Encoding

Today, I was asked for help with encoding problems with vim. Since I know that my colleage uses Windows on his desktop computer and uses a SSH connection to a Linux box, we first made sure, that the shell-terminal connection worked properly.

Files and network connections are a byte stream. But both the terminal and the shell use an internal string representation. With unicode, a single character can be made up of one to four bytes.

Shell - Terminal

In the shell, the encoding can be set with the environment variable LANG (or variants of LC_*). On my German language system, I use LANG=de_DE.UTF-8. Both the shell and the SSH client must be configured similarly.

If they mismatch, funny things can be watched. When typing an Umlaut (some special character that is not in basic ASCII, like the German "รถ"), it appears on the screen, but removing it requires typing backspace two times. Or the Umlaut does not appear immediately, but after another key is pressed. This can be explained with the conversion of characters to a byte stream both for sending the entered key to the shell and for receiving the new sceen content.

If the shell uses latin1 and terminal is configured to UTF-8, the Umlaut is send as two byte code to the shell which interprets the two bytes as two strange symbols and displays them. But the terminal interprets those two bytes again as Umlaut. So far, everything looks fine. But when backspace is pressed, the shell removes one of the stange symbols and the terminal misses the second byte of the two-byte Umlaut code. Only after a second press of Backspace, all Bytes of the Umlaut are gone and the character string is clean again.

In the opposite case, if the shell uses utf-8 and the SSH client uses latin1, the Umlaut is sent as a number between 128 and 255 over the network. Those numbers alone are invalid UTF-8 codes because they indicate a multi-byte character. After another key press on the terminal, the additional byte can complete the UTF-8 sequence (but it may happen that it requires additional bytes). For the shell, this is a two-byte character (and it can only be removed together), but the SSH client will happily display the two original latin1 characters.

Applications

The next step is to ensure, that the application works correctly together with the shell. I made the experience, that vim usually works correctly this way. But it may misinterpret the file format. Again, vim reads a stream of bytes from a file and manages them as characters. Those are printed to the screen with the encoding that the shell uses.

In vim, the setting :set fileencoding can be used to change how a file is interpreted. To be continued...

social