Cyrillic characters wrongly encoded

Linux howto's, compile information, information on whatever we learned on working with linux, MACOs and - of course - Products of the big evil....
Post Reply
User avatar
^rooker
Site Admin
Posts: 1483
Joined: Fri Aug 29, 2003 8:39 pm

Cyrillic characters wrongly encoded

Post by ^rooker »

[PROBLEM]
A friend of mine brought me a Russian audio CD as a gift: "Easy USSR"
I prefer having my music collection in file-based form, so I rip audio CDs to a harddisk.
The metadata (author, title) was available for all CD tracks, but the tags as well as the filenames of the generated files were displayed in a non-readable way.

For example, the artist is displayed like this in the filename:

Code: Select all

ÎÝÌÈ_ïó_Â.Ìåùåðèíà
Inside the audio player "Clementine" the encoding was interpreted differently, so it looked something like this:

Code: Select all

ŽŽŽŽÃŽÃÃŒÃˆ ïó Â.ÌåùåðèíÃ
.
(Note: some non-printable characters between the Ž"Ã" characters can't be displayed in here as text...

But actually the author's name should read something like this:

Code: Select all

"ОЭМИ пу В.Мещерина"
.

This CD is just an example of how cyrillic can be mis-interpreted due to encoding inconsistencies. My problem was that I am completely unfamiliar with cyrillic codepages and therefore couldn't even say which encodings the mis-interpreted versions actually used.

There are different cyrillic encodings and now I had to find out which one should be used in my case.


[SOLUTION]
You can use your browser to find out which encoding to use :)

I found a list of all track titles of "Easy USSR" on "cdtrrracks.com".
(NOTE: URL links with non-ascii characters are not interpreted correctly by phpBB at the moment)

That site (luckily) also has encoding problems with cyrillic chars, so I could find it by searching the wrongly encoded string.
For example track #7:

Code: Select all

Êóáà,ìîÿ Êóáà
Now I could use the character encoding options from the Firefox browser to test when it would display correctly.
Switching through "View > Character Encoding > More Encodings > East European > Cyrillic*", I found that in this case it was "Cyrillic Windows 1251".

Then I simply created an empty HTML file, copy/pasted the wrongly encoded text as "<pre>" block and set the encoding to "windows-1251":

Code: Select all

<html>
    <head>
        <meta content="text/html;charset=windows-1251" />
    </head>

    <body>
        <pre>
        1       Ïî äîðîãå â øêîëó       3:29
        2       Þìîðåñêà        3:21
        3       Òàíöóåì øåéê        2:57
        4       Íà äà÷å         2:15
        5       Öàðåâíà-Ëÿãóøêà         1:50
        6       Êàê çàÿö ëèñó ïåðåõèòðèë        1:23
        7       Êóáà,ìîÿ Êóáà       2:55
        8       Ïîñëåäíèé ãîä â øêîëå       2:38
        9       Ýëü Áèìáî       2:17
        10      Òàíåö ýôèîïñêîãî øóòà       2:04
        11      Òàíåö ïèíãâèíîâ         2:42
        12      Íà êîëõîçíîé ïòèöåôåðìå         2:12
        13      Íà ïëÿæå        2:55
        14      Òàíöóþùèå ãíîìû         3:28
        15      Íà áåðåãó ëàçóðíîãî çàëèâà      6:14
        16      Êîãäà íà óëèöå ìîðîç        2:09
        17      Ëÿãóøêè         1:50
        18      Íî÷íàÿ ìåëîäèÿ      3:11
        19      Ãàëîï ¹2        1:39
        20      Ôàíòàçèè íà ïåðóàíñêèå òåìû         8:02
        21      Ìóçûêàëüíûé ÿùèê        1:19
        22      Âîçäóøíàÿ êóêóðóçà      5:16
        23      Äèñê ñ äàííûìè      0:08
        </pre>
    </body>
</html>
Now, just open this HTML file in your browser and the encoding will be correct :)

In my case I was lucky to find the site which contained my strings, but for all other strings you can use the same mechanism.
Jumping out of an airplane is not a basic instinct. Neither is breathing underwater. But put the two together and you're traveling through space!
Post Reply