Cyrillic characters wrongly encoded
Posted: Sun Dec 22, 2013 7:02 pm
[PROBLEM]
A friend of mine brought me a Russian audio CD as a gift: "Easy USSR"
I prefer having my music collection in file-based form, so I rip audio CDs to a harddisk.
The metadata (author, title) was available for all CD tracks, but the tags as well as the filenames of the generated files were displayed in a non-readable way.
For example, the artist is displayed like this in the filename:
Inside the audio player "Clementine" the encoding was interpreted differently, so it looked something like this:
.
(Note: some non-printable characters between the "Ã" characters can't be displayed in here as text...
But actually the author's name should read something like this:.
This CD is just an example of how cyrillic can be mis-interpreted due to encoding inconsistencies. My problem was that I am completely unfamiliar with cyrillic codepages and therefore couldn't even say which encodings the mis-interpreted versions actually used.
There are different cyrillic encodings and now I had to find out which one should be used in my case.
[SOLUTION]
You can use your browser to find out which encoding to use
I found a list of all track titles of "Easy USSR" on "cdtrrracks.com".
(NOTE: URL links with non-ascii characters are not interpreted correctly by phpBB at the moment)
That site (luckily) also has encoding problems with cyrillic chars, so I could find it by searching the wrongly encoded string.
For example track #7:
Now I could use the character encoding options from the Firefox browser to test when it would display correctly.
Switching through "View > Character Encoding > More Encodings > East European > Cyrillic*", I found that in this case it was "Cyrillic Windows 1251".
Then I simply created an empty HTML file, copy/pasted the wrongly encoded text as "<pre>" block and set the encoding to "windows-1251":
Now, just open this HTML file in your browser and the encoding will be correct
In my case I was lucky to find the site which contained my strings, but for all other strings you can use the same mechanism.
A friend of mine brought me a Russian audio CD as a gift: "Easy USSR"
I prefer having my music collection in file-based form, so I rip audio CDs to a harddisk.
The metadata (author, title) was available for all CD tracks, but the tags as well as the filenames of the generated files were displayed in a non-readable way.
For example, the artist is displayed like this in the filename:
Code: Select all
ÎÝÌÈ_ïó_Â.Ìåùåðèíà
Code: Select all
ÃÃÃà ïó Ã.ÃåùåðèÃÃ
(Note: some non-printable characters between the "Ã" characters can't be displayed in here as text...
But actually the author's name should read something like this:
Code: Select all
"ОЭМИ пу В.Мещерина"
This CD is just an example of how cyrillic can be mis-interpreted due to encoding inconsistencies. My problem was that I am completely unfamiliar with cyrillic codepages and therefore couldn't even say which encodings the mis-interpreted versions actually used.
There are different cyrillic encodings and now I had to find out which one should be used in my case.
[SOLUTION]
You can use your browser to find out which encoding to use
I found a list of all track titles of "Easy USSR" on "cdtrrracks.com".
(NOTE: URL links with non-ascii characters are not interpreted correctly by phpBB at the moment)
That site (luckily) also has encoding problems with cyrillic chars, so I could find it by searching the wrongly encoded string.
For example track #7:
Code: Select all
Êóáà,ìîÿ Êóáà
Switching through "View > Character Encoding > More Encodings > East European > Cyrillic*", I found that in this case it was "Cyrillic Windows 1251".
Then I simply created an empty HTML file, copy/pasted the wrongly encoded text as "<pre>" block and set the encoding to "windows-1251":
Code: Select all
<html>
<head>
<meta content="text/html;charset=windows-1251" />
</head>
<body>
<pre>
1 Ïî äîðîãå â øêîëó 3:29
2 Þìîðåñêà 3:21
3 Òàíöóåì øåéê 2:57
4 Íà äà÷å 2:15
5 Öàðåâíà-Ëÿãóøêà 1:50
6 Êàê çàÿö ëèñó ïåðåõèòðèë 1:23
7 Êóáà,ìîÿ Êóáà 2:55
8 Ïîñëåäíèé ãîä â øêîëå 2:38
9 Ýëü Áèìáî 2:17
10 Òàíåö ýôèîïñêîãî øóòà 2:04
11 Òàíåö ïèíãâèíîâ 2:42
12 Íà êîëõîçíîé ïòèöåôåðìå 2:12
13 Íà ïëÿæå 2:55
14 Òàíöóþùèå ãíîìû 3:28
15 Íà áåðåãó ëàçóðíîãî çàëèâà 6:14
16 Êîãäà íà óëèöå ìîðîç 2:09
17 Ëÿãóøêè 1:50
18 Íî÷íàÿ ìåëîäèÿ 3:11
19 Ãàëîï ¹2 1:39
20 Ôàíòàçèè íà ïåðóàíñêèå òåìû 8:02
21 Ìóçûêàëüíûé ÿùèê 1:19
22 Âîçäóøíàÿ êóêóðóçà 5:16
23 Äèñê ñ äàííûìè 0:08
</pre>
</body>
</html>
In my case I was lucky to find the site which contained my strings, but for all other strings you can use the same mechanism.