This project is read-only.

String was recognized as UTF8, while it was encoded as win1251

May 10, 2012 at 6:19 PM

Hi, I recently tried to use your Utility to check if strings are encoded in UTF8. I use it to check strings values from external sources like Google Plus, Facebook, etc. For most cases it works ok, but recently the Utility failed. I will appreciate if you help me. Here there are some coding stuff, I hope it will be use full:

        // The function that I use to detect if decoding is required.
        public static string Win1251ToUtf8(string source)
        {
            if (!string.IsNullOrWhiteSpace(source) && !Utf8Checker.IsUtf8(source))
            {
                var utf8Encoding = Encoding.GetEncoding("UTF-8");
                var win1251Encoding = Encoding.GetEncoding("Windows-1251");

                var utf8Bytes = win1251Encoding.GetBytes(source);
                var win1251Bytes = Encoding.Convert(utf8Encoding, win1251Encoding, utf8Bytes);

                source = win1251Encoding.GetString(win1251Bytes);
            }

            return source;
        }

        // This function is used in the function above.
        public static bool IsUtf8(string source)
        {
            // Extract bytes from string.
            var bytes = ToByteArray(source);

            return IsUtf8(bytes, bytes.Length);
        }

        // Extracting bytes from string
        private static byte[] ToByteArray(string stringToConvert)
        {
            var bytes = new byte[stringToConvert.Length * sizeof(char)];

            Buffer.BlockCopy(stringToConvert.ToCharArray(), 0, bytes, 0, bytes.Length);

            return bytes;
        }


// This is the string that I retrieved from Google (I just foreached all the bytes and glue them in this string.
var stringBytes = new byte[] {32, 4, 29, 32, 32, 4, 88, 4, 32, 4, 81, 4, 33, 4, 26, 32, 33, 4, 2, 4, 32, 4, 81, 4, 32, 4, 22, 33, };

// String view after forced decoding:
Дмитрий

// String view before decoding (not sure, that you will see the same as me):
Дмитрий


            // This condition from the your IsValid function is true.
            if (ch <= 0x7F)
            {
                bytes = 1;

                return true;
            }

Sorry for such amount of details :)

Aug 8, 2016 at 4:35 PM
Edited Aug 8, 2016 at 4:37 PM
This string, Дмитрий looks just different from your Bytes when encoded in utf-8 ! Here is the sequence:

D0 94 - D0 BC - D0 B8 - D1 82 - D1 80 - D0 B8 - D0 B9

These are all 2-byte sequences and the codepoints are the following:

1044 - CYRILLIC CAPITAL LETTER DE
1084 - CYRILLIC SMALL LETTER EM
1080 - CYRILLIC SMALL LETTER I
1090 - CYRILLIC SMALL LETTER TE
1088 - CYRILLIC SMALL LETTER ER
1080 - CYRILLIC SMALL LETTER I
1081 - CYRILLIC SMALL LETTER SHORT I

Of course, codepoints <= 127 are encoded as singlebytes with utf-8. But the utf-8 checker assumes that the Input is a Byte stream, when reading a file, it could also be a word or dword stream. Also, for a codepoint, the shortest sequence has to be used and Surrogate codepoints are invalid in utf-8 , because not needed. The checker should check this too.