This project is read-only.

Improvements

Feb 9, 2010 at 3:10 PM

Hello everyone,

I would like to start a discussion about needed improvements in Utf8Checker.

Please describe how do you want to use Utf8Checker and desired functionality:

-high performance scenarios, configurable number of bytes to scan

-BOM detection ( to be released soon)

-support other encodings

 

Add other desired functions if needed

Aug 8, 2016 at 3:26 PM
Hi There!

What if you detect an utf-8 sequence but indeed it is not a sequence but all bytes of the detected sequence are single ANSI characters? Of course, if you detect lots of utf-8 sequences it is probably utf-8 encoded. The more you find, the more you can assume it. But you assume that your file or stream is a byte stream, but this could also be a utf-16 encoded stream e. g. word oriented or even dword oriented, talking about utf-32. An improvement should also be, that the encoded Unicode codepoint has used the shortest utf-8 sequence, if not, the sequence is invalid. If an encoded codepoint is a high or low Surrogate, the sequence is invalid, because utf-8 doesn't Need surrogates.