Sunday, 15 April 2012

encoding - Can't read some characters from UTF-8 file. Pure C -


I know there are many similar topics on the stack overflow, but I have not found a solution to my problem. I try to read the UTF-8 file. Everything is fine with English letters, but I can not read Russian or Spanish letters. Here's my code just the example.

  FILE * fp; Four lines [3]; Fp = fopen ("letters.data", "r"); If return (FP == faucet); Int i = 0; Fred (line, 1,3, FP); // BOM wint_t w; While (w = fgetwc (fp)) {wprintf (el "% c", w); } Fclose (fpose);  

Here is the letters.data file:

 Enter image details here

and output data:

 Enter the image details here

I do not know what to do.

fgetwc () (A "detailed character"). It is not the same as the UTF-8. A wchar is a certain size (often 16 bits), requires variable length and some special parsing between UTF-8 characters one and four bytes. Very easy to work, useful. If you need more complex work, see.

Note that you are assuming a BOM in the beginning. There should not be a BOM in UTF-8 files, however some Windows editors add anyway. You should be careful about this issue.

If you are doing all you need to read and write in one stream, there is no need to worry about UTF-8. You can treat them as just raw bytes. But if you are going to interpret them, then you have to interpret UTF-8 correctly.

He said, you should also verify that you actually have an UTF-8 file, for example, it is very common on Windows that the file has different code pages or UTF-16 (UTF -16 is the file written in the file which is considered to be BOM). I find it almost always useful to see the file in Hex Editor to ensure that the bytes are those that you think they are.


No comments:

Post a Comment