8. How to guess the encoding of a document?
文章推薦指數: 80 %
Check for BOM markers¶. If the string begins with a BOM, the encoding can be extracted from the BOM. But there is a problem with UTF- ...
ProgrammingwithUnicode
latest
1.Aboutthisbook
2.Unicodenightmare
3.Definitions
4.Unicode
5.Charsetsandencodings
6.Historicalcharsetsandencodings
7.Unicodeencodings
8.Howtoguesstheencodingofadocument?
8.1.IsASCII?
8.2.CheckforBOMmarkers
8.3.IsUTF-8?
8.4.Libraries
9.Goodpractices
10.Operatingsystems
11.Programminglanguages
12.Databasesystems
13.Libraries
14.Unicodeissues
15.Seealso
ProgrammingwithUnicode
Docs»
8.Howtoguesstheencodingofadocument?
EditonGitHub
8.Howtoguesstheencodingofadocument?¶
OnlyASCII,UTF-8andencodingsusingaBOM(UTF-7
withBOM,UTF-8withBOM,UTF-16,andUTF-32)
havereliablealgorithmstogettheencodingofadocument.Forallother
encodings,youhavetotrustheuristicsbasedonstatistics.
8.1.IsASCII?¶
CheckifadocumentisencodedtoASCIIissimple:testifthebit7of
allbytesisunset(0b0xxxxxxx).
ExampleinC:
intisASCII(constchar*data,size_tsize)
{
constunsignedchar*str=(constunsignedchar*)data;
constunsignedchar*end=str+size;
for(;str!=end;str++){
if(*str&0x80)
return0;
}
return1;
}
InPython,theASCIIdecodercanbeused:
defisASCII(data):
try:
data.decode('ASCII')
exceptUnicodeDecodeError:
returnFalse
else:
returnTrue
Note
OnlyusethePythonfunctiononshortstringsbecauseitdecodesthewhole
stringintomemory.Forlongstrings,itisbettertousethealgorithmof
theCfunctionbecauseitdoesn’tallocateanymemory.
8.2.CheckforBOMmarkers¶
IfthestringbeginswithaBOM,theencodingcanbeextracted
fromtheBOM.ButthereisaproblemwithUTF-16-BEand
UTF-32-LE:UTF-32-LEBOMstartswiththeUTF-16-LEBOM.
ExampleofafunctionwritteninCtocheckifaBOMispresent:
#include
延伸文章資訊
- 1Unicode HOWTO — Python 3.10.7 documentation
- 2Check whether a file contains valid UTF-8. Returns 0 for valid ...
Check whether a file contains valid UTF-8. Returns 0 for valid UTF-8, prints an error message to ...
- 3Unicode HOWTO — Python 3.10.7 documentation
If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded co...
- 4Python Files and os.path - 2021 - BogoToBogo
There is always a current working directory, whether we're in the Python Shell, ... coding: utf-8...
- 5Python detect encoding - ProgramCreek.com
Project: opensauce-python Author: voicesauce File: textgrid.py License: ... If no encoding is spe...