8. How to guess the encoding of a document?
文章推薦指數: 80 %
Check for BOM markers¶. If the string begins with a BOM, the encoding can be extracted from the BOM. But there is a problem with UTF- ...
ProgrammingwithUnicode
latest
1.Aboutthisbook
2.Unicodenightmare
3.Definitions
4.Unicode
5.Charsetsandencodings
6.Historicalcharsetsandencodings
7.Unicodeencodings
8.Howtoguesstheencodingofadocument?
8.1.IsASCII?
8.2.CheckforBOMmarkers
8.3.IsUTF-8?
8.4.Libraries
9.Goodpractices
10.Operatingsystems
11.Programminglanguages
12.Databasesystems
13.Libraries
14.Unicodeissues
15.Seealso
ProgrammingwithUnicode
Docs»
8.Howtoguesstheencodingofadocument?
EditonGitHub
8.Howtoguesstheencodingofadocument?¶
OnlyASCII,UTF-8andencodingsusingaBOM(UTF-7
withBOM,UTF-8withBOM,UTF-16,andUTF-32)
havereliablealgorithmstogettheencodingofadocument.Forallother
encodings,youhavetotrustheuristicsbasedonstatistics.
8.1.IsASCII?¶
CheckifadocumentisencodedtoASCIIissimple:testifthebit7of
allbytesisunset(0b0xxxxxxx).
ExampleinC:
intisASCII(constchar*data,size_tsize)
{
constunsignedchar*str=(constunsignedchar*)data;
constunsignedchar*end=str+size;
for(;str!=end;str++){
if(*str&0x80)
return0;
}
return1;
}
InPython,theASCIIdecodercanbeused:
defisASCII(data):
try:
data.decode('ASCII')
exceptUnicodeDecodeError:
returnFalse
else:
returnTrue
Note
OnlyusethePythonfunctiononshortstringsbecauseitdecodesthewhole
stringintomemory.Forlongstrings,itisbettertousethealgorithmof
theCfunctionbecauseitdoesn’tallocateanymemory.
8.2.CheckforBOMmarkers¶
IfthestringbeginswithaBOM,theencodingcanbeextracted
fromtheBOM.ButthereisaproblemwithUTF-16-BEand
UTF-32-LE:UTF-32-LEBOMstartswiththeUTF-16-LEBOM.
ExampleofafunctionwritteninCtocheckifaBOMispresent:
#include
延伸文章資訊
- 1how can python check if a file name is in utf8? - splunktool
Create a file object using the open() function. Along with the file name, specify: 'r' for readin...
- 2Python detect encoding - ProgramCreek.com
Project: opensauce-python Author: voicesauce File: textgrid.py License: ... If no encoding is spe...
- 3Python 3 Notes: Reading and Writing Methods
If you run into problems, visit the Common Pitfalls section at the bottom of this ... myfile = op...
- 4How to write a check in python to see if file is ... - Exchangetuts
As stated in title, I would like to check in given file object (opened as binary stream) is valid...
- 5Python Files and os.path - 2021 - BogoToBogo
There is always a current working directory, whether we're in the Python Shell, ... coding: utf-8...