8. How to guess the encoding of a document?

文章推薦指數: 80 %
投票人數:10人

Check for BOM markers¶. If the string begins with a BOM, the encoding can be extracted from the BOM. But there is a problem with UTF- ... ProgrammingwithUnicode latest 1.Aboutthisbook 2.Unicodenightmare 3.Definitions 4.Unicode 5.Charsetsandencodings 6.Historicalcharsetsandencodings 7.Unicodeencodings 8.Howtoguesstheencodingofadocument? 8.1.IsASCII? 8.2.CheckforBOMmarkers 8.3.IsUTF-8? 8.4.Libraries 9.Goodpractices 10.Operatingsystems 11.Programminglanguages 12.Databasesystems 13.Libraries 14.Unicodeissues 15.Seealso ProgrammingwithUnicode Docs» 8.Howtoguesstheencodingofadocument? EditonGitHub 8.Howtoguesstheencodingofadocument?¶ OnlyASCII,UTF-8andencodingsusingaBOM(UTF-7 withBOM,UTF-8withBOM,UTF-16,andUTF-32) havereliablealgorithmstogettheencodingofadocument.Forallother encodings,youhavetotrustheuristicsbasedonstatistics. 8.1.IsASCII?¶ CheckifadocumentisencodedtoASCIIissimple:testifthebit7of allbytesisunset(0b0xxxxxxx). ExampleinC: intisASCII(constchar*data,size_tsize) { constunsignedchar*str=(constunsignedchar*)data; constunsignedchar*end=str+size; for(;str!=end;str++){ if(*str&0x80) return0; } return1; } InPython,theASCIIdecodercanbeused: defisASCII(data): try: data.decode('ASCII') exceptUnicodeDecodeError: returnFalse else: returnTrue Note OnlyusethePythonfunctiononshortstringsbecauseitdecodesthewhole stringintomemory.Forlongstrings,itisbettertousethealgorithmof theCfunctionbecauseitdoesn’tallocateanymemory. 8.2.CheckforBOMmarkers¶ IfthestringbeginswithaBOM,theencodingcanbeextracted fromtheBOM.ButthereisaproblemwithUTF-16-BEand UTF-32-LE:UTF-32-LEBOMstartswiththeUTF-16-LEBOM. ExampleofafunctionwritteninCtocheckifaBOMispresent: #include/*memcmp()*/ constchar*UTF_16_BE_BOM="\xFE\xFF"; constchar*UTF_16_LE_BOM="\xFF\xFE"; constchar*UTF_8_BOM="\xEF\xBB\xBF"; constchar*UTF_32_BE_BOM="\x00\x00\xFE\xFF"; constchar*UTF_32_LE_BOM="\xFF\xFE\x00\x00"; char*check_bom(constchar*data,size_tsize) { if(size>=3){ if(memcmp(data,UTF_8_BOM,3)==0) return"UTF-8"; } if(size>=4){ if(memcmp(data,UTF_32_LE_BOM,4)==0) return"UTF-32-LE"; if(memcmp(data,UTF_32_BE_BOM,4)==0) return"UTF-32-BE"; } if(size>=2){ if(memcmp(data,UTF_16_LE_BOM,2)==0) return"UTF-16-LE"; if(memcmp(data,UTF_16_BE_BOM,2)==0) return"UTF-16-BE"; } returnNULL; } FortheUTF-16-LE/UTF-32-LEBOMconflict:thisfunctionreturns"UTF-32-LE" ifthestringbeginswith"\xFF\xFE\x00\x00",evenifthisstringcanbe decodedfromUTF-16-LE. ExampleinPythongettingtheBOMsfromthecodecslibrary: fromcodecsimportBOM_UTF8,BOM_UTF16_BE,BOM_UTF16_LE,BOM_UTF32_BE,BOM_UTF32_LE BOMS=( (BOM_UTF8,"UTF-8"), (BOM_UTF32_BE,"UTF-32-BE"), (BOM_UTF32_LE,"UTF-32-LE"), (BOM_UTF16_BE,"UTF-16-BE"), (BOM_UTF16_LE,"UTF-16-LE"), ) defcheck_bom(data): return[encodingforbom,encodinginBOMSifdata.startswith(bom)] ThisfunctionisdifferentfromtheCfunction:itreturnsalist.Itreturns ['UTF-32-LE','UTF-16-LE']ifthestringbeginswith b"\xFF\xFE\x00\x00". 8.3.IsUTF-8?¶ UTF-8encodingaddsmarkerstoeachbytesandsoit’spossibletowrite areliablealgorithmtocheckifabytestringisencodedto UTF-8. ExampleofastrictCfunctiontocheckifastringisencodedwith UTF-8.Itrejectsoverlongsequences(e.g.0xC0 0x80)andsurrogatecharacters(e.g.0xED0xB20x80, U+DC80). #include intisUTF8(constchar*data,size_tsize) { constunsignedchar*str=(unsignedchar*)data; constunsignedchar*end=str+size; unsignedcharbyte; unsignedintcode_length,i; uint32_tch; while(str!=end){ byte=*str; if(byte<=0x7F){ /*1bytesequence:U+0000..U+007F*/ str+=1; continue; } if(0xC2<=byte&&byte<=0xDF) /*0b110xxxxx:2bytessequence*/ code_length=2; elseif(0xE0<=byte&&byte<=0xEF) /*0b1110xxxx:3bytessequence*/ code_length=3; elseif(0xF0<=byte&&byte<=0xF4) /*0b11110xxx:4bytessequence*/ code_length=4; else{ /*invalidfirstbyteofamultibytecharacter*/ return0; } if(str+(code_length-1)>=end){ /*truncatedstringorinvalidbytesequence*/ return0; } /*Checkcontinuationbytes:bit7shouldbeset,bit6shouldbe *unset(b10xxxxxx).*/ for(i=1;i=0xC2,soch>=0x0080. str[0]<=0xDF,(str[1]&0x3f)<=0x3f,soch<=0x07ff*/ }elseif(code_length==3){ /*3bytessequence:U+0800..U+FFFF*/ ch=((str[0]&0x0f)<<12)+((str[1]&0x3f)<<6)+ (str[2]&0x3f); /*(0xff&0x0f)<<12|(0xff&0x3f)<<6|(0xff&0x3f)=0xffff, soch<=0xffff*/ if(ch<0x0800) return0; /*surrogates(U+D800-U+DFFF)areinvalidinUTF-8: testif(0xD800<=ch&&ch<=0xDFFF)*/ if((ch>>11)==0x1b) return0; }elseif(code_length==4){ /*4bytessequence:U+10000..U+10FFFF*/ ch=((str[0]&0x07)<<18)+((str[1]&0x3f)<<12)+ ((str[2]&0x3f)<<6)+(str[3]&0x3f); if((ch<0x10000)||(0x10FFFF



請為這篇文章評分?