python - decode utf-16 file with bom - splunktool
文章推薦指數: 80 %
I have a UTF-16 LE file with BOM. I'd like to flip this file in to UTF-8 without BOM so I can parse it using Python. python-decodeutf-16filewithbomLastUpdated:SatJul302022IhaveaUTF-16LEfilewithBOM.I'dliketoflipthisfileintoUTF-8withoutBOMsoIcanparseitusingPython., 3daysago Python-DecodeUTF-16filewithBOM.DustinPublishedatDev.24.DustinIhaveaUTF-16LEfilewithBOM.I'dliketoflipthisfileintoUTF-8withoutBOMsoIcanparseitusingPython.TheusualcodethatIusedidn'tdothetrick,itreturnedunknowncharactersinsteadof… ,WhatwouldbetheproperwaytodecodethisfilesoIcanparsethroughitwithf.readlines()?, BOMcharactersshouldbeautomaticallystrippedwhendecodingUTF-16,butnotUTF-8,unlessyouexplicitlyusetheutf-8-sigencoding.Youcouldtrysomethinglikethis: f=open('dbo.chrRaces.Table.sql').read()f=str(f).decode('utf-16le',errors='ignore').encode('utf8')printfimportcodecsencoded_text=open('dbo.chrRaces.Table.sql','rb').read()#youshouldreadinbinarymodetogettheBOMcorrectlybom=codecs.BOM_UTF16_LE#printdir(codecs)forotherencodingsassertencoded_text.startswith(bom)#makesuretheencodingiswhatyouexpect,otherwiseyou'llgetwrongdataencoded_text=encoded_text[len(bom):]#stripawaytheBOMdecoded_text=encoded_text.decode(' utf-16le')#decodetounicodef=open('dbo.chrRaces.Table.sql').read()f=str(f).decode('utf-16le',errors='ignore').encode('utf8')printfimportcodecsencoded_text=open('dbo.chrRaces.Table.sql','rb').read()#youshouldreadinbinarymodetogettheBOMcorrectlybom=codecs.BOM_UTF16_LE#printdir(codecs)forotherencodingsassertencoded_text.startswith(bom)#makesuretheencodingiswhatyouexpect,otherwiseyou'llgetwrongdataencoded_text=encoded_text[len(bom):]#stripawaytheBOMdecoded_text=encoded_text.decode(' utf-16le')#decodetounicodef=open('test_utf16.txt',mode='r',encoding='utf-16').read()print(f)Suggestion:2 codecs.BOM_UTF16_BE def_detect_encoding(self,fileid): ifisinstance(fileid,PathPointer): s=fileid.open().readline() else: withopen(fileid,'rb')asinfile: s=infile.readline() ifs.startswith(codecs.BOM_UTF16_BE): return'utf-16-be' ifs.startswith(codecs.BOM_UTF16_LE): return'utf-16-le' ifs.startswith(codecs.BOM_UTF32_BE): return'utf-32-be' ifs.startswith(codecs.BOM_UTF32_LE): return'utf-32-le' ifs.startswith(codecs.BOM_UTF8): return'utf-8' m=re.match(br'\s*>> 'foo'.encode('utf-16le')b'f\x00o\x00o\x00'>>> 'foo'.encode('utf-16')b'\xff\xfef\x00o\x00o\x00' YoucanseethatwhenusingUTF-16(insteadofUTF-16LE),yougettheBOMcorrectlyprependedtothebytes. IfyouwereonalittleendiansystemandpurposefullywantedtocreateaUTF-16BEfile,theonlywaytodoitis: >>> codecs.BOM_UTF16_BE+'foo'.encode('utf-16be') b'\xfe\xff\x00f\x00o\x00o' Thisdoesn'tmakealotofsensetome.WhyistheBOMnotprependedautomaticallywhenencodingwithUTF-16BE? Furthermore, ifyouweregivenaUTF-16BEfileonalittleendiansystem,youmightthinkthatthiswouldbethecorrectwaytodecodeit: >>> (codecs.BOM_UTF16_BE+'foo'.encode('utf-16be')).decode('utf-16be') '\ufefffoo' butasyoucanseethatleavestheBOMonthere.Strangely,decodingwithUTF-16worksfinehowever: >>> (codecs.BOM_UTF16_BE+'foo'.encode('utf-16be')).decode('utf-16') 'foo' Itseemstomethattheendian-specificversionsofUTF-16andUTF-32shouldbeadding/removingtheappropriateBOMs,andthisisalong-standingbug.Yes, ifyouexplicitlyusebig-endingorlittle-endianUTF,thenyouneedtomanuallyincludeaBOM ifthat'srequired.Thatsaid,ifafileformatordatafieldisspecifiedwithaparticularbyteorder,thenusingaBOMisstrictlyincorrect.SeetheUTFBOMFAQ: http://www.unicode.org/faq/utf_bom.html#BOM Forregulartextdocuments,inwhichthebyteorderdoesn'treallymatter,usethenativebyteorderofyourplatformviaUTF-16orUTF-32.Also,insteadofmanuallyencodingstrings,usethe"encoding"parameterofthebuilt-inopenfunction,orio.openorcodecs.openinPython2.ThisonlywritesasingleBOM,evenwhenwritingtoafilemultipletimes.eryksunbeatmetotheanswer,butI'mgoingtopostmineanyway:) IfIunderstandthecodecsdocscorrectly,thisisbecause ifyouarespecifyingtheendianessyouwant,itisasignthatyouareonlygoingtointerpretitasthatendianness,sothere'snoneedforaBOM.IfyouwantaBOM,useutf-16/32. Inshort,whatisyouruse caseforproducingaUTFstringwithnon-nativebyteorder?Butaseryksunsaid,thePythonsupportedwaytodothatandincludeaBOMistowritetheBOMyourself.Thanks forstraighteningmeoutthere!IhadnotnoticedthisintheUnicodeFAQbefore: > Wherethedatahasanassociatedtype,suchasafieldinadatabase,aBOMisunnecessary.Inparticular, ifatextdatastreamismarkedasUTF-16BE,UTF-16LE,UTF-32BEorUTF-32LE,aBOMisneithernecessarynorpermitted.AnyU+FEFFwouldbeinterpretedasaZWNBSP. Anyway,thethingthatbroughtthisupisthatinchardetwedetectcodecsoffiles forpeopleandwe'vebeenreturningUTF-16BEorUTF-16LEwhenwedetecttheBOMatthefrontofthefile,butwerecentlylearnedthatifpeopletriedtodecodewiththosecodecsthingsdon' tworkasexpected.Itseemsthecorrectbehaviorinour caseistojust returnUTF-16inthesecases.Justtoaddsomemorebackground: TheLEandBEcodecsaremeanttobeusedwhenyoualreadyknowtheendiannessoftheplatformyouaretargeting,e.g.in caseyouworkonstringsthatwerereadaftertheinitialBOM,orwritetoanoutputstringinchunksafterhavingwrittentheinitialBOM.Assuch,theydon'ttreattheBOMspecial,sinceitisavalidcodepoint,andpassitthroughas-is. IfyoudowantBOMhandling,theUTF-16codecistherightchoice.Itdefaultstotheplatform'sendiannessandusestheBOMtoindicatewhichchoiceitmade.Suggestion:4withopen(ff_name,'rb')assource_file: withopen(target_file_name,'w+b')asdest_file: contents=source_file.read() dest_file.write(contents.decode('utf-16').encode('utf-8'))Suggestion:5AccordingtothePythondocumentationonreadingandwritingUnicodedata:,Someencodings,suchasUTF-16,expectaBOMtobepresentatthestartofafile;whensuchanencodingisused,theBOMwillbeautomaticallywrittenasthefirstcharacterandwillbesilentlydroppedwhenthefileisread.,Instead,changetheencodingtoutf-16.ThisletsPythonusetheOperatingSystem’sendianness,anditassumesthataBOMisthereforenecessary.Here’sthemodifiedcodesnippet:,Opentheoutputfileinaneditorwhichreportstheencoding,suchastheexcellentNotepad++onWindows.You’llseeUTF-16(orUCS-2)LittleEndian,butitwillsaythereisnoBOM.Fromthis,itsoundslikeanyUTF-16orUTF-32encodingwillautomaticallytakeoftheBOM.However,tryrunningthefollowingcodeinaPython3script:withopen("output.txt",mode="w",encoding="utf-16-le")asf: f.write("HelloWorld.")Instead,changetheencodingtoutf-16.ThisletsPythonusetheOperatingSystem’sendianness,anditassumesthataBOMisthereforenecessary.Here’sthemodifiedcodesnippet:withopen("output.txt",mode="w",encoding="utf-16")asf: f.write("HelloWorld.")Suggestion:6UTF-16isusedbysystemssuchastheMicrosoftWindowsAPI,theJavaprogramminglanguageandJavaScript/ECMAScript.Itisalsosometimesusedforplaintextandword-processingdatafilesonMicrosoftWindows.ItisrarelyusedforfilesonUnix-likesystems.SinceMay2019,MicrosofthasbegunsupportingUTF-8(aswellasUTF-16)andencouragingitsuse.[2] ,EachUnicodecodepointisencodedeitherasoneortwo16-bitcodeunits.Howthese16-bitcodesarestoredasbytesthendependsonthe'endianness'ofthetextfileorcommunicationprotocol. ,Thefollowingtablesummarizesthisconversion,aswellasothers.ThecolorsindicatehowbitsfromthecodepointaredistributedamongtheUTF-16bytes.AdditionalbitsaddedbytheUTF-16encodingprocessareshowninblack. ,^Unicode(Windows).Retrieved2011-03-08"ThesefunctionsuseUTF-16(widecharacter)encoding(…)usedfornativeUnicodeencodingonWindowsoperatingsystems." U'=yyyyyyyyyyxxxxxxxxxx//U-0x10000 W1=110110yyyyyyyyyy//0xD800+yyyyyyyyyy W2=110111xxxxxxxxxx//0xDC00+xxxxxxxxxxSimilarArticles1.)Whatisa"surrogatepair"inJava?2.)JavaCan'tOpenaFilewithSurrogateUnicodeValuesintheFilename?3.)HowdoesJavastoreUTF-16charactersinits16-bitchartype?4.)UTF-16CharacterEncodingofjava5.)WhydoesJSONencodeUTF-16surrogatepairsinsteadofUnicodecodepointsdirectly?6.)Javascriptandstringmanipulationw/utf-16surrogatepairs7.)JavaScriptstrings-UTF-16vsUCS-2?8.)utf16vsutf-169.)howdoiconvertapythonstringtoucs2hex?TrendingTechnologyandroid×13870angular×16962api×4899css×14556html×21320java×28499javascript×57492json×17645php×21600python×502736reactjs×16351sql×19874typescript×7220xml×2600Mostpopularinpython1.)ignoringmissingvaluesinmultipleolsregressionwithstatsmodels2.)sslerrorinstallingpycurlaftersslisset3.)flaskloadlocaljson4.)changepythonmroatruntime5.)searchforelementinlistandreplaceitbymultipleitems6.)python,readcrlftextfileasis,withcrlf7.)django:canweuse.exclude()on.get()indjangoquerysets8.)howtorunspecifictestinnose29.)howdoialigntextoutputinpython?10.)seleniumwebdriverandunicode
延伸文章資訊
- 1Python3 讀寫UTF-16/UTF-16-LE 文字檔 - Lo爸的遊戲區
【重點寫在前面】 讀檔時: 用encoding='utf-16-le' 讀取UTF-16 文字檔時, ... 讀取UTF-16-LE 文字檔時,會發生錯誤,無法讀取;因為python 會期待有 ...
- 2Python Strings decode() method - GeeksforGeeks
- 3Decode UTF-8 in Python | Delft Stack
- 4Unicode HOWTO — Python 3.10.7 documentation
UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in t...
- 5Why Python 3 doesn't write the Unicode BOM - Peter Bloomfield
Python doesn't always output a Unicode Byte Order Mark. ... When handling Unicode, Windows and Vi...