Python 3.0 automatic decoding of UTF16 - Google Groups

文章推薦指數: 80 %
投票人數:10人

Hello group,. I'm having trouble reading a utf-16 encoded file with Python3.0. This is my (complete) code: #!/usr/bin/python3.0. class AddressBook(): Groupscomp.lang.pythonConversationsAboutPython3.0automaticdecodingofUTF161337viewsSkiptofirstunreadmessageJohannesBauerunread,Dec5,2008,10:25:23PM12/5/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoHellogroup,I'mhavingtroublereadingautf-16encodedfilewithPython3.0.Thisismy(complete)code:#!/usr/bin/python3.0classAddressBook(): def__init__(self,filename): f=open(filename,"r",encoding="utf16") whileTrue: line=f.readline() ifline=="":break print([line[x]forxinrange(len(line))]) f.close()a=AddressBook("2008_11_05_Handy_Backup.txt")Thisisthefile(only1kB,ifhostingdoesn'tworkpleasetellmeandI'llseeifIcanputitsomeplaceelse):http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.txt.gz.htmlWhatIget:Thefilereadsfilethefirstfewlines.Then,inthelastline,Igetlotsofgarbage(lookinglikeuninitializedmemory):['E','n','t','r','y','0','0','T','e','x','t','','=','','"','A','D','A','C','','V','e','r','k','e','h','r','s','i','n','f','o','"','\u0d00','\u0a00','䔀','渀','琀','爀','礀','\u3000','\u3100','吀','礀','瀀','攀','\u2000','㴀','\u2000','一','甀','洀','戀','攀','爀','䴀','漀','戀','椀','氀','攀','\u0d00','\u0a00','䔀','渀','琀','爀','礀','\u3000','\u3100','吀','攀','砀','琀','\u2000','㴀','\u2000','∀','⬀','㐀','㤀','\u3100','㜀','㤀','㈀','㈀','㐀','㤀','㤀','∀','\u0d00','\u0a00','\u0d00','\u0a00','嬀','倀','栀','漀','渀','攀','倀','䈀','䬀','\u3000','\u3000','㐀','崀','\u0d00','\u0a00']WherethelineEntry00Text="ADACVerkehrsinfo"\r\nisactuallytheonlythingthelinecontains,Pythonmakestherestup.Theactualfileismuchlongerandcontainsprivatenumbers,soItruncatedthemaway.WhenIletpythonprocesstheoriginalfile,itdieswithanothererror:Traceback(mostrecentcalllast):File"./modify.py",line12,ina=AddressBook("2008_11_05_Handy_Backup.txt")File"./modify.py",line7,in__init__line=f.readline()File"/usr/local/lib/python3.0/io.py",line1807,inreadlinewhileself._read_chunk():File"/usr/local/lib/python3.0/io.py",line1556,in_read_chunkself._set_decoded_chars(self._decoder.decode(input_chunk,eof))File"/usr/local/lib/python3.0/io.py",line1293,indecodeoutput=self.decoder.decode(input,final=final)File"/usr/local/lib/python3.0/codecs.py",line300,indecode(result,consumed)=self._buffer_decode(data,self.errors,final)File"/usr/local/lib/python3.0/encodings/utf_16.py",line69,in_buffer_decodereturnself.decoder(input,self.errors,final)UnicodeDecodeError:'utf16'codeccan'tdecodebytesinposition74-75:illegalencodingWiththeplacewhereitdiesbeingexactlytheplacewhereitoutputstheweirdgarbageintheshortenedfile.Iguessitrunsoversomepageboundaryhereorsomething?Kindregards,Johannes--"MeineGegenklagegegendichlautetdannaufbewussteVerlogenheit,verlästerungvonGott,BibelundmirundbewussterBlasphemie."--ProphetundVisionärHansJossakaHJPinde.sci.physik<[email protected]>JKennethKingunread,Dec6,2008,12:51:17AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoJohannesBauerwrites:>Traceback(mostrecentcalllast):>File"./modify.py",line12,in>a=AddressBook("2008_11_05_Handy_Backup.txt")>File"./modify.py",line7,in__init__>line=f.readline()>File"/usr/local/lib/python3.0/io.py",line1807,inreadline>whileself._read_chunk():>File"/usr/local/lib/python3.0/io.py",line1556,in_read_chunk>self._set_decoded_chars(self._decoder.decode(input_chunk,eof))>File"/usr/local/lib/python3.0/io.py",line1293,indecode>output=self.decoder.decode(input,final=final)>File"/usr/local/lib/python3.0/codecs.py",line300,indecode>(result,consumed)=self._buffer_decode(data,self.errors,final)>File"/usr/local/lib/python3.0/encodings/utf_16.py",line69,in>_buffer_decode>returnself.decoder(input,self.errors,final)>UnicodeDecodeError:'utf16'codeccan'tdecodebytesinposition74-75:>illegalencodingItprobablymeanswhatitsays:thattheinputfilecontainscharactersitcannotreadusingthespecifiedencoding.Areyougeneratingthefilefrompythonusingafileobjectwiththesameencoding?Ifnot,thenyoumightwanttolookatyourinputdataandfindawaytodealwiththeexception.JohannesBauerunread,Dec6,2008,1:28:04AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoJKennethKingschrieb:>Itprobablymeanswhatitsays:thattheinputfilecontainscharacters>itcannotreadusingthespecifiedencoding.No,itdoesn't.Thefileisjustfine,justastheexample.>Areyougeneratingthefilefrompythonusingafileobjectwiththe>sameencoding?Ifnot,thenyoumightwanttolookatyourinputdata>andfindawaytodealwiththeexception.Idid.Thefileisfine.Couldyoutryouttheexample?Regards,RichardBrodieunread,Dec6,2008,1:38:25AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessageto"JKennethKing"wroteinmessagenews:[email protected]...>Itprobablymeanswhatitsays:thattheinputfilecontainscharacters>itcannotreadusingthespecifiedencoding.Thatwasmyfirstthought.Howeveritappearsthatthereisanoffbyoneerrorsomewhereintheintersectionoflineending/codecprocessing.Halfwaythroughthecodecstartsbyte-flippingcharacters.TerryReedyunread,Dec6,2008,2:24:34AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetopytho...@python.orgJohannesBauerwrote:>Hellogroup,>>I'mhavingtroublereadingautf-16encodedfilewithPython3.0.Thisis>my(complete)code:whatOS.ThisisoftencriticalwhenyouhaveaprobleminteractingwiththeOS.From\r\nIguessWindows.Correct?Isuspectthat'?'after\n(\u0a00)isindicatesnot'question-mark'but'uninterpretableasautf16character'.Thetracebackbelowconfirmsthat.Itshouldbeanend-of-filemarkerandshouldnotbepassedtoPython.Istronglysuspectthatwhateverwrotethefilescrewedupthe(OS-specific)end-of-filemarker.IhaveseenthisoccasionallyonDos/Windowswithasciibytefiles,withthesamesymptomofreadingrandomgarbagepasstheendofthefile.Orperhapsend-of-filedoesnotworkrightwithutf16.>isactuallytheonlythingthelinecontains,Pythonmakestherestup.Noitdoesnot.ItechoeswhattheOSgivesitwithsystemcalls,whichisrandongarbagetotheendofthediskblock.Tryopenwithexplicit'rt'and'rb'modesandseewhathappens.Textmodeshouldbedefault,butthen\rshouldbedeleted.>Theactualfileismuchlongerandcontainsprivatenumbers,soI>truncatedthemaway.WhenIletpythonprocesstheoriginalfile,it>dieswithanothererror:>>Traceback(mostrecentcalllast):>File"./modify.py",line12,in>a=AddressBook("2008_11_05_Handy_Backup.txt")>File"./modify.py",line7,in__init__>line=f.readline()>File"/usr/local/lib/python3.0/io.py",line1807,inreadline>whileself._read_chunk():>File"/usr/local/lib/python3.0/io.py",line1556,in_read_chunk>self._set_decoded_chars(self._decoder.decode(input_chunk,eof))>File"/usr/local/lib/python3.0/io.py",line1293,indecode>output=self.decoder.decode(input,final=final)>File"/usr/local/lib/python3.0/codecs.py",line300,indecode>(result,consumed)=self._buffer_decode(data,self.errors,final)>File"/usr/local/lib/python3.0/encodings/utf_16.py",line69,in>_buffer_decode>returnself.decoder(input,self.errors,final)>UnicodeDecodeError:'utf16'codeccan'tdecodebytesinposition74-75:>illegalencoding>>Withtheplacewhereitdiesbeingexactlytheplacewhereitoutputs>theweirdgarbageintheshortenedfile.Iguessitrunsoversomepage>boundaryhereorsomething?MalformedEOFmorelikely.TerryJanReedyJohannesBauerunread,Dec6,2008,2:36:21AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoTerryReedyschrieb:>JohannesBauerwrote:>>Hellogroup,>>>>I'mhavingtroublereadingautf-16encodedfilewithPython3.0.Thisis>>my(complete)code:>>whatOS.Thisisoftencriticalwhenyouhaveaprobleminteracting>withtheOS.It'sa64-bitLinux,currentlyrunning:Linuxjoeserver2.6.20-skas3-v9-pre9#4SMPPREEMPTWedDec318:34:49CET2008x86_64Intel(R)Core(TM)[email protected]/LinuxKernel,however,2.6.26.1yieldsthesameproblem.>>Entry00Text="ADACVerkehrsinfo"\r\n>>From\r\nIguessWindows.Correct?Well,notreally.Thefilewascreatedwithgammu,aLinuxopensourcetooltoextractaphonebookoffcellphones.However,gammuseemstogeneratethoseWindows-CRLFlineendings.>Isuspectthat'?'after\n(\u0a00)isindicatesnot'question-mark'>but'uninterpretableasautf16character'.Thetracebackbelow>confirmsthat.Itshouldbeanend-of-filemarkerandshouldnotbe>passedtoPython.Istronglysuspectthatwhateverwrotethefile>screwedupthe(OS-specific)end-of-filemarker.Ihaveseenthis>occasionallyonDos/Windowswithasciibytefiles,withthesamesymptom>ofreadingrandomgarbagepasstheendofthefile.Orperhaps>end-of-filedoesnotworkrightwithutf16.SoUTF-16hasanexplicitEOFmarkerwithinthetext?Icannotfindoneinoriginalfile,onlysomekindofstartingsequenceIsuppose(0xfeff).Thelastcharactersofthefileare0x000x0d0x000x0a,simple\r\nlineending.>>isactuallytheonlythingthelinecontains,Pythonmakestherestup.>>Noitdoesnot.ItechoeswhattheOSgivesitwithsystemcalls,which>israndongarbagetotheendofthediskblock.Coulditnotbe,asRichardsuggested,thatthere'sanoff-by-one?>Tryopenwithexplicit'rt'and'rb'modesandseewhathappens.Text>modeshouldbedefault,butthen\rshouldbedeleted.rt:[...]['[','P','h','o','n','e','P','B','K','0','0','3',']','\n']['L','o','c','a','t','i','o','n','','=','','0','0','3','\n']['E','n','t','r','y','0','0','T','y','p','e','','=','','N','a','m','e','\n']Traceback(mostrecentcalllast):File"./modify.py",line12,ina=AddressBook("2008_11_05_Handy_Backup.txt")File"./modify.py",line7,in__init__line=f.readline()File"/usr/local/lib/python3.0/io.py",line1807,inreadlinewhileself._read_chunk():File"/usr/local/lib/python3.0/io.py",line1556,in_read_chunkself._set_decoded_chars(self._decoder.decode(input_chunk,eof))File"/usr/local/lib/python3.0/io.py",line1293,indecodeoutput=self.decoder.decode(input,final=final)File"/usr/local/lib/python3.0/codecs.py",line300,indecode(result,consumed)=self._buffer_decode(data,self.errors,final)File"/usr/local/lib/python3.0/encodings/utf_16.py",line69,in_buffer_decodereturnself.decoder(input,self.errors,final)UnicodeDecodeError:'utf16'codeccan'tdecodebytesinposition74-75:illegalencodingrbworks,asitdoesn'ttakeanencodingparameter.>MalformedEOFmorelikely.Couldyoupleaseelaborate?JoeStroutunread,Dec6,2008,3:00:59AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoPythonListOnDec5,2008,at11:36AM,JohannesBauerwrote:>>Isuspectthat'?'after\n(\u0a00)isindicatesnot'question-mark'>>but'uninterpretableasautf16character'.Thetracebackbelow>>confirmsthat.Itshouldbeanend-of-filemarkerandshouldnotbe>>passedtoPython.Istronglysuspectthatwhateverwrotethefile>>screwedupthe(OS-specific)end-of-filemarker.Ihaveseenthis>>occasionallyonDos/Windowswithasciibytefiles,withthesame>>symptom>>ofreadingrandomgarbagepasstheendofthefile.Orperhaps>>end-of-filedoesnotworkrightwithutf16.>>SoUTF-16hasanexplicitEOFmarkerwithinthetext?No,itdoesnot.Idon'tknowwhatTerry'sthinkingofthere,buttextfilesdonothaveanyEOFmarker.Theystartatthebeginning(sometimesincludingabyte-ordermark),andgotilltheendofthefile,period.>Icannotfindoneinoriginalfile,onlysomekindofstarting>sequenceIsuppose>(0xfeff).That'syourbyte-ordermark(BOM).>Thelastcharactersofthefileare0x000x0d0x000x0a,>simple\r\nlineending.Soundslikeaperfectlynormalfiletome.It'shardtoimagine,butitlookstomelikeyou'vefoundabug.Best,[email protected],Dec6,2008,3:15:33AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnDec5,3:25 pm,JohannesBauerwrote:>Hellogroup,>>I'mhavingtroublereadingautf-16encodedfilewithPython3.0.Thisis>my(complete)code:>>#!/usr/bin/python3.0>>classAddressBook():>    def__init__(self,filename):>        f=open(filename,"r",encoding="utf16")>        whileTrue:>            line=f.readline()>            ifline=="":break>            print([line[x]forxinrange(len(line))])>        f.close()>>a=AddressBook("2008_11_05_Handy_Backup.txt")>>Thisisthefile(only1kB,ifhostingdoesn'tworkpleasetellmeand>I'llseeifIcanputitsomeplaceelse):>>http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.t...>             <[email protected]>2problems:endiannessandtrailingzerbyte.Thisworksforme:classAddressBook():def__init__(self,filename):f=open(filename,"r",encoding="utf_16_be",newline="\r\n")whileTrue:line=f.readline()iflen(line)==0:breakprint(line.replace("\r\n",""))f.close()a=AddressBook("2008_11_05_Handy_Backup2.txt")Pleasenotethefilename:ImodifiedyourfilebydroppingthetrailingzerbyteMRABunread,Dec6,2008,3:36:16AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoPythonListJoeStroutwrote:>OnDec5,2008,at11:36AM,JohannesBauerwrote:>>>>Isuspectthat'?'after\n(\u0a00)isindicatesnot'question-mark'>>>but'uninterpretableasautf16character'.Thetracebackbelow>>>confirmsthat.Itshouldbeanend-of-filemarkerandshouldnotbe>>>passedtoPython.Istronglysuspectthatwhateverwrotethefile>>>screwedupthe(OS-specific)end-of-filemarker.Ihaveseenthis>>>occasionallyonDos/Windowswithasciibytefiles,withthesamesymptom>>>ofreadingrandomgarbagepasstheendofthefile.Orperhaps>>>end-of-filedoesnotworkrightwithutf16.>>>>SoUTF-16hasanexplicitEOFmarkerwithinthetext?>>No,itdoesnot.Idon'tknowwhatTerry'sthinkingofthere,buttext>filesdonothaveanyEOFmarker.Theystartatthebeginning>(sometimesincludingabyte-ordermark),andgotilltheendofthe>file,period.>Textfiles_do_sometimeshaveanEOFmarker,suchascharacter0x1A.ItcanoccurintextfilesinWindows.>>Icannotfindoneinoriginalfile,onlysomekindofstarting>>sequenceIsuppose>>(0xfeff).>>That'syourbyte-ordermark(BOM).>>>Thelastcharactersofthefileare0x000x0d0x000x0a,>>simple\r\nlineending.>JohnMachinunread,Dec6,2008,6:32:23AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnDec6,5:36 am,JohannesBauerwrote:>SoUTF-16hasanexplicitEOFmarkerwithinthetext?Icannotfindone>inoriginalfile,onlysomekindofstartingsequenceIsuppose>(0xfeff).Thelastcharactersofthefileare0x000x0d0x000x0a,>simple\r\nlineending.Sorry,*WRONG*.Itendsin000d000a00.Thefileis1559byteslong,anODDnumber,whichshouldn'thappenwithutf16.Thefileisstuffed.Python3.0hasabug;itshouldgiveameaningfulerrormessage.Python2.6.0silentlyignorestheproblem[that'saBUG]whenreadbyasimilarmethod:|>>>importcodecs|>>>lines=codecs.open('x.txt','r','utf16').readlines()|>>>lines[-1]|u'[PhonePBK004]\r\n'Python2.xdoeshowevergiveameaningfulpreciseerrormessageifyoutryadecodeonthefilecontents:|>>>s=open('x.txt','rb').read()|>>>len(s)|1559|>>>s[-35:]|'\x00\r\x00\n\x00[\x00P\x00h\x00o\x00n\x00e\x00P\x00B\x00K\x000\x000\x004\x00]\x00\r\x00\n\x00'|>>>u=s.decode('utf16')|Traceback(mostrecentcalllast):|File"",line1,in|File"C:\python26\lib\encodings\utf_16.py",line16,indecode|returncodecs.utf_16_decode(input,errors,True)|UnicodeDecodeError:'utf16'codeccan'tdecodebyte0x00inposition1558:truncateddataHTH,JohnStevenD'Apranounread,Dec6,2008,7:35:44AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnFri,05Dec200812:00:59-0700,JoeStroutwrote:>>SoUTF-16hasanexplicitEOFmarkerwithinthetext?>>No,itdoesnot.Idon'tknowwhatTerry'sthinkingofthere,buttext>filesdonothaveanyEOFmarker.Theystartatthebeginning>(sometimesincludingabyte-ordermark),andgotilltheendofthe>file,period.Windowstextfilesstillinterpretctrl-ZasEOF,oratleastWindowsXPdoes.Vista,whoknows?--StevenJohnMachinunread,Dec6,2008,8:26:36AM12/6/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnDec6,10:35 am,StevenD'ApranoOnDec6,10:35am,StevenD'Apranocybersource.com.au>wrote:>>OnFri,05Dec200812:00:59-0700,JoeStroutwrote:>>>>SoUTF-16hasanexplicitEOFmarkerwithinthetext?>>>No,itdoesnot.Idon'tknowwhatTerry'sthinkingofthere,buttext>>>filesdonothaveanyEOFmarker.Theystartatthebeginning>>>(sometimesincludingabyte-ordermark),andgotilltheendofthe>>>file,period.>>Windowstextfilesstillinterpretctrl-ZasEOF,oratleastWindowsXP>>does.Vista,whoknows?>>Thisappliesonlytofilesbeingreadinan8-bittextmode.Itis>inheritedfromMS-DOS,whichfollowedtheCP/Mconvention,whichwas>necessarybecauseCP/M'sfilesystemrecordedonlythephysicalfile>lengthin128-bytesectors,notthelogicallength.Itislikelyto>continueinperpetuity,justasstandardrailwaygaugeis(allegedly)>basedontheaxle-lengthofRomanchariots.>Thechariotsinquestionweredrawnby2horses,sothegaugeisbasedinthewidthofahorse.:-)JohannesBauerunread,Dec7,2008,12:38:03AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoin...@orlans-amo.beschrieb:>2problems:endiannessandtrailingzerbyte.>Thisworksforme:Thisisverystrange-whenusing"utf16",endiannessshouldbedetectedautomatically.WhenIsimplytruncatethetrailingzerobyte,Ireceive:Traceback(mostrecentcalllast):File"./modify.py",line12,ina=AddressBook("2008_11_05_Handy_Backup.txt")File"./modify.py",line7,in__init__line=f.readline()File"/usr/local/lib/python3.0/io.py",line1807,inreadlinewhileself._read_chunk():File"/usr/local/lib/python3.0/io.py",line1556,in_read_chunkself._set_decoded_chars(self._decoder.decode(input_chunk,eof))File"/usr/local/lib/python3.0/io.py",line1293,indecodeoutput=self.decoder.decode(input,final=final)File"/usr/local/lib/python3.0/codecs.py",line300,indecode(result,consumed)=self._buffer_decode(data,self.errors,final)File"/usr/local/lib/python3.0/encodings/utf_16.py",line69,in_buffer_decodereturnself.decoder(input,self.errors,final)UnicodeDecodeError:'utf16'codeccan'tdecodebyte0x0ainposition0:truncateddataButIsupposesomething*is*indeedweirdbecausethefileIuploadedandwhichdidnotyieldthe"truncateddata"erroria1559bytes,whichjustcannotbe.Regards,Johannes--"MeineGegenklagegegendichlautetdannaufbewussteVerlogenheit,verlästerungvonGott,BibelundmirundbewussterBlasphemie."--ProphetundVisionärHansJossakaHJPinde.sci.physik<[email protected]>JohannesBauerunread,Dec7,2008,12:43:13AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoJohnMachinschrieb:>OnDec6,5:36am,JohannesBauerwrote:>>SoUTF-16hasanexplicitEOFmarkerwithinthetext?Icannotfindone>>inoriginalfile,onlysomekindofstartingsequenceIsuppose>>(0xfeff).Thelastcharactersofthefileare0x000x0d0x000x0a,>>simple\r\nlineending.>>Sorry,*WRONG*.Itendsin000d000a00.Thefileis1559bytes>long,anODDnumber,whichshouldn'thappenwithutf16.Thefileis>stuffed.Python3.0hasabug;itshouldgiveameaningfulerror>message.Yes,youareright.Ifixedthefile,yetanothererrorpopsup(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):Traceback(mostrecentcalllast):File"./modify.py",line12,ina=AddressBook("2008_12_05_Handy_Backup.txt")File"./modify.py",line7,in__init__line=f.readline()File"/usr/local/lib/python3.0/io.py",line1807,inreadlinewhileself._read_chunk():File"/usr/local/lib/python3.0/io.py",line1556,in_read_chunkself._set_decoded_chars(self._decoder.decode(input_chunk,eof))File"/usr/local/lib/python3.0/io.py",line1293,indecodeoutput=self.decoder.decode(input,final=final)File"/usr/local/lib/python3.0/codecs.py",line300,indecode(result,consumed)=self._buffer_decode(data,self.errors,final)File"/usr/local/lib/python3.0/encodings/utf_16.py",line69,in_buffer_decodereturnself.decoder(input,self.errors,final)UnicodeDecodeError:'utf16'codeccan'tdecodebyte0x0ainposition0:truncateddataFilesizeis1630bytes-sothisclearlycannotbe.Regards,MRABunread,Dec7,2008,12:50:24AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoPythonListItmightbethattheEOFmarker(b'\x1A'oru'\u001A')waswrittenorisbeingreadasasinglebyteinsteadof2bytesforUTF-16text.MarkTolonenunread,Dec7,2008,3:20:26AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetopytho...@python.org"JohannesBauer"wroteinmessagenews:[email protected]...Howaboutpostingyourcode?Thefirstfileisincorrect.Itcontainsanextra0x00byteattheendofthefile,butisotherwisecorrectlyencodedwithabig-endianUTF16BOManddata.ThesecondfileisacorrectUTF16-BEfileaswell.Thiscode(Python2.6)decodesthefirstfile,removingthetrailingextrabyte:raw=open('2008_11_05_Handy_Backup.txt').read()data=raw[:-1].decode('utf16')andthiscode(Python2.6)decodesthesecond:raw=open('2008_12_05_Handy_Backup.txt').read()data=raw.decode('utf16')Python3.0alsohasnoproblemswithdecodingoraccurateerrormessages:>>>data=open('2008_12_05_Handy_Backup.txt',encoding='utf16').read()>>>data=open('2008_11_05_Handy_Backup.txt',encoding='utf16').read()Traceback(mostrecentcalllast):File"",line1,inFile"C:\dev\python30\lib\io.py",line1724,inreaddecoder.decode(self.buffer.read(),final=True))File"C:\dev\python30\lib\io.py",line1295,indecodeoutput=self.decoder.decode(input,final=final)File"C:\dev\python30\lib\codecs.py",line300,indecode(result,consumed)=self._buffer_decode(data,self.errors,final)File"c:\dev\python30\lib\encodings\utf_16.py",line61,in_buffer_decodecodecs.utf_16_ex_decode(input,errors,0,final)UnicodeDecodeError:'utf16'codeccan'tdecodebyte0x00inposition1558:truncateddata-MarkJohnMachinunread,Dec7,2008,5:40:47AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnDec7,6:20 am,"MarkTolonen"wrote:>"JohannesBauer"wroteinmessage>>news:[email protected]...>>>>>JohnMachinschrieb:>>>OnDec6,5:36am,JohannesBauerwrote:>>>>SoUTF-16hasanexplicitEOFmarkerwithinthetext?Icannotfindone>>>>inoriginalfile,onlysomekindofstartingsequenceIsuppose>>>>(0xfeff).Thelastcharactersofthefileare0x000x0d0x000x0a,>>>>simple\r\nlineending.>>>>Sorry,*WRONG*.Itendsin000d000a00.Thefileis1559bytes>>>long,anODDnumber,whichshouldn'thappenwithutf16. Thefileis>>>stuffed.Python3.0hasabug;itshouldgiveameaningfulerror>>>message.>>>Yes,youareright.Ifixedthefile,yetanothererrorpopsup>>(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.t...>>>Traceback(mostrecentcalllast):>> File"./modify.py",line12,in>>  a=AddressBook("2008_12_05_Handy_Backup.txt")>> File"./modify.py",line7,in__init__>>  line=f.readline()>> File"/usr/local/lib/python3.0/io.py",line1807,inreadline>>  whileself._read_chunk():>> File"/usr/local/lib/python3.0/io.py",line1556,in_read_chunk>>  self._set_decoded_chars(self._decoder.decode(input_chunk,eof))>> File"/usr/local/lib/python3.0/io.py",line1293,indecode>>  output=self.decoder.decode(input,final=final)>> File"/usr/local/lib/python3.0/codecs.py",line300,indecode>>  (result,consumed)=self._buffer_decode(data,self.errors,final)>> File"/usr/local/lib/python3.0/encodings/utf_16.py",line69,in>>_buffer_decode>>  returnself.decoder(input,self.errors,final)>>UnicodeDecodeError:'utf16'codeccan'tdecodebyte0x0ainposition0:>>truncateddata>>>Filesizeis1630bytes-sothisclearlycannotbe.>>Howaboutpostingyourcode?Hedid.Uglystuffusingreadline():-)Shouldstillwork,though.Therearedefiniteproblemswithreadline()andreadlines(),including:Firstfile:silentlyignoreserror*and*thelastlinereturnedisgarbage[consistsofmultipleactuallines,andthetrailingcodepointshavebeenbyte-swapped]Secondfile:ashehasjustreported.I'vereproduceditwithf.open('second_file.txt',encoding='utf16')followedbyeachof:(1)f.readlines()(2)list(f)(3)forlineinf:print(repr(line))Withthelastone,theerrorhappensafterprintingthelastactuallineinhisfile.DavidBolenunread,Dec7,2008,6:01:24AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoJohannesBauerwrites:>Thisisverystrange-whenusing"utf16",endiannessshouldbedetected>automatically.WhenIsimplytruncatethetrailingzerobyte,Ireceive:Anychancethatwhateveryouusedto"simplytruncatethetrailingzerobyte"alsoremovedtheBOMatthestartofthefile?Withoutit,utf16wouldn'tbeabletodetectendiannessandwould,Ibelieve,fallbacktonativeorder.--DavidJohnMachinunread,Dec7,2008,6:34:28AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnDec7,9:01 am,DavidBolenwrote:WhenIreadthis,Ithought"Ono,surelynot!".Seemsthatyouarecorrect:[Python2.5.2,WindowsXP]|>>>nobom=u'abcde'.encode('utf_16_be')|>>>nobom|'\x00a\x00b\x00c\x00d\x00e'|>>>nobom.decode('utf16')|u'\u6100\u6200\u6300\u6400\u6500'ThismaywellexplainoneofthePython3.0problemsthattheOP's2filesexhibit:dataappearstohavebeenbyte-swappedundersomeconditions.Possibility:itisreadingthefileachunkatatimeandapplyingtheutf_16encodingindependentlytoeachchunk--onlythefirstchunkwillhaveaBOM.JohnMachinunread,Dec7,2008,11:30:40AM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoWell,no,onfurtherinvestigation,we'renotbyte-swapped,we'vetrickedourselvesintodecodingonodd-byteboundaries.Here'sthescoop:It'sabuginthenewlinehandling(inio.py,classIncrementalNewlineDecoder,methoddecode).Itreadstextfilesin128-bytechunks.ConvertingCRLFto\nrequiresspecialcasehandlingwhen'\r'isdetectedattheendofthedecodedchunknincasethere'sanLFatthestartofchunkn+1.Buggysolution:prependb'\r'tothechunkn+1bytesanddecodethat--suddenlywitha2-bytes-per-charencodinglikeUTF-16weare1byteoutofwhack.Better(IMVH[1]O)solution:prepend'\r'totheresultofdecodingthechunkn+1bytes.EachoftheOP'sfileshave\rona64-characterboundary.Note:Theywouldexhibitthesamesymptomsifencodedinutf-16LEinsteadofutf-16BE.Withthebettersolutionapplied,thefirstfile[thetruncatedone]gavetheexpectederror,andthesecondfile[theapparentlyOKone]gavesensiblelookingoutput.[1]IthoughtitbesttobeVeryHumblegivenwhatyouseewhenyoudo:importioprint(io.__author__)Hopemysurgeprotectorcancopewiththis:-)^%!//()NOCARRIERTerryReedyunread,Dec7,2008,5:15:29PM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetopytho...@python.orgJohnMachinwrote:>Here'sthescoop:It'sabuginthenewlinehandling(inio.py,class>IncrementalNewlineDecoder,methoddecode).Itreadstextfilesin128->bytechunks.ConvertingCRLFto\nrequiresspecialcasehandling>when'\r'isdetectedattheendofthedecodedchunknincase>there'sanLFatthestartofchunkn+1.Buggysolution:prependb'\r'>tothechunkn+1bytesanddecodethat--suddenlywitha2-bytes-per->charencodinglikeUTF-16weare1byteoutofwhack.Better(IMVH[1]>O)solution:prepend'\r'totheresultofdecodingthechunkn+1>bytes.EachoftheOP'sfileshave\rona64-characterboundary.>Note:Theywouldexhibitthesamesymptomsifencodedinutf-16LE>insteadofutf-16BE.Withthebettersolutionapplied,thefirstfile>[thetruncatedone]gavetheexpectederror,andthesecondfile[the>apparentlyOKone]gavesensiblelookingoutput.>>[1]IthoughtitbesttobeVeryHumblegivenwhatyouseewhenyou>do:>importio>print(io.__author__)>Hopemysurgeprotectorcancopewiththis:-)>^%!//()>NOCARRIERPleasepostthisonthetrackersoitcangetincludedwithotherioworkfor3.0.1.JohnMachinunread,Dec7,2008,5:46:34PM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnDec7,8:15 pm,TerryReedywrote:>JohnMachinwrote:>>Here'sthescoop:It'sabuginthenewlinehandling(inio.py,class>>IncrementalNewlineDecoder,methoddecode).Itreadstextfilesin128->>bytechunks.ConvertingCRLFto\nrequiresspecialcasehandling>>when'\r'isdetectedattheendofthedecodedchunknincase>>there'sanLFatthestartofchunkn+1.Buggysolution:prependb'\r'>>tothechunkn+1bytesanddecodethat--suddenlywitha2-bytes-per->>charencodinglikeUTF-16weare1byteoutofwhack.>Pleasepostthisonthetrackersoitcangetincludedwithotherio>workfor3.0.1.I'mfiddlingwithashortbug-demoscriptrightnow.JohannesBauerunread,Dec7,2008,11:05:53PM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoJohnMachinschrieb:>Hedid.Uglystuffusingreadline():-)Shouldstillwork,though.Well,well,I'maCkindaguyusedtowhile(fgets(b,sizeof(b),f))kindaloops:-)But,seriously-Ifindthatwhole"whileTrue:"and"ifline=="""constructuglyashell,too.Howcanreadingafilelinebylinebeachievedinamorepythonickindofway?D'ArcyJ.M.Cainunread,Dec7,2008,11:14:30PM12/7/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoJohannesBauer,[email protected],07Dec200816:05:53+0100JohannesBauerwrote:>But,seriously-Ifindthatwhole"whileTrue:"and"ifline==""">constructuglyashell,too.Howcanreadingafilelinebylinebe>achievedinamorepythonickindofway?forlineinopen(filename):--D'ArcyJ.M.Cain|Democracyisthreewolveshttp://www.druid.net/darcy/|andasheepvotingon+14164251212(DoD#0082)(eNTP)|what'sfordinner.JohnMachinunread,Dec8,2008,6:20:03AM12/8/08ReplytoauthorSignintoreplytoauthorForwardSignintoforwardDeleteYoudonothavepermissiontodeletemessagesinthisgroupLinkReportmessageasabuseSignintoreportmessageasabuseShoworiginalmessageEitheremailaddressesareanonymousforthisgrouporyouneedtheviewmemberemailaddressespermissiontoviewtheoriginalmessagetoOnDec8,2:05 am,JohannesBauerwrote:>JohnMachinschrieb:>>>Hedid.Uglystuffusingreadline():-)Shouldstillwork,though.>>Well,well,I'maCkindaguyusedtowhile(fgets(b,sizeof(b),f))>kindaloops:-)>>But,seriously-Ifindthatwhole"whileTrue:"and"ifline==""">constructuglyashell,too.Howcanreadingafilelinebylinebe>achievedinamorepythonickindofway?Byusingforlineinopen(.....)asmentionedin(1)mymessagethatyouwerereplyingto(2)thetutorial:http://docs.python.org/3.0/tutorial/inputoutput.html#reading-and-writing-files...skipthestuffonreadline()andreadlines()thistime:-)Whilewaitingforthebugtobefixed,you'llneedsomethinglikethefollowing:defutf16_getlines(fname,newline_terminated=True):f=open(fname,'rb')raw_bytes=f.read()f.close()decoded=raw_bytes.decode('utf16')ifnewline_terminated:normalised=decoded.replace('\r\n','\n')lines=normalised.splitlines(True)else:lines=decoded.splitlines()returnlinesThatavoidsthechunk-readingproblembyreadingthewholefileinonego.InfactgiventhewayI'vewrittenit,therecanbe4copiesofthefilecontents.Fortunatelyyourfilesaretiny.HTH,JohnReplyallReplytoauthorForward0newmessages



請為這篇文章評分?