codecs – String encoding and decoding - PyMOTW

2024-11-26

文章推薦指數： 80 %

投票人數：10人

For example, Python includes codecs for working with base-64, bzip2, ROT-13, ZIP, and other data formats. import codecs from cStringIO import StringIO ... PyMOTW Home Blog TheBook About SiteIndex Ifyoufindthisinformationuseful,considerpickingupacopyofmybook, ThePythonStandardLibraryBy Example. PageContents codecs–Stringencodinganddecoding UnicodePrimer Encodings WorkingwithFiles ByteOrder ErrorHandling EncodingErrors DecodingErrors StandardInputandOutputStreams NetworkCommunication EncodingTranslation Non-UnicodeEncodings IncrementalEncoding DefiningYourOwnEncoding Navigation TableofContents Previous:StringServices Next:difflib–Comparesequences ThisPage ShowSource Examples TheoutputfromalltheexampleprogramsfromPyMOTWhasbeen generatedwithPython2.7.8,unlessotherwisenoted.Some ofthefeaturesdescribedheremaynotbeavailableinearlier versionsofPython. IfyouarelookingforexamplesthatworkunderPython3,please refertothePyMOTW-3sectionofthesite.NowavailableforPython3! Buythebook! Navigation index modules| next| previous| PyMOTW» StringServices» codecs–Stringencodinganddecoding¶ Purpose:Encodersanddecodersforconvertingtextbetweendifferentrepresentations. AvailableIn:2.1andlater Thecodecsmoduleprovidesstreamandfileinterfacesfor transcodingdatainyourprogram.Itismostcommonlyusedtowork withUnicodetext,butotherencodingsarealsoavailableforother purposes. UnicodePrimer¶ CPython2.xsupportstwotypesofstringsforworkingwithtextdata. Old-stylestrinstancesuseasingle8-bitbytetorepresent eachcharacterofthestringusingitsASCIIcode.Incontrast, unicodestringsaremanagedinternallyasasequenceof Unicodecodepoints.Thecodepointvaluesaresavedasasequence of2or4byteseach,dependingontheoptionsgivenwhenPythonwas compiled.Bothunicodeandstrarederivedfroma commonbaseclass,andsupportasimilarAPI. Whenunicodestringsareoutput,theyareencodedusingone ofseveralstandardschemessothatthesequenceofbytescanbe reconstructedasthesamestringlater.Thebytesoftheencoded valuearenotnecessarilythesameasthecodepointvalues,andthe encodingdefinesawaytotranslatebetweenthetwosetsofvalues. ReadingUnicodedataalsorequiresknowingtheencodingsothatthe incomingbytescanbeconvertedtotheinternalrepresentationusedby theunicodeclass. ThemostcommonencodingsforWesternlanguagesareUTF-8and UTF-16,whichusesequencesofoneandtwobytevalues respectivelytorepresenteachcharacter.Otherencodingscanbemore efficientforstoringlanguageswheremostofthecharactersare representedbycodepointsthatdonotfitintotwobytes. Seealso FormoreintroductoryinformationaboutUnicode,refertothelist ofreferencesattheendofthissection.ThePythonUnicode HOWTOisespeciallyhelpful. Encodings¶ Thebestwaytounderstandencodingsistolookatthedifferent seriesofbytesproducedbyencodingthesamestringindifferent ways.Theexamplesbelowusethisfunctiontoformatthebytestring tomakeiteasiertoread. importbinascii defto_hex(t,nbytes): "Formattexttasasequenceofnbytelongvaluesseparatedbyspaces." chars_per_item=nbytes*2 hex_version=binascii.hexlify(t) num_chunks=len(hex_version)/chars_per_item defchunkify(): forstartinxrange(0,len(hex_version),chars_per_item): yieldhex_version[start:start+chars_per_item] return''.join(chunkify()) if__name__=='__main__': printto_hex('abcdef',1) printto_hex('abcdef',2) Thefunctionusesbinasciitogetahexadecimalrepresentation oftheinputbytestring,theninsertaspacebetweeneverynbytes bytesbeforereturningthevalue. $pythoncodecs_to_hex.py 616263646566 616263646566 Thefirstencodingexamplebeginsbyprintingthetext'pi:π' usingtherawrepresentationoftheunicodeclass.Theπ characterisreplacedwiththeexpressionfortheUnicodecodepoint, \u03c0.ThenexttwolinesencodethestringasUTF-8andUTF-16 respectively,andshowthehexadecimalvaluesresultingfromthe encoding. fromcodecs_to_heximportto_hex text=u'pi:π' print'Raw:',repr(text) print'UTF-8:',to_hex(text.encode('utf-8'),1) print'UTF-16:',to_hex(text.encode('utf-16'),2) Theresultofencodingaunicodestringisastr object. $pythoncodecs_encodings.py Raw:u'pi:\u03c0' UTF-8:70693a20cf80 UTF-16:fffe700069003a002000c003 Givenasequenceofencodedbytesasastrinstance,the decode()methodtranslatesthemtocodepointsandreturnsthe sequenceasaunicodeinstance. fromcodecs_to_heximportto_hex text=u'pi:π' encoded=text.encode('utf-8') decoded=encoded.decode('utf-8') print'Original:',repr(text) print'Encoded:',to_hex(encoded,1),type(encoded) print'Decoded:',repr(decoded),type(decoded) Thechoiceofencodinguseddoesnotchangetheoutputtype. $pythoncodecs_decode.py Original:u'pi:\u03c0' Encoded:70693a20cf80 Decoded:u'pi:\u03c0' Note Thedefaultencodingissetduringtheinterpreterstart-upprocess, whensiteisloaded.RefertoUnicodeDefaults foradescriptionofthedefaultencodingsettingsaccessiblevia sys. WorkingwithFiles¶ Encodinganddecodingstringsisespeciallyimportantwhendealing withI/Ooperations.Whetheryouarewritingtoafile,socket,or otherstream,youwillwanttoensurethatthedataisusingthe properencoding.Ingeneral,alltextdataneedstobedecodedfrom itsbyterepresentationasitisread,andencodedfromtheinternal valuestoaspecificrepresentationasitiswritten.Yourprogram canexplicitlyencodeanddecodedata,butdependingontheencoding useditcanbenon-trivialtodeterminewhetheryouhavereadenough bytesinordertofullydecodethedata.codecsprovides classesthatmanagethedataencodinganddecodingforyou,soyou don’thavetocreateyourown. Thesimplestinterfaceprovidedbycodecsisareplacementfor thebuilt-inopen()function.Thenewversionworksjustlike thebuilt-in,butaddstwonewargumentstospecifytheencodingand desirederrorhandlingtechnique. fromcodecs_to_heximportto_hex importcodecs importsys encoding=sys.argv[1] filename=encoding+'.txt' print'Writingto',filename withcodecs.open(filename,mode='wt',encoding=encoding)asf: f.write(u'pi:\u03c0') #Determinethebytegroupingtouseforto_hex() nbytes={'utf-8':1, 'utf-16':2, 'utf-32':4, }.get(encoding,1) #Showtherawbytesinthefile print'Filecontents:' withopen(filename,mode='rt')asf: printto_hex(f.read(),nbytes) Startingwithaunicodestringwiththecodepointforπ, thisexamplesavesthetexttoafileusinganencodingspecifiedon thecommandline. $pythoncodecs_open_write.pyutf-8 Writingtoutf-8.txt Filecontents: 70693a20cf80 $pythoncodecs_open_write.pyutf-16 Writingtoutf-16.txt Filecontents: fffe700069003a002000c003 $pythoncodecs_open_write.pyutf-32 Writingtoutf-32.txt Filecontents: fffe000070000000690000003a00000020000000c0030000 Readingthedatawithopen()isstraightforward,withonecatch: youmustknowtheencodinginadvance,inordertosetupthedecoder correctly.Somedataformats,suchasXML,letyouspecifythe encodingaspartofthefile,butusuallyitisuptotheapplication tomanage.codecssimplytakestheencodingasanargumentand assumesitiscorrect. importcodecs importsys encoding=sys.argv[1] filename=encoding+'.txt' print'Readingfrom',filename withcodecs.open(filename,mode='rt',encoding=encoding)asf: printrepr(f.read()) Thisexamplereadsthefilescreatedbythepreviousprogram,and printstherepresentationoftheresultingunicodeobjectto theconsole. $pythoncodecs_open_read.pyutf-8 Readingfromutf-8.txt u'pi:\u03c0' $pythoncodecs_open_read.pyutf-16 Readingfromutf-16.txt u'pi:\u03c0' $pythoncodecs_open_read.pyutf-32 Readingfromutf-32.txt u'pi:\u03c0' ByteOrder¶ Multi-byteencodingssuchasUTF-16andUTF-32poseaproblemwhen transferringthedatabetweendifferentcomputersystems,eitherby copyingthefiledirectlyorwithnetworkcommunication.Different systemsusedifferentorderingofthehighandloworderbytes.This characteristicofthedata,knownasitsendianness,dependson factorssuchasthehardwarearchitectureandchoicesmadebythe operatingsystemandapplicationdeveloper.Thereisn’talwaysaway toknowinadvancewhatbyteordertouseforagivensetofdata,so themulti-byteencodingsincludeabyte-ordermarker(BOM)asthe firstfewbytesofencodedoutput.Forexample,UTF-16isdefined insuchawaythat0xFFFEand0xFEFFarenotvalidcharacters,andcan beusedtoindicatethebyteorder.codecsdefinesconstants forthebyteordermarkersusedbyUTF-16andUTF-32. importcodecs fromcodecs_to_heximportto_hex fornamein['BOM','BOM_BE','BOM_LE', 'BOM_UTF8', 'BOM_UTF16','BOM_UTF16_BE','BOM_UTF16_LE', 'BOM_UTF32','BOM_UTF32_BE','BOM_UTF32_LE', ]: print'{:12}:{}'.format(name,to_hex(getattr(codecs,name),2)) BOM,BOM_UTF16,andBOM_UTF32areautomaticallysettothe appropriatebig-endianorlittle-endianvaluesdependingonthe currentsystem’snativebyteorder. $pythoncodecs_bom.py BOM:fffe BOM_BE:feff BOM_LE:fffe BOM_UTF8:efbbbf BOM_UTF16:fffe BOM_UTF16_BE:feff BOM_UTF16_LE:fffe BOM_UTF32:fffe0000 BOM_UTF32_BE:0000feff BOM_UTF32_LE:fffe0000 Byteorderingisdetectedandhandledautomaticallybythedecodersin codecs,butyoucanalsochooseanexplicitorderingforthe encoding. importcodecs fromcodecs_to_heximportto_hex #Pickthenon-nativeversionofUTF-16encoding ifcodecs.BOM_UTF16==codecs.BOM_UTF16_BE: bom=codecs.BOM_UTF16_LE encoding='utf_16_le' else: bom=codecs.BOM_UTF16_BE encoding='utf_16_be' print'Nativeorder:',to_hex(codecs.BOM_UTF16,2) print'Selectedorder:',to_hex(bom,2) #Encodethetext. encoded_text=u'pi:\u03c0'.encode(encoding) print'{:14}:{}'.format(encoding,to_hex(encoded_text,2)) withopen('non-native-encoded.txt',mode='wb')asf: #Writetheselectedbyte-ordermarker.Itisnotincludedinthe #encodedtextbecausewewereexplicitaboutthebyteorderwhen #selectingtheencoding. f.write(bom) #Writethebytestringfortheencodedtext. f.write(encoded_text) codecs_bom_create_file.pyfiguresoutthenativebyteordering, thenusesthealternateformexplicitlysothenextexamplecan demonstrateauto-detectionwhilereading. $pythoncodecs_bom_create_file.py Nativeorder:fffe Selectedorder:feff utf_16_be:00700069003a002003c0 codecs_bom_detection.pydoesnotspecifyabyteorderwhenopening thefile,sothedecoderusestheBOMvalueinthefirsttwobytesof thefiletodetermineit. importcodecs fromcodecs_to_heximportto_hex #Lookattherawdata withopen('non-native-encoded.txt',mode='rb')asf: raw_bytes=f.read() print'Raw:',to_hex(raw_bytes,2) #Re-openthefileandletcodecsdetecttheBOM withcodecs.open('non-native-encoded.txt',mode='rt',encoding='utf-16')asf: decoded_text=f.read() print'Decoded:',repr(decoded_text) Sincethefirsttwobytesofthefileareusedforbyteorder detection,theyarenotincludedinthedatareturnedbyread(). $pythoncodecs_bom_detection.py Raw:feff00700069003a002003c0 Decoded:u'pi:\u03c0' ErrorHandling¶ Theprevioussectionspointedouttheneedtoknowtheencodingbeing usedwhenreadingandwritingUnicodefiles.Settingtheencoding correctlyisimportantfortworeasons.Iftheencodingisconfigured incorrectlywhilereadingfromafile,thedatawillbeinterpreted wrongandmaybecorruptedorsimplyfailtodecode.NotallUnicode characterscanberepresentedinallencodings,soifthewrong encodingisusedwhilewritinganerrorwillbegeneratedanddatamay belost. codecsusesthesamefiveerrorhandlingoptionsthatare providedbytheencode()methodofunicodeandthe decode()methodofstr. strict Raisesanexceptionifthedatacannotbeconverted. replace Substitutesaspecialmarkercharacterfordatathatcannotbeencoded. ignore Skipsthedata. xmlcharrefreplace XMLcharacter(encodingonly) backslashreplace escapesequence(encodingonly) EncodingErrors¶ Themostcommonerrorconditionisreceivinga UnicodeEncodeErrorwhenwriting UnicodedatatoanASCIIoutputstream,suchasaregularfileor sys.stdout.Thissampleprogramcanbeused toexperimentwiththedifferenterrorhandlingmodes. importcodecs importsys error_handling=sys.argv[1] text=u'pi:\u03c0' try: #Savethedata,encodedasASCII,usingtheerror #handlingmodespecifiedonthecommandline. withcodecs.open('encode_error.txt','w', encoding='ascii', errors=error_handling)asf: f.write(text) exceptUnicodeEncodeError,err: print'ERROR:',err else: #Iftherewasnoerrorwritingtothefile, #showwhatitcontains. withopen('encode_error.txt','rb')asf: print'Filecontents:',repr(f.read()) Whilestrictmodeissafestforensuringyourapplication explicitlysetsthecorrectencodingforallI/Ooperations,itcan leadtoprogramcrasheswhenanexceptionisraised. $pythoncodecs_encode_error.pystrict ERROR:'ascii'codeccan'tencodecharacteru'\u03c0'inposition4:ordinalnotinrange(128) Someoftheothererrormodesaremoreflexible.Forexample, replaceensuresthatnoerrorisraised,attheexpenseof possiblylosingdatathatcannotbeconvertedtotherequested encoding.TheUnicodecharacterforpistillcannotbeencodedin ASCII,butinsteadofraisinganexceptionthecharacterisreplaced with?intheoutput. $pythoncodecs_encode_error.pyreplace Filecontents:'pi:?' Toskipoverproblemdataentirely,useignore.Anydatathat cannotbeencodedissimplydiscarded. $pythoncodecs_encode_error.pyignore Filecontents:'pi:' Therearetwolosslesserrorhandlingoptions,bothofwhichreplace thecharacterwithanalternaterepresentationdefinedbyastandard separatefromtheencoding.xmlcharrefreplaceusesanXML characterreferenceasasubstitute(thelistofcharacterreferences isspecifiedintheW3CXMLEntityDefinitionsforCharacters). $pythoncodecs_encode_error.pyxmlcharrefreplace Filecontents:'pi:π' Theotherlosslesserrorhandlingschemeisbackslashreplacewhich producesanoutputformatlikethevalueyougetwhenyouprintthe repr()ofaunicodeobject.Unicodecharactersare replacedwith\ufollowedbythehexadecimalvalueofthecode point. $pythoncodecs_encode_error.pybackslashreplace Filecontents:'pi:\\u03c0' DecodingErrors¶ Itisalsopossibletoseeerrorswhendecodingdata,especiallyif thewrongencodingisused. importcodecs importsys fromcodecs_to_heximportto_hex error_handling=sys.argv[1] text=u'pi:\u03c0' print'Original:',repr(text) #Savethedatawithoneencoding withcodecs.open('decode_error.txt','w',encoding='utf-16')asf: f.write(text) #Dumpthebytesfromthefile withopen('decode_error.txt','rb')asf: print'Filecontents:',to_hex(f.read(),1) #Trytoreadthedatawiththewrongencoding withcodecs.open('decode_error.txt','r', encoding='utf-8', errors=error_handling)asf: try: data=f.read() exceptUnicodeDecodeError,err: print'ERROR:',err else: print'Read:',repr(data) Aswithencoding,stricterrorhandlingmoderaisesanexception ifthebytestreamcannotbeproperlydecoded.Inthiscase,a UnicodeDecodeErrorresultsfrom tryingtoconvertpartoftheUTF-16BOMtoacharacterusingthe UTF-8decoder. $pythoncodecs_decode_error.pystrict Original:u'pi:\u03c0' Filecontents:fffe700069003a002000c003 ERROR:'utf8'codeccan'tdecodebyte0xffinposition0:invalidstartbyte Switchingtoignorecausesthedecodertoskipovertheinvalid bytes.Theresultisstillnotquitewhatisexpected,though,since itincludesembeddednullbytes. $pythoncodecs_decode_error.pyignore Original:u'pi:\u03c0' Filecontents:fffe700069003a002000c003 Read:u'p\x00i\x00:\x00\x00\x03' Inreplacemodeinvalidbytesarereplacedwith\uFFFD,the officialUnicodereplacementcharacter,whichlookslikeadiamond withablackbackgroundcontainingawhitequestionmark(�). $pythoncodecs_decode_error.pyreplace Original:u'pi:\u03c0' Filecontents:fffe700069003a002000c003 Read:u'\ufffd\ufffdp\x00i\x00:\x00\x00\ufffd\x03' StandardInputandOutputStreams¶ ThemostcommoncauseofUnicodeEncodeErrorexceptionsiscodethattriestoprint unicodedatatotheconsoleoraUnixpipelinewhen sys.stdoutisnotconfiguredwithan encoding. importcodecs importsys text=u'pi:π' #Printingtostdoutmaycauseanencodingerror print'Defaultencoding:',sys.stdout.encoding print'TTY:',sys.stdout.isatty() printtext ProblemswiththedefaultencodingofthestandardI/Ochannelscanbe difficulttodebugbecausetheprogramworksasexpectedwhenthe outputgoestotheconsole,butcauseencodingerrorswhenitisused aspartofapipelineandtheoutputincludesUnicodecharactersabove theASCIIrange.ThisdifferenceinbehavioriscausedbyPython’s initializationcode,whichsetsthedefaultencodingforeachstandard I/Ochannelonlyifthechannelisconnectedtoaterminal (isatty()returnsTrue).Ifthereisnoterminal,Python assumestheprogramwillconfiguretheencodingexplicitly,andleaves theI/Ochannelalone. $pythoncodecs_stdout.py Defaultencoding:utf-8 TTY:True pi:π $pythoncodecs_stdout.py|cat- Defaultencoding:None TTY:False Traceback(mostrecentcalllast): File"codecs_stdout.py",line18,in printtext UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\u03c0'in position4:ordinalnotinrange(128) Toexplicitlysettheencodingonthestandardoutputchannel,use getwriter()togetastreamencoderclassforaspecific encoding.Instantiatetheclass,passingsys.stdoutastheonly argument. importcodecs importsys text=u'pi:π' #Wrapsys.stdoutwithawriterthatknowshowtohandleencoding #Unicodedata. wrapped_stdout=codecs.getwriter('UTF-8')(sys.stdout) wrapped_stdout.write(u'Viawrite:'+text+'\n') #Replacesys.stdoutwithawriter sys.stdout=wrapped_stdout printu'Viaprint:',text Writingtothewrappedversionofsys.stdoutpassestheUnicode textthroughanencoderbeforesendingtheencodedbytestostdout. Replacingsys.stdoutmeansthatanycodeusedbyyourapplication thatprintstostandardoutputwillbeabletotakeadvantageofthe encodingwriter. $pythoncodecs_stdout_wrapped.py Viawrite:pi:π Viaprint:pi:π Thenextproblemtosolveishowtoknowwhichencodingshouldbe used.Theproperencodingvariesbasedonlocation,language,and userorsystemconfiguration,sohard-codingafixedvalueisnota goodidea.Itwouldalsobeannoyingforausertoneedtopass explicitargumentstoeveryprogramsettingtheinputandoutput encodings.Fortunately,thereisaglobalwaytogetareasonable defaultencoding,usinglocale. importcodecs importlocale importsys text=u'pi:π' #Configurelocalefromtheuser'senvironmentsettings. locale.setlocale(locale.LC_ALL,'') #Wrapstdoutwithanencoding-awarewriter. lang,encoding=locale.getdefaultlocale() print'Localeencoding:',encoding sys.stdout=codecs.getwriter(encoding)(sys.stdout) print'Withwrappedstdout:',text getdefaultlocale()returnsthelanguageandpreferredencoding basedonthesystemanduserconfigurationsettingsinaformthatcan beusedwithgetwriter(). $pythoncodecs_stdout_locale.py Localeencoding:UTF-8 Withwrappedstdout:pi:π Theencodingalsoneedstobesetupwhenworkingwithsys.stdin.Usegetreader()togetareadercapableof decodingtheinputbytes. importcodecs importlocale importsys #Configurelocalefromtheuser'senvironmentsettings. locale.setlocale(locale.LC_ALL,'') #Wrapstdinwithanencoding-awarereader. lang,encoding=locale.getdefaultlocale() sys.stdin=codecs.getreader(encoding)(sys.stdin) print'Fromstdin:',repr(sys.stdin.read()) Readingfromthewrappedhandlereturnsunicodeobjects insteadofstrinstances. $pythoncodecs_stdout_locale.py|pythoncodecs_stdin.py Fromstdin:u'Localeencoding:UTF-8\nWithwrappedstdout:pi:\u03c0\n' NetworkCommunication¶ Networksocketsarealsobyte-streams,andsoUnicodedatamustbe encodedintobytesbeforeitiswrittentoasocket. importsys importSocketServer classEcho(SocketServer.BaseRequestHandler): defhandle(self): #Getsomebytesandechothembacktotheclient. data=self.request.recv(1024) self.request.send(data) return if__name__=='__main__': importcodecs importsocket importthreading address=('localhost',0)#letthekernelgiveusaport server=SocketServer.TCPServer(address,Echo) ip,port=server.server_address#findoutwhatportweweregiven t=threading.Thread(target=server.serve_forever) t.setDaemon(True)#don'thangonexit t.start() #Connecttotheserver s=socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect((ip,port)) #Sendthedata text=u'pi:π' len_sent=s.send(text) #Receivearesponse response=s.recv(len_sent) printrepr(response) #Cleanup s.close() server.socket.close() Youcouldencodethedataexplicitly,beforesendingit,butmissone calltosend()andyourprogramwouldfailwithanencoding error. $pythoncodecs_socket_fail.py Traceback(mostrecentcalllast): File"codecs_socket_fail.py",line43,in len_sent=s.send(text) UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\u03c0'in position4:ordinalnotinrange(128) Byusingmakefile()togetafile-likehandleforthesocket, andthenwrappingthatwithastream-basedreaderorwriter,youwill beabletopassUnicodestringsandknowtheyareencodedontheway intoandoutofthesocket. importsys importSocketServer classEcho(SocketServer.BaseRequestHandler): defhandle(self): #Getsomebytesandechothembacktotheclient.Thereis #noneedtodecodethem,sincetheyarenotused. data=self.request.recv(1024) self.request.send(data) return classPassThrough(object): def__init__(self,other): self.other=other defwrite(self,data): print'Writing:',repr(data) returnself.other.write(data) defread(self,size=-1): print'Reading:', data=self.other.read(size) printrepr(data) returndata defflush(self): returnself.other.flush() defclose(self): returnself.other.close() if__name__=='__main__': importcodecs importsocket importthreading address=('localhost',0)#letthekernelgiveusaport server=SocketServer.TCPServer(address,Echo) ip,port=server.server_address#findoutwhatportweweregiven t=threading.Thread(target=server.serve_forever) t.setDaemon(True)#don'thangonexit t.start() #Connecttotheserver s=socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect((ip,port)) #Wrapthesocketwithareaderandwriter. incoming=codecs.getreader('utf-8')(PassThrough(s.makefile('r'))) outgoing=codecs.getwriter('utf-8')(PassThrough(s.makefile('w'))) #Sendthedata text=u'pi:π' print'Sending:',repr(text) outgoing.write(text) outgoing.flush() #Receivearesponse response=incoming.read() print'Received:',repr(response) #Cleanup s.close() server.socket.close() ThisexampleusesPassThroughtoshowthatthedatais encodedbeforebeingsent,andtheresponseisdecodedafteritis receivedintheclient. $pythoncodecs_socket.py Sending:u'pi:\u03c0' Writing:'pi:\xcf\x80' Reading:'pi:\xcf\x80' Received:u'pi:\u03c0' EncodingTranslation¶ Althoughmostapplicationswillworkwithunicodedata internally,decodingorencodingitaspartofanI/Ooperation,there aretimeswhenchangingafile’sencodingwithoutholdingontothat intermediatedataformatisuseful.EncodedFile()takesanopen filehandleusingoneencodingandwrapsitwithaclassthat translatesthedatatoanotherencodingastheI/Ooccurs. fromcodecs_to_heximportto_hex importcodecs fromcStringIOimportStringIO #Rawversionoftheoriginaldata. data=u'pi:\u03c0' #ManuallyencodeitasUTF-8. utf8=data.encode('utf-8') print'StartasUTF-8:',to_hex(utf8,1) #Setupanoutputbuffer,thenwrapitasanEncodedFile. output=StringIO() encoded_file=codecs.EncodedFile(output,data_encoding='utf-8', file_encoding='utf-16') encoded_file.write(utf8) #FetchthebuffercontentsasaUTF-16encodedbytestring utf16=output.getvalue() print'EncodedtoUTF-16:',to_hex(utf16,2) #SetupanotherbufferwiththeUTF-16dataforreading, #andwrapitwithanotherEncodedFile. buffer=StringIO(utf16) encoded_file=codecs.EncodedFile(buffer,data_encoding='utf-8', file_encoding='utf-16') #ReadtheUTF-8encodedversionofthedata. recoded=encoded_file.read() print'BacktoUTF-8:',to_hex(recoded,1) Thisexampleshowsreadingfromandwritingtoseparatehandles returnedbyEncodedFile().Nomatterwhetherthehandleisused forreadingorwriting,thefile_encodingalwaysreferstothe encodinginusebytheopenfilehandlepassedasthefirstargument, anddata_encodingvaluereferstotheencodinginusebythedata passingthroughtheread()andwrite()calls. $pythoncodecs_encodedfile.py StartasUTF-8:70693a20cf80 EncodedtoUTF-16:fffe700069003a002000c003 BacktoUTF-8:70693a20cf80 Non-UnicodeEncodings¶ AlthoughmostoftheearlierexamplesuseUnicodeencodings, codecscanbeusedformanyotherdatatranslations.For example,Pythonincludescodecsforworkingwithbase-64,bzip2, ROT-13,ZIP,andotherdataformats. importcodecs fromcStringIOimportStringIO buffer=StringIO() stream=codecs.getwriter('rot_13')(buffer) text='abcdefghijklmnopqrstuvwxyz' stream.write(text) stream.flush() print'Original:',text print'ROT-13:',buffer.getvalue() Anytransformationthatcanbeexpressedasafunctiontakingasingle inputargumentandreturningabyteorUnicodestringcanbe registeredasacodec. $pythoncodecs_rot13.py Original:abcdefghijklmnopqrstuvwxyz ROT-13:nopqrstuvwxyzabcdefghijklm Usingcodecstowrapadatastreamprovidesasimplerinterface thanworkingdirectlywithzlib. importcodecs fromcStringIOimportStringIO fromcodecs_to_heximportto_hex buffer=StringIO() stream=codecs.getwriter('zlib')(buffer) text='abcdefghijklmnopqrstuvwxyz\n'*50 stream.write(text) stream.flush() print'Originallength:',len(text) compressed_data=buffer.getvalue() print'ZIPcompressed:',len(compressed_data) buffer=StringIO(compressed_data) stream=codecs.getreader('zlib')(buffer) first_line=stream.readline() print'Readfirstline:',repr(first_line) uncompressed_data=first_line+stream.read() print'Uncompressed:',len(uncompressed_data) print'Same:',text==uncompressed_data Notallofthecompressionorencodingsystemssupportreadinga portionofthedatathroughthestreaminterfaceusing readline()orread()becausetheyneedtofindtheendof acompressedsegmenttoexpandit.Ifyourprogramcannotholdthe entireuncompresseddatasetinmemory,usetheincrementalaccess featuresofthecompressionlibraryinsteadofcodecs. $pythoncodecs_zlib.py Originallength:1350 ZIPcompressed:48 Readfirstline:'abcdefghijklmnopqrstuvwxyz\n' Uncompressed:1350 Same:True IncrementalEncoding¶ Someoftheencodingsprovided,especiallybz2andzlib,may dramaticallychangethelengthofthedatastreamastheyworkonit. Forlargedatasets,theseencodingsoperatebetterincrementally, workingononesmallchunkofdataatatime.The IncrementalEncoderandIncrementalDecoderAPIis designedforthispurpose. importcodecs importsys fromcodecs_to_heximportto_hex text='abcdefghijklmnopqrstuvwxyz\n' repetitions=50 print'Textlength:',len(text) print'Repetitions:',repetitions print'Expectedlen:',len(text)*repetitions #Encodethetextseveraltimesbuildupalargeamountofdata encoder=codecs.getincrementalencoder('bz2')() encoded=[] print print'Encoding:', foriinrange(repetitions): en_c=encoder.encode(text,final=(i==repetitions-1)) ifen_c: print'\nEncoded:{}bytes'.format(len(en_c)) encoded.append(en_c) else: sys.stdout.write('.') bytes=''.join(encoded) print print'Totalencodedlength:',len(bytes) print #Decodethebytestringonebyteatatime decoder=codecs.getincrementaldecoder('bz2')() decoded=[] print'Decoding:', fori,binenumerate(bytes): final=(i+1)==len(text) c=decoder.decode(b,final) ifc: print'\nDecoded:{}characters'.format(len(c)) print'Decoding:', decoded.append(c) else: sys.stdout.write('.') print restored=u''.join(decoded) print print'Totaluncompressedlength:',len(restored) Eachtimedataispassedtotheencoderordecoderitsinternalstate isupdated.Whenthestateisconsistent(asdefinedbythecodec), dataisreturnedandthestateresets.Untilthatpoint,callsto encode()ordecode()willnotreturnanydata.Whenthe lastbitofdataispassedin,theargumentfinalshouldbesetto Truesothecodecknowstoflushanyremainingbuffereddata. $pythoncodecs_incremental_bz2.py Textlength:27 Repetitions:50 Expectedlen:1350 Encoding:................................................. Encoded:99bytes Totalencodedlength:99 Decoding:............................................................ ............................ Decoded:1350characters Decoding:.......... Totaluncompressedlength:1350 DefiningYourOwnEncoding¶ SincePythoncomeswithalargenumberofstandardcodecsalready,it isunlikelythatyouwillneedtodefineyourown.Ifyoudo,there areseveralbaseclassesincodecstomaketheprocesseasier. Thefirststepistounderstandthenatureofthetransformation describedbytheencoding.Forexample,an“invertcaps”encoding convertsuppercaseletterstolowercase,andlowercaselettersto uppercase.Hereisasimpledefinitionofanencodingfunctionthat performsthistransformationonaninputstring: importstring definvertcaps(text): """Returnnewstringwiththecaseofalllettersswitched. """ return''.join(c.upper()ifcinstring.ascii_lowercase elsec.lower()ifcinstring.ascii_uppercase elsec forcintext ) if__name__=='__main__': printinvertcaps('ABC.def') printinvertcaps('abc.DEF') Inthiscase,theencoderanddecoderarethesamefunction(aswith ROT-13). $pythoncodecs_invertcaps.py abc.DEF ABC.def Althoughitiseasytounderstand,thisimplementationisnot efficient,especiallyforverylargetextstrings.Fortunately, codecsincludessomehelperfunctionsforcreatingcharacter mapbasedcodecssuchasinvertcaps.Acharactermapencodingis madeupoftwodictionaries.Theencodingmapconvertscharacter valuesfromtheinputstringtobytevaluesintheoutputandthe decodingmapgoestheotherway.Createyourdecodingmapfirst, andthenusemake_encoding_map()toconvertittoanencoding map.TheCfunctionscharmap_encode()and charmap_decode()usethemapstoconverttheirinputdata efficiently. importcodecs importstring #Mapeverycharactertoitself decoding_map=codecs.make_identity_dict(range(256)) #Makealistofpairsofordinalvaluesfortheloweranduppercase #letters pairs=zip([ord(c)forcinstring.ascii_lowercase], [ord(c)forcinstring.ascii_uppercase]) #Modifythemappingtoconvertuppertolowerandlowertoupper. decoding_map.update(dict((upper,lower)for(lower,upper)inpairs)) decoding_map.update(dict((lower,upper)for(lower,upper)inpairs)) #Createaseparateencodingmap. encoding_map=codecs.make_encoding_map(decoding_map) if__name__=='__main__': printcodecs.charmap_encode('abc.DEF','strict',encoding_map) printcodecs.charmap_decode('abc.DEF','strict',decoding_map) printencoding_map==decoding_map Althoughtheencodinganddecodingmapsforinvertcapsarethesame, thatmaynotalwaysbethecase.make_encoding_map()detects situationswheremorethanoneinputcharacterisencodedtothesame outputbyteandreplacestheencodingvaluewithNonetomarkthe encodingasundefined. $pythoncodecs_invertcaps_charmap.py ('ABC.def',7) (u'ABC.def',7) True Thecharactermapencoderanddecodersupportallofthestandard errorhandlingmethodsdescribedearlier,soyoudonotneedtodoany extraworktocomplywiththatpartoftheAPI. importcodecs fromcodecs_invertcaps_charmapimportencoding_map text=u'pi:π' forerrorin['ignore','replace','strict']: try: encoded=codecs.charmap_encode(text,error,encoding_map) exceptUnicodeEncodeError,err: encoded=str(err) print'{:7}:{}'.format(error,encoded) BecausetheUnicodecodepointforπisnotintheencodingmap, thestricterrorhandlingmoderaisesanexception. $pythoncodecs_invertcaps_error.py ignore:('PI:',5) replace:('PI:?',5) strict:'charmap'codeccan'tencodecharacteru'\u03c0'inposition 4:charactermapsto Afterthattheencodinganddecodingmapsaredefined,youneedtoset upafewadditionalclassesandregistertheencoding. register()addsasearchfunctiontotheregistrysothatwhena userwantstouseyourencodingcodecscanlocateit.The searchfunctionmusttakeasinglestringargumentwiththenameof theencoding,andreturnaCodecInfoobjectifitknowsthe encoding,orNoneifitdoesnot. importcodecs importencodings defsearch1(encoding): print'search1:Searchingfor:',encoding returnNone defsearch2(encoding): print'search2:Searchingfor:',encoding returnNone codecs.register(search1) codecs.register(search2) utf8=codecs.lookup('utf-8') print'UTF-8:',utf8 try: unknown=codecs.lookup('no-such-encoding') exceptLookupError,err: print'ERROR:',err Youcanregistermultiplesearchfunctions,andeachwillbecalledin turnuntilonereturnsaCodecInfoorthelistisexhausted. Theinternalsearchfunctionregisteredbycodecsknowshowto loadthestandardcodecssuchasUTF-8fromencodings,sothose nameswillneverbepassedtoyoursearchfunction. $pythoncodecs_register.py UTF-8: search1:Searchingfor:no-such-encoding search2:Searchingfor:no-such-encoding ERROR:unknownencoding:no-such-encoding TheCodecInfoinstancereturnedbythesearchfunctiontells codecshowtoencodeanddecodeusingallofthedifferent mechanismssupported:stateless,incremental,andstream. codecsincludesbaseclassesthatmakesettingupacharacter mapencodingeasy.Thisexampleputsallofthepiecestogetherto registerasearchfunctionthatreturnsaCodecInfoinstance configuredfortheinvertcapscodec. importcodecs fromcodecs_invertcaps_charmapimportencoding_map,decoding_map #Statelessencoder/decoder classInvertCapsCodec(codecs.Codec): defencode(self,input,errors='strict'): returncodecs.charmap_encode(input,errors,encoding_map) defdecode(self,input,errors='strict'): returncodecs.charmap_decode(input,errors,decoding_map) #Incrementalforms classInvertCapsIncrementalEncoder(codecs.IncrementalEncoder): defencode(self,input,final=False): returncodecs.charmap_encode(input,self.errors,encoding_map)[0] classInvertCapsIncrementalDecoder(codecs.IncrementalDecoder): defdecode(self,input,final=False): returncodecs.charmap_decode(input,self.errors,decoding_map)[0] #Streamreaderandwriter classInvertCapsStreamReader(InvertCapsCodec,codecs.StreamReader): pass classInvertCapsStreamWriter(InvertCapsCodec,codecs.StreamWriter): pass #Registerthecodecsearchfunction deffind_invertcaps(encoding): """Returnthecodecfor'invertcaps'. """ ifencoding=='invertcaps': returncodecs.CodecInfo( name='invertcaps', encode=InvertCapsCodec().encode, decode=InvertCapsCodec().decode, incrementalencoder=InvertCapsIncrementalEncoder, incrementaldecoder=InvertCapsIncrementalDecoder, streamreader=InvertCapsStreamReader, streamwriter=InvertCapsStreamWriter, ) returnNone codecs.register(find_invertcaps) if__name__=='__main__': #Statelessencoder/decoder encoder=codecs.getencoder('invertcaps') text='abc.DEF' encoded_text,consumed=encoder(text) print'Encoderconverted"{}"to"{}",consuming{}characters'.format( text,encoded_text,consumed) #Streamwriter importsys writer=codecs.getwriter('invertcaps')(sys.stdout) print'StreamWriterforstdout:', writer.write('abc.DEF') print #Incrementaldecoder decoder_factory=codecs.getincrementaldecoder('invertcaps') decoder=decoder_factory() decoded_text_parts=[] forcinencoded_text: decoded_text_parts.append(decoder.decode(c,final=False)) decoded_text_parts.append(decoder.decode('',final=True)) decoded_text=''.join(decoded_text_parts) print'IncrementalDecoderconverted"{}"to"{}"'.format( encoded_text,decoded_text) Thestatelessencoder/decoderbaseclassisCodec.Override encode()anddecode()withyourimplementation(inthis case,callingcharmap_encode()andcharmap_decode() respectively).Eachmethodmustreturnatuplecontainingthe transformeddataandthenumberoftheinputbytesorcharacters consumed.Conveniently,charmap_encode()and charmap_decode()alreadyreturnthatinformation. IncrementalEncoderandIncrementalDecoderserveas baseclassesfortheincrementalinterfaces.Theencode()and decode()methodsoftheincrementalclassesaredefinedinsuch awaythattheyonlyreturntheactualtransformeddata.Any informationaboutbufferingismaintainedasinternalstate.The invertcapsencodingdoesnotneedtobufferdata(itusesaone-to-one mapping).Forencodingsthatproduceadifferentamountofoutput dependingonthedatabeingprocessed,suchascompressionalgorithms, BufferedIncrementalEncoderand BufferedIncrementalDecoderaremoreappropriatebaseclasses, sincetheymanagetheunprocessedportionoftheinputforyou. StreamReaderandStreamWriterneedencode() anddecode()methods,too,andsincetheyareexpectedtoreturn thesamevalueastheversionfromCodecyoucanusemultiple inheritancefortheimplementation. $pythoncodecs_invertcaps_register.py Encoderconverted"abc.DEF"to"ABC.def",consuming7characters StreamWriterforstdout:ABC.def IncrementalDecoderconverted"ABC.def"to"abc.DEF" Seealso codecs Thestandardlibrarydocumentationforthismodule. locale Accessingandmanagingthelocalization-basedconfiguration settingsandbehaviors. io Theiomoduleincludesfileandstreamwrappersthat handleencodinganddecoding,too. SocketServer Foramoredetailedexampleofanechoserver,seethe SocketServermodule. encodings Packageinthestandardlibrarycontainingtheencoder/decoder implementationsprovidedbyPython.. UnicodeHOWTO TheofficialguideforusingUnicodewithPython2.x. PythonUnicodeObjects FredrikLundh’sarticleaboutusingnon-ASCIIcharactersets inPython2.0. HowtoUseUTF-8withPython EvanJones’quickguidetoworkingwithUnicode,includingXML dataandtheByte-OrderMarker. OntheGoodnessofUnicode IntroductiontointernationalizationandUnicodebyTimBray. OnCharacterStrings Alookatthehistoryofstringprocessinginprogramming languages,byTimBray. Charactersvs.Bytes PartoneofTimBray’s“essayonmoderncharacterstring processingforcomputerprogrammers.”Thisinstallmentcovers in-memoryrepresentationoftextinformatsotherthanASCII bytes. TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!) AnintroductiontoUnicodebyJoelSpolsky. Endianness ExplanationofendiannessinWikipedia. Navigation index modules| next| previous| PyMOTW» StringServices» ©CopyrightDougHellmann. | |LastupdatedonJul11,2020. |CreatedusingSphinx. |Designbasedon"Leaves"bySmallPark |