For example, Python includes codecs for working with base-64, bzip2, ROT-13, ZIP, and other data formats. import codecs from cStringIO import StringIO ...
PyMOTW
Home
Blog
TheBook
About
SiteIndex
Ifyoufindthisinformationuseful,considerpickingupacopyofmybook,
ThePythonStandardLibraryBy
Example.
PageContents
codecs–Stringencodinganddecoding
UnicodePrimer
Encodings
WorkingwithFiles
ByteOrder
ErrorHandling
EncodingErrors
DecodingErrors
StandardInputandOutputStreams
NetworkCommunication
EncodingTranslation
Non-UnicodeEncodings
IncrementalEncoding
DefiningYourOwnEncoding
Navigation
TableofContents
Previous:StringServices
Next:difflib–Comparesequences
ThisPage
ShowSource
Examples
TheoutputfromalltheexampleprogramsfromPyMOTWhasbeen
generatedwithPython2.7.8,unlessotherwisenoted.Some
ofthefeaturesdescribedheremaynotbeavailableinearlier
versionsofPython.
IfyouarelookingforexamplesthatworkunderPython3,please
refertothePyMOTW-3sectionofthesite.NowavailableforPython3!
Buythebook!
Navigation
index
modules|
next|
previous|
PyMOTW»
StringServices»
codecs–Stringencodinganddecoding¶
Purpose:Encodersanddecodersforconvertingtextbetweendifferentrepresentations.
AvailableIn:2.1andlater
Thecodecsmoduleprovidesstreamandfileinterfacesfor
transcodingdatainyourprogram.Itismostcommonlyusedtowork
withUnicodetext,butotherencodingsarealsoavailableforother
purposes.
UnicodePrimer¶
CPython2.xsupportstwotypesofstringsforworkingwithtextdata.
Old-stylestrinstancesuseasingle8-bitbytetorepresent
eachcharacterofthestringusingitsASCIIcode.Incontrast,
unicodestringsaremanagedinternallyasasequenceof
Unicodecodepoints.Thecodepointvaluesaresavedasasequence
of2or4byteseach,dependingontheoptionsgivenwhenPythonwas
compiled.Bothunicodeandstrarederivedfroma
commonbaseclass,andsupportasimilarAPI.
Whenunicodestringsareoutput,theyareencodedusingone
ofseveralstandardschemessothatthesequenceofbytescanbe
reconstructedasthesamestringlater.Thebytesoftheencoded
valuearenotnecessarilythesameasthecodepointvalues,andthe
encodingdefinesawaytotranslatebetweenthetwosetsofvalues.
ReadingUnicodedataalsorequiresknowingtheencodingsothatthe
incomingbytescanbeconvertedtotheinternalrepresentationusedby
theunicodeclass.
ThemostcommonencodingsforWesternlanguagesareUTF-8and
UTF-16,whichusesequencesofoneandtwobytevalues
respectivelytorepresenteachcharacter.Otherencodingscanbemore
efficientforstoringlanguageswheremostofthecharactersare
representedbycodepointsthatdonotfitintotwobytes.
Seealso
FormoreintroductoryinformationaboutUnicode,refertothelist
ofreferencesattheendofthissection.ThePythonUnicode
HOWTOisespeciallyhelpful.
Encodings¶
Thebestwaytounderstandencodingsistolookatthedifferent
seriesofbytesproducedbyencodingthesamestringindifferent
ways.Theexamplesbelowusethisfunctiontoformatthebytestring
tomakeiteasiertoread.
importbinascii
defto_hex(t,nbytes):
"Formattexttasasequenceofnbytelongvaluesseparatedbyspaces."
chars_per_item=nbytes*2
hex_version=binascii.hexlify(t)
num_chunks=len(hex_version)/chars_per_item
defchunkify():
forstartinxrange(0,len(hex_version),chars_per_item):
yieldhex_version[start:start+chars_per_item]
return''.join(chunkify())
if__name__=='__main__':
printto_hex('abcdef',1)
printto_hex('abcdef',2)
Thefunctionusesbinasciitogetahexadecimalrepresentation
oftheinputbytestring,theninsertaspacebetweeneverynbytes
bytesbeforereturningthevalue.
$pythoncodecs_to_hex.py
616263646566
616263646566
Thefirstencodingexamplebeginsbyprintingthetext'pi:π'
usingtherawrepresentationoftheunicodeclass.Theπ
characterisreplacedwiththeexpressionfortheUnicodecodepoint,
\u03c0.ThenexttwolinesencodethestringasUTF-8andUTF-16
respectively,andshowthehexadecimalvaluesresultingfromthe
encoding.
fromcodecs_to_heximportto_hex
text=u'pi:π'
print'Raw:',repr(text)
print'UTF-8:',to_hex(text.encode('utf-8'),1)
print'UTF-16:',to_hex(text.encode('utf-16'),2)
Theresultofencodingaunicodestringisastr
object.
$pythoncodecs_encodings.py
Raw:u'pi:\u03c0'
UTF-8:70693a20cf80
UTF-16:fffe700069003a002000c003
Givenasequenceofencodedbytesasastrinstance,the
decode()methodtranslatesthemtocodepointsandreturnsthe
sequenceasaunicodeinstance.
fromcodecs_to_heximportto_hex
text=u'pi:π'
encoded=text.encode('utf-8')
decoded=encoded.decode('utf-8')
print'Original:',repr(text)
print'Encoded:',to_hex(encoded,1),type(encoded)
print'Decoded:',repr(decoded),type(decoded)
Thechoiceofencodinguseddoesnotchangetheoutputtype.
$pythoncodecs_decode.py
Original:u'pi:\u03c0'
Encoded:70693a20cf80
Decoded:u'pi:\u03c0'
Note
Thedefaultencodingissetduringtheinterpreterstart-upprocess,
whensiteisloaded.RefertoUnicodeDefaults
foradescriptionofthedefaultencodingsettingsaccessiblevia
sys.
WorkingwithFiles¶
Encodinganddecodingstringsisespeciallyimportantwhendealing
withI/Ooperations.Whetheryouarewritingtoafile,socket,or
otherstream,youwillwanttoensurethatthedataisusingthe
properencoding.Ingeneral,alltextdataneedstobedecodedfrom
itsbyterepresentationasitisread,andencodedfromtheinternal
valuestoaspecificrepresentationasitiswritten.Yourprogram
canexplicitlyencodeanddecodedata,butdependingontheencoding
useditcanbenon-trivialtodeterminewhetheryouhavereadenough
bytesinordertofullydecodethedata.codecsprovides
classesthatmanagethedataencodinganddecodingforyou,soyou
don’thavetocreateyourown.
Thesimplestinterfaceprovidedbycodecsisareplacementfor
thebuilt-inopen()function.Thenewversionworksjustlike
thebuilt-in,butaddstwonewargumentstospecifytheencodingand
desirederrorhandlingtechnique.
fromcodecs_to_heximportto_hex
importcodecs
importsys
encoding=sys.argv[1]
filename=encoding+'.txt'
print'Writingto',filename
withcodecs.open(filename,mode='wt',encoding=encoding)asf:
f.write(u'pi:\u03c0')
#Determinethebytegroupingtouseforto_hex()
nbytes={'utf-8':1,
'utf-16':2,
'utf-32':4,
}.get(encoding,1)
#Showtherawbytesinthefile
print'Filecontents:'
withopen(filename,mode='rt')asf:
printto_hex(f.read(),nbytes)
Startingwithaunicodestringwiththecodepointforπ,
thisexamplesavesthetexttoafileusinganencodingspecifiedon
thecommandline.
$pythoncodecs_open_write.pyutf-8
Writingtoutf-8.txt
Filecontents:
70693a20cf80
$pythoncodecs_open_write.pyutf-16
Writingtoutf-16.txt
Filecontents:
fffe700069003a002000c003
$pythoncodecs_open_write.pyutf-32
Writingtoutf-32.txt
Filecontents:
fffe000070000000690000003a00000020000000c0030000
Readingthedatawithopen()isstraightforward,withonecatch:
youmustknowtheencodinginadvance,inordertosetupthedecoder
correctly.Somedataformats,suchasXML,letyouspecifythe
encodingaspartofthefile,butusuallyitisuptotheapplication
tomanage.codecssimplytakestheencodingasanargumentand
assumesitiscorrect.
importcodecs
importsys
encoding=sys.argv[1]
filename=encoding+'.txt'
print'Readingfrom',filename
withcodecs.open(filename,mode='rt',encoding=encoding)asf:
printrepr(f.read())
Thisexamplereadsthefilescreatedbythepreviousprogram,and
printstherepresentationoftheresultingunicodeobjectto
theconsole.
$pythoncodecs_open_read.pyutf-8
Readingfromutf-8.txt
u'pi:\u03c0'
$pythoncodecs_open_read.pyutf-16
Readingfromutf-16.txt
u'pi:\u03c0'
$pythoncodecs_open_read.pyutf-32
Readingfromutf-32.txt
u'pi:\u03c0'
ByteOrder¶
Multi-byteencodingssuchasUTF-16andUTF-32poseaproblemwhen
transferringthedatabetweendifferentcomputersystems,eitherby
copyingthefiledirectlyorwithnetworkcommunication.Different
systemsusedifferentorderingofthehighandloworderbytes.This
characteristicofthedata,knownasitsendianness,dependson
factorssuchasthehardwarearchitectureandchoicesmadebythe
operatingsystemandapplicationdeveloper.Thereisn’talwaysaway
toknowinadvancewhatbyteordertouseforagivensetofdata,so
themulti-byteencodingsincludeabyte-ordermarker(BOM)asthe
firstfewbytesofencodedoutput.Forexample,UTF-16isdefined
insuchawaythat0xFFFEand0xFEFFarenotvalidcharacters,andcan
beusedtoindicatethebyteorder.codecsdefinesconstants
forthebyteordermarkersusedbyUTF-16andUTF-32.
importcodecs
fromcodecs_to_heximportto_hex
fornamein['BOM','BOM_BE','BOM_LE',
'BOM_UTF8',
'BOM_UTF16','BOM_UTF16_BE','BOM_UTF16_LE',
'BOM_UTF32','BOM_UTF32_BE','BOM_UTF32_LE',
]:
print'{:12}:{}'.format(name,to_hex(getattr(codecs,name),2))
BOM,BOM_UTF16,andBOM_UTF32areautomaticallysettothe
appropriatebig-endianorlittle-endianvaluesdependingonthe
currentsystem’snativebyteorder.
$pythoncodecs_bom.py
BOM:fffe
BOM_BE:feff
BOM_LE:fffe
BOM_UTF8:efbbbf
BOM_UTF16:fffe
BOM_UTF16_BE:feff
BOM_UTF16_LE:fffe
BOM_UTF32:fffe0000
BOM_UTF32_BE:0000feff
BOM_UTF32_LE:fffe0000
Byteorderingisdetectedandhandledautomaticallybythedecodersin
codecs,butyoucanalsochooseanexplicitorderingforthe
encoding.
importcodecs
fromcodecs_to_heximportto_hex
#Pickthenon-nativeversionofUTF-16encoding
ifcodecs.BOM_UTF16==codecs.BOM_UTF16_BE:
bom=codecs.BOM_UTF16_LE
encoding='utf_16_le'
else:
bom=codecs.BOM_UTF16_BE
encoding='utf_16_be'
print'Nativeorder:',to_hex(codecs.BOM_UTF16,2)
print'Selectedorder:',to_hex(bom,2)
#Encodethetext.
encoded_text=u'pi:\u03c0'.encode(encoding)
print'{:14}:{}'.format(encoding,to_hex(encoded_text,2))
withopen('non-native-encoded.txt',mode='wb')asf:
#Writetheselectedbyte-ordermarker.Itisnotincludedinthe
#encodedtextbecausewewereexplicitaboutthebyteorderwhen
#selectingtheencoding.
f.write(bom)
#Writethebytestringfortheencodedtext.
f.write(encoded_text)
codecs_bom_create_file.pyfiguresoutthenativebyteordering,
thenusesthealternateformexplicitlysothenextexamplecan
demonstrateauto-detectionwhilereading.
$pythoncodecs_bom_create_file.py
Nativeorder:fffe
Selectedorder:feff
utf_16_be:00700069003a002003c0
codecs_bom_detection.pydoesnotspecifyabyteorderwhenopening
thefile,sothedecoderusestheBOMvalueinthefirsttwobytesof
thefiletodetermineit.
importcodecs
fromcodecs_to_heximportto_hex
#Lookattherawdata
withopen('non-native-encoded.txt',mode='rb')asf:
raw_bytes=f.read()
print'Raw:',to_hex(raw_bytes,2)
#Re-openthefileandletcodecsdetecttheBOM
withcodecs.open('non-native-encoded.txt',mode='rt',encoding='utf-16')asf:
decoded_text=f.read()
print'Decoded:',repr(decoded_text)
Sincethefirsttwobytesofthefileareusedforbyteorder
detection,theyarenotincludedinthedatareturnedbyread().
$pythoncodecs_bom_detection.py
Raw:feff00700069003a002003c0
Decoded:u'pi:\u03c0'
ErrorHandling¶
Theprevioussectionspointedouttheneedtoknowtheencodingbeing
usedwhenreadingandwritingUnicodefiles.Settingtheencoding
correctlyisimportantfortworeasons.Iftheencodingisconfigured
incorrectlywhilereadingfromafile,thedatawillbeinterpreted
wrongandmaybecorruptedorsimplyfailtodecode.NotallUnicode
characterscanberepresentedinallencodings,soifthewrong
encodingisusedwhilewritinganerrorwillbegeneratedanddatamay
belost.
codecsusesthesamefiveerrorhandlingoptionsthatare
providedbytheencode()methodofunicodeandthe
decode()methodofstr.
strict
Raisesanexceptionifthedatacannotbeconverted.
replace
Substitutesaspecialmarkercharacterfordatathatcannotbeencoded.
ignore
Skipsthedata.
xmlcharrefreplace
XMLcharacter(encodingonly)
backslashreplace
escapesequence(encodingonly)
EncodingErrors¶
Themostcommonerrorconditionisreceivinga
UnicodeEncodeErrorwhenwriting
UnicodedatatoanASCIIoutputstream,suchasaregularfileor
sys.stdout.Thissampleprogramcanbeused
toexperimentwiththedifferenterrorhandlingmodes.
importcodecs
importsys
error_handling=sys.argv[1]
text=u'pi:\u03c0'
try:
#Savethedata,encodedasASCII,usingtheerror
#handlingmodespecifiedonthecommandline.
withcodecs.open('encode_error.txt','w',
encoding='ascii',
errors=error_handling)asf:
f.write(text)
exceptUnicodeEncodeError,err:
print'ERROR:',err
else:
#Iftherewasnoerrorwritingtothefile,
#showwhatitcontains.
withopen('encode_error.txt','rb')asf:
print'Filecontents:',repr(f.read())
Whilestrictmodeissafestforensuringyourapplication
explicitlysetsthecorrectencodingforallI/Ooperations,itcan
leadtoprogramcrasheswhenanexceptionisraised.
$pythoncodecs_encode_error.pystrict
ERROR:'ascii'codeccan'tencodecharacteru'\u03c0'inposition4:ordinalnotinrange(128)
Someoftheothererrormodesaremoreflexible.Forexample,
replaceensuresthatnoerrorisraised,attheexpenseof
possiblylosingdatathatcannotbeconvertedtotherequested
encoding.TheUnicodecharacterforpistillcannotbeencodedin
ASCII,butinsteadofraisinganexceptionthecharacterisreplaced
with?intheoutput.
$pythoncodecs_encode_error.pyreplace
Filecontents:'pi:?'
Toskipoverproblemdataentirely,useignore.Anydatathat
cannotbeencodedissimplydiscarded.
$pythoncodecs_encode_error.pyignore
Filecontents:'pi:'
Therearetwolosslesserrorhandlingoptions,bothofwhichreplace
thecharacterwithanalternaterepresentationdefinedbyastandard
separatefromtheencoding.xmlcharrefreplaceusesanXML
characterreferenceasasubstitute(thelistofcharacterreferences
isspecifiedintheW3CXMLEntityDefinitionsforCharacters).
$pythoncodecs_encode_error.pyxmlcharrefreplace
Filecontents:'pi:π'
Theotherlosslesserrorhandlingschemeisbackslashreplacewhich
producesanoutputformatlikethevalueyougetwhenyouprintthe
repr()ofaunicodeobject.Unicodecharactersare
replacedwith\ufollowedbythehexadecimalvalueofthecode
point.
$pythoncodecs_encode_error.pybackslashreplace
Filecontents:'pi:\\u03c0'
DecodingErrors¶
Itisalsopossibletoseeerrorswhendecodingdata,especiallyif
thewrongencodingisused.
importcodecs
importsys
fromcodecs_to_heximportto_hex
error_handling=sys.argv[1]
text=u'pi:\u03c0'
print'Original:',repr(text)
#Savethedatawithoneencoding
withcodecs.open('decode_error.txt','w',encoding='utf-16')asf:
f.write(text)
#Dumpthebytesfromthefile
withopen('decode_error.txt','rb')asf:
print'Filecontents:',to_hex(f.read(),1)
#Trytoreadthedatawiththewrongencoding
withcodecs.open('decode_error.txt','r',
encoding='utf-8',
errors=error_handling)asf:
try:
data=f.read()
exceptUnicodeDecodeError,err:
print'ERROR:',err
else:
print'Read:',repr(data)
Aswithencoding,stricterrorhandlingmoderaisesanexception
ifthebytestreamcannotbeproperlydecoded.Inthiscase,a
UnicodeDecodeErrorresultsfrom
tryingtoconvertpartoftheUTF-16BOMtoacharacterusingthe
UTF-8decoder.
$pythoncodecs_decode_error.pystrict
Original:u'pi:\u03c0'
Filecontents:fffe700069003a002000c003
ERROR:'utf8'codeccan'tdecodebyte0xffinposition0:invalidstartbyte
Switchingtoignorecausesthedecodertoskipovertheinvalid
bytes.Theresultisstillnotquitewhatisexpected,though,since
itincludesembeddednullbytes.
$pythoncodecs_decode_error.pyignore
Original:u'pi:\u03c0'
Filecontents:fffe700069003a002000c003
Read:u'p\x00i\x00:\x00\x00\x03'
Inreplacemodeinvalidbytesarereplacedwith\uFFFD,the
officialUnicodereplacementcharacter,whichlookslikeadiamond
withablackbackgroundcontainingawhitequestionmark(�).
$pythoncodecs_decode_error.pyreplace
Original:u'pi:\u03c0'
Filecontents:fffe700069003a002000c003
Read:u'\ufffd\ufffdp\x00i\x00:\x00\x00\ufffd\x03'
StandardInputandOutputStreams¶
ThemostcommoncauseofUnicodeEncodeErrorexceptionsiscodethattriestoprint
unicodedatatotheconsoleoraUnixpipelinewhen
sys.stdoutisnotconfiguredwithan
encoding.
importcodecs
importsys
text=u'pi:π'
#Printingtostdoutmaycauseanencodingerror
print'Defaultencoding:',sys.stdout.encoding
print'TTY:',sys.stdout.isatty()
printtext
ProblemswiththedefaultencodingofthestandardI/Ochannelscanbe
difficulttodebugbecausetheprogramworksasexpectedwhenthe
outputgoestotheconsole,butcauseencodingerrorswhenitisused
aspartofapipelineandtheoutputincludesUnicodecharactersabove
theASCIIrange.ThisdifferenceinbehavioriscausedbyPython’s
initializationcode,whichsetsthedefaultencodingforeachstandard
I/Ochannelonlyifthechannelisconnectedtoaterminal
(isatty()returnsTrue).Ifthereisnoterminal,Python
assumestheprogramwillconfiguretheencodingexplicitly,andleaves
theI/Ochannelalone.
$pythoncodecs_stdout.py
Defaultencoding:utf-8
TTY:True
pi:π
$pythoncodecs_stdout.py|cat-
Defaultencoding:None
TTY:False
Traceback(mostrecentcalllast):
File"codecs_stdout.py",line18,in
printtext
UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\u03c0'in
position4:ordinalnotinrange(128)
Toexplicitlysettheencodingonthestandardoutputchannel,use
getwriter()togetastreamencoderclassforaspecific
encoding.Instantiatetheclass,passingsys.stdoutastheonly
argument.
importcodecs
importsys
text=u'pi:π'
#Wrapsys.stdoutwithawriterthatknowshowtohandleencoding
#Unicodedata.
wrapped_stdout=codecs.getwriter('UTF-8')(sys.stdout)
wrapped_stdout.write(u'Viawrite:'+text+'\n')
#Replacesys.stdoutwithawriter
sys.stdout=wrapped_stdout
printu'Viaprint:',text
Writingtothewrappedversionofsys.stdoutpassestheUnicode
textthroughanencoderbeforesendingtheencodedbytestostdout.
Replacingsys.stdoutmeansthatanycodeusedbyyourapplication
thatprintstostandardoutputwillbeabletotakeadvantageofthe
encodingwriter.
$pythoncodecs_stdout_wrapped.py
Viawrite:pi:π
Viaprint:pi:π
Thenextproblemtosolveishowtoknowwhichencodingshouldbe
used.Theproperencodingvariesbasedonlocation,language,and
userorsystemconfiguration,sohard-codingafixedvalueisnota
goodidea.Itwouldalsobeannoyingforausertoneedtopass
explicitargumentstoeveryprogramsettingtheinputandoutput
encodings.Fortunately,thereisaglobalwaytogetareasonable
defaultencoding,usinglocale.
importcodecs
importlocale
importsys
text=u'pi:π'
#Configurelocalefromtheuser'senvironmentsettings.
locale.setlocale(locale.LC_ALL,'')
#Wrapstdoutwithanencoding-awarewriter.
lang,encoding=locale.getdefaultlocale()
print'Localeencoding:',encoding
sys.stdout=codecs.getwriter(encoding)(sys.stdout)
print'Withwrappedstdout:',text
getdefaultlocale()returnsthelanguageandpreferredencoding
basedonthesystemanduserconfigurationsettingsinaformthatcan
beusedwithgetwriter().
$pythoncodecs_stdout_locale.py
Localeencoding:UTF-8
Withwrappedstdout:pi:π
Theencodingalsoneedstobesetupwhenworkingwithsys.stdin.Usegetreader()togetareadercapableof
decodingtheinputbytes.
importcodecs
importlocale
importsys
#Configurelocalefromtheuser'senvironmentsettings.
locale.setlocale(locale.LC_ALL,'')
#Wrapstdinwithanencoding-awarereader.
lang,encoding=locale.getdefaultlocale()
sys.stdin=codecs.getreader(encoding)(sys.stdin)
print'Fromstdin:',repr(sys.stdin.read())
Readingfromthewrappedhandlereturnsunicodeobjects
insteadofstrinstances.
$pythoncodecs_stdout_locale.py|pythoncodecs_stdin.py
Fromstdin:u'Localeencoding:UTF-8\nWithwrappedstdout:pi:\u03c0\n'
NetworkCommunication¶
Networksocketsarealsobyte-streams,andsoUnicodedatamustbe
encodedintobytesbeforeitiswrittentoasocket.
importsys
importSocketServer
classEcho(SocketServer.BaseRequestHandler):
defhandle(self):
#Getsomebytesandechothembacktotheclient.
data=self.request.recv(1024)
self.request.send(data)
return
if__name__=='__main__':
importcodecs
importsocket
importthreading
address=('localhost',0)#letthekernelgiveusaport
server=SocketServer.TCPServer(address,Echo)
ip,port=server.server_address#findoutwhatportweweregiven
t=threading.Thread(target=server.serve_forever)
t.setDaemon(True)#don'thangonexit
t.start()
#Connecttotheserver
s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect((ip,port))
#Sendthedata
text=u'pi:π'
len_sent=s.send(text)
#Receivearesponse
response=s.recv(len_sent)
printrepr(response)
#Cleanup
s.close()
server.socket.close()
Youcouldencodethedataexplicitly,beforesendingit,butmissone
calltosend()andyourprogramwouldfailwithanencoding
error.
$pythoncodecs_socket_fail.py
Traceback(mostrecentcalllast):
File"codecs_socket_fail.py",line43,in
len_sent=s.send(text)
UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\u03c0'in
position4:ordinalnotinrange(128)
Byusingmakefile()togetafile-likehandleforthesocket,
andthenwrappingthatwithastream-basedreaderorwriter,youwill
beabletopassUnicodestringsandknowtheyareencodedontheway
intoandoutofthesocket.
importsys
importSocketServer
classEcho(SocketServer.BaseRequestHandler):
defhandle(self):
#Getsomebytesandechothembacktotheclient.Thereis
#noneedtodecodethem,sincetheyarenotused.
data=self.request.recv(1024)
self.request.send(data)
return
classPassThrough(object):
def__init__(self,other):
self.other=other
defwrite(self,data):
print'Writing:',repr(data)
returnself.other.write(data)
defread(self,size=-1):
print'Reading:',
data=self.other.read(size)
printrepr(data)
returndata
defflush(self):
returnself.other.flush()
defclose(self):
returnself.other.close()
if__name__=='__main__':
importcodecs
importsocket
importthreading
address=('localhost',0)#letthekernelgiveusaport
server=SocketServer.TCPServer(address,Echo)
ip,port=server.server_address#findoutwhatportweweregiven
t=threading.Thread(target=server.serve_forever)
t.setDaemon(True)#don'thangonexit
t.start()
#Connecttotheserver
s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect((ip,port))
#Wrapthesocketwithareaderandwriter.
incoming=codecs.getreader('utf-8')(PassThrough(s.makefile('r')))
outgoing=codecs.getwriter('utf-8')(PassThrough(s.makefile('w')))
#Sendthedata
text=u'pi:π'
print'Sending:',repr(text)
outgoing.write(text)
outgoing.flush()
#Receivearesponse
response=incoming.read()
print'Received:',repr(response)
#Cleanup
s.close()
server.socket.close()
ThisexampleusesPassThroughtoshowthatthedatais
encodedbeforebeingsent,andtheresponseisdecodedafteritis
receivedintheclient.
$pythoncodecs_socket.py
Sending:u'pi:\u03c0'
Writing:'pi:\xcf\x80'
Reading:'pi:\xcf\x80'
Received:u'pi:\u03c0'
EncodingTranslation¶
Althoughmostapplicationswillworkwithunicodedata
internally,decodingorencodingitaspartofanI/Ooperation,there
aretimeswhenchangingafile’sencodingwithoutholdingontothat
intermediatedataformatisuseful.EncodedFile()takesanopen
filehandleusingoneencodingandwrapsitwithaclassthat
translatesthedatatoanotherencodingastheI/Ooccurs.
fromcodecs_to_heximportto_hex
importcodecs
fromcStringIOimportStringIO
#Rawversionoftheoriginaldata.
data=u'pi:\u03c0'
#ManuallyencodeitasUTF-8.
utf8=data.encode('utf-8')
print'StartasUTF-8:',to_hex(utf8,1)
#Setupanoutputbuffer,thenwrapitasanEncodedFile.
output=StringIO()
encoded_file=codecs.EncodedFile(output,data_encoding='utf-8',
file_encoding='utf-16')
encoded_file.write(utf8)
#FetchthebuffercontentsasaUTF-16encodedbytestring
utf16=output.getvalue()
print'EncodedtoUTF-16:',to_hex(utf16,2)
#SetupanotherbufferwiththeUTF-16dataforreading,
#andwrapitwithanotherEncodedFile.
buffer=StringIO(utf16)
encoded_file=codecs.EncodedFile(buffer,data_encoding='utf-8',
file_encoding='utf-16')
#ReadtheUTF-8encodedversionofthedata.
recoded=encoded_file.read()
print'BacktoUTF-8:',to_hex(recoded,1)
Thisexampleshowsreadingfromandwritingtoseparatehandles
returnedbyEncodedFile().Nomatterwhetherthehandleisused
forreadingorwriting,thefile_encodingalwaysreferstothe
encodinginusebytheopenfilehandlepassedasthefirstargument,
anddata_encodingvaluereferstotheencodinginusebythedata
passingthroughtheread()andwrite()calls.
$pythoncodecs_encodedfile.py
StartasUTF-8:70693a20cf80
EncodedtoUTF-16:fffe700069003a002000c003
BacktoUTF-8:70693a20cf80
Non-UnicodeEncodings¶
AlthoughmostoftheearlierexamplesuseUnicodeencodings,
codecscanbeusedformanyotherdatatranslations.For
example,Pythonincludescodecsforworkingwithbase-64,bzip2,
ROT-13,ZIP,andotherdataformats.
importcodecs
fromcStringIOimportStringIO
buffer=StringIO()
stream=codecs.getwriter('rot_13')(buffer)
text='abcdefghijklmnopqrstuvwxyz'
stream.write(text)
stream.flush()
print'Original:',text
print'ROT-13:',buffer.getvalue()
Anytransformationthatcanbeexpressedasafunctiontakingasingle
inputargumentandreturningabyteorUnicodestringcanbe
registeredasacodec.
$pythoncodecs_rot13.py
Original:abcdefghijklmnopqrstuvwxyz
ROT-13:nopqrstuvwxyzabcdefghijklm
Usingcodecstowrapadatastreamprovidesasimplerinterface
thanworkingdirectlywithzlib.
importcodecs
fromcStringIOimportStringIO
fromcodecs_to_heximportto_hex
buffer=StringIO()
stream=codecs.getwriter('zlib')(buffer)
text='abcdefghijklmnopqrstuvwxyz\n'*50
stream.write(text)
stream.flush()
print'Originallength:',len(text)
compressed_data=buffer.getvalue()
print'ZIPcompressed:',len(compressed_data)
buffer=StringIO(compressed_data)
stream=codecs.getreader('zlib')(buffer)
first_line=stream.readline()
print'Readfirstline:',repr(first_line)
uncompressed_data=first_line+stream.read()
print'Uncompressed:',len(uncompressed_data)
print'Same:',text==uncompressed_data
Notallofthecompressionorencodingsystemssupportreadinga
portionofthedatathroughthestreaminterfaceusing
readline()orread()becausetheyneedtofindtheendof
acompressedsegmenttoexpandit.Ifyourprogramcannotholdthe
entireuncompresseddatasetinmemory,usetheincrementalaccess
featuresofthecompressionlibraryinsteadofcodecs.
$pythoncodecs_zlib.py
Originallength:1350
ZIPcompressed:48
Readfirstline:'abcdefghijklmnopqrstuvwxyz\n'
Uncompressed:1350
Same:True
IncrementalEncoding¶
Someoftheencodingsprovided,especiallybz2andzlib,may
dramaticallychangethelengthofthedatastreamastheyworkonit.
Forlargedatasets,theseencodingsoperatebetterincrementally,
workingononesmallchunkofdataatatime.The
IncrementalEncoderandIncrementalDecoderAPIis
designedforthispurpose.
importcodecs
importsys
fromcodecs_to_heximportto_hex
text='abcdefghijklmnopqrstuvwxyz\n'
repetitions=50
print'Textlength:',len(text)
print'Repetitions:',repetitions
print'Expectedlen:',len(text)*repetitions
#Encodethetextseveraltimesbuildupalargeamountofdata
encoder=codecs.getincrementalencoder('bz2')()
encoded=[]
print
print'Encoding:',
foriinrange(repetitions):
en_c=encoder.encode(text,final=(i==repetitions-1))
ifen_c:
print'\nEncoded:{}bytes'.format(len(en_c))
encoded.append(en_c)
else:
sys.stdout.write('.')
bytes=''.join(encoded)
print
print'Totalencodedlength:',len(bytes)
print
#Decodethebytestringonebyteatatime
decoder=codecs.getincrementaldecoder('bz2')()
decoded=[]
print'Decoding:',
fori,binenumerate(bytes):
final=(i+1)==len(text)
c=decoder.decode(b,final)
ifc:
print'\nDecoded:{}characters'.format(len(c))
print'Decoding:',
decoded.append(c)
else:
sys.stdout.write('.')
print
restored=u''.join(decoded)
print
print'Totaluncompressedlength:',len(restored)
Eachtimedataispassedtotheencoderordecoderitsinternalstate
isupdated.Whenthestateisconsistent(asdefinedbythecodec),
dataisreturnedandthestateresets.Untilthatpoint,callsto
encode()ordecode()willnotreturnanydata.Whenthe
lastbitofdataispassedin,theargumentfinalshouldbesetto
Truesothecodecknowstoflushanyremainingbuffereddata.
$pythoncodecs_incremental_bz2.py
Textlength:27
Repetitions:50
Expectedlen:1350
Encoding:.................................................
Encoded:99bytes
Totalencodedlength:99
Decoding:............................................................
............................
Decoded:1350characters
Decoding:..........
Totaluncompressedlength:1350
DefiningYourOwnEncoding¶
SincePythoncomeswithalargenumberofstandardcodecsalready,it
isunlikelythatyouwillneedtodefineyourown.Ifyoudo,there
areseveralbaseclassesincodecstomaketheprocesseasier.
Thefirststepistounderstandthenatureofthetransformation
describedbytheencoding.Forexample,an“invertcaps”encoding
convertsuppercaseletterstolowercase,andlowercaselettersto
uppercase.Hereisasimpledefinitionofanencodingfunctionthat
performsthistransformationonaninputstring:
importstring
definvertcaps(text):
"""Returnnewstringwiththecaseofalllettersswitched.
"""
return''.join(c.upper()ifcinstring.ascii_lowercase
elsec.lower()ifcinstring.ascii_uppercase
elsec
forcintext
)
if__name__=='__main__':
printinvertcaps('ABC.def')
printinvertcaps('abc.DEF')
Inthiscase,theencoderanddecoderarethesamefunction(aswith
ROT-13).
$pythoncodecs_invertcaps.py
abc.DEF
ABC.def
Althoughitiseasytounderstand,thisimplementationisnot
efficient,especiallyforverylargetextstrings.Fortunately,
codecsincludessomehelperfunctionsforcreatingcharacter
mapbasedcodecssuchasinvertcaps.Acharactermapencodingis
madeupoftwodictionaries.Theencodingmapconvertscharacter
valuesfromtheinputstringtobytevaluesintheoutputandthe
decodingmapgoestheotherway.Createyourdecodingmapfirst,
andthenusemake_encoding_map()toconvertittoanencoding
map.TheCfunctionscharmap_encode()and
charmap_decode()usethemapstoconverttheirinputdata
efficiently.
importcodecs
importstring
#Mapeverycharactertoitself
decoding_map=codecs.make_identity_dict(range(256))
#Makealistofpairsofordinalvaluesfortheloweranduppercase
#letters
pairs=zip([ord(c)forcinstring.ascii_lowercase],
[ord(c)forcinstring.ascii_uppercase])
#Modifythemappingtoconvertuppertolowerandlowertoupper.
decoding_map.update(dict((upper,lower)for(lower,upper)inpairs))
decoding_map.update(dict((lower,upper)for(lower,upper)inpairs))
#Createaseparateencodingmap.
encoding_map=codecs.make_encoding_map(decoding_map)
if__name__=='__main__':
printcodecs.charmap_encode('abc.DEF','strict',encoding_map)
printcodecs.charmap_decode('abc.DEF','strict',decoding_map)
printencoding_map==decoding_map
Althoughtheencodinganddecodingmapsforinvertcapsarethesame,
thatmaynotalwaysbethecase.make_encoding_map()detects
situationswheremorethanoneinputcharacterisencodedtothesame
outputbyteandreplacestheencodingvaluewithNonetomarkthe
encodingasundefined.
$pythoncodecs_invertcaps_charmap.py
('ABC.def',7)
(u'ABC.def',7)
True
Thecharactermapencoderanddecodersupportallofthestandard
errorhandlingmethodsdescribedearlier,soyoudonotneedtodoany
extraworktocomplywiththatpartoftheAPI.
importcodecs
fromcodecs_invertcaps_charmapimportencoding_map
text=u'pi:π'
forerrorin['ignore','replace','strict']:
try:
encoded=codecs.charmap_encode(text,error,encoding_map)
exceptUnicodeEncodeError,err:
encoded=str(err)
print'{:7}:{}'.format(error,encoded)
BecausetheUnicodecodepointforπisnotintheencodingmap,
thestricterrorhandlingmoderaisesanexception.
$pythoncodecs_invertcaps_error.py
ignore:('PI:',5)
replace:('PI:?',5)
strict:'charmap'codeccan'tencodecharacteru'\u03c0'inposition
4:charactermapsto
Afterthattheencodinganddecodingmapsaredefined,youneedtoset
upafewadditionalclassesandregistertheencoding.
register()addsasearchfunctiontotheregistrysothatwhena
userwantstouseyourencodingcodecscanlocateit.The
searchfunctionmusttakeasinglestringargumentwiththenameof
theencoding,andreturnaCodecInfoobjectifitknowsthe
encoding,orNoneifitdoesnot.
importcodecs
importencodings
defsearch1(encoding):
print'search1:Searchingfor:',encoding
returnNone
defsearch2(encoding):
print'search2:Searchingfor:',encoding
returnNone
codecs.register(search1)
codecs.register(search2)
utf8=codecs.lookup('utf-8')
print'UTF-8:',utf8
try:
unknown=codecs.lookup('no-such-encoding')
exceptLookupError,err:
print'ERROR:',err
Youcanregistermultiplesearchfunctions,andeachwillbecalledin
turnuntilonereturnsaCodecInfoorthelistisexhausted.
Theinternalsearchfunctionregisteredbycodecsknowshowto
loadthestandardcodecssuchasUTF-8fromencodings,sothose
nameswillneverbepassedtoyoursearchfunction.
$pythoncodecs_register.py
UTF-8:
search1:Searchingfor:no-such-encoding
search2:Searchingfor:no-such-encoding
ERROR:unknownencoding:no-such-encoding
TheCodecInfoinstancereturnedbythesearchfunctiontells
codecshowtoencodeanddecodeusingallofthedifferent
mechanismssupported:stateless,incremental,andstream.
codecsincludesbaseclassesthatmakesettingupacharacter
mapencodingeasy.Thisexampleputsallofthepiecestogetherto
registerasearchfunctionthatreturnsaCodecInfoinstance
configuredfortheinvertcapscodec.
importcodecs
fromcodecs_invertcaps_charmapimportencoding_map,decoding_map
#Statelessencoder/decoder
classInvertCapsCodec(codecs.Codec):
defencode(self,input,errors='strict'):
returncodecs.charmap_encode(input,errors,encoding_map)
defdecode(self,input,errors='strict'):
returncodecs.charmap_decode(input,errors,decoding_map)
#Incrementalforms
classInvertCapsIncrementalEncoder(codecs.IncrementalEncoder):
defencode(self,input,final=False):
returncodecs.charmap_encode(input,self.errors,encoding_map)[0]
classInvertCapsIncrementalDecoder(codecs.IncrementalDecoder):
defdecode(self,input,final=False):
returncodecs.charmap_decode(input,self.errors,decoding_map)[0]
#Streamreaderandwriter
classInvertCapsStreamReader(InvertCapsCodec,codecs.StreamReader):
pass
classInvertCapsStreamWriter(InvertCapsCodec,codecs.StreamWriter):
pass
#Registerthecodecsearchfunction
deffind_invertcaps(encoding):
"""Returnthecodecfor'invertcaps'.
"""
ifencoding=='invertcaps':
returncodecs.CodecInfo(
name='invertcaps',
encode=InvertCapsCodec().encode,
decode=InvertCapsCodec().decode,
incrementalencoder=InvertCapsIncrementalEncoder,
incrementaldecoder=InvertCapsIncrementalDecoder,
streamreader=InvertCapsStreamReader,
streamwriter=InvertCapsStreamWriter,
)
returnNone
codecs.register(find_invertcaps)
if__name__=='__main__':
#Statelessencoder/decoder
encoder=codecs.getencoder('invertcaps')
text='abc.DEF'
encoded_text,consumed=encoder(text)
print'Encoderconverted"{}"to"{}",consuming{}characters'.format(
text,encoded_text,consumed)
#Streamwriter
importsys
writer=codecs.getwriter('invertcaps')(sys.stdout)
print'StreamWriterforstdout:',
writer.write('abc.DEF')
print
#Incrementaldecoder
decoder_factory=codecs.getincrementaldecoder('invertcaps')
decoder=decoder_factory()
decoded_text_parts=[]
forcinencoded_text:
decoded_text_parts.append(decoder.decode(c,final=False))
decoded_text_parts.append(decoder.decode('',final=True))
decoded_text=''.join(decoded_text_parts)
print'IncrementalDecoderconverted"{}"to"{}"'.format(
encoded_text,decoded_text)
Thestatelessencoder/decoderbaseclassisCodec.Override
encode()anddecode()withyourimplementation(inthis
case,callingcharmap_encode()andcharmap_decode()
respectively).Eachmethodmustreturnatuplecontainingthe
transformeddataandthenumberoftheinputbytesorcharacters
consumed.Conveniently,charmap_encode()and
charmap_decode()alreadyreturnthatinformation.
IncrementalEncoderandIncrementalDecoderserveas
baseclassesfortheincrementalinterfaces.Theencode()and
decode()methodsoftheincrementalclassesaredefinedinsuch
awaythattheyonlyreturntheactualtransformeddata.Any
informationaboutbufferingismaintainedasinternalstate.The
invertcapsencodingdoesnotneedtobufferdata(itusesaone-to-one
mapping).Forencodingsthatproduceadifferentamountofoutput
dependingonthedatabeingprocessed,suchascompressionalgorithms,
BufferedIncrementalEncoderand
BufferedIncrementalDecoderaremoreappropriatebaseclasses,
sincetheymanagetheunprocessedportionoftheinputforyou.
StreamReaderandStreamWriterneedencode()
anddecode()methods,too,andsincetheyareexpectedtoreturn
thesamevalueastheversionfromCodecyoucanusemultiple
inheritancefortheimplementation.
$pythoncodecs_invertcaps_register.py
Encoderconverted"abc.DEF"to"ABC.def",consuming7characters
StreamWriterforstdout:ABC.def
IncrementalDecoderconverted"ABC.def"to"abc.DEF"
Seealso
codecs
Thestandardlibrarydocumentationforthismodule.
locale
Accessingandmanagingthelocalization-basedconfiguration
settingsandbehaviors.
io
Theiomoduleincludesfileandstreamwrappersthat
handleencodinganddecoding,too.
SocketServer
Foramoredetailedexampleofanechoserver,seethe
SocketServermodule.
encodings
Packageinthestandardlibrarycontainingtheencoder/decoder
implementationsprovidedbyPython..
UnicodeHOWTO
TheofficialguideforusingUnicodewithPython2.x.
PythonUnicodeObjects
FredrikLundh’sarticleaboutusingnon-ASCIIcharactersets
inPython2.0.
HowtoUseUTF-8withPython
EvanJones’quickguidetoworkingwithUnicode,includingXML
dataandtheByte-OrderMarker.
OntheGoodnessofUnicode
IntroductiontointernationalizationandUnicodebyTimBray.
OnCharacterStrings
Alookatthehistoryofstringprocessinginprogramming
languages,byTimBray.
Charactersvs.Bytes
PartoneofTimBray’s“essayonmoderncharacterstring
processingforcomputerprogrammers.”Thisinstallmentcovers
in-memoryrepresentationoftextinformatsotherthanASCII
bytes.
TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)
AnintroductiontoUnicodebyJoelSpolsky.
Endianness
ExplanationofendiannessinWikipedia.
Navigation
index
modules|
next|
previous|
PyMOTW»
StringServices»
©CopyrightDougHellmann.
|
|LastupdatedonJul11,2020.
|CreatedusingSphinx.
|Designbasedon"Leaves"bySmallPark
|