Unicode in Python 2

文章推薦指數: 80 %
投票人數:10人

Unicode strings are sequences of platonic characters ... import codecs codecs.encode() codecs.decode() codecs.open() # better to use ``io.open`` ... SystemDevelopmentWithPython 2.0 UnicodeinPython2 History WhattheheckisUnicodeanyway? EnterUnicode Unicode Mechanics Whatarestrings? Stringsvsunicode Unicode UsingunicodeinPy2 EncodingandDecoding UnicodeLiterals UsingUnicode Usingunicodeeverywhere Encodings UTF-8 UTF-16 UTF-16criticism Latin-1 UnicodeDocs GotchasinPython2 UnicodeinPython3 Exercises BasicUnicodeLAB ChallengeUnicodeLAB SystemDevelopmentWithPython Docs» UnicodeinPython2 Viewpagesource UnicodeinPython2¶ Aquickrun-downofUnicode, itsuseinPython2, andsomeofthegotchasthatarise. -ChrisBarker History¶ Abitaboutwhereallthismesscamefrom... WhattheheckisUnicodeanyway?¶ Firsttherewaschaos... Differentmachinesuseddifferentencodings ThentherewasASCII–andallwasgood(7bit),127characters (forEnglishspeakers,anyway) Buteachvendorusedthetophalf(127-255)fordifferentthings. MacRoman,Windows1252,etc... Thereisnow“latin-1”,butstillalotofoldfilesaround Non-WesternEuropeanlanguagesrequiredtotallyincompatible1-byteencodings Nowaytomixlanguageswithdifferentalphabets. EnterUnicode¶ TheUnicodeideaisprettysimple: one“codepoint”forallcharactersinalllanguages Buthowdoyouexpressthatinbytes? Earlydays:wecanfitallthecodepointsinatwobyteinteger(65536characters) Turnsoutthatdidn’twork–nowneed32bitintegertoholdallofunicode“raw”(UTC-4) Enter“encodings”: Anencodingisawaytomapspecificbytestoacodepoint. Eachcodepointcanhaveoneormorebytes. Unicode¶ Agoodstart: TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely, PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!) http://www.joelonsoftware.com/articles/Unicode.html EverythingisBytes Ifit’sondiskoronanetwork,it’sbytes Pythonprovidessomeabstractionstomakeiteasiertodealwithbytes Unicodeisabiggie actually,dealingwithnumbersratherthanbytesisbig –butwetakethatforgranted Mechanics¶ Whatarestrings?¶ Py2stringsaresequencesofbytes Unicodestringsaresequencesofplatoniccharacters It’salmostonecodepointpercharacter–buttherearecomplications withcombinedcharacters:accents,etc. Platoniccharacterscannotbewrittentodiskornetwork! (ANSI:onecharacter==onebyte–soeasy!) Stringsvsunicode¶ Python2hastwotypesthatletyouworkwithtext: str unicode Andtwowaystoworkwithbinarydata: str bytes()(andbytearray) but: In[86]:strisbytes Out[86]:True bytesisthereforpy3compatibility–butit’sgoodformakingyour intentionsclear,too. Unicode¶ Theunicodeobjectletsyouworkwithcharacters Ithasallthesamemethodsasthestringobject. “encoding”isconvertingfromaunicodeobjecttobytes “decoding”isconvertingfrombytestoaunicodeobject (sometimesthisfeelsbackwards...) Andcangetevenmoreconfusingwithpy2stringsbeingbothtextandbytes! UsingunicodeinPy2¶ Builtinfunctions ord() chr() unichr() str() unicode() Thecodecsmodule importcodecs codecs.encode() codecs.decode() codecs.open()#bettertouse``io.open`` EncodingandDecoding¶ Encoding:texttobytes–yougetabytes(str)object In[17]:u"this".encode('utf-8') Out[17]:'this' In[18]:u"this".encode('utf-16') Out[18]:'\xff\xfet\x00h\x00i\x00s\x00' Decodingbytestotext–yougetaunicodeobject In[2]:text='\xff\xfe."+"x\x00\xb2\x00'.decode('utf-16') In[3]:type(text) Out[3]:unicode In[4]:printtext ∮∫x² UnicodeLiterals¶ Useunicodeinyoursourcefiles: #-*-coding:utf-8-*- escapetheunicodecharacters: printu"Theintegralsign:\u222B" printu"Theintegralsign:\N{integral}" Lotsoftablesofcodepointsonline: Oneexample: http://inamidst.com/stuff/unidata/ hello_unicode.py. UsingUnicode¶ Useunicodeobjectsinallyourcode Decodeoninput Encodeonoutput Manypackagesdothisforyou:XMLprocessing,databases,... Gotcha: Pythonhasadefaultencoding(usuallyascii) In[2]:sys.getdefaultencoding() Out[2]:'ascii' Thedefaultencodingwillgetusedinunexpectedplaces! Usingunicodeeverywhere¶ Python2.6andabovehaveanicefeaturetomakeiteasiertouseunicodeeverywhere from__future__importunicode_literals Afterrunningthatline,theu''isassumed In[1]:s="thisisaregularpy2string" In[2]:printtype(s) In[3]:from__future__importunicode_literals In[4]:s="thisisnowaunicodestring" In[5]:type(s) Out[5]:unicode NOTE:Youcanstillgetpy2stringsfromothersources! Encodings¶ WhatencodingshouldIuse??? Therearealot: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings Butonlyacoupleyouarelikelytoneed: utf-8(*nix) utf-16(Windows) andofcourse,stilltheone-bytesones. ASCII Latin-1 UTF-8¶ Probablytheoneyou’llusemost–mostcommoninInternetprotocols(xml,JSON,etc.) Niceproperties: ASCIIcompatible:first127charactersarethesame Anyasciistringisautf-8string compactformostly-englishtext. Gotchas: “higher”codepointsmayusemorethanonebyte:upto4foronecharacter ASCIIcompatiblemeansinmayworkwithdefaultencodingintests–butthenblowupwithrealdata... UTF-16¶ KindoflikeUTF-8,exceptitusesatleast16bits(2bytes)foreachcharacter:notASCIIcompatible. Butisstillneedsmorethantwobytesforsomecodepoints,soyoustillcan’tprocess InC/C++heldina“widechar”or“widestring”. MSWindowsusesUTF-16,asdoes(Ithink)Java. UTF-16criticism¶ ThereisalotofcriticismonthenetaboutUTF-16–it’skindoftheworstofbothworlds: Youcan’tassumeeverycharacteristhesamenumberofbytes IttakesupmorememorythanUTF-8 UTFConsideredHarmful Buttobefair: EarlyversionsofUnicode:everythingfitintotwobytes(65536codepoints).MSandJavawerefairlyearlyadopters,anditseemedsimpleenoughtojustuse2bytespercharacter. Whenitturnedoutthat4byteswerereallyneeded,theywerekindofstuckinthemiddle. Latin-1¶ NOTUnicode: a1-bytepercharencoding. SupersetofASCIIsuitableforWesternEuropeanlanguages. Themostcommonone-bytepercharencodingforEuropeantext. Niceproperty–everybytevaluefrom0to255isavalidcharacter(atleastinPython) YouwillnevergetanUnicodeDecodeErrorifyoutrytodecodearbitrarybyteswithlatin-1. Anditcan“round-trip”throughaunicodeobject. Usefulifyoudon’tknowtheencoding–atleastitwon’traiseanException Usefulifyouneedtoworkwithcombinedtext+binarydata. latin1_test.py. UnicodeDocs¶ PythonDocsUnicodeHowTo: http://docs.python.org/howto/unicode.html “ReadingUnicodefromafileisthereforesimple” useio.open: fromioimportopen io.open('unicode.rst',encoding='utf-8') forlineinf: printrepr(line) (https://docs.python.org/2/library/io.html#module-interface) EncodingsBuilt-intoPython: http://docs.python.org/2/library/codecs.html#standard-encodings GotchasinPython2¶ filenames,etc: Ifyoupassinunicode,yougetunicode In[9]:os.listdir('./') Out[9]:['hello_unicode.py','text.utf16','text.utf32'] In[10]:os.listdir(u'./') Out[10]:[u'hello_unicode.py',u'text.utf16',u'text.utf32'] Pythondealswiththefilesystemencodingforyou... But:somemoreobscurecallsdon’tsupportunicodefilenames: os.statvfs()(http://bugs.python.org/issue18695) Exceptionmessages: Py2Exceptionsusestrwhentheyprintmessages. Butwhatifyoupassinaunicodeobject? Itisencodedwiththedefaultencoding. UnicodeDecodeErrorInsideanException???? NOPE:itswallowsitinstead. exception_test.py. UnicodeinPython3¶ The“string”objectisunicode. Py3hastwodistinctconcepts: “text”–usesthestrobject(whichisalwaysunicode!) “binarydata”–usesbytesorbytearray Everythingthat’sabouttextisunicode. Everythingthatrequiresbinarydatausesbytes. It’sallmuchcleaner. (bytheway,therecentimplementationsareveryefficient...) Exercises¶ BasicUnicodeLAB¶ Findsomeniftynon-asciicharactersyoumightuse. Createaunicodeobjectwiththemintwodifferentways. hereisoneexample Readthecontentsintounicodeobjects: ICanEatGlass.utf8.txt ICanEatGlass.utf16.txt and/or text.utf8 text.utf16 text.utf32 writesomeofthetextfromthefirstexercisetofile–readthatfilebackin. reference:http://inamidst.com/stuff/unidata/ NOTE:ifyourterminaldoesnotsupportunicode–you’llgetanerrortrying toprint.TryadifferentterminalorIDE,orgoogleforasolution. ChallengeUnicodeLAB¶ Wesawthisearlier In[38]:u'to\N{INFINITY}andbeyond!'.decode('utf-8') --------------------------------------------------------------------------- UnicodeEncodeErrorTraceback(mostrecentcalllast) in() ---->1u'to\N{INFINITY}andbeyond!'.decode('utf-8') /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.pycindecode(input,errors) 14 15defdecode(input,errors='strict'): --->16returncodecs.utf_8_decode(input,errors,True) 17 18classIncrementalEncoder(codecs.IncrementalEncoder): UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\u221e'inposition3:ordinalnotinrange(128) Butwhywouldyoudecodeaunicodeobject? Anditshouldbeano-op–whytheexception? Andwhy‘ascii’?Ispecified‘utf-8’! It’sthereforbackwardcompatibility What’shappeningunderthehood u'to\N{INFINITY}andbeyond!'.encode().decode('utf-8') Itencodeswiththedefaultencoding(ascii),thendecodes Inthiscase,itbarfsonattemptingtoencodeto‘ascii’ Sonevercalldecodeonaunicodeobject! Butwhatifsomeonepassesoneintoafunctionofyoursthat’sexpectingapy2string? Typecheckingandconverting–yeach! Read: http://axialcorps.com/2014/03/20/unicode-str/ Seeifyoucanfigureoutthedecorators: unicodify.py. (ThisisadvancedPythonJuJu:Aren’tyougladIdidn’taskyoutowritethatyourself?)



請為這篇文章評分?