Handling Unicode - Python 2.7 Tutorial

文章推薦指數: 80 %
投票人數:10人

Unicode vs. str type. In Python 2, Unicode gets its own type distinct from strings: unicode. A Unicode string is always marked with the u'.. Goto:Na-RaeHan'shomepage   Python2.7Tutorial WithVideosbymybringback.com HandlingUnicode PlayAllonYouTube            <> Onthispage:unicode,str,hexadecimals,'\x','\u','\U',u'...',hex(),ord(),.encode(),.decode(),codecsmodule,codecs.open() CodePointRepresentationinPython2.7 Incomputing,everycharacterisassignedauniquenumber,calledcodepoint.Forexample,thecapital'M'hasthecodepointof77.Thenumbercanthenhavedifferentrepresentations,dependingonthebase: Letter Base-10(decimal)Base-2(binary,in2bytes)Base-16(hexadecimal,in2bytes) M770000000001001101004D InPython2,'M',thestrtype,canberepresentedinitshexadecimalformbyescapingeachcharacterwith'\x'.Hence,'M'and'\x4D'areidenticalobjects;bothareoftypestr.   >>>print'\x4D' M >>>'\x4D' 'M' >>>'\x4D'=='M' True >>>type('\x4D') Youcanofcourselookupthecodepointforanycharacteronline,butPythonhasbuilt-infunctionsyoucanuse.ord()convertsacharactertoitscorrespondingordinalinteger,andhex()convertsanintegertoitshexadecimalrepresentation.   >>>ord('M') 77 >>>hex(77) '0x4d' >>>hex(ord('o')) '0x6f' >>>hex(ord('m')) '0x6d' Thehexadecimalcodepointfor'o'is6Fand'm'6D,thereforethestring'Mom'representedinhexadecimallookslikebelow.Notethateverycharacterneedstobeescapedwith'\x'.   >>>'\x4D\x6F\x6D' 'Mom' CAUTION:thesehexadecimalstringsarestillofthestrtype:theyarenotUnicode.Unicodestringsarealwaysprefixedwithu'...',whichisexplainedbelow. Unicodevs.strtype InPython2,Unicodegetsitsowntypedistinctfromstrings:unicode.AUnicodestringisalwaysmarkedwiththeu'...'prefix.Itcomesinthreevariants:8-bitwithordinarycharacter,16-bitstartingwiththelowercase'\u'characterprefix,andfinally32-bitstartingwiththeuppercase'\U'prefix: EscapesequenceMeaningExample noneUnicodecharacter,8-bitu'M' \uxxxxUnicodecharacterwith16-bithexvaluexxxxu'\u004D' \UxxxxxxxxUnicodecharacterwith32-bithexvaluexxxxxxxxu'\U0000004D' Firstofall,belowdemonstrateshow'M'andu'M'aredifferentobjects.The==operator--whichtestsequalityofvalue--returnsTrue,buttheisoperator--whichteststheidentityofobjectsinmemory--returnsFalse.Theymightprintoutthesameandbeconsideredofthesamevalue,buttheyareoftwodifferenttypes:theformerisastring(str)whilethelatterisaUnicodestring(unicode).   >>>printu'M' M >>>u'M' u'M' >>>u'M'=='M' True >>>u'M'is'M' False >>>type(u'M') ThethreedifferentUnicoderepresentations,however,aretrulyidentical.Thisisnotsurprising:thereisoneUnicodestandardafterall,theyarejustwrittendifferently.   >>>printu'\u004D' M >>>u'\u004D' u'M' >>>u'\U0000004D' u'M' >>>u'M'==u'\U0000004D' True >>>u'M'isu'\U0000004D' True BelowareUnicodestrings'Mom'and'MomandDad'.Notethateach16-bitUnicodecharacterisescaped,while8-bitUnicodecharactersdon'tneedtobe.Youcanmixthemupwithinonestring:   >>>u'\u004D\u006F\u006D' u'Mom' >>>u'\u004D\u006F\u006DandDad' u'MomandDad' Conversion .encode()and.decode()arethepairofmethodsusedtoconvertbetweentheUnicodeandthestringtypes.Butbecarefulaboutthedirection:.encode()isusedtoturnaUnicodestringintoaregularstring,and.decode()workstheotherway.Manypeoplefindthiscounter-intuitive.Inadditiontothetwomethods,thetypenamesdoubleupastypeconversionfunctions,asusual:   >>>u'M'.encode() 'M' >>>'M'.decode() u'M' >>>unicode('M') u'M' >>>str(u'M') 'M' ReadingandWritingUnicodeFiles TheusualopenmethodwehavebeenusingforfileI/Ohandlestextasthestrtypeonly.Therefore,toreadandwriteUnicode-encodedfiles,weneedafilemethodthatiscapableofhandlingUnicode:thecodecsmoduleprovidesitsowncodecs.open()methodthatletsuserspecifytheencodingscheme.BelowexampleusesthisfilecontainingKoreantextinUTF-8encoding:   >>>importcodecs >>>f=codecs.open('Korean-UTF8.txt',encoding='utf-8') >>>lines=f.readlines() >>>f.close() >>>lines[0] u'\uaf43\r\n' >>>printlines[0], 꽃 >>>lines[5] u'\ud558\ub098\uc758\ubab8\uc9d3\uc5d0\uc9c0\ub098\uc9c0\uc54a\uc558\ub2e4.\r\n' >>>printlines[5], 하나의몸짓에지나지않았다. >>> Internally,thestringsarestoredasUnicodestrings;printdisplaysthecharactersinthemorerecognizableform.NotethatprintingwillworkonlyifyouhavetheKoreanfontsinstalledonyourmachine. Forwriting,yousupplythe'w'parameterwithyourcodecs.open()method.Afterthat,youcanusetheusual.write()and.writelines()methodsaslongasyousupplythemwithUnicodestrings:   >>>f=codecs.open('foo.txt','w',encoding='utf-8') >>>f.write(u'Helloworld!\n') >>>f.close() Asamatteroffact,codecscanhandleallkindsofencoding,notjustUnicode:youcanuse'ascii'forASCII,'latin_1'foriso-8859-1,etc.ThelistofstandardencodingsusedinPython2canbefoundonthispage.



請為這篇文章評分?