Handling Unicode - Python 2.7 Tutorial

2025-01-23

文章推薦指數： 80 %

投票人數：10人

Unicode vs. str type. In Python 2, Unicode gets its own type distinct from strings: unicode. A Unicode string is always marked with the u'.. Goto:Na-RaeHan'shomepage Python2.7Tutorial WithVideosbymybringback.com HandlingUnicode PlayAllonYouTube <> Onthispage:unicode,str,hexadecimals,'\x','\u','\U',u'...',hex(),ord(),.encode(),.decode(),codecsmodule,codecs.open() CodePointRepresentationinPython2.7 Incomputing,everycharacterisassignedauniquenumber,calledcodepoint.Forexample,thecapital'M'hasthecodepointof77.Thenumbercanthenhavedifferentrepresentations,dependingonthebase: Letter Base-10(decimal)Base-2(binary,in2bytes)Base-16(hexadecimal,in2bytes) M770000000001001101004D InPython2,'M',thestrtype,canberepresentedinitshexadecimalformbyescapingeachcharacterwith'\x'.Hence,'M'and'\x4D'areidenticalobjects;bothareoftypestr. >>>print'\x4D' M >>>'\x4D' 'M' >>>'\x4D'=='M' True >>>type('\x4D') Youcanofcourselookupthecodepointforanycharacteronline,butPythonhasbuilt-infunctionsyoucanuse.ord()convertsacharactertoitscorrespondingordinalinteger,andhex()convertsanintegertoitshexadecimalrepresentation. >>>ord('M') 77 >>>hex(77) '0x4d' >>>hex(ord('o')) '0x6f' >>>hex(ord('m')) '0x6d' Thehexadecimalcodepointfor'o'is6Fand'm'6D,thereforethestring'Mom'representedinhexadecimallookslikebelow.Notethateverycharacterneedstobeescapedwith'\x'. >>>'\x4D\x6F\x6D' 'Mom' CAUTION:thesehexadecimalstringsarestillofthestrtype:theyarenotUnicode.Unicodestringsarealwaysprefixedwithu'...',whichisexplainedbelow. Unicodevs.strtype InPython2,Unicodegetsitsowntypedistinctfromstrings:unicode.AUnicodestringisalwaysmarkedwiththeu'...'prefix.Itcomesinthreevariants:8-bitwithordinarycharacter,16-bitstartingwiththelowercase'\u'characterprefix,andfinally32-bitstartingwiththeuppercase'\U'prefix: EscapesequenceMeaningExample noneUnicodecharacter,8-bitu'M' \uxxxxUnicodecharacterwith16-bithexvaluexxxxu'\u004D' \UxxxxxxxxUnicodecharacterwith32-bithexvaluexxxxxxxxu'\U0000004D' Firstofall,belowdemonstrateshow'M'andu'M'aredifferentobjects.The==operator--whichtestsequalityofvalue--returnsTrue,buttheisoperator--whichteststheidentityofobjectsinmemory--returnsFalse.Theymightprintoutthesameandbeconsideredofthesamevalue,buttheyareoftwodifferenttypes:theformerisastring(str)whilethelatterisaUnicodestring(unicode). >>>printu'M' M >>>u'M' u'M' >>>u'M'=='M' True >>>u'M'is'M' False >>>type(u'M') ThethreedifferentUnicoderepresentations,however,aretrulyidentical.Thisisnotsurprising:thereisoneUnicodestandardafterall,theyarejustwrittendifferently. >>>printu'\u004D' M >>>u'\u004D' u'M' >>>u'\U0000004D' u'M' >>>u'M'==u'\U0000004D' True >>>u'M'isu'\U0000004D' True BelowareUnicodestrings'Mom'and'MomandDad'.Notethateach16-bitUnicodecharacterisescaped,while8-bitUnicodecharactersdon'tneedtobe.Youcanmixthemupwithinonestring: >>>u'\u004D\u006F\u006D' u'Mom' >>>u'\u004D\u006F\u006DandDad' u'MomandDad' Conversion .encode()and.decode()arethepairofmethodsusedtoconvertbetweentheUnicodeandthestringtypes.Butbecarefulaboutthedirection:.encode()isusedtoturnaUnicodestringintoaregularstring,and.decode()workstheotherway.Manypeoplefindthiscounter-intuitive.Inadditiontothetwomethods,thetypenamesdoubleupastypeconversionfunctions,asusual: >>>u'M'.encode() 'M' >>>'M'.decode() u'M' >>>unicode('M') u'M' >>>str(u'M') 'M' ReadingandWritingUnicodeFiles TheusualopenmethodwehavebeenusingforfileI/Ohandlestextasthestrtypeonly.Therefore,toreadandwriteUnicode-encodedfiles,weneedafilemethodthatiscapableofhandlingUnicode:thecodecsmoduleprovidesitsowncodecs.open()methodthatletsuserspecifytheencodingscheme.BelowexampleusesthisfilecontainingKoreantextinUTF-8encoding: >>>importcodecs >>>f=codecs.open('Korean-UTF8.txt',encoding='utf-8') >>>lines=f.readlines() >>>f.close() >>>lines[0] u'\uaf43\r\n' >>>printlines[0], 꽃 >>>lines[5] u'\ud558\ub098\uc758\ubab8\uc9d3\uc5d0\uc9c0\ub098\uc9c0\uc54a\uc558\ub2e4.\r\n' >>>printlines[5], 하나의몸짓에지나지않았다. >>> Internally,thestringsarestoredasUnicodestrings;printdisplaysthecharactersinthemorerecognizableform.NotethatprintingwillworkonlyifyouhavetheKoreanfontsinstalledonyourmachine. Forwriting,yousupplythe'w'parameterwithyourcodecs.open()method.Afterthat,youcanusetheusual.write()and.writelines()methodsaslongasyousupplythemwithUnicodestrings: >>>f=codecs.open('foo.txt','w',encoding='utf-8') >>>f.write(u'Helloworld!\n') >>>f.close() Asamatteroffact,codecscanhandleallkindsofencoding,notjustUnicode:youcanuse'ascii'forASCII,'latin_1'foriso-8859-1,etc.ThelistofstandardencodingsusedinPython2canbefoundonthispage.

請為這篇文章評分？

延伸文章資訊

Convert Unicode code point and character to each other (chr ...

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points...

unicode - Python Reference (The Right Way) - Read the Docs

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bi...

Unicode in Python 2

If it's on disk or on a network, it's bytes; Python provides some ... Unicode strings are sequenc...

Unicode & Character Encodings in Python: A Painless Guide

Encoding and Decoding in Python 3 ... Python 3's str type is meant to represent human-readable te...

Unicode — pysheeet

In Python 2, strings are represented in bytes, not Unicode. Python provides different types of st...

Handling Unicode - Python 2.7 Tutorial

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

中日口譯課程

中國生產力中心口譯評價

紙的應用