Unicode vs. str type. In Python 2, Unicode gets its own type distinct from strings: unicode. A Unicode string is always marked with the u'..
Goto:Na-RaeHan'shomepage
Python2.7Tutorial
WithVideosbymybringback.com
HandlingUnicode
PlayAllonYouTube
<>
Onthispage:unicode,str,hexadecimals,'\x','\u','\U',u'...',hex(),ord(),.encode(),.decode(),codecsmodule,codecs.open()
CodePointRepresentationinPython2.7
Incomputing,everycharacterisassignedauniquenumber,calledcodepoint.Forexample,thecapital'M'hasthecodepointof77.Thenumbercanthenhavedifferentrepresentations,dependingonthebase:
Letter
Base-10(decimal)Base-2(binary,in2bytes)Base-16(hexadecimal,in2bytes)
M770000000001001101004D
InPython2,'M',thestrtype,canberepresentedinitshexadecimalformbyescapingeachcharacterwith'\x'.Hence,'M'and'\x4D'areidenticalobjects;bothareoftypestr.
>>>print'\x4D'
M
>>>'\x4D'
'M'
>>>'\x4D'=='M'
True
>>>type('\x4D')
Youcanofcourselookupthecodepointforanycharacteronline,butPythonhasbuilt-infunctionsyoucanuse.ord()convertsacharactertoitscorrespondingordinalinteger,andhex()convertsanintegertoitshexadecimalrepresentation.
>>>ord('M')
77
>>>hex(77)
'0x4d'
>>>hex(ord('o'))
'0x6f'
>>>hex(ord('m'))
'0x6d'
Thehexadecimalcodepointfor'o'is6Fand'm'6D,thereforethestring'Mom'representedinhexadecimallookslikebelow.Notethateverycharacterneedstobeescapedwith'\x'.
>>>'\x4D\x6F\x6D'
'Mom'
CAUTION:thesehexadecimalstringsarestillofthestrtype:theyarenotUnicode.Unicodestringsarealwaysprefixedwithu'...',whichisexplainedbelow.
Unicodevs.strtype
InPython2,Unicodegetsitsowntypedistinctfromstrings:unicode.AUnicodestringisalwaysmarkedwiththeu'...'prefix.Itcomesinthreevariants:8-bitwithordinarycharacter,16-bitstartingwiththelowercase'\u'characterprefix,andfinally32-bitstartingwiththeuppercase'\U'prefix:
EscapesequenceMeaningExample
noneUnicodecharacter,8-bitu'M'
\uxxxxUnicodecharacterwith16-bithexvaluexxxxu'\u004D'
\UxxxxxxxxUnicodecharacterwith32-bithexvaluexxxxxxxxu'\U0000004D'
Firstofall,belowdemonstrateshow'M'andu'M'aredifferentobjects.The==operator--whichtestsequalityofvalue--returnsTrue,buttheisoperator--whichteststheidentityofobjectsinmemory--returnsFalse.Theymightprintoutthesameandbeconsideredofthesamevalue,buttheyareoftwodifferenttypes:theformerisastring(str)whilethelatterisaUnicodestring(unicode).
>>>printu'M'
M
>>>u'M'
u'M'
>>>u'M'=='M'
True
>>>u'M'is'M'
False
>>>type(u'M')
ThethreedifferentUnicoderepresentations,however,aretrulyidentical.Thisisnotsurprising:thereisoneUnicodestandardafterall,theyarejustwrittendifferently.
>>>printu'\u004D'
M
>>>u'\u004D'
u'M'
>>>u'\U0000004D'
u'M'
>>>u'M'==u'\U0000004D'
True
>>>u'M'isu'\U0000004D'
True
BelowareUnicodestrings'Mom'and'MomandDad'.Notethateach16-bitUnicodecharacterisescaped,while8-bitUnicodecharactersdon'tneedtobe.Youcanmixthemupwithinonestring:
>>>u'\u004D\u006F\u006D'
u'Mom'
>>>u'\u004D\u006F\u006DandDad'
u'MomandDad'
Conversion
.encode()and.decode()arethepairofmethodsusedtoconvertbetweentheUnicodeandthestringtypes.Butbecarefulaboutthedirection:.encode()isusedtoturnaUnicodestringintoaregularstring,and.decode()workstheotherway.Manypeoplefindthiscounter-intuitive.Inadditiontothetwomethods,thetypenamesdoubleupastypeconversionfunctions,asusual:
>>>u'M'.encode()
'M'
>>>'M'.decode()
u'M'
>>>unicode('M')
u'M'
>>>str(u'M')
'M'
ReadingandWritingUnicodeFiles
TheusualopenmethodwehavebeenusingforfileI/Ohandlestextasthestrtypeonly.Therefore,toreadandwriteUnicode-encodedfiles,weneedafilemethodthatiscapableofhandlingUnicode:thecodecsmoduleprovidesitsowncodecs.open()methodthatletsuserspecifytheencodingscheme.BelowexampleusesthisfilecontainingKoreantextinUTF-8encoding:
>>>importcodecs
>>>f=codecs.open('Korean-UTF8.txt',encoding='utf-8')
>>>lines=f.readlines()
>>>f.close()
>>>lines[0]
u'\uaf43\r\n'
>>>printlines[0],
꽃
>>>lines[5]
u'\ud558\ub098\uc758\ubab8\uc9d3\uc5d0\uc9c0\ub098\uc9c0\uc54a\uc558\ub2e4.\r\n'
>>>printlines[5],
하나의몸짓에지나지않았다.
>>>
Internally,thestringsarestoredasUnicodestrings;printdisplaysthecharactersinthemorerecognizableform.NotethatprintingwillworkonlyifyouhavetheKoreanfontsinstalledonyourmachine.
Forwriting,yousupplythe'w'parameterwithyourcodecs.open()method.Afterthat,youcanusetheusual.write()and.writelines()methodsaslongasyousupplythemwithUnicodestrings:
>>>f=codecs.open('foo.txt','w',encoding='utf-8')
>>>f.write(u'Helloworld!\n')
>>>f.close()
Asamatteroffact,codecscanhandleallkindsofencoding,notjustUnicode:youcanuse'ascii'forASCII,'latin_1'foriso-8859-1,etc.ThelistofstandardencodingsusedinPython2canbefoundonthispage.