In Python 3, strings are represented by Unicode instead of bytes. ... get u2 byte string b'Cafe\xcc\x81' >>> from unicodedata import normalize >>> s1 ...
Unicode¶
Themaingoalofthischeatsheetistocollectsomecommonsnippetswhichare
relatedtoUnicode.InPython3,stringsarerepresentedbyUnicodeinsteadof
bytes.FurtherinformationcanbefoundonPEP3100
ASCIIcodeisthemostwell-knownstandardwhichdefinesnumericcodes
forcharacters.Thenumericvaluesonlydefine128charactersoriginally,
soASCIIonlycontainscontrolcodes,digits,lowercaseletters,uppercase
letters,etc.However,itisnotenoughforustorepresentcharacterssuchas
accentedcharacters,Chinesecharacters,oremojiexistedaroundtheworld.
Therefore,Unicodewasdevelopedtosolvethisissue.Itdefinesthe
codepointtorepresentvariouscharacterslikeASCIIbutthenumberof
charactersisupto1,111,998.
TableofContents
Unicode
String
Characters
Portingunicode(s,‘utf-8’)
UnicodeCodePoint
Encoding
Decoding
UnicodeNormalization
AvoidUnicodeDecodeError
LongString
String¶
InPython2,stringsarerepresentedinbytes,notUnicode.Pythonprovides
differenttypesofstringsuchasUnicodestring,rawstring,andsoon.
Inthiscase,ifwewanttodeclareaUnicodestring,weadduprefixfor
stringliterals.
>>>s='Café'#bytestring
>>>s
'Caf\xc3\xa9'
>>>type(s)
>>>u=u'Café'#unicodestring
>>>u
u'Caf\xe9'
>>>type(u)
InPython3,stringsarerepresentedinUnicode.Ifwewanttorepresenta
bytestring,weaddthebprefixforstringliterals.Notethattheearly
Pythonversions(3.0-3.2)donotsupporttheuprefix.Inordertoease
thepaintomigrateUnicodeawareapplicationsfromPython2,Python3.3once
againsupportstheuprefixforstringliterals.Furtherinformationcan
befoundonPEP414
>>>s='Café'
>>>type(s)
>>>s
'Café'
>>>s.encode('utf-8')
b'Caf\xc3\xa9'
>>>s.encode('utf-8').decode('utf-8')
'Café'
Characters¶
Python2takesallstringcharactersasbytes.Inthiscase,thelengthof
stringsmaybenotequivalenttothenumberofcharacters.Forexample,
thelengthofCaféis5,not4becauseéisencodedasa2bytes
character.
>>>s='Café'
>>>print([_cfor_cins])
['C','a','f','\xc3','\xa9']
>>>len(s)
5
>>>s=u'Café'
>>>print([_cfor_cins])
[u'C',u'a',u'f',u'\xe9']
>>>len(s)
4
Python3takesallstringcharactersasUnicodecodepoint.Thelenghtof
astringisalwaysequivalenttothenumberofcharacters.
>>>s='Café'
>>>print([_cfor_cins])
['C','a','f','é']
>>>len(s)
4
>>>bs=bytes(s,encoding='utf-8')
>>>print(bs)
b'Caf\xc3\xa9'
>>>len(bs)
5
Portingunicode(s,‘utf-8’)¶
Theunicode()
built-infunctionwasremovedinPython3sowhatisthebestwaytoconvert
theexpressionunicode(s,'utf-8')soitworksinbothPython2and3?
InPython2:
>>>s='Café'
>>>unicode(s,'utf-8')
u'Caf\xe9'
>>>s.decode('utf-8')
u'Caf\xe9'
>>>unicode(s,'utf-8')==s.decode('utf-8')
True
InPython3:
>>>s='Café'
>>>s.decode('utf-8')
AttributeError:'str'objecthasnoattribute'decode'
So,therealansweris…
UnicodeCodePoint¶
ordisapowerful
built-infunctiontogetaUnicodecodepointfromagivencharacter.
Consequently,IfwewanttocheckaUnicodecodepointofacharacter,wecan
useord.
>>>s=u'Café'
>>>for_cins:print('U+%04x'%ord(_c))
...
U+0043
U+0061
U+0066
U+00e9
>>>u='中文'
>>>for_cinu:print('U+%04x'%ord(_c))
...
U+4e2d
U+6587
Encoding¶
AUnicodecodepointtransferstoabytestringiscalledencoding.
>>>s=u'Café'
>>>type(s.encode('utf-8'))
Decoding¶
AbytestringtransferstoaUnicodecodepointiscalleddecoding.
>>>s=bytes('Café',encoding='utf-8')
>>>s.decode('utf-8')
'Café'
UnicodeNormalization¶
Somecharacterscanberepresentedintwosimilarform.Forexample,the
character,écanbewrittenasé(CanonicalDecomposition)oré
(CanonicalComposition).Inthiscase,wemayacquireunexpectedresultswhenwe
arecomparingtwostringseventhoughtheylookalike.Therefore,wecan
normalizeaUnicodeformtosolvetheissue.
#python3
>>>u1='Café'#unicodestring
>>>u2='Cafe\u0301'
>>>u1,u2
('Café','Café')
>>>len(u1),len(u2)
(4,5)
>>>u1==u2
False
>>>u1.encode('utf-8')#getu1bytestring
b'Caf\xc3\xa9'
>>>u2.encode('utf-8')#getu2bytestring
b'Cafe\xcc\x81'
>>>fromunicodedataimportnormalize
>>>s1=normalize('NFC',u1)#getu1NFCformat
>>>s2=normalize('NFC',u2)#getu2NFCformat
>>>s1==s2
True
>>>s1.encode('utf-8'),s2.encode('utf-8')
(b'Caf\xc3\xa9',b'Caf\xc3\xa9')
>>>s1=normalize('NFD',u1)#getu1NFDformat
>>>s2=normalize('NFD',u2)#getu2NFDformat
>>>s1,s2
('Café','Café')
>>>s1==s2
True
>>>s1.encode('utf-8'),s2.encode('utf-8')
(b'Cafe\xcc\x81',b'Cafe\xcc\x81')
AvoidUnicodeDecodeError¶
PythonraisesUnicodeDecodeErrorwhenbytestringscannotdecodetoUnicode
codepoints.Ifwewanttoavoidthisexception,wecanpassreplace,
backslashreplace,orignoretoerrorsargumentindecode.
>>>u=b"\xff"
>>>u.decode('utf-8','strict')
Traceback(mostrecentcalllast):
File"",line1,in
UnicodeDecodeError:'utf-8'codeccan'tdecodebyte0xffinposition0:invalidstartbyte
>>>#useU+FFFD,REPLACEMENTCHARACTER
>>>u.decode('utf-8',"replace")
'\ufffd'
>>>#insertsa\xNNescapesequence
>>>u.decode('utf-8',"backslashreplace")
'\\xff'
>>>#leavethecharacteroutoftheUnicoderesult
>>>u.decode('utf-8',"ignore")
''
LongString¶
Thefollowingsnippetshowscommonwaystodeclareamulti-linestringin
Python.
#originallongstring
s='Thisisaveryveryverylongpythonstring'
#Singlequotewithanescapingbackslash
s="Thisisaveryveryvery"\
"longpythonstring"
#Usingbrackets
s=(
"Thisisaveryveryvery"
"longpythonstring"
)
#Using``+``
s=(
"Thisisaveryveryvery"+
"longpythonstring"
)
#Usingtriple-quotewithanescapingbackslash
s='''Thisisaveryveryvery\
longpythonstring'''
ThisprojecttriestoprovidemanysnippetsofPythoncodethatmakelifeeasier.UsefulLinks
pysheeetwebsite
pysheeet@GitHub
IssueTracker
pysheeetasaPDF
CheatSheets
C/C++cheatsheet
TableofContents
Unicode
String
Characters
Portingunicode(s,‘utf-8’)
UnicodeCodePoint
Encoding
Decoding
UnicodeNormalization
AvoidUnicodeDecodeError
LongString
Quicksearch