Unicode — pysheeet

2024-12-25

文章推薦指數： 80 %

投票人數：10人

In Python 3, strings are represented by Unicode instead of bytes. ... get u2 byte string b'Cafe\xcc\x81' >>> from unicodedata import normalize >>> s1 ... Unicode¶ Themaingoalofthischeatsheetistocollectsomecommonsnippetswhichare relatedtoUnicode.InPython3,stringsarerepresentedbyUnicodeinsteadof bytes.FurtherinformationcanbefoundonPEP3100 ASCIIcodeisthemostwell-knownstandardwhichdefinesnumericcodes forcharacters.Thenumericvaluesonlydefine128charactersoriginally, soASCIIonlycontainscontrolcodes,digits,lowercaseletters,uppercase letters,etc.However,itisnotenoughforustorepresentcharacterssuchas accentedcharacters,Chinesecharacters,oremojiexistedaroundtheworld. Therefore,Unicodewasdevelopedtosolvethisissue.Itdefinesthe codepointtorepresentvariouscharacterslikeASCIIbutthenumberof charactersisupto1,111,998. TableofContents Unicode String Characters Portingunicode(s,‘utf-8’) UnicodeCodePoint Encoding Decoding UnicodeNormalization AvoidUnicodeDecodeError LongString String¶ InPython2,stringsarerepresentedinbytes,notUnicode.Pythonprovides differenttypesofstringsuchasUnicodestring,rawstring,andsoon. Inthiscase,ifwewanttodeclareaUnicodestring,weadduprefixfor stringliterals. >>>s='Café'#bytestring >>>s 'Caf\xc3\xa9' >>>type(s) >>>u=u'Café'#unicodestring >>>u u'Caf\xe9' >>>type(u) InPython3,stringsarerepresentedinUnicode.Ifwewanttorepresenta bytestring,weaddthebprefixforstringliterals.Notethattheearly Pythonversions(3.0-3.2)donotsupporttheuprefix.Inordertoease thepaintomigrateUnicodeawareapplicationsfromPython2,Python3.3once againsupportstheuprefixforstringliterals.Furtherinformationcan befoundonPEP414 >>>s='Café' >>>type(s) >>>s 'Café' >>>s.encode('utf-8') b'Caf\xc3\xa9' >>>s.encode('utf-8').decode('utf-8') 'Café' Characters¶ Python2takesallstringcharactersasbytes.Inthiscase,thelengthof stringsmaybenotequivalenttothenumberofcharacters.Forexample, thelengthofCaféis5,not4becauseéisencodedasa2bytes character. >>>s='Café' >>>print([_cfor_cins]) ['C','a','f','\xc3','\xa9'] >>>len(s) 5 >>>s=u'Café' >>>print([_cfor_cins]) [u'C',u'a',u'f',u'\xe9'] >>>len(s) 4 Python3takesallstringcharactersasUnicodecodepoint.Thelenghtof astringisalwaysequivalenttothenumberofcharacters. >>>s='Café' >>>print([_cfor_cins]) ['C','a','f','é'] >>>len(s) 4 >>>bs=bytes(s,encoding='utf-8') >>>print(bs) b'Caf\xc3\xa9' >>>len(bs) 5 Portingunicode(s,‘utf-8’)¶ Theunicode() built-infunctionwasremovedinPython3sowhatisthebestwaytoconvert theexpressionunicode(s,'utf-8')soitworksinbothPython2and3? InPython2: >>>s='Café' >>>unicode(s,'utf-8') u'Caf\xe9' >>>s.decode('utf-8') u'Caf\xe9' >>>unicode(s,'utf-8')==s.decode('utf-8') True InPython3: >>>s='Café' >>>s.decode('utf-8') AttributeError:'str'objecthasnoattribute'decode' So,therealansweris… UnicodeCodePoint¶ ordisapowerful built-infunctiontogetaUnicodecodepointfromagivencharacter. Consequently,IfwewanttocheckaUnicodecodepointofacharacter,wecan useord. >>>s=u'Café' >>>for_cins:print('U+%04x'%ord(_c)) ... U+0043 U+0061 U+0066 U+00e9 >>>u='中文' >>>for_cinu:print('U+%04x'%ord(_c)) ... U+4e2d U+6587 Encoding¶ AUnicodecodepointtransferstoabytestringiscalledencoding. >>>s=u'Café' >>>type(s.encode('utf-8')) Decoding¶ AbytestringtransferstoaUnicodecodepointiscalleddecoding. >>>s=bytes('Café',encoding='utf-8') >>>s.decode('utf-8') 'Café' UnicodeNormalization¶ Somecharacterscanberepresentedintwosimilarform.Forexample,the character,écanbewrittenasé(CanonicalDecomposition)oré (CanonicalComposition).Inthiscase,wemayacquireunexpectedresultswhenwe arecomparingtwostringseventhoughtheylookalike.Therefore,wecan normalizeaUnicodeformtosolvetheissue. #python3 >>>u1='Café'#unicodestring >>>u2='Cafe\u0301' >>>u1,u2 ('Café','Café') >>>len(u1),len(u2) (4,5) >>>u1==u2 False >>>u1.encode('utf-8')#getu1bytestring b'Caf\xc3\xa9' >>>u2.encode('utf-8')#getu2bytestring b'Cafe\xcc\x81' >>>fromunicodedataimportnormalize >>>s1=normalize('NFC',u1)#getu1NFCformat >>>s2=normalize('NFC',u2)#getu2NFCformat >>>s1==s2 True >>>s1.encode('utf-8'),s2.encode('utf-8') (b'Caf\xc3\xa9',b'Caf\xc3\xa9') >>>s1=normalize('NFD',u1)#getu1NFDformat >>>s2=normalize('NFD',u2)#getu2NFDformat >>>s1,s2 ('Café','Café') >>>s1==s2 True >>>s1.encode('utf-8'),s2.encode('utf-8') (b'Cafe\xcc\x81',b'Cafe\xcc\x81') AvoidUnicodeDecodeError¶ PythonraisesUnicodeDecodeErrorwhenbytestringscannotdecodetoUnicode codepoints.Ifwewanttoavoidthisexception,wecanpassreplace, backslashreplace,orignoretoerrorsargumentindecode. >>>u=b"\xff" >>>u.decode('utf-8','strict') Traceback(mostrecentcalllast): File"",line1,in UnicodeDecodeError:'utf-8'codeccan'tdecodebyte0xffinposition0:invalidstartbyte >>>#useU+FFFD,REPLACEMENTCHARACTER >>>u.decode('utf-8',"replace") '\ufffd' >>>#insertsa\xNNescapesequence >>>u.decode('utf-8',"backslashreplace") '\\xff' >>>#leavethecharacteroutoftheUnicoderesult >>>u.decode('utf-8',"ignore") '' LongString¶ Thefollowingsnippetshowscommonwaystodeclareamulti-linestringin Python. #originallongstring s='Thisisaveryveryverylongpythonstring' #Singlequotewithanescapingbackslash s="Thisisaveryveryvery"\ "longpythonstring" #Usingbrackets s=( "Thisisaveryveryvery" "longpythonstring" ) #Using``+`` s=( "Thisisaveryveryvery"+ "longpythonstring" ) #Usingtriple-quotewithanescapingbackslash s='''Thisisaveryveryvery\ longpythonstring''' ThisprojecttriestoprovidemanysnippetsofPythoncodethatmakelifeeasier.UsefulLinks pysheeetwebsite pysheeet@GitHub IssueTracker pysheeetasaPDF CheatSheets C/C++cheatsheet TableofContents Unicode String Characters Portingunicode(s,‘utf-8’) UnicodeCodePoint Encoding Decoding UnicodeNormalization AvoidUnicodeDecodeError LongString Quicksearch