Unicode & Character Encodings in Python: A Painless Guide
文章推薦指數: 80 %
Encoding and Decoding in Python 3 ... Python 3's str type is meant to represent human-readable text and can contain any Unicode character. The bytes type, ...
Start Here
LearnPython
PythonTutorials→In-deptharticlesandvideocourses
LearningPaths→Guidedstudyplansforacceleratedlearning
Quizzes→Checkyourlearningprogress
BrowseTopics→Focusonaspecificareaorskilllevel
CommunityChat→LearnwithotherPythonistas
OfficeHours→LiveQ&AcallswithPythonexperts
Podcast→Hearwhat’snewintheworldofPython
Books→Roundoutyourknowledgeandlearnoffline
UnlockAllContent→
More
PythonLearningResources
PythonNewsletter
PythonJobBoard
MeettheTeam
BecomeaTutorialAuthor
BecomeaVideoInstructor
Search
Join
Sign‑In
Unicode&CharacterEncodingsinPython:APainlessGuide
byBradSolomon
advanced
python
MarkasCompleted
Tweet
Share
Email
TableofContents
What’saCharacterEncoding?
ThestringModule
ABitofaRefresher
WeNeedMoreBits!
CoveringAlltheBases:OtherNumberSystems
EnterUnicode
UnicodevsUTF-8
EncodingandDecodinginPython3
Python3:All-InonUnicode
OneByte,TwoBytes,ThreeBytes,Four
WhatAboutUTF-16andUTF-32?
Python’sBuilt-InFunctions
PythonStringLiterals:WaystoSkinaCat
OtherEncodingsAvailableinPython
YouKnowWhatTheySayAboutAssumptions…
OddsandEnds:unicodedata
WrappingUp
Resources
Removeads
WatchNowThistutorialhasarelatedvideocoursecreatedbytheRealPythonteam.Watchittogetherwiththewrittentutorialtodeepenyourunderstanding:UnicodeinPython:WorkingWithCharacterEncodings
HandlingcharacterencodingsinPythonoranyotherlanguagecanattimesseempainful.PlacessuchasStackOverflowhavethousandsofquestionsstemmingfromconfusionoverexceptionslikeUnicodeDecodeErrorandUnicodeEncodeError.ThistutorialisdesignedtocleartheExceptionfogandillustratethatworkingwithtextandbinarydatainPython3canbeasmoothexperience.Python’sUnicodesupportisstrongandrobust,butittakessometimetomaster.
Thistutorialisdifferentbecauseit’snotlanguage-agnosticbutinsteaddeliberatelyPython-centric.You’llstillgetalanguage-agnosticprimer,butyou’llthendiveintoillustrationsinPython,withtext-heavyparagraphskepttoaminimum.You’llseehowtouseconceptsofcharacterencodingsinlivePythoncode.
Bytheendofthistutorial,you’ll:
Getconceptualoverviewsoncharacterencodingsandnumberingsystems
UnderstandhowencodingcomesintoplaywithPython’sstrandbytes
KnowaboutsupportinPythonfornumberingsystemsthroughitsvariousformsofintliterals
BefamiliarwithPython’sbuilt-infunctionsrelatedtocharacterencodingsandnumberingsystems
Characterencodingandnumberingsystemsaresocloselyconnectedthattheyneedtobecoveredinthesametutorialorelsethetreatmentofeitherwouldbetotallyinadequate.
Note:ThisarticleisPython3-centric.Specifically,allcodeexamplesinthistutorialweregeneratedfromaCPython3.7.2shell,althoughallminorversionsofPython3shouldbehave(mostly)thesameintheirtreatmentoftext.
Ifyou’restillusingPython2andareintimidatedbythedifferencesinhowPython2andPython3treattextandbinarydata,thenhopefullythistutorialwillhelpyoumaketheswitch.
FreeDownload:GetasamplechapterfromPythonTricks:TheBookthatshowsyouPython’sbestpracticeswithsimpleexamplesyoucanapplyinstantlytowritemorebeautiful+Pythoniccode.
What’saCharacterEncoding?
Therearetensifnothundredsofcharacterencodings.Thebestwaytostartunderstandingwhattheyareistocoveroneofthesimplestcharacterencodings,ASCII.
Whetheryou’reself-taughtorhaveaformalcomputersciencebackground,chancesareyou’veseenanASCIItableonceortwice.ASCIIisagoodplacetostartlearningaboutcharacterencodingbecauseitisasmallandcontainedencoding.(Toosmall,asitturnsout.)
Itencompassesthefollowing:
LowercaseEnglishletters:athroughz
UppercaseEnglishletters:AthroughZ
Somepunctuationandsymbols:"$"and"!",tonameacouple
Whitespacecharacters:anactualspace(""),aswellasanewline,carriagereturn,horizontaltab,verticaltab,andafewothers
Somenon-printablecharacters:characterssuchasbackspace,"\b",thatcan’tbeprintedliterallyinthewaythattheletterAcan
Sowhatisamoreformaldefinitionofacharacterencoding?
Ataveryhighlevel,it’sawayoftranslatingcharacters(suchasletters,punctuation,symbols,whitespace,andcontrolcharacters)tointegersandultimatelytobits.Eachcharactercanbeencodedtoauniquesequenceofbits.Don’tworryifyou’reshakyontheconceptofbits,becausewe’llgettothemshortly.
Thevariouscategoriesoutlinedrepresentgroupsofcharacters.Eachsinglecharacterhasacorrespondingcodepoint,whichyoucanthinkofasjustaninteger.CharactersaresegmentedintodifferentrangeswithintheASCIItable:
CodePointRange
Class
0through31
Control/non-printablecharacters
32through64
Punctuation,symbols,numbers,andspace
65through90
UppercaseEnglishalphabetletters
91through96
Additionalgraphemes,suchas[and\
97through122
LowercaseEnglishalphabetletters
123through126
Additionalgraphemes,suchas{and|
127
Control/non-printablecharacter(DEL)
TheentireASCIItablecontains128characters.ThistablecapturesthecompletecharactersetthatASCIIpermits.Ifyoudon’tseeacharacterhere,thenyousimplycan’texpressitasprintedtextundertheASCIIencodingscheme.
ASCIITableShow/Hide
CodePoint
Character(Name)
CodePoint
Character(Name)
0
NUL(Null)
64
@
1
SOH(StartofHeading)
65
A
2
STX(StartofText)
66
B
3
ETX(EndofText)
67
C
4
EOT(EndofTransmission)
68
D
5
ENQ(Enquiry)
69
E
6
ACK(Acknowledgment)
70
F
7
BEL(Bell)
71
G
8
BS(Backspace)
72
H
9
HT(HorizontalTab)
73
I
10
LF(LineFeed)
74
J
11
VT(VerticalTab)
75
K
12
FF(FormFeed)
76
L
13
CR(CarriageReturn)
77
M
14
SO(ShiftOut)
78
N
15
SI(ShiftIn)
79
O
16
DLE(DataLinkEscape)
80
P
17
DC1(DeviceControl1)
81
Q
18
DC2(DeviceControl2)
82
R
19
DC3(DeviceControl3)
83
S
20
DC4(DeviceControl4)
84
T
21
NAK(NegativeAcknowledgment)
85
U
22
SYN(SynchronousIdle)
86
V
23
ETB(EndofTransmissionBlock)
87
W
24
CAN(Cancel)
88
X
25
EM(EndofMedium)
89
Y
26
SUB(Substitute)
90
Z
27
ESC(Escape)
91
[
28
FS(FileSeparator)
92
\
29
GS(GroupSeparator)
93
]
30
RS(RecordSeparator)
94
^
31
US(UnitSeparator)
95
_
32
SP(Space)
96
`
33
!
97
a
34
"
98
b
35
#
99
c
36
$
100
d
37
%
101
e
38
&
102
f
39
'
103
g
40
(
104
h
41
)
105
i
42
*
106
j
43
+
107
k
44
,
108
l
45
-
109
m
46
.
110
n
47
/
111
o
48
0
112
p
49
1
113
q
50
2
114
r
51
3
115
s
52
4
116
t
53
5
117
u
54
6
118
v
55
7
119
w
56
8
120
x
57
9
121
y
58
:
122
z
59
;
123
{
60
<
124
|
61
=
125
}
62
>
126
~
63
?
127
DEL(delete)
RemoveadsThestringModule
Python’sstringmoduleisaconvenientone-stop-shopforstringconstantsthatfallinASCII’scharacterset.
Here’sthecoreofthemoduleinallitsglory:
#Fromlib/python3.7/string.py
whitespace='\t\n\r\v\f'
ascii_lowercase='abcdefghijklmnopqrstuvwxyz'
ascii_uppercase='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters=ascii_lowercase+ascii_uppercase
digits='0123456789'
hexdigits=digits+'abcdef'+'ABCDEF'
octdigits='01234567'
punctuation=r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable=digits+ascii_letters+punctuation+whitespace
Mostoftheseconstantsshouldbeself-documentingintheiridentifiername.We’llcoverwhathexdigitsandoctdigitsareshortly.
Youcanusetheseconstantsforeverydaystringmanipulation:
>>>>>>importstring
>>>s="What'swrongwithASCII?!?!?"
>>>s.rstrip(string.punctuation)
'What'swrongwithASCII'
Note:string.printableincludesallofstring.whitespace.Thisdisagreesslightlywithanothermethodfortestingwhetheracharacterisconsideredprintable,namelystr.isprintable(),whichwilltellyouthatnoneof{'\v','\n','\r','\f','\t'}areconsideredprintable.
Thesubtledifferenceisbecauseofdefinition:str.isprintable()considerssomethingprintableif“allofitscharactersareconsideredprintableinrepr().”
ABitofaRefresher
Nowisagoodtimeforashortrefresheronthebit,themostfundamentalunitofinformationthatacomputerknows.
Abitisasignalthathasonlytwopossiblestates.Therearedifferentwaysofsymbolicallyrepresentingabitthatallmeanthesamething:
0or1
“yes”or“no”
TrueorFalse
“on”or“off”
OurASCIItablefromtheprevioussectionuseswhatyouandIwouldjustcallnumbers(0through127),butwhataremorepreciselycallednumbersinbase10(decimal).
Youcanalsoexpresseachofthesebase-10numberswithasequenceofbits(base2).Herearethebinaryversionsof0through10indecimal:
Decimal
Binary(Compact)
Binary(PaddedForm)
0
0
00000000
1
1
00000001
2
10
00000010
3
11
00000011
4
100
00000100
5
101
00000101
6
110
00000110
7
111
00000111
8
1000
00001000
9
1001
00001001
10
1010
00001010
Noticethatasthedecimalnumbernincreases,youneedmoresignificantbitstorepresentthecharactersetuptoandincludingthatnumber.
Here’sahandywaytorepresentASCIIstringsassequencesofbitsinPython.EachcharacterfromtheASCIIstringgetspseudo-encodedinto8bits,withspacesinbetweenthe8-bitsequencesthateachrepresentasinglecharacter:
>>>>>>defmake_bitseq(s:str)->str:
...ifnots.isascii():
...raiseValueError("ASCIIonlyallowed")
...return"".join(f"{ord(i):08b}"foriins)
>>>make_bitseq("bits")
'01100010011010010111010001110011'
>>>make_bitseq("CAPS")
'01000011010000010101000001010011'
>>>make_bitseq("$25.43")
'001001000011001000110101001011100011010000110011'
>>>make_bitseq("~5")
'0111111000110101'
Note:.isascii()wasintroducedinPython3.7.
Thef-stringf"{ord(i):08b}"usesPython’sFormatSpecificationMini-Language,whichisawayofspecifyingformattingforreplacementfieldsinformatstrings:
Theleftsideofthecolon,ord(i),istheactualobjectwhosevaluewillbeformattedandinsertedintotheoutput.UsingthePythonord()functiongivesyouthebase-10codepointforasinglestrcharacter.
Therighthandsideofthecolonistheformatspecifier.08meanswidth8,0padded,andthebfunctionsasasigntooutputtheresultingnumberinbase2(binary).
Thistrickismainlyjustforfun,anditwillfailverybadlyforanycharacterthatyoudon’tseepresentintheASCIItable.We’lldiscusshowotherencodingsfixthisproblemlateron.
RemoveadsWeNeedMoreBits!
There’sacriticallyimportantformulathat’srelatedtothedefinitionofabit.Givenanumberofbits,n,thenumberofdistinctpossiblevaluesthatcanberepresentedinnbitsis2n:
defn_possible_values(nbits:int)->int:
return2**nbits
Here’swhatthatmeans:
1bitwillletyouexpress21==2possiblevalues.
8bitswillletyouexpress28==256possiblevalues.
64bitswillletyouexpress264==18,446,744,073,709,551,616possiblevalues.
There’sacorollarytothisformula:givenarangeofdistinctpossiblevalues,howcanwefindthenumberofbits,n,thatisrequiredfortherangetobefullyrepresented?Whatyou’retryingtosolveforisnintheequation2n=x(whereyoualreadyknowx).
Here’swhatthatworksoutto:
>>>>>>frommathimportceil,log
>>>defn_bits_required(nvalues:int)->int:
...returnceil(log(nvalues)/log(2))
>>>n_bits_required(256)
8
Thereasonthatyouneedtouseaceilinginn_bits_required()istoaccountforvaluesthatarenotcleanpowersof2.Sayyouneedtostoreacharactersetof110characterstotal.Naively,thisshouldtakelog(110)/log(2)==6.781bits,butthere’snosuchthingas0.781bits.110valueswillrequire7bits,not6,withthefinalslotsbeingunneeded:
>>>>>>n_bits_required(110)
7
Allofthisservestoproveoneconcept:ASCIIis,strictlyspeaking,a7-bitcode.TheASCIItablethatyousawabovecontains128codepointsandcharacters,0through127inclusive.Thisrequires7bits:
>>>>>>n_bits_required(128)#0through127
7
>>>n_possible_values(7)
128
Theissuewiththisisthatmoderncomputersdon’tstoremuchofanythingin7-bitslots.Theytrafficinunitsof8bits,conventionallyknownasabyte.
Note:Throughoutthistutorial,Iassumethatabyterefersto8bits,asithassincethe1960s,ratherthansomeotherunitofstorage.Youarefreetocallthisanoctetifyouprefer.
ThismeansthatthestoragespaceusedbyASCIIishalf-empty.Ifit’snotclearwhythisis,thinkbacktothedecimal-to-binarytablefromabove.Youcanexpressthenumbers0and1withjust1bit,oryoucanuse8bitstoexpressthemas00000000and00000001,respectively.
Youcanexpressthenumbers0through3withjust2bits,or00through11,oryoucanuse8bitstoexpressthemas00000000,00000001,00000010,and00000011,respectively.ThehighestASCIIcodepoint,127,requiresonly7significantbits.
Knowingthis,youcanseethatmake_bitseq()convertsASCIIstringsintoastrrepresentationofbytes,whereeverycharacterconsumesonebyte:
>>>>>>make_bitseq("bits")
'01100010011010010111010001110011'
ASCII’sunderutilizationofthe8-bitbytesofferedbymoderncomputersledtoafamilyofconflicting,informalizedencodingsthateachspecifiedadditionalcharacterstobeusedwiththeremaining128availablecodepointsallowedinan8-bitcharacterencodingscheme.
Notonlydidthesedifferentencodingsclashwitheachother,buteachoneofthemwasbyitselfstillagrosslyincompleterepresentationoftheworld’scharacters,regardlessofthefactthattheymadeuseofoneadditionalbit.
Overtheyears,onecharacterencodingmega-schemecametorulethemall.However,beforewegetthere,let’stalkforaminuteaboutnumberingsystems,whichareafundamentalunderpinningofcharacterencodingschemes.
RemoveadsCoveringAlltheBases:OtherNumberSystems
InthediscussionofASCIIabove,yousawthateachcharactermapstoanintegerintherange0through127.
Thisrangeofnumbersisexpressedindecimal(base10).It’sthewaythatyou,me,andtherestofushumansareusedtocounting,fornoreasonmorecomplicatedthanthatwehave10fingers.
ButthereareothernumberingsystemsaswellthatareespeciallyprevalentthroughouttheCPythonsourcecode.Whilethe“underlyingnumber”isthesame,allnumberingsystemsarejustdifferentwaysofexpressingthesamenumber.
IfIaskedyouwhatnumberthestring"11"represents,you’dberighttogivemeastrangelookbeforeansweringthatitrepresentseleven.
However,thisstringrepresentationcanexpressdifferentunderlyingnumbersindifferentnumberingsystems.Inadditiontodecimal,thealternativesincludethefollowingcommonnumberingsystems:
Binary:base2
Octal:base8
Hexadecimal(hex):base16
Butwhatdoesitmeanforustosaythat,inacertainnumberingsystem,numbersarerepresentedinbaseN?
HereisthebestwaythatIknowoftoarticulatewhatthismeans:it’sthenumberoffingersthatyou’dcountoninthatsystem.
Ifyouwantamuchfullerbutstillgentleintroductiontonumberingsystems,CharlesPetzold’sCodeisanincrediblycoolbookthatexploresthefoundationsofcomputercodeindetail.
OnewaytodemonstratehowdifferentnumberingsystemsinterpretthesamethingiswithPython’sint()constructor.Ifyoupassastrtoint(),Pythonwillassumebydefaultthatthestringexpressesanumberinbase10unlessyoutellitotherwise:
>>>>>>int('11')
11
>>>int('11',base=10)#10isalreadydefault
11
>>>int('11',base=2)#Binary
3
>>>int('11',base=8)#Octal
9
>>>int('11',base=16)#Hex
17
There’samorecommonwayoftellingPythonthatyourintegeristypedinabaseotherthan10.Pythonacceptsliteralformsofeachofthe3alternativenumberingsystemsabove:
TypeofLiteral
Prefix
Example
n/a
n/a
11
Binaryliteral
0bor0B
0b11
Octalliteral
0oor0O
0o11
Hexliteral
0xor0X
0x11
Allofthesearesub-formsofintegerliterals.Youcanseethattheseproducethesameresults,respectively,asthecallstoint()withnon-defaultbasevalues.They’realljustinttoPython:
>>>>>>11
11
>>>0b11#Binaryliteral
3
>>>0o11#Octalliteral
9
>>>0x11#Hexliteral
17
Here’showyoucouldtypethebinary,octal,andhexadecimalequivalentsofthedecimalnumbers0through20.AnyoftheseareperfectlyvalidinaPythoninterpretershellorsourcecode,andallworkouttobeoftypeint:
Decimal
Binary
Octal
Hex
0
0b0
0o0
0x0
1
0b1
0o1
0x1
2
0b10
0o2
0x2
3
0b11
0o3
0x3
4
0b100
0o4
0x4
5
0b101
0o5
0x5
6
0b110
0o6
0x6
7
0b111
0o7
0x7
8
0b1000
0o10
0x8
9
0b1001
0o11
0x9
10
0b1010
0o12
0xa
11
0b1011
0o13
0xb
12
0b1100
0o14
0xc
13
0b1101
0o15
0xd
14
0b1110
0o16
0xe
15
0b1111
0o17
0xf
16
0b10000
0o20
0x10
17
0b10001
0o21
0x11
18
0b10010
0o22
0x12
19
0b10011
0o23
0x13
20
0b10100
0o24
0x14
IntegerLiteralsinCPythonSourceShow/Hide
It’samazingjusthowprevalenttheseexpressionsareinthePythonStandardLibrary.Ifyouwanttoseeforyourself,navigatetowhereveryourlib/python3.7/directorysits,andcheckouttheuseofhexliteralslikethis:
$grep-nri--include"*\.py"-e"\b0x"lib/python3.7
ThisshouldworkonanyUnixsystemthathasgrep.Youcoulduse"\b0o"tosearchforoctalliteralsor“\b0b”tosearchforbinaryliterals.
What’stheargumentforusingthesealternateintliteralsyntaxes?Inshort,it’sbecause2,8,and16areallpowersof2,while10isnot.Thesethreealternatenumbersystemsoccasionallyofferawayforexpressingvaluesinacomputer-friendlymanner.Forexample,thenumber65536or216,isjust10000inhexadecimal,or0x10000asaPythonhexadecimalliteral.
RemoveadsEnterUnicode
Asyousaw,theproblemwithASCIIisthatit’snotnearlyabigenoughsetofcharacterstoaccommodatetheworld’ssetoflanguages,dialects,symbols,andglyphs.(It’snotevenbigenoughforEnglishalone.)
UnicodefundamentallyservesthesamepurposeasASCII,butitjustencompassesaway,way,waybiggersetofcodepoints.ThereareahandfulofencodingsthatemergedchronologicallybetweenASCIIandUnicode,buttheyarenotreallyworthmentioningjustyetbecauseUnicodeandoneofitsencodingschemes,UTF-8,hasbecomesopredominantlyused.
ThinkofUnicodeasamassiveversionoftheASCIItable—onethathas1,114,112possiblecodepoints.That’s0through1,114,111,or0through17*(216)-1,or0x10ffffhexadecimal.Infact,ASCIIisaperfectsubsetofUnicode.Thefirst128charactersintheUnicodetablecorrespondpreciselytotheASCIIcharactersthatyou’dreasonablyexpectthemto.
Intheinterestofbeingtechnicallyexacting,Unicodeitselfisnotanencoding.Rather,Unicodeisimplementedbydifferentcharacterencodings,whichyou’llseesoon.Unicodeisbetterthoughtofasamap(somethinglikeadict)ora2-columndatabasetable.Itmapscharacters(like"a","¢",oreven"ቈ")todistinct,positiveintegers.Acharacterencodingneedstoofferabitmore.
Unicodecontainsvirtuallyeverycharacterthatyoucanimagine,includingadditionalnon-printableonestoo.Oneofmyfavoritesisthepeskyright-to-leftmark,whichhascodepoint8207andisusedintextwithbothleft-to-rightandright-to-leftlanguagescripts,suchasanarticlecontainingbothEnglishandArabicparagraphs.
Note:Theworldofcharacterencodingsisoneofmanyfine-grainedtechnicaldetailsoverwhichsomepeoplelovetonitpickabout.Onesuchdetailisthatonly1,111,998oftheUnicodecodepointsareactuallyusable,duetoacoupleofarchaicreasons.
UnicodevsUTF-8
Itdidn’ttakelongforpeopletorealizethatalloftheworld’scharacterscouldnotbepackedintoonebyteeach.It’sevidentfromthisthatmodern,morecomprehensiveencodingswouldneedtousemultiplebytestoencodesomecharacters.
YoualsosawabovethatUnicodeisnottechnicallyafull-blowncharacterencoding.Whyisthat?
ThereisonethingthatUnicodedoesn’ttellyou:itdoesn’ttellyouhowtogetactualbitsfromtext—justcodepoints.Itdoesn’ttellyouenoughabouthowtoconverttexttobinarydataandviceversa.
Unicodeisanabstractencodingstandard,notanencoding.That’swhereUTF-8andotherencodingschemescomeintoplay.TheUnicodestandard(amapofcharacterstocodepoints)definesseveraldifferentencodingsfromitssinglecharacterset.
UTF-8aswellasitslesser-usedcousins,UTF-16andUTF-32,areencodingformatsforrepresentingUnicodecharactersasbinarydataofoneormorebytespercharacter.We’lldiscussUTF-16andUTF-32inamoment,butUTF-8hastakenthelargestshareofthepiebyfar.
Thatbringsustoadefinitionthatislongoverdue.Whatdoesitmean,formally,toencodeanddecode?
EncodingandDecodinginPython3
Python3’sstrtypeismeanttorepresenthuman-readabletextandcancontainanyUnicodecharacter.
Thebytestype,conversely,representsbinarydata,orsequencesofrawbytes,thatdonotintrinsicallyhaveanencodingattachedtoit.
Encodinganddecodingistheprocessofgoingfromonetotheother:
Encodingvsdecoding(Image:RealPython)
In.encode()and.decode(),theencodingparameteris"utf-8"bydefault,thoughit’sgenerallysaferandmoreunambiguoustospecifyit:
>>>>>>"résumé".encode("utf-8")
b'r\xc3\xa9sum\xc3\xa9'
>>>"ElNiño".encode("utf-8")
b'ElNi\xc3\xb1o'
>>>b"r\xc3\xa9sum\xc3\xa9".decode("utf-8")
'résumé'
>>>b"ElNi\xc3\xb1o".decode("utf-8")
'ElNiño'
Theresultsofstr.encode()isabytesobject.Bothbytesliterals(suchasb"r\xc3\xa9sum\xc3\xa9")andtherepresentationsofbytespermitonlyASCIIcharacters.
Thisiswhy,whencalling"ElNiño".encode("utf-8"),theASCII-compatible"El"isallowedtoberepresentedasitis,butthenwithtildeisescapedto"\xc3\xb1".Thatmessy-lookingsequencerepresentstwobytes,0xc3and0xb1inhex:
>>>>>>"".join(f"{i:08b}"foriin(0xc3,0xb1))
'1100001110110001'
Thatis,thecharacterñrequirestwobytesforitsbinaryrepresentationunderUTF-8.
Note:Ifyoutypehelp(str.encode),you’llprobablyseeadefaultofencoding='utf-8'.Becarefulaboutexcludingthisandjustusing"résumé".encode(),becausethedefaultmaybedifferentinWindowspriortoPython3.6.
RemoveadsPython3:All-InonUnicode
Python3isall-inonUnicodeandUTF-8specifically.Here’swhatthatmeans:
Python3sourcecodeisassumedtobeUTF-8bydefault.Thismeansthatyoudon’tneed#-*-coding:UTF-8-*-atthetopof.pyfilesinPython3.
Alltext(str)isUnicodebydefault.EncodedUnicodetextisrepresentedasbinarydata(bytes).ThestrtypecancontainanyliteralUnicodecharacter,suchas"Δv/Δt",allofwhichwillbestoredasUnicode.
Python3acceptsmanyUnicodecodepointsinidentifiers,meaningrésumé="~/Documents/resume.pdf"isvalidifthisstrikesyourfancy.
Python’sremoduledefaultstothere.UNICODEflagratherthanre.ASCII.Thismeans,forinstance,thatr"\w"matchesUnicodewordcharacters,notjustASCIIletters.
Thedefaultencodinginstr.encode()andbytes.decode()isUTF-8.
Thereisoneotherpropertythatismorenuanced,whichisthatthedefaultencodingtothebuilt-inopen()isplatform-dependentanddependsonthevalueoflocale.getpreferredencoding():
>>>>>>#MacOSXHighSierra
>>>importlocale
>>>locale.getpreferredencoding()
'UTF-8'
>>>#WindowsServer2012;otherWindowsbuildsmayuseUTF-16
>>>importlocale
>>>locale.getpreferredencoding()
'cp1252'
Again,thelessonhereistobecarefulaboutmakingassumptionswhenitcomestotheuniversalityofUTF-8,evenifitisthepredominantencoding.Itneverhurtstobeexplicitinyourcode.
OneByte,TwoBytes,ThreeBytes,Four
AcrucialfeatureisthatUTF-8isavariable-lengthencoding.It’stemptingtoglossoverwhatthismeans,butit’sworthdelvinginto.
ThinkbacktothesectiononASCII.Everythinginextended-ASCII-landdemandsatmostonebyteofspace.Youcanquicklyprovethiswiththefollowinggeneratorexpression:
>>>>>>all(len(chr(i).encode("ascii"))==1foriinrange(128))
True
UTF-8isquitedifferent.AgivenUnicodecharactercanoccupyanywherefromonetofourbytes.Here’sanexampleofasingleUnicodecharactertakingupfourbytes:
>>>>>>ibrow="🤨"
>>>len(ibrow)
1
>>>ibrow.encode("utf-8")
b'\xf0\x9f\xa4\xa8'
>>>len(ibrow.encode("utf-8"))
4
>>>#Callinglist()onabytesobjectgivesyou
>>>#thedecimalvalueforeachbyte
>>>list(b'\xf0\x9f\xa4\xa8')
[240,159,164,168]
Thisisasubtlebutimportantfeatureoflen():
ThelengthofasingleUnicodecharacterasaPythonstrwillalwaysbe1,nomatterhowmanybytesitoccupies.
Thelengthofthesamecharacterencodedtobyteswillbeanywherebetween1and4.
Thetablebelowsummarizeswhatgeneraltypesofcharactersfitintoeachbyte-lengthbucket:
DecimalRange
HexRange
What’sIncluded
Examples
0to127
"\u0000"to"\u007F"
U.S.ASCII
"A","\n","7","&"
128to2047
"\u0080"to"\u07FF"
MostLatinicalphabets*
"ę","±","ƌ","ñ"
2048to65535
"\u0800"to"\uFFFF"
Additionalpartsofthemultilingualplane(BMP)**
"ത","ᄇ","ᮈ","‰"
65536to1114111
"\U00010000"to"\U0010FFFF"
Other***
"𝕂","𐀀","😓","🂲",
*SuchasEnglish,Arabic,Greek,andIrish
**Ahugearrayoflanguagesandsymbols—mostlyChinese,Japanese,andKoreanbyvolume(alsoASCIIandLatinalphabets)
***AdditionalChinese,Japanese,Korean,andVietnamesecharacters,plusmoresymbolsandemojis
Note:Intheinterestofnotlosingsightofthebigpicture,thereisanadditionalsetoftechnicalfeaturesofUTF-8thataren’tcoveredherebecausetheyarerarelyvisibletoaPythonuser.
Forinstance,UTF-8actuallyusesprefixcodesthatindicatethenumberofbytesinasequence.Thisenablesadecodertotellwhatbytesbelongtogetherinavariable-lengthencoding,andletsthefirstbyteserveasanindicatorofthenumberofbytesinthecomingsequence.
Wikipedia’sUTF-8articledoesnotshyawayfromtechnicaldetail,andthereisalwaystheofficialUnicodeStandardforyourreadingenjoymentaswell.
WhatAboutUTF-16andUTF-32?
Let’sgetbacktotwootherencodingvariants,UTF-16andUTF-32.
ThedifferencebetweentheseandUTF-8issubstantialinpractice.Here’sanexampleofhowmajorthedifferenceiswitharound-tripconversion:
>>>>>>letters="αβγδ"
>>>rawdata=letters.encode("utf-8")
>>>rawdata.decode("utf-8")
'αβγδ'
>>>rawdata.decode("utf-16")#😧
'뇎닎돎듎'
Inthiscase,encodingfourGreekletterswithUTF-8andthendecodingbacktotextinUTF-16wouldproduceatextstrthatisinacompletelydifferentlanguage(Korean).
Glaringlywrongresultslikethisarepossiblewhenthesameencodingisn’tusedbidirectionally.Twovariationsofdecodingthesamebytesobjectmayproduceresultsthataren’teveninthesamelanguage.
ThistablesummarizestherangeornumberofbytesunderUTF-8,UTF-16,andUTF-32:
Encoding
BytesPerCharacter(Inclusive)
VariableLength
UTF-8
1to4
Yes
UTF-16
2to4
Yes
UTF-32
4
No
OneothercuriousaspectoftheUTFfamilyisthatUTF-8willnotalwaystakeuplessspacethanUTF-16.Thatmayseemmathematicallycounterintuitive,butit’squitepossible:
>>>>>>text="記者鄭啟源羅智堅"
>>>len(text.encode("utf-8"))
26
>>>len(text.encode("utf-16"))
22
ThereasonforthisisthatthecodepointsintherangeU+0800throughU+FFFF(2048through65535indecimal)takeupthreebytesinUTF-8versusonlytwoinUTF-16.
I’mnotbyanymeansrecommendingthatyoujumpaboardtheUTF-16train,regardlessofwhetherornotyouoperateinalanguagewhosecharactersarecommonlyinthisrange.Amongotherreasons,oneofthestrongargumentsforusingUTF-8isthat,intheworldofencoding,it’sagreatideatoblendinwiththecrowd.
Nottomention,it’s2019:computermemoryischeap,sosaving4bytesbygoingoutofyourwaytouseUTF-16isarguablynotworthit.
RemoveadsPython’sBuilt-InFunctions
You’vemadeitthroughthehardpart.Timetousewhatyou’veseenthusfarinPython.
Pythonhasagroupofbuilt-infunctionsthatrelateinsomewaytonumberingsystemsandcharacterencoding:
ascii()
bin()
bytes()
chr()
hex()
int()
oct()
ord()
str()
Thesecanbelogicallygroupedtogetherbasedontheirpurpose:
ascii(),bin(),hex(),andoct()areforobtainingadifferentrepresentationofaninput.Eachoneproducesastr.Thefirst,ascii(),producesanASCIIonlyrepresentationofanobject,withnon-ASCIIcharactersescaped.Theremainingthreegivebinary,hexadecimal,andoctalrepresentationsofaninteger,respectively.Theseareonlyrepresentations,notafundamentalchangeintheinput.
bytes(),str(),andint()areclassconstructorsfortheirrespectivetypes,bytes,str,andint.Theyeachofferwaysofcoercingtheinputintothedesiredtype.Forinstance,asyousawearlier,whileint(11.0)isprobablymorecommon,youmightalsoseeint('11',base=16).
ord()andchr()areinversesofeachotherinthatthePythonord()functionconvertsastrcharactertoitsbase-10codepoint,whilechr()doestheopposite.
Here’samoredetailedlookateachoftheseninefunctions:
Function
Signature
Accepts
ReturnType
Purpose
ascii()
ascii(obj)
Varies
str
ASCIIonlyrepresentationofanobject,withnon-ASCIIcharactersescaped
bin()
bin(number)
number:int
str
Binaryrepresentationofaninteger,withtheprefix"0b"
bytes()
bytes(iterable_of_ints)bytes(s,enc[,errors])bytes(bytes_or_buffer)bytes([i])
Varies
bytes
Coerce(convert)theinputtobytes,rawbinarydata
chr()
chr(i)
i:inti>=0i<=1114111
str
ConvertanintegercodepointtoasingleUnicodecharacter
hex()
hex(number)
number:int
str
Hexadecimalrepresentationofaninteger,withtheprefix"0x"
int()
int([x])int(x,base=10)
Varies
int
Coerce(convert)theinputtoint
oct()
oct(number)
number:int
str
Octalrepresentationofaninteger,withtheprefix"0o"
ord()
ord(c)
c:strlen(c)==1
int
ConvertasingleUnicodecharactertoitsintegercodepoint
str()
str(object=’‘)str(b[,enc[,errors]])
Varies
str
Coerce(convert)theinputtostr,text
Youcanexpandthesectionbelowtoseesomeexamplesofeachfunction.
Examples:ascii()Show/Hide
ascii()givesyouanASCII-onlyrepresentationofanobject,withnon-ASCIIcharactersescaped:
>>>>>>ascii("abcdefg")
"'abcdefg'"
>>>ascii("jalepeño")
"'jalepe\\xf1o'"
>>>ascii((1,2,3))
'(1,2,3)'
>>>ascii(0xc0ffee)#Hexliteral(int)
'12648430'
Examples:bin()Show/Hide
bin()givesyouabinaryrepresentationofaninteger,withtheprefix"0b":
>>>>>>bin(0)
'0b0'
>>>bin(400)
'0b110010000'
>>>bin(0xc0ffee)#Hexliteral(int)
'0b110000001111111111101110'
>>>[bin(i)foriin[1,2,4,8,16]]#`int`+listcomprehension
['0b1','0b10','0b100','0b1000','0b10000']
Examples:bytes()Show/Hide
bytes()coercestheinputtobytes,representingrawbinarydata:
>>>>>>#Iterableofints
>>>bytes((104,101,108,108,111,32,119,111,114,108,100))
b'helloworld'
>>>bytes(range(97,123))#Iterableofints
b'abcdefghijklmnopqrstuvwxyz'
>>>bytes("real🐍","utf-8")#String+encoding
b'real\xf0\x9f\x90\x8d'
>>>bytes(10)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>>bytes.fromhex('c0ffee')
b'\xc0\xff\xee'
>>>bytes.fromhex("7265616c707974686f6e")
b'realpython'
Examples:chr()Show/Hide
chr()convertsanintegercodepointtoasingleUnicodecharacter:
>>>>>>chr(97)
'a'
>>>chr(7048)
'ᮈ'
>>>chr(1114111)
'\U0010ffff'
>>>chr(0x10FFFF)#Hexliteral(int)
'\U0010ffff'
>>>chr(0b01100100)#Binaryliteral(int)
'd'
Examples:hex()Show/Hide
hex()givesthehexadecimalrepresentationofaninteger,withtheprefix"0x":
>>>>>>hex(100)
'0x64'
>>>[hex(i)foriin[1,2,4,8,16]]
['0x1','0x2','0x4','0x8','0x10']
>>>[hex(i)foriinrange(16)]
['0x0','0x1','0x2','0x3','0x4','0x5','0x6','0x7',
'0x8','0x9','0xa','0xb','0xc','0xd','0xe','0xf']
Examples:int()Show/Hide
int()coercestheinputtoint,optionallyinterpretingtheinputinagivenbase:
>>>>>>int(11.0)
11
>>>int('11')
11
>>>int('11',base=2)
3
>>>int('11',base=8)
9
>>>int('11',base=16)
17
>>>int(0xc0ffee-1.0)
12648429
>>>int.from_bytes(b"\x0f","little")
15
>>>int.from_bytes(b'\xc0\xff\xee',"big")
12648430
Examples:ord()Show/Hide
ThePythonord()functionconvertsasingleUnicodecharactertoitsintegercodepoint:
>>>>>>ord("a")
97
>>>ord("ę")
281
>>>ord("ᮈ")
7048
>>>[ord(i)foriin"helloworld"]
[104,101,108,108,111,32,119,111,114,108,100]
Examples:str()Show/Hide
str()coercestheinputtostr,representingtext:
>>>>>>str("strofstring")
'strofstring'
>>>str(5)
'5'
>>>str([1,2,3,4])#Like[1,2,3,4].__str__(),butusestr()
'[1,2,3,4]'
>>>str(b"\xc2\xbccupofflour","utf-8")
'¼cupofflour'
>>>str(0xc0ffee)
'12648430'
PythonStringLiterals:WaystoSkinaCat
Ratherthanusingthestr()constructor,it’scommonplacetotypeastrliterally:
>>>>>>meal="shrimpandgrits"
Thatmayseemeasyenough.Buttheinterestingsideofthingsisthat,becausePython3isUnicode-centricthroughandthrough,youcan“type”Unicodecharactersthatyouprobablywon’tevenfindonyourkeyboard.YoucancopyandpastethisrightintoaPython3interpretershell:
>>>>>>alphabet='αβγδεζηθικλμνξοπρςστυφχψ'
>>>print(alphabet)
αβγδεζηθικλμνξοπρςστυφχψ
Besidesplacingtheactual,unescapedUnicodecharactersintheconsole,thereareotherwaystotypeUnicodestringsaswell.
OneofthedensestsectionsofPython’sdocumentationistheportiononlexicalanalysis,specificallythesectiononstringandbytesliterals.Personally,Ihadtoreadthissectionaboutone,two,ormaybeninetimesforittoreallysinkin.
PartofwhatitsaysisthatthereareuptosixwaysthatPythonwillallowyoutotypethesameUnicodecharacter.
Thefirstandmostcommonwayistotypethecharacteritselfliterally,asyou’vealreadyseen.Thetoughpartwiththismethodisfindingtheactualkeystrokes.That’swheretheothermethodsforgettingandrepresentingcharacterscomeintoplay.Here’sthefulllist:
EscapeSequence
Meaning
HowToExpress"a"
"\ooo"
Characterwithoctalvalueooo
"\141"
"\xhh"
Characterwithhexvaluehh
"\x61"
"\N{name}"
CharacternamednameintheUnicodedatabase
"\N{LATINSMALLLETTERA}"
"\uxxxx"
Characterwith16-bit(2-byte)hexvaluexxxx
"\u0061"
"\Uxxxxxxxx"
Characterwith32-bit(4-byte)hexvaluexxxxxxxx
"\U00000061"
Here’ssomeproofandvalidationoftheabove:
>>>>>>(
..."a"==
..."\x61"==
..."\N{LATINSMALLLETTERA}"==
..."\u0061"==
..."\U00000061"
...)
True
Now,therearetwomaincaveats:
Notalloftheseformsworkforallcharacters.Thehexrepresentationoftheinteger300is0x012c,whichsimplyisn’tgoingtofitintothe2-hex-digitescapecode"\xhh".Thehighestcodepointthatyoucansqueezeintothisescapesequenceis"\xff"("ÿ").Similarlyfor"\ooo",itwillonlyworkupto"\777"("ǿ").
For\xhh,\uxxxx,and\Uxxxxxxxx,exactlyasmanydigitsarerequiredasareshownintheseexamples.ThiscanthrowyouforaloopbecauseofthewaythatUnicodetablesconventionallydisplaythecodesforcharacters,withaleadingU+andvariablenumberofhexcharacters.ThekeyisthatUnicodetablesmostoftendonotzero-padthesecodes.
Forinstance,ifyouconsultunicode-table.comforinformationontheGothicletterfaihu(orfehu),"𐍆",you’llseethatitislistedashavingthecodeU+10346.
Howdoyouputthisinto"\uxxxx"or"\Uxxxxxxxx"?Well,youcan’tfititin"\uxxxx"becauseit’sa4-bytecharacter,andtouse"\Uxxxxxxxx"torepresentthischaracter,you’llneedtoleft-padthesequence:
>>>>>>"\U00010346"
'𐍆'
Thisalsomeansthatthe"\Uxxxxxxxx"formistheonlyescapesequencethatiscapableofholdinganyUnicodecharacter.
Note:Here’sashortfunctiontoconvertstringsthatlooklike"U+10346"intosomethingPythoncanworkwith.Itusesstr.zfill():
>>>>>>defmake_uchr(code:str):
...returnchr(int(code.lstrip("U+").zfill(8),16))
>>>make_uchr("U+10346")
'𐍆'
>>>make_uchr("U+0026")
'&'
RemoveadsOtherEncodingsAvailableinPython
Sofar,you’veseenfourcharacterencodings:
ASCII
UTF-8
UTF-16
UTF-32
Thereareatonofotheronesoutthere.
OneexampleisLatin-1(alsocalledISO-8859-1),whichistechnicallythedefaultfortheHypertextTransferProtocol(HTTP),perRFC2616.WindowshasitsownLatin-1variantcalledcp1252.
Note:ISO-8859-1isstillverymuchpresentoutinthewild.TherequestslibraryfollowsRFC2616“totheletter”inusingitasthedefaultencodingforthecontentofanHTTPorHTTPSresponse.Iftheword“text”isfoundintheContent-Typeheader,andnootherencodingisspecified,thenrequestswilluseISO-8859-1.
Thecompletelistofacceptedencodingsisburiedwaydowninthedocumentationforthecodecsmodule,whichispartofPython’sStandardLibrary.
There’sonemoreusefulrecognizedencodingtobeawareof,whichis"unicode-escape".IfyouhaveadecodedstrandwanttoquicklygetarepresentationofitsescapedUnicodeliteral,thenyoucanspecifythisencodingin.encode():
>>>>>>alef=chr(1575)#Or"\u0627"
>>>alef_hamza=chr(1571)#Or"\u0623"
>>>alef,alef_hamza
('ا','أ')
>>>alef.encode("unicode-escape")
b'\\u0627'
>>>alef_hamza.encode("unicode-escape")
b'\\u0623'
YouKnowWhatTheySayAboutAssumptions…
JustbecausePythonmakestheassumptionofUTF-8encodingforfilesandcodethatyougeneratedoesn’tmeanthatyou,theprogrammer,shouldoperatewiththesameassumptionforexternaldata.
Let’ssaythatagainbecauseit’saruletoliveby:whenyoureceivebinarydata(bytes)fromathirdpartysource,whetheritbefromafileoroveranetwork,thebestpracticeistocheckthatthedataspecifiesanencoding.Ifitdoesn’t,thenit’sonyoutoask.
AllI/Ohappensinbytes,nottext,andbytesarejustonesandzerostoacomputeruntilyoutellitotherwisebyinformingitofanencoding.
Here’sanexampleofwherethingscangowrong.You’resubscribedtoanAPIthatsendsyouarecipeoftheday,whichyoureceiveinbytesandhavealwaysdecodedusing.decode("utf-8")withnoproblem.Onthisparticularday,partoftherecipelookslikethis:
>>>>>>data=b"\xbccupofflour"
Itlooksasiftherecipecallsforsomeflour,butwedon’tknowhowmuch:
>>>>>>data.decode("utf-8")
Traceback(mostrecentcalllast):
File"
延伸文章資訊
- 1Unicode & Character Encodings in Python: A Painless Guide
Encoding and Decoding in Python 3 ... Python 3's str type is meant to represent human-readable te...
- 2Unicode — pysheeet
In Python 3, strings are represented by Unicode instead of bytes. ... get u2 byte string b'Cafe\x...
- 3unicode - Python Reference (The Right Way) - Read the Docs
- 4Python 3 Tutorial 第二堂(1)Unicode 支援、基本I/O
是這樣的… import sys for line in open(sys.
- 5Unicode in Python 2
Unicode strings are sequences of platonic characters ... import codecs codecs.encode() codecs.dec...