Unicode in Python 2
文章推薦指數: 80 %
Unicode strings are sequences of platonic characters ... import codecs codecs.encode() codecs.decode() codecs.open() # better to use ``io.open`` ...
SystemDevelopmentWithPython
2.0
UnicodeinPython2
History
WhattheheckisUnicodeanyway?
EnterUnicode
Unicode
Mechanics
Whatarestrings?
Stringsvsunicode
Unicode
UsingunicodeinPy2
EncodingandDecoding
UnicodeLiterals
UsingUnicode
Usingunicodeeverywhere
Encodings
UTF-8
UTF-16
UTF-16criticism
Latin-1
UnicodeDocs
GotchasinPython2
UnicodeinPython3
Exercises
BasicUnicodeLAB
ChallengeUnicodeLAB
SystemDevelopmentWithPython
Docs»
UnicodeinPython2
Viewpagesource
UnicodeinPython2¶
Aquickrun-downofUnicode,
itsuseinPython2,
andsomeofthegotchasthatarise.
-ChrisBarker
History¶
Abitaboutwhereallthismesscamefrom...
WhattheheckisUnicodeanyway?¶
Firsttherewaschaos...
Differentmachinesuseddifferentencodings
ThentherewasASCII–andallwasgood(7bit),127characters
(forEnglishspeakers,anyway)
Buteachvendorusedthetophalf(127-255)fordifferentthings.
MacRoman,Windows1252,etc...
Thereisnow“latin-1”,butstillalotofoldfilesaround
Non-WesternEuropeanlanguagesrequiredtotallyincompatible1-byteencodings
Nowaytomixlanguageswithdifferentalphabets.
EnterUnicode¶
TheUnicodeideaisprettysimple:
one“codepoint”forallcharactersinalllanguages
Buthowdoyouexpressthatinbytes?
Earlydays:wecanfitallthecodepointsinatwobyteinteger(65536characters)
Turnsoutthatdidn’twork–nowneed32bitintegertoholdallofunicode“raw”(UTC-4)
Enter“encodings”:
Anencodingisawaytomapspecificbytestoacodepoint.
Eachcodepointcanhaveoneormorebytes.
Unicode¶
Agoodstart:
TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,
PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)
http://www.joelonsoftware.com/articles/Unicode.html
EverythingisBytes
Ifit’sondiskoronanetwork,it’sbytes
Pythonprovidessomeabstractionstomakeiteasiertodealwithbytes
Unicodeisabiggie
actually,dealingwithnumbersratherthanbytesisbig
–butwetakethatforgranted
Mechanics¶
Whatarestrings?¶
Py2stringsaresequencesofbytes
Unicodestringsaresequencesofplatoniccharacters
It’salmostonecodepointpercharacter–buttherearecomplications
withcombinedcharacters:accents,etc.
Platoniccharacterscannotbewrittentodiskornetwork!
(ANSI:onecharacter==onebyte–soeasy!)
Stringsvsunicode¶
Python2hastwotypesthatletyouworkwithtext:
str
unicode
Andtwowaystoworkwithbinarydata:
str
bytes()(andbytearray)
but:
In[86]:strisbytes
Out[86]:True
bytesisthereforpy3compatibility–butit’sgoodformakingyour
intentionsclear,too.
Unicode¶
Theunicodeobjectletsyouworkwithcharacters
Ithasallthesamemethodsasthestringobject.
“encoding”isconvertingfromaunicodeobjecttobytes
“decoding”isconvertingfrombytestoaunicodeobject
(sometimesthisfeelsbackwards...)
Andcangetevenmoreconfusingwithpy2stringsbeingbothtextandbytes!
UsingunicodeinPy2¶
Builtinfunctions
ord()
chr()
unichr()
str()
unicode()
Thecodecsmodule
importcodecs
codecs.encode()
codecs.decode()
codecs.open()#bettertouse``io.open``
EncodingandDecoding¶
Encoding:texttobytes–yougetabytes(str)object
In[17]:u"this".encode('utf-8')
Out[17]:'this'
In[18]:u"this".encode('utf-16')
Out[18]:'\xff\xfet\x00h\x00i\x00s\x00'
Decodingbytestotext–yougetaunicodeobject
In[2]:text='\xff\xfe."+"x\x00\xb2\x00'.decode('utf-16')
In[3]:type(text)
Out[3]:unicode
In[4]:printtext
∮∫x²
UnicodeLiterals¶
Useunicodeinyoursourcefiles:
#-*-coding:utf-8-*-
escapetheunicodecharacters:
printu"Theintegralsign:\u222B"
printu"Theintegralsign:\N{integral}"
Lotsoftablesofcodepointsonline:
Oneexample:
http://inamidst.com/stuff/unidata/
hello_unicode.py.
UsingUnicode¶
Useunicodeobjectsinallyourcode
Decodeoninput
Encodeonoutput
Manypackagesdothisforyou:XMLprocessing,databases,...
Gotcha:
Pythonhasadefaultencoding(usuallyascii)
In[2]:sys.getdefaultencoding()
Out[2]:'ascii'
Thedefaultencodingwillgetusedinunexpectedplaces!
Usingunicodeeverywhere¶
Python2.6andabovehaveanicefeaturetomakeiteasiertouseunicodeeverywhere
from__future__importunicode_literals
Afterrunningthatline,theu''isassumed
In[1]:s="thisisaregularpy2string"
In[2]:printtype(s)
延伸文章資訊
- 1A Guide to Unicode, UTF-8 and Strings in Python | by Sanket Gupta
- 2Unicode HOWTO — Python 3.10.7 documentation
Since Python 3.0, the language's str type contains Unicode characters, meaning any string created...
- 3Unicode HOWTO — Python 3.10.7 documentation
- 4Python 3 Tutorial 第二堂(1)Unicode 支援、基本I/O
是這樣的… import sys for line in open(sys.
- 5unicode - Python Reference (The Right Way) - Read the Docs