How Python does Unicode - James Bennett
文章推薦指數: 80 %
In Python 3, there is one and only one string type. Its name is str and it's Unicode. Sequences of bytes are now represented by a type called ... Skiptocontent HowPythondoes Unicode Published:September5,2017.Filedunder: Pedantics,Programming,Python,Unicode. Asweall(hopefully)knowbynow,Python3madeasignificantchangetohowstringsworkinthelanguage.I’montherecordasbeingstronglyinfavorofthischange,andI’vewrittenatlengthaboutwhyIthinkitwastherightthingtodo.Butforthosewho’vebeenlivingunderarockthepasttenyearsorso,here’sabriefsummary,becauseit’srelevanttowhatIwanttogointo today: InPython2,twotypescouldbeusedtorepresentstrings.Oneofthem,str,wasa“bytestring”type;itrepresentedasequenceofbytesinsomeparticulartextencoding,anddefaultedtoASCII.Theother,unicode,was(asthenameimplies)aUnicodestringtype.Thusitdidnotrepresentanyparticularencoding(ordidit?Keepreadingtofindout!).InPython2,manyoperationsallowedyoutouseeithertype,manycomparisonsworkedevenonstringsofdifferenttypes,andstrandunicodewerebothsubclassesofacommonbaseclass,basestring.TocreateastrinPython2,youcanusethestr()built-in,orstring-literalsyntax,likeso:my_string='Thisismystring.'.Tocreateaninstanceofunicode,youcanusetheunicode()built-in,orprefixastringliteralwithau,likeso:my_unicode=u'ThisismyUnicodestring.'. InPython3,thereisoneandonlyonestringtype.Itsnameisstrandit’sUnicode.Sequencesofbytesarenowrepresentedbyatypecalledbytes.Abytesinstance—ifithappenstorepresentASCII-encodedtext—doesstillsupportsomeoperationswhichletyouuseitasASCII,butthingsthatacceptstringswillgenerallynotacceptbytesanymore,andcomparisonsandotheroperationsnolongerworkbetweenbytesandstrinstances.Thereareevencommand-lineflagstothePythoninterpreterwhichcanpromotefailedmixingofbytesandstrintoanexceptionthatcrashesyourprogram,ifyou want. Thisisagoodthing.Unicodeisthecorrectstringtype,andencoding/decodingshouldbehandledatboundaries(likenetworkinterfacesorthefilesystem),andshouldnotbethingsyouconstantlyhavetoworryabouteverytimeyou’redealingwithastring.Andasasideeffect,thebytestypeisactuallyalotmoreusefulforworkingwithbytesthanPython2’sstrtypeever was. Butthere’sanotherchangeinthePython3series—onewhichhappenedinPython3.3—that’snotaswellknownbutalsoimportant.Andthat’swhatIwanttotalkabouttoday,butfirstweneedonemorebitof background. Unicodetransformation(nottheMichaelBay kind) IfyoualreadyknowhowencodingsofUnicode—specifically,UTF-8,UTF-16,andUTF-32—work,youcanskipthissection.Ifyoudon’tknowaboutthat,keep reading! Asyoumayknow,Unicodeisnotanencoding.Itmakessensetotalkaboutastringthat’sencodedasASCII,orencodedasISO-8859-2,orencodedasShiftJIS,becausethoseareallencodings.ItmakesnosensetotalkaboutastringencodedasUnicode,becauseUnicodeisnotanencoding.Unicodeismorelikeabigdatabaseofcharactersandpropertiesandrules,andonitsownsaysnothingabouthowtoencodeasequenceofcharactersintoasequenceofbytesforactualusebya computer. Andwhenyougetrightdowntoit,Unicodedoesn’tworkintermsofcharacters;itworksintermsofthingscalledcodepoints.IfyouweretorenderaUnicodesequencefordisplay,say,onyourscreen,you’ddiscoverthere’snotacleanmappingbetweencodepointsandwhatyouconsidertobe“characters”;sometimesa“character”mightbeasinglecodepoint,sometimesitmightbemadeupofmultiplecodepoints.Orsomethingthatlookstoyoulikemultiple“characters”mightbeasinglecode point! Herearesome examples: CodepointU+0041,LATINCAPITALLETTERA,is,well,aLatincapitalletterA.Itlookslikethis: A CodepointU+0327COMBININGCEDILLA,ontheotherhand,reallyneedssomethingelsetogowithit.PutitafteraU+0063LATINSMALLLETTERC,andyougeta“character”madeupofmultiplecodepoints:ç,whichisalsoequivalenttothesinglecodepointU+00E7LATINSMALLLETTERCWITHCEDILLA.U+00E7iswhatUnicodecallsa“composedform”andU+0063U+0327isa“decomposedform”.Unicodedefinesrulesforhowtocomposeanddecomposesequencesofcodepointsintoeachother,andwhendifferentsequencesshouldbeconsideredequivalentornormalizedtothesame thing. Sohowaboutasinglecodepointthatlookslikemultiple“characters”?Well,let’sturntoU+FDFAARABICLIGATURESALLALLAHOUALAYHEWASALLAM,whichlookslikethis(assumingyouhaveafontthatsupportsit):ﷺ,and—ifyoufollowUnicode’srulesforhowtodecomposeit—isnearlytwenty“characters”inasinglecodepoint.Itspresenceasasinglecodepointisduetoitsfrequentuse;itrepresentsthephrase“peacebeupon him”. IfyouwanttolearnhowtospeakUnicode,bytheway,thewordyouwanttousefor“character”isusuallygoingtobe“grapheme”.InUnicode,a“grapheme”isdefinedasa“minimallydistinctiveunitofwritinginthecontextofaparticularwritingsystem”,withtheUnicodeglossarysuggesting“b”and“d”asanexampleforEnglish,wherechangingonetotheotherproducesadifferentword(asin“big”versus “dig”). SoUnicodeiscodepoints,whichareusuallygivenashexadecimalrepresentationsofintegers.AndUnicodeorganizestheseinto“planes”of216(that’s65,536)codepointseach,brokenupinto“blocks”ofrelatedcodepoints.Therearecurrently17planesdefinedinUnicode,thoughonlyafewhaveanythingassignedtotheirconstituentcode points. Howmightwegoaboutrepresentingthisinsequencesofbytes?OneideamightbetojustspitoutasequenceofbytesrepresentingtheintegervaluesoftheUnicodecodepoints;thesmallestnumberofbytescapableofrepresentinganyarbitrarily-chosenUnicodecodepointis3(thehighestpossiblecodepointrequires21bitstorepresentasaninteger),butideallywewantanicepoweroftwo,sowecouldrepresentthecodepointsas4-byteintegers,andcallita day. AndinfactthisisprettymuchthedefinitionofUTF-32(that’s“UnicodeTransformationFormat,32-bit”).UTF-32isalsosometimesreferredtoasUCS-4(“UniversalCodedCharacterSet,4-byte”—UCSisanISO/IECstandardsimilartoUnicode,andsharingtheassignmentsofcodepoints,butwithoutthefullsetofpropertiesandrulesdefinedbyUnicode).Sothat’sanoption,butitdoeseatupalotofspace;mostreal-worldtextdatadoesn’tusecodepointshighenoughtoneedthismanybytes,soitcanseem—andbe—wastefultoinsistonusingfourbytesevenwhenit’snot necessary. Onceuponatime,whenthesetofUnicodecodepointshadn’tgrownpastthelimitofa16-bitinteger,therewasanencodingcalledUCS-2,whereeachcodepointwasrepresentedliterallyasa16-bit/two-byteinteger.Today,though,Unicodecontainscodepointsoutsidetherangeofa16-bitinteger(inotherwords,outsideofthelowest-numberedplane,officiallyknownastheBasicMultilingualPlane),sothatnolongerworksquiteaselegantly.However,withabitofshoehorning,it’sstillpossibletouseanencodingbasedontwo-byteintegervalues;thatsystemiscalledUTF-16(UnicodeTransformationFormat,16-bit).ThewayUTF-16works is: Ifacodepointcanberepresentedasa16-bitinteger,encodeitasthat16-bitintegerandyou’re done. Ifnot,subtract0x010000fromthecodepoint,thentakethefirsttenbitsoftheresultandadd0xD800,andtakethelasttenbitsandadd0xDC00.Thiswillproducetwo16-bitvalues,thefirstintherange0xD800-0xDBFFandthesecondintherange0xDC00-0xDFFF.Thesetwovaluesareusedtorepresentthecodepoint(whichnowisencodedbyfourbytes total). Thetwo16-bitvaluesproducedforcodepointsthatcan’tbecleanlyhandledinasingle16-bitunitareknownas“surrogatepairs”.Usingsurrogatepairshassomeadvantages:it’simmediatelyobviouswhenyou’relookingatthem,sincethere’saparticularrangethatthey—andonlythey—fallinto,andsincethefirstandsecondpartsofthesurrogatepairfallintodifferentranges,youcanalsofigureout,ifyou’redumpedatarandompointinaUTF-16stream,whetheryou’relookingatasurrogatepairandwhetheryou’relookingatthefirstorthesecondhalfof one. However,thisdidrequiresomechangestoUnicodeitself,becauseitmeansUTF-16canonlyhandlecodepointsrepresentableby21bitsorfewer(that’sasmuchascanbeencodedusingtherangesthesurrogatepairsfallinto),andthere’sarangeofcodepointinthelowest-numberedplanewhichcanneverbeassigned,sinceUTF-16isusingthemforsurrogatepairs.Unicodeitselfnowguaranteesthattherangesusedforsurrogatepairswillneverbeassigned,andthatnocodepointslargerthan21bitswilleverbeassigned,inordertoensureUTF-16willalwaysbeableencodeallof Unicode. It’sworthnoting,however,thatbecauseofthislittleworkaroundUTF-16isavariable-widthencoding.WhereUTF-32alwaysusesfourbytes,andwhereUCS-2alwaysusedtwo(sinceatthetimetwobyteswasenough),UTF-16sometimesusestwobytesandsometimesusesfour,andyouhavetopayattentiontothevaluesyou’reconsuminginordertoknowwhichis which. UTF-16alsomakesuseofthecodepointU+FEFF,officiallyZEROWIDTHNO-BREAKSPACE,asabyte-ordermarktodistinguishendianness(onabig-endiansystemitwillreadasFEFF,whileonlittle-endianitreadsasFFFE).UTF-16-encodedstreamsoftextusuallystartwiththisvalueandit’saneasywaytodetectfilescontainingUTF-16-encodedcontents.UTF-32alsocanusethebyte-ordermark,butreal-worlduseofUTF-32forstorageandtransmissionofdatais rare. However,thisisstillslightlywasteful,because—atleastintheEnglish-speakingworld—alotoftextcanberepresentedwithcodepointsthatfitinonebyte,andalotofsystemswereoriginallybuiltonASCII,whichdidn’tevenusethefull8bitsofasinglebyte.Unicodethoughtfullyputthe128charactersofASCIIintothefirst128codepointsofthefirstblockofthefirstplane,buttotakeadvantageofthatrequiresanencodingthatmakesthosecodepointscomeoutexactlyequivalenttoasequenceofASCIIbytes.ThatencodingisUTF-8. InUTF-8,asinglecodepointmayrequireanywherefrom1-4bytestoencode.Thewaythisworks is: Ifthecodepointislessthanorequalto0x7F,encodeitas-is,inasinglebyte,leavingthehigh(first,ifyou’relookingattheoutputof,say,Python’sbin()built-in)bitsetto zero. Ifthecodepointisgreaterthan0x7Fbutlessthanorequalto0x07FF,encodeitastwobytes.Thefirstbytebegins110,followedbythefirstfivebitsofthecodepoint,andthesecondbytebegins10followedbytheremainingbitsofthecode point. Ifthecodepointisgreaterthan0x07FFbutlessthanorequalto0xFFFF,encodeitasthreebytes.Thefirstbytebeginswith1110followedbythefirstfourbitsofthecodepoint,andthesecondandthirdbyteseachbeginwith10followedbysixbitsfromthecode point. Forallothercodepoints,encodeasfourbytes.Thefirstbytebeginswith11110followedbythefirstthreebitsofthecodepoint,theneachoftheotherthreebytesbegins10followedbysixbitsofthecode point. Thiscanencodeany21-bitcodepointinatmostfourbytes(sinceitusesatmost11bitstosignalhowthecodepointisspreadacrossmultiplebytes,and32 -11bitsleaves21bitstostorethevalueofthecodepoint).UTF-8cantheoreticallybeextendedouttoencodelargercodepointsifUnicodeeverweretoassignthem,butsinceUnicodeitselfwillnownevergoover21bits,UTF-8canhandleanycodepointinfourbytesorfewer.Though,again,UTF-8isavariable-widthencoding;anygivencodepointmightrequire1,2,3,or4 bytes. Butsincethefirst128codepointsareencodedexactlyasthemselves,inonebytewiththehighbitsettozero,aUTF-8sequencewhichcontainsonlycodepointsinthatrangeisidenticaltotheequivalentASCII,whichissomethingalotofpeople like. Also,forreasonsknownonlytoMicrosoft,manyWindowsprogramsinsertthebyte-ordermarkinUTF-8-encodedfiles.ThecodepointU+FEFFcomesoutasthethree-bytesequenceEFBBBFinUTF-8,justincaseyoueverneedtocheckfor it. Intherealworld,UTF-8andUTF-16arecommonchoicesofencodingforsystemswhichuseUnicode;WindowsusesUTF-16formanyAPIs(thoughtechnicallynotforfilesystempaths,which—althoughthey’remadeupof16-bitunits—arenotspecifiedasneedingtobevalidUTF-16,acommonpitfallthat,alongwithothermisuses,inspiredanalternateencodingmechanismcalled“WTF-8”),whilemodernUnix/LinuxsystemsmostlyseemtogoUTF-8. Whatdoesthishavetodowith Python? AndnowwecantalkaboutPythonstrings.InPython2,andinPython3priorto3.3,PythonhadexactlytwooptionsforhowUnicodestrings(unicodeonPython2,stronPython3)wouldbestoredinmemory.ThechoicewasmadeatthetimeyourPythoninterpreterwascompiled,andwouldproduceeithera“narrow”ora“wide”buildofPython.Ina“narrow”build,PythonwouldinternallystoreUnicodeinatwo-byteencodingwithsurrogatepairs.Ina“wide”build,PythonwouldinternallystoreUnicodeinafour-byte encoding. Thiscanleadtosomeunexpectedoutcomes.Forexample,considerthefollowing(onapre-3.3“narrow”buildof Python): >>>s=u"\U0001F4A9" >>>len(s) 2 What’sworse,iteratingoveritwillyieldtwo“characters”,whichPythonwilltellyouhavecodepointsD83DandDCA9.Ifyouhaven’tworkeditoutyet,that’sthebigclue,becausebothofthosevaluesareinthesurrogate-pairrange.SincethisPythoninterpreterwascompiledwith“narrow”Unicode,itstoresthestringusingtwo-byteencodingwithsurrogatepairs,anddoesnothingtowarnyouwhenyouencounteroneofthem.ItjustgivesyouthetwocodepointsD83DDCA9,whicharethesurrogatepairfor1F4A9. (andofcoursecodepointU+1F4A9iseveryone’sfavorite,💩,alsoknownasPILEOFPOO) Pythonwasn’ttheonlylanguagetohavethisproblem.SeverallanguagesuseUTF-16fortheirUnicodestrings,includingJavaScript(whichwillalsoreportalengthof2forthissingle-code-pointstring).Butit’sanannoyance;Unicodeshouldbeacleanabstraction,oratleastacleaneronethanthis,andthelanguageshouldn’tbeleakingdetailsofitsimplementationupwardtothe programmer. Buthowtodothis?Theonlyfixed-widthencodingavailableisUTF-32,whichwastesalotofspaceonmosttext.AndwhileitmightbepossibletojusthavePythontransparentlyconvertsurrogatepairsintocodepointsforthehigh-levelAPI,it’dstillbeanissueforanythingusinglower-level(i.e.,C)APIstoworkwithPython strings. Python3.3solvedalloftheseproblemsbytakinganewapproachtointernalstorageofUnicode.Gonearethe“wide”and“narrow”builds,gonearethemessyleaksofsurrogatepairsintohigh-levelcode.InPython3.3andlater,theinternalstorageofUnicodeisnowdynamicandchosenonaper-stringbasis.Here’showit works: Pythonparsessourcecodeontheassumptionthatit’sUTF-8. Whenitneedstocreatestringobjects,Pythondeterminesthehighestcodepointinthestring,andlooksatthesizeoftheencodingneededtostorethatcodepoint as-is. Pythonthenchoosesthatencoding—whichwillbeoneoflatin-1,UCS-2,orUCS-4—tostorethe string. Allthegorydetails,ifyouwanttoreadaboutthem,areinPEP393. OneobviousadvantageofthisisthatPythonstringsnowuselessmemoryonaverage;manystringsconsistsolelyofcodepointswhichcanbeencodedtoonebyte,andnowtheywillbeencodedtojustonebyteeach(asopposedtopreviously,whenthey’duseatleast two). Moreimportantly,astringconsistingsolelyofU+1F4A9PILEOFPOOisnowastringoflength1,asitshouldbe.AndtheAPIiscleaner, too: IteratingaPythonstringisiterationovercode points. IndexingintoaPythonstringyieldswhatevercorrespondstothecodepointatthe index. ThelengthofaPythonstringisthenumberofcodepointsit contains. Thisisallpossible—oratleastmucheasiertodo—becausePythonstringsarenow,internally,alwaysfixed-width.Italsonicelymirrorsthebehaviorofbytesobjects,whereiterationandindexingyieldtheintegervaluesofthe bytes. Sowhat’sthe point? Thischangemaynotseemlikesuchabigdeal,especiallywhencomparedtotheupheavalinvolvedinmovingfromPython2toPython3.AndintermsofitsvisibleimpactonendusersofPython,itprobablyisn’t;unlessyou’recarefullymeasuringyourPythonprograms’memoryuseandthatmemoryuseisdominatedbystringstorage,youwon’tnoticemuchadvantagefromthefactthatPythoncannowstoremanystringsusingonlyonebytepercodepoint.AndunlessyouwereroutinelyworkingwithcodepointsoutsidetheBasicMultilingualPlane,youprobablyneverwouldhavenoticedthatlengthanditerationandindexingforthemwas weird. Butitisstillanimportantchange.Andit’sonethatmakesPythonmoreright.Iknowit’spopularthesedaystopromoteUTF-8astheOneTrueEncoding,andthatseveralpopularnew-ishlanguagesareusingUTF-8asthebackingstorageoftheirstringtype(andasaresult,exposingsomeofthequirksofvariable-widthencodingtousersofthoselanguages).Butforahigh-levellanguage,IamincreasinglyoftheopinionthatPython’sapproachis correct. Unicode,forpeoplewhodon’talreadyhaveatleastapassingfamiliaritywithhowitalreadyworks,andespeciallyifthey’recomingfromaworldofASCIIoroneofthepopular8-bitencodings,canseemveryweird,andalotofprogrammersalreadyfallintooneofthosecategories.AddinginthequirksofhowUnicodeactuallygetsrepresentedasbytesinmemory,foralanguagewheremanualmemorymanagementandotherlower-levelprogrammingtasksarerare,isprobablyimposingtoomuchcognitive load. Soabstractingawaythestorage,andprovidingasingleclearAPIintermsofUnicodecodepoints,istherightthingtodo.AndPEP393’schangetotheinternalstorageofstringswasanotherstepdownthepathofDoingtheRightThing™inPython’shistory,andIthinkmorepeopleshouldknowabout it.
延伸文章資訊
- 1Convert Unicode code point and character to each other (chr ...
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points...
- 2Unicode — pysheeet
In Python 2, strings are represented in bytes, not Unicode. Python provides different types of st...
- 3unicode - Python Reference (The Right Way) - Read the Docs
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bi...
- 4Unicode in Python 2
If it's on disk or on a network, it's bytes; Python provides some ... Unicode strings are sequenc...
- 5How Python does Unicode - James Bennett
In Python 3, there is one and only one string type. Its name is str and it's Unicode. Sequences o...