Unicode HOWTO — Python 3.10.0 documentation

文章推薦指數: 80 %
投票人數:10人

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Navigation index modules| next| previous| Python» 3.10.0Documentation» PythonHOWTOs» UnicodeHOWTO | UnicodeHOWTO¶ Release 1.12 ThisHOWTOdiscussesPython’ssupportfortheUnicodespecification forrepresentingtextualdata,andexplainsvariousproblemsthat peoplecommonlyencounterwhentryingtoworkwithUnicode. IntroductiontoUnicode¶ Definitions¶ Today’sprogramsneedtobeabletohandleawidevarietyof characters.Applicationsareofteninternationalizedtodisplay messagesandoutputinavarietyofuser-selectablelanguages;the sameprogrammightneedtooutputanerrormessageinEnglish,French, Japanese,Hebrew,orRussian.Webcontentcanbewritteninanyof theselanguagesandcanalsoincludeavarietyofemojisymbols. Python’sstringtypeusestheUnicodeStandardforrepresenting characters,whichletsPythonprogramsworkwithallthesedifferent possiblecharacters. Unicode(https://www.unicode.org/)isaspecificationthataimsto listeverycharacterusedbyhumanlanguagesandgiveeachcharacter itsownuniquecode.TheUnicodespecificationsarecontinually revisedandupdatedtoaddnewlanguagesandsymbols. Acharacteristhesmallestpossiblecomponentofatext.‘A’,‘B’,‘C’, etc.,arealldifferentcharacters.Soare‘È’and‘Í’.Charactersvary dependingonthelanguageorcontextyou’retalking about.Forexample,there’sacharacterfor“RomanNumeralOne”,‘Ⅰ’,that’s separatefromtheuppercaseletter‘I’.They’llusuallylookthesame, butthesearetwodifferentcharactersthathavedifferentmeanings. TheUnicodestandarddescribeshowcharactersarerepresentedby codepoints.Acodepointvalueisanintegerintherange0to 0x10FFFF(about1.1millionvalues,the actualnumberassigned islessthanthat).Inthestandardandinthisdocument,acodepointiswritten usingthenotationU+265Etomeanthecharacterwithvalue 0x265e(9,822indecimal). TheUnicodestandardcontainsalotoftableslistingcharactersand theircorrespondingcodepoints: 0061'a';LATINSMALLLETTERA 0062'b';LATINSMALLLETTERB 0063'c';LATINSMALLLETTERC ... 007B'{';LEFTCURLYBRACKET ... 2167'Ⅷ';ROMANNUMERALEIGHT 2168'Ⅸ';ROMANNUMERALNINE ... 265E'♞';BLACKCHESSKNIGHT 265F'♟';BLACKCHESSPAWN ... 1F600'😀';GRINNINGFACE 1F609'😉';WINKINGFACE ... Strictly,thesedefinitionsimplythatit’smeaninglesstosay‘thisis characterU+265E’.U+265Eisacodepoint,whichrepresentssomeparticular character;inthiscase,itrepresentsthecharacter‘BLACKCHESSKNIGHT’, ‘♞’.In informalcontexts,thisdistinctionbetweencodepointsandcharacterswill sometimesbeforgotten. Acharacterisrepresentedonascreenoronpaperbyasetofgraphical elementsthat’scalledaglyph.TheglyphforanuppercaseA,forexample, istwodiagonalstrokesandahorizontalstroke,thoughtheexactdetailswill dependonthefontbeingused.MostPythoncodedoesn’tneedtoworryabout glyphs;figuringoutthecorrectglyphtodisplayisgenerallythejobofaGUI toolkitoraterminal’sfontrenderer. Encodings¶ Tosummarizetheprevioussection:aUnicodestringisasequenceof codepoints,whicharenumbersfrom0through0x10FFFF(1,114,111 decimal).Thissequenceofcodepointsneedstoberepresentedin memoryasasetofcodeunits,andcodeunitsarethenmapped to8-bitbytes.TherulesfortranslatingaUnicodestringintoa sequenceofbytesarecalledacharacterencoding,orjust anencoding. Thefirstencodingyoumightthinkofisusing32-bitintegersasthe codeunit,andthenusingtheCPU’srepresentationof32-bitintegers. Inthisrepresentation,thestring“Python”mightlooklikethis: Python 0x500000007900000074000000680000006f0000006e000000 01234567891011121314151617181920212223 Thisrepresentationisstraightforwardbutusingitpresentsanumberof problems. It’snotportable;differentprocessorsorderthebytesdifferently. It’sverywastefulofspace.Inmosttexts,themajorityofthecodepoints arelessthan127,orlessthan255,soalotofspaceisoccupiedby0x00 bytes.Theabovestringtakes24bytescomparedtothe6bytesneededforan ASCIIrepresentation.IncreasedRAMusagedoesn’tmattertoomuch(desktop computershavegigabytesofRAM,andstringsaren’tusuallythatlarge),but expandingourusageofdiskandnetworkbandwidthbyafactorof4is intolerable. It’snotcompatiblewithexistingCfunctionssuchasstrlen(),soanew familyofwidestringfunctionswouldneedtobeused. Thereforethisencodingisn’tusedverymuch,andpeopleinsteadchooseother encodingsthataremoreefficientandconvenient,suchasUTF-8. UTF-8isoneofthemostcommonlyusedencodings,andPythonoften defaultstousingit.UTFstandsfor“UnicodeTransformationFormat”, andthe‘8’meansthat8-bitvaluesareusedintheencoding.(There arealsoUTF-16andUTF-32encodings,buttheyarelessfrequently usedthanUTF-8.)UTF-8usesthefollowingrules: Ifthecodepointis<128,it’srepresentedbythecorrespondingbytevalue. Ifthecodepointis>=128,it’sturnedintoasequenceoftwo,three,or fourbytes,whereeachbyteofthesequenceisbetween128and255. UTF-8hasseveralconvenientproperties: ItcanhandleanyUnicodecodepoint. AUnicodestringisturnedintoasequenceofbytesthatcontainsembedded zerobytesonlywheretheyrepresentthenullcharacter(U+0000).Thismeans thatUTF-8stringscanbeprocessedbyCfunctionssuchasstrcpy()andsent throughprotocolsthatcan’thandlezerobytesforanythingotherthan end-of-stringmarkers. AstringofASCIItextisalsovalidUTF-8text. UTF-8isfairlycompact;themajorityofcommonlyusedcharacterscanbe representedwithoneortwobytes. Ifbytesarecorruptedorlost,it’spossibletodeterminethestartofthe nextUTF-8-encodedcodepointandresynchronize.It’salsounlikelythat random8-bitdatawilllooklikevalidUTF-8. UTF-8isabyteorientedencoding.Theencodingspecifiesthateach characterisrepresentedbyaspecificsequenceofoneormorebytes.This avoidsthebyte-orderingissuesthatcanoccurwithintegerandwordoriented encodings,likeUTF-16andUTF-32,wherethesequenceofbytesvariesdepending onthehardwareonwhichthestringwasencoded. References¶ TheUnicodeConsortiumsitehascharactercharts,a glossary,andPDFversionsoftheUnicodespecification.Bepreparedforsome difficultreading.Achronologyofthe originanddevelopmentofUnicodeisalsoavailableonthesite. OntheComputerphileYoutubechannel,TomScottbriefly discussesthehistoryofUnicodeandUTF-8 (9minutes36seconds). Tohelpunderstandthestandard,JukkaKorpelahaswrittenanintroductory guidetoreadingthe Unicodecharactertables. Anothergoodintroductoryarticle waswrittenbyJoelSpolsky. Ifthisintroductiondidn’tmakethingscleartoyou,youshouldtry readingthisalternatearticlebeforecontinuing. Wikipediaentriesareoftenhelpful;seetheentriesfor“characterencoding”andUTF-8,forexample. Python’sUnicodeSupport¶ Nowthatyou’velearnedtherudimentsofUnicode,wecanlookatPython’s Unicodefeatures. TheStringType¶ SincePython3.0,thelanguage’sstrtypecontainsUnicode characters,meaninganystringcreatedusing"unicoderocks!",'unicode rocks!',orthetriple-quotedstringsyntaxisstoredasUnicode. ThedefaultencodingforPythonsourcecodeisUTF-8,soyoucansimply includeaUnicodecharacterinastringliteral: try: withopen('/tmp/input.txt','r')asf: ... exceptOSError: #'Filenotfound'errormessage. print("Fichiernontrouvé") Sidenote:Python3alsosupportsusingUnicodecharactersinidentifiers: répertoire="/tmp/records.log" withopen(répertoire,"w")asf: f.write("test\n") Ifyoucan’tenteraparticularcharacterinyoureditororwantto keepthesourcecodeASCII-onlyforsomereason,youcanalsouse escapesequencesinstringliterals.(Dependingonyoursystem, youmayseetheactualcapital-deltaglyphinsteadofauescape.) >>>"\N{GREEKCAPITALLETTERDELTA}"#Usingthecharactername '\u0394' >>>"\u0394"#Usinga16-bithexvalue '\u0394' >>>"\U00000394"#Usinga32-bithexvalue '\u0394' Inaddition,onecancreateastringusingthedecode()methodof bytes.Thismethodtakesanencodingargument,suchasUTF-8, andoptionallyanerrorsargument. Theerrorsargumentspecifiestheresponsewhentheinputstringcan’tbe convertedaccordingtotheencoding’srules.Legalvaluesforthisargumentare 'strict'(raiseaUnicodeDecodeErrorexception),'replace'(use U+FFFD,REPLACEMENTCHARACTER),'ignore'(justleavethe characteroutoftheUnicoderesult),or'backslashreplace'(insertsa \xNNescapesequence). Thefollowingexamplesshowthedifferences: >>>b'\x80abc'.decode("utf-8","strict") Traceback(mostrecentcalllast): ... UnicodeDecodeError:'utf-8'codeccan'tdecodebyte0x80inposition0: invalidstartbyte >>>b'\x80abc'.decode("utf-8","replace") '\ufffdabc' >>>b'\x80abc'.decode("utf-8","backslashreplace") '\\x80abc' >>>b'\x80abc'.decode("utf-8","ignore") 'abc' Encodingsarespecifiedasstringscontainingtheencoding’sname.Python comeswithroughly100differentencodings;seethePythonLibraryReferenceat StandardEncodingsforalist.Someencodingshavemultiplenames;for example,'latin-1','iso_8859_1'and'8859’areallsynonymsfor thesameencoding. One-characterUnicodestringscanalsobecreatedwiththechr() built-infunction,whichtakesintegersandreturnsaUnicodestringoflength1 thatcontainsthecorrespondingcodepoint.Thereverseoperationisthe built-inord()functionthattakesaone-characterUnicodestringand returnsthecodepointvalue: >>>chr(57344) '\ue000' >>>ord('\ue000') 57344 ConvertingtoBytes¶ Theoppositemethodofbytes.decode()isstr.encode(), whichreturnsabytesrepresentationoftheUnicodestring,encodedinthe requestedencoding. Theerrorsparameteristhesameastheparameterofthe decode()methodbutsupportsafewmorepossiblehandlers.Aswellas 'strict','ignore',and'replace'(whichinthiscase insertsaquestionmarkinsteadoftheunencodablecharacter),thereis also'xmlcharrefreplace'(insertsanXMLcharacterreference), backslashreplace(insertsa\uNNNNescapesequence)and namereplace(insertsa\N{...}escapesequence). Thefollowingexampleshowsthedifferentresults: >>>u=chr(40960)+'abcd'+chr(1972) >>>u.encode('utf-8') b'\xea\x80\x80abcd\xde\xb4' >>>u.encode('ascii') Traceback(mostrecentcalllast): ... UnicodeEncodeError:'ascii'codeccan'tencodecharacter'\ua000'in position0:ordinalnotinrange(128) >>>u.encode('ascii','ignore') b'abcd' >>>u.encode('ascii','replace') b'?abcd?' >>>u.encode('ascii','xmlcharrefreplace') b'ꀀabcd޴' >>>u.encode('ascii','backslashreplace') b'\\ua000abcd\\u07b4' >>>u.encode('ascii','namereplace') b'\\N{YISYLLABLEIT}abcd\\u07b4' Thelow-levelroutinesforregisteringandaccessingtheavailable encodingsarefoundinthecodecsmodule.Implementingnew encodingsalsorequiresunderstandingthecodecsmodule. However,theencodinganddecodingfunctionsreturnedbythismodule areusuallymorelow-levelthaniscomfortable,andwritingnewencodings isaspecializedtask,sothemodulewon’tbecoveredinthisHOWTO. UnicodeLiteralsinPythonSourceCode¶ InPythonsourcecode,specificUnicodecodepointscanbewrittenusingthe \uescapesequence,whichisfollowedbyfourhexdigitsgivingthecode point.The\Uescapesequenceissimilar,butexpectseighthexdigits, notfour: >>>s="a\xac\u1234\u20ac\U00008000" ...#^^^^two-digithexescape ...#^^^^^^four-digitUnicodeescape ...#^^^^^^^^^^eight-digitUnicodeescape >>>[ord(c)forcins] [97,172,4660,8364,32768] Usingescapesequencesforcodepointsgreaterthan127isfineinsmalldoses, butbecomesanannoyanceifyou’reusingmanyaccentedcharacters,asyouwould inaprogramwithmessagesinFrenchorsomeotheraccent-usinglanguage.You canalsoassemblestringsusingthechr()built-infunction,butthisis evenmoretedious. Ideally,you’dwanttobeabletowriteliteralsinyourlanguage’snatural encoding.YoucouldtheneditPythonsourcecodewithyourfavoriteeditor whichwoulddisplaytheaccentedcharactersnaturally,andhavetheright charactersusedatruntime. PythonsupportswritingsourcecodeinUTF-8bydefault,butyoucanusealmost anyencodingifyoudeclaretheencodingbeingused.Thisisdonebyincluding aspecialcommentaseitherthefirstorsecondlineofthesourcefile: #!/usr/bin/envpython #-*-coding:latin-1-*- u='abcdé' print(ord(u[-1])) ThesyntaxisinspiredbyEmacs’snotationforspecifyingvariableslocaltoa file.Emacssupportsmanydifferentvariables,butPythononlysupports ‘coding’.The-*-symbolsindicatetoEmacsthatthecommentisspecial; theyhavenosignificancetoPythonbutareaconvention.Pythonlooksfor coding:nameorcoding=nameinthecomment. Ifyoudon’tincludesuchacomment,thedefaultencodingusedwillbeUTF-8as alreadymentioned.SeealsoPEP263formoreinformation. UnicodeProperties¶ TheUnicodespecificationincludesadatabaseofinformationabout codepoints.Foreachdefinedcodepoint,theinformationincludes thecharacter’sname,itscategory,thenumericvalueifapplicable (forcharactersrepresentingnumericconceptssuchastheRoman numerals,fractionssuchasone-thirdandfour-fifths,etc.).There arealsodisplay-relatedproperties,suchashowtousethecodepoint inbidirectionaltext. Thefollowingprogramdisplayssomeinformationaboutseveralcharacters,and printsthenumericvalueofoneparticularcharacter: importunicodedata u=chr(233)+chr(0x0bf2)+chr(3972)+chr(6000)+chr(13231) fori,cinenumerate(u): print(i,'%04x'%ord(c),unicodedata.category(c),end="") print(unicodedata.name(c)) #Getnumericvalueofsecondcharacter print(unicodedata.numeric(u[1])) Whenrun,thisprints: 000e9LlLATINSMALLLETTEREWITHACUTE 10bf2NoTAMILNUMBERONETHOUSAND 20f84MnTIBETANMARKHALANTA 31770LoTAGBANWALETTERSA 433afSoSQUARERADOVERSSQUARED 1000.0 Thecategorycodesareabbreviationsdescribingthenatureofthecharacter. Thesearegroupedintocategoriessuchas“Letter”,“Number”,“Punctuation”,or “Symbol”,whichinturnarebrokenupintosubcategories.Totakethecodes fromtheaboveoutput,'Ll'means‘Letter,lowercase’,'No'means “Number,other”,'Mn'is“Mark,nonspacing”,and'So'is“Symbol, other”.See theGeneralCategoryValuessectionoftheUnicodeCharacterDatabasedocumentationfora listofcategorycodes. ComparingStrings¶ Unicodeaddssomecomplicationtocomparingstrings,becausethesame setofcharacterscanberepresentedbydifferentsequencesofcode points.Forexample,aletterlike‘ê’canberepresentedasasingle codepointU+00EA,orasU+0065U+0302,whichisthecodepointfor ‘e’followedbyacodepointfor‘COMBININGCIRCUMFLEXACCENT’.These willproducethesameoutputwhenprinted,butoneisastringof length1andtheotherisoflength2. Onetoolforacase-insensitivecomparisonisthe casefold()stringmethodthatconvertsastringtoa case-insensitiveformfollowinganalgorithmdescribedbytheUnicode Standard.Thisalgorithmhasspecialhandlingforcharacterssuchas theGermanletter‘ß’(codepointU+00DF),whichbecomesthepairof lowercaseletters‘ss’. >>>street='Gürzenichstraße' >>>street.casefold() 'gürzenichstrasse' Asecondtoolistheunicodedatamodule’s normalize()functionthatconvertsstringstoone ofseveralnormalforms,wherelettersfollowedbyacombining characterarereplacedwithsinglecharacters.normalize()can beusedtoperformstringcomparisonsthatwon’tfalselyreport inequalityiftwostringsusecombiningcharactersdifferently: importunicodedata defcompare_strs(s1,s2): defNFD(s): returnunicodedata.normalize('NFD',s) returnNFD(s1)==NFD(s2) single_char='ê' multiple_chars='\N{LATINSMALLLETTERE}\N{COMBININGCIRCUMFLEXACCENT}' print('lengthoffirststring=',len(single_char)) print('lengthofsecondstring=',len(multiple_chars)) print(compare_strs(single_char,multiple_chars)) Whenrun,thisoutputs: $python3compare-strs.py lengthoffirststring=1 lengthofsecondstring=2 True Thefirstargumenttothenormalize()functionisa stringgivingthedesirednormalizationform,whichcanbeoneof ‘NFC’,‘NFKC’,‘NFD’,and‘NFKD’. TheUnicodeStandardalsospecifieshowtodocaselesscomparisons: importunicodedata defcompare_caseless(s1,s2): defNFD(s): returnunicodedata.normalize('NFD',s) returnNFD(NFD(s1).casefold())==NFD(NFD(s2).casefold()) #Exampleusage single_char='ê' multiple_chars='\N{LATINCAPITALLETTERE}\N{COMBININGCIRCUMFLEXACCENT}' print(compare_caseless(single_char,multiple_chars)) ThiswillprintTrue.(WhyisNFD()invokedtwice?Because thereareafewcharactersthatmakecasefold()returna non-normalizedstring,sotheresultneedstobenormalizedagain.See section3.13oftheUnicodeStandardforadiscussionandanexample.) UnicodeRegularExpressions¶ Theregularexpressionssupportedbytheremodulecanbeprovided eitherasbytesorstrings.Someofthespecialcharactersequencessuchas \dand\whavedifferentmeaningsdependingonwhether thepatternissuppliedasbytesorastring.Forexample, \dwillmatchthecharacters[0-9]inbytesbut instringswillmatchanycharacterthat’sinthe'Nd'category. Thestringinthisexamplehasthenumber57writteninbothThaiand Arabicnumerals: importre p=re.compile(r'\d+') s="Over\u0e55\u0e5757flavours" m=p.search(s) print(repr(m.group())) Whenexecuted,\d+willmatchtheThainumeralsandprintthem out.Ifyousupplythere.ASCIIflagto compile(),\d+willmatchthesubstring“57”instead. Similarly,\wmatchesawidevarietyofUnicodecharactersbut only[a-zA-Z0-9_]inbytesorifre.ASCIIissupplied, and\swillmatcheitherUnicodewhitespacecharactersor [\t\n\r\f\v]. References¶ SomegoodalternativediscussionsofPython’sUnicodesupportare: ProcessingTextFilesinPython3,byNickCoghlan. PragmaticUnicode,aPyCon2012presentationbyNedBatchelder. ThestrtypeisdescribedinthePythonlibraryreferenceat TextSequenceType—str. Thedocumentationfortheunicodedatamodule. Thedocumentationforthecodecsmodule. Marc-AndréLemburggaveapresentationtitled“PythonandUnicode”(PDFslides)at EuroPython2002.TheslidesareanexcellentoverviewofthedesignofPython 2’sUnicodefeatures(wheretheUnicodestringtypeiscalledunicodeand literalsstartwithu). ReadingandWritingUnicodeData¶ Onceyou’vewrittensomecodethatworkswithUnicodedata,thenextproblemis input/output.HowdoyougetUnicodestringsintoyourprogram,andhowdoyou convertUnicodeintoaformsuitableforstorageortransmission? It’spossiblethatyoumaynotneedtodoanythingdependingonyourinput sourcesandoutputdestinations;youshouldcheckwhetherthelibrariesusedin yourapplicationsupportUnicodenatively.XMLparsersoftenreturnUnicode data,forexample.ManyrelationaldatabasesalsosupportUnicode-valued columnsandcanreturnUnicodevaluesfromanSQLquery. Unicodedataisusuallyconvertedtoaparticularencodingbeforeitgets writtentodiskorsentoverasocket.It’spossibletodoallthework yourself:openafile,readan8-bitbytesobjectfromit,andconvertthebytes withbytes.decode(encoding).However,themanualapproachisnotrecommended. Oneproblemisthemulti-bytenatureofencodings;oneUnicodecharactercanbe representedbyseveralbytes.Ifyouwanttoreadthefileinarbitrary-sized chunks(say,1024or4096bytes),youneedtowriteerror-handlingcodetocatchthecase whereonlypartofthebytesencodingasingleUnicodecharacterarereadatthe endofachunk.Onesolutionwouldbetoreadtheentirefileintomemoryand thenperformthedecoding,butthatpreventsyoufromworkingwithfilesthat areextremelylarge;ifyouneedtoreada2GiBfile,youneed2GiBofRAM. (More,really,sinceforatleastamomentyou’dneedtohaveboththeencoded stringanditsUnicodeversioninmemory.) Thesolutionwouldbetousethelow-leveldecodinginterfacetocatchthecase ofpartialcodingsequences.Theworkofimplementingthishasalreadybeen doneforyou:thebuilt-inopen()functioncanreturnafile-likeobject thatassumesthefile’scontentsareinaspecifiedencodingandacceptsUnicode parametersformethodssuchasread()and write().Thisworksthroughopen()'sencodingand errorsparameterswhichareinterpretedjustlikethoseinstr.encode() andbytes.decode(). ReadingUnicodefromafileisthereforesimple: withopen('unicode.txt',encoding='utf-8')asf: forlineinf: print(repr(line)) It’salsopossibletoopenfilesinupdatemode,allowingbothreadingand writing: withopen('test',encoding='utf-8',mode='w+')asf: f.write('\u4500blahblahblah\n') f.seek(0) print(repr(f.readline()[:1])) TheUnicodecharacterU+FEFFisusedasabyte-ordermark(BOM),andisoften writtenasthefirstcharacterofafileinordertoassistwithautodetection ofthefile’sbyteordering.Someencodings,suchasUTF-16,expectaBOMtobe presentatthestartofafile;whensuchanencodingisused,theBOMwillbe automaticallywrittenasthefirstcharacterandwillbesilentlydroppedwhen thefileisread.Therearevariantsoftheseencodings,suchas‘utf-16-le’ and‘utf-16-be’forlittle-endianandbig-endianencodings,thatspecifyone particularbyteorderinganddon’tskiptheBOM. Insomeareas,itisalsoconventiontousea“BOM”atthestartofUTF-8 encodedfiles;thenameismisleadingsinceUTF-8isnotbyte-orderdependent. ThemarksimplyannouncesthatthefileisencodedinUTF-8.Forreadingsuch files,usethe‘utf-8-sig’codectoautomaticallyskipthemarkifpresent. Unicodefilenames¶ Mostoftheoperatingsystemsincommonusetodaysupportfilenames thatcontainarbitraryUnicodecharacters.Usuallythisis implementedbyconvertingtheUnicodestringintosomeencodingthat variesdependingonthesystem.TodayPythonisconvergingonusing UTF-8:PythononMacOShasusedUTF-8forseveralversions,andPython 3.6switchedtousingUTF-8onWindowsaswell.OnUnixsystems, therewillonlybeafilesystemencoding.ifyou’vesettheLANGorLC_CTYPEenvironmentvariables;if youhaven’t,thedefaultencodingisagainUTF-8. Thesys.getfilesystemencoding()functionreturnstheencodingtouseon yourcurrentsystem,incaseyouwanttodotheencodingmanually,butthere’s notmuchreasontobother.Whenopeningafileforreadingorwriting,youcan usuallyjustprovidetheUnicodestringasthefilename,anditwillbe automaticallyconvertedtotherightencodingforyou: filename='filename\u4500abc' withopen(filename,'w')asf: f.write('blah\n') Functionsintheosmodulesuchasos.stat()willalsoacceptUnicode filenames. Theos.listdir()functionreturnsfilenames,whichraisesanissue:shoulditreturn theUnicodeversionoffilenames,orshoulditreturnbytescontaining theencodedversions?os.listdir()candoboth,dependingonwhetheryou providedthedirectorypathasbytesoraUnicodestring.Ifyoupassa Unicodestringasthepath,filenameswillbedecodedusingthefilesystem’s encodingandalistofUnicodestringswillbereturned,whilepassingabyte pathwillreturnthefilenamesasbytes.Forexample, assumingthedefaultfilesystemencodingisUTF-8,runningthefollowingprogram: fn='filename\u4500abc' f=open(fn,'w') f.close() importos print(os.listdir(b'.')) print(os.listdir('.')) willproducethefollowingoutput: $pythonlistdir-test.py [b'filename\xe4\x94\x80abc',...] ['filename\u4500abc',...] ThefirstlistcontainsUTF-8-encodedfilenames,andthesecondlistcontains theUnicodeversions. Notethatonmostoccasions,youshouldcanjuststickwithusing UnicodewiththeseAPIs.ThebytesAPIsshouldonlybeusedon systemswhereundecodablefilenamescanbepresent;that’s prettymuchonlyUnixsystemsnow. TipsforWritingUnicode-awarePrograms¶ Thissectionprovidessomesuggestionsonwritingsoftwarethatdealswith Unicode. Themostimportanttipis: SoftwareshouldonlyworkwithUnicodestringsinternally,decodingtheinput dataassoonaspossibleandencodingtheoutputonlyattheend. IfyouattempttowriteprocessingfunctionsthatacceptbothUnicodeandbyte strings,youwillfindyourprogramvulnerabletobugswhereveryoucombinethe twodifferentkindsofstrings.Thereisnoautomaticencodingordecoding:if youdoe.g.str+bytes,aTypeErrorwillberaised. Whenusingdatacomingfromawebbrowserorsomeotheruntrustedsource,a commontechniqueistocheckforillegalcharactersinastringbeforeusingthe stringinageneratedcommandlineorstoringitinadatabase.Ifyou’redoing this,becarefultocheckthedecodedstring,nottheencodedbytesdata; someencodingsmayhaveinterestingproperties,suchasnotbeingbijective ornotbeingfullyASCII-compatible.Thisisespeciallytrueiftheinput dataalsospecifiestheencoding,sincetheattackercanthenchoosea cleverwaytohidemalicioustextintheencodedbytestream. ConvertingBetweenFileEncodings¶ TheStreamRecoderclasscantransparentlyconvertbetween encodings,takingastreamthatreturnsdatainencoding#1 andbehavinglikeastreamreturningdatainencoding#2. Forexample,ifyouhaveaninputfilefthat’sinLatin-1,you canwrapitwithaStreamRecodertoreturnbytesencodedin UTF-8: new_f=codecs.StreamRecoder(f, #en/decoder:usedbyread()toencodeitsresultsand #bywrite()todecodeitsinput. codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'), #reader/writer:usedtoreadandwritetothestream. codecs.getreader('latin-1'),codecs.getwriter('latin-1')) FilesinanUnknownEncoding¶ Whatcanyoudoifyouneedtomakeachangetoafile,butdon’tknow thefile’sencoding?IfyouknowtheencodingisASCII-compatibleand onlywanttoexamineormodifytheASCIIparts,youcanopenthefile withthesurrogateescapeerrorhandler: withopen(fname,'r',encoding="ascii",errors="surrogateescape")asf: data=f.read() #makechangestothestring'data' withopen(fname+'.new','w', encoding="ascii",errors="surrogateescape")asf: f.write(data) Thesurrogateescapeerrorhandlerwilldecodeanynon-ASCIIbytes ascodepointsinaspecialrangerunningfromU+DC80to U+DCFF.Thesecodepointswillthenturnbackintothe samebyteswhenthesurrogateescapeerrorhandlerisusedto encodethedataandwriteitbackout. References¶ OnesectionofMasteringPython3Input/Output, aPyCon2010talkbyDavidBeazley,discussestextprocessingandbinarydatahandling. ThePDFslidesforMarc-AndréLemburg’spresentation“WritingUnicode-aware ApplicationsinPython” discussquestionsofcharacterencodingsaswellashowtointernationalize andlocalizeanapplication.TheseslidescoverPython2.xonly. TheGutsofUnicodeinPython isaPyCon2013talkbyBenjaminPetersonthatdiscussestheinternalUnicode representationinPython3.3. Acknowledgements¶ TheinitialdraftofthisdocumentwaswrittenbyAndrewKuchling. IthassincebeenrevisedfurtherbyAlexanderBelopolsky,GeorgBrandl, AndrewKuchling,andEzioMelotti. Thankstothefollowingpeoplewhohavenotederrorsoroffered suggestionsonthisarticle:ÉricAraujo,NicholasBastin,Nick Coghlan,MariusGedminas,KentJohnson,KenKrugler,Marc-André Lemburg,MartinvonLöwis,TerryJ.Reedy,SerhiyStorchaka, ErykSun,ChadWhitacre,GrahamWideman. TableofContents UnicodeHOWTO IntroductiontoUnicode Definitions Encodings References Python’sUnicodeSupport TheStringType ConvertingtoBytes UnicodeLiteralsinPythonSourceCode UnicodeProperties ComparingStrings UnicodeRegularExpressions References ReadingandWritingUnicodeData Unicodefilenames TipsforWritingUnicode-awarePrograms ConvertingBetweenFileEncodings FilesinanUnknownEncoding References Acknowledgements Previoustopic SortingHOWTO Nexttopic HOWTOFetchInternetResourcesUsingTheurllibPackage ThisPage ReportaBug ShowSource Navigation index modules| next| previous| Python» 3.10.0Documentation» PythonHOWTOs» UnicodeHOWTO |



請為這篇文章評分?