Converting Between Unicode and Plain Strings - O'Reilly
文章推薦指數: 80 %
Convert Unicode to plain Python string: "encode" unicodestring = u"Hello world" utf8string = unicodestring.encode("utf-8") asciistring ... Skiptomaincontent Prev ConvertingBetweenCharactersandValues PythonCookbookby Next PrintingUnicodeCharacterstoStandardOutput GetfullaccesstoPythonCookbookand60K+othertitles,withfree10-daytrialofO'Reilly. There'salsoliveonlineevents,interactivecontent,certificationprepmaterials,andmore. Startyourfreetrial ConvertingBetweenUnicodeandPlainStringsCredit:DavidAscher,PaulPrescodProblemYouneedtodealwith datathatdoesn’tfitintheASCIIcharacterset. SolutionUnicodestringscanbeencodedinplainstringsinavarietyofways, accordingtowhicheverencodingyouchoose: #ConvertUnicodetoplainPythonstring:"encode" unicodestring=u"Helloworld" utf8string=unicodestring.encode("utf-8") asciistring=unicodestring.encode("ascii") isostring=unicodestring.encode("ISO-8859-1") utf16string=unicodestring.encode("utf-16") #ConvertplainPythonstringtoUnicode:"decode" plainstring1=unicode(utf8string,"utf-8") plainstring2=unicode(asciistring,"ascii") plainstring3=unicode(isostring,"ISO-8859-1") plainstring4=unicode(utf16string,"utf-16") assertplainstring1==plainstring2==plainstring3==plainstring4DiscussionIfyoufindyourselfdealingwithtextthatcontainsnon-ASCII characters,youhavetolearnaboutUnicode—whatitis,howit works,andhowPythonusesit. Unicodeisabig topic.Luckily,youdon’tneedtoknoweverything aboutUnicodetobeabletosolvereal-worldproblemswithit:afew basicbitsofknowledgeareenough.First,youmustunderstandthe differencebetween bytesandcharacters.Inolder, ASCII-centriclanguagesandenvironments,bytesandcharactersare treatedasthesamething.Sinceabytecanholdupto256values, theseenvironmentsarelimitedto256characters.Unicode,onthe otherhand,hastensofthousandsofcharacters.Thatmeansthateach Unicodecharactertakesmorethanonebyte,soyouneedtomakethe distinctionbetweencharactersandbytes. StandardPythonstringsarereallybytestrings,andaPython characterisreallyabyte.OthertermsforthestandardPythontype are“8-bitstring”and “plainstring.”Inthisrecipewe willcallthembytestrings,toremindyouoftheir byte-orientedness. Conversely,aPythonUnicodecharacterisanabstractobjectbig enoughtoholdthecharacter,analogoustoPython’s longintegers.Youdon’thavetoworryaboutthe internalrepresentation;therepresentationofUnicodecharacters becomesanissueonlywhenyouaretryingtosendthemtosome byte-orientedfunction,suchasthewritemethod forfilesorthesendmethodfornetworksockets. Atthatpoint,youmustchoosehowtorepresentthecharactersas bytes.ConvertingfromUnicodetoabytestringiscalled encodingthestring.Similarly,whenyou loadUnicodestringsfromafile,socket,orotherbyte-oriented object,youneedtodecodethestringsfrom bytestocharacters. TherearemanywaysofconvertingUnicodeobjectstobytestrings, eachofwhichiscalledan encoding. Foravarietyofhistorical,political,andtechnicalreasons,there isnoone“right”encoding.Every encodinghasacase-insensitivename,andthatnameispassedtothe decodemethodasaparameter.Hereareafewyoushouldknowabout: The UTF-8 encodingcanhandleanyUnicodecharacter.Itisalsobackward compatiblewithASCII,soapureASCIIfilecanalsobeconsidereda UTF-8file,andaUTF-8filethathappenstouseonlyASCII charactersisidenticaltoanASCIIfilewiththesamecharacters. ThispropertymakesUTF-8verybackward-compatible,especiallywith olderUnixtools.UTF-8isfarandawaythedominantencodingon Unix.It’sprimaryweaknessisthatitisfairly inefficientforEasterntexts. TheUTF-16encodingisfavoredbyMicrosoft operatingsystemsandtheJavaenvironment.Itislessefficientfor WesternlanguagesbutmoreefficientforEasternones.Avariantof UTF-16issometimesknownasUCS-2. TheISO-8859seriesofencodingsare256-character ASCIIsupersets.TheycannotsupportalloftheUnicodecharacters; theycansupportonlysomeparticularlanguageorfamilyof languages.ISO-8859-1,alsoknownas Latin-1,coversmostWesternEuropeanand Africanlanguages,butnotArabic.ISO-8859-2,alsoknownas Latin-2,coversmanyEasternEuropean languagessuchasHungarianandPolish. IfyouwanttobeabletoencodeallUnicodecharacters,youprobably wanttouseUTF-8.Youwillprobablyneedtodealwiththeother encodingsonlywhenyouarehandeddatainthoseencodingscreatedby someotherapplication. SeeAlsoUnicodeisahugetopic,butarecommendedbookisUnicode:APrimer,byTonyGraham(HungryMinds, Inc.)—detailsareavailableathttp://www.menteith.com/unicode/primer/. GetPythonCookbooknowwiththeO’Reillylearningplatform. O’Reillymembersexperienceliveonlinetraining,plusbooks,videos,anddigitalcontentfromnearly200publishers. Startyourfreetrial Don’tleaveempty-handed GetMarkRichards’sSoftwareArchitecturePatternsebooktobetterunderstandhowtodesigncomponents—andhowtheyshouldinteract. It’syours,free. Getitnow Close
延伸文章資訊
- 1Converting Between Unicode and Plain Strings - O'Reilly
Convert Unicode to plain Python string: "encode" unicodestring = u"Hello world" utf8string = unic...
- 2Day27 Python 基礎- 字符轉編碼操作 - iT 邦幫忙
UTF-8 是一種針對Unicode的可變長度字元編碼,英文字符一樣會依照ASCII碼規範,只占一個字節8bit,而中文字符的話,統一就占三個字節. 回顧可以參考字符編碼.
- 3Decode UTF-8 in Python | Delft Stack
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. ...
- 4Unicode HOWTO — Python 3.10.7 documentation
Python's string type uses the Unicode Standard for representing ... This means that UTF-8 strings...
- 5Convert UTF-8 to string literals in Python - Stack Overflow
The u'' syntax only works for string literals, e.g. defining values in source code. Using the syn...