A Guide to Unicode, UTF-8 and Strings in Python
文章推薦指數: 80 %
Sure! Let's see all we have covered so far visually. By default in Python 3, we are on the left side in the world of Unicode code points ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceAGuidetoUnicode,UTF-8andStringsinPythonLet’stakeatouronessentialconceptsofstringsthatwilltakeyourunderstandingtonextlevel.StringsareoneofthemostcommondatatypesinPython.Theyareusedtodealwithtextdataofanykind.ThefieldofNaturalLanguageProcessingisbuiltontopoftextandstringprocessingofsomekind.ItisimportanttoknowabouthowstringsworkinPython.StringsareusuallyeasytodealwithwhentheyaremadeupofEnglishASCIIcharacters,but“problems”appearwhenweenterintonon-ASCIIcharacters—whicharebecomingincreasinglycommonintheworldtodayesp.withadventofemojisetc.Let’sdecipherwhatishiddeninthestringsManyprogrammersuseencodeanddecodewithstringsinhopesofremovingthedreadedUnicodeDecodeError—hopefully,thisblogwillhelpyouovercomethedreadaboutdealingwithstrings.BelowIamgoingtotakeaQandAformattoreallygettotheanswerstothequestionsyoumighthave,andwhichIalsohadbeforeIstartedlearningaboutstrings.1.Whatarestringsmadeof?InPython(2or3),stringscaneitherberepresentedinbytesorunicodecodepoints.Byteisaunitofinformationthatisbuiltof8bits—bytesareusedtostoreallfilesinaharddisk.SoalloftheCSVsandJSONfilesonyourcomputerarebuiltofbytes.Wecanallagreethatweneedbytes,butthenwhataboutunicodecodepoints?Wewillgettotheminthenextquestion.2.WhatisUnicode,andunicodecodepoints?Whilereadingbytesfromafile,areaderneedstoknowwhatthosebytesmean.SoifyouwriteaJSONfileandsenditovertoyourfriend,yourfriendwouldneedtoknowhowtodealwiththebytesinyourJSONfile.Forthefirst20yearsorsoofcomputing,upperandlowercaseEnglishcharacters,somepunctuationsanddigitswereenough.Thesewereallencodedintoa127symbollistcalledASCII.7bitsofinformationor1byteisenoughtoencodeeveryEnglishcharacter.YoucouldtellyourfriendtodecodeyourJSONfileinASCIIencoding,andvoila—shewouldbeabletoreadwhatyousenther.Thiswascoolfortheinitialfewdecadesorso,butslowlywerealizedthattherearewaymorenumberofcharactersthanjustEnglishcharacters.Wetriedextending127charactersto256characters(viaLatin-1orISO-8859–1)tofullyutilizethe8bitspace—butthatwasnotenough.Weneededaninternationalstandardthatweallagreedontodealwithhundredsandthousandsofnon-Englishcharacters.IncameUnicode!Unicodeisinternationalstandardwhereamappingofindividualcharactersandauniquenumberismaintained.AsofMay2019,themostrecentversionofUnicodeis12.1whichcontainsover137kcharactersincludingdifferentscriptsincludingEnglish,Hindi,ChineseandJapanese,aswellasemojis.These137kcharactersareeachrepresentedbyaunicodecodepoint.Sounicodecodepointsrefertoactualcharactersthataredisplayed.Thesecodepointsareencodedtobytesanddecodedfrombytesbacktocodepoints.Examples:UnicodecodepointforalphabetaisU+0061,emoji🖐isU+1F590,andforΩisU+03A9.3ofthemostpopularencodingstandardsdefinedbyUnicodeareUTF-8,UTF-16andUTF-32.3.WhatareUnicodeencodingsUTF-8,UTF-16,andUTF-32?WenowknowthatUnicodeisaninternationalstandardthatencodeseveryknowncharactertoauniquenumber.Thenthenextquestionishowdowemovetheseuniquenumbersaroundtheinternet?Youalreadyknowtheanswer!Usingbytesofinformation.UTF-8:Ituses1,2,3or4bytestoencodeeverycodepoint.ItisbackwardscompatiblewithASCII.AllEnglishcharactersjustneed1byte—whichisquiteefficient.Weonlyneedmorebytesifwearesendingnon-Englishcharacters.Itisthemostpopularformofencoding,andisbydefaulttheencodinginPython3.InPython2,thedefaultencodingisASCII(unfortunately).UTF-16isvariable2or4bytes.ThisencodingisgreatforAsiantextasmostofitcanbeencodedin2byteseach.It’sbadforEnglishasallEnglishcharactersalsoneed2byteshere.UTF-32isfixed4bytes.Allcharactersareencodedin4bytessoitneedsalotofmemory.Itisnotusedveryoften.[YoucanreadmoreinthisStackOverflowpost.]Weneedencodemethodtoconvertunicodecodepointstobytes.ThiswillhappentypicallyduringwritingstringdatatoaCSVorJSONfileforexample.Weneeddecodemethodtoconvertbytestounicodecodepoints.Thiswilltypicallyhappenduringreadingdatafromafileintostrings.Whyareencodeanddecodemethodsneeded?4.WhatdatatypesinPythonhandleUnicodecodepointsandbytes?Aswediscussedearlier,inPython,stringscaneitherberepresentedinbytesorunicodecodepoints.ThemaintakeawaysinPythonare:1.Python2usesstrtypetostorebytesandunicodetypetostoreunicodecodepoints.Allstringsbydefaultarestrtype—whichisbytes~AndDefaultencodingisASCII.SoifanincomingfileisCyrilliccharacters,Python2mightfailbecauseASCIIwillnotbeabletohandlethoseCyrillicCharacters.Inthiscase,weneedtoremembertousedecode("utf-8")duringreadingoffiles.Thisisinconvenient.2.Python3cameandfixedthis.Stringsarestillstrtypebydefaultbuttheynowmeanunicodecodepointsinstead—wecarrywhatwesee.Ifwewanttostorethesestrtypestringsinfilesweusebytestypeinstead.DefaultencodingisUTF-8insteadofASCII.Perfect!5.Anycodeexamplestocomparethedifferentdatatypes?Yes,let’slookat“你好”whichisChineseforhello.Ittakes6bytestostorethisstringmadeof2unicodecodepoints.Let’staketheexampleofpopularlenfunctiontoseehowthingsmightdifferinPython2and3—andthingsyouneedtokeepnoteof.>>>print(len(“你好”))#Python2-strisbytes6>>>print(len(u“你好”))#Python2-Add'u'forunicodecodepoints2>>>print(len(“你好”))#Python3-strisunicodecodepoints2So,prefixingauinPython2canmakeacompletedifferencetoyourcodefunctioningcorrectlyornot—whichcanbeconfusing!Python3fixedthisbyusingunicodecodepointsbydefault—solenwillworkasyouwouldexpectgivinglengthof2intheexampleabove.Let’slookatmoreexamplesinPython3fordealingwithstrings:#stringsisbydefaultmadeofunicodecodepoints>>>print(len(“你好”))2#Manuallyencodeastringintobytes>>>print(len(("你好").encode("utf-8")))6#Youdon'tneedtopassanargumentasdefaultencodingis"utf-8">>>print(len(("你好").encode()))6#Printactualunicodecodepointsinsteadofcharacters[Source]>>>print(("你好").encode("unicode_escape"))b'\\u4f60\\u597d'#PrintbytesencodedinUTF-8forthisstring>>>print(("你好").encode())b'\xe4\xbd\xa0\xe5\xa5\xbd'6.It’salotofinformation!Canyousummarize?Sure!Let’sseeallwehavecoveredsofarvisually.BydefaultinPython3,weareontheleftsideintheworldofUnicodecodepointsforstrings.Weonlyneedtogobackandforthwithbyteswhilewritingorreadingthedata.DefaultencodingduringthisconversionisUTF-8,butotherencodingscanalsobeused.Weneedtoknowwhatencoderwasusedduringthedecodingprocess,otherwisewemightgeterrorsorgetgibberish!VisualdiagramofhowencodinganddecodingworksforstringsThisdiagramholdstrueforbothPython2andPython3!WemightbegettingUnicodeDecodeErrorsdueto:1)WetryingtouseASCIItoencodenon-ASCIIcharacters.Thiswouldhappenesp.inPython2wheredefaultencoderisASCII.SoyoushouldexplicitlyencodeanddecodebytesusingUTF-8.2)Wemightbeusingthewrongdecodercompletely.IfunicodecodepointswereencodedinUTF-16insteadofUTF-8,youmightrunintobytesthataregibberishinUTF-8land.SoUTF-8decodermightfailcompletelytounderstandthebytes.AgoodpracticeistodecodeyourbytesinUTF-8(oranencoderthatwasusedtocreatethosebytes)assoonastheyareloadedfromafile.RunyourprocessingonunicodecodepointsthroughyourPythoncode,andthenwritebackintobytesintoafileusingUTF-8encoderintheend.ThisiscalledUnicodeSandwich.Read/watchtheexcellenttalkbyNedBatchelder(@nedbat)aboutthis.IfyouwanttoaddmoreinformationaboutstringsinPython,pleasementioninthecommentsbelowasitwillhelpothers.ThisconcludesmyblogontheguidetoUnicode,UTF-8andstrings.Goodluckinyourownexplorationswithtext!PS,checkoutmynewpodcast!It’scalled“TheDataLifePodcast”whereItalkaboutsimilartopics.InarecentepisodeItalkedaboutWhyPandasisthenewExcel.Youcanlistentothepodcasthereorwhereveryoulistentoyourpodcasts.Mypodcast:TheDataLifePodcastIfyouhaveanyquestions,dropmeanoteatmyLinkedInprofile.Thanksforreading!MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumMorganinmetaflow-aiTensorFlow:Howtooptimiseyourinputpipelinewithqueuesandmulti-threadingdrie_contentindrieWhataremicroservicesJozsefSzalmaWorkingAroundMicrosoftAccess2GBFileSizeLimit,withVBAKumoMindinCodeXWhatYouNeedToKnowToDebugAPreemptedPodOnKubernetesAdithyaAnilkumarInstallingArchLinuxwithKDEPlasmaorGNOMEDesktop(DualBootingwithWindows)DhirajKumarIntegrationofEC2,ebsS3andCloudFrontinAwsSanjaySanthoshkumarRelativeSizing — UserStoryEstimationAmeyAnekarSecureYourApplicationFromVulnerabilitiesinOpenSourceLibrariesAboutHelpTermsPrivacyGettheMediumappGetstartedSanketGupta994FollowersAttheintersectionofmachinelearning,designandproduct.HostofTheDataLifePodcast.Opinionsaremyownanddonotexpressviewsofmyemployer.FollowHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable
延伸文章資訊
- 1A Guide to Unicode, UTF-8 and Strings in Python
Sure! Let's see all we have covered so far visually. By default in Python 3, we are on the left s...
- 2Python Convert Unicode to Bytes, ASCII, UTF-8, Raw String
Converting Unicode strings to bytes is quite common these days because it is necessary to convert...
- 3Convert UTF-8 to string literals in Python - Stack Overflow
The u'' syntax only works for string literals, e.g. defining values in source code. Using the syn...
- 4Decode UTF-8 in Python | Delft Stack
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. ...
- 5How to Convert a String to UTF-8 in Python? - Studytonight
In this article, we will learn to convert a string to UTF-8 in Python. We will use some built-in ...