A Guide to Unicode, UTF-8 and Strings in Python

文章推薦指數: 80 %
投票人數:10人

Sure! Let's see all we have covered so far visually. By default in Python 3, we are on the left side in the world of Unicode code points ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceAGuidetoUnicode,UTF-8andStringsinPythonLet’stakeatouronessentialconceptsofstringsthatwilltakeyourunderstandingtonextlevel.StringsareoneofthemostcommondatatypesinPython.Theyareusedtodealwithtextdataofanykind.ThefieldofNaturalLanguageProcessingisbuiltontopoftextandstringprocessingofsomekind.ItisimportanttoknowabouthowstringsworkinPython.StringsareusuallyeasytodealwithwhentheyaremadeupofEnglishASCIIcharacters,but“problems”appearwhenweenterintonon-ASCIIcharacters—whicharebecomingincreasinglycommonintheworldtodayesp.withadventofemojisetc.Let’sdecipherwhatishiddeninthestringsManyprogrammersuseencodeanddecodewithstringsinhopesofremovingthedreadedUnicodeDecodeError—hopefully,thisblogwillhelpyouovercomethedreadaboutdealingwithstrings.BelowIamgoingtotakeaQandAformattoreallygettotheanswerstothequestionsyoumighthave,andwhichIalsohadbeforeIstartedlearningaboutstrings.1.Whatarestringsmadeof?InPython(2or3),stringscaneitherberepresentedinbytesorunicodecodepoints.Byteisaunitofinformationthatisbuiltof8bits—bytesareusedtostoreallfilesinaharddisk.SoalloftheCSVsandJSONfilesonyourcomputerarebuiltofbytes.Wecanallagreethatweneedbytes,butthenwhataboutunicodecodepoints?Wewillgettotheminthenextquestion.2.WhatisUnicode,andunicodecodepoints?Whilereadingbytesfromafile,areaderneedstoknowwhatthosebytesmean.SoifyouwriteaJSONfileandsenditovertoyourfriend,yourfriendwouldneedtoknowhowtodealwiththebytesinyourJSONfile.Forthefirst20yearsorsoofcomputing,upperandlowercaseEnglishcharacters,somepunctuationsanddigitswereenough.Thesewereallencodedintoa127symbollistcalledASCII.7bitsofinformationor1byteisenoughtoencodeeveryEnglishcharacter.YoucouldtellyourfriendtodecodeyourJSONfileinASCIIencoding,andvoila—shewouldbeabletoreadwhatyousenther.Thiswascoolfortheinitialfewdecadesorso,butslowlywerealizedthattherearewaymorenumberofcharactersthanjustEnglishcharacters.Wetriedextending127charactersto256characters(viaLatin-1orISO-8859–1)tofullyutilizethe8bitspace—butthatwasnotenough.Weneededaninternationalstandardthatweallagreedontodealwithhundredsandthousandsofnon-Englishcharacters.IncameUnicode!Unicodeisinternationalstandardwhereamappingofindividualcharactersandauniquenumberismaintained.AsofMay2019,themostrecentversionofUnicodeis12.1whichcontainsover137kcharactersincludingdifferentscriptsincludingEnglish,Hindi,ChineseandJapanese,aswellasemojis.These137kcharactersareeachrepresentedbyaunicodecodepoint.Sounicodecodepointsrefertoactualcharactersthataredisplayed.Thesecodepointsareencodedtobytesanddecodedfrombytesbacktocodepoints.Examples:UnicodecodepointforalphabetaisU+0061,emoji🖐isU+1F590,andforΩisU+03A9.3ofthemostpopularencodingstandardsdefinedbyUnicodeareUTF-8,UTF-16andUTF-32.3.WhatareUnicodeencodingsUTF-8,UTF-16,andUTF-32?WenowknowthatUnicodeisaninternationalstandardthatencodeseveryknowncharactertoauniquenumber.Thenthenextquestionishowdowemovetheseuniquenumbersaroundtheinternet?Youalreadyknowtheanswer!Usingbytesofinformation.UTF-8:Ituses1,2,3or4bytestoencodeeverycodepoint.ItisbackwardscompatiblewithASCII.AllEnglishcharactersjustneed1byte—whichisquiteefficient.Weonlyneedmorebytesifwearesendingnon-Englishcharacters.Itisthemostpopularformofencoding,andisbydefaulttheencodinginPython3.InPython2,thedefaultencodingisASCII(unfortunately).UTF-16isvariable2or4bytes.ThisencodingisgreatforAsiantextasmostofitcanbeencodedin2byteseach.It’sbadforEnglishasallEnglishcharactersalsoneed2byteshere.UTF-32isfixed4bytes.Allcharactersareencodedin4bytessoitneedsalotofmemory.Itisnotusedveryoften.[YoucanreadmoreinthisStackOverflowpost.]Weneedencodemethodtoconvertunicodecodepointstobytes.ThiswillhappentypicallyduringwritingstringdatatoaCSVorJSONfileforexample.Weneeddecodemethodtoconvertbytestounicodecodepoints.Thiswilltypicallyhappenduringreadingdatafromafileintostrings.Whyareencodeanddecodemethodsneeded?4.WhatdatatypesinPythonhandleUnicodecodepointsandbytes?Aswediscussedearlier,inPython,stringscaneitherberepresentedinbytesorunicodecodepoints.ThemaintakeawaysinPythonare:1.Python2usesstrtypetostorebytesandunicodetypetostoreunicodecodepoints.Allstringsbydefaultarestrtype—whichisbytes~AndDefaultencodingisASCII.SoifanincomingfileisCyrilliccharacters,Python2mightfailbecauseASCIIwillnotbeabletohandlethoseCyrillicCharacters.Inthiscase,weneedtoremembertousedecode("utf-8")duringreadingoffiles.Thisisinconvenient.2.Python3cameandfixedthis.Stringsarestillstrtypebydefaultbuttheynowmeanunicodecodepointsinstead—wecarrywhatwesee.Ifwewanttostorethesestrtypestringsinfilesweusebytestypeinstead.DefaultencodingisUTF-8insteadofASCII.Perfect!5.Anycodeexamplestocomparethedifferentdatatypes?Yes,let’slookat“你好”whichisChineseforhello.Ittakes6bytestostorethisstringmadeof2unicodecodepoints.Let’staketheexampleofpopularlenfunctiontoseehowthingsmightdifferinPython2and3—andthingsyouneedtokeepnoteof.>>>print(len(“你好”))#Python2-strisbytes6>>>print(len(u“你好”))#Python2-Add'u'forunicodecodepoints2>>>print(len(“你好”))#Python3-strisunicodecodepoints2So,prefixingauinPython2canmakeacompletedifferencetoyourcodefunctioningcorrectlyornot—whichcanbeconfusing!Python3fixedthisbyusingunicodecodepointsbydefault—solenwillworkasyouwouldexpectgivinglengthof2intheexampleabove.Let’slookatmoreexamplesinPython3fordealingwithstrings:#stringsisbydefaultmadeofunicodecodepoints>>>print(len(“你好”))2#Manuallyencodeastringintobytes>>>print(len(("你好").encode("utf-8")))6#Youdon'tneedtopassanargumentasdefaultencodingis"utf-8">>>print(len(("你好").encode()))6#Printactualunicodecodepointsinsteadofcharacters[Source]>>>print(("你好").encode("unicode_escape"))b'\\u4f60\\u597d'#PrintbytesencodedinUTF-8forthisstring>>>print(("你好").encode())b'\xe4\xbd\xa0\xe5\xa5\xbd'6.It’salotofinformation!Canyousummarize?Sure!Let’sseeallwehavecoveredsofarvisually.BydefaultinPython3,weareontheleftsideintheworldofUnicodecodepointsforstrings.Weonlyneedtogobackandforthwithbyteswhilewritingorreadingthedata.DefaultencodingduringthisconversionisUTF-8,butotherencodingscanalsobeused.Weneedtoknowwhatencoderwasusedduringthedecodingprocess,otherwisewemightgeterrorsorgetgibberish!VisualdiagramofhowencodinganddecodingworksforstringsThisdiagramholdstrueforbothPython2andPython3!WemightbegettingUnicodeDecodeErrorsdueto:1)WetryingtouseASCIItoencodenon-ASCIIcharacters.Thiswouldhappenesp.inPython2wheredefaultencoderisASCII.SoyoushouldexplicitlyencodeanddecodebytesusingUTF-8.2)Wemightbeusingthewrongdecodercompletely.IfunicodecodepointswereencodedinUTF-16insteadofUTF-8,youmightrunintobytesthataregibberishinUTF-8land.SoUTF-8decodermightfailcompletelytounderstandthebytes.AgoodpracticeistodecodeyourbytesinUTF-8(oranencoderthatwasusedtocreatethosebytes)assoonastheyareloadedfromafile.RunyourprocessingonunicodecodepointsthroughyourPythoncode,andthenwritebackintobytesintoafileusingUTF-8encoderintheend.ThisiscalledUnicodeSandwich.Read/watchtheexcellenttalkbyNedBatchelder(@nedbat)aboutthis.IfyouwanttoaddmoreinformationaboutstringsinPython,pleasementioninthecommentsbelowasitwillhelpothers.ThisconcludesmyblogontheguidetoUnicode,UTF-8andstrings.Goodluckinyourownexplorationswithtext!PS,checkoutmynewpodcast!It’scalled“TheDataLifePodcast”whereItalkaboutsimilartopics.InarecentepisodeItalkedaboutWhyPandasisthenewExcel.Youcanlistentothepodcasthereorwhereveryoulistentoyourpodcasts.Mypodcast:TheDataLifePodcastIfyouhaveanyquestions,dropmeanoteatmyLinkedInprofile.Thanksforreading!MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumMorganinmetaflow-aiTensorFlow:Howtooptimiseyourinputpipelinewithqueuesandmulti-threadingdrie_contentindrieWhataremicroservicesJozsefSzalmaWorkingAroundMicrosoftAccess2GBFileSizeLimit,withVBAKumoMindinCodeXWhatYouNeedToKnowToDebugAPreemptedPodOnKubernetesAdithyaAnilkumarInstallingArchLinuxwithKDEPlasmaorGNOMEDesktop(DualBootingwithWindows)DhirajKumarIntegrationofEC2,ebsS3andCloudFrontinAwsSanjaySanthoshkumarRelativeSizing — UserStoryEstimationAmeyAnekarSecureYourApplicationFromVulnerabilitiesinOpenSourceLibrariesAboutHelpTermsPrivacyGettheMediumappGetstartedSanketGupta994FollowersAttheintersectionofmachinelearning,designandproduct.HostofTheDataLifePodcast.Opinionsaremyownanddonotexpressviewsofmyemployer.FollowHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable



請為這篇文章評分?