A Guide to Unicode, UTF-8 and Strings in Python

2025-01-10

文章推薦指數： 80 %

投票人數：10人

Sure! Let's see all we have covered so far visually. By default in Python 3, we are on the left side in the world of Unicode code points ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceAGuidetoUnicode,UTF-8andStringsinPythonLet’stakeatouronessentialconceptsofstringsthatwilltakeyourunderstandingtonextlevel.StringsareoneofthemostcommondatatypesinPython.Theyareusedtodealwithtextdataofanykind.ThefieldofNaturalLanguageProcessingisbuiltontopoftextandstringprocessingofsomekind.ItisimportanttoknowabouthowstringsworkinPython.StringsareusuallyeasytodealwithwhentheyaremadeupofEnglishASCIIcharacters,but“problems”appearwhenweenterintonon-ASCIIcharacters—whicharebecomingincreasinglycommonintheworldtodayesp.withadventofemojisetc.Let’sdecipherwhatishiddeninthestringsManyprogrammersuseencodeanddecodewithstringsinhopesofremovingthedreadedUnicodeDecodeError—hopefully,thisblogwillhelpyouovercomethedreadaboutdealingwithstrings.BelowIamgoingtotakeaQandAformattoreallygettotheanswerstothequestionsyoumighthave,andwhichIalsohadbeforeIstartedlearningaboutstrings.1.Whatarestringsmadeof?InPython(2or3),stringscaneitherberepresentedinbytesorunicodecodepoints.Byteisaunitofinformationthatisbuiltof8bits—bytesareusedtostoreallfilesinaharddisk.SoalloftheCSVsandJSONfilesonyourcomputerarebuiltofbytes.Wecanallagreethatweneedbytes,butthenwhataboutunicodecodepoints?Wewillgettotheminthenextquestion.2.WhatisUnicode,andunicodecodepoints?Whilereadingbytesfromafile,areaderneedstoknowwhatthosebytesmean.SoifyouwriteaJSONfileandsenditovertoyourfriend,yourfriendwouldneedtoknowhowtodealwiththebytesinyourJSONfile.Forthefirst20yearsorsoofcomputing,upperandlowercaseEnglishcharacters,somepunctuationsanddigitswereenough.Thesewereallencodedintoa127symbollistcalledASCII.7bitsofinformationor1byteisenoughtoencodeeveryEnglishcharacter.YoucouldtellyourfriendtodecodeyourJSONfileinASCIIencoding,andvoila—shewouldbeabletoreadwhatyousenther.Thiswascoolfortheinitialfewdecadesorso,butslowlywerealizedthattherearewaymorenumberofcharactersthanjustEnglishcharacters.Wetriedextending127charactersto256characters(viaLatin-1orISO-8859–1)tofullyutilizethe8bitspace—butthatwasnotenough.Weneededaninternationalstandardthatweallagreedontodealwithhundredsandthousandsofnon-Englishcharacters.IncameUnicode!Unicodeisinternationalstandardwhereamappingofindividualcharactersandauniquenumberismaintained.AsofMay2019,themostrecentversionofUnicodeis12.1whichcontainsover137kcharactersincludingdifferentscriptsincludingEnglish,Hindi,ChineseandJapanese,aswellasemojis.These137kcharactersareeachrepresentedbyaunicodecodepoint.Sounicodecodepointsrefertoactualcharactersthataredisplayed.Thesecodepointsareencodedtobytesanddecodedfrombytesbacktocodepoints.Examples:UnicodecodepointforalphabetaisU+0061,emoji🖐isU+1F590,andforΩisU+03A9.3ofthemostpopularencodingstandardsdefinedbyUnicodeareUTF-8,UTF-16andUTF-32.3.WhatareUnicodeencodingsUTF-8,UTF-16,andUTF-32?WenowknowthatUnicodeisaninternationalstandardthatencodeseveryknowncharactertoauniquenumber.Thenthenextquestionishowdowemovetheseuniquenumbersaroundtheinternet?Youalreadyknowtheanswer!Usingbytesofinformation.UTF-8:Ituses1,2,3or4bytestoencodeeverycodepoint.ItisbackwardscompatiblewithASCII.AllEnglishcharactersjustneed1byte—whichisquiteefficient.Weonlyneedmorebytesifwearesendingnon-Englishcharacters.Itisthemostpopularformofencoding,andisbydefaulttheencodinginPython3.InPython2,thedefaultencodingisASCII(unfortunately).UTF-16isvariable2or4bytes.ThisencodingisgreatforAsiantextasmostofitcanbeencodedin2byteseach.It’sbadforEnglishasallEnglishcharactersalsoneed2byteshere.UTF-32isfixed4bytes.Allcharactersareencodedin4bytessoitneedsalotofmemory.Itisnotusedveryoften.[YoucanreadmoreinthisStackOverflowpost.]Weneedencodemethodtoconvertunicodecodepointstobytes.ThiswillhappentypicallyduringwritingstringdatatoaCSVorJSONfileforexample.Weneeddecodemethodtoconvertbytestounicodecodepoints.Thiswilltypicallyhappenduringreadingdatafromafileintostrings.Whyareencodeanddecodemethodsneeded?4.WhatdatatypesinPythonhandleUnicodecodepointsandbytes?Aswediscussedearlier,inPython,stringscaneitherberepresentedinbytesorunicodecodepoints.ThemaintakeawaysinPythonare:1.Python2usesstrtypetostorebytesandunicodetypetostoreunicodecodepoints.Allstringsbydefaultarestrtype—whichisbytes~AndDefaultencodingisASCII.SoifanincomingfileisCyrilliccharacters,Python2mightfailbecauseASCIIwillnotbeabletohandlethoseCyrillicCharacters.Inthiscase,weneedtoremembertousedecode("utf-8")duringreadingoffiles.Thisisinconvenient.2.Python3cameandfixedthis.Stringsarestillstrtypebydefaultbuttheynowmeanunicodecodepointsinstead—wecarrywhatwesee.Ifwewanttostorethesestrtypestringsinfilesweusebytestypeinstead.DefaultencodingisUTF-8insteadofASCII.Perfect!5.Anycodeexamplestocomparethedifferentdatatypes?Yes,let’slookat“你好”whichisChineseforhello.Ittakes6bytestostorethisstringmadeof2unicodecodepoints.Let’staketheexampleofpopularlenfunctiontoseehowthingsmightdifferinPython2and3—andthingsyouneedtokeepnoteof.>>>print(len(“你好”))#Python2-strisbytes6>>>print(len(u“你好”))#Python2-Add'u'forunicodecodepoints2>>>print(len(“你好”))#Python3-strisunicodecodepoints2So,prefixingauinPython2canmakeacompletedifferencetoyourcodefunctioningcorrectlyornot—whichcanbeconfusing!Python3fixedthisbyusingunicodecodepointsbydefault—solenwillworkasyouwouldexpectgivinglengthof2intheexampleabove.Let’slookatmoreexamplesinPython3fordealingwithstrings:#stringsisbydefaultmadeofunicodecodepoints>>>print(len(“你好”))2#Manuallyencodeastringintobytes>>>print(len(("你好").encode("utf-8")))6#Youdon'tneedtopassanargumentasdefaultencodingis"utf-8">>>print(len(("你好").encode()))6#Printactualunicodecodepointsinsteadofcharacters[Source]>>>print(("你好").encode("unicode_escape"))b'\\u4f60\\u597d'#PrintbytesencodedinUTF-8forthisstring>>>print(("你好").encode())b'\xe4\xbd\xa0\xe5\xa5\xbd'6.It’salotofinformation!Canyousummarize?Sure!Let’sseeallwehavecoveredsofarvisually.BydefaultinPython3,weareontheleftsideintheworldofUnicodecodepointsforstrings.Weonlyneedtogobackandforthwithbyteswhilewritingorreadingthedata.DefaultencodingduringthisconversionisUTF-8,butotherencodingscanalsobeused.Weneedtoknowwhatencoderwasusedduringthedecodingprocess,otherwisewemightgeterrorsorgetgibberish!VisualdiagramofhowencodinganddecodingworksforstringsThisdiagramholdstrueforbothPython2andPython3!WemightbegettingUnicodeDecodeErrorsdueto:1)WetryingtouseASCIItoencodenon-ASCIIcharacters.Thiswouldhappenesp.inPython2wheredefaultencoderisASCII.SoyoushouldexplicitlyencodeanddecodebytesusingUTF-8.2)Wemightbeusingthewrongdecodercompletely.IfunicodecodepointswereencodedinUTF-16insteadofUTF-8,youmightrunintobytesthataregibberishinUTF-8land.SoUTF-8decodermightfailcompletelytounderstandthebytes.AgoodpracticeistodecodeyourbytesinUTF-8(oranencoderthatwasusedtocreatethosebytes)assoonastheyareloadedfromafile.RunyourprocessingonunicodecodepointsthroughyourPythoncode,andthenwritebackintobytesintoafileusingUTF-8encoderintheend.ThisiscalledUnicodeSandwich.Read/watchtheexcellenttalkbyNedBatchelder(@nedbat)aboutthis.IfyouwanttoaddmoreinformationaboutstringsinPython,pleasementioninthecommentsbelowasitwillhelpothers.ThisconcludesmyblogontheguidetoUnicode,UTF-8andstrings.Goodluckinyourownexplorationswithtext!PS,checkoutmynewpodcast!It’scalled“TheDataLifePodcast”whereItalkaboutsimilartopics.InarecentepisodeItalkedaboutWhyPandasisthenewExcel.Youcanlistentothepodcasthereorwhereveryoulistentoyourpodcasts.Mypodcast:TheDataLifePodcastIfyouhaveanyquestions,dropmeanoteatmyLinkedInprofile.Thanksforreading!MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumMorganinmetaflow-aiTensorFlow:Howtooptimiseyourinputpipelinewithqueuesandmulti-threadingdrie_contentindrieWhataremicroservicesJozsefSzalmaWorkingAroundMicrosoftAccess2GBFileSizeLimit,withVBAKumoMindinCodeXWhatYouNeedToKnowToDebugAPreemptedPodOnKubernetesAdithyaAnilkumarInstallingArchLinuxwithKDEPlasmaorGNOMEDesktop(DualBootingwithWindows)DhirajKumarIntegrationofEC2,ebsS3andCloudFrontinAwsSanjaySanthoshkumarRelativeSizing — UserStoryEstimationAmeyAnekarSecureYourApplicationFromVulnerabilitiesinOpenSourceLibrariesAboutHelpTermsPrivacyGettheMediumappGetstartedSanketGupta994FollowersAttheintersectionofmachinelearning,designandproduct.HostofTheDataLifePodcast.Opinionsaremyownanddonotexpressviewsofmyemployer.FollowHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable

請為這篇文章評分？

延伸文章資訊

A Guide to Unicode, UTF-8 and Strings in Python

Sure! Let's see all we have covered so far visually. By default in Python 3, we are on the left s...

Python Convert Unicode to Bytes, ASCII, UTF-8, Raw String

Converting Unicode strings to bytes is quite common these days because it is necessary to convert...

Convert UTF-8 to string literals in Python - Stack Overflow

The u'' syntax only works for string literals, e.g. defining values in source code. Using the syn...

Decode UTF-8 in Python | Delft Stack

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. ...

How to Convert a String to UTF-8 in Python? - Studytonight

In this article, we will learn to convert a string to UTF-8 in Python. We will use some built-in ...

A Guide to Unicode, UTF-8 and Strings in Python

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

中日口譯課程

中國生產力中心口譯評價

紙的應用