Byte string, Unicode string, Raw string — A Guide to all strings ...
文章推薦指數: 80 %
With the basic concepts understood, let's cover some practical coding tips in Python. In Python3, the default string is called Unicode string (u ...
OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceBytestring,Unicodestring,Rawstring—AGuidetoallstringsinPythonDifferences,usages,PythonversusNumPyversusPandas?PhotobyHiteshChoudharyonUnsplash“String”inPython?SoundslikethemostbasictopicsthateveryPythonprogrammershouldhavealreadymasteredintheirfirstPythontutorial.However,doyouknowthereareatleastfourtypesofstringsinprimitivePython?DoyouknowhowyourstringsareactuallyrepresentedinNumpyorPandasoranyotherpackages?WhatarethedifferencesandcaveatsthatIneedtoknow?(Seebelow)Here,Letmetrytoclearsomeofyourconfusionbasedonmyownlearningexperiences.Wearegoingtocoverthesetopics:Whataretheconceptsof“Encoding”and“Decoding”?Whatisaraw(r)stringorformat(f)stringandwhenIshouldusethem?WhatarethedifferencesbetweenNumpy/PandasstringandprimitivePythonstrings?BytestringandUnicodeString(defaultPython3string)—It’sallaboutEncodingTounderstandthedifferencesbetweenbytestringandUnicodestring,wefirstneedtoknowwhat“Encoding”and“Decoding”are.EncodingandDecoding(ImagebyAuthor)Tostorethehuman-readablecharactersoncomputers,weneedtoencodethemintobytes.Incontrast,weneedtodecodethebytesintohuman-readablecharactersforrepresentation.Byte,incomputerscience,indicatesaunitof0/1,commonlyoflength8.Socharacters“Hi”areactuallystoredas“0100100001101001”onthecomputer,whichconsumes2bytes(16-bits).Therulethatdefinestheencodingprocessiscalledencodingschema,commonlyusedonesinclude“ASCII”,“UTF-8”,etc.Now,thequestionhowdotheseencodingschemaslooklike?“ASCII”convertseachcharacterintoonebyte.Sinceonebyteconsistedof8bitsandeachbitcontains0/1.Thetotalnumberofcharacters“ASCII”canrepresentis2⁸=256.Itismorethanenoughfor26Englishlettersplussomecommonly-usedcharacters.Seethe“ASCII”tableforfullinformation.However,256charactersareobviouslynotenoughforstoringallthecharactersintheworld.Inlightofthat,peopledesignedUnicodeinwhicheachcharacterwillbeencodedasa“codepoint”.Forinstance,“H”willberepresentedascodepoint“U+0048”.AccordingtoWikipedia,Unicodecaninclude144,697characters.Butagain,thecodepointstillcannotberecognizedbythecomputer,sowehave“UTF-8”orothervariantsencodingschematoconvertthecodepointtothebyte.“UTF-8”meanstheminimumlengthofbitstorepresentacharacteris8,soyoucanguess,“UTF-16”meanstheminimumlengthofbitsis16.UTF-8iswaymorepopularthanUTF-16sointhisarticleandformostofyourworkastheyarecompatiblewiththeoldoriginalASCIIstandard(onecharactercanberepresentedusingonebyte),understandingtheUTF-8isenough.Seethe“UTF-8”tableforfullinformation.Withthebasicconceptsunderstood,let’scoversomepracticalcodingtipsinPython.InPython3,thedefaultstringiscalledUnicodestring(ustring),youcanunderstandthemashuman-readablecharacters.Asexplainedabove,youcanencodethemtothebytestring(bstring),andthebytestringcanbedecodedbacktotheUnicodestring.u'Hi'.encode('ASCII')>b'Hi'b'\x48\x69'.decode('ASCII')>'Hi'InPythonIDE,usually,thebytestringwillbeautomaticallydecodedusing“ASCII”whenprintedout,sothat’swhythefirstresultishuman-readable(b’Hi').Moreoften,Bytestringshouldberepresentedashexcode(b’\x48\x69'),whichcanbefoundinany“ASCII”table.Towrapupthissection,let’slookatone“UTF-8”example,againthehexcodeforeverycharactercanbefoundintheUTF-8table:b'\xe0\xb0\x86'.decode('utf-8')>'ఆ'RawstringTostartwiththistypeofstring,wejustneedtoknowonethingaboutthedefaultUnicodestring(ustring)—backslash(“\”)isaspecialcharacterinUnicodestringsuchthatthefollowingcharacterwillhavethespecialmeanings(i.e.\t,\n,etc).Soinordertoignorethespecialmeaningofthebackslash,wehavetheRawstring(rstring)inwhichbackslashisjustabackslashanditwon’thaveeffectsonchangingthemeaningofitsfollowingcharacters.UnicodeandRawstring(ImagebyAuthor)Herecomesmypersonalsuggestions,unlessinthescenariowhereyouneedtodefinetheregularexpressionmatchpattern(Seebelowexample),IsuggestusingtheUnicodestringwithescape(usingbackslashtoignorespecialcharacter).Asshowninthethirdexample,weusedbackslashtomakesureweoutputaliteral“\”insteadofanewtab“\t”.WhyIwouldrecommendthat?Thisisbecausetherawstringcannotreallysolveeverything,forinstance,howtooutputaliteralsinglequotationinaRawstring?r'ttt'g''File"
延伸文章資訊
- 1A Guide to Unicode, UTF-8 and Strings in Python
As we discussed earlier, in Python, strings can either be represented in bytes or unicode code po...
- 2Unicode String in Python - Tutorialspoint
Unicode String in Python - Normal strings in Python are stored internally as 8-bit ASCII, while U...
- 3Converting Between Unicode and Plain Strings - O'Reilly
Unicode strings can be encoded in plain strings in a variety of ways, according to whichever enco...
- 4Unicode HOWTO — Python 3.10.7 documentation
- 5Byte string, Unicode string, Raw string — A Guide to all strings ...
With the basic concepts understood, let's cover some practical coding tips in Python. In Python3,...