Byte string, Unicode string, Raw string — A Guide to all strings ...
文章推薦指數: 80 %
With the basic concepts understood, let's cover some practical coding tips in Python. In Python3, the default string is called Unicode string (u ...
OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceBytestring,Unicodestring,Rawstring—AGuidetoallstringsinPythonDifferences,usages,PythonversusNumPyversusPandas?PhotobyHiteshChoudharyonUnsplash“String”inPython?SoundslikethemostbasictopicsthateveryPythonprogrammershouldhavealreadymasteredintheirfirstPythontutorial.However,doyouknowthereareatleastfourtypesofstringsinprimitivePython?DoyouknowhowyourstringsareactuallyrepresentedinNumpyorPandasoranyotherpackages?WhatarethedifferencesandcaveatsthatIneedtoknow?(Seebelow)Here,Letmetrytoclearsomeofyourconfusionbasedonmyownlearningexperiences.Wearegoingtocoverthesetopics:Whataretheconceptsof“Encoding”and“Decoding”?Whatisaraw(r)stringorformat(f)stringandwhenIshouldusethem?WhatarethedifferencesbetweenNumpy/PandasstringandprimitivePythonstrings?BytestringandUnicodeString(defaultPython3string)—It’sallaboutEncodingTounderstandthedifferencesbetweenbytestringandUnicodestring,wefirstneedtoknowwhat“Encoding”and“Decoding”are.EncodingandDecoding(ImagebyAuthor)Tostorethehuman-readablecharactersoncomputers,weneedtoencodethemintobytes.Incontrast,weneedtodecodethebytesintohuman-readablecharactersforrepresentation.Byte,incomputerscience,indicatesaunitof0/1,commonlyoflength8.Socharacters“Hi”areactuallystoredas“0100100001101001”onthecomputer,whichconsumes2bytes(16-bits).Therulethatdefinestheencodingprocessiscalledencodingschema,commonlyusedonesinclude“ASCII”,“UTF-8”,etc.Now,thequestionhowdotheseencodingschemaslooklike?“ASCII”convertseachcharacterintoonebyte.Sinceonebyteconsistedof8bitsandeachbitcontains0/1.Thetotalnumberofcharacters“ASCII”canrepresentis2⁸=256.Itismorethanenoughfor26Englishlettersplussomecommonly-usedcharacters.Seethe“ASCII”tableforfullinformation.However,256charactersareobviouslynotenoughforstoringallthecharactersintheworld.Inlightofthat,peopledesignedUnicodeinwhicheachcharacterwillbeencodedasa“codepoint”.Forinstance,“H”willberepresentedascodepoint“U+0048”.AccordingtoWikipedia,Unicodecaninclude144,697characters.Butagain,thecodepointstillcannotberecognizedbythecomputer,sowehave“UTF-8”orothervariantsencodingschematoconvertthecodepointtothebyte.“UTF-8”meanstheminimumlengthofbitstorepresentacharacteris8,soyoucanguess,“UTF-16”meanstheminimumlengthofbitsis16.UTF-8iswaymorepopularthanUTF-16sointhisarticleandformostofyourworkastheyarecompatiblewiththeoldoriginalASCIIstandard(onecharactercanberepresentedusingonebyte),understandingtheUTF-8isenough.Seethe“UTF-8”tableforfullinformation.Withthebasicconceptsunderstood,let’scoversomepracticalcodingtipsinPython.InPython3,thedefaultstringiscalledUnicodestring(ustring),youcanunderstandthemashuman-readablecharacters.Asexplainedabove,youcanencodethemtothebytestring(bstring),andthebytestringcanbedecodedbacktotheUnicodestring.u'Hi'.encode('ASCII')>b'Hi'b'\x48\x69'.decode('ASCII')>'Hi'InPythonIDE,usually,thebytestringwillbeautomaticallydecodedusing“ASCII”whenprintedout,sothat’swhythefirstresultishuman-readable(b’Hi').Moreoften,Bytestringshouldberepresentedashexcode(b’\x48\x69'),whichcanbefoundinany“ASCII”table.Towrapupthissection,let’slookatone“UTF-8”example,againthehexcodeforeverycharactercanbefoundintheUTF-8table:b'\xe0\xb0\x86'.decode('utf-8')>'ఆ'RawstringTostartwiththistypeofstring,wejustneedtoknowonethingaboutthedefaultUnicodestring(ustring)—backslash(“\”)isaspecialcharacterinUnicodestringsuchthatthefollowingcharacterwillhavethespecialmeanings(i.e.\t,\n,etc).Soinordertoignorethespecialmeaningofthebackslash,wehavetheRawstring(rstring)inwhichbackslashisjustabackslashanditwon’thaveeffectsonchangingthemeaningofitsfollowingcharacters.UnicodeandRawstring(ImagebyAuthor)Herecomesmypersonalsuggestions,unlessinthescenariowhereyouneedtodefinetheregularexpressionmatchpattern(Seebelowexample),IsuggestusingtheUnicodestringwithescape(usingbackslashtoignorespecialcharacter).Asshowninthethirdexample,weusedbackslashtomakesureweoutputaliteral“\”insteadofanewtab“\t”.WhyIwouldrecommendthat?Thisisbecausetherawstringcannotreallysolveeverything,forinstance,howtooutputaliteralsinglequotationinaRawstring?r'ttt'g''File"
延伸文章資訊
- 1python - What is a unicode string? - Stack Overflow
In Python 3, Unicode strings are the default. The type str is a collection of Unicode code points...
- 2瞭解Unicode — Python Tutorial 0.1 說明文件
與外界溝通- decode與encode¶. 在上面我們學到了如何表示unicode 字串,但是事實上是, unicode 字串只能存在程式的內部,並沒有 ...
- 3Python String encode() - Programiz
Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented...
- 4Unicode String in Python - Tutorialspoint
Unicode String in Python - Normal strings in Python are stored internally as 8-bit ASCII, while U...
- 5Unicode & Character Encodings in Python: A Painless Guide
Python 3's str type is meant to represent human-readable text and can contain any Unicode charact...