Byte string, Unicode string, Raw string — A Guide to all strings ...

文章推薦指數: 80 %
投票人數:10人

With the basic concepts understood, let's cover some practical coding tips in Python. In Python3, the default string is called Unicode string (u ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceBytestring,Unicodestring,Rawstring—AGuidetoallstringsinPythonDifferences,usages,PythonversusNumPyversusPandas?PhotobyHiteshChoudharyonUnsplash“String”inPython?SoundslikethemostbasictopicsthateveryPythonprogrammershouldhavealreadymasteredintheirfirstPythontutorial.However,doyouknowthereareatleastfourtypesofstringsinprimitivePython?DoyouknowhowyourstringsareactuallyrepresentedinNumpyorPandasoranyotherpackages?WhatarethedifferencesandcaveatsthatIneedtoknow?(Seebelow)Here,Letmetrytoclearsomeofyourconfusionbasedonmyownlearningexperiences.Wearegoingtocoverthesetopics:Whataretheconceptsof“Encoding”and“Decoding”?Whatisaraw(r)stringorformat(f)stringandwhenIshouldusethem?WhatarethedifferencesbetweenNumpy/PandasstringandprimitivePythonstrings?BytestringandUnicodeString(defaultPython3string)—It’sallaboutEncodingTounderstandthedifferencesbetweenbytestringandUnicodestring,wefirstneedtoknowwhat“Encoding”and“Decoding”are.EncodingandDecoding(ImagebyAuthor)Tostorethehuman-readablecharactersoncomputers,weneedtoencodethemintobytes.Incontrast,weneedtodecodethebytesintohuman-readablecharactersforrepresentation.Byte,incomputerscience,indicatesaunitof0/1,commonlyoflength8.Socharacters“Hi”areactuallystoredas“0100100001101001”onthecomputer,whichconsumes2bytes(16-bits).Therulethatdefinestheencodingprocessiscalledencodingschema,commonlyusedonesinclude“ASCII”,“UTF-8”,etc.Now,thequestionhowdotheseencodingschemaslooklike?“ASCII”convertseachcharacterintoonebyte.Sinceonebyteconsistedof8bitsandeachbitcontains0/1.Thetotalnumberofcharacters“ASCII”canrepresentis2⁸=256.Itismorethanenoughfor26Englishlettersplussomecommonly-usedcharacters.Seethe“ASCII”tableforfullinformation.However,256charactersareobviouslynotenoughforstoringallthecharactersintheworld.Inlightofthat,peopledesignedUnicodeinwhicheachcharacterwillbeencodedasa“codepoint”.Forinstance,“H”willberepresentedascodepoint“U+0048”.AccordingtoWikipedia,Unicodecaninclude144,697characters.Butagain,thecodepointstillcannotberecognizedbythecomputer,sowehave“UTF-8”orothervariantsencodingschematoconvertthecodepointtothebyte.“UTF-8”meanstheminimumlengthofbitstorepresentacharacteris8,soyoucanguess,“UTF-16”meanstheminimumlengthofbitsis16.UTF-8iswaymorepopularthanUTF-16sointhisarticleandformostofyourworkastheyarecompatiblewiththeoldoriginalASCIIstandard(onecharactercanberepresentedusingonebyte),understandingtheUTF-8isenough.Seethe“UTF-8”tableforfullinformation.Withthebasicconceptsunderstood,let’scoversomepracticalcodingtipsinPython.InPython3,thedefaultstringiscalledUnicodestring(ustring),youcanunderstandthemashuman-readablecharacters.Asexplainedabove,youcanencodethemtothebytestring(bstring),andthebytestringcanbedecodedbacktotheUnicodestring.u'Hi'.encode('ASCII')>b'Hi'b'\x48\x69'.decode('ASCII')>'Hi'InPythonIDE,usually,thebytestringwillbeautomaticallydecodedusing“ASCII”whenprintedout,sothat’swhythefirstresultishuman-readable(b’Hi').Moreoften,Bytestringshouldberepresentedashexcode(b’\x48\x69'),whichcanbefoundinany“ASCII”table.Towrapupthissection,let’slookatone“UTF-8”example,againthehexcodeforeverycharactercanbefoundintheUTF-8table:b'\xe0\xb0\x86'.decode('utf-8')>'ఆ'RawstringTostartwiththistypeofstring,wejustneedtoknowonethingaboutthedefaultUnicodestring(ustring)—backslash(“\”)isaspecialcharacterinUnicodestringsuchthatthefollowingcharacterwillhavethespecialmeanings(i.e.\t,\n,etc).Soinordertoignorethespecialmeaningofthebackslash,wehavetheRawstring(rstring)inwhichbackslashisjustabackslashanditwon’thaveeffectsonchangingthemeaningofitsfollowingcharacters.UnicodeandRawstring(ImagebyAuthor)Herecomesmypersonalsuggestions,unlessinthescenariowhereyouneedtodefinetheregularexpressionmatchpattern(Seebelowexample),IsuggestusingtheUnicodestringwithescape(usingbackslashtoignorespecialcharacter).Asshowninthethirdexample,weusedbackslashtomakesureweoutputaliteral“\”insteadofanewtab“\t”.WhyIwouldrecommendthat?Thisisbecausetherawstringcannotreallysolveeverything,forinstance,howtooutputaliteralsinglequotationinaRawstring?r'ttt'g''File"",line1r'ttt'g''^SyntaxError:invalidsyntaxHowever,usingtheescapeideaalongwithUnicodestringseemstobeamoregeneralapproach:u'ttt\'g\''>"ttt'g'"TheonlyplacethatRawstring(rstring)maybeusefuliswhenyouaredealingwithregularexpression.TheregularexpressionisawholecanofwormsandIamnotintendingtocoverthatinthisarticle.Butwhenusingaregularexpression,weusuallyneedtofirstdefineamatchedpatternwheretheRawstringwouldberecommended.importrepat=re.compile(r'ENSG\d+$')string='ENSG00000555're.search(pat,string)<_sre.sre_matchobject>FormatstringForexperiencedPythonprogrammers,formatstringshouldnotbeanunfamiliarconcept,itallowsyoutodynamicallyconfigurethestringwewanttoprint.BeforePythonversion3.5,therecommendedapproachforcreatingformatstringislikethat:var='hello'print('{}world'.format(var))>helloworldSincePython3.5andlater,there’sanew“fstring”tohelpustoachievethesamegoal:var='hello'print(f'{var}world')>helloworldTheimportantthingIwanttonotehereis,whenusingformatstring,curlybrace“{}”becomesaveryspecialcharacterandcontainsitsuniquemeaning.Asaresult,ifwestillaimtooutputtheliteral“{}”,weneedtoescapeitusedoublecurlybrace“{{}}”:'{{}}{}'.format(5)>'{}5'StringinNumpyandPandasWhatwecoveredsofarareallaboutprimitivestringtypesinPython,wehaven’ttouchedonhowthestringishandledinotherpopularPythonpackages.HereIamgoingtoshareabitonstringtypesinNumpyandPandas.InNumpy,usually,Stringcanbespecifiedinthreedifferent“dtypes”:Variable-lengthUnicode(U)Fixed-lengthbyte(S)Pythonobject(O)importnumpyasnparr1=np.array(['hello','hi','ha'],dtype='array(['hello','hi','ha'],dtype='array([b'hello',b'hi',b'ha'],dtype='|S5')>array(['hello','hi','ha'],dtype=object)s10hello1hi2hadtype:object>s20hello1hi2hadtype:stringThesetwotypesareingeneralsimilar,thesubtledifferencesareoutlinedinthedocumentation.ConclusionInsummary,wetalkedaboutthedifferentrepresentationsof“string”inPython.StartingwiththedefaultUnicodestring(ustring),wetouchedonhowitrelatestoBytestring(bstring).Understandingtheconversionisveryimportantbecausesometimesthestandardoutputfromotherprogramswillbeintheformatofbytes,andweneedtofirstdecodethemtoUnicodestringforfurtherStreamingoperation.WethentalkedaboutRawstring(rstring)andFormatstring(fstring)andthecaveatsweneedtopayattentiontowhenusingthem.Finally,wesummarisedthedifferentwaysofstringrepresentationinNumpyandPandas,andspecialcareshouldbetakenwheninstantiatingNumpyorPandasobjectswithstringbecausethebehaviorswillbedrasticallydifferentthanprimitivePythonstrings.That’saboutit!Ihopeyoufindthisarticleinterestinganduseful,thanksforreading!Ifyoulikethisarticle,followmeonmedium,thankyousomuchforyoursupport.ConnectmeonmyTwitterorLinkedIn,alsopleaseletmeknowifyouhaveanyquestionsorwhatkindoftutorialsyouwouldliketoseeinthefuture!MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumIsabelleinJEN-LICHENINDATASCIENCELeetcodeDanielChanginAlwaysBeCodingWhatDavidChang,FoodTeachUsAboutProgrammingDungLeinDistributedKnowledgeCUDAMemoryManagement&UsecasesJirapongsePhuriphanvichaiinRefinitivDeveloperCommunityDatastreamCommoditiesOverviewCodalInc.AS400MigrationtotheCloudPraveenKumarJava — TheIntroduction&HistoryPurrwebinAgileInsiderClients’want-tos:waystoaccomplishadvancedtasksCS371gSummer2020:SrinidhiKrishnamurthyCS371gSummer2020:SrinidhiKrishnamurthyPost#1AboutHelpTermsPrivacyGettheMediumappGetstartedGuangyuan(Frank)Li224FollowersBioinformaticsPhDstudentatCincinnatiChildren'sHospitalMedicalCenter;GitHub:https://github.com/frankligyFollowHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable



請為這篇文章評分?