Unicode character encodings - Python Morsels

文章推薦指數: 80 %
投票人數:10人

Bytes are what Python decodes to make strings. Encoding strings into bytes. If you have a string in Python and you'd like to convert it into ... Articles Screencasts Exercises Courses Pastebin Gift / SignUp SignIn Watchasvideo 02:58 Showcaptions Autoplay Auto-expand Signintochangeyoursettings SignintoyourPythonMorselsaccounttosaveyourscreencastsettings. Don'thaveanaccountyet?Signuphere. AlltextthatcomesfromoutsideofyourPythonprocessstartsasbinarydata. Allinputstartsasrawbytes WhenyouopenafileinPython,thedefaultmodeisrorrt,forreadtextmode: >>>withopen("my_file.txt")asf: ...contents=f.read() ... >>>f.mode 'r' Meaningwhenwereadourfile,we'llgetbackstringsthatrepresenttext: >>>contents 'Thisisafile✨\n' Butthat'snotwhatPythonactuallyreadsfromdisk. Ifweopenafilewiththemoderbandreadfromourfilewe'llseewhatPythonsees;thatisbytes: >>>withopen("my_file.txt",mode="rb")asf: ...contents=f.read() ... >>>contents b'Thisisafile\xe2\x9c\xa8\n' >>>type(contents) BytesarewhatPythondecodestomakestrings. Encodingstringsintobytes IfyouhaveastringinPythonandyou'dliketoconvertitintobytes,youcancallitsencodemethod: >>>text="Hellothere!\u2728" >>>text.encode() b'Hellothere!\xe2\x9c\xa8' Theencodemethodusesthecharacterencodingutf-8bydefault: >>>text.encode("utf-8") b'Hellothere!\xe2\x9c\xa8' Butyoucanspecifyadifferentcharacterencodingifyou'dlike: >>>text.encode("utf-16-le") b"H\x00e\x00l\x00l\x00o\x00\x00t\x00h\x00e\x00r\x00e\x00!\x00\x00('" Decodingbytesintostrings Ifyouhaveabytesobjectandyou'dliketoconvertitintoastring,youneedtodecodeitbycallingitsdecodemethod: >>>data=b"Hellothere!\xe2\x9c\xa8" >>>data.decode() 'Hellothere!✨' Likethestringencodemethod,thebytesdecodemethodusesthecharacterencodingutf-8bydefault: >>>data.decode("utf-8") 'Hellothere!✨' Butifyouhavebytesthatrepresentdatainadifferentcharacterencoding,you'llneedtospecifythatcharacterencodinginstead: >>>data=b"H\x00e\x00l\x00l\x00o\x00\x00t\x00h\x00e\x00r\x00e\x00!\x00\x00('" >>>data.decode("utf-16le") 'Hellothere!✨' Specifyingacharacterencodingwhenopeningfiles WhenyouopenafileinPython,whetherforwritingorforreading,it'sconsideredabestpracticetospecifythecharacterencodingthatyou'reworkingwith: >>>withopen("message.txt",mode="wt",encoding="utf-8")asf: ...f.write("InJan2020Isaid\u201cI'mgladIupgradedtoPython3\u201d.") ... 53 >>>withopen("message.txt",mode="rt",encoding="utf-8")asf: ...contents=f.read() ... >>>contents 'InJan2020Isaid\u201cI'mgladIupgradedtoPython3\u201d.' Thisisbecauseondifferentoperatingsystems,Pythonwilluseadifferentcharacterencodingbydefaultwhenit'sworkingwithtextfiles. Onmymachine,thedefaultcharacterencodingisutf-8. ButonWindows,thedefaultcharacterencodingisusuallycp1252. Becarefulwithyourcharacterencodings SoifwereadthisUTF-8fileonaWindowsmachinewithoutspecifyinganencoding,wewouldgetaUnicodeDecodeError: >>>withopen("message.txt",mode="rt")asf: ...contents=f.read() ... Traceback(mostrecentcalllast): File"",line2,in File"/usr/lib/python3.10/encodings/cp1252.py",line23,indecode returncodecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError:'charmap'codeccan'tdecodebyte0x9dinposition55:charactermapsto >>> ThistracebackforthisUnicodeDecodeErroristryingtotellusthatthere'samismatchbetweenthecharacterencodingofthebytesthatwe'rereadingandthecharacterencodingthatPythonistryingtousetoreadthem. Butyoucan'trelyonUnicodeDecodeErrorsalwaysbeingraisedwhenthere'sacharacterencodingmismatch. Sometimestwodifferentencodingsmayusethesamebytestorepresentdifferenttext. Herewe'vesavedafilewithusingtheUTF-8characterencoding: >>>text="Yayunicode!\N{SPARKLES}" >>>print(text) Yayunicode!✨ >>>withopen("sparkles.txt",mode="wt",encoding="utf-8")asf: ...f.write(text) ... 14 Ifreadthisfileusingthecp1252characterencoding,we'llseedifferenttextthanwhatwestartedwith: >>>withopen("sparkles.txt",encoding="cp1252")asf: ...contents=f.read() ... >>>contents 'Yayunicode!✨' >>> Weusedcp1252todecodebytesthatwereencodedusingutf-8andendedupwithmojibake. Thisisactuallyareallycommonproblembetweenutf-8(defaultencodingonLinux/Mac)andcp1252(defaultencodingonWindows)inparticularbecausethesetwocharacterencodingsareverysimilar,butfarfromthesame. Summary Whenyoureadafile,Pythonwillreadbytesfromdiskandthendecodethosebytestomakethemintostrings. Whenyouwritetoafile,Pythonwilltakeyourstringsandencodethosestringsintobytestowritethemtodisk. It'sconsideredabestpracticetospecifythecharacterencodingthatyou'reworkingwithwheneveryou'rereadingorwritingtextfromoutsideofyourPythonprocess,especiallyifyou'reworkingwithnon-ASCIItext. Pythontipseverycoupleweeks Needtofill-ingapsinyourPythonskills? Isendregularemailsdesignedtodojustthat. SignupformyPythontipsemailsandI'llsharemyfavoritePythoninsightswithyoueverycoupleweeks. Website SignupforPythontips Series:Files Readingfromandwritingtotextfiles(andsometimesbinaryfiles)isanimportantskillformostPythonprogrammers. TotrackyourprogressonthisPythonMorselstopictrail,signinorsignup. 0% Howtoreadfromatextfile 03:03 Readafileline-by-lineinPython 01:52 WritetoafileinPython 02:54 Unicodecharacterencodings 02:58 ReadingbinaryfilesinPython 03:47 Printingtoafile 02:52 Filesareiterators 02:49 File-likeobjectsinPython 02:53 FilemodesinPython 03:34 Seekinginfiles 04:18 ✕ ↑ APythonTipEveryWeek Needtofill-ingapsinyourPythonskills?Isendweeklyemailsdesignedtodojustthat. Website Watchasvideo 02:58 TableofContents NextUp 03:47 ReadingbinaryfilesinPython HowcanyoureadbinaryfilesinPython?Andhowcanyoureadverylargebinaryfilesinsmallchunks?



請為這篇文章評分?