Python Convert Unicode to Bytes, ASCII, UTF-8, Raw String

文章推薦指數: 80 %
投票人數:10人

Converting Unicode strings to bytes is quite common these days because it is necessary to convert strings to bytes to process files or machine learning. Let's ... Skiptocontent Menu Ratethispost TableofContents PythonConvertUnicodetoBytesMethod1Built-infunctionbytes()Method2Built-infunctionencode()PythonConvertUnicodetoASCIIMethod1Built-infunctiondecode()Method2Moduleunidecode()PythonConvertUnicodetoUTF-8Method1Built-infunctionencode()anddecode()Method2Moduleunidecode PythonConvertUnicodetoBytes ConvertingUnicodestringstobytesisquitecommonthesedaysbecauseitisnecessarytoconvertstringstobytestoprocessfilesormachinelearning.Let’stakealookathowthiscanbeaccomplished. Method1Built-infunctionbytes() Astringcanbeconvertedtobytesusingthebytes()genericfunction.ThisfunctioninternallypointstotheCPythonlibrary,whichperformsanencodingfunctiontoconvertthestringtothespecifiedencoding.Let’sseehowitworksandimmediatelycheckthedatatype: A='Hello' >>>print(bytes(A,'utf-8'),type(bytes(A,'utf-8'))) #b'Hello' Aliteralbappeared–asignthatitisastringofbytes.Unlikethefollowingmethod,thebytes()functiondoesnotapplyanyencodingbydefault,butrequiresittobeexplicitlyspecifiedandotherwiseraisestheTypeError:stringargumentwithoutanencoding. Method2Built-infunctionencode() Perhapsthemostcommonmethodtoaccomplishthistaskusestheencodingfunctiontoperformtheconversionanddoesnotuseoneadditionalreferencetoaspecificlibrary,thisfunctioncallsitdirectly. Thebuilt-infunctionencode()isappliedtoaUnicodestringandproducesastringofbytesintheoutput,usedintwoarguments:theinputstringencodingschemeandanerrorhandler.Anyencodingcanbeusedintheencodingscheme:ASCII,UTF-8(usedbydefault),UTF-16,latin-1,etc.Errorhandlingcanworkinseveralways: strict–usedbydefault,willraiseaUnicodeErrorwhencheckingforacharacterthatisnotsupportedbythisencoding; ignore–unsupportedcharactersareskipped; replace–unsupportedcharactersarereplacedwith“?”; xmlcharrefreplace–unsupportedcharactersarereplacedwiththeircorrespondingXML-representation; backslashreplace–unsupportedcharactersarereplacedwithsequencesstartingwithabackslash; namereplace–unsupportedcharactersarereplacedwithsequenceslike\N{…};surrogateescape–replaceseachbytewithasurrogatecode,fromU+DC80toU+DCFF; surrogatepass–ignoressurrogatecodes,isusedwiththefollowingencodings:utf-8,utf-16,utf-32,utf-16-be,utf-16-le,utf-32-be,utf-32-le. Let’sconsideranexample: A='\u0048\u0065\u006C\u006C\u006F' >>>print(A.encode()) #b'Hello' Inthisexample,wedidnotexplicitlyspecifyeithertheencodingortheerrorhandlingmethod,weusedthedefaultvalues–UTF-8encodingandthestrictmethod,whichdidnotcauseanyerrors.Butthisishighlydiscouraged,sinceotherdevelopersmaynotonlyuseencodingsotherthanUTF-8andnotdeclareitintheheader,butthemetacharactersusedmaydifferfromthecontent. PythonConvertUnicodetoASCII Nowlet’slookatmethodsforfurtherconvertingbytestrings.WeneedtogetaUnicodeASCIIstring. Method1Built-infunctiondecode() Thedecode()function,likeencode(),workswithtwoarguments–encodinganderrorhandling.Let’sseehowitworks: >>>print(A.encode('ascii').decode('ascii')) #Hello ThismethodisgoodiftheinputUnicodestringisencodedinASCIIorotherdevelopersareresponsibleandexplicitlydeclaredtheencodingintheheader,butassoonasacodepointappearsintherangefrom0to127,themethoddoesnotwork: A='\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1' >>>print(A.encode('ascii').decode('ascii')) #UnicodeEncodeError:'ascii'codeccan'tencodecharactersinposition6-7:ordinalnotinrange(128) Youcanusevariouserrorhandlers,forexample,backslashreplace(toreplaceunsupportedcharacterswithsequencesstartingwithbackslashes)ornamereplace(toinsertsequenceslike\N{…}): A='\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1' >>>print(A.encode('ascii','backslashreplace').decode('ascii','backslashreplace')) #Hello \u5316\u4eb1 >>>print(A.encode('ascii','namereplace').decode('ascii','namereplace')) #Hello \N{CJKUNIFIEDIDEOGRAPH-5316}\N{CJKUNIFIEDIDEOGRAPH-4EB1} Asaresult,wecangetanotquiteexpectedoruninformativeanswer,whichcanleadtofurthererrorsorwasteoftimeonadditionalprocessing. Method2Moduleunidecode() PyPihasaunidecodemodule,itexportsafunctionthattakesaUnicodestringandreturnsastringthatcanbeencodedintoASCIIbytesinPython3.x: >>>fromunidecodeimportunidecode >>>print(unidecode(A)) #Hello HuaYe Youcanalsoprovideanerrorargumenttounidecode(),whichdetermineswhattodowithcharactersnotpresentinitstransliterationtables.Thedefaultisignore,whichmeansthatUnidecodeignoresthesecharacters(replacesthemwithanemptystring).strictwillraiseUnidecodeError.Theexclusionobjectwillcontainanindexattributethatcanbeusedtofindtheinvalidcharacter.replacewillreplacethemwith“?”(oranotherstringspecifiedinthereplace_strargument).Thepreservewillsavetheoriginalnon-ASCIIcharacterinthestring.Notethatifpreserveisused,thestringreturnedbyunidecode()willnotbeASCIIencoded!Readmorehere. PythonConvertUnicodetoUTF-8 DuetothefactthatUTF-8encodingisusedbydefaultinPythonandisthemostpopularorevenbecomingakindofstandard,aswellasmakingtheassumptionthatotherdeveloperstreatitthesamewayanddonotforgettodeclaretheencodinginthescriptheader,wecansaythatalmostallstringhandlingtasksboildowntoencoding/decodingfrom/toUTF-8. Forthistask,bothoftheabovemethodsareapplicable. Method1Built-infunctionencode()anddecode() Withencode(),wefirstgetabytestringbyapplyingUTF-8encodingtotheinputUnicodestring,andthenusedecode(),whichwillgiveusaUTF-8encodedUnicodestringthatisalreadyreadableandcanbedisplayedortotheconsoletotheuserorprinted. B='\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1\t\u041f\u0440\u0438\u0432\u0435\u0442' >>>print(B.encode('utf-8').decode('utf-8')) #Hello 化亱 Привет Sinceitisdifficulttoimagineacharacterusedinpopularapplications,environments,oroperatingenvironmentsthatdoesnothaveitsowncodepointinUTF-8,specifyingtheerrorhandlingmethodcanbeneglected. Method2Moduleunidecode >>>print(list(map(float,[ord(i)foriinB]))) #[72.0,101.0,108.0,108.0,111.0] Orwecanuseaforloop,andthedatatypeofeachcharacterwillbefloat,sinceweexplicitlyindicatedtoconverttothistype: >>>foriinB: print(float(ord(i)),sep='') #72.0101.0108.0108.0111.0 RelatedTutorialsTheUltimateGuidetoPythonListsHowtoFixUnicodeDecodeErrorwhenReadingCSVfilein…[List]HowtoCheckPackageVersioninPython100CodePuzzlestoTrainYourRapidPythonUnderstandingPythonRegexSuperpower[FullTutorial]HowtoConvertaUnicodeStringtoaStringObjectin… WhyFinxter? "Givemealeverlongenough[...]andIshallmovetheworld."-ArchimedesFinxteraimstobeyourlever!Oursinglepurposeistoincreasehumanity'scollectiveintelligenceviaprogrammingtutorialssoyoucanleverageinfinitecomputationalintelligencetoyoursuccess!FinxterMissionVideo LearningResources Toboostyourskills,joinourfreeemailacademywith1000+tutorialsonPython,freelancing,datascience,machinelearning,andBlockchaindevelopment!Tocreateyourthrivingcodingbusinessonline,checkoutourFinxterbooksandtheworld's#1freelancedeveloperprogram.Ifyou'renotquitereadytogoall-in,watchthefreemasterclassonbuildingyourhigh-incomeskillprogramming. NewFinxterTutorials: SolidityDeepDive—Syllabus+VideoTutorialResources SolidityFunctionTypes—ASimpleGuidewithVideo User-DefinedValueTypesinSolidity SolidityStringTypes,Unicode/HexLiterals,andEnums HowtoRemoveTextWithinParenthesesinaPythonString? PythonPrintDictionaryValuesWithout“dict_values” HowtoCleanandFormatPhoneNumbersinPython PythonPrintDictionaryWithoutOneKeyorMultipleKeys StateVariablesinSolidity HowtoExtractaZipFileinPython FinxterCategories: Categories SelectCategory 2-minComputerScienceConcepts 2-minComputerSciencePapers AlexaSkills Algorithms AppDevelopment Arduino ArtificialIntelligence Automation BeautifulSoup Binary Bitcoin Blockchain Blogging Brownie C C# C++ Career CheatSheets Clojure CloudComputing CodingBusiness CodingInterview ComputerScience Crypto CSS CSV DailyDataSciencePuzzle DailyPythonPuzzle dApp Dash DataScience DataStructures DataVisualization Database DeepLearning DeFi DependencyManagement DevOps DistributedSystems Django DunderMethods Error Ethereum Excel ExceptionHandling Finance Flask Float Freelancing FunctionalProgramming Functions Git Go GraphTheory GUI Hardware HTML ImageProcessing Input/Output Investment Java JavaScript json Jupyter Keras Linux MachineLearning macOS Math Matplotlib NaturalLanguageProcessing Networking Newspaper3k NFT ObjectOrientation OpenCV OperatingSystem PandasLibrary Performance PHP Pillow pip Polygon Powershell Productivity Projects PyCharm PyTest Python PythonBuilt-inFunctions PythonDictionary PythonEmailCourse PythonKeywords PythonList PythonOne-Liners PythonOperators PythonRequests PythonSet PythonString Pythonsys PythonTime PythonTuple PyTorch React Regex Research Scikit-learnLibrary SciPy Scripting Seaborn Security Selenium shutil sklearn SmartContracts Solana Solidity SQL Statistics Streamlit SymPy Tableau TensorFlow Testing TextProcessing TheNumpyLibrary TKinter Trading VisualStudio Web3 WebDevelopment WebScraping Windows XML



請為這篇文章評分?