Converting Unicode strings to bytes is quite common these days because it is necessary to convert strings to bytes to process files or machine learning. Let's ...
Skiptocontent
Menu
Ratethispost
TableofContents
PythonConvertUnicodetoBytesMethod1Built-infunctionbytes()Method2Built-infunctionencode()PythonConvertUnicodetoASCIIMethod1Built-infunctiondecode()Method2Moduleunidecode()PythonConvertUnicodetoUTF-8Method1Built-infunctionencode()anddecode()Method2Moduleunidecode
PythonConvertUnicodetoBytes
ConvertingUnicodestringstobytesisquitecommonthesedaysbecauseitisnecessarytoconvertstringstobytestoprocessfilesormachinelearning.Let’stakealookathowthiscanbeaccomplished.
Method1Built-infunctionbytes()
Astringcanbeconvertedtobytesusingthebytes()genericfunction.ThisfunctioninternallypointstotheCPythonlibrary,whichperformsanencodingfunctiontoconvertthestringtothespecifiedencoding.Let’sseehowitworksandimmediatelycheckthedatatype:
A='Hello'
>>>print(bytes(A,'utf-8'),type(bytes(A,'utf-8')))
#b'Hello'
Aliteralbappeared–asignthatitisastringofbytes.Unlikethefollowingmethod,thebytes()functiondoesnotapplyanyencodingbydefault,butrequiresittobeexplicitlyspecifiedandotherwiseraisestheTypeError:stringargumentwithoutanencoding.
Method2Built-infunctionencode()
Perhapsthemostcommonmethodtoaccomplishthistaskusestheencodingfunctiontoperformtheconversionanddoesnotuseoneadditionalreferencetoaspecificlibrary,thisfunctioncallsitdirectly.
Thebuilt-infunctionencode()isappliedtoaUnicodestringandproducesastringofbytesintheoutput,usedintwoarguments:theinputstringencodingschemeandanerrorhandler.Anyencodingcanbeusedintheencodingscheme:ASCII,UTF-8(usedbydefault),UTF-16,latin-1,etc.Errorhandlingcanworkinseveralways:
strict–usedbydefault,willraiseaUnicodeErrorwhencheckingforacharacterthatisnotsupportedbythisencoding;
ignore–unsupportedcharactersareskipped;
replace–unsupportedcharactersarereplacedwith“?”;
xmlcharrefreplace–unsupportedcharactersarereplacedwiththeircorrespondingXML-representation;
backslashreplace–unsupportedcharactersarereplacedwithsequencesstartingwithabackslash;
namereplace–unsupportedcharactersarereplacedwithsequenceslike\N{…};surrogateescape–replaceseachbytewithasurrogatecode,fromU+DC80toU+DCFF;
surrogatepass–ignoressurrogatecodes,isusedwiththefollowingencodings:utf-8,utf-16,utf-32,utf-16-be,utf-16-le,utf-32-be,utf-32-le.
Let’sconsideranexample:
A='\u0048\u0065\u006C\u006C\u006F'
>>>print(A.encode())
#b'Hello'
Inthisexample,wedidnotexplicitlyspecifyeithertheencodingortheerrorhandlingmethod,weusedthedefaultvalues–UTF-8encodingandthestrictmethod,whichdidnotcauseanyerrors.Butthisishighlydiscouraged,sinceotherdevelopersmaynotonlyuseencodingsotherthanUTF-8andnotdeclareitintheheader,butthemetacharactersusedmaydifferfromthecontent.
PythonConvertUnicodetoASCII
Nowlet’slookatmethodsforfurtherconvertingbytestrings.WeneedtogetaUnicodeASCIIstring.
Method1Built-infunctiondecode()
Thedecode()function,likeencode(),workswithtwoarguments–encodinganderrorhandling.Let’sseehowitworks:
>>>print(A.encode('ascii').decode('ascii'))
#Hello
ThismethodisgoodiftheinputUnicodestringisencodedinASCIIorotherdevelopersareresponsibleandexplicitlydeclaredtheencodingintheheader,butassoonasacodepointappearsintherangefrom0to127,themethoddoesnotwork:
A='\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1'
>>>print(A.encode('ascii').decode('ascii'))
#UnicodeEncodeError:'ascii'codeccan'tencodecharactersinposition6-7:ordinalnotinrange(128)
Youcanusevariouserrorhandlers,forexample,backslashreplace(toreplaceunsupportedcharacterswithsequencesstartingwithbackslashes)ornamereplace(toinsertsequenceslike\N{…}):
A='\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1'
>>>print(A.encode('ascii','backslashreplace').decode('ascii','backslashreplace'))
#Hello \u5316\u4eb1
>>>print(A.encode('ascii','namereplace').decode('ascii','namereplace'))
#Hello \N{CJKUNIFIEDIDEOGRAPH-5316}\N{CJKUNIFIEDIDEOGRAPH-4EB1}
Asaresult,wecangetanotquiteexpectedoruninformativeanswer,whichcanleadtofurthererrorsorwasteoftimeonadditionalprocessing.
Method2Moduleunidecode()
PyPihasaunidecodemodule,itexportsafunctionthattakesaUnicodestringandreturnsastringthatcanbeencodedintoASCIIbytesinPython3.x:
>>>fromunidecodeimportunidecode
>>>print(unidecode(A))
#Hello HuaYe
Youcanalsoprovideanerrorargumenttounidecode(),whichdetermineswhattodowithcharactersnotpresentinitstransliterationtables.Thedefaultisignore,whichmeansthatUnidecodeignoresthesecharacters(replacesthemwithanemptystring).strictwillraiseUnidecodeError.Theexclusionobjectwillcontainanindexattributethatcanbeusedtofindtheinvalidcharacter.replacewillreplacethemwith“?”(oranotherstringspecifiedinthereplace_strargument).Thepreservewillsavetheoriginalnon-ASCIIcharacterinthestring.Notethatifpreserveisused,thestringreturnedbyunidecode()willnotbeASCIIencoded!Readmorehere.
PythonConvertUnicodetoUTF-8
DuetothefactthatUTF-8encodingisusedbydefaultinPythonandisthemostpopularorevenbecomingakindofstandard,aswellasmakingtheassumptionthatotherdeveloperstreatitthesamewayanddonotforgettodeclaretheencodinginthescriptheader,wecansaythatalmostallstringhandlingtasksboildowntoencoding/decodingfrom/toUTF-8.
Forthistask,bothoftheabovemethodsareapplicable.
Method1Built-infunctionencode()anddecode()
Withencode(),wefirstgetabytestringbyapplyingUTF-8encodingtotheinputUnicodestring,andthenusedecode(),whichwillgiveusaUTF-8encodedUnicodestringthatisalreadyreadableandcanbedisplayedortotheconsoletotheuserorprinted.
B='\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1\t\u041f\u0440\u0438\u0432\u0435\u0442'
>>>print(B.encode('utf-8').decode('utf-8'))
#Hello 化亱 Привет
Sinceitisdifficulttoimagineacharacterusedinpopularapplications,environments,oroperatingenvironmentsthatdoesnothaveitsowncodepointinUTF-8,specifyingtheerrorhandlingmethodcanbeneglected.
Method2Moduleunidecode
>>>print(list(map(float,[ord(i)foriinB])))
#[72.0,101.0,108.0,108.0,111.0]
Orwecanuseaforloop,andthedatatypeofeachcharacterwillbefloat,sinceweexplicitlyindicatedtoconverttothistype:
>>>foriinB:
print(float(ord(i)),sep='')
#72.0101.0108.0108.0111.0
RelatedTutorialsTheUltimateGuidetoPythonListsHowtoFixUnicodeDecodeErrorwhenReadingCSVfilein…[List]HowtoCheckPackageVersioninPython100CodePuzzlestoTrainYourRapidPythonUnderstandingPythonRegexSuperpower[FullTutorial]HowtoConvertaUnicodeStringtoaStringObjectin…
WhyFinxter?
"Givemealeverlongenough[...]andIshallmovetheworld."-ArchimedesFinxteraimstobeyourlever!Oursinglepurposeistoincreasehumanity'scollectiveintelligenceviaprogrammingtutorialssoyoucanleverageinfinitecomputationalintelligencetoyoursuccess!FinxterMissionVideo
LearningResources
Toboostyourskills,joinourfreeemailacademywith1000+tutorialsonPython,freelancing,datascience,machinelearning,andBlockchaindevelopment!Tocreateyourthrivingcodingbusinessonline,checkoutourFinxterbooksandtheworld's#1freelancedeveloperprogram.Ifyou'renotquitereadytogoall-in,watchthefreemasterclassonbuildingyourhigh-incomeskillprogramming.
NewFinxterTutorials:
SolidityDeepDive—Syllabus+VideoTutorialResources
SolidityFunctionTypes—ASimpleGuidewithVideo
User-DefinedValueTypesinSolidity
SolidityStringTypes,Unicode/HexLiterals,andEnums
HowtoRemoveTextWithinParenthesesinaPythonString?
PythonPrintDictionaryValuesWithout“dict_values”
HowtoCleanandFormatPhoneNumbersinPython
PythonPrintDictionaryWithoutOneKeyorMultipleKeys
StateVariablesinSolidity
HowtoExtractaZipFileinPython
FinxterCategories:
Categories
SelectCategory
2-minComputerScienceConcepts
2-minComputerSciencePapers
AlexaSkills
Algorithms
AppDevelopment
Arduino
ArtificialIntelligence
Automation
BeautifulSoup
Binary
Bitcoin
Blockchain
Blogging
Brownie
C
C#
C++
Career
CheatSheets
Clojure
CloudComputing
CodingBusiness
CodingInterview
ComputerScience
Crypto
CSS
CSV
DailyDataSciencePuzzle
DailyPythonPuzzle
dApp
Dash
DataScience
DataStructures
DataVisualization
Database
DeepLearning
DeFi
DependencyManagement
DevOps
DistributedSystems
Django
DunderMethods
Error
Ethereum
Excel
ExceptionHandling
Finance
Flask
Float
Freelancing
FunctionalProgramming
Functions
Git
Go
GraphTheory
GUI
Hardware
HTML
ImageProcessing
Input/Output
Investment
Java
JavaScript
json
Jupyter
Keras
Linux
MachineLearning
macOS
Math
Matplotlib
NaturalLanguageProcessing
Networking
Newspaper3k
NFT
ObjectOrientation
OpenCV
OperatingSystem
PandasLibrary
Performance
PHP
Pillow
pip
Polygon
Powershell
Productivity
Projects
PyCharm
PyTest
Python
PythonBuilt-inFunctions
PythonDictionary
PythonEmailCourse
PythonKeywords
PythonList
PythonOne-Liners
PythonOperators
PythonRequests
PythonSet
PythonString
Pythonsys
PythonTime
PythonTuple
PyTorch
React
Regex
Research
Scikit-learnLibrary
SciPy
Scripting
Seaborn
Security
Selenium
shutil
sklearn
SmartContracts
Solana
Solidity
SQL
Statistics
Streamlit
SymPy
Tableau
TensorFlow
Testing
TextProcessing
TheNumpyLibrary
TKinter
Trading
VisualStudio
Web3
WebDevelopment
WebScraping
Windows
XML