Unicode & Character Encodings in Python: A Painless Guide

2025-01-23

文章推薦指數： 80 %

投票人數：10人

Encoding and Decoding in Python 3 ... Python 3's str type is meant to represent human-readable text and can contain any Unicode character. The bytes type, ... Start Here LearnPython PythonTutorials→In-deptharticlesandvideocourses LearningPaths→Guidedstudyplansforacceleratedlearning Quizzes→Checkyourlearningprogress BrowseTopics→Focusonaspecificareaorskilllevel CommunityChat→LearnwithotherPythonistas OfficeHours→LiveQ&AcallswithPythonexperts Podcast→Hearwhat’snewintheworldofPython Books→Roundoutyourknowledgeandlearnoffline UnlockAllContent→ More PythonLearningResources PythonNewsletter PythonJobBoard MeettheTeam BecomeaTutorialAuthor BecomeaVideoInstructor Search Join Sign‑In Unicode&CharacterEncodingsinPython:APainlessGuide byBradSolomon advanced python MarkasCompleted Tweet Share Email TableofContents What’saCharacterEncoding? ThestringModule ABitofaRefresher WeNeedMoreBits! CoveringAlltheBases:OtherNumberSystems EnterUnicode UnicodevsUTF-8 EncodingandDecodinginPython3 Python3:All-InonUnicode OneByte,TwoBytes,ThreeBytes,Four WhatAboutUTF-16andUTF-32? Python’sBuilt-InFunctions PythonStringLiterals:WaystoSkinaCat OtherEncodingsAvailableinPython YouKnowWhatTheySayAboutAssumptions… OddsandEnds:unicodedata WrappingUp Resources Removeads WatchNowThistutorialhasarelatedvideocoursecreatedbytheRealPythonteam.Watchittogetherwiththewrittentutorialtodeepenyourunderstanding:UnicodeinPython:WorkingWithCharacterEncodings HandlingcharacterencodingsinPythonoranyotherlanguagecanattimesseempainful.PlacessuchasStackOverflowhavethousandsofquestionsstemmingfromconfusionoverexceptionslikeUnicodeDecodeErrorandUnicodeEncodeError.ThistutorialisdesignedtocleartheExceptionfogandillustratethatworkingwithtextandbinarydatainPython3canbeasmoothexperience.Python’sUnicodesupportisstrongandrobust,butittakessometimetomaster. Thistutorialisdifferentbecauseit’snotlanguage-agnosticbutinsteaddeliberatelyPython-centric.You’llstillgetalanguage-agnosticprimer,butyou’llthendiveintoillustrationsinPython,withtext-heavyparagraphskepttoaminimum.You’llseehowtouseconceptsofcharacterencodingsinlivePythoncode. Bytheendofthistutorial,you’ll: Getconceptualoverviewsoncharacterencodingsandnumberingsystems UnderstandhowencodingcomesintoplaywithPython’sstrandbytes KnowaboutsupportinPythonfornumberingsystemsthroughitsvariousformsofintliterals BefamiliarwithPython’sbuilt-infunctionsrelatedtocharacterencodingsandnumberingsystems Characterencodingandnumberingsystemsaresocloselyconnectedthattheyneedtobecoveredinthesametutorialorelsethetreatmentofeitherwouldbetotallyinadequate. Note:ThisarticleisPython3-centric.Specifically,allcodeexamplesinthistutorialweregeneratedfromaCPython3.7.2shell,althoughallminorversionsofPython3shouldbehave(mostly)thesameintheirtreatmentoftext. Ifyou’restillusingPython2andareintimidatedbythedifferencesinhowPython2andPython3treattextandbinarydata,thenhopefullythistutorialwillhelpyoumaketheswitch. FreeDownload:GetasamplechapterfromPythonTricks:TheBookthatshowsyouPython’sbestpracticeswithsimpleexamplesyoucanapplyinstantlytowritemorebeautiful+Pythoniccode. What’saCharacterEncoding? Therearetensifnothundredsofcharacterencodings.Thebestwaytostartunderstandingwhattheyareistocoveroneofthesimplestcharacterencodings,ASCII. Whetheryou’reself-taughtorhaveaformalcomputersciencebackground,chancesareyou’veseenanASCIItableonceortwice.ASCIIisagoodplacetostartlearningaboutcharacterencodingbecauseitisasmallandcontainedencoding.(Toosmall,asitturnsout.) Itencompassesthefollowing: LowercaseEnglishletters:athroughz UppercaseEnglishletters:AthroughZ Somepunctuationandsymbols:"$"and"!",tonameacouple Whitespacecharacters:anactualspace(""),aswellasanewline,carriagereturn,horizontaltab,verticaltab,andafewothers Somenon-printablecharacters:characterssuchasbackspace,"\b",thatcan’tbeprintedliterallyinthewaythattheletterAcan Sowhatisamoreformaldefinitionofacharacterencoding? Ataveryhighlevel,it’sawayoftranslatingcharacters(suchasletters,punctuation,symbols,whitespace,andcontrolcharacters)tointegersandultimatelytobits.Eachcharactercanbeencodedtoauniquesequenceofbits.Don’tworryifyou’reshakyontheconceptofbits,becausewe’llgettothemshortly. Thevariouscategoriesoutlinedrepresentgroupsofcharacters.Eachsinglecharacterhasacorrespondingcodepoint,whichyoucanthinkofasjustaninteger.CharactersaresegmentedintodifferentrangeswithintheASCIItable: CodePointRange Class 0through31 Control/non-printablecharacters 32through64 Punctuation,symbols,numbers,andspace 65through90 UppercaseEnglishalphabetletters 91through96 Additionalgraphemes,suchas[and\ 97through122 LowercaseEnglishalphabetletters 123through126 Additionalgraphemes,suchas{and| 127 Control/non-printablecharacter(DEL) TheentireASCIItablecontains128characters.ThistablecapturesthecompletecharactersetthatASCIIpermits.Ifyoudon’tseeacharacterhere,thenyousimplycan’texpressitasprintedtextundertheASCIIencodingscheme. ASCIITableShow/Hide CodePoint Character(Name) CodePoint Character(Name) 0 NUL(Null) 64 @ 1 SOH(StartofHeading) 65 A 2 STX(StartofText) 66 B 3 ETX(EndofText) 67 C 4 EOT(EndofTransmission) 68 D 5 ENQ(Enquiry) 69 E 6 ACK(Acknowledgment) 70 F 7 BEL(Bell) 71 G 8 BS(Backspace) 72 H 9 HT(HorizontalTab) 73 I 10 LF(LineFeed) 74 J 11 VT(VerticalTab) 75 K 12 FF(FormFeed) 76 L 13 CR(CarriageReturn) 77 M 14 SO(ShiftOut) 78 N 15 SI(ShiftIn) 79 O 16 DLE(DataLinkEscape) 80 P 17 DC1(DeviceControl1) 81 Q 18 DC2(DeviceControl2) 82 R 19 DC3(DeviceControl3) 83 S 20 DC4(DeviceControl4) 84 T 21 NAK(NegativeAcknowledgment) 85 U 22 SYN(SynchronousIdle) 86 V 23 ETB(EndofTransmissionBlock) 87 W 24 CAN(Cancel) 88 X 25 EM(EndofMedium) 89 Y 26 SUB(Substitute) 90 Z 27 ESC(Escape) 91 [ 28 FS(FileSeparator) 92 \ 29 GS(GroupSeparator) 93 ] 30 RS(RecordSeparator) 94 ^ 31 US(UnitSeparator) 95 _ 32 SP(Space) 96 ` 33 ! 97 a 34 " 98 b 35 # 99 c 36 $ 100 d 37 % 101 e 38 & 102 f 39 ' 103 g 40 ( 104 h 41 ) 105 i 42 * 106 j 43 + 107 k 44 , 108 l 45 - 109 m 46 . 110 n 47 / 111 o 48 0 112 p 49 1 113 q 50 2 114 r 51 3 115 s 52 4 116 t 53 5 117 u 54 6 118 v 55 7 119 w 56 8 120 x 57 9 121 y 58 : 122 z 59 ; 123 { 60 < 124 | 61 = 125 } 62 > 126 ~ 63 ? 127 DEL(delete) RemoveadsThestringModule Python’sstringmoduleisaconvenientone-stop-shopforstringconstantsthatfallinASCII’scharacterset. Here’sthecoreofthemoduleinallitsglory: #Fromlib/python3.7/string.py whitespace='\t\n\r\v\f' ascii_lowercase='abcdefghijklmnopqrstuvwxyz' ascii_uppercase='ABCDEFGHIJKLMNOPQRSTUVWXYZ' ascii_letters=ascii_lowercase+ascii_uppercase digits='0123456789' hexdigits=digits+'abcdef'+'ABCDEF' octdigits='01234567' punctuation=r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" printable=digits+ascii_letters+punctuation+whitespace Mostoftheseconstantsshouldbeself-documentingintheiridentifiername.We’llcoverwhathexdigitsandoctdigitsareshortly. Youcanusetheseconstantsforeverydaystringmanipulation: >>>>>>importstring >>>s="What'swrongwithASCII?!?!?" >>>s.rstrip(string.punctuation) 'What'swrongwithASCII' Note:string.printableincludesallofstring.whitespace.Thisdisagreesslightlywithanothermethodfortestingwhetheracharacterisconsideredprintable,namelystr.isprintable(),whichwilltellyouthatnoneof{'\v','\n','\r','\f','\t'}areconsideredprintable. Thesubtledifferenceisbecauseofdefinition:str.isprintable()considerssomethingprintableif“allofitscharactersareconsideredprintableinrepr().” ABitofaRefresher Nowisagoodtimeforashortrefresheronthebit,themostfundamentalunitofinformationthatacomputerknows. Abitisasignalthathasonlytwopossiblestates.Therearedifferentwaysofsymbolicallyrepresentingabitthatallmeanthesamething: 0or1 “yes”or“no” TrueorFalse “on”or“off” OurASCIItablefromtheprevioussectionuseswhatyouandIwouldjustcallnumbers(0through127),butwhataremorepreciselycallednumbersinbase10(decimal). Youcanalsoexpresseachofthesebase-10numberswithasequenceofbits(base2).Herearethebinaryversionsof0through10indecimal: Decimal Binary(Compact) Binary(PaddedForm) 0 0 00000000 1 1 00000001 2 10 00000010 3 11 00000011 4 100 00000100 5 101 00000101 6 110 00000110 7 111 00000111 8 1000 00001000 9 1001 00001001 10 1010 00001010 Noticethatasthedecimalnumbernincreases,youneedmoresignificantbitstorepresentthecharactersetuptoandincludingthatnumber. Here’sahandywaytorepresentASCIIstringsassequencesofbitsinPython.EachcharacterfromtheASCIIstringgetspseudo-encodedinto8bits,withspacesinbetweenthe8-bitsequencesthateachrepresentasinglecharacter: >>>>>>defmake_bitseq(s:str)->str: ...ifnots.isascii(): ...raiseValueError("ASCIIonlyallowed") ...return"".join(f"{ord(i):08b}"foriins) >>>make_bitseq("bits") '01100010011010010111010001110011' >>>make_bitseq("CAPS") '01000011010000010101000001010011' >>>make_bitseq("$25.43") '001001000011001000110101001011100011010000110011' >>>make_bitseq("~5") '0111111000110101' Note:.isascii()wasintroducedinPython3.7. Thef-stringf"{ord(i):08b}"usesPython’sFormatSpecificationMini-Language,whichisawayofspecifyingformattingforreplacementfieldsinformatstrings: Theleftsideofthecolon,ord(i),istheactualobjectwhosevaluewillbeformattedandinsertedintotheoutput.UsingthePythonord()functiongivesyouthebase-10codepointforasinglestrcharacter. Therighthandsideofthecolonistheformatspecifier.08meanswidth8,0padded,andthebfunctionsasasigntooutputtheresultingnumberinbase2(binary). Thistrickismainlyjustforfun,anditwillfailverybadlyforanycharacterthatyoudon’tseepresentintheASCIItable.We’lldiscusshowotherencodingsfixthisproblemlateron. RemoveadsWeNeedMoreBits! There’sacriticallyimportantformulathat’srelatedtothedefinitionofabit.Givenanumberofbits,n,thenumberofdistinctpossiblevaluesthatcanberepresentedinnbitsis2n: defn_possible_values(nbits:int)->int: return2**nbits Here’swhatthatmeans: 1bitwillletyouexpress21==2possiblevalues. 8bitswillletyouexpress28==256possiblevalues. 64bitswillletyouexpress264==18,446,744,073,709,551,616possiblevalues. There’sacorollarytothisformula:givenarangeofdistinctpossiblevalues,howcanwefindthenumberofbits,n,thatisrequiredfortherangetobefullyrepresented?Whatyou’retryingtosolveforisnintheequation2n=x(whereyoualreadyknowx). Here’swhatthatworksoutto: >>>>>>frommathimportceil,log >>>defn_bits_required(nvalues:int)->int: ...returnceil(log(nvalues)/log(2)) >>>n_bits_required(256) 8 Thereasonthatyouneedtouseaceilinginn_bits_required()istoaccountforvaluesthatarenotcleanpowersof2.Sayyouneedtostoreacharactersetof110characterstotal.Naively,thisshouldtakelog(110)/log(2)==6.781bits,butthere’snosuchthingas0.781bits.110valueswillrequire7bits,not6,withthefinalslotsbeingunneeded: >>>>>>n_bits_required(110) 7 Allofthisservestoproveoneconcept:ASCIIis,strictlyspeaking,a7-bitcode.TheASCIItablethatyousawabovecontains128codepointsandcharacters,0through127inclusive.Thisrequires7bits: >>>>>>n_bits_required(128)#0through127 7 >>>n_possible_values(7) 128 Theissuewiththisisthatmoderncomputersdon’tstoremuchofanythingin7-bitslots.Theytrafficinunitsof8bits,conventionallyknownasabyte. Note:Throughoutthistutorial,Iassumethatabyterefersto8bits,asithassincethe1960s,ratherthansomeotherunitofstorage.Youarefreetocallthisanoctetifyouprefer. ThismeansthatthestoragespaceusedbyASCIIishalf-empty.Ifit’snotclearwhythisis,thinkbacktothedecimal-to-binarytablefromabove.Youcanexpressthenumbers0and1withjust1bit,oryoucanuse8bitstoexpressthemas00000000and00000001,respectively. Youcanexpressthenumbers0through3withjust2bits,or00through11,oryoucanuse8bitstoexpressthemas00000000,00000001,00000010,and00000011,respectively.ThehighestASCIIcodepoint,127,requiresonly7significantbits. Knowingthis,youcanseethatmake_bitseq()convertsASCIIstringsintoastrrepresentationofbytes,whereeverycharacterconsumesonebyte: >>>>>>make_bitseq("bits") '01100010011010010111010001110011' ASCII’sunderutilizationofthe8-bitbytesofferedbymoderncomputersledtoafamilyofconflicting,informalizedencodingsthateachspecifiedadditionalcharacterstobeusedwiththeremaining128availablecodepointsallowedinan8-bitcharacterencodingscheme. Notonlydidthesedifferentencodingsclashwitheachother,buteachoneofthemwasbyitselfstillagrosslyincompleterepresentationoftheworld’scharacters,regardlessofthefactthattheymadeuseofoneadditionalbit. Overtheyears,onecharacterencodingmega-schemecametorulethemall.However,beforewegetthere,let’stalkforaminuteaboutnumberingsystems,whichareafundamentalunderpinningofcharacterencodingschemes. RemoveadsCoveringAlltheBases:OtherNumberSystems InthediscussionofASCIIabove,yousawthateachcharactermapstoanintegerintherange0through127. Thisrangeofnumbersisexpressedindecimal(base10).It’sthewaythatyou,me,andtherestofushumansareusedtocounting,fornoreasonmorecomplicatedthanthatwehave10fingers. ButthereareothernumberingsystemsaswellthatareespeciallyprevalentthroughouttheCPythonsourcecode.Whilethe“underlyingnumber”isthesame,allnumberingsystemsarejustdifferentwaysofexpressingthesamenumber. IfIaskedyouwhatnumberthestring"11"represents,you’dberighttogivemeastrangelookbeforeansweringthatitrepresentseleven. However,thisstringrepresentationcanexpressdifferentunderlyingnumbersindifferentnumberingsystems.Inadditiontodecimal,thealternativesincludethefollowingcommonnumberingsystems: Binary:base2 Octal:base8 Hexadecimal(hex):base16 Butwhatdoesitmeanforustosaythat,inacertainnumberingsystem,numbersarerepresentedinbaseN? HereisthebestwaythatIknowoftoarticulatewhatthismeans:it’sthenumberoffingersthatyou’dcountoninthatsystem. Ifyouwantamuchfullerbutstillgentleintroductiontonumberingsystems,CharlesPetzold’sCodeisanincrediblycoolbookthatexploresthefoundationsofcomputercodeindetail. OnewaytodemonstratehowdifferentnumberingsystemsinterpretthesamethingiswithPython’sint()constructor.Ifyoupassastrtoint(),Pythonwillassumebydefaultthatthestringexpressesanumberinbase10unlessyoutellitotherwise: >>>>>>int('11') 11 >>>int('11',base=10)#10isalreadydefault 11 >>>int('11',base=2)#Binary 3 >>>int('11',base=8)#Octal 9 >>>int('11',base=16)#Hex 17 There’samorecommonwayoftellingPythonthatyourintegeristypedinabaseotherthan10.Pythonacceptsliteralformsofeachofthe3alternativenumberingsystemsabove: TypeofLiteral Prefix Example n/a n/a 11 Binaryliteral 0bor0B 0b11 Octalliteral 0oor0O 0o11 Hexliteral 0xor0X 0x11 Allofthesearesub-formsofintegerliterals.Youcanseethattheseproducethesameresults,respectively,asthecallstoint()withnon-defaultbasevalues.They’realljustinttoPython: >>>>>>11 11 >>>0b11#Binaryliteral 3 >>>0o11#Octalliteral 9 >>>0x11#Hexliteral 17 Here’showyoucouldtypethebinary,octal,andhexadecimalequivalentsofthedecimalnumbers0through20.AnyoftheseareperfectlyvalidinaPythoninterpretershellorsourcecode,andallworkouttobeoftypeint: Decimal Binary Octal Hex 0 0b0 0o0 0x0 1 0b1 0o1 0x1 2 0b10 0o2 0x2 3 0b11 0o3 0x3 4 0b100 0o4 0x4 5 0b101 0o5 0x5 6 0b110 0o6 0x6 7 0b111 0o7 0x7 8 0b1000 0o10 0x8 9 0b1001 0o11 0x9 10 0b1010 0o12 0xa 11 0b1011 0o13 0xb 12 0b1100 0o14 0xc 13 0b1101 0o15 0xd 14 0b1110 0o16 0xe 15 0b1111 0o17 0xf 16 0b10000 0o20 0x10 17 0b10001 0o21 0x11 18 0b10010 0o22 0x12 19 0b10011 0o23 0x13 20 0b10100 0o24 0x14 IntegerLiteralsinCPythonSourceShow/Hide It’samazingjusthowprevalenttheseexpressionsareinthePythonStandardLibrary.Ifyouwanttoseeforyourself,navigatetowhereveryourlib/python3.7/directorysits,andcheckouttheuseofhexliteralslikethis: $grep-nri--include"*\.py"-e"\b0x"lib/python3.7 ThisshouldworkonanyUnixsystemthathasgrep.Youcoulduse"\b0o"tosearchforoctalliteralsor“\b0b”tosearchforbinaryliterals. What’stheargumentforusingthesealternateintliteralsyntaxes?Inshort,it’sbecause2,8,and16areallpowersof2,while10isnot.Thesethreealternatenumbersystemsoccasionallyofferawayforexpressingvaluesinacomputer-friendlymanner.Forexample,thenumber65536or216,isjust10000inhexadecimal,or0x10000asaPythonhexadecimalliteral. RemoveadsEnterUnicode Asyousaw,theproblemwithASCIIisthatit’snotnearlyabigenoughsetofcharacterstoaccommodatetheworld’ssetoflanguages,dialects,symbols,andglyphs.(It’snotevenbigenoughforEnglishalone.) UnicodefundamentallyservesthesamepurposeasASCII,butitjustencompassesaway,way,waybiggersetofcodepoints.ThereareahandfulofencodingsthatemergedchronologicallybetweenASCIIandUnicode,buttheyarenotreallyworthmentioningjustyetbecauseUnicodeandoneofitsencodingschemes,UTF-8,hasbecomesopredominantlyused. ThinkofUnicodeasamassiveversionoftheASCIItable—onethathas1,114,112possiblecodepoints.That’s0through1,114,111,or0through17*(216)-1,or0x10ffffhexadecimal.Infact,ASCIIisaperfectsubsetofUnicode.Thefirst128charactersintheUnicodetablecorrespondpreciselytotheASCIIcharactersthatyou’dreasonablyexpectthemto. Intheinterestofbeingtechnicallyexacting,Unicodeitselfisnotanencoding.Rather,Unicodeisimplementedbydifferentcharacterencodings,whichyou’llseesoon.Unicodeisbetterthoughtofasamap(somethinglikeadict)ora2-columndatabasetable.Itmapscharacters(like"a","¢",oreven"ቈ")todistinct,positiveintegers.Acharacterencodingneedstoofferabitmore. Unicodecontainsvirtuallyeverycharacterthatyoucanimagine,includingadditionalnon-printableonestoo.Oneofmyfavoritesisthepeskyright-to-leftmark,whichhascodepoint8207andisusedintextwithbothleft-to-rightandright-to-leftlanguagescripts,suchasanarticlecontainingbothEnglishandArabicparagraphs. Note:Theworldofcharacterencodingsisoneofmanyfine-grainedtechnicaldetailsoverwhichsomepeoplelovetonitpickabout.Onesuchdetailisthatonly1,111,998oftheUnicodecodepointsareactuallyusable,duetoacoupleofarchaicreasons. UnicodevsUTF-8 Itdidn’ttakelongforpeopletorealizethatalloftheworld’scharacterscouldnotbepackedintoonebyteeach.It’sevidentfromthisthatmodern,morecomprehensiveencodingswouldneedtousemultiplebytestoencodesomecharacters. YoualsosawabovethatUnicodeisnottechnicallyafull-blowncharacterencoding.Whyisthat? ThereisonethingthatUnicodedoesn’ttellyou:itdoesn’ttellyouhowtogetactualbitsfromtext—justcodepoints.Itdoesn’ttellyouenoughabouthowtoconverttexttobinarydataandviceversa. Unicodeisanabstractencodingstandard,notanencoding.That’swhereUTF-8andotherencodingschemescomeintoplay.TheUnicodestandard(amapofcharacterstocodepoints)definesseveraldifferentencodingsfromitssinglecharacterset. UTF-8aswellasitslesser-usedcousins,UTF-16andUTF-32,areencodingformatsforrepresentingUnicodecharactersasbinarydataofoneormorebytespercharacter.We’lldiscussUTF-16andUTF-32inamoment,butUTF-8hastakenthelargestshareofthepiebyfar. Thatbringsustoadefinitionthatislongoverdue.Whatdoesitmean,formally,toencodeanddecode? EncodingandDecodinginPython3 Python3’sstrtypeismeanttorepresenthuman-readabletextandcancontainanyUnicodecharacter. Thebytestype,conversely,representsbinarydata,orsequencesofrawbytes,thatdonotintrinsicallyhaveanencodingattachedtoit. Encodinganddecodingistheprocessofgoingfromonetotheother: Encodingvsdecoding(Image:RealPython) In.encode()and.decode(),theencodingparameteris"utf-8"bydefault,thoughit’sgenerallysaferandmoreunambiguoustospecifyit: >>>>>>"résumé".encode("utf-8") b'r\xc3\xa9sum\xc3\xa9' >>>"ElNiño".encode("utf-8") b'ElNi\xc3\xb1o' >>>b"r\xc3\xa9sum\xc3\xa9".decode("utf-8") 'résumé' >>>b"ElNi\xc3\xb1o".decode("utf-8") 'ElNiño' Theresultsofstr.encode()isabytesobject.Bothbytesliterals(suchasb"r\xc3\xa9sum\xc3\xa9")andtherepresentationsofbytespermitonlyASCIIcharacters. Thisiswhy,whencalling"ElNiño".encode("utf-8"),theASCII-compatible"El"isallowedtoberepresentedasitis,butthenwithtildeisescapedto"\xc3\xb1".Thatmessy-lookingsequencerepresentstwobytes,0xc3and0xb1inhex: >>>>>>"".join(f"{i:08b}"foriin(0xc3,0xb1)) '1100001110110001' Thatis,thecharacterñrequirestwobytesforitsbinaryrepresentationunderUTF-8. Note:Ifyoutypehelp(str.encode),you’llprobablyseeadefaultofencoding='utf-8'.Becarefulaboutexcludingthisandjustusing"résumé".encode(),becausethedefaultmaybedifferentinWindowspriortoPython3.6. RemoveadsPython3:All-InonUnicode Python3isall-inonUnicodeandUTF-8specifically.Here’swhatthatmeans: Python3sourcecodeisassumedtobeUTF-8bydefault.Thismeansthatyoudon’tneed#-*-coding:UTF-8-*-atthetopof.pyfilesinPython3. Alltext(str)isUnicodebydefault.EncodedUnicodetextisrepresentedasbinarydata(bytes).ThestrtypecancontainanyliteralUnicodecharacter,suchas"Δv/Δt",allofwhichwillbestoredasUnicode. Python3acceptsmanyUnicodecodepointsinidentifiers,meaningrésumé="~/Documents/resume.pdf"isvalidifthisstrikesyourfancy. Python’sremoduledefaultstothere.UNICODEflagratherthanre.ASCII.Thismeans,forinstance,thatr"\w"matchesUnicodewordcharacters,notjustASCIIletters. Thedefaultencodinginstr.encode()andbytes.decode()isUTF-8. Thereisoneotherpropertythatismorenuanced,whichisthatthedefaultencodingtothebuilt-inopen()isplatform-dependentanddependsonthevalueoflocale.getpreferredencoding(): >>>>>>#MacOSXHighSierra >>>importlocale >>>locale.getpreferredencoding() 'UTF-8' >>>#WindowsServer2012;otherWindowsbuildsmayuseUTF-16 >>>importlocale >>>locale.getpreferredencoding() 'cp1252' Again,thelessonhereistobecarefulaboutmakingassumptionswhenitcomestotheuniversalityofUTF-8,evenifitisthepredominantencoding.Itneverhurtstobeexplicitinyourcode. OneByte,TwoBytes,ThreeBytes,Four AcrucialfeatureisthatUTF-8isavariable-lengthencoding.It’stemptingtoglossoverwhatthismeans,butit’sworthdelvinginto. ThinkbacktothesectiononASCII.Everythinginextended-ASCII-landdemandsatmostonebyteofspace.Youcanquicklyprovethiswiththefollowinggeneratorexpression: >>>>>>all(len(chr(i).encode("ascii"))==1foriinrange(128)) True UTF-8isquitedifferent.AgivenUnicodecharactercanoccupyanywherefromonetofourbytes.Here’sanexampleofasingleUnicodecharactertakingupfourbytes: >>>>>>ibrow="🤨" >>>len(ibrow) 1 >>>ibrow.encode("utf-8") b'\xf0\x9f\xa4\xa8' >>>len(ibrow.encode("utf-8")) 4 >>>#Callinglist()onabytesobjectgivesyou >>>#thedecimalvalueforeachbyte >>>list(b'\xf0\x9f\xa4\xa8') [240,159,164,168] Thisisasubtlebutimportantfeatureoflen(): ThelengthofasingleUnicodecharacterasaPythonstrwillalwaysbe1,nomatterhowmanybytesitoccupies. Thelengthofthesamecharacterencodedtobyteswillbeanywherebetween1and4. Thetablebelowsummarizeswhatgeneraltypesofcharactersfitintoeachbyte-lengthbucket: DecimalRange HexRange What’sIncluded Examples 0to127 "\u0000"to"\u007F" U.S.ASCII "A","\n","7","&" 128to2047 "\u0080"to"\u07FF" MostLatinicalphabets* "ę","±","ƌ","ñ" 2048to65535 "\u0800"to"\uFFFF" Additionalpartsofthemultilingualplane(BMP)** "ത","ᄇ","ᮈ","‰" 65536to1114111 "\U00010000"to"\U0010FFFF" Other*** "𝕂","𐀀","😓","🂲", *SuchasEnglish,Arabic,Greek,andIrish **Ahugearrayoflanguagesandsymbols—mostlyChinese,Japanese,andKoreanbyvolume(alsoASCIIandLatinalphabets) ***AdditionalChinese,Japanese,Korean,andVietnamesecharacters,plusmoresymbolsandemojis Note:Intheinterestofnotlosingsightofthebigpicture,thereisanadditionalsetoftechnicalfeaturesofUTF-8thataren’tcoveredherebecausetheyarerarelyvisibletoaPythonuser. Forinstance,UTF-8actuallyusesprefixcodesthatindicatethenumberofbytesinasequence.Thisenablesadecodertotellwhatbytesbelongtogetherinavariable-lengthencoding,andletsthefirstbyteserveasanindicatorofthenumberofbytesinthecomingsequence. Wikipedia’sUTF-8articledoesnotshyawayfromtechnicaldetail,andthereisalwaystheofficialUnicodeStandardforyourreadingenjoymentaswell. WhatAboutUTF-16andUTF-32? Let’sgetbacktotwootherencodingvariants,UTF-16andUTF-32. ThedifferencebetweentheseandUTF-8issubstantialinpractice.Here’sanexampleofhowmajorthedifferenceiswitharound-tripconversion: >>>>>>letters="αβγδ" >>>rawdata=letters.encode("utf-8") >>>rawdata.decode("utf-8") 'αβγδ' >>>rawdata.decode("utf-16")#😧 '뇎닎돎듎' Inthiscase,encodingfourGreekletterswithUTF-8andthendecodingbacktotextinUTF-16wouldproduceatextstrthatisinacompletelydifferentlanguage(Korean). Glaringlywrongresultslikethisarepossiblewhenthesameencodingisn’tusedbidirectionally.Twovariationsofdecodingthesamebytesobjectmayproduceresultsthataren’teveninthesamelanguage. ThistablesummarizestherangeornumberofbytesunderUTF-8,UTF-16,andUTF-32: Encoding BytesPerCharacter(Inclusive) VariableLength UTF-8 1to4 Yes UTF-16 2to4 Yes UTF-32 4 No OneothercuriousaspectoftheUTFfamilyisthatUTF-8willnotalwaystakeuplessspacethanUTF-16.Thatmayseemmathematicallycounterintuitive,butit’squitepossible: >>>>>>text="記者鄭啟源羅智堅" >>>len(text.encode("utf-8")) 26 >>>len(text.encode("utf-16")) 22 ThereasonforthisisthatthecodepointsintherangeU+0800throughU+FFFF(2048through65535indecimal)takeupthreebytesinUTF-8versusonlytwoinUTF-16. I’mnotbyanymeansrecommendingthatyoujumpaboardtheUTF-16train,regardlessofwhetherornotyouoperateinalanguagewhosecharactersarecommonlyinthisrange.Amongotherreasons,oneofthestrongargumentsforusingUTF-8isthat,intheworldofencoding,it’sagreatideatoblendinwiththecrowd. Nottomention,it’s2019:computermemoryischeap,sosaving4bytesbygoingoutofyourwaytouseUTF-16isarguablynotworthit. RemoveadsPython’sBuilt-InFunctions You’vemadeitthroughthehardpart.Timetousewhatyou’veseenthusfarinPython. Pythonhasagroupofbuilt-infunctionsthatrelateinsomewaytonumberingsystemsandcharacterencoding: ascii() bin() bytes() chr() hex() int() oct() ord() str() Thesecanbelogicallygroupedtogetherbasedontheirpurpose: ascii(),bin(),hex(),andoct()areforobtainingadifferentrepresentationofaninput.Eachoneproducesastr.Thefirst,ascii(),producesanASCIIonlyrepresentationofanobject,withnon-ASCIIcharactersescaped.Theremainingthreegivebinary,hexadecimal,andoctalrepresentationsofaninteger,respectively.Theseareonlyrepresentations,notafundamentalchangeintheinput. bytes(),str(),andint()areclassconstructorsfortheirrespectivetypes,bytes,str,andint.Theyeachofferwaysofcoercingtheinputintothedesiredtype.Forinstance,asyousawearlier,whileint(11.0)isprobablymorecommon,youmightalsoseeint('11',base=16). ord()andchr()areinversesofeachotherinthatthePythonord()functionconvertsastrcharactertoitsbase-10codepoint,whilechr()doestheopposite. Here’samoredetailedlookateachoftheseninefunctions: Function Signature Accepts ReturnType Purpose ascii() ascii(obj) Varies str ASCIIonlyrepresentationofanobject,withnon-ASCIIcharactersescaped bin() bin(number) number:int str Binaryrepresentationofaninteger,withtheprefix"0b" bytes() bytes(iterable_of_ints)bytes(s,enc[,errors])bytes(bytes_or_buffer)bytes([i]) Varies bytes Coerce(convert)theinputtobytes,rawbinarydata chr() chr(i) i:inti>=0i<=1114111 str ConvertanintegercodepointtoasingleUnicodecharacter hex() hex(number) number:int str Hexadecimalrepresentationofaninteger,withtheprefix"0x" int() int([x])int(x,base=10) Varies int Coerce(convert)theinputtoint oct() oct(number) number:int str Octalrepresentationofaninteger,withtheprefix"0o" ord() ord(c) c:strlen(c)==1 int ConvertasingleUnicodecharactertoitsintegercodepoint str() str(object=’‘)str(b[,enc[,errors]]) Varies str Coerce(convert)theinputtostr,text Youcanexpandthesectionbelowtoseesomeexamplesofeachfunction. Examples:ascii()Show/Hide ascii()givesyouanASCII-onlyrepresentationofanobject,withnon-ASCIIcharactersescaped: >>>>>>ascii("abcdefg") "'abcdefg'" >>>ascii("jalepeño") "'jalepe\\xf1o'" >>>ascii((1,2,3)) '(1,2,3)' >>>ascii(0xc0ffee)#Hexliteral(int) '12648430' Examples:bin()Show/Hide bin()givesyouabinaryrepresentationofaninteger,withtheprefix"0b": >>>>>>bin(0) '0b0' >>>bin(400) '0b110010000' >>>bin(0xc0ffee)#Hexliteral(int) '0b110000001111111111101110' >>>[bin(i)foriin[1,2,4,8,16]]#`int`+listcomprehension ['0b1','0b10','0b100','0b1000','0b10000'] Examples:bytes()Show/Hide bytes()coercestheinputtobytes,representingrawbinarydata: >>>>>>#Iterableofints >>>bytes((104,101,108,108,111,32,119,111,114,108,100)) b'helloworld' >>>bytes(range(97,123))#Iterableofints b'abcdefghijklmnopqrstuvwxyz' >>>bytes("real🐍","utf-8")#String+encoding b'real\xf0\x9f\x90\x8d' >>>bytes(10) b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>>bytes.fromhex('c0ffee') b'\xc0\xff\xee' >>>bytes.fromhex("7265616c707974686f6e") b'realpython' Examples:chr()Show/Hide chr()convertsanintegercodepointtoasingleUnicodecharacter: >>>>>>chr(97) 'a' >>>chr(7048) 'ᮈ' >>>chr(1114111) '\U0010ffff' >>>chr(0x10FFFF)#Hexliteral(int) '\U0010ffff' >>>chr(0b01100100)#Binaryliteral(int) 'd' Examples:hex()Show/Hide hex()givesthehexadecimalrepresentationofaninteger,withtheprefix"0x": >>>>>>hex(100) '0x64' >>>[hex(i)foriin[1,2,4,8,16]] ['0x1','0x2','0x4','0x8','0x10'] >>>[hex(i)foriinrange(16)] ['0x0','0x1','0x2','0x3','0x4','0x5','0x6','0x7', '0x8','0x9','0xa','0xb','0xc','0xd','0xe','0xf'] Examples:int()Show/Hide int()coercestheinputtoint,optionallyinterpretingtheinputinagivenbase: >>>>>>int(11.0) 11 >>>int('11') 11 >>>int('11',base=2) 3 >>>int('11',base=8) 9 >>>int('11',base=16) 17 >>>int(0xc0ffee-1.0) 12648429 >>>int.from_bytes(b"\x0f","little") 15 >>>int.from_bytes(b'\xc0\xff\xee',"big") 12648430 Examples:ord()Show/Hide ThePythonord()functionconvertsasingleUnicodecharactertoitsintegercodepoint: >>>>>>ord("a") 97 >>>ord("ę") 281 >>>ord("ᮈ") 7048 >>>[ord(i)foriin"helloworld"] [104,101,108,108,111,32,119,111,114,108,100] Examples:str()Show/Hide str()coercestheinputtostr,representingtext: >>>>>>str("strofstring") 'strofstring' >>>str(5) '5' >>>str([1,2,3,4])#Like[1,2,3,4].__str__(),butusestr() '[1,2,3,4]' >>>str(b"\xc2\xbccupofflour","utf-8") '¼cupofflour' >>>str(0xc0ffee) '12648430' PythonStringLiterals:WaystoSkinaCat Ratherthanusingthestr()constructor,it’scommonplacetotypeastrliterally: >>>>>>meal="shrimpandgrits" Thatmayseemeasyenough.Buttheinterestingsideofthingsisthat,becausePython3isUnicode-centricthroughandthrough,youcan“type”Unicodecharactersthatyouprobablywon’tevenfindonyourkeyboard.YoucancopyandpastethisrightintoaPython3interpretershell: >>>>>>alphabet='αβγδεζηθικλμνξοπρςστυφχψ' >>>print(alphabet) αβγδεζηθικλμνξοπρςστυφχψ Besidesplacingtheactual,unescapedUnicodecharactersintheconsole,thereareotherwaystotypeUnicodestringsaswell. OneofthedensestsectionsofPython’sdocumentationistheportiononlexicalanalysis,specificallythesectiononstringandbytesliterals.Personally,Ihadtoreadthissectionaboutone,two,ormaybeninetimesforittoreallysinkin. PartofwhatitsaysisthatthereareuptosixwaysthatPythonwillallowyoutotypethesameUnicodecharacter. Thefirstandmostcommonwayistotypethecharacteritselfliterally,asyou’vealreadyseen.Thetoughpartwiththismethodisfindingtheactualkeystrokes.That’swheretheothermethodsforgettingandrepresentingcharacterscomeintoplay.Here’sthefulllist: EscapeSequence Meaning HowToExpress"a" "\ooo" Characterwithoctalvalueooo "\141" "\xhh" Characterwithhexvaluehh "\x61" "\N{name}" CharacternamednameintheUnicodedatabase "\N{LATINSMALLLETTERA}" "\uxxxx" Characterwith16-bit(2-byte)hexvaluexxxx "\u0061" "\Uxxxxxxxx" Characterwith32-bit(4-byte)hexvaluexxxxxxxx "\U00000061" Here’ssomeproofandvalidationoftheabove: >>>>>>( ..."a"== ..."\x61"== ..."\N{LATINSMALLLETTERA}"== ..."\u0061"== ..."\U00000061" ...) True Now,therearetwomaincaveats: Notalloftheseformsworkforallcharacters.Thehexrepresentationoftheinteger300is0x012c,whichsimplyisn’tgoingtofitintothe2-hex-digitescapecode"\xhh".Thehighestcodepointthatyoucansqueezeintothisescapesequenceis"\xff"("ÿ").Similarlyfor"\ooo",itwillonlyworkupto"\777"("ǿ"). For\xhh,\uxxxx,and\Uxxxxxxxx,exactlyasmanydigitsarerequiredasareshownintheseexamples.ThiscanthrowyouforaloopbecauseofthewaythatUnicodetablesconventionallydisplaythecodesforcharacters,withaleadingU+andvariablenumberofhexcharacters.ThekeyisthatUnicodetablesmostoftendonotzero-padthesecodes. Forinstance,ifyouconsultunicode-table.comforinformationontheGothicletterfaihu(orfehu),"𐍆",you’llseethatitislistedashavingthecodeU+10346. Howdoyouputthisinto"\uxxxx"or"\Uxxxxxxxx"?Well,youcan’tfititin"\uxxxx"becauseit’sa4-bytecharacter,andtouse"\Uxxxxxxxx"torepresentthischaracter,you’llneedtoleft-padthesequence: >>>>>>"\U00010346" '𐍆' Thisalsomeansthatthe"\Uxxxxxxxx"formistheonlyescapesequencethatiscapableofholdinganyUnicodecharacter. Note:Here’sashortfunctiontoconvertstringsthatlooklike"U+10346"intosomethingPythoncanworkwith.Itusesstr.zfill(): >>>>>>defmake_uchr(code:str): ...returnchr(int(code.lstrip("U+").zfill(8),16)) >>>make_uchr("U+10346") '𐍆' >>>make_uchr("U+0026") '&' RemoveadsOtherEncodingsAvailableinPython Sofar,you’veseenfourcharacterencodings: ASCII UTF-8 UTF-16 UTF-32 Thereareatonofotheronesoutthere. OneexampleisLatin-1(alsocalledISO-8859-1),whichistechnicallythedefaultfortheHypertextTransferProtocol(HTTP),perRFC2616.WindowshasitsownLatin-1variantcalledcp1252. Note:ISO-8859-1isstillverymuchpresentoutinthewild.TherequestslibraryfollowsRFC2616“totheletter”inusingitasthedefaultencodingforthecontentofanHTTPorHTTPSresponse.Iftheword“text”isfoundintheContent-Typeheader,andnootherencodingisspecified,thenrequestswilluseISO-8859-1. Thecompletelistofacceptedencodingsisburiedwaydowninthedocumentationforthecodecsmodule,whichispartofPython’sStandardLibrary. There’sonemoreusefulrecognizedencodingtobeawareof,whichis"unicode-escape".IfyouhaveadecodedstrandwanttoquicklygetarepresentationofitsescapedUnicodeliteral,thenyoucanspecifythisencodingin.encode(): >>>>>>alef=chr(1575)#Or"\u0627" >>>alef_hamza=chr(1571)#Or"\u0623" >>>alef,alef_hamza ('ا','أ') >>>alef.encode("unicode-escape") b'\\u0627' >>>alef_hamza.encode("unicode-escape") b'\\u0623' YouKnowWhatTheySayAboutAssumptions… JustbecausePythonmakestheassumptionofUTF-8encodingforfilesandcodethatyougeneratedoesn’tmeanthatyou,theprogrammer,shouldoperatewiththesameassumptionforexternaldata. Let’ssaythatagainbecauseit’saruletoliveby:whenyoureceivebinarydata(bytes)fromathirdpartysource,whetheritbefromafileoroveranetwork,thebestpracticeistocheckthatthedataspecifiesanencoding.Ifitdoesn’t,thenit’sonyoutoask. AllI/Ohappensinbytes,nottext,andbytesarejustonesandzerostoacomputeruntilyoutellitotherwisebyinformingitofanencoding. Here’sanexampleofwherethingscangowrong.You’resubscribedtoanAPIthatsendsyouarecipeoftheday,whichyoureceiveinbytesandhavealwaysdecodedusing.decode("utf-8")withnoproblem.Onthisparticularday,partoftherecipelookslikethis: >>>>>>data=b"\xbccupofflour" Itlooksasiftherecipecallsforsomeflour,butwedon’tknowhowmuch: >>>>>>data.decode("utf-8") Traceback(mostrecentcalllast): File"",line1,in UnicodeDecodeError:'utf-8'codeccan'tdecodebyte0xbcinposition0:invalidstartbyte Uhoh.There’sthatpeskyUnicodeDecodeErrorthatcanbiteyouwhenyoumakeassumptionsaboutencoding.YoucheckwiththeAPIhost.Loandbehold,thedataisactuallysentoverencodedinLatin-1: >>>>>>data.decode("latin-1") '¼cupofflour' Therewego.InLatin-1,everycharacterfitsintoasinglebyte,whereasthe“¼”charactertakesuptwobytesinUTF-8("\xc2\xbc"). Thelessonhereisthatitcanbedangeroustoassumetheencodingofanydatathatishandedofftoyou.It’susuallyUTF-8thesedays,butit’sthesmallpercentageofcaseswhereit’snotthatwillblowthingsup. Ifyoureallydoneedtoabandonshipandguessanencoding,thenhavealookatthechardetlibrary,whichusesmethodologyfromMozillatomakeaneducatedguessaboutambiguouslyencodedtext.Thatsaid,atoollikechardetshouldbeyourlastresort,notyourfirst. RemoveadsOddsandEnds:unicodedata WewouldberemissnottomentionunicodedatafromthePythonStandardLibrary,whichletsyouinteractwithanddolookupsontheUnicodeCharacterDatabase(UCD): >>>>>>importunicodedata >>>unicodedata.name("€") 'EUROSIGN' >>>unicodedata.lookup("EUROSIGN") '€' WrappingUp Inthisarticle,you’vedecodedthewideandimposingsubjectofcharacterencodinginPython. You’vecoveredalotofgroundhere: Fundamentalconceptsofcharacterencodingsandnumberingsystems Integer,binary,octal,hex,str,andbytesliteralsinPython Python’sbuilt-infunctionsrelatedtocharacterencodingandnumberingsystems Python3’streatmentoftextversusbinarydata Now,goforthandencode! Resources Forevenmoredetailaboutthetopicscoveredhere,checkouttheseresources: JoelSpolsky:TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!) DavidZentgraf:Whateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtext Mozilla:Acompositeapproachtolanguage/encodingdetection Wikipedia:UTF-8 JohnSkeet:Unicodeand.NET CharlesPetzold:Code:TheHiddenLanguageofComputerHardwareandSoftware NetworkWorkingGroup,RFC3629:UTF-8,atransformationformatofISO10646 UnicodeTechnicalStandard#18:UnicodeRegularExpressions ThePythondocshavetwopagesonthesubject: What’sNewinPython3.0 UnicodeHOWTO MarkasCompleted WatchNowThistutorialhasarelatedvideocoursecreatedbytheRealPythonteam.Watchittogetherwiththewrittentutorialtodeepenyourunderstanding:UnicodeinPython:WorkingWithCharacterEncodings 🐍PythonTricks💌 Getashort&sweetPythonTrickdeliveredtoyourinboxeverycoupleofdays.Nospamever.Unsubscribeanytime.CuratedbytheRealPythonteam. SendMePythonTricks» AboutBradSolomon BradisasoftwareengineerandamemberoftheRealPythonTutorialTeam. »MoreaboutBrad EachtutorialatRealPythoniscreatedbyateamofdeveloperssothatitmeetsourhighqualitystandards.Theteammemberswhoworkedonthistutorialare: Alex Aldren Joanna MasterReal-WorldPythonSkillsWithUnlimitedAccesstoReal Python Joinusandgetaccesstothousandsoftutorials,hands-onvideocourses,andacommunityofexpert Pythonistas: LevelUpYourPythonSkills» MasterReal-WorldPythonSkillsWithUnlimitedAccesstoReal Python Joinusandgetaccesstothousandsoftutorials,hands-onvideocourses,andacommunityofexpertPythonistas: LevelUpYourPythonSkills» WhatDoYouThink? Ratethisarticle: Tweet Share Share Email What’syour#1takeawayorfavoritethingyoulearned?Howareyougoingtoputyournewfoundskillstouse?Leaveacommentbelowandletusknow. CommentingTips:Themostusefulcommentsarethosewrittenwiththegoaloflearningfromorhelpingoutotherstudents.Gettipsforaskinggoodquestionsandgetanswerstocommonquestionsinoursupportportal.Lookingforareal-timeconversation?VisittheRealPythonCommunityChatorjointhenext“Office Hours”LiveQ&ASession.HappyPythoning! KeepLearning RelatedTutorialCategories: advanced python RecommendedVideoCourse:UnicodeinPython:WorkingWithCharacterEncodings KeepreadingReal Pythonbycreatingafreeaccountorsigning in: Continue» Alreadyhaveanaccount?Sign-In —FREEEmailSeries— 🐍PythonTricks💌 GetPythonTricks» 🔒Nospam.Unsubscribeanytime. AllTutorialTopics advanced api basics best-practices community databases data-science devops django docker flask front-end gamedev gui intermediate machine-learning projects python testing tools web-dev web-scraping TableofContents What’saCharacterEncoding? ThestringModule ABitofaRefresher WeNeedMoreBits! CoveringAlltheBases:OtherNumberSystems EnterUnicode UnicodevsUTF-8 EncodingandDecodinginPython3 Python3:All-InonUnicode OneByte,TwoBytes,ThreeBytes,Four WhatAboutUTF-16andUTF-32? Python’sBuilt-InFunctions PythonStringLiterals:WaystoSkinaCat OtherEncodingsAvailableinPython YouKnowWhatTheySayAboutAssumptions… OddsandEnds:unicodedata WrappingUp Resources MarkasCompleted Tweet Share Email RecommendedVideoCourseUnicodeinPython:WorkingWithCharacterEncodings Almostthere!Completethisformandclickthebuttonbelowtogaininstantaccess: × "PythonTricks:TheBook"–FreeSampleChapter(PDF) SendMySampleChapter» 🔒Nospam.Wetakeyourprivacyseriously.