What is UTF-8 Encoding? A Guide for Non-Programmers

文章推薦指數: 80 %
投票人數:10人

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate ... WhatisUTF-8Encoding?AGuideforNon-Programmers GetmoreinformationinourintrotoHTMLandCSSguideformarketers. GettheFreeGuide JamieJuviler Updated: November02,2021 Published: August10,2020 Text:itsimportanceontheinternetgoeswithoutsaying.It’sthefirst“T”in“HTTP”,theonly“T”in“HTML”,andvirtuallyeverywebsiteusesitsomehow,beitaURL,apieceofmarketingcopy,aproductreview,aviralTweet,orablogpost.(Hithere!) But,webtextmightnotactuallybeassimpleasyouthink.Considerthethousandsoflanguagesspokentoday,orallthepunctuationandsymbolswecanaddtoenhancethem,orthefactthatnewemojisarebeingcreatedtocaptureeveryhumanemotion.Howdowebsitesstoreandprocessallofthis? Thetruthis,evensomethingasbasicastextrequiresawell-coordinated,clearly-definedsystemtoappearinwebbrowsers.Inthispost,I’llexplainthebasicsofonetechnologycentraltotextontheweb,UTF-8.We’lllearnthebasicsoftextstorageandencoding,anddiscusshowithelpsputengagingwordsacrossyoursite. Beforewebegin,youshouldbefamiliarwiththebasicsofHTMLandreadytodiveintosomelightcomputerscience. WhatIsUTF-8? UTF-8standsfor“UnicodeTransformationFormat-8bits.”That’snothelpfultousyet,solet’srewindtothebasics. Binary:HowComputersStoreInformation Inordertostoreinformation,computersuseabinarysystem.Inbinary,alldataisrepresentedinsequencesof1sand0s.Themostbasicunitofbinaryisabit,whichisjustasingle1or0.Thenextlargestunitofbinary,abyte,consistsof8bits.Anexampleofabyteis“01101011”. Everydigitalassetyou’veeverencountered—fromsoftwaretomobileappstowebsitestoInstagramstories—isbuiltonthissystemofbytes,whicharestrungtogetherinawaythatmakessensetocomputers.Whenwerefertofilesizes,we’rereferencingthenumberofbytes.Forexample,akilobyteisroughlyonethousandbytes,andagigabyteisroughlyonebillionbytes. Textisoneofmanyassetsthatcomputersstoreandprocess.Textismadeupofindividualcharacters,eachofwhichisrepresentedincomputersbyastringofbits.Thesestringsareassembledtoformdigitalwords,sentences,paragraphs,romancenovels,andsoon. ASCII:ConvertingSymbolstoBinary TheAmericanStandardCodeforInformationInterchange(ASCII)wasanearlystandardizedencodingsystemfortext.Encodingistheprocessofconvertingcharactersinhumanlanguagesintobinarysequencesthatcomputerscanprocess. ASCII’slibraryincludeseveryupper-caseandlower-caseletterintheLatinalphabet(A,B,C…),everydigitfrom0to9,andsomecommonsymbols(like/,!,and?).Itassignseachofthesecharactersauniquethree-digitcodeandauniquebyte. ThetablebelowshowsexamplesofASCIIcharacterswiththeirassociatedcodesandbytes. Character ASCIICode BYTE A 065 01000001 a 097 01100001 B 066 01000010 b 098 01100010 Z 090 01011010 z 122 01111010 0 048 00110000 9 057 00111001 ! 033 00100001 ? 063 00111111 Justascharacterscometogethertoformwordsandsentencesinlanguage,binarycodedoessointextfiles.So,thesentence“Thequickbrownfoxjumpsoverthelazydog.”representedinASCIIbinarywouldbe: 0101010001101000011001010010000001110001 0111010101101001011000110110101100100000 0110001001110010011011110111011101101110 0010000001100110011011110111100000100000 0110101001110101011011010111000001110011 0010000001101111011101100110010101110010 0010000001110100011010000110010100100000 0110110001100001011110100111100100100000 01100100011011110110011100101110 Thatdoesn’tmeanmuchtoushumans,butit’sacomputer’sbreadandbutter. ThenumberofcharactersthatASCIIcanrepresentislimitedtothenumberofuniquebytesavailable,sinceeachcharactergetsonebyte.Ifyoudothemath,you’llfindthatthereare256differentwaysofgroupseight1sand0stogether.Thisgivesus256differentbytes,or256waystorepresentacharacterinASCII.WhenASCIIwasintroducedin1960,thiswasokay,sincedevelopersneededonly128bytestorepresentalltheEnglishcharactersandsymbolstheyneeded. But,ascomputingexpandedglobally,computersystemsbegantostoretextinlanguagesbesidesEnglish,manyofwhichusednon-ASCIIcharacters.Newsystemswerecreatedtomapotherlanguagestothesamesetof256uniquebytes,buthavingmultipleencodingsystemswasinefficientandconfusing.Developersneededabetterwaytoencodeallpossiblecharacterswithonesystem. Unicode:AWaytoStoreEverySymbol,Ever EnterUnicode,anencodingsystemthatsolvesthespaceissueofASCII.LikeASCII,Unicodeassignsauniquecode,calledacodepoint,toeachcharacter.However,Unicode’smoresophisticatedsystemcanproduceoveramillioncodepoints,morethanenoughtoaccountforeverycharacterinanylanguage. Unicodeisnowtheuniversalstandardforencodingallhumanlanguages.Andyes,itevenincludesemojis. Belowaresomeexamplesoftextcharactersandtheirmatchingcodepoints.Eachcodepointbeginswith“U”for“Unicode,”followedbyauniquestringofcharacterstorepresentthecharacter. Character Codepoint A U+0041 a U+0061 0 U+0030 9 U+0039 ! U+0021 Ø U+00D8 ڃ U+0683 ಚ U+0C9A 𠜎 U+2070E 😁 U+1F601   IfyouwanttolearnhowcodepointsaregeneratedandwhattheymeaninUnicode,checkoutthisin-depthexplanation. So,wenowhaveastandardizedwayofrepresentingeverycharacterusedbyeveryhumanlanguageinasinglelibrary.Thissolvestheissueofmultiplelabelingsystemsfordifferentlanguages—anycomputeronEarthcanuseUnicode. But,Unicodealonedoesn’tstorewordsinbinary.ComputersneedawaytotranslateUnicodeintobinarysothatitscharacterscanbestoredintextfiles.Here’swhereUTF-8comesin. UTF-8:TheFinalPieceofthePuzzle UTF-8isanencodingsystemforUnicode.ItcantranslateanyUnicodecharactertoamatchinguniquebinarystring,andcanalsotranslatethebinarystringbacktoaUnicodecharacter.Thisisthemeaningof“UTF”,or“UnicodeTransformationFormat.” ThereareotherencodingsystemsforUnicodebesidesUTF-8,butUTF-8isuniquebecauseitrepresentscharactersinone-byteunits.Rememberthatonebyteconsistsofeightbits,hencethe“-8”initsname. Morespecifically,UTF-8convertsacodepoint(whichrepresentsasinglecharacterinUnicode)intoasetofonetofourbytes.Thefirst256charactersintheUnicodelibrary—whichincludethecharacterswesawinASCII—arerepresentedasonebyte.CharactersthatappearlaterintheUnicodelibraryareencodedastwo-byte,three-byte,andeventuallyfour-bytebinaryunits. Belowisthesamecharactertablefromabove,withtheUTF-8outputforeachcharacteradded.Noticehowsomecharactersarerepresentedasjustonebyte,whileothersusemore. Character Codepoint UTF-8binaryencoding A U+0041 01000001 a U+0061 01100001 0 U+0030 00110000 9 U+0039 00111001 ! U+0021 00100001 Ø U+00D8 1100001110011000 ڃ U+0683 1101101010000011 ಚ U+0C9A 111000001011001010011010 𠜎 U+2070E 11110000101000001001110010001110 😁 U+1F601 11110000100111111001100010000001 WhywouldUTF-8convertsomecharacterstoonebyte,andothersuptofourbytes?Inshort,tosavememory.Byusinglessspacetorepresentmorecommoncharacters(i.e.ASCIIcharacters),UTF-8reducesfilesizewhileallowingforamuchlargernumberofless-commoncharacters.Theseless-commoncharactersareencodedintotwoormorebytes,butthisisokayifthey’restoredsparingly. SpatialefficiencyisakeyadvantageofUTF-8encoding.IfinsteadeveryUnicodecharacterwasrepresentedbyfourbytes,atextfilewritteninEnglishwouldbefourtimesthesizeofthesamefileencodedwithUTF-8. AnotherbenefitofUTF-8encodingisitsbackwardcompatibilitywithASCII.Thefirst128charactersintheUnicodelibrarymatchthoseintheASCIIlibrary,andUTF-8translatesthese128UnicodecharactersintothesamebinarystringsasASCII.Asaresult,UTF-8cantakeatextfileformattedbyASCIIandconvertittohuman-readabletextwithoutissue. UTF-8CharactersinWebDevelopment UTF-8isthemostcommoncharacterencodingmethodusedontheinternettoday,andisthedefaultcharactersetforHTML5.Over95%ofallwebsites,likelyincludingyourown,storecharactersthisway.Additionally,commondatatransfermethodsovertheweb,likeXMLandJSON,areencodedwithUTF-8standards. Sinceit’snowthestandardmethodforencodingtextontheweb,allyoursitepagesanddatabasesshoulduseUTF-8.AcontentmanagementsystemorwebsitebuilderwillsaveyourfilesinUTF-8formatbydefault,butit’sstillagoodideatomakesureyou’restickingtothisbestpractice. TextfilesencodedwithUTF-8mustindicatethistothesoftwareprocessingit.Otherwise,thesoftwarewon’tproperlytranslatethebinarybackintocharacters.InHTMLfiles,youmightseeastringofcodelikethefollowingnearthetop: ThistellsthebrowserthattheHTMLfileisencodedbyUTF-8,sothatthebrowsercantranslateitbacktolegibletext. UTF-8vs.UTF-16 AsImentioned,UTF-8isnottheonlyencodingmethodforUnicodecharacters—there’salsoUTF-16.Thesemethodsdifferinthenumberofbytestheyneedtostoreacharacter.UTF-8encodesacharacterintoabinarystringofone,two,three,orfourbytes.UTF-16encodesaUnicodecharacterintoastringofeithertwoorfourbytes. Thisdistinctionisevidentfromtheirnames.InUTF-8,thesmallestbinaryrepresentationofacharacterisonebyte,oreightbits.InUTF-16,thesmallestbinaryrepresentationofacharacteristwobytes,orsixteenbits. BothUTF-8andUTF-16cantranslateUnicodecharactersintocomputer-friendlybinaryandbackagain.However,theyarenotcompatiblewitheachother.Thesesystemsusedifferentalgorithmstomapcodepointstobinarystrings,sothebinaryoutputforanygivencharacterwilllookdifferentfrombothmethods: Character UTF-8binaryencoding UTF-16binaryencoding A 01000001 01000001110110000000111011011111 𠜎 11110000101000001001110010001110 01000001110110000000111011011111   UTF-8encodingispreferabletoUTF-16onthemajorityofwebsites,becauseituseslessmemory.RecallthatUTF-8encodeseachASCIIcharacterinjustonebyte.UTF-16mustencodethesesamecharactersineithertwoorfourbytes.ThismeansthatanEnglishtextfileencodedwithUTF-16wouldbeatleastdoublethesizeofthesamefileencodedwithUTF-8. UTF-16isonlymoreefficientthanUTF-8onsomenon-Englishwebsites.IfawebsiteusesalanguagewithcharactersfartherbackintheUnicodelibrary,UTF-8willencodeallcharactersasfourbytes,whereasUTF-16mightencodemanyofthesamecharactersasonlytwobytes.Still,ifyourpagesarefilledwithABCsand123s,stickwithUTF-8. DecodingtheWorldofUTF-8Encoding Thatwasalotofwordsaboutwords,solet’ssummarizewhatwe’vecovered: Computersstoredata,includingtextcharacters,asbinary(1sand0s). ASCIIwasanearlywayofencoding,ormappingcharacterstobinarycodesothatcomputerscouldstorethem.However,ASCIIdidnotprovideenoughroomfornon-Latincharactersandnumberstoberepresentedinbinary. Unicodewasthesolutiontothisproblem.Unicodeassignsaunique“codepoint”toeverycharacterineveryhumanlanguage. UTF-8isaUnicodecharacterencodingmethod.ThismeansthatUTF-8takesthecodepointforagivenUnicodecharacterandtranslatesitintoastringofbinary.Italsodoesthereverse,readinginbinarydigitsandconvertingthembacktocharacters. UTF-8iscurrentlythemostpopularencodingmethodontheinternetbecauseitcanefficientlystoretextcontaininganycharacter. UTF-16isanotherencodingmethod,butislessefficientforstoringtextfiles(exceptforthosewrittenincertainnon-Englishlanguages). Unicodetranslationisn’tsomethingmostofusneedtothinkaboutwhenbrowsingordesigningwebsites,andthat’sexactlythepoint—tocreateaseamlesstextprocessingsystemthatfunctionsforalllanguagesandwebbrowsers.Ifit’sworkingwell,youwon’tnoticeit. But,ifyoufindyourwebsite’spagesareusingupaninordinateamountofspace,orifyourtextislitteredwith▢sand�s,it’stimetoputyournewknowledgeofUTF-8towork. Topics: WebsiteDesign Don'tforgettosharethispost! RelatedArticles previous next 14DosandDon’tsofUsingWebsiteTemplates in2022 WhatisLiquidLayoutinInDesign?[+HowToUseIt] WhatisGracefulDegradation&WhyDoesitMatter? 19ExamplesofBadWebsiteDesignin2022[+WhatTheyGotWrong] User-CenteredDesign:WhatItIsandHowtoDoItRight 10BrochureWebsiteExamplestoInspireYouin2022[+HowtoMakeOne] 7BestDesignFeedbackToolsforDesigners&Developers[+WhyUseThem] 25WebinarLandingPageExamplestoCopyin2022[+DesignTips] 45WebsitePop-upExamplesThatGetClicks 15LetterheadExamplesWithLogostoInspireYours JoinThousandsofProfessionals ThanksforSubscribing! Close Getthetoolsandskillsneededtoimproveyourwebsite.SubscribetotheWebsiteBlog. Close We'recommittedtoyourprivacy.HubSpotusestheinformationyouprovidetoustocontactyouaboutourrelevantcontent,products,andservices.Youmayunsubscribefromthesecommunicationsatanytime.Formoreinformation,checkoutourPrivacyPolicy. PopupforDOWNLOADNOW:FREEWEBSITEREDESIGNWORKBOOK DOWNLOADNOW:FREEWEBSITEREDESIGNWORKBOOK FREEWEBSITEREDESIGNWORKBOOK Learnhowtoredesignyourwebsitewiththisfreeguide. DOWNLOADTHEFREEWORKBOOK DOWNLOADTHEFREEWORKBOOK



請為這篇文章評分?