What is UTF-8 Encoding? A Guide for Non-Programmers
文章推薦指數: 80 %
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate ...
WhatisUTF-8Encoding?AGuideforNon-Programmers
GetmoreinformationinourintrotoHTMLandCSSguideformarketers.
GettheFreeGuide
JamieJuviler
Updated:
November02,2021
Published:
August10,2020
Text:itsimportanceontheinternetgoeswithoutsaying.It’sthefirst“T”in“HTTP”,theonly“T”in“HTML”,andvirtuallyeverywebsiteusesitsomehow,beitaURL,apieceofmarketingcopy,aproductreview,aviralTweet,orablogpost.(Hithere!)
But,webtextmightnotactuallybeassimpleasyouthink.Considerthethousandsoflanguagesspokentoday,orallthepunctuationandsymbolswecanaddtoenhancethem,orthefactthatnewemojisarebeingcreatedtocaptureeveryhumanemotion.Howdowebsitesstoreandprocessallofthis?
Thetruthis,evensomethingasbasicastextrequiresawell-coordinated,clearly-definedsystemtoappearinwebbrowsers.Inthispost,I’llexplainthebasicsofonetechnologycentraltotextontheweb,UTF-8.We’lllearnthebasicsoftextstorageandencoding,anddiscusshowithelpsputengagingwordsacrossyoursite.
Beforewebegin,youshouldbefamiliarwiththebasicsofHTMLandreadytodiveintosomelightcomputerscience.
WhatIsUTF-8?
UTF-8standsfor“UnicodeTransformationFormat-8bits.”That’snothelpfultousyet,solet’srewindtothebasics.
Binary:HowComputersStoreInformation
Inordertostoreinformation,computersuseabinarysystem.Inbinary,alldataisrepresentedinsequencesof1sand0s.Themostbasicunitofbinaryisabit,whichisjustasingle1or0.Thenextlargestunitofbinary,abyte,consistsof8bits.Anexampleofabyteis“01101011”.
Everydigitalassetyou’veeverencountered—fromsoftwaretomobileappstowebsitestoInstagramstories—isbuiltonthissystemofbytes,whicharestrungtogetherinawaythatmakessensetocomputers.Whenwerefertofilesizes,we’rereferencingthenumberofbytes.Forexample,akilobyteisroughlyonethousandbytes,andagigabyteisroughlyonebillionbytes.
Textisoneofmanyassetsthatcomputersstoreandprocess.Textismadeupofindividualcharacters,eachofwhichisrepresentedincomputersbyastringofbits.Thesestringsareassembledtoformdigitalwords,sentences,paragraphs,romancenovels,andsoon.
ASCII:ConvertingSymbolstoBinary
TheAmericanStandardCodeforInformationInterchange(ASCII)wasanearlystandardizedencodingsystemfortext.Encodingistheprocessofconvertingcharactersinhumanlanguagesintobinarysequencesthatcomputerscanprocess.
ASCII’slibraryincludeseveryupper-caseandlower-caseletterintheLatinalphabet(A,B,C…),everydigitfrom0to9,andsomecommonsymbols(like/,!,and?).Itassignseachofthesecharactersauniquethree-digitcodeandauniquebyte.
ThetablebelowshowsexamplesofASCIIcharacterswiththeirassociatedcodesandbytes.
Character
ASCIICode
BYTE
A
065
01000001
a
097
01100001
B
066
01000010
b
098
01100010
Z
090
01011010
z
122
01111010
0
048
00110000
9
057
00111001
!
033
00100001
?
063
00111111
Justascharacterscometogethertoformwordsandsentencesinlanguage,binarycodedoessointextfiles.So,thesentence“Thequickbrownfoxjumpsoverthelazydog.”representedinASCIIbinarywouldbe:
0101010001101000011001010010000001110001
0111010101101001011000110110101100100000
0110001001110010011011110111011101101110
0010000001100110011011110111100000100000
0110101001110101011011010111000001110011
0010000001101111011101100110010101110010
0010000001110100011010000110010100100000
0110110001100001011110100111100100100000
01100100011011110110011100101110
Thatdoesn’tmeanmuchtoushumans,butit’sacomputer’sbreadandbutter.
ThenumberofcharactersthatASCIIcanrepresentislimitedtothenumberofuniquebytesavailable,sinceeachcharactergetsonebyte.Ifyoudothemath,you’llfindthatthereare256differentwaysofgroupseight1sand0stogether.Thisgivesus256differentbytes,or256waystorepresentacharacterinASCII.WhenASCIIwasintroducedin1960,thiswasokay,sincedevelopersneededonly128bytestorepresentalltheEnglishcharactersandsymbolstheyneeded.
But,ascomputingexpandedglobally,computersystemsbegantostoretextinlanguagesbesidesEnglish,manyofwhichusednon-ASCIIcharacters.Newsystemswerecreatedtomapotherlanguagestothesamesetof256uniquebytes,buthavingmultipleencodingsystemswasinefficientandconfusing.Developersneededabetterwaytoencodeallpossiblecharacterswithonesystem.
Unicode:AWaytoStoreEverySymbol,Ever
EnterUnicode,anencodingsystemthatsolvesthespaceissueofASCII.LikeASCII,Unicodeassignsauniquecode,calledacodepoint,toeachcharacter.However,Unicode’smoresophisticatedsystemcanproduceoveramillioncodepoints,morethanenoughtoaccountforeverycharacterinanylanguage.
Unicodeisnowtheuniversalstandardforencodingallhumanlanguages.Andyes,itevenincludesemojis.
Belowaresomeexamplesoftextcharactersandtheirmatchingcodepoints.Eachcodepointbeginswith“U”for“Unicode,”followedbyauniquestringofcharacterstorepresentthecharacter.
Character
Codepoint
A
U+0041
a
U+0061
0
U+0030
9
U+0039
!
U+0021
Ø
U+00D8
ڃ
U+0683
ಚ
U+0C9A
𠜎
U+2070E
😁
U+1F601
IfyouwanttolearnhowcodepointsaregeneratedandwhattheymeaninUnicode,checkoutthisin-depthexplanation.
So,wenowhaveastandardizedwayofrepresentingeverycharacterusedbyeveryhumanlanguageinasinglelibrary.Thissolvestheissueofmultiplelabelingsystemsfordifferentlanguages—anycomputeronEarthcanuseUnicode.
But,Unicodealonedoesn’tstorewordsinbinary.ComputersneedawaytotranslateUnicodeintobinarysothatitscharacterscanbestoredintextfiles.Here’swhereUTF-8comesin.
UTF-8:TheFinalPieceofthePuzzle
UTF-8isanencodingsystemforUnicode.ItcantranslateanyUnicodecharactertoamatchinguniquebinarystring,andcanalsotranslatethebinarystringbacktoaUnicodecharacter.Thisisthemeaningof“UTF”,or“UnicodeTransformationFormat.”
ThereareotherencodingsystemsforUnicodebesidesUTF-8,butUTF-8isuniquebecauseitrepresentscharactersinone-byteunits.Rememberthatonebyteconsistsofeightbits,hencethe“-8”initsname.
Morespecifically,UTF-8convertsacodepoint(whichrepresentsasinglecharacterinUnicode)intoasetofonetofourbytes.Thefirst256charactersintheUnicodelibrary—whichincludethecharacterswesawinASCII—arerepresentedasonebyte.CharactersthatappearlaterintheUnicodelibraryareencodedastwo-byte,three-byte,andeventuallyfour-bytebinaryunits.
Belowisthesamecharactertablefromabove,withtheUTF-8outputforeachcharacteradded.Noticehowsomecharactersarerepresentedasjustonebyte,whileothersusemore.
Character
Codepoint
UTF-8binaryencoding
A
U+0041
01000001
a
U+0061
01100001
0
U+0030
00110000
9
U+0039
00111001
!
U+0021
00100001
Ø
U+00D8
1100001110011000
ڃ
U+0683
1101101010000011
ಚ
U+0C9A
111000001011001010011010
𠜎
U+2070E
11110000101000001001110010001110
😁
U+1F601
11110000100111111001100010000001
WhywouldUTF-8convertsomecharacterstoonebyte,andothersuptofourbytes?Inshort,tosavememory.Byusinglessspacetorepresentmorecommoncharacters(i.e.ASCIIcharacters),UTF-8reducesfilesizewhileallowingforamuchlargernumberofless-commoncharacters.Theseless-commoncharactersareencodedintotwoormorebytes,butthisisokayifthey’restoredsparingly.
SpatialefficiencyisakeyadvantageofUTF-8encoding.IfinsteadeveryUnicodecharacterwasrepresentedbyfourbytes,atextfilewritteninEnglishwouldbefourtimesthesizeofthesamefileencodedwithUTF-8.
AnotherbenefitofUTF-8encodingisitsbackwardcompatibilitywithASCII.Thefirst128charactersintheUnicodelibrarymatchthoseintheASCIIlibrary,andUTF-8translatesthese128UnicodecharactersintothesamebinarystringsasASCII.Asaresult,UTF-8cantakeatextfileformattedbyASCIIandconvertittohuman-readabletextwithoutissue.
UTF-8CharactersinWebDevelopment
UTF-8isthemostcommoncharacterencodingmethodusedontheinternettoday,andisthedefaultcharactersetforHTML5.Over95%ofallwebsites,likelyincludingyourown,storecharactersthisway.Additionally,commondatatransfermethodsovertheweb,likeXMLandJSON,areencodedwithUTF-8standards.
Sinceit’snowthestandardmethodforencodingtextontheweb,allyoursitepagesanddatabasesshoulduseUTF-8.AcontentmanagementsystemorwebsitebuilderwillsaveyourfilesinUTF-8formatbydefault,butit’sstillagoodideatomakesureyou’restickingtothisbestpractice.
TextfilesencodedwithUTF-8mustindicatethistothesoftwareprocessingit.Otherwise,thesoftwarewon’tproperlytranslatethebinarybackintocharacters.InHTMLfiles,youmightseeastringofcodelikethefollowingnearthetop:
延伸文章資訊
- 1HTML Unicode (UTF-8) Reference - W3Schools
Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decima...
- 2UTF-8 - 维基百科,自由的百科全书
UTF-8(8-bit Unicode Transformation Format)是一種針對Unicode的可變長度字元編碼,也是一种前缀码。它可以用一至四个字节对Unicode字符集中的所有...
- 3UTF-8 encoder/decoder
About this tool. This tool uses utf8.js to UTF-8-encode any string you enter in the 'decoded' fie...
- 4Day27 Python 基礎- 字符轉編碼操作 - iT 邦幫忙
coding:utf-8 -*- import sys print(sys.getdefaultencoding()) # 打印出目前系統字符編碼s = '你好' s_to_unicode = ...
- 5Encoding.UTF8 屬性(System.Text) - Microsoft Learn
WriteLine(); // Display the number of bytes required to encode the array. int reqBytes = utf8.Get...