What is UTF-8? UTF-8 Character Encoding Tutorial

文章推薦指數: 80 %
投票人數:10人

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, ... Search Submityoursearchquery Forum Donate QuincyLarson UTF-8isacharacterencodingsystem.ItletsyourepresentcharactersasASCIItext,whilestillallowingforinternationalcharacters,suchasChinesecharacters.Asofthemid2020s,UTF-8isoneofthemostpopularencodingsystems.TostartusingUTF-8,youwillwanttofirstfamiliarizeyourselfwiththethebasicASCIIcharacterset.WhatistheASCIICharacterSet?ASCIIuses7-bitcodepointstorepresent128differentcharacters.Thesecodepointsaredividedinto95printablecharacters,whichincludethe26lettersoftheEnglishalphabet(AtoZ,bothupper-andlower-case),the10digits(0through9),andavarietyofpunctuationandothersymbols.Therearealso33non-printablecharacters,whichincludecontrolcharacterslikecarriagereturnandlinefeed,aswellasvariousothercharactersthatareusedforthingslikeformattingtext.UTF-8VSASCII–What'stheDifference?UTF-8extendstheASCIIcharactersettouse8-bitcodepoints,whichallowsforupto256differentcharacters.ThismeansthatUTF-8canrepresentalloftheprintableASCIIcharacters,aswellasthenon-printablecharacters.UTF-8alsoincludesavarietyofadditionalinternationalcharacters,suchasChinesecharactersandArabiccharacters.HowtoUseUTF-8inYourWebpages–HTMLUTF-8ExampleAndnowtheeasypart.Youdon'tactuallyneedtoknowhowitworks(thoughI'lltellyouinamoment.)YoucanconfigureUTF-8CharacterEncodinginyourHTMLcodewithasinglelineofHTMLlocatedinthe

sectionofyourcode: Withthatoutoftheway,letmeexplainhowUTF-8works,andwhyit'ssuchabrilliantencodingscheme.HowUTF-8EncodingWorks,andHowMuchStorageEachCharacterUsesWhenrepresentingcharactersinUTF-8,eachcodepointisrepresentedbyasequenceofoneormorebytes.Thenumberofbytesuseddependsonthecodepointbeingrepresentedbythecharacter.Here'sabreakdownoftheusagerange:codepointsintheASCIIrange(0-127)arerepresentedbyasinglebytecodepointsintherange(128-2047)arerepresentedbytwobytescodepointsintherange(2048-65535)arerepresentedbythreebytesandcodepointsintherange(65536-1114111)arerepresentedbyfourbytes.(Thismayseemlikealotofpossiblecharacters,butkeepinmindthatinChinesealone,thereare100,000sofcharacters.)ThefirstbyteofaUTF-8sequenceiscalledthe"leaderbyte".Theleaderbyteprovidesinformationabouthowmanybytesareinthesequence,andwhatthecodepointvalueofthecharacteris.Theleaderbyteforasingle-bytesequenceisalwaysintherange(0-127).Theleaderbyteforatwo-bytesequenceisintherange(194-223).Theleaderbyteforathree-bytesequenceisintherange(224-239).Andtheleaderbyteforafour-bytesequenceisintherange(240-247).Theremainingbytesinthesequencearecalled"trailingbytes."Thetrailingbytesforatwo-bytesequenceareintherange(128-191).Thetrailingbytesforathree-bytesequenceareintherange(128-191).Andthetrailingbytesforafour-bytesequenceareintherange(128-191).Youcancalculatethecodepointvalueofacharacterbylookingattheleaderbyteandthetrailingbytes.Forasingle-bytesequence,thecodepointvalueisequaltothevalueoftheleaderbyte.Foratwo-bytesequence,thecodepointvalueisequalto((leaderbyte-194)*64)+(trailingbyte-128).Forathree-bytesequence,thecodepointvalueisequalto((leaderbyte-224)*4096)+((trailingbyte1-128)*64)+(trailingbyte2-128).Forafour-bytesequence,thecodepointvalueisequalto((leaderbyte-240)*262144)+((trailingbyte1-128)*4096)+((trailingbyte2-128)*64)+(trailingbyte3-128).UTF-8isaSoundChoiceforEncodingAgain,UTF-8isasuperefficientencodingsystem.ItcanrepresentawiderangeofcharacterswhilestillbeingcompatiblewithASCII.Thismakesitasoundchoiceforuseininternationalizedsoftware.Ihopeyou'vefoundthishelpful.Ifyouwanttolearnmoreaboutprogrammingandtechnology,tryfreeCodeCamp'scorecodingcurriculum.It'sfree. ADVERTISEMENT ADVERTISEMENT ADVERTISEMENT QuincyLarson TheteacherwhofoundedfreeCodeCamp.org. Ifyoureadthisfar,tweettotheauthortoshowthemyoucare.Tweetathanks Learntocodeforfree.freeCodeCamp'sopensourcecurriculumhashelpedmorethan40,000peoplegetjobsasdevelopers.Getstarted ADVERTISEMENT


請為這篇文章評分?