How many characters can UTF-8 encode? - Stack Overflow

2025-01-10

文章推薦指數： 80 %

投票人數：10人

UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. Home Public Questions Tags Users Companies Collectives ExploreCollectives Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Collectives™onStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. LearnmoreaboutCollectives Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams HowmanycharacterscanUTF-8encode? AskQuestion Asked 10years,5monthsago Modified 1year,11monthsago Viewed 118ktimes 134 IfUTF-8is8bits,doesitnotmeanthattherecanbeonlymaximumof256differentcharacters? Thefirst128codepointsarethesameasinASCII.ButitsaysUTF-8cansupportuptomillionofcharacters? Howdoesthiswork? utf-8character-encodingascii Share Improvethisquestion Follow editedSep30,2015at22:33 mklement0 333k5858goldbadges528528silverbadges658658bronzebadges askedApr19,2012at13:29 eMReeMRe 2,78755goldbadges3232silverbadges4848bronzebadges 1 1 IntheUTF-8,UTF-16,UTF-32encodingsofUnicode,thenumberisthenumberofbitsinitscodeunits,oneormoreofwhichencodeaUnicodecodepoint. – TomBlodget Oct29,2017at16:39 Addacomment | 10Answers 10 Sortedby: Resettodefault Highestscore(default) Trending(recentvotescountmore) Datemodified(newestfirst) Datecreated(oldestfirst) 175 UTF-8doesnotuseonebyteallthetime,it's1to4bytes. Thefirst128characters(US-ASCII)needonebyte. Thenext1,920charactersneedtwobytestoencode.ThiscoverstheremainderofalmostallLatinalphabets,andalsoGreek,Cyrillic,Coptic,Armenian,Hebrew,Arabic,SyriacandTānaalphabets,aswellasCombiningDiacriticalMarks. ThreebytesareneededforcharactersintherestoftheBasicMultilingualPlane,whichcontainsvirtuallyallcharactersincommonuse[12]includingmostChinese,JapaneseandKorean[CJK]characters. FourbytesareneededforcharactersintheotherplanesofUnicode,whichincludelesscommonCJKcharacters,varioushistoricscripts,mathematicalsymbols,andemoji(pictographicsymbols). source:Wikipedia Share Improvethisanswer Follow editedOct2,2015at13:53 mklement0 333k5858goldbadges528528silverbadges658658bronzebadges answeredApr19,2012at13:34 zwippiezwippie 14.6k33goldbadges3838silverbadges5353bronzebadges 5 [email protected].!BMPuses2bytesyousayis3?amiwrong? – chiperortiz Mar3,2019at15:11 1 @chiperortiz,BMPisindeed16bits,soitcanbeencodedasUTF-16withconstantlengthpercharacter(UTF-16alsosupportsgoingbeyond16bits,butit'sadifficultpractice,andmanyimplementationsdon'tsupportit).However,forUTF-8,youalsoneedtoencodehowlongitwillbe,soyoulosesomebits.Whichiswhyyouneed3bytestoencodethecompleteBMP.Thismayseemaswasteful,butrememberthatUTF-16alwaysuses2bytes,butUTF-8usesonebytepercharacterformostlatin-basedlanguagecharacters.Makingittwiceascompact. – sanderd17 Mar25,2019at8:17 ThemainthrustoftheOP'squestionisrelatedtowhyitiscalledUTF-8--thisdoesn'treallyanswerthat. – jbyrd Feb13,2020at14:25 @sanderd17thediffbetweenutf8and16isthatutf-8canuseasminimum1byte(sufficientforlatin)whereas16needsatleast2bytes? – Timo Nov16,2020at18:34 HowmanycharacterscanUTF-8encode? – a55 Aug29at2:07 Addacomment | 55 UTF-8uses1-4bytespercharacter:onebyteforasciicharacters(thefirst128unicodevaluesarethesameasascii).Butthatonlyrequires7bits.Ifthehighest("sign")bitisset,thisindicatesthestartofamulti-bytesequence;thenumberofconsecutivehighbitssetindicatesthenumberofbytes,thena0,andtheremainingbitscontributetothevalue.Fortheotherbytes,thehighesttwobitswillbe1and0andtheremaining6bitsareforthevalue. Soafourbytesequencewouldbeginwith11110...(and...=threebitsforthevalue)thenthreebyteswith6bitseachforthevalue,yieldinga21bitvalue.2^21exceedsthenumberofunicodecharacters,soallofunicodecanbeexpressedinUTF8. Share Improvethisanswer Follow editedMay31,2019at12:12 answeredApr19,2012at13:40 CodeClown42CodeClown42 11k11goldbadge3030silverbadges6464bronzebadges 8 @NickL.No,Imean3bytes.Inthatexample,ifthefirstbyteofamultibytesequencebegins1111,thefirst1indicatesthatitisthebeginningofamultibytesequence,thenthenumberofconsecutive1'safterthatindicatesthenumberofadditionalbytesinthesequence(soafirstbytewillbegineither110,1110,or11110). – CodeClown42 Jul2,2016at15:31 FoundproofforyourwordsinRFC3629.tools.ietf.org/html/rfc3629#section-3.However,Idon'tunderstandwhydoIneedtoplace"10"inthebeginningofthesecondbyte110xxxxx10xxxxxx?Whynotjust110xxxxxxxxxxxxx? – kolobok Nov6,2017at10:15 3 Foundanswerinsoftwareengineering.stackexchange.com/questions/262227/….Justforsafetyreasons(incaseasinglebyteinthemiddleofthestreamiscorrupted) – kolobok Nov6,2017at10:47 1 @NickL"Soafourbytesequencewouldbeginwith11110...(...=threebytesforthevalue)"shouldhaveread"...=threebits"(thanks).Thisiswhya4byteutf8characterhasa21-bitvalue(3+6+6+6). – CodeClown42 May31,2019at12:10 1 Whyeveryoneiswritingalongwalloftext,whenthetechnicaldescriptioncanfitinasmallparagraphlikehere!Thumbsup! – duketwo Apr4at3:36 | Show3morecomments 38 UnicodevsUTF-8 Unicoderesolvescodepointstocharacters.UTF-8isastoragemechanismforUnicode.Unicodehasaspec.UTF-8hasaspec.Theybothhavedifferentlimits.UTF-8hasadifferentupwards-bound. Unicode Unicodeisdesignatedwith"planes."Eachplanecarries216codepoints.Thereare17PlanesinUnicode.Foratotalof17*2^16codepoints.Thefirstplane,plane0ortheBMP,isspecialintheweightofwhatitcarries. Ratherthanexplainallthenuances,letmejustquotetheabovearticleonplanes. The17planescanaccommodate1,114,112codepoints.Ofthese,2,048aresurrogates,66arenon-characters,and137,468arereservedforprivateuse,leaving974,530forpublicassignment. UTF-8 Nowlet'sgobacktothearticlelinkedabove, TheencodingschemeusedbyUTF-8wasdesignedwithamuchlargerlimitof231codepoints(32,768planes),andcanencode221codepoints(32planes)eveniflimitedto4bytes.[3]SinceUnicodelimitsthecodepointstothe17planesthatcanbeencodedbyUTF-16,codepointsabove0x10FFFFareinvalidinUTF-8andUTF-32. SoyoucanseethatyoucanputstuffintoUTF-8thatisn'tvalidUnicode.Why?BecauseUTF-8accommodatescodepointsthatUnicodedoesn'tevensupport. UTF-8,evenwithafourbytelimitation,supports221codepoints,whichisfarmorethan17*2^16 Share Improvethisanswer Follow editedSep13,2017at7:35 ijmacd 33611goldbadge33silverbadges1010bronzebadges answeredJul11,2017at18:58 EvanCarrollEvanCarroll 73.8k4444goldbadges243243silverbadges423423bronzebadges 0 Addacomment | 30 Accordingtothistable*UTF-8shouldsupport: 231=2,147,483,648characters However,RFC3629restrictedthepossiblevalues,sonowwe'recappedat4bytes,whichgivesus 221=2,097,152characters Notethatagoodchunkofthosecharactersare"reserved"forcustomuse,whichisactuallyprettyhandyforicon-fonts. *Wikipediausedshowatablewith6bytes--they'vesinceupdatedthearticle. 2017-07-11:Correctedfordouble-countingthesamecodepointencodedwithmultiplebytes Share Improvethisanswer Follow editedOct7,2021at7:34 CommunityBot 111silverbadge answeredJul20,2016at18:38 mpenmpen 261k259259goldbadges816816silverbadges11881188bronzebadges 4 Thisanswerisdoublecountingthenumberofencodingspossible.Onceyouhavecountedall2^7,youcannotcountthemagainin2^11,2^16,etc.Thecorrectnumberofencodingspossibleis2^21(thoughnotallarecurrentlybeingused). – Jimmy Jun24,2017at17:15 @JimmyYousureI'mdoublecounting?0xxxxxxxgives7usablebits,110xxxxx10xxxxxxgives11more--there'snooverlap.Thefirstbytestartswith0inthefirstcase,and1inthesecondcase. – mpen Jun26,2017at0:25 @mpensowhatcodepointdoes00000001storeandwhatdoes11000000100000001store? – EvanCarroll Jul11,2017at16:00 1 @EvanCarrollUhh....pointtaken.Didn'trealizethereweremultiplewaystoencodethesamecodepoint. – mpen Jul11,2017at18:25 Addacomment | 30 2,164,864“characters”canbepotentiallycodedbyUTF-8. Thisnumberis27+211+216+221,whichcomesfromthewaytheencodingworks: 1-bytecharshave7bitsforencoding 0xxxxxxx(0x00-0x7F) 2-bytecharshave11bitsforencoding 110xxxxx10xxxxxx(0xC0-0xDFforthefirstbyte;0x80-0xBFforthesecond) 3-bytecharshave16bitsforencoding 1110xxxx10xxxxxx10xxxxxx(0xE0-0xEFforthefirstbyte;0x80-0xBFforcontinuationbytes) 4-bytecharshave21bitsforencoding 11110xxx10xxxxxx10xxxxxx10xxxxxx(0xF0-0xF7forthefirstbyte;0x80-0xBFforcontinuationbytes) AsyoucanseethisissignificantlylargerthancurrentUnicode(1,112,064characters). UPDATE Myinitialcalculationiswrongbecauseitdoesn'tconsideradditionalrules.Seecommentstothisanswerformoredetails. Share Improvethisanswer Follow editedOct13,2020at6:19 ManuManjunath 6,00122goldbadges3131silverbadges3131bronzebadges answeredOct29,2017at12:08 RubenReyesRubenReyes 68766silverbadges77bronzebadges 5 6 Yourmathdoesn'trespecttheUTF-8rulethatonlytheshortestcodeunitsequenceisallowedtoencodeacodepoint.So,00000001isvalidforU+0001but11110000100000001000000010000001isnot.Ref:Table3-7.Well-FormedUTF-8ByteSequences.Besides,thequestionisdirectlyansweredbythetable:youjustadduptheranges.(TheyaredisjointtoexcludesurrogatesforUTF-16). – TomBlodget Oct29,2017at17:01 Tom-thanksforyourcomment!Iwasunawareofthoserestrictions.Isawtable3-7andranthenumbersanditlookslikethereare1,083,392possiblevalidsequences. – RubenReyes Oct30,2017at13:23 Thisisanaccurateanswer.Otheranswershavejuststoppedat2^21andforgottherestofthecombinationspossible. – ManuManjunath Oct13,2020at6:17 Yes,theoreticallythereare2^7+2^11+2^16+2^21symbolsbutalotofthemareinvalidbyUTF8rules,sointheend<2^21.UTF8mustencode1114112Unicodesymbolsand2^21isenough.Thoughmanyforgetthattherearesymbolcombinations–diacriticmodifiers,skintonemodifiers,zero-width-joiner,flags,etc.UTF8symbolsare1-4bytes,butcombinationofthesesymbolscouldbe6/8/20/...bytes,whichcanexpressmuchmore"visible"symbols – LevLukomsky Jan13at10:31 @RubenReyesTheinputdatainyourexcelsheetiswrongduetowhichyourtotalisincorrect.Iranthenumbersandthereareexactly1,112,064validUTF-8bytesequences,whichispreciselyallthecodepointsfromU+0000toU+10FFFFminusthesurrogatehalvesfromU+D800toU+DFFF.Thus,thereareexactly1,114,112-2,048=1,112,064validUTF-8bytesequences,or2^16+2^20-2^11bytesequences.Themistakeinyourinputdataisinthe2ndbytehexendcolumn.Itshouldbe80..BFinsteadof80..9Fforthe4thandthe6throwrespectively. – AaditMShah Oct6at11:39 Addacomment | 9 UTF-8isavariablelengthencodingwithaminimumof8bitspercharacter. Characterswithhighercodepointswilltakeupto32bits. Share Improvethisanswer Follow answeredApr19,2012at13:35 deceze♦deceze 498k8181goldbadges719719silverbadges867867bronzebadges 2 2 Thisismisleading.Thelongestcodepointyoucanhaveis11110xxx10xxxxxx10xxxxxx10xxxxxx,soonly21bitscanbeusedforencodingtheactualcharacter. – BorisVerkhovskiy May23,2017at16:21 8 Isaidcodepointsmaytakeupto32bitstobeencoded,Ineverclaimedthat(byinduction)youcanencode2^32charactersin32bitUTF-8.Butthatisrathermoot,sinceyoucanencodeallexistingUnicodecharactersinUTF-8,andyoucanencodeevenmoreifyoustretchUTF-8to48bits(whichexistsbutisdeprecated),soI'mnotsurewhatthemisleadingpointis. – deceze ♦ May23,2017at22:08 Addacomment | 5 QuotefromWikipedia:"UTF-8encodeseachofthe1,112,064codepointsintheUnicodecharactersetusingonetofour8-bitbytes(termed"octets"intheUnicodeStandard)." Somelinks: http://www.utf-8.com/ http://www.joelonsoftware.com/articles/Unicode.html http://www.icu-project.org/docs/papers/forms_of_unicode/ http://en.wikipedia.org/wiki/UTF-8 Share Improvethisanswer Follow answeredApr19,2012at13:35 ZZ-bbZZ-bb 2,15711goldbadge2424silverbadges3333bronzebadges Addacomment | 1 CheckouttheUnicodeStandardandrelatedinformation,suchastheirFAQentry,UTF-8UTF-16,UTF-32&BOM.It’snotthatsmoothsailing,butit’sauthoritativeinformation,andmuchofwhatyoumightreadaboutUTF-8elsewhereisquestionable. The“8”in“UTF-8”relatestothelengthofcodeunitsinbits.Codeunitsareentitiesusetoencodecharacters,notnecessarilyasasimpleone-to-onemapping.UTF-8usesavariablenumberofcodeunitstoencodeacharacter. ThecollectionofcharactersthatcanbeencodedinUTF-8isexactlythesameasforUTF-16orUTF-32,namelyallUnicodecharacters.TheyallencodetheentireUnicodecodingspace,whichevenincludesnoncharactersandunassignedcodepoints. Share Improvethisanswer Follow answeredApr19,2012at14:25 JukkaK.KorpelaJukkaK.Korpela 190k3636goldbadges257257silverbadges376376bronzebadges Addacomment | 0 WhileIagreewithmpenonthecurrentmaximumUTF-8codes(2,164,864)(listedbelow,Icouldn'tcommentonhis),heisoffby2levelsifyouremovethe2majorrestrictionsofUTF-8:only4byteslimitandcodes254and255cannotbeused(heonlyremovedthe4bytelimit). Startingcode254followsthebasicarrangementofstartingbits(multi-bitflagsetto1,acountof61's,andterminal0,nosparebits)givingyou6additionalbytestoworkwith(610xxxxxxgroups,anadditional2^36codes). Startingcode255doesn'texactlyfollowthebasicsetup,noterminal0butallbitsareused,givingyou7additionalbytes(multi-bitflagsetto1,acountof71's,andnoterminal0becauseallbitsareused;710xxxxxxgroups,anadditional2^42codes). Addingtheseingivesafinalmaximumpresentablecharactersetof4,468,982,745,216.Thisismorethanallcharactersincurrentuse,oldordeadlanguages,andanybelievedlostlanguages.AngelicorCelestialscriptanyone? Alsotherearesinglebytecodesthatareoverlooked/ignoredintheUTF-8standardinadditionto254and255:128-191,andafewothers.Someareusedlocallybythekeyboard,examplecode128isusuallyadeletingbackspace.Theotherstartingcodes(andassociatedranges)areinvalidforoneormorereasons(https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences). Share Improvethisanswer Follow editedAug14,2016at22:43 answeredAug14,2016at21:52 JamesV.FieldsJamesV.Fields 3311silverbadge55bronzebadges Addacomment | -2 UnicodeisfirmlymarriedtoUTF-8.Unicodespecificallysupports2^21codepoints(2,097,152characters)whichisexactlythesamenumberofcodepointssupportedbyUTF-8.Bothsystemsreservethesame'dead'spaceandrestrictedzonesforcodepointsetc....asofJune2018themostrecentversion,Unicode11.0,containsarepertoireof137,439characters Fromtheunicodestandard.UnicodeFAQ TheUnicodeStandardencodescharactersintherangeU+0000..U+10FFFF, whichamountstoa21-bitcodespace. FromtheUTF-8Wikipediapage.UTF-8Description SincetherestrictionoftheUnicodecode-spaceto21-bitvaluesin 2003,UTF-8isdefinedtoencodecodepointsinonetofourbytes,... Share Improvethisanswer Follow answeredNov26,2018at21:29 DisplaynameDisplayname 1,19011goldbadge1616silverbadges2626bronzebadges 2 21bitsisroundedup.Unicodesupports1,114,112codepoints(U+0000toU+10FFFF)likeitsays.(Sometimesdescribedas17planesof65536.) – TomBlodget Nov27,2018at0:27 @TomBlodget,Youarecorrect.themostrelevanttakeawayfromthisdiscussionisthatUTF-8canencodeallthecurrentlydefinedpointsintheUnicodestandardandwilllikelybeabletosoforquitesometimetocome. – Displayname Nov28,2018at16:52 Addacomment | YourAnswer ThanksforcontributingananswertoStackOverflow!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedutf-8character-encodingasciioraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew... CollectivesUpdate:RecognizedMembers,Articles,andGitLab Shouldweburninatethe[script]tag? Linked 479 WhatisthedifferencebetweenUTF-8andISO-8859-1? 51 HowtoloadaresourcebundlefromafileresourceinJava? 15 UTF-8Encodingsize 1 ReadbytesintoaSwiftString 2 CanIgetmoreexplanationsforBSON? 0 Encryptingaes-cbcinwebcryptodecryptingaes-256-cbcwithopenssl_decryptinphp 1 StringlengthdataforDynamoDB 1 Isthereawayforpythontodetectgarbledorbrokencharacters? -1 ConvertInputStreamtoStringandRoundTripUsingUTF8injava 0 WhyC11definecharacterconstantrecursively? Seemorelinkedquestions Related 597 Bestwaytoconverttextfilesbetweencharactersets? 1323 UTF-8allthewaythrough 2334 HowdoIgetaconsistentbyterepresentationofstringsinC#withoutmanuallyspecifyinganencoding? 327 DetectencodingandmakeeverythingUTF-8 233 WritetoUTF-8fileinPython 974 What'sthedifferencebetweenUTF-8andUTF-8withBOM? 411 WhydoweuseBase64? 231 Howtoconvertastringtoutf-8inPython 582 IsitpossibletoforceExcelrecognizeUTF-8CSVfilesautomatically? 585 WhydoesmodernPerlavoidUTF-8bydefault? HotNetworkQuestions Sub-zerocyclingwaterbottlesthatfitregularcages Howtosimplifyapurefunction? UsingLaTeX/TikZforfractalflower Doyoupayforthebreakfastinadvance? Howdouncomputablenumbersrelatetouncomputablefunctions? DoestheDemocraticPartyofficiallysupportrepealingtheSecondAmendment? HowcanIkeepmyampfromtemperingthetoneofmyprocessor?(rockandhardmetalmusic) Howtoformalizeagamewhereeachplayerisaprogramhavingaccesstoopponent'scode? HowdothosewhoholdtoaliteralinterpretationofthefloodaccountrespondtothecriticismthatNoahbuildingthearkwouldbeunfeasible? Whyistherealotofcurrentvariationattheoutputofabuckwhenabatteryisconnectedattheoutput? Probabilisticmethodsforundecidableproblem Movingframesmethod LaTeX2(e)vsLaTeX3 AmIreallyrequiredtosetupanInheritedIRA? HowtoviewpauseandviewcurrentsolutioninCPLEXOptimisationStudio? Wouldextractinghydrogenfromthesunlessenitslifespan? ArethereanyspellsotherthanWishthatcanlocateanobjectthroughleadshielding? WhydopeopleinsistonusingTikzwhentheycanusesimplerdrawingtools? Justifyingdefinitionsofagroupaction. Howcanmyaliensymbiotesidentifyeachother? Alternativeversionsofbreathing? Helpneededregardingsamplesizeforapoll Howtoplug2.5mm²strandedwiresintoapushwirewago? HowtofindthebordercrossingtimeofatraininEurope?(Czechbureaucracyedition) morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings