UTF-8 - Wikipedia
文章推薦指數: 80 %
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or ... UTF-8 FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch ASCII-compatiblevariable-widthencodingofUnicode,usingonetofourbytes UTF-8StandardUnicodeStandardClassificationUnicodeTransformationFormat,extendedASCII,variable-widthencodingExtendsUS-ASCIITransforms/EncodesISO/IEC10646(Unicode)PrecededbyUTF-1vte UTF-8isavariable-widthcharacterencodingusedforelectroniccommunication.DefinedbytheUnicodeStandard,thenameisderivedfromUnicode(orUniversalCodedCharacterSet)TransformationFormat –8-bit.[1] UTF-8iscapableofencodingall1,112,064[nb1]validcharactercodepointsinUnicodeusingonetofourone-byte(8-bit)codeunits.Codepointswithlowernumericalvalues,whichtendtooccurmorefrequently,areencodedusingfewerbytes.ItwasdesignedforbackwardcompatibilitywithASCII:thefirst128charactersofUnicode,whichcorrespondone-to-onewithASCII,areencodedusingasinglebytewiththesamebinaryvalueasASCII,sothatvalidASCIItextisvalidUTF-8-encodedUnicodeaswell. UTF-8wasdesignedasasuperioralternativetoUTF-1,aproposedvariable-widthencodingwithpartialASCIIcompatibilitywhichlackedsomefeaturesincludingself-synchronizationandfullyASCII-compatiblehandlingofcharacterssuchasslashes.KenThompsonandRobPikeproducedthefirstimplementationforthePlan9operatingsysteminSeptember1992.[2][3]ThisledtoitsadoptionbyX/OpenasitsspecificationforFSS-UTF,[4]whichwouldfirstbeofficiallypresentedatUSENIXinJanuary1993[5]andsubsequentlyadoptedbytheInternetEngineeringTaskForce(IETF)inRFC2277(BCP18)[6]forfutureinternetstandardswork,replacingSingleByteCharacterSetssuchasLatin-1inolderRFCs. UTF-8isthedominantencodingfortheWorldWideWeb(andinternettechnologies),accountingfor98%ofallwebpages,andupto100.0%forsomelanguages,asof2022.[7] Contents 1Naming 2Encoding 2.1Examples 2.2Octal 2.3Codepagelayout 2.4Overlongencodings 2.5Invalidsequencesanderrorhandling 2.6Byteordermark 3Adoption 4History 4.1FSS-UTF 5Standards 6Comparisonwithotherencodings 6.1Single-byte 6.2Othermulti-byte 6.3UTF-16 7Derivatives 7.1CESU-8 7.2MySQLutf8mb3 7.3ModifiedUTF-8 7.4WTF-8 7.5PEP383 8Seealso 9Notes 10References 11Externallinks Naming[edit] TheofficialInternetAssignedNumbersAuthority(IANA)codefortheencodingis"UTF-8".[8]Alllettersareupper-case,andthenameishyphenated.ThisspellingisusedinalltheUnicodeConsortiumdocumentsrelatingtotheencoding.However,thename"utf-8"maybeusedbyallstandardsconformingtotheIANAlist(whichincludeCSS,HTML,XML,andHTTPheaders),[9]asthedeclarationiscase-insensitive.[8] Othervariants,suchasthosethatomitthehyphenorreplaceitwithaspace,i.e."utf8"or"UTF8",arenotacceptedascorrectbythegoverningstandards.[10]Despitethis,mostwebbrowserscanunderstandthem,andsostandardsintendedtodescribeexistingpractice(suchasHTML5)mayeffectivelyrequiretheirrecognition.[11] "UTF-8-BOM"and"UTF-8-NOBOM"aresometimesusedfortextfileswhichcontainordon'tcontainabyteordermark(BOM),respectively.[citationneeded]InJapanespecially,UTF-8encodingwithoutaBOMissometimescalled"UTF-8N".[12][13] InWindowsUTF-8iscodepage65001.[14] InHPPCL,UTF-8iscalledSymbol-ID"18N".[15] Encoding[edit] UTF-8encodescodepointsinonetofourbytes,dependingonthevalueofthecodepoint.Thexcharactersarereplacedbythebitsofthecodepoint: Codepoint↔UTF-8conversion Firstcodepoint Lastcodepoint Byte1 Byte2 Byte3 Byte4 U+0000 U+007F 0xxxxxxx U+0080 U+07FF 110xxxxx 10xxxxxx U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx U+10000 [nb2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Thefirst128codepoints(ASCII)needonebyte.Thenext1,920codepointsneedtwobytestoencode,whichcoverstheremainderofalmostallLatin-scriptalphabets,andalsoIPAextensions,Greek,Cyrillic,Coptic,Armenian,Hebrew,Arabic,Syriac,ThaanaandN'Koalphabets,aswellasCombiningDiacriticalMarks.ThreebytesareneededfortherestoftheBasicMultilingualPlane,whichcontainsvirtuallyallcodepointsincommonuse,[16]includingmostChinese,JapaneseandKoreancharacters.FourbytesareneededforcodepointsintheotherplanesofUnicode,whichincludelesscommonCJKcharacters,varioushistoricscripts,mathematicalsymbols,andemoji(pictographicsymbols). A"character"cantakemorethan4bytesbecauseitismadeofmorethanonecodepoint.Forinstanceanationalflagcharactertakes8bytessinceit's"constructedfromapairofUnicodescalarvalues"bothfromoutsidetheBMP.[17][nb3] Examples[edit] Considertheencodingoftheeurosign,€: TheUnicodecodepointfor€isU+20AC. AsthiscodepointliesbetweenU+0800andU+FFFF,thiswilltakethreebytestoencode. Hexadecimal20ACisbinary0010000010101100.Thetwoleadingzerosareaddedbecauseathree-byteencodingneedsexactlysixteenbitsfromthecodepoint. Becausetheencodingwillbethreebyteslong,itsleadingbytestartswiththree1s,thena0(1110...) Thefourmostsignificantbitsofthecodepointarestoredintheremainingloworderfourbitsofthisbyte(11100010),leaving12bitsofthecodepointyettobeencoded(...000010101100). Allcontinuationbytescontainexactlysixbitsfromthecodepoint.Sothenextsixbitsofthecodepointarestoredinthelowordersixbitsofthenextbyte,and10isstoredinthehighordertwobitstomarkitasacontinuationbyte(so10000010). Finallythelastsixbitsofthecodepointarestoredinthelowordersixbitsofthefinalbyte,andagain10isstoredinthehighordertwobits(10101100). Thethreebytes111000101000001010101100canbemoreconciselywritteninhexadecimal,asE282AC. Thefollowingtablesummarizesthisconversion,aswellasotherswithdifferentlengthsinUTF-8.ThecolorsindicatehowbitsfromthecodepointaredistributedamongtheUTF-8bytes.AdditionalbitsaddedbytheUTF-8encodingprocessareshowninblack. ExamplesofUTF-8encoding Character Binarycodepoint BinaryUTF-8 HexUTF-8 $ U+0024 0100100 00100100 24 £ U+00A3 00010100011 1100001010100011 C2A3 ह U+0939 0000100100111001 111000001010010010111001 E0A4B9 € U+20AC 0010000010101100 111000101000001010101100 E282AC 한 U+D55C 1101010101011100 111011011001010110011100 ED959C 𐍈 U+10348 000010000001101001000 11110000100100001000110110001000 F0908D88 Octal[edit] UTF-8'suseofsixbitsperbytetorepresenttheactualcharactersbeingencodedmeansthatoctalnotation(whichuses3-bitgroups)canaidinthecomparisonofUTF-8sequenceswithoneanotherandinmanualconversion.[18] Octalcodepoint↔OctalUTF-8conversion Firstcodepoint Lastcodepoint Codepoint Byte1 Byte2 Byte3 Byte4 000 0177 xxx xxx 0200 03777 xxyy 3xx 2yy 04000 077777 xyyzz 34x 2yy 2zz 0100000 0177777 1xyyzz 35x 2yy 2zz 0200000 04177777 xyyzzww 36x 2yy 2zz 2ww Withoctalnotation,thearbitraryoctaldigits,markedwithx,y,zorwinthetable,willremainunchangedwhenconvertingtoorfromUTF-8. Example:Á=U+00C1=0301(inoctal)isencodedas303201inUTF-8(C381inhex). Example:€=U+20AC=020254isencodedas342202254inUTF-8(E282ACinhex). Codepagelayout[edit] ThefollowingtablesummarizesusageofUTF-8codeunits(individualbytesoroctets)inacodepageformat.Theupperhalfisforbytesusedonlyinsingle-bytecodes,soitlookslikeanormalcodepage;thelowerhalfisforcontinuationbytesandleadingbytesandisexplainedfurtherinthelegendbelow. UTF-8 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2x SP ! " # $ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~ DEL 8x • • • • • • • • • • • • • • • • 9x • • • • • • • • • • • • • • • • Ax • • • • • • • • • • • • • • • • Bx • • • • • • • • • • • • • • • • Cx 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Dx 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Ex 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Fx 4 4 4 4 4 4 4 4 5 5 5 5 6 6 7-bit(single-byte)codepoints.Theymustnotbefollowedbyacontinuationbyte.[19] Continuationbytes.[20]Thetooltipshowsinhexthevalueofthe6bitstheyadd.[nb4] Leadingbytesforasequenceofmultiplebytes,mustbefollowedbyexactlyN−1continuationbytes.[21]ThetooltipshowsthecodepointrangeandtheUnicodeblocksencodedbysequencesstartingwiththisbyte. Leadingbyteswherenotallarrangementsofcontinuationbytesarevalid.E0andF0couldstartoverlongencodings.F4canstartcodepointsgreaterthanU+10FFFF.EDcanstartcodepointsintherangeU+D800–U+DFFF,whichareinvalidUTF-16surrogatehalves.[22] DonotappearinavalidUTF-8sequence.C0andC1couldbeusedonlyforan"overlong"encodingofa1-bytecharacter.[23]F5toFDareleadingbytesof4-byteorlongersequencesthatcanonlyencodecodepointslargerthanU+10FFFF.[24]FEandFFwereneverassignedanymeaning.[25] Overlongencodings[edit] Inprinciple,itwouldbepossibletoinflatethenumberofbytesinanencodingbypaddingthecodepointwithleading0s.Toencodetheeurosign€fromtheaboveexampleinfourbytesinsteadofthree,itcouldbepaddedwithleading0suntilitwas21 bitslong – 000000010000010101100,andencodedas11110000100000101000001010101100(orF08282ACinhexadecimal).Thisiscalledanoverlongencoding. Thestandardspecifiesthatthecorrectencodingofacodepointusesonlytheminimumnumberofbytesrequiredtoholdthesignificantbitsofthecodepoint.LongerencodingsarecalledoverlongandarenotvalidUTF-8representationsofthecodepoint.Thisrulemaintainsaone-to-onecorrespondencebetweencodepointsandtheirvalidencodings,sothatthereisauniquevalidencodingforeachcodepoint.Thisensuresthatstringcomparisonsandsearchesarewell-defined. Invalidsequencesanderrorhandling[edit] NotallsequencesofbytesarevalidUTF-8.AUTF-8decodershouldbepreparedfor: invalidbytes anunexpectedcontinuationbyte anon-continuationbytebeforetheendofthecharacter thestringendingbeforetheendofthecharacter(whichcanhappeninsimplestringtruncation) anoverlongencoding asequencethatdecodestoaninvalidcodepoint ManyofthefirstUTF-8decoderswoulddecodethese,ignoringincorrectbitsandacceptingoverlongresults.CarefullycraftedinvalidUTF-8couldmakethemeitherskiporcreateASCIIcharacterssuchasNUL,slash,orquotes.InvalidUTF-8hasbeenusedtobypasssecurityvalidationsinhigh-profileproductsincludingMicrosoft'sIISwebserver[26]andApache'sTomcatservletcontainer.[27]RFC3629states"ImplementationsofthedecodingalgorithmMUSTprotectagainstdecodinginvalidsequences."[10]TheUnicodeStandardrequiresdecodersto"...treatanyill-formedcodeunitsequenceasanerrorcondition.Thisguaranteesthatitwillneitherinterpretnoremitanill-formedcodeunitsequence." SinceRFC3629(November2003),thehighandlowsurrogatehalvesusedbyUTF-16(U+D800throughU+DFFF)andcodepointsnotencodablebyUTF-16(thoseafterU+10FFFF)arenotlegalUnicodevalues,andtheirUTF-8encodingmustbetreatedasaninvalidbytesequence.NotdecodingunpairedsurrogatehalvesmakesitimpossibletostoreinvalidUTF-16(suchasWindowsfilenamesorUTF-16thathasbeensplitbetweenthesurrogates)asUTF-8,[28]whileitispossiblewithWTF-8. Someimplementationsofdecodersthrowexceptionsonerrors.[29]Thishasthedisadvantagethatitcanturnwhatwouldotherwisebeharmlesserrors(suchasa"nosuchfile"error)intoadenialofservice.ForinstanceearlyversionsofPython3.0wouldexitimmediatelyifthecommandlineorenvironmentvariablescontainedinvalidUTF-8.[30]Analternativepracticeistoreplaceerrorswithareplacementcharacter.SinceUnicode6[31](October2010),thestandard(chapter3)hasrecommendeda"bestpractice"wheretheerrorendsassoonasadisallowedbyteisencountered.InthesedecodersE1,A0,C0istwoerrors(2bytesinthefirstone).Thismeansanerrorisnomorethanthreebyteslongandnevercontainsthestartofavalidcharacter,andthereare21,952differentpossibleerrors.[32]Thestandardalsorecommendsreplacingeacherrorwiththereplacementcharacter"�"(U+FFFD). Byteordermark[edit] IftheUTF-16Unicodebyteordermark(BOM,U+FEFF)characterisatthestartofaUTF-8file,thefirstthreebyteswillbe0xEF,0xBB,0xBF. TheUnicodeStandardneitherrequiresnorrecommendstheuseoftheBOMforUTF-8,butwarnsthatitmaybeencounteredatthestartofafiletrans-codedfromanotherencoding.[33]WhileASCIItextencodedusingUTF-8isbackwardcompatiblewithASCII,thisisnottruewhenUnicodeStandardrecommendationsareignoredandaBOMisadded.ABOMcanconfusesoftwarethatisn'tpreparedforitbutcanotherwiseacceptUTF-8,e.g.programminglanguagesthatpermitnon-ASCIIbytesinstringliteralsbutnotatthestartofthefile.Nevertheless,therewasandstillissoftwarethatalwaysinsertsaBOMwhenwritingUTF-8,andrefusestocorrectlyinterpretUTF-8unlessthefirstcharacterisaBOM(orthefileonlycontainsASCII).[34] Adoption[edit] Seealso:Popularityoftextencodings Declaredcharactersetforthe10 millionmostpopularwebsitessince2010 Useofthemainencodingsonthewebfrom2001to2012asrecordedbyGoogle,[35]withUTF-8overtakingallothersin2008andover60%ofthewebin2012(sincethenapproaching100%).TheASCII-onlyfigureincludesallwebpagesthatonlycontainASCIIcharacters,regardlessofthedeclaredheader.OtherencodingsofUnicodesuchasGB2312areaddedto"others". ManystandardsonlysupportUTF-8,e.g.openJSONexchangerequiresit(withoutabyteordermark(BOM)).[36]UTF-8isalsotherecommendationfromtheWHATWGforHTMLandDOMspecifications,[37]andtheInternetMailConsortiumrecommendsthatalle-mailprogramsbeabletodisplayandcreatemailusingUTF-8.[38][39]TheWorldWideWebConsortiumrecommendsUTF-8asthedefaultencodinginXMLandHTML(andnotjustusingUTF-8,alsodeclaringitinmetadata),"evenwhenallcharactersareintheASCIIrange..Usingnon-UTF-8encodingscanhaveunexpectedresults".[40] Lotsofsoftwarehastheabilitytoread/writeUTF-8,thoughthisoftenrequirestheusertochangeoptionsfromthenormalsettings,andmayrequireaBOM(byteordermark)asthefirstcharactertoreadthefile.ExamplesincludeMicrosoftWord[41][42][43]andMicrosoftExcel.[44][45]MostdatabasessupportUTF-8(sometimestheonlyoptionaswithsomefileformats),includingMicrosoft'ssinceSQLServer2019,resultingin35%speedincrease,and"nearly50%reductioninstoragerequirements."[46]MicrosoftfullysupportsandrecommendsUTF-8foritsproductssuchasWindows. UTF-8hasbeenthemostcommonencodingfortheWorldWideWebsince2008.[47]AsofOctober 2022[update],UTF-8accountsforonaverage97.9%ofallwebpages(and989ofthetop1,000highestrankedwebpages).[7]AlthoughmanypagesonlyuseASCIIcharacterstodisplaycontent,fewwebsitesnowdeclaretheirencodingtoonlybeASCIIinsteadofUTF-8.[48]Overathirdofthelanguagestrackedhave100%UTF-8use. ForlocaltextfilesUTF-8usageislower,andmanylegacysingle-byte(andCJKmulti-byte)encodingsremaininuse.TheprimarycauseiseditorsthatdonotdisplayorwriteUTF-8unlessthefirstcharacterinafileisabyteordermark(BOM),makingitimpossibleforothersoftwaretouseUTF-8withoutbeingrewrittentoignorethebyteordermarkoninputandadditonoutput.[49][50]Therehasbeensomeimprovement,NotepadonWindowsdefaults(catchingupwithmostothereditors)towritingUTF-8withoutaBOMbydefault(achangesinceWindows7),[51]andsomesystemfilesonWindows11requireUTF-8[52]anddon'trequiretheBOMandalmostallfilesonmacOSandLinuxarerequiredtobeUTF-8withoutaBOM.[citationneeded]Java18changedtodefaultingtoreadingandwritingfilesasUTF-8,[53]andinolderversions(e.g.LTSversions)theNIOAPIonlydidso.ManyotherprogramminglanguagesdefaulttoUTF-8forI/O,includingthecurrentRuby3.0[54][55]andR4.2.2.[56]AllcurrentlysupportedversionsofPythonsupportUTF-8,evenonWindowsforI/O(butit'sopt-intherefortheopen()function[57]),andplanstomakeUTF-8I/Othedefaultin3.15onWindowsasforotherplatforms,andhasalreadymadechangestohelpprogrammersprepareforthis.[58] Internallyinsoftwareusageislower,withUTF-16inuse,particularlyonWindows,butalsobyJavaScript,Python,[59][60]Qt,andmanyothercross-platformsoftwarelibraries.CompatibilitywiththeWindowsAPIistheprimaryreasonforthis(thoughthebeliefthatdirectindexingofBMPimprovesspeedwasalsoafactor).MorerecentsoftwarehasstartedtouseUTF-8(almost)exclusively:thedefaultstringprimitiveusedinGo,[61]Julia,Rust,Swift5,[62]andPyPy[63]isUTF-8,afutureversionofPythonintendstostorestringsasUTF-8,[64]andmodernversionsofMicrosoftVisualStudiouseUTF-8internally[65](howeverstillrequireacommand-lineswitchtoreadorwriteUTF-8[66]).UTF-8isthe"onlytextencodingmandatedtobesupportedbytheC++standard"inC++20.[67]AllcurrentlysupportedWindowsversionssupportUTF-8insomeway;partiallyatleastsinceWindowsXP(andlatestversionsfully),andasofMay2019,MicrosoftreverseditscourseofonlyrecommendingUTF-16,sinceWindowsprovidestheabilitytosetUTF-8asthe"codepage"fortheWindowsAPI(themulti-byteAPI,previouslythiswasimpossible),andnowMicrosoftrecommendsprogrammersuseUTF-8.[68] History[edit] Seealso:UniversalCodedCharacterSet§ History TheInternationalOrganizationforStandardization(ISO)setouttocomposeauniversalmulti-bytecharactersetin1989.ThedraftISO10646standardcontainedanon-requiredannexcalledUTF-1thatprovidedabytestreamencodingofits32-bitcodepoints.Thisencodingwasnotsatisfactoryonperformancegrounds,amongotherproblems,andthebiggestproblemwasprobablythatitdidnothaveaclearseparationbetweenASCIIandnon-ASCII:newUTF-1toolswouldbebackwardcompatiblewithASCII-encodedtext,butUTF-1-encodedtextcouldconfuseexistingcodeexpectingASCII(orextendedASCII),becauseitcouldcontaincontinuationbytesintherange0x21–0x7EthatmeantsomethingelseinASCII,e.g.,0x2Ffor'/',theUnixpathdirectoryseparator,andthisexampleisreflectedinthenameandintroductorytextofitsreplacement.Thetablebelowwasderivedfromatextualdescriptionintheannex. UTF-1 Numberofbytes Firstcodepoint Lastcodepoint Byte1 Byte2 Byte3 Byte4 Byte5 1 U+0000 U+009F 00–9F 2 U+00A0 U+00FF A0 A0–FF 2 U+0100 U+4015 A1–F5 21–7E,A0–FF 3 U+4016 U+38E2D F6–FB 21–7E,A0–FF 21–7E,A0–FF 5 U+38E2E U+7FFFFFFF FC–FF 21–7E,A0–FF 21–7E,A0–FF 21–7E,A0–FF 21–7E,A0–FF InJuly1992,theX/OpencommitteeXoJIGwaslookingforabetterencoding.DaveProsserofUnixSystemLaboratoriessubmittedaproposalforonethathadfasterimplementationcharacteristicsandintroducedtheimprovementthat7-bitASCIIcharacterswouldonlyrepresentthemselves;allmulti-bytesequenceswouldincludeonlybyteswherethehighbitwasset.ThenameFileSystemSafeUCSTransformationFormat(FSS-UTF)andmostofthetextofthisproposalwerelaterpreservedinthefinalspecification.[69][70][71][72] FSS-UTF[edit] FSS-UTFproposal(1992) Numberofbytes Firstcodepoint Lastcodepoint Byte1 Byte2 Byte3 Byte4 Byte5 1 U+0000 U+007F 0xxxxxxx 2 U+0080 U+207F 10xxxxxx 1xxxxxxx 3 U+2080 U+8207F 110xxxxx 1xxxxxxx 1xxxxxxx 4 U+82080 U+208207F 1110xxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 5 U+2082080 U+7FFFFFFF 11110xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx InAugust1992,thisproposalwascirculatedbyanIBMX/Openrepresentativetointerestedparties.AmodificationbyKenThompsonofthePlan9operatingsystemgroupatBellLabsmadeitself-synchronizing,lettingareaderstartanywhereandimmediatelydetectcharacterboundaries,atthecostofbeingsomewhatlessbit-efficientthanthepreviousproposal.Italsoabandonedtheuseofbiasesandinsteadaddedtherulethatonlytheshortestpossibleencodingisallowed;theadditionallossincompactnessisrelativelyinsignificant,butreadersnowhavetolookoutforinvalidencodingstoavoidreliabilityandespeciallysecurityissues.Thompson'sdesignwasoutlinedonSeptember2,1992,onaplacematinaNewJerseydinerwithRobPike.Inthefollowingdays,PikeandThompsonimplementeditandupdatedPlan9touseitthroughout,andthencommunicatedtheirsuccessbacktoX/Open,whichaccepteditasthespecificationforFSS-UTF.[71] FSS-UTF(1992)/UTF-8(1993)[2] Numberofbytes Firstcodepoint Lastcodepoint Byte1 Byte2 Byte3 Byte4 Byte5 Byte6 1 U+0000 U+007F 0xxxxxxx 2 U+0080 U+07FF 110xxxxx 10xxxxxx 3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 4 U+10000 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 5 U+200000 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 6 U+4000000 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx UTF-8wasfirstofficiallypresentedattheUSENIXconferenceinSanDiego,fromJanuary25to29,1993.TheInternetEngineeringTaskForceadoptedUTF-8initsPolicyonCharacterSetsandLanguagesinRFC 2277(BCP18)forfutureinternetstandardswork,replacingSingleByteCharacterSetssuchasLatin-1inolderRFCs.[73] InNovember2003,UTF-8wasrestrictedbyRFC 3629tomatchtheconstraintsoftheUTF-16characterencoding:explicitlyprohibitingcodepointscorrespondingtothehighandlowsurrogatecharactersremovedmorethan3%ofthethree-bytesequences,andendingatU+10FFFFremovedmorethan48%ofthefour-bytesequencesandallfive-andsix-bytesequences. Standards[edit] ThereareseveralcurrentdefinitionsofUTF-8invariousstandardsdocuments: RFC 3629/STD63(2003),whichestablishesUTF-8asastandardinternetprotocolelement RFC 5198definesUTF-8NFCforNetworkInterchange(2008) ISO/IEC10646:2014§9.1(2014)[74] TheUnicodeStandard,Version14.0.0(2021)[75] Theysupersedethedefinitionsgiveninthefollowingobsoleteworks: TheUnicodeStandard,Version2.0,AppendixA(1996) ISO/IEC10646-1:1993Amendment2/AnnexR(1996) RFC 2044(1996) RFC 2279(1998) TheUnicodeStandard,Version3.0,§2.3(2000)plusCorrigendum#1 :UTF-8ShortestForm(2000) UnicodeStandardAnnex#27:Unicode3.1(2001)[76] TheUnicodeStandard,Version5.0(2006)[77] TheUnicodeStandard,Version6.0(2010)[78] Theyareallthesameintheirgeneralmechanics,withthemaindifferencesbeingonissuessuchasallowedrangeofcodepointvaluesandsafehandlingofinvalidinput. Comparisonwithotherencodings[edit] Seealso:ComparisonofUnicodeencodings Someoftheimportantfeaturesofthisencodingareasfollows: Backwardcompatibility:BackwardcompatibilitywithASCIIandtheenormousamountofsoftwaredesignedtoprocessASCII-encodedtextwasthemaindrivingforcebehindthedesignofUTF-8.InUTF-8,singlebyteswithvaluesintherangeof0to127mapdirectlytoUnicodecodepointsintheASCIIrange.Singlebytesinthisrangerepresentcharacters,astheydoinASCII.Moreover,7-bitbytes(byteswherethemostsignificantbitis0)neverappearinamulti-bytesequence,andnovalidmulti-bytesequencedecodestoanASCIIcode-point.Asequenceof7-bitbytesisbothvalidASCIIandvalidUTF-8,andundereitherinterpretationrepresentsthesamesequenceofcharacters.Therefore,the7-bitbytesinaUTF-8streamrepresentallandonlytheASCIIcharactersinthestream.Thus,manytextprocessors,parsers,protocols,fileformats,textdisplayprograms,etc.,whichuseASCIIcharactersforformattingandcontrolpurposes,willcontinuetoworkasintendedbytreatingtheUTF-8bytestreamasasequenceofsingle-bytecharacters,withoutdecodingthemulti-bytesequences.ASCIIcharactersonwhichtheprocessingturns,suchaspunctuation,whitespace,andcontrolcharacterswillneverbeencodedasmulti-bytesequences.Itisthereforesafeforsuchprocessorstosimplyignoreorpass-throughthemulti-bytesequences,withoutdecodingthem.Forexample,ASCIIwhitespacemaybeusedtotokenizeaUTF-8streamintowords;ASCIIline-feedsmaybeusedtosplitaUTF-8streamintolines;andASCIINULcharacterscanbeusedtosplitUTF-8-encodeddataintonull-terminatedstrings.Similarly,manyformatstringsusedbylibraryfunctionslike"printf"willcorrectlyhandleUTF-8-encodedinputarguments. Fallbackandauto-detection:OnlyasmallsubsetofpossiblebytestringsareavalidUTF-8string:severalbytescannotappear;abytewiththehighbitsetcannotbealone;andfurtherrequirementsmeanthatitisextremelyunlikelythatareadabletextinanyextendedASCIIisvalidUTF-8.PartofthepopularityofUTF-8isduetoitprovidingaformofbackwardcompatibilityfortheseaswell.AUTF-8processorwhicherroneouslyreceivesextendedASCIIasinputcanthus"auto-detect"thiswithveryhighreliability.AUTF-8streammaysimplycontainerrors,resultingintheauto-detectionschemeproducingfalsepositives;butauto-detectionissuccessfulinthevastmajorityofcases,especiallywithlongertexts,andiswidelyused.Italsoworksto"fallback"orreplace8-bitbytesusingtheappropriatecode-pointforalegacyencodingwhenerrorsintheUTF-8aredetected,allowingrecoveryevenifUTF-8andlegacyencodingisconcatenatedinthesamefile. Prefixcode:Thefirstbyteindicatesthenumberofbytesinthesequence.Readingfromastreamcaninstantaneouslydecodeeachindividualfullyreceivedsequence,withoutfirsthavingtowaitforeitherthefirstbyteofanextsequenceoranend-of-streamindication.Thelengthofmulti-bytesequencesiseasilydeterminedbyhumansasitissimplythenumberofhigh-order1sintheleadingbyte.Anincorrectcharacterwillnotbedecodedifastreamendsmid-sequence. Self-synchronization:Theleadingbytesandthecontinuationbytesdonotsharevalues(continuationbytesstartwiththebits10whilesinglebytesstartwith0andlongerleadbytesstartwith11).Thismeansasearchwillnotaccidentallyfindthesequenceforonecharacterstartinginthemiddleofanothercharacter.Italsomeansthestartofacharactercanbefoundfromarandompositionbybackingupatmost3bytestofindtheleadingbyte.Anincorrectcharacterwillnotbedecodedifastreamstartsmid-sequence,andashortersequencewillneverappearinsidealongerone. Sortingorder:ThechosenvaluesoftheleadingbytesmeansthatalistofUTF-8stringscanbesortedincodepointorderbysortingthecorrespondingbytesequences. Single-byte[edit] UTF-8canencodeanyUnicodecharacter,avoidingtheneedtofigureoutandseta"codepage"orotherwiseindicatewhatcharactersetisinuse,andallowingoutputinmultiplescriptsatthesametime.Formanyscriptstherehavebeenmorethanonesingle-byteencodinginusage,soevenknowingthescriptwasinsufficientinformationtodisplayitcorrectly. Thebytes0xFEand0xFFdonotappear,soavalidUTF-8streamnevermatchestheUTF-16byteordermarkandthuscannotbeconfusedwithit.Theabsenceof0xFF(0377)alsoeliminatestheneedtoescapethisbyteinTelnet(andFTPcontrolconnection). UTF-8encodedtextislargerthanspecializedsingle-byteencodingsexceptforplainASCIIcharacters.Inthecaseofscriptswhichused8-bitcharactersetswithnon-Latincharactersencodedintheupperhalf(suchasmostCyrillicandGreekalphabetcodepages),charactersinUTF-8willbedoublethesize.Forsomescripts,suchasThaiandDevanagari(whichisusedbyvariousSouthAsianlanguages),characterswilltripleinsize.ThereareevenexampleswhereasinglebyteturnsintoacompositecharacterinUnicodeandisthussixtimeslargerinUTF-8.ThishascausedobjectionsinIndiaandothercountries. ItispossibleinUTF-8(oranyothermulti-byteencoding)tosplitortruncateastringinthemiddleofacharacter.Ifthetwopiecesarenotre-appendedlaterbeforeinterpretationascharacters,thiscanintroduceaninvalidsequenceatboththeendoftheprevioussectionandthestartofthenext,andsomedecoderswillnotpreservethesebytesandresultindataloss.BecauseUTF-8isself-synchronizingthiswillhoweverneverintroduceadifferentvalidcharacter,anditisalsofairlyeasytomovethetruncationpointbackwardtothestartofacharacter. Ifthecodepointsareallthesamesize,measurementsofafixednumberofthemiseasy.DuetoASCII-eradocumentationwhere"character"isusedasasynonymfor"byte"thisisoftenconsideredimportant.However,bymeasuringstringpositionsusingbytesinsteadof"characters"mostalgorithmscanbeeasilyandefficientlyadaptedforUTF-8.Searchingforastringwithinalongstringcanforexamplebedonebytebybyte;theself-synchronizationpropertypreventsfalsepositives. Othermulti-byte[edit] UTF-8canencodeanyUnicodecharacter.Filesindifferentscriptscanbedisplayedcorrectlywithouthavingtochoosethecorrectcodepageorfont.Forinstance,ChineseandArabiccanbewritteninthesamefilewithoutspecializedmarkupormanualsettingsthatspecifyanencoding. UTF-8isself-synchronizing:characterboundariesareeasilyidentifiedbyscanningforwell-definedbitpatternsineitherdirection.Ifbytesarelostduetoerrororcorruption,onecanalwayslocatethenextvalidcharacterandresumeprocessing.Ifthereisaneedtoshortenastringtofitaspecifiedfield,thepreviousvalidcharactercaneasilybefound.Manymulti-byteencodingssuchasShiftJISaremuchhardertoresynchronize.Thisalsomeansthatbyte-orientedstring-searchingalgorithmscanbeusedwithUTF-8(asacharacteristhesameasa"word"madeupofthatmanybytes),optimizedversionsofbytesearchescanbemuchfasterduetohardwaresupportandlookuptablesthathaveonly256entries.Self-synchronizationdoeshoweverrequirethatbitsbereservedforthesemarkersineverybyte,increasingthesize. Efficienttoencodeusingsimplebitwiseoperations.UTF-8doesnotrequireslowermathematicaloperationssuchasmultiplicationordivision(unlikeShiftJIS,GB2312andotherencodings). UTF-8willtakemorespacethanamulti-byteencodingdesignedforaspecificscript.EastAsianlegacyencodingsgenerallyusedtwobytespercharacteryettakethreebytespercharacterinUTF-8. UTF-16[edit] Mainarticle:UTF-16 ByteencodingsandUTF-8arerepresentedbybytearraysinprograms,andoftennothingneedstobedonetoafunctionwhenconvertingsourcecodefromabyteencodingtoUTF-8.UTF-16isrepresentedby16-bitwordarrays,andconvertingtoUTF-16whilemaintainingcompatibilitywithexistingASCII-basedprograms(suchaswasdonewithWindows)requireseveryAPIanddatastructurethattakesastringtobeduplicated,oneversionacceptingbytestringsandanotherversionacceptingUTF-16.Ifbackwardcompatibilityisnotneeded,allstringhandlingstillmustbemodified. TextencodedinUTF-8willbesmallerthanthesametextencodedinUTF-16iftherearemorecodepointsbelowU+0080thanintherangeU+0800..U+FFFF.ThisistrueforallmodernEuropeanlanguages.ItisoftentrueevenforlanguageslikeChinese,duetothelargenumberofspaces,newlines,digits,andHTMLmarkupintypicalfiles. Mostcommunication(e.g.HTMLandIP)andstorage(e.g.forUnix)wasdesignedforastreamofbytes.AUTF-16stringmustuseapairofbytesforeachcodeunit: TheorderofthosetwobytesbecomesanissueandmustbespecifiedintheUTF-16protocol,suchaswithabyteordermark. IfanoddnumberofbytesismissingfromUTF-16,thewholerestofthestringwillbemeaninglesstext.AnybytesmissingfromUTF-8willstillallowthetexttoberecoveredaccuratelystartingwiththenextcharacterafterthemissingbytes. Derivatives[edit] ThefollowingimplementationsshowslightdifferencesfromtheUTF-8specification.TheyareincompatiblewiththeUTF-8specificationandmayberejectedbyconformingUTF-8applications. CESU-8[edit] Mainarticle:CESU-8 UnicodeTechnicalReport#26[79]assignsthenameCESU-8toanonstandardvariantofUTF-8,inwhichUnicodecharactersinsupplementaryplanesareencodedusingsixbytes,ratherthanthefourbytesrequiredbyUTF-8.CESU-8encodingtreatseachhalfofafour-byteUTF-16surrogatepairasatwo-byteUCS-2character,yieldingtwothree-byteUTF-8characters,whichtogetherrepresenttheoriginalsupplementarycharacter.UnicodecharacterswithintheBasicMultilingualPlaneappearastheywouldnormallyinUTF-8.TheReportwaswrittentoacknowledgeandformalizetheexistenceofdataencodedasCESU-8,despitetheUnicodeConsortiumdiscouragingitsuse,andnotesthatapossibleintentionalreasonforCESU-8encodingispreservationofUTF-16binarycollation. CESU-8encodingcanresultfromconvertingUTF-16datawithsupplementarycharacterstoUTF-8,usingconversionmethodsthatassumeUCS-2data,meaningtheyareunawareoffour-byteUTF-16supplementarycharacters.ItisprimarilyanissueonoperatingsystemswhichextensivelyuseUTF-16internally,suchasMicrosoftWindows.[citationneeded] InOracleDatabase,theUTF8charactersetusesCESU-8encoding,andisdeprecated.TheAL32UTF8charactersetusesstandards-compliantUTF-8encoding,andispreferred.[80][81] CESU-8isprohibitedforuseinHTML5documents.[82][83][84] MySQLutf8mb3[edit] InMySQL,theutf8mb3charactersetisdefinedtobeUTF-8encodeddatawithamaximumofthreebytespercharacter,meaningonlyUnicodecharactersintheBasicMultilingualPlane(i.e.fromUCS-2)aresupported.Unicodecharactersinsupplementaryplanesareexplicitlynotsupported.utf8mb3isdeprecatedinfavoroftheutf8mb4characterset,whichusesstandards-compliantUTF-8encoding.utf8isanaliasforutf8mb3,butisintendedtobecomeanaliastoutf8mb4inafuturereleaseofMySQL.[85]Itispossible,thoughunsupported,tostoreCESU-8encodeddatainutf8mb3,byhandlingUTF-16datawithsupplementarycharactersasthoughitisUCS-2. ModifiedUTF-8[edit] ModifiedUTF-8(MUTF-8)originatedintheJavaprogramminglanguage.InModifiedUTF-8,thenullcharacter(U+0000)usesthetwo-byteoverlongencoding1100000010000000(hexadecimalC080),insteadof00000000(hexadecimal00).[86]ModifiedUTF-8stringsnevercontainanyactualnullbytesbutcancontainallUnicodecodepointsincludingU+0000,[87]whichallowssuchstrings(withanullbyteappended)tobeprocessedbytraditionalnull-terminatedstringfunctions.AllknownModifiedUTF-8implementationsalsotreatthesurrogatepairsasinCESU-8. Innormalusage,thelanguagesupportsstandardUTF-8whenreadingandwritingstringsthroughInputStreamReaderandOutputStreamWriter(ifitistheplatform'sdefaultcharactersetorasrequestedbytheprogram).HoweveritusesModifiedUTF-8forobjectserialization[88]amongotherapplicationsofDataInputandDataOutput,fortheJavaNativeInterface,[89]andforembeddingconstantstringsinclassfiles.[90] ThedexformatdefinedbyDalvikalsousesthesamemodifiedUTF-8torepresentstringvalues.[91]TclalsousesthesamemodifiedUTF-8[92]asJavaforinternalrepresentationofUnicodedata,butusesstrictCESU-8forexternaldata. WTF-8[edit] Thissectioncontainsalistofmiscellaneousinformation.Pleaserelocateanyrelevantinformationintoothersectionsorarticles.(August2020) InWTF-8(WobblyTransformationFormat,8-bit)unpairedsurrogatehalves(U+D800throughU+DFFF)areallowed.[93]Thisisnecessarytostorepossibly-invalidUTF-16,suchasWindowsfilenames.ManysystemsthatdealwithUTF-8workthiswaywithoutconsideringitadifferentencoding,asitissimpler.[94] (Theterm"WTF-8"hasalsobeenusedhumorouslytorefertoerroneouslydoubly-encodedUTF-8[95][96]sometimeswiththeimplicationthatCP1252bytesaretheonlyonesencoded.)[97] PEP383[edit] Version3ofthePythonprogramminglanguagetreatseachbyteofaninvalidUTF-8bytestreamasanerror(seealsochangeswithnewUTF-8modeinPython3.7[98]);thisgives128differentpossibleerrors.ExtensionshavebeencreatedtoallowanybytesequencethatisassumedtobeUTF-8tobelosslesslytransformedtoUTF-16orUTF-32,bytranslatingthe128possibleerrorbytestoreservedcodepoints,andtransformingthosecodepointsbacktoerrorbytestooutputUTF-8.ThemostcommonapproachistotranslatethecodestoU+DC80...U+DCFFwhicharelow(trailing)surrogatevaluesandthus"invalid"UTF-16,asusedbyPython'sPEP383(or"surrogateescape")approach.[99]AnotherencodingcalledMirBSDOPTU-8/16convertsthemtoU+EF80...U+EFFFinaPrivateUseArea.[100]Ineitherapproach,thebytevalueisencodedintheloweightbitsoftheoutputcodepoint. Theseencodingsareveryusefulbecausetheyavoidtheneedtodealwith"invalid"bytestringsuntilmuchlater,ifatall,andallow"text"and"data"bytearraystobethesameobject.IfaprogramwantstouseUTF-16internallythesearerequiredtopreserveandusefilenamesthatcanuseinvalidUTF-8;[101]astheWindowsfilesystemAPIusesUTF-16,theneedtosupportinvalidUTF-8islessthere.[99] Fortheencodingtobereversible,thestandardUTF-8encodingsofthecodepointsusedforerroneousbytesmustbeconsideredinvalid.ThismakestheencodingincompatiblewithWTF-8orCESU-8(thoughonlyfor128codepoints).Whenre-encodingitisnecessarytobecarefulofsequencesoferrorcodepointswhichconvertbacktovalidUTF-8,whichmaybeusedbymalicioussoftwaretogetunexpectedcharactersintheoutput,thoughthiscannotproduceASCIIcharacterssoitisconsideredcomparativelysafe,sincemalicioussequences(suchascross-sitescripting)usuallyrelyonASCIIcharacters.[101] Seealso[edit] Altcode Comparisonofemailclients§ Features ComparisonofUnicodeencodings GB18030 UTF-EBCDIC Iconv Percent-encoding§ Currentstandard Specials(Unicodeblock) Unicodeandemail UnicodeandHTML CharacterencodingsinHTML Notes[edit] ^17planestimes216codepointsperplane,minus211technically-invalidsurrogates. ^Thereareenoughxbitstoencodeupto0x1FFFFF,butthecurrentRFC3629§3limitsUTF-8encodingtocodepointU+10FFFF,tomatchthelimitsofUTF-16.TheobsoleteRFC2279allowedUTF-8encodingupto(thenlegal)codepointU+7FFFFFF. ^Somecomplexemojicharacterscantakeevenmorethanthis;thetransgenderflagemoji(🏳️⚧️),whichconsistsofthefive-codepointsequenceU+1F3F3U+FE0FU+200DU+26A7U+FE0F,requiressixteenbytestoencode,whilethatfortheflagofScotland(🏴)requiresatotaloftwenty-eightbytesfortheseven-codepointsequenceU+1F3F4U+E0067U+E0062U+E0073U+E0063U+E0074U+E007F. ^Forexample,cell9Dsays+1D.Thehexadecimalnumber9Dinbinaryis10011101,andsincethe2highestbits(10)arereservedformarkingthisasacontinuationbyte,theremaining6bits(011101)haveahexadecimalvalueof1D. References[edit] ^"Chapter2.GeneralStructure".TheUnicodeStandard(6.0 ed.).MountainView,California,US:TheUnicodeConsortium.ISBN 978-1-936213-01-6. ^abPike,Rob(30April2003)."UTF-8history". ^Pike,Rob;Thompson,Ken(1993)."HelloWorldorΚαλημέρακόσμεorこんにちは世界"(PDF).ProceedingsoftheWinter1993USENIXConference. ^"FileSystemSafeUCS-TransformationFormat(FSS-UTF)-X/OpenPreliminarySpecification"(PDF).unicode.org. ^"USENIXWinter1993ConferenceProceedings".usenix.org. ^"RFC2277-IETFPolicyonCharacterSetsandLanguages".datatracker.ietf.org. ^ab"UsageSurveyofCharacterEncodingsbrokendownbyRanking".w3techs.com.Retrieved2022-10-11. ^ab"CharacterSets".InternetAssignedNumbersAuthority.2013-01-23.Retrieved2013-02-08. ^Dürst,Martin."SettingtheHTTPcharsetparameter".W3C.Retrieved2013-02-08. ^abYergeau,F.(2003).UTF-8,atransformationformatofISO10646.InternetEngineeringTaskForce.doi:10.17487/RFC3629.RFC3629.Retrieved2015-02-03. ^"EncodingStandard§4.2.Namesandlabels".WHATWG.Retrieved2018-04-29. ^"BOM".suikawiki(inJapanese).Retrieved2013-04-26. ^Davis,Mark."FormsofUnicode".IBM.Archivedfromtheoriginalon2005-05-06.Retrieved2013-09-18. ^Liviu(2014-02-07)."UTF-8codepage65001inWindows7-partI".Retrieved2018-01-30.PreviouslyunderXP(and,unverified,butprobablyVista,too)forloopssimplydidnotworkwhilecodepage65001wasactive ^"HPPCLSymbolSets|PrinterControlLanguage(PCL&PXL)SupportBlog".2015-02-19.Archivedfromtheoriginalon2015-02-19.Retrieved2018-01-30. ^Allen,JulieD.;Anderson,Deborah;Becker,Joe;Cook,Richard,eds.(2012).TheUnicodeStandard,Version6.1.MountainView,California:UnicodeConsortium. ^"AppleDeveloperDocumentation".developer.apple.com.Retrieved2021-03-15. ^"BinaryString(flink1.9-SNAPSHOTAPI)".ci.apache.org.Retrieved2021-03-24. ^"Chapter3"(PDF),TheUnicodeStandard,p. 54 ^"Chapter3"(PDF),TheUnicodeStandard,p. 55 ^"Chapter3"(PDF),TheUnicodeStandard,p. 55 ^Yergeau,F.(November2003).UTF-8,atransformationformatofISO10646.IETF.doi:10.17487/RFC3629.STD63.RFC3629.RetrievedAugust20,2020. ^"Chapter3"(PDF),TheUnicodeStandard,p. 54 ^Yergeau,F.(November2003).UTF-8,atransformationformatofISO10646.IETF.doi:10.17487/RFC3629.STD63.RFC3629.RetrievedAugust20,2020. ^"Chapter3"(PDF),TheUnicodeStandard,p. 55 ^Marin,Marvin(2000-10-17)."WebServerFolderTraversalMS00-078". ^"SummaryforCVE-2008-2938".NationalVulnerabilityDatabase. ^"PEP529--ChangeWindowsfilesystemencodingtoUTF-8".Python.org.Retrieved2022-05-10.ThisPEPproposeschangingthedefaultfilesystemencodingonWindowstoutf-8,andchangingallfilesystemfunctionstousetheUnicodeAPIsforfilesystempaths.[..]cancorrectlyround-tripallcharactersusedinpaths(onPOSIXwithsurrogateescapehandling;onWindowsbecausestrmapstothenativerepresentation).OnWindowsbytescannotround-tripallcharactersusedinpaths ^"DataInput(JavaPlatformSE8)".docs.oracle.com.Retrieved2021-03-24. ^"Non-decodableBytesinSystemCharacterInterfaces".python.org.2009-04-22.Retrieved2014-08-13. ^"Unicode6.0.0". ^1281-byte,(16+5)×642-byte,and5×64×643-byte.Theremaybesomewhatfewerifmoreprecisetestsaredoneforeachcontinuationbyte. ^"Chapter2"(PDF),TheUnicodeStandard-Version6.0,p. 30 ^"UTF-8andUnicodeFAQforUnix/Linux". ^Davis,Mark(2012-02-03)."Unicodeover60percentoftheweb".OfficialGoogleBlog.Archivedfromtheoriginalon2018-08-09.Retrieved2020-07-24. ^Bray,Tim(December2017)."TheJavaScriptObjectNotation(JSON)DataInterchangeFormat".IETF.Retrieved16February2018. ^"EncodingStandard".encoding.spec.whatwg.org.Retrieved2020-04-15. ^"UsageofInternetMailinTheWorldCharacters".washingtonindependent.com.1998-08-01.Retrieved2007-11-08. ^"EncodingStandard".encoding.spec.whatwg.org.Retrieved2018-11-15. ^"Specifyingthedocument'scharacterencoding".HTML5.2.WorldWideWebConsortium.14December2017.Retrieved2018-06-03. ^"Choosetextencodingwhenyouopenandsavefiles".support.microsoft.com.Retrieved2021-11-01. ^"utf8-CharacterencodingofMicrosoftWordDOCandDOCXfiles?".StackOverflow.Retrieved2021-11-01. ^"ExportingaUTF-8.txtfilefromWord". ^"excel-AreXLSXfilesUTF-8encodedbydefinition?".StackOverflow.Retrieved2021-11-01. ^"HowtoopenUTF-8CSVfileinExcelwithoutmis-conversionofcharactersinJapaneseandChineselanguageforbothMacandWindows?".answers.microsoft.com.Retrieved2021-11-01. ^"IntroducingUTF-8supportforSQLServer".techcommunity.microsoft.com.2019-07-02.Retrieved2021-08-24.Forexample,changinganexistingcolumndatatypefromNCHAR(10)toCHAR(10)usinganUTF-8enabledcollation,translatesintonearly50%reductioninstoragerequirements.[..]IntheASCIIrange,whendoingintensiveread/writeI/OonUTF-8,wemeasuredanaverage35%performanceimprovementoverUTF-16usingclusteredtableswithanon-clusteredindexonthestringcolumn,andanaverage11%performanceimprovementoverUTF-16usingaheap. ^Davis,Mark(2008-05-05)."MovingtoUnicode5.1".Retrieved2021-02-19. ^"UsageStatisticsandMarketShareofASCIIforWebsites,October2021".w3techs.com.Retrieved2020-11-01. ^"HowcanImakeNotepadtosavetextinUTF-8withouttheBOM?".StackOverflow.Retrieved2021-03-24. ^Galloway,Matt."CharacterencodingforiOSdevelopers.OrUTF-8whatnow?".www.galloway.me.uk.Retrieved2021-01-02.inreality,youusuallyjustassumeUTF-8sincethatisbyfarthemostcommonencoding. ^"Windows10NotepadisGettingBetterUTF-8EncodingSupport".BleepingComputer.Retrieved2021-03-24.MicrosoftisnowdefaultingtosavingnewtextfilesasUTF-8withoutBOMasshownbelow. ^"CustomizetheWindows11Startmenu".docs.microsoft.com.Retrieved2021-06-29.MakesureyourLayoutModification.jsonusesUTF-8encoding. ^"JEP400:UTF-8byDefault".openjdk.java.net.Retrieved2022-03-30. ^"Feature#16604:SetdefaultforEncoding.default_externaltoUTF-8onWindows-Rubymaster-RubyIssueTrackingSystem".bugs.ruby-lang.org.Retrieved2022-08-01. ^"Feature#12650:UseUTF-8encodingforENVonWindows-Rubymaster-RubyIssueTrackingSystem".bugs.ruby-lang.org.Retrieved2022-08-01. ^"NewfeaturesinR4.2.0|R-bloggers".TheJumpingRiversBlog.2022-04-01.Retrieved2022-08-01. ^"PEP540–AddanewUTF-8Mode|peps.python.org".peps.python.org.Retrieved2022-09-23. ^"PEP597--AddoptionalEncodingWarning".Python.org.Retrieved2021-08-24. ^Pythonusesanumberofencodingsforwhatitcalls"Unicode",howevernoneoftheseencodingsareUTF-8.Python2andearly3versionsusedUTF-16onWindowsandUTF-32onUnix.NewerPython3implementationsusethreefixed-widthencodings:ISO-8859-1,UCS-2,andUTF-32,dependingonthemaximumcodepointneeded. ^"PEP393–FlexibleStringRepresentation".Python.org.Retrieved2022-05-18. ^"TheGoProgrammingLanguageSpecification".Retrieved2021-02-10. ^Tsai,MichaelJ."MichaelTsai-Blog-UTF-8StringinSwift5".Retrieved2021-03-15.SwitchingtoUTF-8fulfillsoneofString’slong-termgoalstoenablehigh-performanceprocessing,[..]alsolaysthegroundworkforprovidingevenmoreperformantAPIsinthefuture ^Mattip(2019-03-24)."PyPyStatusBlog:PyPyv7.1released;nowusesutf-8internallyforunicodestrings".PyPyStatusBlog.Retrieved2020-11-21. ^"PEP623--RemovewstrfromUnicode".Python.org.Retrieved2020-11-21.UntilwedroplegacyUnicodeobject,itisveryhardtotryotherUnicodeimplementationlikeUTF-8basedimplementationinPyPy ^"/validate-charset(Validateforcompatiblecharacters)".docs.microsoft.com.Retrieved2021-07-19.VisualStudiousesUTF-8astheinternalcharacterencodingduringconversionbetweenthesourcecharactersetandtheexecutioncharacterset. ^"/utf-8(SetSourceandExecutablecharactersetstoUTF-8)".docs.microsoft.com.Retrieved2021-07-18. ^"absentstd::u8stringinC++11".NewbeDEV.Retrieved2021-11-01. ^"UsetheWindowsUTF-8codepage-UWPapplications".docs.microsoft.com.Retrieved2020-06-06.AsofWindowsVersion1903(May2019Update),youcanusetheActiveCodePagepropertyintheappxmanifestforpackagedapps,orthefusionmanifestforunpackagedapps,toforceaprocesstouseUTF-8astheprocesscodepage.[..]CP_ACPequatestoCP_UTF8onlyifrunningonWindowsVersion1903(May2019Update)oraboveandtheActiveCodePagepropertydescribedaboveissettoUTF-8.Otherwise,ithonorsthelegacysystemcodepage.WerecommendusingCP_UTF8explicitly. ^"AppendixF.FSS-UTF/FileSystemSafeUCSTransformationformat"(PDF).TheUnicodeStandard1.1.Archived(PDF)fromtheoriginalon2016-06-07.Retrieved2016-06-07. ^Whistler,Kenneth(2001-06-12)."FSS-UTF,UTF-2,UTF-8,andUTF-16".Archivedfromtheoriginalon2016-06-07.Retrieved2006-06-07. ^abPike,Rob(2003-04-30)."UTF-8history".Retrieved2012-09-07. ^Pike,Rob(2012-09-06)."UTF-8turned20yearsoldyesterday".Retrieved2012-09-07. ^Alvestrand,Harald(January1998).IETFPolicyonCharacterSetsandLanguages.doi:10.17487/RFC2277.BCP18. ^ISO/IEC10646:2014§9.1,2014. ^TheUnicodeStandard,Version14.0§3.9D92,§3.10D95,2021. ^UnicodeStandardAnnex#27:Unicode3.1,2001. ^TheUnicodeStandard,Version5.0§3.9–§3.10ch.3,2006. ^TheUnicodeStandard,Version6.0§3.9D92,§3.10D95,2010. ^McGowan,Rick(2011-12-19)."CompatibilityEncodingSchemeforUTF-16:8-Bit(CESU-8)".UnicodeConsortium.UnicodeTechnicalReport#26. ^"CharacterSetSupport".OracleDatabase19cDocumentation,SQLLanguageReference.OracleCorporation. ^"SupportingMultilingualDatabaseswithUnicode§SupportfortheUnicodeStandardinOracleDatabase".DatabaseGlobalizationSupportGuide.OracleCorporation. ^"8.2.2.3.Characterencodings".HTML5.1Standard.W3C. ^"8.2.2.3.Characterencodings".HTML5Standard.W3C. ^"12.2.3.3Characterencodings".HTMLLivingStandard.WHATWG. ^"Theutf8mb3CharacterSet(3-ByteUTF-8UnicodeEncoding)".MySQL8.0ReferenceManual.OracleCorporation. ^"JavaSEdocumentationforInterfacejava.io.DataInput,subsectiononModifiedUTF-8".OracleCorporation.2015.Retrieved2015-10-16. ^"TheJavaVirtualMachineSpecification,section4.4.7:"TheCONSTANT_Utf8_infoStructure"".OracleCorporation.2015.Retrieved2015-10-16. ^"JavaObjectSerializationSpecification,chapter6:ObjectSerializationStreamProtocol,section2:StreamElements".OracleCorporation.2010.Retrieved2015-10-16. ^"JavaNativeInterfaceSpecification,chapter3:JNITypesandDataStructures,section:ModifiedUTF-8Strings".OracleCorporation.2015.Retrieved2015-10-16. ^"TheJavaVirtualMachineSpecification,section4.4.7:"TheCONSTANT_Utf8_infoStructure"".OracleCorporation.2015.Retrieved2015-10-16. ^"ARTandDalvik".AndroidOpenSourceProject.Archivedfromtheoriginalon2013-04-26.Retrieved2013-04-09. ^"UTF-8bitbybit".Tcler'sWiki.2001-02-28.Retrieved2022-09-03. ^Sapin,Simon(2016-03-11)[2014-09-25]."TheWTF-8encoding".Archivedfromtheoriginalon2016-05-24.Retrieved2016-05-24. ^Sapin,Simon(2015-03-25)[2014-09-25]."TheWTF-8encoding§Motivation".Archivedfromtheoriginalon2020-08-16.Retrieved2020-08-26.{{citeweb}}:CS1maint:bot:originalURLstatusunknown(link) ^"WTF-8.com".2006-05-18.Retrieved2016-06-21. ^Speer,Robyn(2015-05-21)."ftfy(fixestextforyou)4.0:changinglessandfixingmore".Archivedfromtheoriginalon2015-05-30.Retrieved2016-06-21. ^"WTF-8,atransformationformatofcodepage1252".Archivedfromtheoriginalon2016-10-13.Retrieved2016-10-12. ^"PEP540--AddanewUTF-8Mode".Python.org.Retrieved2021-03-24. ^abvonLöwis,Martin(2009-04-22)."Non-decodableBytesinSystemCharacterInterfaces".PythonSoftwareFoundation.PEP383. ^"RTFMoptu8to16(3),optu8to16vis(3)".www.mirbsd.org. ^abDavis,Mark;Suignard,Michel(2014)."3.7EnablingLosslessConversion".UnicodeSecurityConsiderations.UnicodeTechnicalReport#36. Externallinks[edit] OriginalUTF-8paper(orpdf)forPlan9fromBellLabs UTF-8testpages: AndreasPrilopArchived2017-11-30attheWaybackMachine JostGippert WorldWideWebConsortium Unix/Linux:UTF-8/UnicodeFAQ,LinuxUnicodeHOWTO,UTF-8andGentoo Characters,SymbolsandtheUnicodeMiracleonYouTube vteUnicodeUnicode UnicodeConsortium ISO/IEC10646(UniversalCharacterSet) Versions Codepoints Block List UniversalCharacterSet Charactercharts Characterproperty Plane PrivateUseArea CharactersSpecialpurpose BOM Combininggraphemejoiner Left-to-rightmark /Right-to-leftmark Softhyphen Variantform Wordjoiner Zero-widthjoiner Zero-widthnon-joiner Zero-widthspace Lists Characters CJKUnifiedIdeographs Combiningcharacter Duplicatecharacters Numerals Scripts Spaces Symbols Halfwidthandfullwidth Aliasnamesandabbreviations Whitespacecharacters ProcessingAlgorithms Bidirectionaltext Collation ISO/IEC14651 Equivalence Variationsequences InternationalIdeographsCore Comparison BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Onpairsofcodepoints Combiningcharacter Compatibilitycharacters Duplicatecharacters Equivalence Homoglyph Precomposedcharacter list Z-variant Variationsequences Regionalindicatorsymbol Emojiskincolor Usage Domainnames(IDN) Email Fonts HTML entityreferences numericreferences Input InternationalIdeographsCore Relatedstandards CommonLocaleDataRepository(CLDR) GB18030 ISO/IEC8859 ISO15924 Relatedtopics Anomalies ConScriptUnicodeRegistry IdeographicResearchGroup InternationalComponentsforUnicode PeopleinvolvedwithUnicode Hanunification ScriptsandsymbolsinUnicodeCommonandinheritedscripts Combiningmarks Diacritics Punctuationmarks Spaces Numbers Modernscripts Adlam Arabic Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese CanadianAboriginal Chakma Cham Cherokee CJKUnifiedIdeographs(Han) Cyrillic Deseret Devanagari Geʽez Georgian Greek Gujarati GunjalaGondi Gurmukhi Hangul HanifiRohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana KayahLi Khmer Lao Latin Lepcha Limbu Lisu(Fraser) Lontara Malayalam MasaramGondi MendeKikakui Medefaidrin Miao(Pollard) Mongolian Mru N'Ko NagMundari NewTaiLue Nüshu NyiakengPuachueHmong Odia OlChiki Osage Osmanya PahawhHmong PauCinHau Pracalit(Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala SorangSompeng Sundanese Syriac Tagbanwa TaiLe TaiTham TaiViet Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Toto Vai Wancho WarangCiti Yi Ancientandhistoricscripts Ahom Anatolianhieroglyphs AncientNorthArabian Avestan BassaVah Bhaiksuki Brāhmī Carian CaucasianAlbanian Coptic Cuneiform Cypriot Cypro-Minoan DivesAkuru Dogra Egyptianhieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran ImperialAramaic InscriptionalPahlavi InscriptionalParthian Kaithi Kawi Kharosthi Khitansmallscript Khojki Khudawadi Khwarezmian(Chorasmian) LinearA LinearB Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen MeeteiMayek Meroitic Modi Multani Nabataean Nandinagari Ogham OldHungarian OldItalic OldPermic OldPersiancuneiform OldSogdian OldTurkic OldUyghur Palmyrene ʼPhags-pa Phoenician PsalterPahlavi Runic Sharada Siddham Sogdian SouthArabian Soyombo SylhetiNagri Tagalog(Baybayin) Takri Tangut Ugaritic Vithkuqi Yezidi ZanabazarSquare Notationalscripts Duployan SignWriting Symbols,emojis Cultural,political,andreligioussymbols Currency ControlPictures Mathematicaloperatorsandsymbols Listbysubject Phoneticsymbols(includingIPA) Emoji Category:Unicode Category:Unicodeblocks vteCharacterencodingsEarlytelecommunications Telegraphcode Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Korean BaudotandMurray Fieldata ASCII ISO/IEC646 BCDIC TeletexandVideotex/Teletext T.51/ISO/IEC6937 ITUT.61 ITUT.101 WorldSystemTeletext background sets Transcode ISO/IEC8859 Approvedparts -1(WesternEurope) -2(CentralEurope) -3(Maltese/Esperanto) -4(NorthEurope) -5(Cyrillic) -6(Arabic) -7(Greek) -8(Hebrew) -9(Turkish) -10(Nordic) -11(Thai) -13(Baltic) -14(Celtic) -15(NewWesternEurope) -16(Romanian) Abandonedparts -12(Devanagari) Proposedbutnot approved KOI-8Cyrillic Sámi Adaptations Welsh BarentsCyrillic Estonian UkrainianCyrillic Bibliographicuse MARC-8 ANSEL CCCII/EACC ISO5426 5426-2 5427 5428 6438 6862 Nationalstandards ArmSCII BraSCII CNS11643 DIN66003 ELOT927 GOST10859 GB2312 GB12345 GB12052 GB18030 HKSCS ISCII JISX0201 JISX0208 JISX0212 JISX0213 KOI-7 KPS9566 KSX1001 KSX1002 LST1564 LST1590-4 PASCII ShiftJIS SI960 TIS-620 TSCII VISCII VSCII YUSCII ISO/IEC2022 ISO/IEC8859 ISO/IEC10367 ExtendedUnixCode/EUC MacOSCodepages("scripts") Armenian Arabic BarentsCyrillic Celtic CentralEuropean Croatian Cyrillic Devanagari Farsi(Persian) FontX(Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin(Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish TurkicCyrillic Ukrainian VT100 DOScodepages 437 668 708 720 737 770 773 775 776 777 778 850 851 852 853 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 897 899 903 904 932 936 942 949 950 951 1040 1042 1043 1046 1098 1115 1116 1117 1118 1127 3846 ABICOMP CSIndic CSXIndic CSX+Indic CWI-2 IranSystem Kamenický Mazovia MIK IBMAIXcodepages 895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1124 1133 Windowscodepages CER-GS 932 936(GBK) 950 1169 ExtendedLatin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic+Finnish Cyrillic+French Cyrillic+German PolytonicGreek EBCDICcodepages 37 JapaneselanguageinEBCDIC DKOI DECterminals(VTx) Multinational(MCS) NationalReplacement(NRCS) FrenchCanadian Swiss Spanish UnitedKingdom Dutch Finnish French NorwegianandDanish Swedish NorwegianandDanish(alternative) 8-bitGreek 8-bitTurkish SI960 Hebrew SpecialGraphics Technical(TCS) Platformspecific 1057 Acorn AdobeStandard AdobeLatin1 AmstradCPC AppleII ATASCII AtariST BICS Casiocalculators CDC CompucolorII CP/M+ DECRADIX50 DECMCS/NRCS DGInternational Fieldata GEM GSM03.38 HPRoman HPFOCAL HPRPL SQUOZE LICS LMBCS MSX NECAPC NeXT PETSCII SegaSC-3000 Sharpcalculators SharpMZ SinclairQL Symbol Teletext TIcalculators TRS-80 VenturaInternational WISCII XCCS ZX80 ZX81 ZXSpectrum Unicode /ISO/IEC10646 UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB18030 BOCU-1 CESU-8 SCSU TACE16 ComparisonofUnicodeencodings TeXtypesettingsystem Cork LY1 OML OMS OT1 Miscellaneouscodepages ABICOMP ASMO449 Big5 DigitalencodingofAPLsymbols ISO-IR-68 ARIBSTD-B24 HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS TRON UnifiedHangulCode Controlcharacter Morseprosigns C0andC1controlcodes ISO/IEC6429 JISX0211 Unicodecontrol,formatandseparatorcharacters Whitespacecharacters Relatedtopics CCSID CharacterencodingsinHTML Charsetdetection Hanunification Hardwarecodepage MICRcode Mojibake Variable-widthencoding Charactersets vteRobPikeOperatingsystems Plan9fromBellLabs Inferno Programminglanguages Newsqueak Limbo Sawzall Go Software acme Blit sam rio 8½ Publications ThePracticeofProgramming TheUnixProgrammingEnvironment Other RenéeFrench MarkV.Shaney UTF-8 vteKenThompsonOperatingsystems Unix Plan9fromBellLabs Inferno Programminglanguages B Go Software Belle ed grep sam SpaceTravel Thompsonshell Other UTF-8 Retrievedfrom"https://en.wikipedia.org/w/index.php?title=UTF-8&oldid=1115403769" Categories:CharacterencodingComputer-relatedintroductionsin1993EncodingsUnicodeTransformationFormatsHiddencategories:CS1Japanese-languagesources(ja)CS1maint:bot:originalURLstatusunknownArticleswithshortdescriptionShortdescriptionmatchesWikidataAllarticleswithunsourcedstatementsArticleswithunsourcedstatementsfromMarch2016ArticlescontainingpotentiallydatedstatementsfromOctober2022AllarticlescontainingpotentiallydatedstatementsArticleswithunsourcedstatementsfromJune2021ArticleswithunsourcedstatementsfromSeptember2020ArticleswithtriviasectionsfromAugust2020AllarticleswithtriviasectionsWebarchivetemplatewaybacklinks Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English Views ReadEditViewhistory More Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages العربيةAzərbaycancaবাংলাБългарскиBosanskiCatalàЧӑвашлаČeštinaDanskDeutschΕλληνικάEspañolEsperantoEuskaraفارسیFrançais한국어HrvatskiBahasaIndonesiaItalianoעבריתLatviešuLietuviųMagyarമലയാളംBahasaMelayuNederlands日本語NorskbokmålNorsknynorskPolskiPortuguêsРусскийSlovenčinaSlovenščinaСрпски/srpskiSvenskaTürkçeУкраїнськаاردوTiếngViệt中文 Editlinks
延伸文章資訊
- 1UTF-8 encoder/decoder
About this tool. This tool uses utf8.js to UTF-8-encode any string you enter in the 'decoded' fie...
- 2UTF-8 - Wikipedia
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Un...
- 3What is UTF-8 Encoding? A Guide for Non-Programmers
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching uni...
- 4Encoding.UTF8 屬性(System.Text) - Microsoft Learn
WriteLine(); // Display the number of bytes required to encode the array. int reqBytes = utf8.Get...
- 5UTF-8 - 维基百科,自由的百科全书
UTF-8(8-bit Unicode Transformation Format)是一種針對Unicode的可變長度字元編碼,也是一种前缀码。它可以用一至四个字节对Unicode字符集中的所有...