UTF-8 - Jenkov.com

文章推薦指數: 80 %
投票人數:10人

UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Tech&MediaLabs Tutorials RSS Home UnicodeUnicodeUTF-8 UTF-8 UTF-8MarkerBitsandCodePointBitsUnicodeCodePointIntervalsUsedinUTF-8ReadingUTF-8WritingUTF-8ReadingandWritingUTF-8inJavaReadUTF-8IntoaJavaStringGetUTF-8BytesFromJavaStringAUtf8BufferClassWhichCanWriteandReadUTF-8CodePointsSearchingForwardsinUTF-8SearchingBackwardsinUTF-8 JakobJenkov Lastupdate:2022-08-07 UTF-8isabyteencodingusedtoencodeunicodecharacters.UTF-8uses1,2,3or4bytes torepresentaunicodecharacter.Remember,aunicodecharacterisrepresentedbya unicodecodepoint.Thus,UTF-8uses1,2,3or4bytestorepresenta unicodecodepoint. UTF-8istheaverycommonlyusedtextualencodingontheweb,andisthusverypopular.Webbrowsersunderstand UTF-8.ManyprogramminglanguagesalsoallowyoutouseUTF-8inthecode,andcanimportandexportUTF-8text easily.SeveraltextualdataformatsandmarkuplanguagesareoftenencodedinUTF-8.Forinstance JSON,XML,HTML,CSS,SVGetc. UTF-8MarkerBitsandCodePointBits WhentranslatingaunicodecodepointtooneormoreUTF-8encodedbytes,eachofthesebytesarecomposedofmarkerbits andcodepointbits.Themarkerbitstellhowtointerpretthegivenbyte.Thecodepointbitsareusedto representthevalueofthecodepoint.Inthefollowingsectionsthemarkerbitsarewrittenusing0'sand 1's,andthecodepointbitsarewrittenusingthecharactersZ,Y,X, WandV.Eachcharacterrepresentsasinglebit. UnicodeCodePointIntervalsUsedinUTF-8 ForunicodecodepointsinthehexadecimalvalueintervalU+0000toU+007FUTF-8uses asinglebytetorepresentthecharacter.Thecodepointsinthisintervalrepresentthesamecharactersas theASCIIcharacters,andusethesameintegervalues(codepoints)torepresentthem. Inbinarydigits,thesinglebyterepresentingacodepointinthisintervallooks likethis: 0ZZZZZZZ Themarkerbithasthevalue0.ThebitsrepresentingthecodepointvaluearemarkedwithZ. ForunicodecodepointsintheintervalU+0080toU+07FFUTF-8usestwobytesto representthecharacter.Inbinarydigits,thetwobytesrepresentingacodepointinthisintervallook likethis: 110YYYYY10ZZZZZZ Themarkerbitsarethe110and10bitsofthetwobytes. TheYandZcharactersrepresentsthebitsusedtorepresentthecodepointvalue. Thefirstbyte(mostsignificantbyte)isthebytetotheleft. ForunicodecodepointsintheintervalU+0800toU+FFFFUTF-8usesthreebytesto representthecharacter.Inbinarydigits,thethreebytesrepresentingacodepointinthisintervallook likethis: 1110XXXX10YYYYYY10ZZZZZZ Themarkerbitsarethe1110and10bitsofthethreebytes. TheX,YandZcharactersthebitsusedtorepresentthecodepoint value.Thefirstbyte(mostsignificantbyte)isthebytetotheleft. ForunicodecodepointsintheintervalU+10000toU+10FFFFUTF-8usesfourbytes torepresentthecharacter.Inbinarydigits,thefourbytesrepresentingacodepointinthisintervallook likethis: 11110VVV10WWXXXX10YYYYYY10ZZZZZZ Themarkerbitsarethe11110and10bitsofthefourbytes. ThebitsnamedVandWmarkthecodepointplanethecharacterisfrom. TherestofthebitsmarkedwithX,YandZrepresenttherestofthe codepoint.Thefirstbyte(mostsignificantbyte)isthebyteontheleft. ReadingUTF-8 WhenreadingUTF-8encodedbytesintocharacters,youneedtofigureoutifagivencharacter(codepoint) isrepresentedby1,2,3or4bytes.Youdosobylookingatthebitpatternofthefirstbyte. Ifthefirstbytehasthebitpattern0ZZZZZZZ(mostsignificantbitisa0)thenthecharacter codepointisrepresentedonlybythisbyte. Ifthefirstbytehasthebitpattern110YYYYY(3mostsignificantbitsare110)thenthecharacter codepointisrepresentedbytwobytes. Ifthefirstbytehasthebitpattern1110XXXX(4mostsignificantbitsare1110)thenthecharacter codepointisrepresentedbythreebytes. Ifthefirstbytehasthebitpattern11110VVV(5mostsignificantbitsare11110)thenthecharacter codepointisrepresentedbyfourbytes. Onceyouknowhowmanybytesisusedtorepresentthegivencharactercodepoint,readalltheactualcodepoint carryingbits(bitsmarkedwithV,W,X,YandZ), intoasingle32bitdatatype(e.gaJavaint).Thebitsthenmakeuptheintegervalueofthe codepoint.Hereishowa32-bitdatatypelooksafterreadinga4-byteUTF-8characterintoit: 000000000VVVWWXXXXYYYYYYZZZZZZ Noticehowallthemarkerbits(themostsignificantbitswiththepatterns 11110and10)havebeenremovedfromallofthe 4bytes,beforetheremainingbits(thebitsmarkedwithA,B,C,DandE)arecopiedintothe32-bitdatatype. WritingUTF-8 WhenwritingUTF-8textyouneedtotranslateunicodecodepointsintoUTF-8encodedbytes.First,youmust figureouthowmanybytesyouneedtorepresentthegivencodepoint.Ihaveexplainedthecodepointvalue intervalsatthetopofthisUTF-8tutorial,soIwillnotrepeatthemhere. Second,youneedtotranslatethebitsrepresentingthecodepointintothecorrespondingUTF-8bytes. Onceyouknowhowmanybytesareneededtorepresentthecodepoint,youalsoknowwhatbitpattern ofmarkerbitsandcodepointbitsyouneedtouse.Simplycreatetheneedednumberofbyteswithmarkerbits, andcopythecorrectcodepointbitsintoeachofthebytes,andyouaredone. Hereisanexampleoftranslatingacodepointthatrequires4bytesinUTF-8.Thecodepointhastheabstract value(asbitpattern): 000000000VVVWWXXXXYYYYYYZZZZZZ Thecorresponding4UTF-8byteswilllooklikethis: 11110VVV10WWXXXX10YYYYYY10ZZZZZZ ReadingandWritingUTF-8inJava ThereareseveralwaystoreadandwriteUTF-8encodedbytesinJava.InthefollowingsectionsIwillcovera fewofthem. ReadUTF-8IntoaJavaString IfyouneedtoreadUTF-8intoaJavaString,youcandolikethis: byte[]utf8=...//gettheUTF-8bytesfromsomewhere(file,URLetc) Stringstring=newString(bytes,StandardCharsets.UTF_8); GetUTF-8BytesFromJavaString YoucanobtainthecharactersofaJavaStringasUTF-8encodedbytes,likethis: byte[]utf8=string.getBytes(StandardCharsets.UTF_8); AUtf8BufferClassWhichCanWriteandReadUTF-8CodePoints HereisaUtf8BufferclasswhichcanbothwriteandreadUTF-8asJavaintegercodepoints: publicclassUtf8Buffer{ publicbyte[]buffer; publicintoffset; publicintlength; publicintendOffset; publicinttempOffset; publicUtf8Buffer(byte[]data,intoffset,intlength){ this.buffer=data; this.offset=offset; this.tempOffset=offset; this.length=length; this.endOffset=offset+length; } publicvoidreset(){ this.tempOffset=this.offset; } publicvoidcalculateLengthAndEndOffset(){ this.length=this.tempOffset-this.offset; this.endOffset=this.tempOffset; } publicintwriteCodepoint(intcodepoint){ if(codepoint<0x00_00_00_80){ //ThisisaonebyteUTF-8char buffer[this.tempOffset++]=(byte)(0xFF&codepoint); return1; }elseif(codepoint<0x00_00_08_00){ //ThisisatwobyteUTF-8char.Valueis11bitslong(lessthan12bitsinvalue). //Gethighest5bitsintofirstbyte buffer[this.tempOffset]=(byte)(0xFF&(0b1100_0000|(0b0001_1111&(codepoint>>6)))); buffer[this.tempOffset+1]=(byte)(0xFF&(0b1000_0000|(0b0011_1111&codepoint))); this.tempOffset+=2; return2; }elseif(codepoint<0x00_01_00_00){ //ThisisathreebyteUTF-8char.Valueis16bitslong(lessthan17bitsinvalue). //Getthehighest4bitsintothefirstbyte buffer[this.tempOffset]=(byte)(0xFF&(0b1110_0000|(0b0000_1111&(codepoint>>12)))); buffer[this.tempOffset+1]=(byte)(0xFF&(0b1000_0000|(0b00111111&(codepoint>>6)))); buffer[this.tempOffset+2]=(byte)(0xFF&(0b1000_0000|(0b00111111&codepoint))); this.tempOffset+=3; return3; }elseif(codepoint<0x00_11_00_00){ //ThisisafourbyteUTF-8char.Valueis21bitslong(lessthan22bitsinvalue). //Getthehighest3bitsintothefirstbyte buffer[this.tempOffset]=(byte)(0xFF&(0b1111_0000|(0b0000_0111&(codepoint>>18)))); buffer[this.tempOffset+1]=(byte)(0xFF&(0b1000_0000|(0b0011_1111&(codepoint>>12)))); buffer[this.tempOffset+2]=(byte)(0xFF&(0b1000_0000|(0b0011_1111&(codepoint>>6)))); buffer[this.tempOffset+3]=(byte)(0xFF&(0b1000_0000|(0b0011_1111&codepoint))); this.tempOffset+=4; return4; } thrownewIllegalArgumentException( "UnknownUnicodecodepoint:" +codepoint); } publicintnextCodepoint(){ intfirstByteOfChar=0xFF&buffer[tempOffset]; if(firstByteOfChar<0b1000_0000){//128 //thisisasinglebyteUTF-8char(anASCIIchar) tempOffset++; returnfirstByteOfChar; }elseif(firstByteOfChar<0b1110_0000){//224 intnextCodepoint=0; //thisisatwobyteUTF-8char nextCodepoint=0b0001_1111&firstByteOfChar;//0x1F nextCodepoint<<=6; nextCodepoint|=0b0011_1111&(0xFF&buffer[tempOffset+1]);//0x3F tempOffset+=2; returnnextCodepoint; }elseif(firstByteOfChar<0b1111_0000){//240 //thisisathreebyteUTF-8char intnextCodepoint=0; //thisisatwobyteUTF-8char nextCodepoint=0b0000_1111&firstByteOfChar;//0x0F nextCodepoint<<=6; nextCodepoint|=0x3F&buffer[tempOffset+1]; nextCodepoint<<=6; nextCodepoint|=0x3F&buffer[tempOffset+2]; tempOffset+=3; returnnextCodepoint; }elseif(firstByteOfChar<0b1111_1000){//248 //thisisafourbyteUTF-8char intnextCodepoint=0; //thisisatwobyteUTF-8char nextCodepoint=0b0000_0111&firstByteOfChar;//0x07 nextCodepoint<<=6; nextCodepoint|=0x3F&buffer[tempOffset+1]; nextCodepoint<<=6; nextCodepoint|=0x3F&buffer[tempOffset+2]; nextCodepoint<<=6; nextCodepoint|=0x3F&buffer[tempOffset+3]; tempOffset+=4; returnnextCodepoint; } thrownewIllegalStateException( "Codepointnotrecognizedfromfirstbyte:" +firstByteOfChar); } } UsingtheUtf8Bufferclasscouldlooklikethis: Utf8Bufferutf8Buffer=newUtf8Buffer(newbyte[1024],0,0); utf8Buffer.writeCodepoint(0x7F); //Afterwriting-calculatinglengthandoffsetsarenecessary, //andifyouwanttoread,tempOffsetmustbesetbacktooffset(reset()) utf8Buffer.calculateLengthAndEndOffset(); utf8Buffer.reset(); intnextCodePoint=utf8Buffer.nextCodepoint(); SearchingForwardsinUTF-8 SearchingforwardsinUTF-8isreasonablystraightforward.Youencodeonecharacteratatime,andcompareit tothecharacteryouaresearchingfor.Nobigsurprisehere. SearchingBackwardsinUTF-8 TheUTF-8encodinghasthenicesideeffectthatyoucansearchbackwardsinUTF-8encodedbytes.Youcansee fromeachbyteifitisthebeginningofacharacterornotbylookingatthemarkerbits.Thefollowing markerbitpatternsallimplythatthebyteisthebeginningofacharacter: 0Beginningof1bytecharacter(alsoanasciicharacter) 110Beginningof2bytecharacter 1110Beginningof3bytecharacter 11110Beginningof4bytecharacter ThefollowingmarkerbitpatternimpliesthatthebyteisnotthefirstbyteofaUTF-8character: 10Second,thirdorfourthbyteofaUTF-8character Noticehowyoucanalwaysseefromamarkerbitpatternifitisthefirstbyteofacharacter,or asecond/third/fourthbyte.Justkeepingsearchingbackwardsuntilyoufindthebeginningofthe character,thengoforwardanddecodeit,andcheckifitisthecharacteryouarelookingfor. Tweet JakobJenkov FeaturedVideos CopyrightJenkovAps CloseTOC AllTrails TrailTOC PageTOC Previous Next



請為這篇文章評分?