Encode a String to UTF-8 in Java - Stack Abuse
文章推薦指數: 80 %
Encoding a String in Java simply means injecting certain bytes into the byte array that constitutes a String - providing additional information ...
SALogotypeArticlesLearnWorkwithUsSigninSignupPythonJavaScriptJavaHomeArticlesEncodeaStringtoUTF-8inJavaBrankoIlicIntroduction
WhenworkingwithStringsinJava,weoftentimesneedtoencodethemtoaspecificcharset,suchasUTF-8.
UTF-8representsavariable-widthcharacterencodingthatusesbetweenoneandfoureight-bitbytestorepresentallvalidUnicodecodepoints.
Acodepointcanrepresentsinglecharacters,butalsohaveothermeanings,suchasforformatting."Variable-width"meansthatitencodeseachcodepointwithadifferentnumberofbytes(betweenoneandfour)andasaspace-savingmeasure,commonlyusedcodepointsarerepresentedwithfewerbytesthanthoseusedlessfrequently.
UTF-8usesonebytetorepresentcodepointsfrom0-127,makingthefirst128codepointsaone-to-onemapwithASCIIcharacters,soUTF-8isbackward-compatiblewithASCII.
Note:JavaencodesallStringsintoUTF-16,whichusesaminimumoftwobytestostorecodepoints.WhywouldweneedtoconverttoUTF-8then?
NotallinputmightbeUTF-16,orUTF-8forthatmatter.YoumightactuallyreceiveanASCII-encodedString,whichdoesn'tsupportasmanycharactersasUTF-8.Additionally,notalloutputmighthandleUTF-16,soitmakessensetoconverttoamoreuniversalUTF-8.
We'llbeworkingwithafewStringsthatcontainUnicodecharactersyoumightnotencounteronadailybasis-suchasč,ßandあ,simulatinguserinput.
Let'swriteoutacoupleofStrings:
StringserbianString="Štaradiš?";//Whatareyoudoing?
StringgermanString="WieheißenSie?";//What'syourname?
StringjapaneseString="よろしくお願いします";//Pleasedtomeetyou.
Now,let'sleveragetheString(byte[]bytes,Charsetcharset)constructorofaString,torecreatetheseStrings,butwithadifferentCharset,simulatingASCIIinputthatarrivedtousinthefirstplace:
StringasciiSerbianString=newString(serbianString.getBytes(),StandardCharsets.US_ASCII);
StringasciigermanString=newString(germanString.getBytes(),StandardCharsets.US_ASCII);
StringasciijapaneseString=newString(japaneseString.getBytes(),StandardCharsets.US_ASCII);
System.out.println(asciiSerbianString);
System.out.println(asciigermanString);
System.out.println(asciijapaneseString);
Oncewe'vecreatedtheseStringsandencodedthemasASCIIcharacters,wecanprintthem:
��taradi��?
Wiehei��enSie?
������������������������������
WhilethefirsttwoStringscontainjustafewcharactersthataren'tvalidASCIIcharacters-thefinalonedoesn'tcontainany.
Toavoidthisissue,wecanassumethatnotallinputmightalreadybeencodedtoourliking-andencodeittoironoutsuchcasesourselves.ThereareseveralwayswecangoaboutencodingaStringtoUTF-8inJava.
EncodingaStringinJavasimplymeansinjectingcertainbytesintothebytearraythatconstitutesaString-providingadditionalinformationthatcanbeusedtoformatitonceweformaStringinstance.
UsingthegetBytes()method
TheStringclass,beingmadeupofbytes,naturallyoffersagetBytes()method,whichreturnsthebytearrayusedtocreatetheString.Sinceencodingisreallyjustmanipulatingthisbytearray,wecanputthisarraythroughaCharsettoformitwhilegettingthedata.
Bydefault,withoutprovidingaCharset,thebytesareencodedusingtheplatforms'defaultCharset-whichmightnotbeUTF-8orUTF-16.Let'sgetthebytesofaStringandprintthemout:
StringserbianString="Štaradiš?";//Whatareyoudoing?
byte[]bytes=serbianString.getBytes(StandardCharsets.UTF_8);
for(byteb:bytes){
System.out.print(String.format("%s",b));
}
Thisoutputs:
-59-96116973211497100105-59-9563
Thesearethecodepointsforourencodedcharacters,andthey'renotreallyusefultohumaneyes.Though,again,wecanleverageString'sconstructortomakeahuman-readableStringfromthisverysequence.Consideringthefactthatwe'veencodedthisbytearrayintoUTF_8,wecangoaheadandsafelymakeanewStringfromthis:
Stringutf8String=newString(bytes);
System.out.println(utf8String);
Note:InsteadofencodingthemthroughthegetBytes()method,youcanalsoencodethebytesthroughtheStringconstructor:
Stringutf8String=newString(bytes,StandardCharsets.UTF_8);
ThisnowoutputstheexactsameStringwestartedwith,butencodedtoUTF-8:
FreeeBook:GitEssentialsCheckoutourhands-on,practicalguidetolearningGit,withbest-practices,industry-acceptedstandards,andincludedcheatsheet.StopGooglingGitcommandsandactuallylearnit!DownloadtheeBook Štaradiš?
EncodeaStringtoUTF-8withJava7StandardCharsets
SinceJava7,we'vebeenintroducedtotheStandardCharsetsclass,whichhasseveralCharsetsavailablesuchasUS_ASCII,ISO_8859_1,UTF_8andUTF-16amongothers.
EachCharsethasanencode()anddecode()method,whichacceptsaCharBuffer(whichimplementsCharSequence,sameasaString).Inpracticalterms-thismeanswecanchuckinaStringintotheencode()methodsofaCharset.
Theencode()methodreturnsaByteBuffer-whichwecaneasilyturnintoaStringagain.
Earlierwhenwe'veusedourgetBytes()method,westoredthebyteswegotinanarrayofbytes,butwhenusingtheStandardCharsetsclass,thingsareabitdifferent.WefirstneedtouseaclasscalledByteBuffertostoreourbytes.Then,weneedtobothencodeandthendecodebackournewlyallocatedbytes.Let'sseehowthisworksincode:
StringjapaneseString="よろしくお願いします";//Pleasedtomeetyou.
ByteBufferbyteBuffer=StandardCharsets.UTF_8.encode(japaneseString);
Stringutf8String=newString(byteBuffer.array(),StandardCharsets.UTF_8);
System.out.println(utf8String);
Runningthiscoderesultsin:
よろしくお願いします
EncodeaStringtoUTF-8withApacheCommons
TheApacheCommonsCodecpackagecontainssimpleencodersanddecodersforvariousformatssuchasBase64andHexadecimal.Inadditiontothesewidelyusedencodersanddecoders,thecodecpackagealsomaintainsacollectionofphoneticencodingutilities.
ForustobeabletousetheApacheCommonsCodec,weneedtoaddittoourprojectasanexternaldependency.
UsingMaven,let'saddthecommons-codecdependencytoourpom.xmlfile:
延伸文章資訊
- 1Byte Encodings and Strings (The Java™ Tutorials ...
Byte Encodings and Strings ... If a byte array contains non-Unicode text, you can convert the tex...
- 2Convert String to UTF-8 bytes in Java - Tutorialspoint
UTF-8 is a variable width character encoding. UTF-8 has ability to be as condense as ASCII but ca...
- 3Java 的字串
任何一本Java 入門的書都會談到,Java 的字串使用Unicode,那麼是否想過,明明你的文字編輯器是使用MS950 編碼,為什麼會寫下的字串在JVM 中會是Unicode?如果在一個.
- 4Java String Encoding - Javatpoint
- 5Encode a String to UTF-8 in Java - Stack Abuse
Encoding a String in Java simply means injecting certain bytes into the byte array that constitut...