Encode a String to UTF-8 in Java - Stack Abuse

文章推薦指數: 80 %
投票人數:10人

Encoding a String in Java simply means injecting certain bytes into the byte array that constitutes a String - providing additional information ... SALogotypeArticlesLearnWorkwithUsSigninSignupPythonJavaScriptJavaHomeArticlesEncodeaStringtoUTF-8inJavaBrankoIlicIntroduction WhenworkingwithStringsinJava,weoftentimesneedtoencodethemtoaspecificcharset,suchasUTF-8. UTF-8representsavariable-widthcharacterencodingthatusesbetweenoneandfoureight-bitbytestorepresentallvalidUnicodecodepoints. Acodepointcanrepresentsinglecharacters,butalsohaveothermeanings,suchasforformatting."Variable-width"meansthatitencodeseachcodepointwithadifferentnumberofbytes(betweenoneandfour)andasaspace-savingmeasure,commonlyusedcodepointsarerepresentedwithfewerbytesthanthoseusedlessfrequently. UTF-8usesonebytetorepresentcodepointsfrom0-127,makingthefirst128codepointsaone-to-onemapwithASCIIcharacters,soUTF-8isbackward-compatiblewithASCII. Note:JavaencodesallStringsintoUTF-16,whichusesaminimumoftwobytestostorecodepoints.WhywouldweneedtoconverttoUTF-8then? NotallinputmightbeUTF-16,orUTF-8forthatmatter.YoumightactuallyreceiveanASCII-encodedString,whichdoesn'tsupportasmanycharactersasUTF-8.Additionally,notalloutputmighthandleUTF-16,soitmakessensetoconverttoamoreuniversalUTF-8. We'llbeworkingwithafewStringsthatcontainUnicodecharactersyoumightnotencounteronadailybasis-suchasč,ßandあ,simulatinguserinput. Let'swriteoutacoupleofStrings: StringserbianString="Štaradiš?";//Whatareyoudoing? StringgermanString="WieheißenSie?";//What'syourname? StringjapaneseString="よろしくお願いします";//Pleasedtomeetyou. Now,let'sleveragetheString(byte[]bytes,Charsetcharset)constructorofaString,torecreatetheseStrings,butwithadifferentCharset,simulatingASCIIinputthatarrivedtousinthefirstplace: StringasciiSerbianString=newString(serbianString.getBytes(),StandardCharsets.US_ASCII); StringasciigermanString=newString(germanString.getBytes(),StandardCharsets.US_ASCII); StringasciijapaneseString=newString(japaneseString.getBytes(),StandardCharsets.US_ASCII); System.out.println(asciiSerbianString); System.out.println(asciigermanString); System.out.println(asciijapaneseString); Oncewe'vecreatedtheseStringsandencodedthemasASCIIcharacters,wecanprintthem: ��taradi��? Wiehei��enSie? ������������������������������ WhilethefirsttwoStringscontainjustafewcharactersthataren'tvalidASCIIcharacters-thefinalonedoesn'tcontainany. Toavoidthisissue,wecanassumethatnotallinputmightalreadybeencodedtoourliking-andencodeittoironoutsuchcasesourselves.ThereareseveralwayswecangoaboutencodingaStringtoUTF-8inJava. EncodingaStringinJavasimplymeansinjectingcertainbytesintothebytearraythatconstitutesaString-providingadditionalinformationthatcanbeusedtoformatitonceweformaStringinstance. UsingthegetBytes()method TheStringclass,beingmadeupofbytes,naturallyoffersagetBytes()method,whichreturnsthebytearrayusedtocreatetheString.Sinceencodingisreallyjustmanipulatingthisbytearray,wecanputthisarraythroughaCharsettoformitwhilegettingthedata. Bydefault,withoutprovidingaCharset,thebytesareencodedusingtheplatforms'defaultCharset-whichmightnotbeUTF-8orUTF-16.Let'sgetthebytesofaStringandprintthemout: StringserbianString="Štaradiš?";//Whatareyoudoing? byte[]bytes=serbianString.getBytes(StandardCharsets.UTF_8); for(byteb:bytes){ System.out.print(String.format("%s",b)); } Thisoutputs: -59-96116973211497100105-59-9563 Thesearethecodepointsforourencodedcharacters,andthey'renotreallyusefultohumaneyes.Though,again,wecanleverageString'sconstructortomakeahuman-readableStringfromthisverysequence.Consideringthefactthatwe'veencodedthisbytearrayintoUTF_8,wecangoaheadandsafelymakeanewStringfromthis: Stringutf8String=newString(bytes); System.out.println(utf8String); Note:InsteadofencodingthemthroughthegetBytes()method,youcanalsoencodethebytesthroughtheStringconstructor: Stringutf8String=newString(bytes,StandardCharsets.UTF_8); ThisnowoutputstheexactsameStringwestartedwith,butencodedtoUTF-8: FreeeBook:GitEssentialsCheckoutourhands-on,practicalguidetolearningGit,withbest-practices,industry-acceptedstandards,andincludedcheatsheet.StopGooglingGitcommandsandactuallylearnit!DownloadtheeBook Štaradiš? EncodeaStringtoUTF-8withJava7StandardCharsets SinceJava7,we'vebeenintroducedtotheStandardCharsetsclass,whichhasseveralCharsetsavailablesuchasUS_ASCII,ISO_8859_1,UTF_8andUTF-16amongothers. EachCharsethasanencode()anddecode()method,whichacceptsaCharBuffer(whichimplementsCharSequence,sameasaString).Inpracticalterms-thismeanswecanchuckinaStringintotheencode()methodsofaCharset. Theencode()methodreturnsaByteBuffer-whichwecaneasilyturnintoaStringagain. Earlierwhenwe'veusedourgetBytes()method,westoredthebyteswegotinanarrayofbytes,butwhenusingtheStandardCharsetsclass,thingsareabitdifferent.WefirstneedtouseaclasscalledByteBuffertostoreourbytes.Then,weneedtobothencodeandthendecodebackournewlyallocatedbytes.Let'sseehowthisworksincode: StringjapaneseString="よろしくお願いします";//Pleasedtomeetyou. ByteBufferbyteBuffer=StandardCharsets.UTF_8.encode(japaneseString); Stringutf8String=newString(byteBuffer.array(),StandardCharsets.UTF_8); System.out.println(utf8String); Runningthiscoderesultsin: よろしくお願いします EncodeaStringtoUTF-8withApacheCommons TheApacheCommonsCodecpackagecontainssimpleencodersanddecodersforvariousformatssuchasBase64andHexadecimal.Inadditiontothesewidelyusedencodersanddecoders,thecodecpackagealsomaintainsacollectionofphoneticencodingutilities. ForustobeabletousetheApacheCommonsCodec,weneedtoaddittoourprojectasanexternaldependency. UsingMaven,let'saddthecommons-codecdependencytoourpom.xmlfile: commons-codec commons-codec 1.15 Alternativelyifyou'reusingGradle: compile'commons-codec:commons-codec:1.15' Now,wecanutilizetheutilityclassesofApacheCommons-andasusual,we'llbeleveragingtheStringUtilsclass. ItallowsustoconvertStringstoandfrombytesusingvariousencodingsrequiredbytheJavaspecification.Thisclassisnull-safeandthread-safe,sowe'vegotanextralayerofprotectionwhenworkingwithStrings. ToencodeaStringtoUTF-8withApacheCommon'sStringUtilsclass,wecanusethegetBytesUtf8()method,whichfunctionsmuchlikethegetBytes()methodwithaspecifiedCharset: StringgermanString="WieheißenSie?";//What'syourname? byte[]bytes=StringUtils.getBytesUtf8(germanString); Stringutf8String=StringUtils.newStringUtf8(bytes); System.out.println(utf8String); Thisresultsin: WieheißenSie? Or,youcanusetheregularStringUtilsclassfromthecommons-lang3dependency: org.apache.commons commons-lang3 Ifyou'reusingGradle: implementationgroup:'org.apache.commons',name:'commons-lang3',version:${version} Andnow,wecanusemuchthesameapproachaswithregularStrings: StringgermanString="WieheißenSie?";//What'syourname? byte[]bytes=StringUtils.getBytes(germanString,StandardCharsets.UTF_8); Stringutf8String=StringUtils.toEncodedString(bytes,StandardCharsets.UTF_8); System.out.println(utf8String); Though,thisapproachisthread-safeandnull-safe: WieheißenSie? Conclusion Inthistutorial,we'vetakenalookathowtoencodeaJavaStringtoUTF-8.We'vetakenalookatafewapproaches-manuallycreatingaStringusinggetBytes()andmanipulatingthem,theJava7StandardCharsetsclassaswellasApacheCommons. #java#encoding#apachecommonsLastUpdated:November24th,2021Wasthisarticlehelpful?Youmightalsolike...RemoveElementfromanArrayinJavaJava:CheckifStringStartswithAnotherStringConvertInputStreamintoaStringinJavaComparingStringswithJavaGuidetoApacheCommons'StringUtilsClassinJavaImproveyourdevskills!Gettutorials,guides,anddevjobsinyourinbox.EmailaddressSignUpNospamever.Unsubscribeatanytime.ReadourPrivacyPolicy.BrankoIlicAuthorDavidLandupEditorInthisarticleIntroductionUsingthegetBytes()methodEncodeaStringtoUTF-8withJava7StandardCharsetsEncodeaStringtoUTF-8withApacheCommonsConclusionMakeClarityfromData-QuicklyLearnDataVisualizationwithPythonLearnthelandscapeofDataVisualizationtoolsinPython-workwithSeaborn,Plotly,andBokeh,andexcelinMatplotlib!Fromsimpleplottypestoridgeplots,surfaceplotsandspectrograms-understandyourdataandlearntodrawconclusionsfromit.Learnmore Wantaremotejob? MoreJobsJobsbyHireRemote.ioTwitterGitHubFacebook©2013-2022StackAbuse.Allrightsreserved.DisclosurePrivacyTermsDonotsharemyPersonalInformation.



請為這篇文章評分?