How do I encode/decode UTF-16LE byte arrays with a BOM?
文章推薦指數: 80 %
The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and ...
Home
Public
Questions
Tags
Users
Companies
Collectives
ExploreCollectives
Teams
StackOverflowforTeams
–Startcollaboratingandsharingorganizationalknowledge.
CreateafreeTeam
WhyTeams?
Teams
CreatefreeTeam
Collectives™onStackOverflow
Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost.
LearnmoreaboutCollectives
Teams
Q&Aforwork
Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch.
LearnmoreaboutTeams
HowdoIencode/decodeUTF-16LEbytearrayswithaBOM?
AskQuestion
Asked
13years,4monthsago
Modified
5years,1monthago
Viewed
39ktimes
24
Ineedtoencode/decodeUTF-16bytearraystoandfromjava.lang.String.ThebytearraysaregiventomewithaByteOrderMarker(BOM),andIneedtoencodedbytearrayswithaBOM.
Also,becauseI'mdealingwithaMicrosoftclient/server,I'dliketoemittheencodinginlittleendian(alongwiththeLEBOM)toavoidanymisunderstandings.IdorealizethatwiththeBOMitshouldworkbigendian,butIdon'twanttoswimupstreamintheWindowsworld.
Asanexample,hereisamethodwhichencodesajava.lang.StringasUTF-16inlittleendianwithaBOM:
publicstaticbyte[]encodeString(Stringmessage){
byte[]tmp=null;
try{
tmp=message.getBytes("UTF-16LE");
}catch(UnsupportedEncodingExceptione){
//shouldnotpossible
AssertionErrorae=
newAssertionError("CouldnotencodeUTF-16LE");
ae.initCause(e);
throwae;
}
//usebruteforcemethodtoaddBOM
byte[]utf16lemessage=newbyte[2+tmp.length];
utf16lemessage[0]=(byte)0xFF;
utf16lemessage[1]=(byte)0xFE;
System.arraycopy(tmp,0,
utf16lemessage,2,
tmp.length);
returnutf16lemessage;
}
WhatisthebestwaytodothisinJava?IdeallyI'dliketoavoidcopyingtheentirebytearrayintoanewbytearraythathastwoextrabytesallocatedatthebeginning.
Thesamegoesfordecodingsuchastring,butthat'smuchmorestraightforwardbyusingthejava.lang.Stringconstructor:
publicString(byte[]bytes,
intoffset,
intlength,
StringcharsetName)
javaunicodeutf-16byte-order-mark
Share
Follow
editedMay18,2009at20:27
JaredOberhaus
askedMay18,2009at19:55
JaredOberhausJaredOberhaus
14.4k44goldbadges5555silverbadges5555bronzebadges
Addacomment
|
5Answers
5
Sortedby:
Resettodefault
Highestscore(default)
Trending(recentvotescountmore)
Datemodified(newestfirst)
Datecreated(oldestfirst)
32
The"UTF-16"charsetnamewillalwaysencodewithaBOMandwilldecodedatausingeitherbig/littleendianness,but"UnicodeBig"and"UnicodeLittle"areusefulforencodinginaspecificbyteorder.UseUTF-16LEorUTF-16BEfornoBOM-seethispostforhowtouse"\uFEFF"tohandleBOMsmanually.Seehereforcanonicalnamingofcharsetstringnamesor(preferably)theCharsetclass.Alsotakenotethatonlyalimitedsubsetofencodingsareabsolutelyrequiredtobesupported.
Share
Follow
answeredMay18,2009at20:08
McDowellMcDowell
106k2929goldbadges199199silverbadges262262bronzebadges
6
1
Thanks!Onemoreissuethough...Using"UTF-16"encodesthedataasBigEndian,whichIsuspectwillnotgooverwellwithMicrosoftdata(eventhoughtheBOMexists).AnywaytoencodeUTF-16LEwithBOMwithJava?I'llupdatemyquestiontoreflectwhatIwasreallylookingfor...
– JaredOberhaus
May18,2009at20:14
Clickonthe"seethispost"linkhegave.Basically,youstuffa\uFEFFcharacteratthebeginningofyourstring,andthenencodetoUTF-16LE,andtheresultwillhaveaproperBOM.
– DanielMartin
May18,2009at20:17
Use"UnicodeLittle"(assumingyourJREsupportsit-("\uEFFF"+"mystring").getBytes("UTF-16LE")otherwise).ThoughIwouldbesurprisedifMicrosoftAPIsexpectedaBOMbutcouldn'thandlebig-endiandata-theytendtolikeusingBOMsmorethanotherplatforms.Testwithemptystrings-youmaygetemptyarraysifthereisnodata.
– McDowell
May18,2009at20:22
4
IwouldbecompletelyunsurprisedatMicrosoftdefiningaformatwhereitexpectsaUTF-16LEBOMtobeginafileandwillnotbehaveifthefilebeginswithaUTF-8BOMoraUTF-16BEBOM.IwouldbecompletelyunsurprisedbecausethisisexactlythebehaviorIhaveobservedwithexcelloadingCSVfiles-ifthefilebeginswithaUTF-16LEBOM,thenitloadsthedatainUTF-16LEandexpectstabsbetweencolumns.Anyothercharactersequenceanditloadsdatainsomelocalcharactersetwith","or";"(locale-dependent!)betweencolumns.
– DanielMartin
May18,2009at20:42
7
Justtoreiterate:"UnicodeLittle"(a.k.a."x-UTF-16LE-BOM")willwritethefileasUTF-16little-endianwithaBOM.ThisshouldbethepreferredmethodforWRITINGthefiles,butitonlyseemstobeavailablesinceJava6(JDK1.6).ForREADING,youshouldstickwith"UTF-16".
– AlanMoore
May18,2009at23:51
|
Show1morecomment
6
Firstoff,fordecodingyoucanusethecharacterset"UTF-16";thatautomaticallydetectsaninitialBOM.ForencodingUTF-16BE,youcanalsousethe"UTF-16"characterset-that'llwriteaproperBOMandthenoutputbigendianstuff.
ForencodingtolittleendianwithaBOM,Idon'tthinkyourcurrentcodeistoobad,evenwiththedoubleallocation(unlessyourstringsaretrulymonstrous).Whatyoumightwanttodoiftheyareisnotdealwithabytearraybutratherajava.nioByteBuffer,andusethejava.nio.charset.CharsetEncoderclass.(WhichyoucangetfromCharset.forName("UTF-16LE").newEncoder()).
Share
Follow
answeredMay18,2009at20:15
DanielMartinDanielMartin
22.6k66goldbadges4949silverbadges6868bronzebadges
0
Addacomment
|
6
Thisishowyoudoitinnio:
returnCharset.forName("UTF-16LE").encode(message)
.put(0,(byte)0xFF)
.put(1,(byte)0xFE)
.array();
Itiscertainlysupposedtobefaster,butIdon'tknowhowmanyarraysitmakesunderthecovers,butmyunderstandingofthepointoftheAPIisthatitissupposedtominimizethat.
Share
Follow
answeredMay18,2009at23:09
YishaiYishai
88.9k3131goldbadges186186silverbadges257257bronzebadges
1
Thisoneactuallydoesn'twork.Theput(0)andput(1)callsoverwritesthefirsttwobytesoftheencodedmessage'sByteBuffer.
– hopia
Aug24,2017at22:18
Addacomment
|
3
ByteArrayOutputStreambyteArrayOutputStream=newByteArrayOutputStream(string.length()*2+2);
byteArrayOutputStream.write(newbyte[]{(byte)0xFF,(byte)0xFE});
byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
returnbyteArrayOutputStream.toByteArray();
EDIT:Rereadingyourquestion,Iseeyouwouldratheravoidthedoublearrayallocationaltogether.UnfortunatelytheAPIdoesn'tgiveyouthat,asfarasIknow.(Therewasamethod,butitisdeprecated,andyoucan'tspecifyencodingwithit).
IwrotetheabovebeforeIsawyourcomment,Ithinktheanswertousethenioclassesisontherighttrack.Iwaslookingatthat,butI'mnotfamiliarenoughwiththeAPItoknowoffhandhowyougetthatdone.
Share
Follow
editedMay18,2009at20:36
answeredMay18,2009at20:09
YishaiYishai
88.9k3131goldbadges186186silverbadges257257bronzebadges
3
Thanks.InadditionwhatIwouldhavelikedhereistonotallocatetheentirebytearraywithstring.getBytes("UTF-16LE")--perhapsbywrappingthestreamasanInputStream,whichwasthepointofmyearlierquestion:stackoverflow.com/questions/837703/…
– JaredOberhaus
May18,2009at20:21
NotethatthiscodeactuallyallocatesarraysbigenoughfortheStringthreetimes,sinceyouhavetheinternalarrayoftheByteArrayOutputStreamwhichiscopiedinthecall.toByteArray().AwaytogetitbackdowntoonlyallocatingtwoistowraptheByteArrayOutputStreaminanOutputStreamWriterandwritethestringtothat.ThenyoustillhavetheByteArrayOutputStream'sinternalstateandthecopymadeby.toByteArray(),butnotthereturnvaluefrom.getBytes
– DanielMartin
May18,2009at20:55
Itseemsthatyouarejustexchangingachararrayforabytearrayifyoudothat,astheOutputStreamWriterdelegatestotheStreamEncoderclass,whichcreatesachar[]buffertoretrievetheStringdata.Stringisimmutable,andthesizeofanarrayisinvariable,sothatcopyseemsunavoidable.IthinknioissupposedtohelpwiththatdoublecreationontheByteArrayOutputStream
– Yishai
May18,2009at21:29
Addacomment
|
0
Thisisanoldquestion,butstill,Icouldn'tfindanacceptableanswerformysituation.Basically,Javadoesn'thaveabuilt-inencoderforUTF-16LEwithaBOM.Andso,youhavetorolloutyourownimplementation.
Here'swhatIendedupwith:
privatebyte[]encodeUTF16LEWithBOM(finalStrings){
ByteBuffercontent=Charset.forName("UTF-16LE").encode(s);
byte[]bom={(byte)0xff,(byte)0xfe};
returnByteBuffer.allocate(content.capacity()+bom.length).put(bom).put(content).array();
}
Share
Follow
answeredAug24,2017at22:17
hopiahopia
4,82177goldbadges2929silverbadges5252bronzebadges
Addacomment
|
YourAnswer
ThanksforcontributingananswertoStackOverflow!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers.
Draftsaved
Draftdiscarded
Signuporlogin
SignupusingGoogle
SignupusingFacebook
SignupusingEmailandPassword
Submit
Postasaguest
Name
Email
Required,butnevershown
PostYourAnswer
Discard
Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy
Nottheansweryou'relookingfor?Browseotherquestionstaggedjavaunicodeutf-16byte-order-markoraskyourownquestion.
TheOverflowBlog
HowtoearnamillionreputationonStackOverflow:beofservicetoothers
Therightwaytojobhop(Ep.495)
FeaturedonMeta
BookmarkshaveevolvedintoSaves
Inboximprovements:markingnotificationsasread/unread,andafiltered...
Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew...
CollectivesUpdate:RecognizedMembers,Articles,andGitLab
Shouldweburninatethe[script]tag?
Linked
0
emojicodeOKHANDSING(👌)notrenderingproperlywithanycharsetencoding
146
WhichencodingopensCSVfilescorrectlywithExcelonbothMacandWindows?
4
RightwaytodealwithUnicodeBOMinatextfile
Related
1538
HowcanIconcatenatetwoarraysinJava?
2784
HowcanIcreateanexecutableJARwithdependenciesusingMaven?
974
What'sthedifferencebetweenUTF-8andUTF-8withBOM?
37
Isn’tonbigendianmachinesUTF-8'sbyteorderdifferentthanonlittleendianmachines?Sowhythendoesn’tUTF-8requireaBOM?
881
UnicodeDecodeError:'charmap'codeccan'tdecodebyteXinpositionY:charactermapsto
延伸文章資訊
- 1FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode
Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In pa...
- 2BOM — Unicode歷史沙石(之一) - I.T. 9 遊戲日誌
如果以2個byte嘅Little Endian編碼方式去將Unicode文字去編碼的話,呢種方法就叫UTF-16LE。如果用咗Big Endian就叫UTF-16BE。當然,現實上中其實仲有第三...
- 3位元組順序記號 - 维基百科
位元組順序記號(英語:byte-order mark,BOM)是位於碼點 U+FEFF 的統一碼字符的名称。當以UTF-16或UTF-32來將UCS/統一碼字符所組成的字串編碼時,這個字符被用來...
- 4UTF-16 - 維基百科,自由的百科全書
以下的例子有四個字元:「朱」(U+6731)、半形逗號(U+002C)、「聿」(U+807F)、「𪚥」(U+2A6A5)。 使用UTF-16編碼的例子. 編碼名稱, 編碼次序, 編碼. BOM,...
- 5[Charset]UTF-8, UTF-16, UTF-16LE, UTF-16BE的區別 - 程式人生
如果這個UTF-16檔案裡帶有BOM的話, charset就用"UTF-16", java會自動根據BOM判斷LE還是BE, 如果你在這裡指定了"UTF-16LE"或"UTF-16BE"的話, ...