Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff?
文章推薦指數: 80 %
The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ... Home Public Questions Tags Users Companies Collectives ExploreCollectives Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Collectives™onStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. LearnmoreaboutCollectives Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams WhyUTF-8BOMbytesefbbbfcanbereplacedby\ufeff? AskQuestion Asked 3years,8monthsago Modified 1year,2monthsago Viewed 6ktimes 6 Thebyteordermark(BOM)forUTF-8isEFBBBF,asnotedinsection23.8oftheUnicode9specification(searchfor"signature"). ManysolutionsinJavatoremovethisisjustasimpleone-linecode: replace("\uFEFF","") Idon'tunderstandthiswhythisworks. Hereismytestcode.IcheckthebinaryaftercallingString#replacewhereIfindthatEFBBBFisINDEEDremoved.SeethiscoderunliveatIdeOne.com. Somagic.Whydoesthiswork? @Test publicvoidshit()throwsException{ byte[]b=newbyte[]{-17,-69,-65,97,97,97};//EFBBBF616161 char[]c=newchar[10]; newInputStreamReader(newByteArrayInputStream(b),"UTF-8").read(c); byte[]bytes=newStringBuilder().append(c).toString().replace("\uFEFF","").getBytes();// for(bytebt:bytes){//616161,wecanseeEFBBBFisindeedremoved System.out.println(bt); } } javabyte-order-mark Share Improvethisquestion Follow editedJul28,2021at22:54 BasilBourque 276k9292goldbadges785785silverbadges10641064bronzebadges askedJan18,2019at3:32 aaron.chuaaron.chu 12511silverbadge66bronzebadges 2 2 Youareconfusingencodingwiththecharactercodepoint.Also,innormaluse,UTF-8encodedcontentshouldnotuseaBOM. – MarkRotteveel Jan18,2019at9:37 Related:HowtoaddaUTF-8BOMinJava? – BasilBourque Jul28,2021at22:45 Addacomment | 2Answers 2 Sortedby: Resettodefault Highestscore(default) Trending(recentvotescountmore) Datemodified(newestfirst) Datecreated(oldestfirst) 8 Thereasonisthataunicodetextshouldstartwiththebyteordermark(exceptUTF-8whereitisnotrecommendedmandatory[1]). fromWikipedia Thebyteordermark(BOM)isaUnicodecharacter,U+FEFFBYTEORDERMARK(BOM),whoseappearanceasamagicnumberatthestartofatextstream... ... TheBOMisencodedinthesameschemeastherestofthedocument... Whichmeansthisspecialcharacter(\uFEFF)mustalsobeencodedinUTF-8. UTF-8canencodeUnicodecodepointsinonetofourbytes. codepointswhichcanberepresentedwith7bitsareencodedinonebyte,thehighestbitisalwayszero0xxxxxxx allothercodepointsencodedinmultiplebytesdependingonthenumberofbits,theleftsetbitsofthefirstbyterepresentthenumberofbytesusedfortheencoding,e.g.110xxxxxmeanstheencodingisrepresentedbytwobytes,continuationbytesalwaysstartwith10xxxxxx(thexbitscanbeusedforthecodepoints) ThecodepointsintherangeU+0000-U+007Fcanbeencodedwithonebyte. ThecodepointsintherangeU+0080-U+07FFcanbeencodedwithtwobytes. ThecodepointsintherangeU+0800-U+FFFFcanbeencodedwiththreebytes. AdetailedexplanationisonWikipedia FortheBOMweneedthreebytes. hexFEFF binary1111111011111111 encodethebitsinUTF-8 patternforthreebyteencoding1110xxxx10xxxxxx10xxxxxx thebitsofthecodepoint1111111011111111 result111011111011101110111111 inhexEFBBBF EFBBBFsoundsalreadyfamiliar.;-) ThebytesequenceEFBBBFisnothingelsethantheBOMencodedinUTF-8. AsthebyteordermarkhasnomeaningforUTF-8itisnotusedinJava. encodingtheBOMcharacterasUTF-8 jshell>"\uFEFF".getBytes("UTF-8") $1==>byte[3]{-17,-69,-65}//EFBBBF Hencewhenthefileisreadthebytesequencegetsdecodedto\uFEFF. Forencodinge.g.UTF-16theBOMisadded jshell>"".getBytes("UTF-16") $2==>byte[4]{-2,-1,0,32}//FEFF+theencodedSPACE [1]citedfrom:http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf Althoughthere areneveranyquestionsofbyteorderwithUTF-8text,thissequencecanserveassignature forUTF-8encodedtextwherethecharactersetisunmarked.AswithaBOMinUTF-16, thissequenceofbyteswillbeextremelyrareatthebeginningoftextfilesinothercharacter encodings. Share Improvethisanswer Follow editedJul29,2021at6:52 answeredJan18,2019at9:31 SubOptimalSubOptimal 21.9k33goldbadges4949silverbadges6161bronzebadges 1 @BasilBourqueWasn'tawarethatonecouldmisreadthesentencethatway.ImadeitnowmoreclearwhatIwantedtosay. – SubOptimal Jul29,2021at6:53 Addacomment | 5 InputStreamReaderisdecodingtheUTF-8encodedbytesequence(b)intoUTF-16BE,andintheprocesstranslatestheUTF-8BOMtoUTF-16BEBOM(\uFEFF).UTF-16BEisselectedasthetargetencodingbecauseCharsetdefaultstothisbehavior: https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html TheUTF-16charsetsarespecifiedbyRFC2781;thetransformation formatsuponwhichtheyarebasedarespecifiedinAmendment1ofISO 10646-1andarealsodescribedintheUnicodeStandard. TheUTF-16charsetsusesixteen-bitquantitiesandaretherefore sensitivetobyteorder.Intheseencodingsthebyteorderofastream maybeindicatedbyaninitialbyte-ordermarkrepresentedbythe Unicodecharacter'\uFEFF'.Byte-ordermarksarehandledasfollows: Whendecoding,theUTF-16BEandUTF-16LEcharsetsinterpretthe initialbyte-ordermarksasaZERO-WIDTHNON-BREAKINGSPACE;when encoding,theydonotwritebyte-ordermarks. Whendecoding,theUTF-16charsetinterpretsthebyte-ordermarkat thebeginningoftheinputstreamtoindicatethebyte-orderofthe streambutdefaultstobig-endianifthereisnobyte-ordermark;when encoding,itusesbig-endianbyteorderandwritesabig-endian byte-ordermark. SeeJLS3.1tounderstandwhytheinternalencodingofStringisUTF-16: https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1 TheJavaprogramminglanguagerepresentstextinsequencesof16-bit codeunits,usingtheUTF-16encoding. String#getBytes()returnsabytesequenceintheplatform'sdefaultencoding,whichappearstobeUTF-8foryoursystem. Summary ThesequenceEFBBBF(UTF-8BOM)istranslatedtoFEFF(UTF-16BEBOM)whendecodingthebytesequenceintoaStringusingInputStreamReader,becausetheencodingofjava.lang.StringwithadefaultCharsetisUTF-16BEinthepresenceofaBOM.AfterreplacingtheUTF-16BEBOMandcallingString#getBytes()thestringisdecodedintoUTF-8(thedefaultcharsetforyourplatform)andyouseeyouroriginalbytesequencewithoutaBOM. Share Improvethisanswer Follow editedJan18,2019at20:28 answeredJan18,2019at3:42 ChrisHutchinsonChrisHutchinson 8,82233goldbadges2525silverbadges3333bronzebadges 2 AndwheredoesthelanguagedemonstratethatitisUTF-16BE,insteadofUTF-16-Host? – Deduplicator Jan18,2019at14:42 @DeduplicatoradjustedtheanswertoexplainwhyUTF-16BEischosen – ChrisHutchinson Jan18,2019at20:17 Addacomment | YourAnswer ThanksforcontributingananswertoStackOverflow!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedjavabyte-order-markoraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew... CollectivesUpdate:RecognizedMembers,Articles,andGitLab Shouldweburninatethe[script]tag? Linked 28 HowtoaddaUTF-8BOMinJava? Related 974 What'sthedifferencebetweenUTF-8andUTF-8withBOM? 198 UTF-8withoutBOM 329 UsingPowerShelltowriteafileinUTF-8withouttheBOM 101 ConvertUTF-8withBOMtoUTF-8withnoBOMinPython 1 ReadingtextfromfilewithUTF-16BOMcharacter 0 WhydoIhavetoencodeautf-8parameterStringtoiso-Latinandthendecodeasutf-8togetJavautf-8String? 6 whyaChinesecharactertakesonechar(2bytes)but3bytes? 3 StringEncodingwithEmojiinJava? HotNetworkQuestions Canananimalfilealawsuitonitsownbehalf? circuitikz:Addingarrowheadtotapofvariableinductance? Theunusualphrasing"verb+the+comparativeadjective"intheLordoftheRingsnovels WhattestamItaking,anyways‽ Whyarefighterjetssoloudwhendoingslowflight? Determinethelengthoftherestofamathdisplaylineformultlined IsdocumentingabigprojectwithUMLDiagramsneeded,goodtohaveorevennotpossible? PacifistethosblockingmyprogressinStellaris 9dotsthatare3by3,continuethepattern HowdoIdownloadmacOSMontereyonunsupportedMac? WhydidGodprohibitwearingofgarmentsofdifferentmaterialsinLeviticus19:19? IfthedrowshadowbladeusesShadowSwordasarangedattack,doesitthrowasword(thatitthenhastoretrievebeforeusingitagain)? keyless/flatkeyboard Unknownnotation:squarebrackets,triangles,andnumbers IsthematrixinducedL1-normgreaterthantheinducedL2-norm? HowdothosewhoholdtoaliteralinterpretationofthefloodaccountrespondtothecriticismthatNoahbuildingthearkwouldbeunfeasible? Whataretheargumentsforrevengeandretribution? CanNewton'sFirstLawbetreatedasaformofbias? Supposethat(𝑋,𝑑)isacompletemetricspace.Showthatthereisnoopen,continuousfunction𝑓:𝑋→ℚ WhathadEstherdonein"TheBellJar"bySylviaPlath? What'sthedifferencebetween'Dynamic','Random',and'Procedural'generations? InD&D3.5,whathappenswhenyouplopaheadbandofintellectonananimal? WhydopeopleinsistonusingTikzwhentheycanusesimplerdrawingtools? Ifquasarsdestroyalllifeintheirhostgalaxy,thenhowdidlifesurvivewhenMilkyWaywasaQuasar6millionyearsago? morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. lang-java Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings
延伸文章資訊
- 1Python \ufeff - SYmm 微筆記
在Windows下用文本編輯器創建的文本文件,如果選擇以UTF-8等Unicode格式保存,會在文件頭(第一個字符)加入一個BOM標識。具體去除方法看看下面代碼.
- 2ufeff的解决方法_51CTO博客
\ufeff的解决方法. 用"utf-8"编码方式读取带有BOM的文件时,它会把BOM当做是文件内容来处理, 也就会发生错误. 解决方法.
- 3java utf-8带bom格式内容(带"\uFEFF")转换成utf-8格式
后台导出的txt文件格式为带bom的utf-8。需要判断第一个字符是否是'\uFEFF'. if(inputTaskItem.substring ...
- 4java utf-8帶bom格式內容(帶"\uFEFF")轉換成utf-8格式- 台部落
java utf-8帶bom格式內容(帶"\uFEFF")轉換成utf-8格式. 原創 HiWorldNice 2020-06-20 04:48. 從txt文件中讀取一串字符串和數據庫中另一串字...
- 5位元組順序記號 - 维基百科
位元組順序記號(英語:byte-order mark,BOM)是位於碼點 U+FEFF 的統一碼字符的名称。當以UTF-16或UTF-32來將UCS/統一碼字符所組成的字串編碼時,這個字符被用來...