Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff?

文章推薦指數: 80 %
投票人數:10人

The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ... Home Public Questions Tags Users Companies Collectives ExploreCollectives Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Collectives™onStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. LearnmoreaboutCollectives Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams WhyUTF-8BOMbytesefbbbfcanbereplacedby\ufeff? AskQuestion Asked 3years,8monthsago Modified 1year,2monthsago Viewed 6ktimes 6 Thebyteordermark(BOM)forUTF-8isEFBBBF,asnotedinsection23.8oftheUnicode9specification(searchfor"signature"). ManysolutionsinJavatoremovethisisjustasimpleone-linecode: replace("\uFEFF","") Idon'tunderstandthiswhythisworks. Hereismytestcode.IcheckthebinaryaftercallingString#replacewhereIfindthatEFBBBFisINDEEDremoved.SeethiscoderunliveatIdeOne.com. Somagic.Whydoesthiswork? @Test publicvoidshit()throwsException{ byte[]b=newbyte[]{-17,-69,-65,97,97,97};//EFBBBF616161 char[]c=newchar[10]; newInputStreamReader(newByteArrayInputStream(b),"UTF-8").read(c); byte[]bytes=newStringBuilder().append(c).toString().replace("\uFEFF","").getBytes();// for(bytebt:bytes){//616161,wecanseeEFBBBFisindeedremoved System.out.println(bt); } } javabyte-order-mark Share Improvethisquestion Follow editedJul28,2021at22:54 BasilBourque 276k9292goldbadges785785silverbadges10641064bronzebadges askedJan18,2019at3:32 aaron.chuaaron.chu 12511silverbadge66bronzebadges 2 2 Youareconfusingencodingwiththecharactercodepoint.Also,innormaluse,UTF-8encodedcontentshouldnotuseaBOM. – MarkRotteveel Jan18,2019at9:37 Related:HowtoaddaUTF-8BOMinJava? – BasilBourque Jul28,2021at22:45 Addacomment  |  2Answers 2 Sortedby: Resettodefault Highestscore(default) Trending(recentvotescountmore) Datemodified(newestfirst) Datecreated(oldestfirst) 8 Thereasonisthataunicodetextshouldstartwiththebyteordermark(exceptUTF-8whereitisnotrecommendedmandatory[1]). fromWikipedia Thebyteordermark(BOM)isaUnicodecharacter,U+FEFFBYTEORDERMARK(BOM),whoseappearanceasamagicnumberatthestartofatextstream... ... TheBOMisencodedinthesameschemeastherestofthedocument... Whichmeansthisspecialcharacter(\uFEFF)mustalsobeencodedinUTF-8. UTF-8canencodeUnicodecodepointsinonetofourbytes. codepointswhichcanberepresentedwith7bitsareencodedinonebyte,thehighestbitisalwayszero0xxxxxxx allothercodepointsencodedinmultiplebytesdependingonthenumberofbits,theleftsetbitsofthefirstbyterepresentthenumberofbytesusedfortheencoding,e.g.110xxxxxmeanstheencodingisrepresentedbytwobytes,continuationbytesalwaysstartwith10xxxxxx(thexbitscanbeusedforthecodepoints) ThecodepointsintherangeU+0000-U+007Fcanbeencodedwithonebyte. ThecodepointsintherangeU+0080-U+07FFcanbeencodedwithtwobytes. ThecodepointsintherangeU+0800-U+FFFFcanbeencodedwiththreebytes. AdetailedexplanationisonWikipedia FortheBOMweneedthreebytes. hexFEFF binary1111111011111111 encodethebitsinUTF-8 patternforthreebyteencoding1110xxxx10xxxxxx10xxxxxx thebitsofthecodepoint1111111011111111 result111011111011101110111111 inhexEFBBBF EFBBBFsoundsalreadyfamiliar.;-) ThebytesequenceEFBBBFisnothingelsethantheBOMencodedinUTF-8. AsthebyteordermarkhasnomeaningforUTF-8itisnotusedinJava. encodingtheBOMcharacterasUTF-8 jshell>"\uFEFF".getBytes("UTF-8") $1==>byte[3]{-17,-69,-65}//EFBBBF Hencewhenthefileisreadthebytesequencegetsdecodedto\uFEFF. Forencodinge.g.UTF-16theBOMisadded jshell>"".getBytes("UTF-16") $2==>byte[4]{-2,-1,0,32}//FEFF+theencodedSPACE [1]citedfrom:http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf Althoughthere areneveranyquestionsofbyteorderwithUTF-8text,thissequencecanserveassignature forUTF-8encodedtextwherethecharactersetisunmarked.AswithaBOMinUTF-16, thissequenceofbyteswillbeextremelyrareatthebeginningoftextfilesinothercharacter encodings. Share Improvethisanswer Follow editedJul29,2021at6:52 answeredJan18,2019at9:31 SubOptimalSubOptimal 21.9k33goldbadges4949silverbadges6161bronzebadges 1 @BasilBourqueWasn'tawarethatonecouldmisreadthesentencethatway.ImadeitnowmoreclearwhatIwantedtosay. – SubOptimal Jul29,2021at6:53 Addacomment  |  5 InputStreamReaderisdecodingtheUTF-8encodedbytesequence(b)intoUTF-16BE,andintheprocesstranslatestheUTF-8BOMtoUTF-16BEBOM(\uFEFF).UTF-16BEisselectedasthetargetencodingbecauseCharsetdefaultstothisbehavior: https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html TheUTF-16charsetsarespecifiedbyRFC2781;thetransformation formatsuponwhichtheyarebasedarespecifiedinAmendment1ofISO 10646-1andarealsodescribedintheUnicodeStandard. TheUTF-16charsetsusesixteen-bitquantitiesandaretherefore sensitivetobyteorder.Intheseencodingsthebyteorderofastream maybeindicatedbyaninitialbyte-ordermarkrepresentedbythe Unicodecharacter'\uFEFF'.Byte-ordermarksarehandledasfollows: Whendecoding,theUTF-16BEandUTF-16LEcharsetsinterpretthe initialbyte-ordermarksasaZERO-WIDTHNON-BREAKINGSPACE;when encoding,theydonotwritebyte-ordermarks. Whendecoding,theUTF-16charsetinterpretsthebyte-ordermarkat thebeginningoftheinputstreamtoindicatethebyte-orderofthe streambutdefaultstobig-endianifthereisnobyte-ordermark;when encoding,itusesbig-endianbyteorderandwritesabig-endian byte-ordermark. SeeJLS3.1tounderstandwhytheinternalencodingofStringisUTF-16: https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1 TheJavaprogramminglanguagerepresentstextinsequencesof16-bit codeunits,usingtheUTF-16encoding. String#getBytes()returnsabytesequenceintheplatform'sdefaultencoding,whichappearstobeUTF-8foryoursystem. Summary ThesequenceEFBBBF(UTF-8BOM)istranslatedtoFEFF(UTF-16BEBOM)whendecodingthebytesequenceintoaStringusingInputStreamReader,becausetheencodingofjava.lang.StringwithadefaultCharsetisUTF-16BEinthepresenceofaBOM.AfterreplacingtheUTF-16BEBOMandcallingString#getBytes()thestringisdecodedintoUTF-8(thedefaultcharsetforyourplatform)andyouseeyouroriginalbytesequencewithoutaBOM. Share Improvethisanswer Follow editedJan18,2019at20:28 answeredJan18,2019at3:42 ChrisHutchinsonChrisHutchinson 8,82233goldbadges2525silverbadges3333bronzebadges 2 AndwheredoesthelanguagedemonstratethatitisUTF-16BE,insteadofUTF-16-Host? – Deduplicator Jan18,2019at14:42 @DeduplicatoradjustedtheanswertoexplainwhyUTF-16BEischosen – ChrisHutchinson Jan18,2019at20:17 Addacomment  |  YourAnswer ThanksforcontributingananswertoStackOverflow!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedjavabyte-order-markoraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew... CollectivesUpdate:RecognizedMembers,Articles,andGitLab Shouldweburninatethe[script]tag? Linked 28 HowtoaddaUTF-8BOMinJava? Related 974 What'sthedifferencebetweenUTF-8andUTF-8withBOM? 198 UTF-8withoutBOM 329 UsingPowerShelltowriteafileinUTF-8withouttheBOM 101 ConvertUTF-8withBOMtoUTF-8withnoBOMinPython 1 ReadingtextfromfilewithUTF-16BOMcharacter 0 WhydoIhavetoencodeautf-8parameterStringtoiso-Latinandthendecodeasutf-8togetJavautf-8String? 6 whyaChinesecharactertakesonechar(2bytes)but3bytes? 3 StringEncodingwithEmojiinJava? HotNetworkQuestions Canananimalfilealawsuitonitsownbehalf? circuitikz:Addingarrowheadtotapofvariableinductance? Theunusualphrasing"verb+the+comparativeadjective"intheLordoftheRingsnovels WhattestamItaking,anyways‽ Whyarefighterjetssoloudwhendoingslowflight? Determinethelengthoftherestofamathdisplaylineformultlined IsdocumentingabigprojectwithUMLDiagramsneeded,goodtohaveorevennotpossible? PacifistethosblockingmyprogressinStellaris 9dotsthatare3by3,continuethepattern HowdoIdownloadmacOSMontereyonunsupportedMac? WhydidGodprohibitwearingofgarmentsofdifferentmaterialsinLeviticus19:19? IfthedrowshadowbladeusesShadowSwordasarangedattack,doesitthrowasword(thatitthenhastoretrievebeforeusingitagain)? keyless/flatkeyboard Unknownnotation:squarebrackets,triangles,andnumbers IsthematrixinducedL1-normgreaterthantheinducedL2-norm? HowdothosewhoholdtoaliteralinterpretationofthefloodaccountrespondtothecriticismthatNoahbuildingthearkwouldbeunfeasible? Whataretheargumentsforrevengeandretribution? CanNewton'sFirstLawbetreatedasaformofbias? Supposethat(𝑋,𝑑)isacompletemetricspace.Showthatthereisnoopen,continuousfunction𝑓:𝑋→ℚ WhathadEstherdonein"TheBellJar"bySylviaPlath? What'sthedifferencebetween'Dynamic','Random',and'Procedural'generations? InD&D3.5,whathappenswhenyouplopaheadbandofintellectonananimal? WhydopeopleinsistonusingTikzwhentheycanusesimplerdrawingtools? Ifquasarsdestroyalllifeintheirhostgalaxy,thenhowdidlifesurvivewhenMilkyWaywasaQuasar6millionyearsago? morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. lang-java Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings  



請為這篇文章評分?