Why BOM is U+FE FF, rather than U+FF FE? - Stack Overflow
文章推薦指數: 80 %
Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff? Home Public Questions Tags Users Companies Collectives ExploreCollectives Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Collectives™onStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. LearnmoreaboutCollectives Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams WhyBOMisU+FEFF,ratherthanU+FFFE? AskQuestion Asked 6years,6monthsago Modified 6years,6monthsago Viewed 1ktimes 0 SoI'mteachingmyselfcharacterencoding,andIhaveapresumablystupidquestion:Wikipediasays Thebyteordermark(BOM)isaUnicodecharacter,U+FEFFBYTEORDER MARK(BOM),... ,andachartonthatpagewrites EncodingRepresentation(hexadecimal) UTF-8EFBBBF UTF-16(BE)FEFF UTF-16(LE)FFFE ... I'malittleconfusedbyit.AsIknow,mostmachinesusingIntelCPUsarelittle-endian,sowhyBOMisU+FEFFforUTF-16(BE),ratherthanU+EFBBBFforUTF-8orU+FFFEforUTF-16(LE)? unicodeutf-8character-encoding Share Follow askedApr1,2016at15:59 nalzoknalzok 13.9k1919goldbadges6565silverbadges123123bronzebadges 4 2 Well,becauseU+FEFFisacharacterandU+FFEFisnot.Azero-widthspacehasthenicepropertythatithasnoeffectonrenderedtextevenwhenanappdoesnotproperlyfilteritorflubsbyinsertingBOMsinthemiddleofatextstream.Verycommonbug. – HansPassant Apr1,2016at16:44 Onyour"ratherthanU+EFBBBFforUTF-8":quitefunny,becauseUTF8doesnotneeda"byteordermark".AllvaluesinanUTF8encodedtextaresupposedtobeexactly1bytelong,sothereiszerochanceofgettingyourendiannesswrong. – Jongware Apr2,2016at11:53 @RadLexusSoUTF-8doesn'tneedaBOMtoindicatetheendianness,whileUTF-16andUTF-32does? – nalzok Apr2,2016at14:14 Yes.Thereisaquitelengthyarticlediscussingallvarietiesonwikipedia. – Jongware Apr2,2016at15:37 Addacomment | 2Answers 2 Sortedby: Resettodefault Highestscore(default) Trending(recentvotescountmore) Datemodified(newestfirst) Datecreated(oldestfirst) 3 AsIknow,mostmachinesusingIntelCPUsarelittle-endian IntelCPUsarenottheonlyCPUsusedintheworld.AMD,ARM,etc.Andtherearebig-endianCPUs. whyBOMisU+FEFFforUTF-16(BE),ratherthanU+EFBBBFforUTF-8orU+FFFEforUTF-16(LE)? U+FEFFistheUnicodecodepointdesignation.FEFF,EFBBBF,FFFE,thesearesequencesofbytesinstead.U+onlyappliestoUnicodecodepointdesignations,notbytes. ThenumericvalueofUnicodecodepointU+FEFFZEROWIDTHNO-BREAKSPACE(whichisitsofficialdesignation,notU+FEFFBYTEORDERMARK,thoughitisalsousedasaBOM)is0xFEFF(65279). ThatcodepointvalueencodedinUTF-8producesthree8-bitcodeunitvalues0xEF0xBB0xBF,whicharenotsubjecttoanyendianissues,whichiswhyUTF-8doesnothaveseparateLEandBEvariants. ThatsamecodepointvalueencodedinUTF-16producesone16-bitcodeunitvalue0xFEFF.Becauseitisamulti-byte(16-bit)value,itissubjecttoendianwheninterpretedastwo8-bitbytes,hencetheLE(0xFF0xFE)andBE(0xFE0xFF)variants. ItisnotjusttheBOMthatiseffected.AllcodeunitsinaUTF-16stringareaffectedbyendian.TheBOMhelpsadecoderknowtheendianusedforthecodeunitsintheentirestring. UTF-32,whichalsousesmulti-byte(32-bit)codeunits,isalsosubjecttoendian,andthusitalsohasLEandBEvariants,anda32-bitBOMtoexpressthatendiantodecoders(0xFF0xFE0x000x00forLE,0x000x000xFE0xFFforBE).Andyes,asyoucanprobablyguess,thereisanambiguitybetweentheUTF-16LEBOMandtheUTF-32LEBOM,ifyoudon'tknowaheadoftimewhichUTFyouaredealingwith.ABOMismeanttoidentifytheendian,hencethename"ByteOrderMark",nottheparticularencoding(thoughitiscommonlyusedforthatpurpose). Share Follow editedApr1,2016at17:48 answeredApr1,2016at17:41 RemyLebeauRemyLebeau 522k2929goldbadges425425silverbadges725725bronzebadges 5 It'sprobablyalsoworthmentioningthatU+FFFEisnotavalidUnicodecodepoint,whichiswhyU+FEFFcanbeusedasabyteordermark(otherwiseitwouldbeimpossibletodistinguishreliably).And"endian"shouldbe"endianness". – KeithThompson Apr1,2016at17:46 IdoseethatU+FEFFwasdeprecatedinUnicode3.2asabreakcharacterinfavorofU+2060WORDJOINERforthatpurpose.However,theUnicodespecdoesalsosaythatifU+FEFFappearsinsideastringthenitshouldstillbetreatedasabreaker:"Unicode3.2implementationsshouldsupportthisnewcharacter[U+2060],butalsosupporttheZWNBSPsemanticofU+FEFF." – RemyLebeau Apr1,2016at17:54 1 Didyoumisreadmycomment?U+FEFFisavalidcharacter(ZEROWIDTHNO-BREAKSPACE),andisusedasaBOM.U+FFFEisnotavalidcharacter,oratleastdoesn'thaveanameassignedtoit(see[UnicodeData.txt](unicode.org/Public/UNIDATA/UnicodeData.txt,a1.6MBtextfile)--whichiswhyU+FEFFcanbeusedasaBOM.IfyouseeU+FFFE,it'sprobablyabyte-swappedBOM,andyouneedtochangetheendiannessyou'reusingtointerprettheinput.(And"endian"isanadjective;"endianness"isthecorrespondingnoun.) – KeithThompson Apr1,2016at17:57 Yes,Imisreadyourcomment.Andyes,ifyoureada2-byteBOMintoa16-bitvariableandgetavalusof0xFFFEthentheBOMisintheoppositeendianthanwhatyoureaditas. – RemyLebeau Apr1,2016at17:59 ThenisitOKtowritefputwc(L'\uFEFF',fp);toaddaBOM? – nalzok Apr2,2016at0:57 Addacomment | 2 whyBOMisU+FEFFforUTF-16(BE) Itisn't.BOMischaracternumberU+FEFF.There'snospace,it'sasinglehexadecimalnumber,aka65279.Thisdefinitiondoesnotdependonwhatsequenceofbytesisusedtorepresentthatcharactersinanyparticularencoding. Ithappensthatthehexadecimalrepresentationofthebytesequencethatencodesthecharacter(*)inUTF-16LE,0xFE,0xFFhasthesameorderofdigitsasthehexadecimalrepresentationofthecharacternumberU+FEFF;thisisjustanartefactofbig-endianness,itputsmost-significantcontentontheleft,sameashumansdoforbig[hexa]decimalnumbers. (*andindeedanycharacterintheBasicMultilingualPlane.Itgetshairierwhenyougoabovethisrangeastheynolongerfitintwobytes.) Share Follow answeredApr1,2016at17:27 bobincebobince 519k102102goldbadges646646silverbadges825825bronzebadges Addacomment | YourAnswer ThanksforcontributingananswertoStackOverflow!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedunicodeutf-8character-encodingoraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew... CollectivesUpdate:RecognizedMembers,Articles,andGitLab Shouldweburninatethe[script]tag? Related 24 HowdoIencode/decodeUTF-16LEbytearrayswithaBOM? 974 What'sthedifferencebetweenUTF-8andUTF-8withBOM? 37 Isn’tonbigendianmachinesUTF-8'sbyteorderdifferentthanonlittleendianmachines?Sowhythendoesn’tUTF-8requireaBOM? 585 WhydoesmodernPerlavoidUTF-8bydefault? 28 Encoding.UTF8.GetStringdoesn'ttakeintoaccountthePreamble/BOM 72 Whydoes.netusetheUTF16encodingforstring,butusesUTF-8asdefaultforsavingfiles? 1402 WhyisexecutingJavacodeincommentswithcertainUnicodecharactersallowed? 0 ByteOrderMask:confusingtheUTFencoding 2 WhyUTF-8encodingdoesn'tneedaByteOrderMark? HotNetworkQuestions FPGAlogicthreshold-distinguishingalogic0and1 Wouldmerfolkgainanyrealadvantagefrommounts(andbeastsofburden)? Isthereawordfor"amessagetomyself"? Unsurewhatthesewatersoftenerdialsarefor Howtotellifmybikehasanaluminumframe HowdoIdownloadmacOSMontereyonunsupportedMac? rename(Perl)-tryingtorenumberalistoffiles,startingatacertainvalue keyless/flatkeyboard StandardCoverflow-safearithmeticfunctions Howtoelegantlyimplementthisoneusefulobject-orientedfeatureinMathematica? ConvertanintegertoIEEE754float Canaphotonturnaprotonintoaneutron? Howtodecompose8x8UnitaryMatrixintotensorproductofthreephasedgate? LaTeX2(e)vsLaTeX3 WhattestamItaking,anyways‽ Whyarefighterjetssoloudwhendoingslowflight? Myfavoriteanimalisa-singularandpluralform Doyoupayforthebreakfastinadvance? Whoorwhatis"Nampat"inthechantoftheOrcsintheRingsofPower? Canyoufindit? Determinethelengthoftherestofamathdisplaylineformultlined CanIuseaspritesheetfromanexistingvideogameformypromotionalreel? 2016PutnamB6difficultsummationproblem Howtoplug2.5mm²strandedwiresintoapushwirewago? morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings
延伸文章資訊
- 1Byte order mark - Wikipedia
The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFF BYTE ORD...
- 2不同編碼的字節順序標記的表示 - BOM_百度百科
en:UTF-EBCDIC. DD 73 66 73. 221 115 102 115 ; en:Standard Compression Scheme for Unicode. 0E FE F...
- 3Byte order mark - Globalization - Microsoft Learn
Byte Order Mark (BOM) is used to indicate how a processor places serialized text into a sequence ...
- 4BOM BOM BOM | 就是愛程式
ff fe ## ## UTF-16, Little Endian; ef bb bf UTF-8. Microsoft與BOM. 許多Windows 軟體(包括Windows 筆記本) 在UT...
- 5Process a file that starts with a BOM (FF FE)
From this wikipedia article, FF FE means UTF16LE . So you should tell iconv to convert from UTF16...