Byte order mark - Wikipedia
文章推薦指數: 80 %
The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFF BYTE ORDER MARK, whose appearance as a magic number at the start of ... Byteordermark FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch Unicodecharacter "FEFF"redirectshere.FortheairportinCentralAfricanRepublicwiththeairportcodeFEFF,seeBanguiM'PokoInternationalAirport.FortheprogramusedinX-rayabsorptionspectroscopy,seeFEFF(software).ForthenameofU+FEFFinUnicodeandthealternativeusageasazero-widthnon-breakingspace,seeWordjoiner. Thebyteordermark(BOM)isaparticularusageofthespecialUnicodecharacter,U+FEFFBYTEORDERMARK,whoseappearanceasamagicnumberatthestartofatextstreamcansignalseveralthingstoaprogramreadingthetext:[1] Thebyteorder,orendianness,ofthetextstreaminthecasesof16-bitand32-bitencodings; Thefactthatthetextstream'sencodingisUnicode,toahighlevelofconfidence; WhichUnicodecharacterencodingisused. BOMuseisoptional.ItspresenceinterfereswiththeuseofUTF-8bysoftwarethatdoesnotexpectnon-ASCIIbytesatthestartofafilebutthatcouldotherwisehandlethetextstream. Unicodecanbeencodedinunitsof8-bit,16-bit,or32-bitintegers.Forthe16-and32-bitrepresentations,acomputerreceivingtextfromarbitrarysourcesneedstoknowwhichbyteordertheintegersareencodedin.TheBOMisencodedinthesameschemeastherestofthedocumentandbecomesanoncharacterUnicodecodepointifitsbytesareswapped.Hence,theprocessaccessingthetextcanexaminethesefirstfewbytestodeterminetheendianness,withoutrequiringsomecontractormetadataoutsideofthetextstreamitself.Generallythereceivingcomputerwillswapthebytestoitsownendianness,ifnecessary,andwouldnolongerneedtheBOMforprocessing. ThebytesequenceoftheBOMdiffersperUnicodeencoding(includingonesoutsidetheUnicodestandardsuchasUTF-7,seetablebelow),andnoneofthesequencesislikelytoappearatthestartoftextstreamsstoredinotherencodings.Therefore,placinganencodedBOMatthestartofatextstreamcanindicatethatthetextisUnicodeandidentifytheencodingschemeused.ThisuseoftheBOMcharacteriscalleda"Unicodesignature".[2] Contents 1Usage 1.1UTF-8 1.2UTF-16 1.3UTF-32 2Byteordermarksbyencoding 3Seealso 4References 5Externallinks Usage[edit] IftheBOMcharacterappearsinthemiddleofadatastream,Unicodesaysitshouldbeinterpretedasa"zero-widthnon-breakingspace"(inhibitsline-breakingbetweenword-glyphs).InUnicode3.2,thisusageisdeprecatedinfavorofthe"WordJoiner"character,U+2060.[1]ThisallowsU+FEFFtobeusedonlyasaBOM. UTF-8[edit] TheUTF-8representationoftheBOMisthe(hexadecimal)bytesequence0xEF,0xBB,0xBF. TheUnicodeStandardpermitstheBOMinUTF-8,[3]butdoesnotrequireorrecommenditsuse.[4]ByteorderhasnomeaninginUTF-8,[5]soitsonlyuseinUTF-8istosignalatthestartthatthetextstreamisencodedinUTF-8,orthatitwasconvertedtoUTF-8fromastreamthatcontainedanoptionalBOM.ThestandardalsodoesnotrecommendremovingaBOMwhenitisthere,sothatround-trippingbetweenencodingsdoesnotloseinformation,andsothatcodethatreliesonitcontinuestowork.[6][7]TheIETFrecommendsthatifaprotocoleither(a)alwaysusesUTF-8,or(b)hassomeotherwaytoindicatewhatencodingisbeingused,thenit"SHOULDforbiduseofU+FEFFasasignature."[8]AnexampleofnotfollowingthisrecommendationistheIETFSyslogprotocolwhichrequirestexttobeinUTF-8andalsorequirestheBOM.[9] NotusingaBOMallowstexttobebackwards-compatiblewithsomesoftwarethatisnotUnicode-aware.Examplesincludeprogramminglanguagesthatpermitnon-ASCIIbytesinstringliteralsbutnotatthestartofthefile. UTF-8isasparseencodinginthesensethatalargefractionofpossiblebytecombinationsdonotresultinvalidUTF-8text.BinarydataandtextinanyotherencodingarelikelytocontainbytesequencesthatareinvalidasUTF-8.PracticallytheonlyexceptionstothatarewhenthetextconsistspurelyofASCII-rangebytes.BecauseallmodernencodingsuseASCII-rangebytestorepresentASCIIcharacters,ASCII-onlytextcanbesafelyinterpretedasUTF-8regardlessofwhatencodingwasintendedbythesystemthatemittedthebytes.Becauseoftheseconsiderations,heuristicanalysiscandetectwithhighconfidencewhetherUTF-8isinuse,withoutrequiringaBOM. Microsoftcompilers[10]andinterpreters,andmanypiecesofsoftwareonMicrosoftWindowssuchasNotepadtreattheBOMasarequiredmagicnumberratherthanuseheuristics.ThesetoolsaddaBOMwhensavingtextasUTF-8,andcannotinterpretUTF-8unlesstheBOMispresentorthefilecontainsonlyASCII.WindowsPowerShell(upto5.1)willaddaBOMwhenitsavesUTF-8XMLdocuments.However,PowerShellCore6hasaddeda-Encodingswitchonsomecmdletscalledutf8NoBOMsothatdocumentcanbesavedwithoutBOM.GoogleDocsalsoaddsaBOMwhenconvertingadocumenttoaplaintextfilefordownload. UTF-16[edit] InUTF-16,aBOM(U+FEFF)maybeplacedasthefirstcharacterofafileorcharacterstreamtoindicatetheendianness(byteorder)ofallthe16-bitcodeunitofthefileorstream.Ifanattemptismadetoreadthisstreamwiththewrongendianness,thebyteswillbeswapped,thusdeliveringthecharacterU+FFFE,whichisdefinedbyUnicodeasa"noncharacter"thatshouldneverappearinthetext. Ifthe16-bitunitsarerepresentedinbig-endianbyteorder,theBOMwillappearinthesequenceofbytesas0xFE0xFF Ifthe16-bitunitsuselittle-endianorder,theBOMwillappearinthesequenceofbytesas0xFF0xFE NeitherofthesesequencesisvalidUTF-8,sotheirpresenceindicatesthatthefileisnotencodedinUTF-8. FortheIANAregisteredcharsetsUTF-16BEandUTF-16LE,abyteordermarkshouldnotbeusedbecausethenamesofthesecharactersetsalreadydeterminethebyteorder.Ifencounteredanywhereinsuchatextstream,U+FEFFistobeinterpretedasa"zerowidthno-breakspace". IfthereisnoBOM,itispossibletoguesswhetherthetextisUTF-16anditsbyteorderbysearchingforASCIIcharacters(i.e.a0byteadjacenttoabyteinthe0x20-0x7Erange,also0x0Aand0x0DforCRandLF).Alargenumber(i.e.farhigherthanrandomchance)inthesameorderisaverygoodindicationofUTF-16andwhetherthe0isintheevenoroddbytesindicatesthebyteorder.However,thiscanresultinbothfalsepositivesandfalsenegatives. ClauseD98ofconformance(section3.10)oftheUnicodestandardstates,"TheUTF-16encodingschememayormaynotbeginwithaBOM.However,whenthereisnoBOM,andintheabsenceofahigher-levelprotocol,thebyteorderoftheUTF-16encodingschemeisbig-endian."Whetherornotahigher-levelprotocolisinforceisopentointerpretation.Fileslocaltoacomputerforwhichthenativebyteorderingislittle-endian,forexample,mightbearguedtobeencodedasUTF-16LEimplicitly.Therefore,thepresumptionofbig-endianiswidelyignored.TheW3C/WHATWGencodingstandardusedinHTML5specifiesthatcontentlabelledeither"utf-16"or"utf-16le"aretobeinterpretedaslittle-endian"todealwithdeployedcontent".[11]However,ifabyte-ordermarkispresent,thenthatBOMistobetreatedas"moreauthoritativethananythingelse".[12] ProgramsthatinterpretUTF-16asabyte-basedencodingmaydisplayagarbledmessofcharacters,butASCIIcharacterswouldberecognizablebecausethelowbyteoftheUTF-16representationisthesameastheASCIIcodeandthereforewouldbedisplayedthesame.Theupperbyteof0maybedisplayedasnothing,whitespace,aperiod,orsomeotherunvaryingglyph. UTF-32[edit] AlthoughaBOMcouldbeusedwithUTF-32,thisencodingisrarelyusedfortransmission.OtherwisethesamerulesasforUTF-16areapplicable. TheBOMforlittle-endianUTF-32isthesamepatternasalittle-endianUTF-16BOMfollowedbyaNULcharacter,anunusualexampleoftheBOMbeingthesamepatternintwodifferentencodings.ProgrammersusingtheBOMtoidentifytheencodingwillhavetodecidewhetherUTF-32oraNULfirstcharacterismorelikely. Byteordermarksbyencoding[edit] ThistableillustrateshowtheBOMcharacterisrepresentedasabytesequenceinvariousencodingsandhowthosesequencesmightappearinatexteditorthatisinterpretingeachbyteasalegacyencoding(CP1252andcaretnotationfortheC0controls): Encoding Representation(hexadecimal) Representation(decimal) BytesasCP1252characters UTF-8[a] EFBBBF 239187191  UTF-16(BE) FEFF 254255 þÿ UTF-16(LE) FFFE 255254 ÿþ UTF-32(BE) 0000FEFF 00254255 ^@^@þÿ(^@isthenullcharacter) UTF-32(LE) FFFE0000 25525400 ÿþ^@^@(^@isthenullcharacter) UTF-7[a] 2B2F76[b][14][15] 4347118 +/v UTF-1[a] F7644C 24710076 ÷dL UTF-EBCDIC[a] DD736673 221115102115 Ýsfs SCSU[a] 0EFEFF[c] 14254255 ^Nþÿ(^Nisthe"shiftout"character) BOCU-1[a] FBEE28 25123840 ûî( GB-18030[a] 84319533 1324914951 „1•3 ^abcdefgThisisnotliterallya"byteorder"mark,sinceacodeunitintheseencodingsisonebyteandthereforecannothavebytesina"wrong"order.Nevertheless,theBOMcanbeusedtoindicatetheencodingofthetextthatfollowsit.[5][13] ^Followedby38,39,2B,or2F(ASCII8,9,+or/),dependingonwhatthenextcharacteris. ^SCSUallowsotherencodingsofU+FEFF,theshownformisthesignaturerecommendedinUTR#6.[16] Seealso[edit] Left-to-rightmark ArabicPresentationForms-B,blocktowhichcodepointU+FEFFbelongs References[edit] ^ab"FAQ-UTF-8,UTF-16,UTF-32&BOM".Unicode.org.Retrieved28January2017. ^"TheUnicode®StandardVersion9.0"(PDF).TheUnicodeConsortium. ^"TheUnicodeStandard5.0,Chapter2:GeneralStructure"(PDF).p. 36.Retrieved29March2009.Table2-4.TheSevenUnicodeEncodingSchemes ^"TheUnicodeStandard5.0,Chapter2:GeneralStructure"(PDF).p. 36.Retrieved30November2008.UseofaBOMisneitherrequirednorrecommendedforUTF-8,butmaybeencounteredincontextswhereUTF-8dataisconvertedfromotherencodingformsthatuseaBOMorwheretheBOMisusedasaUTF-8signature ^ab"FAQ-UTF-8,UTF-16,UTF-32&BOM:CanaUTF-8datastreamcontaintheBOMcharacter(inUTF-8form)?Ifyes,thencanIstillassumetheremainingUTF-8bytesareinbig-endianorder?".Unicode.org.Retrieved4January2009. ^"Re:pre-HTML5andtheBOMfromAsmusFreytagon2012-07-13(UnicodeMailListArchive)".Unicode.org.Retrieved14July2012. ^"BugID:JDK-6378911UTF-8decoderhandlingofbyte-ordermarkhaschanged".Bugs.java.com.Retrieved14October2021. ^Yergeau,Francois(November2003).UTF-8,atransformationformatofISO10646.IETF.doi:10.17487/RFC3629.RFC3629.Retrieved15May2014. ^Gerhards,Rainer(March2009)."MSG".TheSyslogProtocol.IETF.sec. 6.4.doi:10.17487/RFC5424.RFC5424. ^AlfP.Steinbach(2011)."Unicodepart1:Windowsconsolei/oapproaches".Retrieved24March2012.However,sincetheC++sourcecodewasencodedasUTF-8withoutBOM(asisusualinLinux),theVisualC++compilererroneouslyassumedthatthesourcecodewasencodedasWindowsANSI. ^"UTF-16LE".EncodingStandard.WHATWG. ^"Decode".EncodingStandard.WHATWG. ^Yergeau,François(8November2003)."RFC3629-UTF-8,atransformationformatofISO10646".Tools.ietf.org.Retrieved28January2017. ^https://unicode.org/L2/L2021/21038-bom-guidance.pdf[bareURLPDF] ^"SDLDocumentation". ^MarkusScherer."UTS#6:CompressionSchemeforUnicode".Unicode.org.Retrieved28January2017. Externallinks[edit] UnicodeFAQ:UTF-8,UTF-16,UTF-32&BOM TheUnicodeStandard,chapter2.6EncodingSchemes TheUnicodeStandard,chapter2.13SpecialCharactersandNoncharacters,sectionByteOrderMark(BOM) TheUnicodeStandard,chapter16.8Specials,sectionByteOrderMark(BOM):U+FEFF vteUnicodeUnicode UnicodeConsortium ISO/IEC10646(UniversalCharacterSet) Versions Codepoints Block List UniversalCharacterSet Charactercharts Characterproperty Plane PrivateUseArea CharactersSpecialpurpose BOM Combininggraphemejoiner Left-to-rightmark /Right-to-leftmark Softhyphen Variantform Wordjoiner Zero-widthjoiner Zero-widthnon-joiner Zero-widthspace Lists Characters CJKUnifiedIdeographs Combiningcharacter Duplicatecharacters Numerals Scripts Spaces Symbols Halfwidthandfullwidth Aliasnamesandabbreviations Whitespacecharacters ProcessingAlgorithms Bidirectionaltext Collation ISO/IEC14651 Equivalence Variationsequences InternationalIdeographsCore Comparison BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Onpairsofcodepoints Combiningcharacter Compatibilitycharacters Duplicatecharacters Equivalence Homoglyph Precomposedcharacter list Z-variant Variationsequences Regionalindicatorsymbol Emojiskincolor Usage Domainnames(IDN) Email Fonts HTML entityreferences numericreferences Input InternationalIdeographsCore Relatedstandards CommonLocaleDataRepository(CLDR) GB18030 ISO/IEC8859 ISO15924 Relatedtopics Anomalies ConScriptUnicodeRegistry IdeographicResearchGroup InternationalComponentsforUnicode PeopleinvolvedwithUnicode Hanunification ScriptsandsymbolsinUnicodeCommonandinheritedscripts Combiningmarks Diacritics Punctuationmarks Spaces Numbers Modernscripts Adlam Arabic Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese CanadianAboriginal Chakma Cham Cherokee CJKUnifiedIdeographs(Han) Cyrillic Deseret Devanagari Geʽez Georgian Greek Gujarati GunjalaGondi Gurmukhi Hangul HanifiRohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana KayahLi Khmer Lao Latin Lepcha Limbu Lisu(Fraser) Lontara Malayalam MasaramGondi MendeKikakui Medefaidrin Miao(Pollard) Mongolian Mru N'Ko NagMundari NewTaiLue Nüshu NyiakengPuachueHmong Odia OlChiki Osage Osmanya PahawhHmong PauCinHau Pracalit(Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala SorangSompeng Sundanese Syriac Tagbanwa TaiLe TaiTham TaiViet Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Toto Vai Wancho WarangCiti Yi Ancientandhistoricscripts Ahom Anatolianhieroglyphs AncientNorthArabian Avestan BassaVah Bhaiksuki Brāhmī Carian CaucasianAlbanian Coptic Cuneiform Cypriot Cypro-Minoan DivesAkuru Dogra Egyptianhieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran ImperialAramaic InscriptionalPahlavi InscriptionalParthian Kaithi Kawi Kharosthi Khitansmallscript Khojki Khudawadi Khwarezmian(Chorasmian) LinearA LinearB Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen MeeteiMayek Meroitic Modi Multani Nabataean Nandinagari Ogham OldHungarian OldItalic OldPermic OldPersiancuneiform OldSogdian OldTurkic OldUyghur Palmyrene ʼPhags-pa Phoenician PsalterPahlavi Runic Sharada Siddham Sogdian SouthArabian Soyombo SylhetiNagri Tagalog(Baybayin) Takri Tangut Ugaritic Vithkuqi Yezidi ZanabazarSquare Notationalscripts Duployan SignWriting Symbols,emojis Cultural,political,andreligioussymbols Currency ControlPictures Mathematicaloperatorsandsymbols Listbysubject Phoneticsymbols(includingIPA) Emoji Category:Unicode Category:Unicodeblocks Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Byte_order_mark&oldid=1106914685" Categories:UnicodespecialcodepointsHiddencategories:AllarticleswithbareURLsforcitationsArticleswithbareURLsforcitationsfromMarch2022ArticleswithPDFformatbareURLsforcitationsUsedmydatesfromApril2022ArticleswithshortdescriptionShortdescriptionmatchesWikidata Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English Views ReadEditViewhistory More Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages العربيةČeštinaDeutschEspañolفارسیFrançais한국어ItalianoעבריתLietuviųMalagasy日本語NorskbokmålPolskiPortuguêsRomânăРусскийSimpleEnglishSvenskaУкраїнська中文 Editlinks
延伸文章資訊
- 1BOM(字节顺序标记(ByteOrderMark))_百度百科
BOM —— Byte Order Mark,中文名译作“字节顺序标记”。在这里找到一段关于BOM 的说明:. 在UCS 编码中有一个叫做"Zero Width No-Break Space" ...
- 2Byte order mark - Globalization - Microsoft Learn
Byte Order Mark (BOM) is used to indicate how a processor places serialized text into a sequence ...
- 3The byte-order mark (BOM) in HTML - W3C
Each 2-digit hexadecimal number represents a byte in the stream of text. You can see that the ord...
- 4UTF-8 BOM (Byte Order Mark) 的問題@新精讚
解釋為甚麼Windows 2000 以後的Notepad 存UTF-8 的檔案會加上BOM(Byte Order Mark, U+FEFF), 主要是因為UTF-8 和ASCII 是相容的, 為...
- 5What is a Byte Order Mark (BOM)? - Definition from Techopedia
The byte order mark (BOM) is a piece of information used to signify that a text file employs Unic...