Byte order mark - Wikipedia
文章推薦指數: 80 %
The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFF BYTE ORDER MARK, whose appearance as a magic number at the start of ... Byteordermark FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch Unicodecharacter "FEFF"redirectshere.FortheairportinCentralAfricanRepublicwiththeairportcodeFEFF,seeBanguiM'PokoInternationalAirport.FortheprogramusedinX-rayabsorptionspectroscopy,seeFEFF(software).ForthenameofU+FEFFinUnicodeandthealternativeusageasazero-widthnon-breakingspace,seeWordjoiner. Thebyteordermark(BOM)isaparticularusageofthespecialUnicodecharacter,U+FEFFBYTEORDERMARK,whoseappearanceasamagicnumberatthestartofatextstreamcansignalseveralthingstoaprogramreadingthetext:[1] Thebyteorder,orendianness,ofthetextstreaminthecasesof16-bitand32-bitencodings; Thefactthatthetextstream'sencodingisUnicode,toahighlevelofconfidence; WhichUnicodecharacterencodingisused. BOMuseisoptional.ItspresenceinterfereswiththeuseofUTF-8bysoftwarethatdoesnotexpectnon-ASCIIbytesatthestartofafilebutthatcouldotherwisehandlethetextstream. Unicodecanbeencodedinunitsof8-bit,16-bit,or32-bitintegers.Forthe16-and32-bitrepresentations,acomputerreceivingtextfromarbitrarysourcesneedstoknowwhichbyteordertheintegersareencodedin.TheBOMisencodedinthesameschemeastherestofthedocumentandbecomesanoncharacterUnicodecodepointifitsbytesareswapped.Hence,theprocessaccessingthetextcanexaminethesefirstfewbytestodeterminetheendianness,withoutrequiringsomecontractormetadataoutsideofthetextstreamitself.Generallythereceivingcomputerwillswapthebytestoitsownendianness,ifnecessary,andwouldnolongerneedtheBOMforprocessing. ThebytesequenceoftheBOMdiffersperUnicodeencoding(includingonesoutsidetheUnicodestandardsuchasUTF-7,seetablebelow),andnoneofthesequencesislikelytoappearatthestartoftextstreamsstoredinotherencodings.Therefore,placinganencodedBOMatthestartofatextstreamcanindicatethatthetextisUnicodeandidentifytheencodingschemeused.ThisuseoftheBOMcharacteriscalleda"Unicodesignature".[2] Contents 1Usage 1.1UTF-8 1.2UTF-16 1.3UTF-32 2Byteordermarksbyencoding 3Seealso 4References 5Externallinks Usage[edit] IftheBOMcharacterappearsinthemiddleofadatastream,Unicodesaysitshouldbeinterpretedasa"zero-widthnon-breakingspace"(inhibitsline-breakingbetweenword-glyphs).InUnicode3.2,thisusageisdeprecatedinfavorofthe"WordJoiner"character,U+2060.[1]ThisallowsU+FEFFtobeusedonlyasaBOM. UTF-8[edit] TheUTF-8representationoftheBOMisthe(hexadecimal)bytesequence0xEF,0xBB,0xBF. TheUnicodeStandardpermitstheBOMinUTF-8,[3]butdoesnotrequireorrecommenditsuse.[4]ByteorderhasnomeaninginUTF-8,[5]soitsonlyuseinUTF-8istosignalatthestartthatthetextstreamisencodedinUTF-8,orthatitwasconvertedtoUTF-8fromastreamthatcontainedanoptionalBOM.ThestandardalsodoesnotrecommendremovingaBOMwhenitisthere,sothatround-trippingbetweenencodingsdoesnotloseinformation,andsothatcodethatreliesonitcontinuestowork.[6][7]TheIETFrecommendsthatifaprotocoleither(a)alwaysusesUTF-8,or(b)hassomeotherwaytoindicatewhatencodingisbeingused,thenit"SHOULDforbiduseofU+FEFFasasignature."[8]AnexampleofnotfollowingthisrecommendationistheIETFSyslogprotocolwhichrequirestexttobeinUTF-8andalsorequirestheBOM.[9] NotusingaBOMallowstexttobebackwards-compatiblewithsomesoftwarethatisnotUnicode-aware.Examplesincludeprogramminglanguagesthatpermitnon-ASCIIbytesinstringliteralsbutnotatthestartofthefile. UTF-8isasparseencodinginthesensethatalargefractionofpossiblebytecombinationsdonotresultinvalidUTF-8text.BinarydataandtextinanyotherencodingarelikelytocontainbytesequencesthatareinvalidasUTF-8.PracticallytheonlyexceptionstothatarewhenthetextconsistspurelyofASCII-rangebytes.BecauseallmodernencodingsuseASCII-rangebytestorepresentASCIIcharacters,ASCII-onlytextcanbesafelyinterpretedasUTF-8regardlessofwhatencodingwasintendedbythesystemthatemittedthebytes.Becauseoftheseconsiderations,heuristicanalysiscandetectwithhighconfidencewhetherUTF-8isinuse,withoutrequiringaBOM. Microsoftcompilers[10]andinterpreters,andmanypiecesofsoftwareonMicrosoftWindowssuchasNotepadtreattheBOMasarequiredmagicnumberratherthanuseheuristics.ThesetoolsaddaBOMwhensavingtextasUTF-8,andcannotinterpretUTF-8unlesstheBOMispresentorthefilecontainsonlyASCII.WindowsPowerShell(upto5.1)willaddaBOMwhenitsavesUTF-8XMLdocuments.However,PowerShellCore6hasaddeda-Encodingswitchonsomecmdletscalledutf8NoBOMsothatdocumentcanbesavedwithoutBOM.GoogleDocsalsoaddsaBOMwhenconvertingadocumenttoaplaintextfilefordownload. UTF-16[edit] InUTF-16,aBOM(U+FEFF)maybeplacedasthefirstcharacterofafileorcharacterstreamtoindicatetheendianness(byteorder)ofallthe16-bitcodeunitofthefileorstream.Ifanattemptismadetoreadthisstreamwiththewrongendianness,thebyteswillbeswapped,thusdeliveringthecharacterU+FFFE,whichisdefinedbyUnicodeasa"noncharacter"thatshouldneverappearinthetext. Ifthe16-bitunitsarerepresentedinbig-endianbyteorder,theBOMwillappearinthesequenceofbytesas0xFE0xFF Ifthe16-bitunitsuselittle-endianorder,theBOMwillappearinthesequenceofbytesas0xFF0xFE NeitherofthesesequencesisvalidUTF-8,sotheirpresenceindicatesthatthefileisnotencodedinUTF-8. FortheIANAregisteredcharsetsUTF-16BEandUTF-16LE,abyteordermarkshouldnotbeusedbecausethenamesofthesecharactersetsalreadydeterminethebyteorder.Ifencounteredanywhereinsuchatextstream,U+FEFFistobeinterpretedasa"zerowidthno-breakspace". IfthereisnoBOM,itispossibletoguesswhetherthetextisUTF-16anditsbyteorderbysearchingforASCIIcharacters(i.e.a0byteadjacenttoabyteinthe0x20-0x7Erange,also0x0Aand0x0DforCRandLF).Alargenumber(i.e.farhigherthanrandomchance)inthesameorderisaverygoodindicationofUTF-16andwhetherthe0isintheevenoroddbytesindicatesthebyteorder.However,thiscanresultinbothfalsepositivesandfalsenegatives. ClauseD98ofconformance(section3.10)oftheUnicodestandardstates,"TheUTF-16encodingschememayormaynotbeginwithaBOM.However,whenthereisnoBOM,andintheabsenceofahigher-levelprotocol,thebyteorderoftheUTF-16encodingschemeisbig-endian."Whetherornotahigher-levelprotocolisinforceisopentointerpretation.Fileslocaltoacomputerforwhichthenativebyteorderingislittle-endian,forexample,mightbearguedtobeencodedasUTF-16LEimplicitly.Therefore,thepresumptionofbig-endianiswidelyignored.TheW3C/WHATWGencodingstandardusedinHTML5specifiesthatcontentlabelledeither"utf-16"or"utf-16le"aretobeinterpretedaslittle-endian"todealwithdeployedcontent".[11]However,ifabyte-ordermarkispresent,thenthatBOMistobetreatedas"moreauthoritativethananythingelse".[12] ProgramsthatinterpretUTF-16asabyte-basedencodingmaydisplayagarbledmessofcharacters,butASCIIcharacterswouldberecognizablebecausethelowbyteoftheUTF-16representationisthesameastheASCIIcodeandthereforewouldbedisplayedthesame.Theupperbyteof0maybedisplayedasnothing,whitespace,aperiod,orsomeotherunvaryingglyph. UTF-32[edit] AlthoughaBOMcouldbeusedwithUTF-32,thisencodingisrarelyusedfortransmission.OtherwisethesamerulesasforUTF-16areapplicable. TheBOMforlittle-endianUTF-32isthesamepatternasalittle-endianUTF-16BOMfollowedbyaNULcharacter,anunusualexampleoftheBOMbeingthesamepatternintwodifferentencodings.ProgrammersusingtheBOMtoidentifytheencodingwillhavetodecidewhetherUTF-32oraNULfirstcharacterismorelikely. Byteordermarksbyencoding[edit] ThistableillustrateshowtheBOMcharacterisrepresentedasabytesequenceinvariousencodingsandhowthosesequencesmightappearinatexteditorthatisinterpretingeachbyteasalegacyencoding(CP1252andcaretnotationfortheC0controls): Encoding Representation(hexadecimal) Representation(decimal) BytesasCP1252characters UTF-8[a] EFBBBF 239187191  UTF-16(BE) FEFF 254255 þÿ UTF-16(LE) FFFE 255254 ÿþ UTF-32(BE) 0000FEFF 00254255 ^@^@þÿ(^@isthenullcharacter) UTF-32(LE) FFFE0000 25525400 ÿþ^@^@(^@isthenullcharacter) UTF-7[a] 2B2F76[b][14][15] 4347118 +/v UTF-1[a] F7644C 24710076 ÷dL UTF-EBCDIC[a] DD736673 221115102115 Ýsfs SCSU[a] 0EFEFF[c] 14254255 ^Nþÿ(^Nisthe"shiftout"character) BOCU-1[a] FBEE28 25123840 ûî( GB-18030[a] 84319533 1324914951 „1•3 ^abcdefgThisisnotliterallya"byteorder"mark,sinceacodeunitintheseencodingsisonebyteandthereforecannothavebytesina"wrong"order.Nevertheless,theBOMcanbeusedtoindicatetheencodingofthetextthatfollowsit.[5][13] ^Followedby38,39,2B,or2F(ASCII8,9,+or/),dependingonwhatthenextcharacteris. ^SCSUallowsotherencodingsofU+FEFF,theshownformisthesignaturerecommendedinUTR#6.[16] Seealso[edit] Left-to-rightmark ArabicPresentationForms-B,blocktowhichcodepointU+FEFFbelongs References[edit] ^ab"FAQ-UTF-8,UTF-16,UTF-32&BOM".Unicode.org.Retrieved28January2017. ^"TheUnicode®StandardVersion9.0"(PDF).TheUnicodeConsortium. ^"TheUnicodeStandard5.0,Chapter2:GeneralStructure"(PDF).p. 36.Retrieved29March2009.Table2-4.TheSevenUnicodeEncodingSchemes ^"TheUnicodeStandard5.0,Chapter2:GeneralStructure"(PDF).p. 36.Retrieved30November2008.UseofaBOMisneitherrequirednorrecommendedforUTF-8,butmaybeencounteredincontextswhereUTF-8dataisconvertedfromotherencodingformsthatuseaBOMorwheretheBOMisusedasaUTF-8signature ^ab"FAQ-UTF-8,UTF-16,UTF-32&BOM:CanaUTF-8datastreamcontaintheBOMcharacter(inUTF-8form)?Ifyes,thencanIstillassumetheremainingUTF-8bytesareinbig-endianorder?".Unicode.org.Retrieved4January2009. ^"Re:pre-HTML5andtheBOMfromAsmusFreytagon2012-07-13(UnicodeMailListArchive)".Unicode.org.Retrieved14July2012. ^"BugID:JDK-6378911UTF-8decoderhandlingofbyte-ordermarkhaschanged".Bugs.java.com.Retrieved14October2021. ^Yergeau,Francois(November2003).UTF-8,atransformationformatofISO10646.IETF.doi:10.17487/RFC3629.RFC3629.Retrieved15May2014. ^Gerhards,Rainer(March2009)."MSG".TheSyslogProtocol.IETF.sec. 6.4.doi:10.17487/RFC5424.RFC5424. ^AlfP.Steinbach(2011)."Unicodepart1:Windowsconsolei/oapproaches".Retrieved24March2012.However,sincetheC++sourcecodewasencodedasUTF-8withoutBOM(asisusualinLinux),theVisualC++compilererroneouslyassumedthatthesourcecodewasencodedasWindowsANSI. ^"UTF-16LE".EncodingStandard.WHATWG. ^"Decode".EncodingStandard.WHATWG. ^Yergeau,François(8November2003)."RFC3629-UTF-8,atransformationformatofISO10646".Tools.ietf.org.Retrieved28January2017. ^https://unicode.org/L2/L2021/21038-bom-guidance.pdf[bareURLPDF] ^"SDLDocumentation". ^MarkusScherer."UTS#6:CompressionSchemeforUnicode".Unicode.org.Retrieved28January2017. Externallinks[edit] UnicodeFAQ:UTF-8,UTF-16,UTF-32&BOM TheUnicodeStandard,chapter2.6EncodingSchemes TheUnicodeStandard,chapter2.13SpecialCharactersandNoncharacters,sectionByteOrderMark(BOM) TheUnicodeStandard,chapter16.8Specials,sectionByteOrderMark(BOM):U+FEFF vteUnicodeUnicode UnicodeConsortium ISO/IEC10646(UniversalCharacterSet) Versions Codepoints Block List UniversalCharacterSet Charactercharts Characterproperty Plane PrivateUseArea CharactersSpecialpurpose BOM Combininggraphemejoiner Left-to-rightmark /Right-to-leftmark Softhyphen Variantform Wordjoiner Zero-widthjoiner Zero-widthnon-joiner Zero-widthspace Lists Characters CJKUnifiedIdeographs Combiningcharacter Duplicatecharacters Numerals Scripts Spaces Symbols Halfwidthandfullwidth Aliasnamesandabbreviations Whitespacecharacters ProcessingAlgorithms Bidirectionaltext Collation ISO/IEC14651 Equivalence Variationsequences InternationalIdeographsCore Comparison BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Onpairsofcodepoints Combiningcharacter Compatibilitycharacters Duplicatecharacters Equivalence Homoglyph Precomposedcharacter list Z-variant Variationsequences Regionalindicatorsymbol Emojiskincolor Usage Domainnames(IDN) Email Fonts HTML entityreferences numericreferences Input InternationalIdeographsCore Relatedstandards CommonLocaleDataRepository(CLDR) GB18030 ISO/IEC8859 ISO15924 Relatedtopics Anomalies ConScriptUnicodeRegistry IdeographicResearchGroup InternationalComponentsforUnicode PeopleinvolvedwithUnicode Hanunification ScriptsandsymbolsinUnicodeCommonandinheritedscripts Combiningmarks Diacritics Punctuationmarks Spaces Numbers Modernscripts Adlam Arabic Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese CanadianAboriginal Chakma Cham Cherokee CJKUnifiedIdeographs(Han) Cyrillic Deseret Devanagari Geʽez Georgian Greek Gujarati GunjalaGondi Gurmukhi Hangul HanifiRohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana KayahLi Khmer Lao Latin Lepcha Limbu Lisu(Fraser) Lontara Malayalam MasaramGondi MendeKikakui Medefaidrin Miao(Pollard) Mongolian Mru N'Ko NagMundari NewTaiLue Nüshu NyiakengPuachueHmong Odia OlChiki Osage Osmanya PahawhHmong PauCinHau Pracalit(Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala SorangSompeng Sundanese Syriac Tagbanwa TaiLe TaiTham TaiViet Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Toto Vai Wancho WarangCiti Yi Ancientandhistoricscripts Ahom Anatolianhieroglyphs AncientNorthArabian Avestan BassaVah Bhaiksuki Brāhmī Carian CaucasianAlbanian Coptic Cuneiform Cypriot Cypro-Minoan DivesAkuru Dogra Egyptianhieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran ImperialAramaic InscriptionalPahlavi InscriptionalParthian Kaithi Kawi Kharosthi Khitansmallscript Khojki Khudawadi Khwarezmian(Chorasmian) LinearA LinearB Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen MeeteiMayek Meroitic Modi Multani Nabataean Nandinagari Ogham OldHungarian OldItalic OldPermic OldPersiancuneiform OldSogdian OldTurkic OldUyghur Palmyrene ʼPhags-pa Phoenician PsalterPahlavi Runic Sharada Siddham Sogdian SouthArabian Soyombo SylhetiNagri Tagalog(Baybayin) Takri Tangut Ugaritic Vithkuqi Yezidi ZanabazarSquare Notationalscripts Duployan SignWriting Symbols,emojis Cultural,political,andreligioussymbols Currency ControlPictures Mathematicaloperatorsandsymbols Listbysubject Phoneticsymbols(includingIPA) Emoji Category:Unicode Category:Unicodeblocks Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Byte_order_mark&oldid=1106914685" Categories:UnicodespecialcodepointsHiddencategories:AllarticleswithbareURLsforcitationsArticleswithbareURLsforcitationsfromMarch2022ArticleswithPDFformatbareURLsforcitationsUsedmydatesfromApril2022ArticleswithshortdescriptionShortdescriptionmatchesWikidata Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English Views ReadEditViewhistory More Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages العربيةČeštinaDeutschEspañolفارسیFrançais한국어ItalianoעבריתLietuviųMalagasy日本語NorskbokmålPolskiPortuguêsRomânăРусскийSimpleEnglishSvenskaУкраїнська中文 Editlinks
延伸文章資訊
- 1Byte order mark - Globalization - Microsoft Learn
Byte Order Mark (BOM) is used to indicate how a processor places serialized text into a sequence ...
- 2這些是什麼? BOM/UFT-8有簽章/withBOM/withoutBOM - iT 邦幫忙
這是另一篇關於BOM之亂的描述. Windows 作業系統不少程式(像是記事本),預設會對UTF-8 檔案加上BOM 而Linux 則避免 ...
- 3What is a Byte Order Mark (BOM)? - Definition from Techopedia
The byte order mark (BOM) is a piece of information used to signify that a text file employs Unic...
- 4位元組順序記號 - 维基百科
位元組順序記號(英語:byte-order mark,BOM)是位於碼點 U+FEFF 的統一碼字符的名称。當以UTF-16或UTF-32來將UCS/統一碼字符所組成的字串編碼時,這個字符被用來...
- 5UTF-8 BOM (Byte Order Mark) 的問題@新精讚
解釋為甚麼Windows 2000 以後的Notepad 存UTF-8 的檔案會加上BOM(Byte Order Mark, U+FEFF), 主要是因為UTF-8 和ASCII 是相容的, 為...