Byte order mark - Wikipedia

文章推薦指數: 80 %
投票人數:10人

The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFF BYTE ORDER MARK, whose appearance as a magic number at the start of ... Byteordermark FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch Unicodecharacter "FEFF"redirectshere.FortheairportinCentralAfricanRepublicwiththeairportcodeFEFF,seeBanguiM'PokoInternationalAirport.FortheprogramusedinX-rayabsorptionspectroscopy,seeFEFF(software).ForthenameofU+FEFFinUnicodeandthealternativeusageasazero-widthnon-breakingspace,seeWordjoiner. Thebyteordermark(BOM)isaparticularusageofthespecialUnicodecharacter,U+FEFFBYTEORDERMARK,whoseappearanceasamagicnumberatthestartofatextstreamcansignalseveralthingstoaprogramreadingthetext:[1] Thebyteorder,orendianness,ofthetextstreaminthecasesof16-bitand32-bitencodings; Thefactthatthetextstream'sencodingisUnicode,toahighlevelofconfidence; WhichUnicodecharacterencodingisused. BOMuseisoptional.ItspresenceinterfereswiththeuseofUTF-8bysoftwarethatdoesnotexpectnon-ASCIIbytesatthestartofafilebutthatcouldotherwisehandlethetextstream. Unicodecanbeencodedinunitsof8-bit,16-bit,or32-bitintegers.Forthe16-and32-bitrepresentations,acomputerreceivingtextfromarbitrarysourcesneedstoknowwhichbyteordertheintegersareencodedin.TheBOMisencodedinthesameschemeastherestofthedocumentandbecomesanoncharacterUnicodecodepointifitsbytesareswapped.Hence,theprocessaccessingthetextcanexaminethesefirstfewbytestodeterminetheendianness,withoutrequiringsomecontractormetadataoutsideofthetextstreamitself.Generallythereceivingcomputerwillswapthebytestoitsownendianness,ifnecessary,andwouldnolongerneedtheBOMforprocessing. ThebytesequenceoftheBOMdiffersperUnicodeencoding(includingonesoutsidetheUnicodestandardsuchasUTF-7,seetablebelow),andnoneofthesequencesislikelytoappearatthestartoftextstreamsstoredinotherencodings.Therefore,placinganencodedBOMatthestartofatextstreamcanindicatethatthetextisUnicodeandidentifytheencodingschemeused.ThisuseoftheBOMcharacteriscalleda"Unicodesignature".[2] Contents 1Usage 1.1UTF-8 1.2UTF-16 1.3UTF-32 2Byteordermarksbyencoding 3Seealso 4References 5Externallinks Usage[edit] IftheBOMcharacterappearsinthemiddleofadatastream,Unicodesaysitshouldbeinterpretedasa"zero-widthnon-breakingspace"(inhibitsline-breakingbetweenword-glyphs).InUnicode3.2,thisusageisdeprecatedinfavorofthe"WordJoiner"character,U+2060.[1]ThisallowsU+FEFFtobeusedonlyasaBOM. UTF-8[edit] TheUTF-8representationoftheBOMisthe(hexadecimal)bytesequence0xEF,0xBB,0xBF. TheUnicodeStandardpermitstheBOMinUTF-8,[3]butdoesnotrequireorrecommenditsuse.[4]ByteorderhasnomeaninginUTF-8,[5]soitsonlyuseinUTF-8istosignalatthestartthatthetextstreamisencodedinUTF-8,orthatitwasconvertedtoUTF-8fromastreamthatcontainedanoptionalBOM.ThestandardalsodoesnotrecommendremovingaBOMwhenitisthere,sothatround-trippingbetweenencodingsdoesnotloseinformation,andsothatcodethatreliesonitcontinuestowork.[6][7]TheIETFrecommendsthatifaprotocoleither(a)alwaysusesUTF-8,or(b)hassomeotherwaytoindicatewhatencodingisbeingused,thenit"SHOULDforbiduseofU+FEFFasasignature."[8]AnexampleofnotfollowingthisrecommendationistheIETFSyslogprotocolwhichrequirestexttobeinUTF-8andalsorequirestheBOM.[9] NotusingaBOMallowstexttobebackwards-compatiblewithsomesoftwarethatisnotUnicode-aware.Examplesincludeprogramminglanguagesthatpermitnon-ASCIIbytesinstringliteralsbutnotatthestartofthefile. UTF-8isasparseencodinginthesensethatalargefractionofpossiblebytecombinationsdonotresultinvalidUTF-8text.BinarydataandtextinanyotherencodingarelikelytocontainbytesequencesthatareinvalidasUTF-8.PracticallytheonlyexceptionstothatarewhenthetextconsistspurelyofASCII-rangebytes.BecauseallmodernencodingsuseASCII-rangebytestorepresentASCIIcharacters,ASCII-onlytextcanbesafelyinterpretedasUTF-8regardlessofwhatencodingwasintendedbythesystemthatemittedthebytes.Becauseoftheseconsiderations,heuristicanalysiscandetectwithhighconfidencewhetherUTF-8isinuse,withoutrequiringaBOM. Microsoftcompilers[10]andinterpreters,andmanypiecesofsoftwareonMicrosoftWindowssuchasNotepadtreattheBOMasarequiredmagicnumberratherthanuseheuristics.ThesetoolsaddaBOMwhensavingtextasUTF-8,andcannotinterpretUTF-8unlesstheBOMispresentorthefilecontainsonlyASCII.WindowsPowerShell(upto5.1)willaddaBOMwhenitsavesUTF-8XMLdocuments.However,PowerShellCore6hasaddeda-Encodingswitchonsomecmdletscalledutf8NoBOMsothatdocumentcanbesavedwithoutBOM.GoogleDocsalsoaddsaBOMwhenconvertingadocumenttoaplaintextfilefordownload. UTF-16[edit] InUTF-16,aBOM(U+FEFF)maybeplacedasthefirstcharacterofafileorcharacterstreamtoindicatetheendianness(byteorder)ofallthe16-bitcodeunitofthefileorstream.Ifanattemptismadetoreadthisstreamwiththewrongendianness,thebyteswillbeswapped,thusdeliveringthecharacterU+FFFE,whichisdefinedbyUnicodeasa"noncharacter"thatshouldneverappearinthetext. Ifthe16-bitunitsarerepresentedinbig-endianbyteorder,theBOMwillappearinthesequenceofbytesas0xFE0xFF Ifthe16-bitunitsuselittle-endianorder,theBOMwillappearinthesequenceofbytesas0xFF0xFE NeitherofthesesequencesisvalidUTF-8,sotheirpresenceindicatesthatthefileisnotencodedinUTF-8. FortheIANAregisteredcharsetsUTF-16BEandUTF-16LE,abyteordermarkshouldnotbeusedbecausethenamesofthesecharactersetsalreadydeterminethebyteorder.Ifencounteredanywhereinsuchatextstream,U+FEFFistobeinterpretedasa"zerowidthno-breakspace". IfthereisnoBOM,itispossibletoguesswhetherthetextisUTF-16anditsbyteorderbysearchingforASCIIcharacters(i.e.a0byteadjacenttoabyteinthe0x20-0x7Erange,also0x0Aand0x0DforCRandLF).Alargenumber(i.e.farhigherthanrandomchance)inthesameorderisaverygoodindicationofUTF-16andwhetherthe0isintheevenoroddbytesindicatesthebyteorder.However,thiscanresultinbothfalsepositivesandfalsenegatives. ClauseD98ofconformance(section3.10)oftheUnicodestandardstates,"TheUTF-16encodingschememayormaynotbeginwithaBOM.However,whenthereisnoBOM,andintheabsenceofahigher-levelprotocol,thebyteorderoftheUTF-16encodingschemeisbig-endian."Whetherornotahigher-levelprotocolisinforceisopentointerpretation.Fileslocaltoacomputerforwhichthenativebyteorderingislittle-endian,forexample,mightbearguedtobeencodedasUTF-16LEimplicitly.Therefore,thepresumptionofbig-endianiswidelyignored.TheW3C/WHATWGencodingstandardusedinHTML5specifiesthatcontentlabelledeither"utf-16"or"utf-16le"aretobeinterpretedaslittle-endian"todealwithdeployedcontent".[11]However,ifabyte-ordermarkispresent,thenthatBOMistobetreatedas"moreauthoritativethananythingelse".[12] ProgramsthatinterpretUTF-16asabyte-basedencodingmaydisplayagarbledmessofcharacters,butASCIIcharacterswouldberecognizablebecausethelowbyteoftheUTF-16representationisthesameastheASCIIcodeandthereforewouldbedisplayedthesame.Theupperbyteof0maybedisplayedasnothing,whitespace,aperiod,orsomeotherunvaryingglyph. UTF-32[edit] AlthoughaBOMcouldbeusedwithUTF-32,thisencodingisrarelyusedfortransmission.OtherwisethesamerulesasforUTF-16areapplicable. TheBOMforlittle-endianUTF-32isthesamepatternasalittle-endianUTF-16BOMfollowedbyaNULcharacter,anunusualexampleoftheBOMbeingthesamepatternintwodifferentencodings.ProgrammersusingtheBOMtoidentifytheencodingwillhavetodecidewhetherUTF-32oraNULfirstcharacterismorelikely. Byteordermarksbyencoding[edit] ThistableillustrateshowtheBOMcharacterisrepresentedasabytesequenceinvariousencodingsandhowthosesequencesmightappearinatexteditorthatisinterpretingeachbyteasalegacyencoding(CP1252andcaretnotationfortheC0controls): Encoding Representation(hexadecimal) Representation(decimal) BytesasCP1252characters UTF-8[a] EFBBBF 239187191  UTF-16(BE) FEFF 254255 þÿ UTF-16(LE) FFFE 255254 ÿþ UTF-32(BE) 0000FEFF 00254255 ^@^@þÿ(^@isthenullcharacter) UTF-32(LE) FFFE0000 25525400 ÿþ^@^@(^@isthenullcharacter) UTF-7[a] 2B2F76[b][14][15] 4347118 +/v UTF-1[a] F7644C 24710076 ÷dL UTF-EBCDIC[a] DD736673 221115102115 Ýsfs SCSU[a] 0EFEFF[c] 14254255 ^Nþÿ(^Nisthe"shiftout"character) BOCU-1[a] FBEE28 25123840 ûî( GB-18030[a] 84319533 1324914951 „1•3 ^abcdefgThisisnotliterallya"byteorder"mark,sinceacodeunitintheseencodingsisonebyteandthereforecannothavebytesina"wrong"order.Nevertheless,theBOMcanbeusedtoindicatetheencodingofthetextthatfollowsit.[5][13] ^Followedby38,39,2B,or2F(ASCII8,9,+or/),dependingonwhatthenextcharacteris. ^SCSUallowsotherencodingsofU+FEFF,theshownformisthesignaturerecommendedinUTR#6.[16] Seealso[edit] Left-to-rightmark ArabicPresentationForms-B,blocktowhichcodepointU+FEFFbelongs References[edit] ^ab"FAQ-UTF-8,UTF-16,UTF-32&BOM".Unicode.org.Retrieved28January2017. ^"TheUnicode®StandardVersion9.0"(PDF).TheUnicodeConsortium. ^"TheUnicodeStandard5.0,Chapter2:GeneralStructure"(PDF).p. 36.Retrieved29March2009.Table2-4.TheSevenUnicodeEncodingSchemes ^"TheUnicodeStandard5.0,Chapter2:GeneralStructure"(PDF).p. 36.Retrieved30November2008.UseofaBOMisneitherrequirednorrecommendedforUTF-8,butmaybeencounteredincontextswhereUTF-8dataisconvertedfromotherencodingformsthatuseaBOMorwheretheBOMisusedasaUTF-8signature ^ab"FAQ-UTF-8,UTF-16,UTF-32&BOM:CanaUTF-8datastreamcontaintheBOMcharacter(inUTF-8form)?Ifyes,thencanIstillassumetheremainingUTF-8bytesareinbig-endianorder?".Unicode.org.Retrieved4January2009. ^"Re:pre-HTML5andtheBOMfromAsmusFreytagon2012-07-13(UnicodeMailListArchive)".Unicode.org.Retrieved14July2012. ^"BugID:JDK-6378911UTF-8decoderhandlingofbyte-ordermarkhaschanged".Bugs.java.com.Retrieved14October2021. ^Yergeau,Francois(November2003).UTF-8,atransformationformatofISO10646.IETF.doi:10.17487/RFC3629.RFC3629.Retrieved15May2014. ^Gerhards,Rainer(March2009)."MSG".TheSyslogProtocol.IETF.sec. 6.4.doi:10.17487/RFC5424.RFC5424. ^AlfP.Steinbach(2011)."Unicodepart1:Windowsconsolei/oapproaches".Retrieved24March2012.However,sincetheC++sourcecodewasencodedasUTF-8withoutBOM(asisusualinLinux),theVisualC++compilererroneouslyassumedthatthesourcecodewasencodedasWindowsANSI. ^"UTF-16LE".EncodingStandard.WHATWG. ^"Decode".EncodingStandard.WHATWG. ^Yergeau,François(8November2003)."RFC3629-UTF-8,atransformationformatofISO10646".Tools.ietf.org.Retrieved28January2017. ^https://unicode.org/L2/L2021/21038-bom-guidance.pdf[bareURLPDF] ^"SDLDocumentation". ^MarkusScherer."UTS#6:CompressionSchemeforUnicode".Unicode.org.Retrieved28January2017. Externallinks[edit] UnicodeFAQ:UTF-8,UTF-16,UTF-32&BOM TheUnicodeStandard,chapter2.6EncodingSchemes TheUnicodeStandard,chapter2.13SpecialCharactersandNoncharacters,sectionByteOrderMark(BOM) TheUnicodeStandard,chapter16.8Specials,sectionByteOrderMark(BOM):U+FEFF vteUnicodeUnicode UnicodeConsortium ISO/IEC10646(UniversalCharacterSet) Versions Codepoints Block List UniversalCharacterSet Charactercharts Characterproperty Plane PrivateUseArea CharactersSpecialpurpose BOM Combininggraphemejoiner Left-to-rightmark /Right-to-leftmark Softhyphen Variantform Wordjoiner Zero-widthjoiner Zero-widthnon-joiner Zero-widthspace Lists Characters CJKUnifiedIdeographs Combiningcharacter Duplicatecharacters Numerals Scripts Spaces Symbols Halfwidthandfullwidth Aliasnamesandabbreviations Whitespacecharacters ProcessingAlgorithms Bidirectionaltext Collation ISO/IEC14651 Equivalence Variationsequences InternationalIdeographsCore Comparison BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Onpairsofcodepoints Combiningcharacter Compatibilitycharacters Duplicatecharacters Equivalence Homoglyph Precomposedcharacter list Z-variant Variationsequences Regionalindicatorsymbol Emojiskincolor Usage Domainnames(IDN) Email Fonts HTML entityreferences numericreferences Input InternationalIdeographsCore Relatedstandards CommonLocaleDataRepository(CLDR) GB18030 ISO/IEC8859 ISO15924 Relatedtopics Anomalies ConScriptUnicodeRegistry IdeographicResearchGroup InternationalComponentsforUnicode PeopleinvolvedwithUnicode Hanunification ScriptsandsymbolsinUnicodeCommonandinheritedscripts Combiningmarks Diacritics Punctuationmarks Spaces Numbers Modernscripts Adlam Arabic Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese CanadianAboriginal Chakma Cham Cherokee CJKUnifiedIdeographs(Han) Cyrillic Deseret Devanagari Geʽez Georgian Greek Gujarati GunjalaGondi Gurmukhi Hangul HanifiRohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana KayahLi Khmer Lao Latin Lepcha Limbu Lisu(Fraser) Lontara Malayalam MasaramGondi MendeKikakui Medefaidrin Miao(Pollard) Mongolian Mru N'Ko NagMundari NewTaiLue Nüshu NyiakengPuachueHmong Odia OlChiki Osage Osmanya PahawhHmong PauCinHau Pracalit(Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala SorangSompeng Sundanese Syriac Tagbanwa TaiLe TaiTham TaiViet Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Toto Vai Wancho WarangCiti Yi Ancientandhistoricscripts Ahom Anatolianhieroglyphs AncientNorthArabian Avestan BassaVah Bhaiksuki Brāhmī Carian CaucasianAlbanian Coptic Cuneiform Cypriot Cypro-Minoan DivesAkuru Dogra Egyptianhieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran ImperialAramaic InscriptionalPahlavi InscriptionalParthian Kaithi Kawi Kharosthi Khitansmallscript Khojki Khudawadi Khwarezmian(Chorasmian) LinearA LinearB Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen MeeteiMayek Meroitic Modi Multani Nabataean Nandinagari Ogham OldHungarian OldItalic OldPermic OldPersiancuneiform OldSogdian OldTurkic OldUyghur Palmyrene ʼPhags-pa Phoenician PsalterPahlavi Runic Sharada Siddham Sogdian SouthArabian Soyombo SylhetiNagri Tagalog(Baybayin) Takri Tangut Ugaritic Vithkuqi Yezidi ZanabazarSquare Notationalscripts Duployan SignWriting Symbols,emojis Cultural,political,andreligioussymbols Currency ControlPictures Mathematicaloperatorsandsymbols Listbysubject Phoneticsymbols(includingIPA) Emoji  Category:Unicode  Category:Unicodeblocks Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Byte_order_mark&oldid=1106914685" Categories:UnicodespecialcodepointsHiddencategories:AllarticleswithbareURLsforcitationsArticleswithbareURLsforcitationsfromMarch2022ArticleswithPDFformatbareURLsforcitationsUsedmydatesfromApril2022ArticleswithshortdescriptionShortdescriptionmatchesWikidata Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English Views ReadEditViewhistory More Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages العربيةČeštinaDeutschEspañolفارسیFrançais한국어ItalianoעבריתLietuviųMalagasy日本語NorskbokmålPolskiPortuguêsRomânăРусскийSimpleEnglishSvenskaУкраїнська中文 Editlinks



請為這篇文章評分?