The byte-order mark (BOM) in HTML - W3C
文章推薦指數: 80 %
Each 2-digit hexadecimal number represents a byte in the stream of text. You can see that the order of the two bytes that represent a single character is ... Quickcheck Checkforbyte-ordermarksinapage Check Lookinthe"Characterencoding"areaoftheInformationtable.Ifthepagehasnon-initialBOMstherewillbeawarningmessagelowerdown. Question Whatisthebyte-ordermark,andwhatdoIneedtoknowaboutitwhencreatingHTML? Answer Whatisabyte-ordermark? AtthebeginningofapagethatusesaUnicodecharacterencodingyoumayfindsomebytesthatrepresenttheUnicodecodepointU+FEFFBYTEORDERMARK(abbreviatedasBOM). ThenameBYTEORDERMARKisanaliasfortheoriginalcharacternameZEROWIDTHNO-BREAKSPACE(ZWNBSP).WiththeintroductionofU+2060WORDJOINER,there'snolongeraneedtoeveruseU+FEFFforitsZWNSPeffect,sofromthatpointon,andwiththeavailabilityofaformalalias,thenameZEROWIDTHNO-BREAKSPACEisnolongerhelpful,andwewillusethealiashere. TheBOM,whencorrectlyused,isinvisible. BeforeUTF-8wasintroducedinearly1993,theexpectedwayfortransferringUnicodetextwasusing16-bitcodeunitsusinganencodingcalledUCS-2whichwaslaterextendedtoUTF-16.16-bitcodeunitscanbeexpressedasbytesintwoways:themostsignificantbytefirst(big-endian)ortheleastsignificantbytefirst(little-endian).Tocommunicatewhichbyteorderwasinuse,U+FEFF(thebyte-ordermark)wasusedatthestartofthestreamasamagicnumberthatisnotlogicallypartofthetextthestreamrepresents. Thepicturebelowshowsthebytesusedinasequenceoftwo-bytecharacters.Each2-digithexadecimalnumberrepresentsabyteinthestreamoftext.Youcanseethattheorderofthetwobytesthatrepresentasinglecharacterisreversedforbigendianvs.littleendianstorage.Thebyte-ordermarkindicateswhichorderisused,sothatapplicationscanimmediatelydecodethecontent. IntheUTF-8encoding,thepresenceoftheBOMisnotessentialbecause,unliketheUTF-16encodings,thereisno alternativesequenceofbytesinacharacter.However,theBOMmaystilloccurinUTF-8encodedtext,eitherasaby-productofanencoding conversionorbecauseitwasaddedbyaneditortoflagthecontentasUTF-8.Inthissituation,theBOMisoftencalledaUTF-8signature. WhatdoIneedtoknowabouttheBOM? Mostofthetimeyouwillnothavetoworryaboutthebyte-ordermarkinUTF-8.Youwillfindthatsomeeditors(suchasNotepadonWindows)willalwaysaddaBOMwhenyousaveafilewiththeUTF-8encoding,otherswillofferyouachoice. InHTML5browsersarerequiredtorecognizetheUTF-8BOManduseittodetecttheencodingofthepage,andrecentversionsofmajorbrowsershandletheBOMasexpectedwhenusedforUTF-8encodedpages. TheUTF-8BOMoffersreliableencodingdetection,sinceitisextremelyshortandstable,worksinXMLandHTML,andworkswhetheryourpageisreadoverthenetworkornot(unlikeHTTPdeclarations).However,bearinmindthatitisalwaysagoodideatodeclaretheencodingofyourpageusingthemetaelement,inadditiontotheBOM,sothattheencodingisapparenttopeoplelookingatthesourcetext. AlsothereareanumberofsituationswheretheBOM,particularlybecauseitisinvisible,maycauseaproblem.Seethesectionbelowformoreinformationaboutthose. IfyouuseaUTF-16encodingforyourpage(andwestronglyrecommendthatyoudon't),therearesomeadditionalconsiderations. DetectingtheBOM YoucanfindoutwhetherapagecontainsaBOMatthestartorfurtherdowninthecontentbyusingtheW3CInternationalizationChecker.ABOMatthestartofthepagewillbereportedintheInformationpanel.ABOMthatisincludedinthepagelowerdown(typicallyduetocontentbeingaddedtothepagefromanexternalsource)willbereportedintheDetailedReportsection. YoucantrylookingforaUTF-8signatureinyourcontentinyoureditor,butifyoureditorhandlestheBOMcorrectlyyouprobablywon'tbeableto seeit.Withabinaryeditorcapableof displayingthehexadecimalbytevaluesinthefile,theUTF-8signaturedisplaysasEFBBBF. IfyoureditororbrowserappliesthewrongcharacterencodingtoaUTF-8encodedfilewithaBOM,youarelikelytoseeasequenceofbytesatthestartofthefile.ThesearethebytesthatcomposeBOMrepresentedasthecharactersthosebytesrepresentinthatencoding.WiththeLatin1(ISO8859-1)characterencoding,thesignaturedisplaysascharacters. Alternatively,youreditormaytellyouinastatusbaroramenuwhatencodingyourfileisin,includinginformationaboutthe presenceornotoftheUTF-8signature.Forexample,ifyouuseSaveAsinDreamweaverandyourfilehasaBOMatthestartyouwillseeacheckmarkintheboxlabeled'IncludeUnicodeSignature(BOM)'.Youcanalsospecifyinyourpreferences(seeillustration)whethernewdocumentsshoulduseaBOMbydefault. PotentialissueswiththeUTF-8BOM Whatfollowsaresomesituationswherethebyte-ordermarkhasbeenknowntocauseproblems. Ingeneral,theseissuesarefadingawayaspeopleadoptnewerversionsofbrowsersandeditingtools.Itisworthknowingaboutthemifyouruserbasestillusesoldertechnology.However,thisisnotsolelyaboutlegacyissues. PHPincludes Atthetimethisarticlewaswritten,ifyouincludesomeexternalfileinapageusingPHPandthatfilestartswithaBOM,itmaycreateblanklines. ThisisbecausetheBOMisnotstrippedbeforeinclusionintothepage,andactslikeacharacteroccupyingalineoftext.Seeanexample.Intheexample,ablanklinecontainingtheBOMappearsabovethefirstitemofincludedtext. YoushouldensurethattheincludedfilesdonotstartwithaBOM. YoumayalsofindthattheBOMcausesproblemsforanordinaryPHPpage.WhensendingcustomHTTPheadersthecodetosettheheadermustbecalledbeforeoutputbegins.ABOMatthestartofthefilecausesthepagetobeginoutputbeforetheheadercommandisinterpreted,andmayleadtoerrormessagesandotherproblemsinthedisplayedpage. Processingwithprogramcode YouneedtobecarefultotaketheBOMintoaccountinscriptsorprogramcodethatautomaticallyprocessfilesthatstartwithaBOM.Forexample,whenpatternmatchingatthestartofafilethatbeginswithaBOMyouneedadditionalcodetotestforthepresenceoftheBOMandignoreitiffound. TheUTF-8encodingwithoutaBOMhasthepropertythatadocumentwhichcontainsonlycharactersfromtheUS-ASCIIrangeisencodedbyte-for-bytethesamewayasthesamedocumentencodedusingtheUS-ASCIIencoding.SuchadocumentcanbeprocessedandunderstoodwhenencodedeitherasUTF-8orasUS-ASCII.AddingaBOMinsertsadditionalnon-ASCIIbytes,sothisisnolongertrue.IfyouhaveprocessesorscriptsthatassumethatthecontentiscomprisedofUS-ASCIIcharactersonly,youwillneedtoavoidtheBOM. HTTPprecedence ChangesintroducedwithHTML5meanthatthebyte-ordermarkoverridesanyencodingdeclarationintheHTTPheaderwhendetectingtheencodingofanHTMLpage.Thiscanbeveryusefulwhentheauthorofthepagecannotcontrolthecharacterencodingsettingoftheserver,orisunawareofitseffect,andtheserverisdeclaringpagestobeinanencodingotherthanUTF-8.IftheBOMhasahigherprecedencethantheHTTPheaders,thepageshouldbecorrectlyidentifiedasUTF-8. Atthetimeofwriting,notallbrowsersdothis,soyoushouldnotrelyonallreadersofyourpagebenefittingfromthisjustyet. PreviousversionsofInternetExplorergavetheBOMprecedenceoverHTTP,butIE10andIE11giveahigherprecedencetoHTTP.ItishopedthatthenextversionofInternetExplorerwillreverttothepreviousbehaviour,whichwillthenbeinlinewiththeothermajorbrowsers. InbrowserswheretheHTTPheaderstilloverridesthebyte-ordermarkandtheserverisdeclaringpagestohaveanon-Unicodecharacterencoding,youarelikelytofindunexpectedcharactersatthestartofthepage(suchasinapagelabelledinHTTPasISO8859-1)aswellasproblemsdisplayingnon-ASCIIcharactersonthepage. Otherissues IfyouuseapplicationsorscriptsinthebackendofyoursiteyoushouldcheckthattheyarealsoabletorecognizeandhandletheBOM. Westronglyrecommendthatyoudon'tchangetheencodingofaUTF-8filefromaUnicodeencodingtoanon-Unicodeencoding,butif,forsomeexceptionalreason,youdoyoumustensurethattheBOMisremoved.Ifyoudon't,eitherthebrowserwillcontinuetotreatyourcontentasUTF-8,oryouwillseestrangecharactersatthebeginningofthepage. RemovingtheBOM IfyouneedtoremovetheBOM,checkwhetheryoureditorallowsyoutospecifywhetheraUTF-8signatureisaddedorkeptwhileyousavethefile.Suchaneditorprovidesawayofremoving thesignaturebysimplyreadingthefileinthensavingitoutagain.Forexample,ineditorssuchasNotepad++onWindowsandTextWranglerontheMac,itispossibletoselecttheencodingfromalistwhileusingtheSaveAsfunction.ThelisthasoptionstosaveasUTF-8withorwithouttheBOM.JustchoosetheoptionwithouttheBOMandsave. Oneofthebenefitsofusingascriptisthatyoucanremovethesignaturequickly,andfrommultiplefiles.Infactthescriptcould berunautomaticallyaspartofyourprocess.IfyouusePerl,youcoulduseasimplescriptcreatedbyMartinDürst. Note:Youshouldchecktheprocessimpactofremovingthesignature.Itmaybethatsomepartofyourcontentdevelopmentprocess reliesontheuseofthesignaturetoindicatethatafileisinUTF-8.BearinmindalsothatpageswithahighproportionofLatincharactersmaylookcorrectsuperficiallybutthatoccasionalcharactersoutsidetheASCIIrange(U+0000toU+007F)maybeincorrectlyencoded. Additionalinformation HerearesomeadditionalnotesforthosewhoareencodingtheirHTMLpagesusingUTF-16.Notethat,forHTMLit'srecommendedthatyouuseUTF-8andthatyouavoidUTF-16.Soformostpeoplethissectionwillbeacademic. AccordingtoRFC2718andtheUnicodeStandard,ifyoudeclarethecharacterencodingofyourpageusingHTTPaseither"UTF-16LE"or"UTF-16BE"thenyoushouldnotuseabyte-ordermarkatthebeginningofthepage.OnlyifthepageislabelledinHTTPusingIANAcharsetname"UTF-16"isabyte-ordermarkappropriate. Notethatthisissolelyaboutthelabelingofthecontent.Ofcourse,theactualsequenceofbytesisthesame,whetheryoulabelcontentasUTF-16andaddaBOM,orwhetheryoulabelitasUTF-16LEorUTF-16BE. TheHTML5specificationcurrentlydisallowstheuseofanyother,text-basedin-documentencodingdeclarationforpagesusingtheUTF-16encoding.Ineffect,thismeansthattheBOMis,itself,thedeclarationthatyouhavetoadd. Thebyte-ordermarkisalsousedfortextlabeledasUTF-32,andshouldnotbeusedfortextlabeledasUTF-32BEorUTF-32LE.TheuseofUTF-32forHTMLcontent,however,isstronglydiscouragedandsomeimplementationshaveremovedsupportforit,sowehaven'tevenmentionedituntilnow. Furtherreading Gettingstarted?IntroducingCharacterSetsandEncodings Tutorial,HandlingcharacterencodingsinHTMLandCSS Relatedlinks,AuthoringHTML&CSS Characters DeclaringthecharacterencodingforHTML
延伸文章資訊
- 1這些是什麼? BOM/UFT-8有簽章/withBOM/withoutBOM - iT 邦幫忙
這是另一篇關於BOM之亂的描述. Windows 作業系統不少程式(像是記事本),預設會對UTF-8 檔案加上BOM 而Linux 則避免 ...
- 2UTF-8 BOM (Byte Order Mark) 的問題@新精讚
解釋為甚麼Windows 2000 以後的Notepad 存UTF-8 的檔案會加上BOM(Byte Order Mark, U+FEFF), 主要是因為UTF-8 和ASCII 是相容的, 為...
- 3The byte-order mark (BOM) in HTML - W3C
Each 2-digit hexadecimal number represents a byte in the stream of text. You can see that the ord...
- 4什麼是BOM(Byte-order mark)? - 程式隨筆
位元組順序記號(英語:byte-order mark,BOM)是位於碼點 U+FEFF 的統一碼字元的名稱。當以UTF-16或UTF-32來將UCS/統一碼字元所組成的字串編碼時, ...
- 5Byte order mark - Wikipedia
The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFF BYTE ORD...