The byte-order mark (BOM) in HTML - W3C

文章推薦指數: 80 %
投票人數:10人

Each 2-digit hexadecimal number represents a byte in the stream of text. You can see that the order of the two bytes that represent a single character is ... Quickcheck Checkforbyte-ordermarksinapage Check Lookinthe"Characterencoding"areaoftheInformationtable.Ifthepagehasnon-initialBOMstherewillbeawarningmessagelowerdown. Question Whatisthebyte-ordermark,andwhatdoIneedtoknowaboutitwhencreatingHTML? Answer Whatisabyte-ordermark? AtthebeginningofapagethatusesaUnicodecharacterencodingyoumayfindsomebytesthatrepresenttheUnicodecodepointU+FEFFBYTEORDERMARK(abbreviatedasBOM). ThenameBYTEORDERMARKisanaliasfortheoriginalcharacternameZEROWIDTHNO-BREAKSPACE(ZWNBSP).WiththeintroductionofU+2060WORDJOINER,there'snolongeraneedtoeveruseU+FEFFforitsZWNSPeffect,sofromthatpointon,andwiththeavailabilityofaformalalias,thenameZEROWIDTHNO-BREAKSPACEisnolongerhelpful,andwewillusethealiashere. TheBOM,whencorrectlyused,isinvisible. BeforeUTF-8wasintroducedinearly1993,theexpectedwayfortransferringUnicodetextwasusing16-bitcodeunitsusinganencodingcalledUCS-2whichwaslaterextendedtoUTF-16.16-bitcodeunitscanbeexpressedasbytesintwoways:themostsignificantbytefirst(big-endian)ortheleastsignificantbytefirst(little-endian).Tocommunicatewhichbyteorderwasinuse,U+FEFF(thebyte-ordermark)wasusedatthestartofthestreamasamagicnumberthatisnotlogicallypartofthetextthestreamrepresents. Thepicturebelowshowsthebytesusedinasequenceoftwo-bytecharacters.Each2-digithexadecimalnumberrepresentsabyteinthestreamoftext.Youcanseethattheorderofthetwobytesthatrepresentasinglecharacterisreversedforbigendianvs.littleendianstorage.Thebyte-ordermarkindicateswhichorderisused,sothatapplicationscanimmediatelydecodethecontent. IntheUTF-8encoding,thepresenceoftheBOMisnotessentialbecause,unliketheUTF-16encodings,thereisno alternativesequenceofbytesinacharacter.However,theBOMmaystilloccurinUTF-8encodedtext,eitherasaby-productofanencoding conversionorbecauseitwasaddedbyaneditortoflagthecontentasUTF-8.Inthissituation,theBOMisoftencalledaUTF-8signature. WhatdoIneedtoknowabouttheBOM? Mostofthetimeyouwillnothavetoworryaboutthebyte-ordermarkinUTF-8.Youwillfindthatsomeeditors(suchasNotepadonWindows)willalwaysaddaBOMwhenyousaveafilewiththeUTF-8encoding,otherswillofferyouachoice. InHTML5browsersarerequiredtorecognizetheUTF-8BOManduseittodetecttheencodingofthepage,andrecentversionsofmajorbrowsershandletheBOMasexpectedwhenusedforUTF-8encodedpages. TheUTF-8BOMoffersreliableencodingdetection,sinceitisextremelyshortandstable,worksinXMLandHTML,andworkswhetheryourpageisreadoverthenetworkornot(unlikeHTTPdeclarations).However,bearinmindthatitisalwaysagoodideatodeclaretheencodingofyourpageusingthemetaelement,inadditiontotheBOM,sothattheencodingisapparenttopeoplelookingatthesourcetext. AlsothereareanumberofsituationswheretheBOM,particularlybecauseitisinvisible,maycauseaproblem.Seethesectionbelowformoreinformationaboutthose. IfyouuseaUTF-16encodingforyourpage(andwestronglyrecommendthatyoudon't),therearesomeadditionalconsiderations. DetectingtheBOM YoucanfindoutwhetherapagecontainsaBOMatthestartorfurtherdowninthecontentbyusingtheW3CInternationalizationChecker.ABOMatthestartofthepagewillbereportedintheInformationpanel.ABOMthatisincludedinthepagelowerdown(typicallyduetocontentbeingaddedtothepagefromanexternalsource)willbereportedintheDetailedReportsection. YoucantrylookingforaUTF-8signatureinyourcontentinyoureditor,butifyoureditorhandlestheBOMcorrectlyyouprobablywon'tbeableto seeit.Withabinaryeditorcapableof displayingthehexadecimalbytevaluesinthefile,theUTF-8signaturedisplaysasEFBBBF. IfyoureditororbrowserappliesthewrongcharacterencodingtoaUTF-8encodedfilewithaBOM,youarelikelytoseeasequenceofbytesatthestartofthefile.ThesearethebytesthatcomposeBOMrepresentedasthecharactersthosebytesrepresentinthatencoding.WiththeLatin1(ISO8859-1)characterencoding,thesignaturedisplaysascharacters. Alternatively,youreditormaytellyouinastatusbaroramenuwhatencodingyourfileisin,includinginformationaboutthe presenceornotoftheUTF-8signature.Forexample,ifyouuseSaveAsinDreamweaverandyourfilehasaBOMatthestartyouwillseeacheckmarkintheboxlabeled'IncludeUnicodeSignature(BOM)'.Youcanalsospecifyinyourpreferences(seeillustration)whethernewdocumentsshoulduseaBOMbydefault. PotentialissueswiththeUTF-8BOM Whatfollowsaresomesituationswherethebyte-ordermarkhasbeenknowntocauseproblems. Ingeneral,theseissuesarefadingawayaspeopleadoptnewerversionsofbrowsersandeditingtools.Itisworthknowingaboutthemifyouruserbasestillusesoldertechnology.However,thisisnotsolelyaboutlegacyissues. PHPincludes Atthetimethisarticlewaswritten,ifyouincludesomeexternalfileinapageusingPHPandthatfilestartswithaBOM,itmaycreateblanklines. ThisisbecausetheBOMisnotstrippedbeforeinclusionintothepage,andactslikeacharacteroccupyingalineoftext.Seeanexample.Intheexample,ablanklinecontainingtheBOMappearsabovethefirstitemofincludedtext. YoushouldensurethattheincludedfilesdonotstartwithaBOM. YoumayalsofindthattheBOMcausesproblemsforanordinaryPHPpage.WhensendingcustomHTTPheadersthecodetosettheheadermustbecalledbeforeoutputbegins.ABOMatthestartofthefilecausesthepagetobeginoutputbeforetheheadercommandisinterpreted,andmayleadtoerrormessagesandotherproblemsinthedisplayedpage. Processingwithprogramcode YouneedtobecarefultotaketheBOMintoaccountinscriptsorprogramcodethatautomaticallyprocessfilesthatstartwithaBOM.Forexample,whenpatternmatchingatthestartofafilethatbeginswithaBOMyouneedadditionalcodetotestforthepresenceoftheBOMandignoreitiffound. TheUTF-8encodingwithoutaBOMhasthepropertythatadocumentwhichcontainsonlycharactersfromtheUS-ASCIIrangeisencodedbyte-for-bytethesamewayasthesamedocumentencodedusingtheUS-ASCIIencoding.SuchadocumentcanbeprocessedandunderstoodwhenencodedeitherasUTF-8orasUS-ASCII.AddingaBOMinsertsadditionalnon-ASCIIbytes,sothisisnolongertrue.IfyouhaveprocessesorscriptsthatassumethatthecontentiscomprisedofUS-ASCIIcharactersonly,youwillneedtoavoidtheBOM. HTTPprecedence ChangesintroducedwithHTML5meanthatthebyte-ordermarkoverridesanyencodingdeclarationintheHTTPheaderwhendetectingtheencodingofanHTMLpage.Thiscanbeveryusefulwhentheauthorofthepagecannotcontrolthecharacterencodingsettingoftheserver,orisunawareofitseffect,andtheserverisdeclaringpagestobeinanencodingotherthanUTF-8.IftheBOMhasahigherprecedencethantheHTTPheaders,thepageshouldbecorrectlyidentifiedasUTF-8. Atthetimeofwriting,notallbrowsersdothis,soyoushouldnotrelyonallreadersofyourpagebenefittingfromthisjustyet. PreviousversionsofInternetExplorergavetheBOMprecedenceoverHTTP,butIE10andIE11giveahigherprecedencetoHTTP.ItishopedthatthenextversionofInternetExplorerwillreverttothepreviousbehaviour,whichwillthenbeinlinewiththeothermajorbrowsers. InbrowserswheretheHTTPheaderstilloverridesthebyte-ordermarkandtheserverisdeclaringpagestohaveanon-Unicodecharacterencoding,youarelikelytofindunexpectedcharactersatthestartofthepage(suchasinapagelabelledinHTTPasISO8859-1)aswellasproblemsdisplayingnon-ASCIIcharactersonthepage. Otherissues IfyouuseapplicationsorscriptsinthebackendofyoursiteyoushouldcheckthattheyarealsoabletorecognizeandhandletheBOM. Westronglyrecommendthatyoudon'tchangetheencodingofaUTF-8filefromaUnicodeencodingtoanon-Unicodeencoding,butif,forsomeexceptionalreason,youdoyoumustensurethattheBOMisremoved.Ifyoudon't,eitherthebrowserwillcontinuetotreatyourcontentasUTF-8,oryouwillseestrangecharactersatthebeginningofthepage. RemovingtheBOM IfyouneedtoremovetheBOM,checkwhetheryoureditorallowsyoutospecifywhetheraUTF-8signatureisaddedorkeptwhileyousavethefile.Suchaneditorprovidesawayofremoving thesignaturebysimplyreadingthefileinthensavingitoutagain.Forexample,ineditorssuchasNotepad++onWindowsandTextWranglerontheMac,itispossibletoselecttheencodingfromalistwhileusingtheSaveAsfunction.ThelisthasoptionstosaveasUTF-8withorwithouttheBOM.JustchoosetheoptionwithouttheBOMandsave. Oneofthebenefitsofusingascriptisthatyoucanremovethesignaturequickly,andfrommultiplefiles.Infactthescriptcould berunautomaticallyaspartofyourprocess.IfyouusePerl,youcoulduseasimplescriptcreatedbyMartinDürst. Note:Youshouldchecktheprocessimpactofremovingthesignature.Itmaybethatsomepartofyourcontentdevelopmentprocess reliesontheuseofthesignaturetoindicatethatafileisinUTF-8.BearinmindalsothatpageswithahighproportionofLatincharactersmaylookcorrectsuperficiallybutthatoccasionalcharactersoutsidetheASCIIrange(U+0000toU+007F)maybeincorrectlyencoded. Additionalinformation HerearesomeadditionalnotesforthosewhoareencodingtheirHTMLpagesusingUTF-16.Notethat,forHTMLit'srecommendedthatyouuseUTF-8andthatyouavoidUTF-16.Soformostpeoplethissectionwillbeacademic. AccordingtoRFC2718andtheUnicodeStandard,ifyoudeclarethecharacterencodingofyourpageusingHTTPaseither"UTF-16LE"or"UTF-16BE"thenyoushouldnotuseabyte-ordermarkatthebeginningofthepage.OnlyifthepageislabelledinHTTPusingIANAcharsetname"UTF-16"isabyte-ordermarkappropriate. Notethatthisissolelyaboutthelabelingofthecontent.Ofcourse,theactualsequenceofbytesisthesame,whetheryoulabelcontentasUTF-16andaddaBOM,orwhetheryoulabelitasUTF-16LEorUTF-16BE. TheHTML5specificationcurrentlydisallowstheuseofanyother,text-basedin-documentencodingdeclarationforpagesusingtheUTF-16encoding.Ineffect,thismeansthattheBOMis,itself,thedeclarationthatyouhavetoadd. Thebyte-ordermarkisalsousedfortextlabeledasUTF-32,andshouldnotbeusedfortextlabeledasUTF-32BEorUTF-32LE.TheuseofUTF-32forHTMLcontent,however,isstronglydiscouragedandsomeimplementationshaveremovedsupportforit,sowehaven'tevenmentionedituntilnow. Furtherreading Gettingstarted?IntroducingCharacterSetsandEncodings Tutorial,HandlingcharacterencodingsinHTMLandCSS Relatedlinks,AuthoringHTML&CSS Characters DeclaringthecharacterencodingforHTML



請為這篇文章評分?