Display problems caused by the UTF-8 BOM - W3C

文章推薦指數: 80 %
投票人數:10人

If you are dealing with a file encoded in UTF-8, your display problems may be caused by the presence of a UTF-8 signature (BOM) that the ... WhenusingUTF-8encodedpagesinsomeuseragents,Igetanextralineorunwantedcharactersatthetopofmywebpageorincludedfile.HowdoIremovethem? Answer IfyouaredealingwithafileencodedinUTF-8,yourdisplayproblemsmaybecausedbythepresenceofaUTF-8signature(BOM)thatthe useragentdoesn'trecognize.ThisusedtobeaproblemforstaticHTMLfiles,butisnolongerinrecentversionsofmajorbrowsers.However,ifyouusePHPtogenerateyourHTML,thiswasstillanissuewithPHPversion5.3.6. TheBOMisalwaysatthebeginningofthefile,andsoyouwouldnormallyexpecttoseethedisplayissuesatthetopofapage.However, youmayalsofindblanklinesappearingwithinthepageifyouincludetextfromaseparatefilethatbeginswithaUTF-8signature. ThisarticlewillhelpyoudeterminewhethertheUTF-8iscausingtheproblem.IfthereisnoevidenceofaUTF-8signatureatthe beginningofthefile,thenyouwillhavetolookelsewhereforasolution. WhatisaUTF-8signature(BOM)? Someapplicationsinsertaparticularcombinationofbytesatthebeginningofafiletoindicatethatthetextcontainedinthe fileisUnicode.ThiscombinationofbytesisknownasasignatureorByteOrderMark(BOM).Someapplications- suchasatexteditororabrowser-willdisplaytheBOMasanextralineinthefile,otherswilldisplayunexpectedcharacters,suchas. SeethesidepanelformoredetailedinformationabouttheBOM. TheBOMistheUnicodecodepointU+FEFF,correspondingtotheUnicodecharacter'ZEROWIDTHNON-BREAKINGSPACE'(ZWNBSP). InUTF-16andUTF-32encodings,unlessthereissomealternativeindicator,theBOMisessentialtoensurecorrect interpretationofthefile'scontents.Eachcharacterinthefileisrepresentedby2or4bytesofdataandtheorderinwhichthesebytesare storedinthefileissignificant;theBOMindicatesthisorder. IntheUTF-8encoding,thepresenceoftheBOMisnotessentialbecause,unliketheUTF-16orUTF-32encodings,thereisno alternativesequenceofbytesinacharacter.TheBOMmaystilloccurinUTF-8encodingtext,however,eitherasaby-productofanencoding conversionorbecauseitwasaddedbyaneditor. DetectingtheBOM First,weneedtocheckwhetherthereisindeedaBOMatthebeginningofthefile. YoucantrylookingforaBOMinyourcontent,butifyoureditorhandlestheUTF-8signaturecorrectlyyouprobablywon'tbeableto seeit.AneditorwhichdoesnothandletheUTF-8signaturecorrectlydisplaysthebytesthatcomposethatsignatureaccordingtoitsowncharacter encodingsetting.(WiththeLatin1(ISO8859-1)characterencoding,thesignaturedisplaysascharacters.)Withabinaryeditorcapableof displayingthehexadecimalbytevaluesinthefile,theUTF-8signaturedisplaysasEFBBBF. Alternatively,youreditormaytellyouinastatusbaroramenuwhatencodingyourfileisin,includinginformationaboutthe presenceornotoftheUTF-8signature. Ifnot,somekindofscript-basedtest(seebelow)mayhelp.(Note,ifit’safileincludedbyPHPorsomeothermechanismthatyou thinkiscausingtheproblem,typeintheURIoftheincludedfile.) RemovingtheBOM IfyouhaveaneditorwhichshowsthecharactersthatmakeuptheUTF-8signatureyoumaybeabletodeletethembyhand.Chancesare, however,thattheBOMisthereinthefirstplacebecauseyoudidn'tseeit. CheckwhetheryoureditorallowsyoutospecifywhetheraUTF-8signatureisaddedorkeptduringasave.Suchaneditorprovidesawayofremoving thesignaturebysimplyreadingthefileinthensavingitoutagain.Forexample,ifDreamweaverdetectsaBOMtheSaveAsdialogueboxwillhavea checkmarkalongsidethetext"IncludeUnicodeSignature(BOM)".Justunchecktheboxandsave. Oneofthebenefitsofusingascriptisthatyoucanremovethesignaturequickly,andfrommultiplefiles.Infactthescriptcould berunautomaticallyaspartofyourprocess.IfyouusePerl,youcoulduseasimplescriptcreatedbyMartinDürst. Note:Youshouldchecktheprocessimpactofremovingthesignature.Itmaybethatsomepartofyourcontentdevelopmentprocess reliesontheuseofthesignaturetoindicatethatafileisinUTF-8.BearinmindalsothatpageswithahighproportionofLatincharactersmay lookcorrectsuperficiallybutthatoccasionalcharactersoutsidetheASCIIrange(U+0000toU+007F)maybeincorrectlyencoded. Bytheway YouwillfindthatsometexteditorssuchasWindowsNotepadwillautomaticallyaddaUTF-8signaturetoanyfileyousaveasUTF-8. AUTF-8signatureatthebeginningofaCSSfilecansometimescausetheinitialrulesinthefiletofailoncertainuseragents. Insomebrowsers,thepresenceofaUTF-8signaturewillcausethebrowsertointerpretthetextasUTF-8regardlessofanycharacter encodingdeclarationstothecontrary. Furtherreading UnicodeFAQabouttheByteOrderMark Settingencodinginwebauthoringapplications UnicodeBidirectionalAlgorithmbasics AuthoringHTML&CSS Characters Handlingthebyte-ordermark



請為這篇文章評分?