What's the difference between UTF-8 and UTF-8 with BOM?
文章推薦指數: 80 %
The UTF-8 BOM is a sequence of bytes at the start of a text stream ( 0xEF, 0xBB, 0xBF ) that allows the reader to more reliably guess a file as being ...
Home
Public
Questions
Tags
Users
Companies
Collectives
ExploreCollectives
Teams
StackOverflowforTeams
–Startcollaboratingandsharingorganizationalknowledge.
CreateafreeTeam
WhyTeams?
Teams
CreatefreeTeam
Collectives™onStackOverflow
Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost.
LearnmoreaboutCollectives
Teams
Q&Aforwork
Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch.
LearnmoreaboutTeams
What'sthedifferencebetweenUTF-8andUTF-8withBOM?
AskQuestion
Asked
12years,8monthsago
Modified
1monthago
Viewed
760ktimes
974
What'sdifferentbetweenUTF-8andUTF-8withBOM?Whichisbetter?
unicodeutf-8character-encodingbyte-order-mark
Share
Improvethisquestion
Follow
editedSep9at16:08
Henke
3,27522goldbadges2222silverbadges2929bronzebadges
askedFeb8,2010at18:26
simplesimple
9,87333goldbadges1616silverbadges1111bronzebadges
18
83
UTF-8canbeauto-detectedbetterbycontentsthanbyBOM.Themethodissimple:trytoreadthefile(orastring)asUTF-8andifthatsucceeds,assumethatthedataisUTF-8.OtherwiseassumethatitisCP1252(orsomeother8bitencoding).Anynon-UTF-8eightbitencodingwillalmostcertainlycontainsequencesthatarenotpermittedbyUTF-8.PureASCII(7bit)getsinterpretedasUTF-8,buttheresultiscorrectthatwaytoo.
– Tronic
Feb11,2010at13:25
45
ScanninglargefilesforUTF-8contenttakestime.ABOMmakesthisprocessmuchfaster.Inpracticeyouoftenneedtodoboth.Theculpritnowadaysisthatstillalotoftextcontentisn'tUnicode,andIstillbumpintotoolsthatsaytheydoUnicode(forinstanceUTF-8)butemittheircontentadifferentcodepage.
– JeroenWiertPluimers
Dec18,2013at7:41
11
@TronicIdon'treallythinkthat"better"fitsinthiscase.Itdependsontheenvironment.IfyouaresurethatallUTF-8filesaremarkedwithaBOMthancheckingtheBOMisthe"better"way,becauseitisfasterandmorereliable.
– mg30rg
Jul31,2014at9:31
36
UTF-8doesnothaveaBOM.WhenyouputaU+FEFFcodepointatthestartofaUTF-8file,specialcaremustbemadetodealwithit.ThisisjustoneofthoseMicrosoftnaminglies,likecallinganencoding"Unicode"whenthereisnosuchthing.
– tchrist
Oct1,2014at22:37
9
"ThemodernMainframe(andAIX)islittleendianUTF-8aware"UTF-8doesn'thaveanendedness!thereisnoshufflingofbytesaroundtoputpairsorgroupsoffourintotheright"order"foraparticularsystem!TodetectaUTF-8bytesequenceitmaybeusefultonotethatthefirstbyteofamulti-bytesequence"codepoint"(thebytesthatareNOT"plain"ASCIIones)hastheMSbitsetandallonetothreemoresuccessivelylesssignificantbitsfollowedbyaresetbit.ThetotalnumberofthosesetbitsisonelessbytesthatareinthatcodepointandtheywillALLhavetheMSBset...
– SlySven
Aug19,2016at17:38
|
Show13morecomments
22Answers
22
Sortedby:
Resettodefault
Highestscore(default)
Trending(recentvotescountmore)
Datemodified(newestfirst)
Datecreated(oldestfirst)
898
TheUTF-8BOMisasequenceofbytesatthestartofatextstream(0xEF,0xBB,0xBF)thatallowsthereadertomorereliablyguessafileasbeingencodedinUTF-8.
Normally,theBOMisusedtosignaltheendiannessofanencoding,butsinceendiannessisirrelevanttoUTF-8,theBOMisunnecessary.
AccordingtotheUnicodestandard,theBOMforUTF-8filesisnotrecommended:
2.6EncodingSchemes
...UseofaBOMisneitherrequirednorrecommendedforUTF-8,butmaybeencounteredincontextswhereUTF-8dataisconvertedfromotherencodingformsthatuseaBOMorwheretheBOMisusedasaUTF-8signature.Seethe“ByteOrderMark”subsectioninSection16.8,Specials,formoreinformation.
Share
Improvethisanswer
Follow
editedApr16,2020at22:43
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredFeb8,2010at18:33
MartinCoteMartinCote
27.9k1313goldbadges7575silverbadges9999bronzebadges
30
136
ItmightnotberecommendedbutfrommyexperienceinHebrewconversionstheBOMissometimescrucialforUTF-8recognitioninExcel,andmaymakethedifferencebetweenJibrishandHebrew
– Matanya
Dec7,2012at8:13
41
Itmightnotberecommendedbutitdidwonderstomypowershellscriptwhentryingtooutput"æøå"
– Marius
Nov12,2013at9:22
75
Regardlessofitnotbeingrecommendedbythestandard,it'sallowed,andIgreatlypreferhavingsomethingtoactasaUTF-8signatureratherthealternativesofassumingorguessing.Unicode-compliantsoftwareshould/mustbeabletodealwithitspresence,soIpersonallyencourageitsuse.
– martineau
Dec31,2013at20:41
33
@bames53:Yes,inanidealworldstoringtheencodingoftextfilesasfilesystemmetadatawouldbeabetterwaytopreserveit.Butmostofuslivingintherealworldcan'tchangethefilesystemoftheOS(s)ourprogramsgetrunon--sousingtheUnicodestandard'splatform-independentBOMsignatureseemslikethebestandmostpracticalalternativeIMHO.
– martineau
Jan16,2014at19:37
41
@martineauJustyesterdayIranintoafilewithaUTF-8BOMthatwasn'tUTF-8(itwasCP936).What'sunfortunateisthattheonesresponsiblefortheimmenseamountofpaincausebytheUTF-8BOMarelargelyoblivioustoit.
– bames53
Jan16,2014at23:21
|
Show25morecomments
274
Theotherexcellentanswersalreadyansweredthat:
ThereisnoofficialdifferencebetweenUTF-8andBOM-edUTF-8
ABOM-edUTF-8stringwillstartwiththethreefollowingbytes.EFBBBF
Thosebytes,ifpresent,mustbeignoredwhenextractingthestringfromthefile/stream.
But,asadditionalinformationtothis,theBOMforUTF-8couldbeagoodwayto"smell"ifastringwasencodedinUTF-8...Oritcouldbealegitimatestringinanyotherencoding...
Forexample,thedata[EFBBBF414243]couldeitherbe:
ThelegitimateISO-8859-1string"ABC"
ThelegitimateUTF-8string"ABC"
Sowhileitcanbecooltorecognizetheencodingofafilecontentbylookingatthefirstbytes,youshouldnotrelyonthis,asshowbytheexampleabove
Encodingsshouldbeknown,notdivined.
Share
Improvethisanswer
Follow
editedMay6,2015at19:25
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredFeb8,2010at18:42
paercebalpaercebal
79.5k3737goldbadges129129silverbadges158158bronzebadges
26
67
@Alcott:Youunderstoodcorrectly.Thestring[EFBBBF414243]isjustabunchofbytes.Youneedexternalinformationtochoosehowtointerpretit.IfyoubelievethosebyteswereencodedusingISO-8859-1,thenthestringis"ABC".IfyoubelievethosebyteswereencodedusingUTF-8,thenitis"ABC".Ifyoudon'tknow,thenyoumusttrytofindout.TheBOMcouldbeaclue.TheabsenceofinvalidcharacterwhendecodedasUTF-8couldbeanother...Intheend,unlessyoucanmemorize/findtheencodingsomehow,anarrayofbytesisjustanarrayofbytes.
– paercebal
Sep11,2011at18:57
23
@paercebalWhile""isvalidlatin-1,itisveryunlikelythatatextfilebeginswiththatcombination.Thesameholdsfortheucs2-le/bemarkersÿþandþÿ.Alsoyoucanneverknow.
– user877329
Jun21,2013at16:48
17
@decezeItisprobablylinguisticallyinvalid:Firstï(whichisok),thensomequotationmarkwithoutspacein-between(notok).¿indicatesitisSpanishbutïisnotusedinSpanish.Conclusion:Itisnotlatin-1withacertaintywellabovethecertaintywithoutit.
– user877329
Nov5,2013at7:20
26
@userSure,itdoesn'tnecessarilymakesense.Butifyoursystemreliesonguessing,that'swhereuncertaintiescomein.Somemalicioususersubmitstextstartingwiththese3lettersonpurpose,andyoursystemsuddenlyassumesit'slookingatUTF-8withaBOM,treatsthetextasUTF-8whereitshoulduseLatin-1,andsomeUnicodeinjectiontakesplace.Justahypotheticalexample,butcertainlypossible.Youcan'tjudgeatextencodingbyitscontent,period.
– deceze
♦
Nov5,2013at7:44
50
"Encodingsshouldbeknown,notdivined."Theheartandsouloftheproblem.+1,goodsir.Inotherwords:eitherstandardizeyourcontentandsay,"We'realwaysusingthisencoding.Period.Writeitthatway.Readitthatway,"ordevelopanextendedformatthatallowsforstoringtheencodingasmetadata.(Thelatterprobablyneedssome"bootstrapstandardencoding,"too.Likesaying"ThepartthattellsyoutheencodingisalwaysASCII.")
– jpmc26
Jul23,2015at21:25
|
Show21morecomments
147
ThereareatleastthreeproblemswithputtingaBOMinUTF-8encodedfiles.
FilesthatholdnotextarenolongeremptybecausetheyalwayscontaintheBOM.
FilesthatholdtextwithintheASCIIsubsetofUTF-8arenolongerthemselvesASCIIbecausetheBOMisnotASCII,whichmakessomeexistingtoolsbreakdown,anditcanbeimpossibleforuserstoreplacesuchlegacytools.
ItisnotpossibletoconcatenateseveralfilestogetherbecauseeachfilenowhasaBOMatthebeginning.
And,asothershavementioned,itisneithersufficientnornecessarytohaveaBOMtodetectthatsomethingisUTF-8:
ItisnotsufficientbecauseanarbitrarybytesequencecanhappentostartwiththeexactsequencethatconstitutestheBOM.
ItisnotnecessarybecauseyoucanjustreadthebytesasiftheywereUTF-8;ifthatsucceeds,itis,bydefinition,validUTF-8.
Share
Improvethisanswer
Follow
editedSep9at16:16
Henke
3,27522goldbadges2222silverbadges2929bronzebadges
answeredNov15,2012at13:28
jpsecherjpsecher
4,04522goldbadges3131silverbadges3838bronzebadges
16
11
Repoint1"FilesthatholdnotextarenolongeremptybecausetheyalwayscontaintheBOM",this(1)conflatestheOSfilesystemlevelwiththeinterpretedcontentslevel,plusit(2)incorrectlyassumesthatusingBOMonemustputaBOMalsoineveryotherwiseemptyfile.Thepracticalsolutionto(1)istonotdo(2).Essentiallythecomplaintreducesto"it'spossibletoimpracticallyputaBOMinanotherwiseemptyfile,thuspreventingthemosteasydetectionoflogicallyemptyfile(bycheckingfilesize)".Stillgoodsoftwareshouldbeabletodealwithit,sinceithasapurpose.
– Cheersandhth.-Alf
Jun18,2014at14:22
9
Repoint2,"FilesthatholdASCIItextisnolongerthemselvesASCII",thisconflatesASCIIwithUTF-8.AnUTF-8filethatholdsASCIItextisnotASCII,it'sUTF-8.Similarly,anUTF-16filethatholdsASCIItextisnotASCII,it'sUTF-16.Andsoon.ASCIIisa7-bitsinglebytecode.UTF-8isan8-bitvariablelengthextensionofASCII.If"toolsbreakdown"dueto>127valuesthenthey'rejustnotfitforan8-bitworld.OnesimplepracticalsolutionistouseonlyASCIIfileswithtoolsthatbreakdownfornon-ASCIIbytevalues.Aprobablybettersolutionistoditchthoseungoodtools.
– Cheersandhth.-Alf
Jun18,2014at14:27
9
Repoint3,"ItisnotpossibletoconcatenateseveralfilestogetherbecauseeachfilenowhasaBOMatthebeginning"isjustwrong.IhavenoproblemconcatenatingUTF-8fileswithBOM,soit'sclearlypossible.IthinkmaybeyoumeanttheUnix-landcatwon'tgiveyouacleanresult,aresultthathasBOMonlyatthestart.Ifyoumeantthat,thenthat'sbecausecatworksatthebytelevel,notattheinterpretedcontentslevel,andinsimilarfashioncatcan'tdealwithphotographs,say.Stillitdoesn'tdomuchharm.That'sbecausetheBOMencodesazero-widthnon-breakingspace.
– Cheersandhth.-Alf
Jun18,2014at14:34
28
@Cheersandhth.-AlfThisansweriscorrect.YouaremerelypointingoutMicrosoftbugs.
– tchrist
Oct1,2014at22:34
13
@brighty:Thesituationisn'timprovedanybyaddingabomthough.
– Deduplicator
Sep20,2015at4:29
|
Show11morecomments
120
HereareexamplesoftheBOMusagethatactuallycauserealproblemsandyetmanypeopledon'tknowaboutit.
BOMbreaksscripts
Shellscripts,Perlscripts,Pythonscripts,Rubyscripts,Node.jsscriptsoranyotherexecutablethatneedstoberunbyaninterpreter-allstartwithashebanglinewhichlookslikeoneofthose:
#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/envnode
Ittellsthesystemwhichinterpreterneedstoberunwheninvokingsuchascript.IfthescriptisencodedinUTF-8,onemaybetemptedtoincludeaBOMatthebeginning.Butactuallythe"#!"charactersarenotjustcharacters.TheyareinfactamagicnumberthathappenstobecomposedoutoftwoASCIIcharacters.Ifyouputsomething(likeaBOM)beforethosecharacters,thenthefilewilllooklikeithadadifferentmagicnumberandthatcanleadtoproblems.
SeeWikipedia,article:Shebang,section:Magicnumber:
Theshebangcharactersarerepresentedbythesametwobytesin
extendedASCIIencodings,includingUTF-8,whichiscommonlyusedfor
scriptsandothertextfilesoncurrentUnix-likesystems.However,
UTF-8filesmaybeginwiththeoptionalbyteordermark(BOM);ifthe
"exec"functionspecificallydetectsthebytes0x23and0x21,thenthe
presenceoftheBOM(0xEF0xBB0xBF)beforetheshebangwillprevent
thescriptinterpreterfrombeingexecuted.Someauthoritiesrecommend
againstusingthebyteordermarkinPOSIX(Unix-like)scripts,[14]
forthisreasonandforwiderinteroperabilityandphilosophical
concerns.Additionally,abyteordermarkisnotnecessaryinUTF-8,
asthatencodingdoesnothaveendiannessissues;itservesonlyto
identifytheencodingasUTF-8.[emphasisadded]
BOMisillegalinJSON
SeeRFC7159,Section8.1:
ImplementationsMUSTNOTaddabyteordermarktothebeginningofaJSONtext.
BOMisredundantinJSON
NotonlyitisillegalinJSON,itisalsonotneededtodeterminethecharacterencodingbecausetherearemorereliablewaystounambiguouslydetermineboththecharacterencodingandendiannessusedinanyJSONstream(seethisanswerfordetails).
BOMbreaksJSONparsers
NotonlyitisillegalinJSONandnotneeded,itactuallybreaksallsoftwarethatdeterminetheencodingusingthemethodpresentedinRFC4627:
DeterminingtheencodingandendiannessofJSON,examiningthefirstfourbytesfortheNULbyte:
000000xx-UTF-32BE
00xx00xx-UTF-16BE
xx000000-UTF-32LE
xx00xx00-UTF-16LE
xxxxxxxx-UTF-8
Now,ifthefilestartswithBOMitwilllooklikethis:
0000FEFF-UTF-32BE
FEFF00xx-UTF-16BE
FFFE0000-UTF-32LE
FFFExx00-UTF-16LE
EFBBBFxx-UTF-8
Notethat:
UTF-32BEdoesn'tstartwiththreeNULs,soitwon'tberecognized
UTF-32LEthefirstbyteisnotfollowedbythreeNULs,soitwon'tberecognized
UTF-16BEhasonlyoneNULinthefirstfourbytes,soitwon'tberecognized
UTF-16LEhasonlyoneNULinthefirstfourbytes,soitwon'tberecognized
Dependingontheimplementation,allofthosemaybeinterpretedincorrectlyasUTF-8andthenmisinterpretedorrejectedasinvalidUTF-8,ornotrecognizedatall.
Additionally,iftheimplementationtestsforvalidJSONasIrecommend,itwillrejecteventheinputthatisindeedencodedasUTF-8,becauseitdoesn'tstartwithanASCIIcharacter<128asitshouldaccordingtotheRFC.
Otherdataformats
BOMinJSONisnotneeded,isillegalandbreakssoftwarethatworkscorrectlyaccordingtotheRFC.Itshouldbeanobrainertojustnotuseitthenandyet,therearealwayspeoplewhoinsistonbreakingJSONbyusingBOMs,comments,differentquotingrulesordifferentdatatypes.OfcourseanyoneisfreetousethingslikeBOMsoranythingelseifyouneedit-justdon'tcallitJSONthen.
ForotherdataformatsthanJSON,takealookathowitreallylookslike.IftheonlyencodingsareUTF-*andthefirstcharactermustbeanASCIIcharacterlowerthan128thenyoualreadyhavealltheinformationneededtodetermineboththeencodingandtheendiannessofyourdata.AddingBOMsevenasanoptionalfeaturewouldonlymakeitmorecomplicatedanderrorprone.
OtherusesofBOM
AsfortheusesoutsideofJSONorscripts,Ithinktherearealreadyverygoodanswershere.Iwantedtoaddmoredetailedinfospecificallyaboutscriptingandserialization,becauseitisanexampleofBOMcharacterscausingrealproblems.
Share
Improvethisanswer
Follow
editedOct7,2021at7:34
CommunityBot
111silverbadge
answeredJun26,2016at11:34
rsprsp
103k2828goldbadges197197silverbadges174174bronzebadges
13
7
rfc7159whichsupersedesrfc4627actuallysuggestssupportingBOMmaynotbesoevil.BasicallynothavingaBOMisjustanambiguouskludgesothatoldWindowsandUnixsoftwarethatarenotUnicode-awarecanstillprocessutf-8.
– EricGrange
Apr10,2017at7:59
2
SoundslikeJSONneedsupdatinginordertosupportit,samewithPerlscripts,Pythonscripts,Rubyscripts,Node.js.Justbecausetheseplatformsoptedtonotincludesupport,doesn'tnecessarilykilltheuseforBOM.ApplehasbeentryingtokillAdobeforafewyearsnow,andAdobeisstillaround.Butanenlighteningpost.
– htm11h
Jul24,2017at15:47
19
@EricGrange,youseemtobeverystronglysupportingBOM,butfailtorealizethatthiswouldrendertheall-ubiquitous,universallyuseful,optimal-minimum"plaintext"formatarelicofthepre-UTF8past!Addinganysortof(in-band)headertotheplaintextstreamwould,bydefinition,imposeamandatoryprotocoltothesimplesttextfiles,makingitneveragainthe"simplest"!Andforwhatgain?Tosupportalltheother,ancientCPencodingsthatalsodidn'thavesignatures,soyoumightmistakethemwithUTF-8?(BTW,ASCIIisUTF-8,too.So,aBOMtothose,too?;)Comeon.)
– Sz.
Mar14,2018at22:20
4
ThisansweristhereasonwhyIcameuptothisquestion!IcreatmybashscriptsinWindowsandexperiencealotofproblemswhenpublishingthosescriptstoLinux!Samethingwithjasonfiles.
– TonoNam
Jul2,2019at14:43
4
IwishIcouldvotethisanswerupaboutfiftytimes.Ialsowanttoaddthatatthispoint,UTF-8haswonthestandardswar,andnearlyalltextbeingproducedontheInternetisUTF-8.Someofthemostpopularprogramminglanguages(suchasC#andJava)useUTF-16internally,butwhenprogrammersusingthoselanguageswritefilestooutputstreams,theyalmostalwaysencodethemasUTF-8.Therefore,itnolongermakessensetohaveaBOMtomarkaUTF-8file;UTF-8shouldbethedefaultyouusewhenreading,andonlytryotherencodingsifUTF-8decodingfails.
– rmunn
Aug23,2019at1:56
|
Show8morecomments
51
What'sdifferentbetweenUTF-8andUTF-8withoutBOM?
Shortanswer:InUTF-8,aBOMisencodedasthebytesEFBBBFatthebeginningofthefile.
Longanswer:
Originally,itwasexpectedthatUnicodewouldbeencodedinUTF-16/UCS-2.TheBOMwasdesignedforthisencodingform.Whenyouhave2-bytecodeunits,it'snecessarytoindicatewhichorderthosetwobytesarein,andacommonconventionfordoingthisistoincludethecharacterU+FEFFasa"ByteOrderMark"atthebeginningofthedata.ThecharacterU+FFFEispermanentlyunassignedsothatitspresencecanbeusedtodetectthewrongbyteorder.
UTF-8hasthesamebyteorderregardlessofplatformendianness,soabyteordermarkisn'tneeded.However,itmayoccur(asthebytesequenceEFBBFF)indatathatwasconvertedtoUTF-8fromUTF-16,orasa"signature"toindicatethatthedataisUTF-8.
Whichisbetter?
Without.AsMartinCoteanswered,theUnicodestandarddoesnotrecommendit.Itcausesproblemswithnon-BOM-awaresoftware.
AbetterwaytodetectwhetherafileisUTF-8istoperformavaliditycheck.UTF-8hasstrictrulesaboutwhatbytesequencesarevalid,sotheprobabilityofafalsepositiveisnegligible.IfabytesequencelookslikeUTF-8,itprobablyis.
Share
Improvethisanswer
Follow
editedMay6,2015at19:27
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJul31,2010at22:53
dan04dan04
84.4k2323goldbadges160160silverbadges192192bronzebadges
6
8
thiswouldalsoinvalidatevalidUTF-8withasingleerroneousbyteinit,though:/
– endolith
Jul15,2012at1:05
9
-1re"Itcausesproblemswithnon-BOM-awaresoftware.",that'sneverbeenaproblemforme,butonthecontrary,thatabsenceofBOMcausesproblemswithBOM-awaresoftware(inparticularVisualC++)hasbeenaproblem.Sothisstatementisveryplatform-specific,anarrowUnix-landpointofview,butismisleadinglypresentedasifitappliesingeneral.Whichitdoesnot.
– Cheersandhth.-Alf
Jun18,2014at14:46
6
No,UTF-8hasnoBOM.Thisanswerisincorrect.SeetheUnicodeStandard.
– tchrist
Oct1,2014at22:35
2
YoucaneventhinkyouhaveapureASCIIfilewhenjustlookingatthebytes.Butthiscouldbeautf-16fileaswellwhereyou'dhavetolookatwordsandnotatbytes.ModernsofwareshouldbeawareaboutBOMs.Stillreadingutf-8canfailifdetectinginvalidsequences,codepointsthatcanuseasmallersequenceorcodepointsthataresurrogates.Forutf-16readingmightfailtoowhenthereareorphanedsurrogates.
– brighty
Feb9,2015at16:56
2
@Alf,Idisagreewithyourinterpretationofanon-BOMattitudeas"platform-specific,anarrowUnix-landpointofview."Tome,theonlywaythatthenarrow-mindednesscouldliewith"Unixland"wereifMSandVisualC++camebefore*NIX,whichtheydidn't.ThefactthatMS(Iassumeknowingly)startedusingaBOMinUTF-8ratherthanUTF-16suggeststomethattheypromotedbreakingsh,perl,g++,andmanyotherfreeandpowerfultools.Wantthingstowork?JustbuytheMSversions.MScreatedtheplatform-specificproblem,justlikethedisasteroftheir\x80-\x95range.
– bballdave025
Jan17,2020at23:17
|
Show1morecomment
38
UTF-8withBOMisbetteridentified.Ihavereachedthisconclusionthehardway.IamworkingonaprojectwhereoneoftheresultsisaCSVfile,includingUnicodecharacters.
IftheCSVfileissavedwithoutaBOM,Excelthinksit'sANSIandshowsgibberish.Onceyouadd"EFBBBF"atthefront(forexample,byre-savingitusingNotepadwithUTF-8;orNotepad++withUTF-8withBOM),Excelopensitfine.
PrependingtheBOMcharactertoUnicodetextfilesisrecommendedbyRFC3629:"UTF-8,atransformationformatofISO10646",November2003
athttps://www.rfc-editor.org/rfc/rfc3629(thislastinfofoundat:http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)
Share
Improvethisanswer
Follow
editedOct7,2021at5:46
CommunityBot
111silverbadge
answeredJun28,2012at17:34
HelenCraigmanHelenCraigman
1,40533goldbadges1515silverbadges2424bronzebadges
8
6
ThanksforthisexcellenttipincaseoneiscreatingUTF-8filesforusebyExcel.Inothercircumstancesthough,IwouldstillfollowtheotheranswersandskiptheBOM.
– barfuin
May7,2013at19:20
5
It'salsousefulifyoucreatefilesthatcontainonlyASCIIandlatermayhavenon-asciiaddedtoit.Ihavejustranintosuchanissue:softwarethatexpectsutf8,createsfilewithsomedataforuserediting.IftheinitialfilecontainsonlyASCII,isopenedinsomeeditorsandthensaved,itendsupinlatin-1andeverythingbreaks.IfIaddtheBOM,itwillgetdetectedasUTF8bytheeditorandeverythingworks.
– RobertoAlsina
Sep9,2013at22:03
1
IhavefoundmultipleprogrammingrelatedtoolswhichrequiretheBOMtoproperlyrecogniseUTF-8filescorrectly.VisualStudio,SSMS,SoureTree....
– kjbartel
Jan27,2015at13:24
7
WheredoyoureadarecommendationforusingaBOMintothatRFC?Atmost,there'sastrongrecommendationtonotforbiditundercertaincircumstanceswheredoingsoisdifficult.
– Deduplicator
Aug11,2015at18:37
13
Excelthinksit'sANSIandshowsgibberishthentheproblemisinExcel.
– user8017719
Nov26,2016at8:10
|
Show3morecomments
17
Question:What'sdifferentbetweenUTF-8andUTF-8withoutaBOM?Whichisbetter?
HerearesomeexcerptsfromtheWikipediaarticleonthebyteordermark(BOM)thatIbelieveofferasolidanswertothisquestion.
OnthemeaningoftheBOMandUTF-8:
TheUnicodeStandardpermitstheBOMinUTF-8,butdoesnotrequire
orrecommenditsuse.ByteorderhasnomeaninginUTF-8,soits
onlyuseinUTF-8istosignalatthestartthatthetextstreamis
encodedinUTF-8.
ArgumentforNOTusingaBOM:
TheprimarymotivationfornotusingaBOMisbackwards-compatibility
withsoftwarethatisnotUnicode-aware...Anothermotivationfornot
usingaBOMistoencourageUTF-8asthe"default"encoding.
ArgumentFORusingaBOM:
TheargumentforusingaBOMisthatwithoutit,heuristicanalysisis
requiredtodeterminewhatcharacterencodingafileisusing.
Historicallysuchanalysis,todistinguishvarious8-bitencodings,is
complicated,error-prone,andsometimesslow.Anumberoflibraries
areavailabletoeasethetask,suchasMozillaUniversalCharset
DetectorandInternationalComponentsforUnicode.
ProgrammersmistakenlyassumethatdetectionofUTF-8isequally
difficult(itisnotbecauseofthevastmajorityofbytesequences
areinvalidUTF-8,whiletheencodingstheselibrariesaretryingto
distinguishallowallpossiblebytesequences).Thereforenotall
Unicode-awareprogramsperformsuchananalysisandinsteadrelyon
theBOM.
Inparticular,Microsoftcompilersandinterpreters,andmany
piecesofsoftwareonMicrosoftWindowssuchasNotepadwillnot
correctlyreadUTF-8textunlessithasonlyASCIIcharactersorit
startswiththeBOM,andwilladdaBOMtothestartwhensavingtext
asUTF-8.GoogleDocswilladdaBOMwhenaMicrosoftWorddocumentis
downloadedasaplaintextfile.
Onwhichisbetter,WITHorWITHOUTtheBOM:
TheIETFrecommendsthatifaprotocoleither(a)alwaysusesUTF-8,
or(b)hassomeotherwaytoindicatewhatencodingisbeingused,
thenit“SHOULDforbiduseofU+FEFFasasignature.”
MyConclusion:
UsetheBOMonlyifcompatibilitywithasoftwareapplicationisabsolutelyessential.
AlsonotethatwhilethereferencedWikipediaarticleindicatesthatmanyMicrosoftapplicationsrelyontheBOMtocorrectlydetectUTF-8,thisisnotthecaseforallMicrosoftapplications.Forexample,aspointedoutby@barlop,whenusingtheWindowsCommandPromptwithUTF-8†,commandssuchtypeandmoredonotexpecttheBOMtobepresent.IftheBOMispresent,itcanbeproblematicasitisforotherapplications.
†ThechcpcommandofferssupportforUTF-8(withouttheBOM)viacodepage65001.
Share
Improvethisanswer
Follow
editedMar4,2018at1:16
answeredOct2,2014at20:24
DavidRRDavidRR
17.3k2121goldbadges105105silverbadges180180bronzebadges
4
5
I'dbettertostricttoWITHOUTtheBOM.Ifoundthat.htaccessandgzipcompressionincombinationwithUTF-8BOMgivesanencodingerrorChangetoEncodinginUTF-8withoutBOMfollowtoasuggestionasexplainedheresolvetheproblems
– eQ19
Apr16,2015at15:09
1
'AnothermotivationfornotusingaBOMistoencourageUTF-8asthe"default"encoding.'--Whichissostrong&validanargument,thatyoucouldhaveactuallystoppedtheanswerthere!...;-oUnlessyougotabetterideaforuniversaltextrepresentation,thatis.;)(Idon'tknowhowoldyouare,howmanyyearsyouhadtosufferinthepre-UTF8era(whenlinguistsdesperatelyconsideredevenchangingtheiralphabets),butIcantellyouthateverysecondwegetclosertoriddingthemessofalltheancientsingle-byte-with-no-metadataencodings,insteadofhaving"theone"ispurejoy.)
– Sz.
Mar14,2018at22:41
SeealsothiscommentabouthowaddingaBOM(oranything!)tothesimplestofthetextfileformats,"plaintext",wouldmeanpreventingexactlythebestuniversaltextencodingformatfrombeing"plain",and"simple"(i.e."overheadless")!...
– Sz.
Mar14,2018at22:58
BOMismostlyproblematiconLinuxbecausemanyutilitiesdonotreallysupportUnicodetobeginwith(theywillhappilytruncateinthemiddleofcodepointsforinstance).Formostothermodernsoftwareenvironment,useBOMwhenevertheencodingisnotunambiguous(throughspecsormetadata).
– EricGrange
Aug23,2019at7:58
Addacomment
|
16
BOMtendstoboom(nopunintended(sic))somewhere,someplace.Andwhenitbooms(forexample,doesn'tgetrecognizedbybrowsers,editors,etc.),itshowsupastheweirdcharactersatthestartofthedocument(forexample,HTMLfile,JSONresponse,RSS,etc.)andcausesthekindofembarrassmentsliketherecentencodingissueexperiencedduringthetalkofObamaonTwitter.
It'sveryannoyingwhenitshowsupatplaceshardtodebugorwhentestingisneglected.Soit'sbesttoavoiditunlessyoumustuseit.
Share
Improvethisanswer
Follow
editedMay6,2015at19:28
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJul11,2011at7:56
HalilÖzgürHalilÖzgür
15.4k55goldbadges4848silverbadges5656bronzebadges
5
Yes,justspenthoursidentifyingaproblemcausedbyafilebeingencodedasUTF-8insteadofUTF-8withoutBOM.(TheissueonlyshowedupinIE7sothatledmeonaquiteagoosechase.IusedDjango's"include".)
– user984003
Jan31,2013at20:45
Futurereaders:NotethatthetweetissueI'vementionedabovewasnotstrictlyrelatedtoBOM,butifitwas,thenthetweetwouldbegarbledinasimilarway,butatthestartofthetweet.
– HalilÖzgür
Feb1,2013at7:26
13
@user984003No,theproblemisthatMicrosofthasmisleadyou.WhatitcallsUTF-8isnotUTF-8.WhatitcallsUTF-8withoutBOMiswhatUTF-8reallyis.
– tchrist
Oct2,2014at0:11
whatdoesthe"sic"addtoyour"nopunintended"
– JoelFan
Oct23,2017at21:15
2
@JoelFanIcan'trecallanymorebutIguessthepunmighthavebeenintendeddespitetheauthor'sclaim:)
– HalilÖzgür
Oct23,2017at21:34
Addacomment
|
15
Thisquestionalreadyhasamillion-and-oneanswersandmanyofthemarequitegood,butIwantedtotryandclarifywhenaBOMshouldorshouldnotbeused.
Asmentioned,anyuseoftheUTFBOM(ByteOrderMark)indeterminingwhetherastringisUTF-8ornotiseducatedguesswork.Ifthereispropermetadataavailable(likecharset="utf-8"),thenyoualreadyknowwhatyou'resupposedtobeusing,butotherwiseyou'llneedtotestandmakesomeassumptions.Thisinvolvescheckingwhetherthefileastringcomesfrombeginswiththehexadecimalbytecode,EFBBBF.
IfabytecodecorrespondingtotheUTF-8BOMisfound,theprobabilityishighenoughtoassumeit'sUTF-8andyoucangofromthere.Whenforcedtomakethisguess,however,additionalerrorcheckingwhilereadingwouldstillbeagoodideaincasesomethingcomesupgarbled.YoushouldonlyassumeaBOMisnotUTF-8(i.e.latin-1orANSI)iftheinputdefinitelyshouldn'tbeUTF-8basedonitssource.IfthereisnoBOM,however,youcansimplydeterminewhetherit'ssupposedtobeUTF-8byvalidatingagainsttheencoding.
WhyisaBOMnotrecommended?
Non-Unicode-awareorpoorlycompliantsoftwaremayassumeit'slatin-1orANSIandwon'tstriptheBOMfromthestring,whichcanobviouslycauseissues.
It'snotreallyneeded(justcheckifthecontentsarecompliantandalwaysuseUTF-8asthefallbackwhennocompliantencodingcanbefound)
WhenshouldyouencodewithaBOM?
Ifyou'reunabletorecordthemetadatainanyotherway(throughacharsettagorfilesystemmeta),andtheprogramsbeingusedlikeBOMs,youshouldencodewithaBOM.ThisisespeciallytrueonWindowswhereanythingwithoutaBOMisgenerallyassumedtobeusingalegacycodepage.TheBOMtellsprogramslikeOfficethat,yes,thetextinthisfileisUnicode;here'stheencodingused.
Whenitcomesdowntoit,theonlyfilesIeverreallyhaveproblemswithareCSV.Dependingontheprogram,iteithermust,ormustnothaveaBOM.Forexample,ifyou'reusingExcel2007+onWindows,itmustbeencodedwithaBOMifyouwanttoopenitsmoothlyandnothavetoresorttoimportingthedata.
Share
Improvethisanswer
Follow
editedApr16,2020at23:37
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJan25,2016at16:03
jpc-aejpc-ae
15111silverbadge55bronzebadges
1
7
Thelastsectionofyouransweris100%correct:theonlyreasontouseaBOMiswhenyouhavetointeroperatewithbuggysoftwarethatdoesn'tuseUTF-8asitsdefaulttoparseunknownfiles.
– rmunn
Aug23,2019at2:01
Addacomment
|
8
UTF-8withoutBOMhasnoBOM,whichdoesn'tmakeitanybetterthanUTF-8withBOM,exceptwhentheconsumerofthefileneedstoknow(orwouldbenefitfromknowing)whetherthefileisUTF-8-encodedornot.
TheBOMisusuallyusefultodeterminetheendiannessoftheencoding,whichisnotrequiredformostusecases.
Also,theBOMcanbeunnecessarynoise/painforthoseconsumersthatdon'tknoworcareaboutit,andcanresultinuserconfusion.
Share
Improvethisanswer
Follow
editedFeb8,2010at18:42
answeredFeb8,2010at18:30
RomainRomain
12.4k33goldbadges3737silverbadges5454bronzebadges
3
2
"whichhasnouseforUTF-8asitis8-bitsperglyphanyway."Er...no,onlyASCII-7glyphsare8-bitsinUTF-8.Anythingbeyondthatisgoingtobe16,24,or32bits.
– Powerlord
Feb8,2010at18:38
4
"TheBOMisusuallyusefultodeterminetheendiannessoftheencoding,whichisnotrequiredformostusecases."...endiannesssimplydoesnotapplytoUTF-8,regardlessofusecase
– JoelFan
Oct23,2017at21:30
aconsumerthatneedstoknowisbrokenbydesign,.
– Jasen
Aug9,2020at8:38
Addacomment
|
8
ItshouldbenotedthatforsomefilesyoumustnothavetheBOMevenonWindows.ExamplesareSQL*plusorVBScriptfiles.IncasesuchfilescontainsaBOMyougetanerrorwhenyoutrytoexecutethem.
Share
Improvethisanswer
Follow
editedAug11,2015at18:43
Deduplicator
43.7k66goldbadges6262silverbadges110110bronzebadges
answeredJan31,2015at21:09
WernfriedDomscheitWernfriedDomscheit
48.4k77goldbadges6666silverbadges9696bronzebadges
Addacomment
|
7
QuotedatthebottomoftheWikipediapageonBOM:http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2
"UseofaBOMisneitherrequirednorrecommendedforUTF-8,butmaybeencounteredincontextswhereUTF-8dataisconvertedfromotherencodingformsthatuseaBOMorwheretheBOMisusedasaUTF-8signature"
Share
Improvethisanswer
Follow
answeredFeb8,2010at18:35
pibpib
3,2731717silverbadges1515bronzebadges
1
2
DoyouhaveanyexamplewheresoftwaremakesadecisionofwhethertouseUTF-8with/withoutBOM,basedonwhetherthepreviousencodingitisencodingfrom,hadaBOMornot?!Thatseemslikeanabsurdclaim
– barlop
Mar3,2018at15:31
Addacomment
|
7
UTF-8withBOMonlyhelpsifthefileactuallycontainssomenon-ASCIIcharacters.Ifitisincludedandtherearen'tany,thenitwillpossiblybreakolderapplicationsthatwouldhaveotherwiseinterpretedthefileasplainASCII.TheseapplicationswilldefinitelyfailwhentheycomeacrossanonASCIIcharacter,soinmyopiniontheBOMshouldonlybeaddedwhenthefilecan,andshould,nolongerbeinterpretedasplainASCII.
IwanttomakeitclearthatIprefertonothavetheBOMatall.Additinifsomeoldrubbishbreakswithoutit,andreplacingthatlegacyapplicationisnotfeasible.
Don'tmakeanythingexpectaBOMforUTF-8.
Share
Improvethisanswer
Follow
editedApr16,2020at23:15
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJul3,2014at2:43
JamesWakefieldJamesWakefield
52633silverbadges1010bronzebadges
2
1
it'snotcertainthatnonUTF8-awareapplicationswillfailiftheyencounterUTF8,thewholepointofUTF8isthatmanythingswilljustworkwc(1)willgiveacorrectlineandoctetcount,andacorrectwordcountifnounicode-onlyspacingcharactersareused.
– Jasen
Aug9,2020at8:37
Iagreewithyou@Jasen.TryingtoworkoutifIjustdeletethisoldanswer.Mycurrentopinionisthattheanswerissimplydon'taddaBOM.Theendusercanappendoneiftheyhavetohackafiletomakeitworkwitholdsoftware.Weshouldn'tmakesoftwarethatperpetuatesthisincorrectbehaviour.Thereisnoreasonwhyafilecouldn'tstartwithazero-width-non-joinerthatismeanttobeinterpretedasone.
– JamesWakefield
Dec16,2021at4:31
Addacomment
|
6
Ilookatthisfromadifferentperspective.IthinkUTF-8withBOMisbetterasitprovidesmoreinformationaboutthefile.IuseUTF-8withoutBOMonlyifIfaceproblems.
Iamusingmultiplelanguages(evenCyrillic)onmypagesforalongtimeandwhenthefilesaresavedwithoutBOMandIre-openthemforeditingwithaneditor(ascherouvimalsonoted),somecharactersarecorrupted.
NotethatWindows'classicNotepadautomaticallysavesfileswithaBOMwhenyoutrytosaveanewlycreatedfilewithUTF-8encoding.
Ipersonallysaveserversidescriptingfiles(.asp,.ini,.aspx)withBOMand.htmlfileswithoutBOM.
Share
Improvethisanswer
Follow
editedMay23,2017at11:55
CommunityBot
111silverbadge
answeredMay11,2012at8:34
user1358065user1358065
10311silverbadge44bronzebadges
5
4
ThanksfortheexcellenttipaboutwindowsclassicNotepad.Ialreadyspentsometimefindingouttheexactsamething.MyconsequencewastoalwaysuseNotepad++insteadofwindowsclassicNotepad.:-)
– barfuin
May7,2013at19:22
Youbetterusemadedit.It'stheonlyEditorthat-inhexmode-showsonecharacterifyouselectautf-8bytesequenceinsteadofa1:1Basisbetweenbyteandcharacter.Ahex-EditorthatisawareaboutaUTF-8fileshouldbevavelikemadeditdoes!
– brighty
Feb9,2015at16:49
@brightyIdon'tthinkyouneedonetooneforthesakeoftheBOM.itdoesn'tmatter,itdoesn'ttakemuchtorecogniseautf-8BOMisefbbbforfffe(offffeifreadwrong).Onecansimplydeletethosebytes.It'snotbadthoughtohaveamappingfortherestofthefilethough,buttoalsobeabletodeletebytebybytetoo
– barlop
Mar3,2018at15:34
@barlopWhywouldyouwanttodeleteautf-8BOMifthefile'scontentisutf-8encoded?TheBOMisrecognizedbymodernTextViewers,TextControlsaswellasTextEditors.Aonetooneviewofautf-8sequencemakesnosense,sincenbytesresultinonecharacter.Ofcourseatext-editororhex-editorshouldallowtodeleteanybyte,butthiscanleadtoinvalidutf-8sequences.
– brighty
Mar4,2018at16:41
@brightyutf-8withbomisanencoding,andutf-8withoutbomisanencoding.Thecmdpromptusesutf8withoutbom..soifyouhaveautf8file,yourunthecommandchcp65001forutf8support,it'sutf8withoutbom.Ifyoudotypemyfileitwillonlydisplayproperlyifthereisnobom.Ifyoudoechoaaa>a.aorechoאאא>a.atooutputthecharstofilea.a,andyouhavechcp65001,itwilloutputwithnoBOM.
– barlop
Mar5,2018at4:55
Addacomment
|
6
WhenyouwanttodisplayinformationencodedinUTF-8youmaynotfaceproblems.DeclareforexampleanHTMLdocumentasUTF-8andyouwillhaveeverythingdisplayedinyourbrowserthatiscontainedinthebodyofthedocument.
Butthisisnotthecasewhenwehavetext,CSVandXMLfiles,eitheronWindowsorLinux.
Forexample,atextfileinWindowsorLinux,oneoftheeasiestthingsimaginable,itisnot(usually)UTF-8.
SaveitasXMLanddeclareitasUTF-8:
Itwillnotdisplay(itwillnotbeberead)correctly,evenifit'sdeclaredasUTF-8.
IhadastringofdatacontainingFrenchletters,thatneededtobesavedasXMLforsyndication.WithoutcreatingaUTF-8filefromtheverybeginning(changingoptionsinIDEand"CreateNewFile")oraddingtheBOMatthebeginningofthefile
$file="\xEF\xBB\xBF".$string;
IwasnotabletosavetheFrenchlettersinanXMLfile.
Share
Improvethisanswer
Follow
editedMay6,2015at19:33
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredSep10,2012at16:50
FlorinSimaFlorinSima
1,4791616silverbadges1313bronzebadges
1
4
Iknowthisisanoldanswer,butIjustwanttomentionthatit'swrong.TextfilesonLinux(can'tspeakforotherUnixes)usually/are/UTF-8.
– Functino
Nov14,2015at23:41
Addacomment
|
6
OnepracticaldifferenceisthatifyouwriteashellscriptforMac OS XandsaveitasplainUTF-8,youwillgettheresponse:
#!/bin/bash:Nosuchfileordirectory
inresponsetotheshebanglinespecifyingwhichshellyouwishtouse:
#!/bin/bash
IfyousaveasUTF-8,noBOM(sayinBBEdit)allwillbewell.
Share
Improvethisanswer
Follow
editedMay6,2015at19:46
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJan24,2014at20:38
DavidDavid
9981212silverbadges2121bronzebadges
1
10
That’sbecauseMicrosofthasswappedthemeaningofwhatthestandardsays.UTF-8hasnoBOM:theyhavecreatedMicrosoftUTF-8whichinsertsaspuriousBOMinfrontofthedatastreamandthentoldyouthatno,thisisactuallyUTF-8.Itisnot.Itisjustextendingandcorrupting.
– tchrist
Oct2,2014at0:14
Addacomment
|
5
TheUnicodeByteOrderMark(BOM)FAQprovidesaconciseanswer:
Q:HowIshoulddealwithBOMs?
A:Herearesomeguidelinestofollow:
Aparticularprotocol(e.g.Microsoftconventionsfor.txtfiles)mayrequireuseoftheBOMoncertainUnicodedatastreams,suchas
files.Whenyouneedtoconformtosuchaprotocol,useaBOM.
SomeprotocolsallowoptionalBOMsinthecaseofuntaggedtext.Inthosecases,
Whereatextdatastreamisknowntobeplaintext,butofunknownencoding,BOMcanbeusedasasignature.IfthereisnoBOM,
theencodingcouldbeanything.
WhereatextdatastreamisknowntobeplainUnicodetext(butnotwhichendian),thenBOMcanbeusedasasignature.Ifthere
isnoBOM,thetextshouldbeinterpretedasbig-endian.
SomebyteorientedprotocolsexpectASCIIcharactersatthebeginningofafile.IfUTF-8isusedwiththeseprotocols,useofthe
BOMasencodingformsignatureshouldbeavoided.
Wheretheprecisetypeofthedatastreamisknown(e.g.Unicodebig-endianorUnicodelittle-endian),theBOMshouldnotbeused.In
particular,wheneveradatastreamisdeclaredtobeUTF-16BE,
UTF-16LE,UTF-32BEorUTF-32LEaBOMmustnotbeused.
Share
Improvethisanswer
Follow
answeredMar8,2018at13:58
WernfriedDomscheitWernfriedDomscheit
48.4k77goldbadges6666silverbadges9696bronzebadges
0
Addacomment
|
4
Asmentionedabove,UTF-8withBOMmaycauseproblemswithnon-BOM-aware(orcompatible)software.IonceeditedHTMLfilesencodedasUTF-8+BOMwiththeMozilla-basedKompoZer,asaclientrequiredthatWYSIWYGprogram.
Invariablythelayoutwouldgetdestroyedwhensaving.Ittookmysometimetofiddlemywayaroundthis.ThesefilesthenworkedwellinFirefox,butshowedaCSSquirkinInternetExplorerdestroyingthelayout,again.AfterfiddlingwiththelinkedCSSfilesforhourstonoavailIdiscoveredthatInternet Explorerdidn'tliketheBOMfedHTMLfile.Neveragain.
Also,IjustfoundthisinWikipedia:
TheshebangcharactersarerepresentedbythesametwobytesinextendedASCIIencodings,includingUTF-8,whichiscommonlyusedforscriptsandothertextfilesoncurrentUnix-likesystems.However,UTF-8filesmaybeginwiththeoptionalbyteordermark(BOM);ifthe"exec"functionspecificallydetectsthebytes0x230x21,thenthepresenceoftheBOM(0xEF0xBB0xBF)beforetheshebangwillpreventthescriptinterpreterfrombeingexecuted.SomeauthoritiesrecommendagainstusingthebyteordermarkinPOSIX(Unix-like)scripts,[15]forthisreasonandforwiderinteroperabilityandphilosophicalconcerns
Share
Improvethisanswer
Follow
editedMay6,2015at19:44
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJun22,2013at4:56
MarekMöhlingMarekMöhling
13288bronzebadges
Addacomment
|
3
Fromhttp://en.wikipedia.org/wiki/Byte-order_mark:
Thebyteordermark(BOM)isaUnicode
characterusedtosignalthe
endianness(byteorder)ofatextfile
orstream.ItscodepointisU+FEFF.
BOMuseisoptional,and,ifused,
shouldappearatthestartofthetext
stream.Beyonditsspecificuseasa
byte-orderindicator,theBOM
charactermayalsoindicatewhichof
theseveralUnicoderepresentations
thetextisencodedin.
AlwaysusingaBOMinyourfilewillensurethatitalwaysopenscorrectlyinaneditorwhichsupportsUTF-8andBOM.
MyrealproblemwiththeabsenceofBOMisthefollowing.Supposewe'vegotafilewhichcontains:
abc
WithoutBOMthisopensasANSIinmosteditors.Soanotheruserofthisfileopensitandappendssomenativecharacters,forexample:
abg-αβγ
Oops...NowthefileisstillinANSIandguesswhat,"αβγ"doesnotoccupy6bytes,but3.ThisisnotUTF-8andthiscausesotherproblemslateroninthedevelopmentchain.
Share
Improvethisanswer
Follow
editedMay6,2015at19:23
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredFeb8,2010at18:31
cherouvimcherouvim
31.4k1515goldbadges102102silverbadges151151bronzebadges
6
10
AnensurethatspuriousbytesappearinthebeginningofnonBOM-awaresoftware.Yay.
– Romain
Feb8,2010at18:33
1
@RomainMuller:e.g.PHP5willthrow"impossible"errorswhenyoutrytosendheadersaftertheBOM.
– Piskvorleftthebuilding
Feb8,2010at18:47
5
αβγisnotascii,butcanappearin8bit-ascii-bassedencodings.TheuseofaBOMdisablesabenafitofutf-8,itscompatabilitywithascii(abilitytoworkwithlagacyapplicationswherepureasciiisused).
– ctrl-alt-delor
Jan7,2011at13:03
1
Thisisthewronganswer.AstringwithaBOMinfrontofitissomethingelsealtogether.Itisnotsupposedtobethereandjustscrewseverythingup.
– tchrist
Oct2,2014at0:13
WithoutBOMthisopensasANSIinmosteditors.Iagreeabsolutely.Ifthishappensyou'reluckyifyoudealwiththecorrectCodepagebutindeedit'sjustaguess,becausetheCodepageisnotpartofthefile.ABOMis.
– brighty
Feb9,2015at16:59
|
Show1morecomment
1
HereismyexperiencewithVisualStudio,SourcetreeandBitbucketpullrequests,whichhasbeengivingmesomeproblems:
SoitturnsoutBOMwithasignaturewillincludeareddotcharacteroneachfilewhenreviewingapullrequest(itcanbequiteannoying).
Ifyouhoveronit,itwillshowacharacterlike"ufeff",butitturnsoutSourcetreedoesnotshowthesetypesofbytemarks,soitwillmostlikelyendupinyourpullrequests,whichshouldbeokbecausethat'showVisual Studio 2017encodesnewfilesnow,somaybeBitbucketshouldignorethisormakeitshowinanotherway,moreinfohere:
ReddotmarkerBitBucketdiffview
Share
Improvethisanswer
Follow
editedApr16,2020at23:47
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJul31,2019at9:30
LeoLeo
92077silverbadges2323bronzebadges
Addacomment
|
0
Isaveaautohotkeyfilewithutf-8,thechinesecharactersbecomestrang.
Withutf-8BOM,worksfine.
AutoHotkeywillnotautomaticallyrecognizeaUTF-8fileunlessitbeginswithabyteordermark.
https://www.autohotkey.com/docs/FAQ.htm#nonascii
Share
Improvethisanswer
Follow
answeredMay8at3:41
GoodPenGoodPen
53555silverbadges77bronzebadges
Addacomment
|
-4
UTFwithaBOMisbetterifyouuseUTF-8inHTMLfilesandifyouuseSerbianCyrillic,SerbianLatin,German,Hungarianorsomeexoticlanguageonthesamepage.
Thatismyopinion(30yearsofcomputingandITindustry).
Share
Improvethisanswer
Follow
editedApr16,2020at23:11
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredMar15,2013at10:01
user2173444user2173444
19
3
1
Ifindthistobetrueaswell.Ifyouusecharactersoutsideofthefirst255ASCIIsetandyouomittheBOM,browsersinterpretitasISO-8859-1andyougetgarbledcharacters.Giventheanswersabove,thisisapparentlyonthebrowser-vendorsdoingthewrongthingwhentheydon'tdetectaBOM.ButunlessyouworkatMicrosoftEdge/Mozilla/Webkit/Blink,youhavenochoicebutworkwiththedefectstheseappshave.
– asontu
Nov28,2017at8:42
UTFwhat?UTF-8?UTF-16?Somethingelse?
– PeterMortensen
Apr16,2020at23:12
Ifyourserverdoesntindocatethecorrectmimetypecharsetparameteryoushouldusethe
延伸文章資訊
- 1UTF-8与UTF-8 BOM - bijian1013 - 博客园
在我们通常使用的windows系统中,我发现了一个有趣的现象。我新建一个空的文本文档,点击文件-另存为-编码选择UTF-8,然后保存。
- 2[PHP] 無痛遠離UTF-8 BOM - 工程的日子每天都很師
(圖片來源) 某次我利用php Curl 來呼叫WordPress API ,透過php strlen function 查看回傳的123 字串長度,印出在網頁上時卻顯示有8個字元,打.
- 3位元組順序記號 - 维基百科
位元組順序記號(英語:byte-order mark,BOM)是位於碼點 U+FEFF 的統一碼字符的名称。當以UTF-16或UTF-32來將UCS/統一碼字符所組成的字串編碼時,這個字符被用來...
- 4這些是什麼? BOM/UFT-8有簽章/withBOM/withoutBOM - iT 邦幫忙
這是另一篇關於BOM之亂的描述. Windows 作業系統不少程式(像是記事本),預設會對UTF-8 檔案加上BOM 而Linux 則避免妨礙到像是解譯器腳本而不加BOM,對於沒有預期要 ...
- 5「带BOM 的UTF-8」和「无BOM 的UTF-8」有什么区别?网页 ...
UTF-8 不需要BOM,尽管Unicode 标准允许在UTF-8 中使用BOM。 所以不含BOM 的UTF-8 才是标准形式,在UTF-8 文件中放置BOM 主要是微软的习惯(顺便提一下:把带...