「帶BOM 的UTF-8」和「無BOM 的UTF-8」有什麼區別？網頁 ...

2025-01-23

文章推薦指數： 80 %

投票人數：10人

UTF-8 不需要BOM，儘管Unicode 標準允許在UTF-8 中使用BOM。

所以不含BOM 的UTF-8 才是標準形式，在UTF-8 文件中放置BOM 主要是微軟的習慣（順便提 ... 標籤：HTMLUnicode統一碼UTF-8位元組序標記BOM字元編碼「帶BOM的UTF-8」和「無BOM的UTF-8」有什麼區別？網頁代碼一般使用哪個？ 12-27 受邀。

早知道上得山多終遇老虎，在@梁海老兄面前耍Unicode總會有這一天的??首先，BOM是啥。

這個就不解釋了，Wikipedia上很詳細。

http://en.wikipedia.org/wiki/Byte_order_mark。

在網頁上使用BOM是個錯誤。

BOM設計出來不是用來支持HTML和XML的。

要識別文本編碼，HTML有charset屬性，XML有encoding屬性，沒必要拉BOM撐場面。

雖然理論上BOM可以用來識別UTF-16編碼的HTML頁面，但實際工程上很少有人這麼干。

畢竟UTF-16這種編碼連ASCII都雙位元組，實在不適用於做網頁。

其實說BOM是個壞習慣也不盡然。

BOM也是Unicode標準的一部分，有它特定的適用範圍。

通常BOM是用來標示Unicode純文本位元組流的，用來提供一種方便的方法讓文本處理程序識別讀入的.txt文件是哪個Unicode編碼（UTF-8，UTF-16BE，UTF-16LE）。

Windows相對對BOM處理比較好，是因為Windows把Unicode識別代碼集成進了API里，主要是CreateFile()。

打開文本文件時它會自動識別並剔除BOM。

Windows用這個有歷史原因，因為它最初脫胎於多代碼頁的環境。

而引入Unicode時Windows的設計者又希望能在用戶不注意的情況下同時兼容Unicode和非Unicode（Multiplebyte）文本文件，就只能藉助這種小trick了。

相比之下，Linux這樣的系統在多locale的環境中浸染的時間比較短，再加上社區本身也有足夠的動力輕裝前進（吐槽：微軟對兼容性的要求確實是到了非常偏執的地步，任何一點破壞兼容性的做法都不允許，以至於很多時候是自己綁住自己的雙手），所以乾脆一步到位進入UTF-8。

當然中間其實有一段過渡期，比如從最初全UTF-8的GTK+2.0發布到基本上所有GTK開發者都棄用多locale的GTK+1.2，我印象中至少經歷了三到四年。

BOM不受歡迎主要是在UNIX環境下，因為很多UNIX程序不鳥BOM。

主要問題出在UNIX那個所有腳本語言通行的首行#!標示，這東西依賴於shell解析，而很多shell出於兼容的考慮不檢測BOM，所以加進BOM時shell會把它解釋為某個普通字元輸入導致破壞#!標示，這就麻煩了。

其實很多現代腳本語言，比如Python，其解釋器本身都是能處理BOM的，但是shell卡在這裡，沒辦法，只能躺著也中槍。

說起來這也不能怪shell，因為BOM本身違反了一個UNIX設計的常見原則，就是文檔中存在的數據必須可見。

BOM不能作為可見字元被文本編輯器編輯，就這一條很多UNIX開發者就不滿意。

順便說一句，即使腳本語言能處理BOM，隨處使用BOM也不是推薦的辦法。

各個腳本語言對Unicode的處理都有自己的一套，Python的#-*-coding:utf-8-*-，Perl的useutf8，都比BOM簡單而且可靠。

另一個好消息是，即使是必須在Windows和UNIX之間切換的朋友也不會悲催。

幸虧在UNIX環境下我們還有VIM這種神器，即使遇到BOM擋道，我們也可以通過setnobomb;setfileencoding=utf8;w三條命令解決問題。

最後回頭想想，似乎也真就只有Windows堅持用BOM了。

P.S.：本問題是自己的第150個回答。

突然發現自己回答得很少很少??P.S.2：突然想起需要解釋一下為什麼說VIM去除bomb的操作需要在UNIX下完成。

因為VIM在Windows環境下有一個奇怪的bug，總是把UTF-16文件識別成二進位文件，而UNIX（Linux或者Mac都可以）下VIM則無問題。

這個問題從VIM6.8一直跟著我到VIM7.3。

目前尚不清楚這是VIM的bug還是我自己那個.vimrc文件的bug。

如有高手解答不勝感激。

UTF-8不需要BOM，儘管Unicode標準允許在UTF-8中使用BOM。

所以不含BOM的UTF-8才是標準形式，在UTF-8文件中放置BOM主要是微軟的習慣（順便提一下：把帶有BOM的小端序UTF-16稱作「Unicode」而又不詳細說明，這也是微軟的習慣）。

BOM（byteordermark）是為UTF-16和UTF-32準備的，用於標記位元組序（byteorder）。

微軟在UTF-8中使用BOM是因為這樣可以把UTF-8和ASCII等編碼明確區分開，但這樣的文件在Windows之外的操作系統里會帶來問題。

「UTF-8」和「帶BOM的UTF-8」的區別就是有沒有BOM。

即文件開頭有沒有U+FEFF。

UTF-8的網頁代碼不應使用BOM，否則常常會出錯。

這是一個小例子：為什麼這個網頁代碼&內的信息會被瀏覽器理解為在&內？另附《TheUnicodeStandard,Version6.0》之3.10D95UTF-8encodingscheme的一段話： WhilethereisobviouslynoneedforabyteordersignaturewhenusingUTF-8,thereareoccasionswhenprocessesconvertUTF-16orUTF-32datacontainingabyteordermarkintoUTF-8.WhenrepresentedinUTF-8,thebyteordermarkturnsintothebytesequence.ItsusageatthebeginningofaUTF-8datastreamisneitherrequirednorrecommendedbytheUnicodeStandard,butitspresencedoesnotaffectconformancetotheUTF-8encodingscheme.Identificationofthebytesequenceatthebeginningofadatastreamcan,however,betakenasanear-certainindicationthatthedatastreamisusingtheUTF-8encodingscheme. http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf 網頁編程中用不用bom我就不說什麼了，因為軟體原因無法使用的就更不能用了。

最近在學慣用cocos2d-x，純C++的編碼，如果代碼中有中文等的非ascii字元出現。

發現會出錯。

代碼是在mac下用xcode寫的，放到windows下用vs編譯。

最後把所有的源文件轉成了帶bom的格式後編譯通過了，鏈接失敗，這想這個就不是編碼的問題了。

通常情況下，一般都會認為在寫C++代碼的時候不要用中文，但是很多時候我們程序員也有想自己看著舒服的時候，為神馬就不能寫中文了？於是在windows下寫了一個helloworld.cpp類型的文件，輸出內容用中文，然後存為utf-8帶bom格式，再把它copy到mac下用g++編譯，發現成功通過並且可正常運行，用xcode打開源文件也正常顯示。

所以，這裡建議程序要在windows和mac還有linux上運行的話，源代碼最好保存成utf-8帶bom的格式，這樣比較通用一些。

而用utf-16無論大端還是小端，g++都不認的。

或者用utf-8不帶bom格式，然後代碼不要出現非ascii127以後的字元。

關於說utf-8不帶bom才是標準的，我想應該是帶用個人情緒的說法吧。

真正的標準應該是bom是可選的，為什麼可選?因為有些時候不帶bom會出錯，就拿歷史較久遠的windows來講吧，很多國家的用戶都在用windows，其文件都是用其本地的ansi編碼來做的，比如大陸的GBK和GB2013,港台的big5，這些編碼因為針對當地所用的字元制定的，所以呢，其存儲文件較小，所以會大量使用，並且也大量存在著，微軟不可能不考慮全球幾十億的用戶的文件而盲目地修改解碼方式，並且微軟也是uncode制定者之一，所以，帶用bom的utf-8也是符合國際標準的。

或許是因為程序編寫者的個人原因，也許是考慮到效率，很多的程序無法正確區分一個utf-8文件是否有bom，所以導致了各種亂碼的出現。

個人不想說哪個是標準，也不想用語言去攻擊哪個公司或團體。

微軟在堅持使用bom上沒有錯，因為這是在為用戶考慮的。

也許給我們這些寫程序的帶來了不便，但是，計算機最廣泛的用戶不是程序員。

UTF-8因為它的編碼特性，是位元組序無關的，所以不需要BOM。

我覺得「帶BOM的UTF-8」這個鍋基本上WINDOWS還是要背的，儘管我不太確定「UTF-8文件是否可以帶BOM」這個問題，但整因為它不需要，於是很多跨平台的軟體其實並不支持這種格式。

編碼歪傳——番外篇BOM是什麼，有興趣可以看看我這篇流水賬。

就是帶頭的鵝和去頭的鵝，有些編輯器比較傻會把去頭的鵝認成鴨子… 帶BOM的UTF-8就是赤裸裸的流氓！！！！！！！！！windows總是自做聰明的做一些別人無法理解的事情！！！UTF-8是不需要BOM頭的~~~！！從剛開始學習代碼（實在不能稱我做的東西為程序）到現在，不曉得被這個BOM頭搞了多少次，特別是對於我這種完全自學的人，知道找一個BUG需要多久多久不？？？？帶不帶BOM頭區別就在於這個BOM頭，祥見排名靠前的大神答案。

windows特有的奇葩。

請使用UTF-8不帶BOM頭！！它產生的BUG包含但不僅限於：鍩--感謝@飛揚提供，參考其答案 HTML空白行 div之間莫明的間隔亂碼！如果你用ssl那麼一定會有問題！！！順便再鄙視一下SONY的記憶棒、IPHONE的介面~~這種吐槽的東西就讓它摺疊吧 php之「鍩」，誰用誰知道…… notepad++會自動添加為帶Bom的utf8比較坑爹幾周前還在為BOM的問題苦惱著。

。

。

正如@梁海所說，「不含BOM的UTF-8才是標準形式」，的確是這樣，無BOM使用得更多些，所以個人還是推薦一般情況下用無BOM的形式吧，除非有問題的時候，再考慮換有BOM的。

Windows系統保存的都是有BOM的，所以你可以看到，用記事本保存一個UTF-8的txt，其實是有BOM的，這一點需要注意。

另外不同的文本編輯器對於有無BOM的稱呼也略有不同，比如EditPlus，有BOM的稱為UTF-8+，無BOM的稱為UTF-8，而在Notepad++中，有BOM的被稱為標準UTF-8，而無BOM則被稱為UTF-8無BOM。

這個問題只有吳秀軍的答案是正確的，在這裡鼓吹不帶BOM的utf8編碼的都是頭髮長見識短（還有個居然推薦mac，請允許我呵呵），不屑與之辯論。

我就問：這種情況你怎麼解決？在某個時候你把一個帶中文的文檔指定存成了無bom的utf8格式。

在下次，你或者其他人再次用編輯器打開這個文件，編輯器自動檢測到了這是一個utf8文件（或者你明確了是以utf8格式打開，但我假設沒有人會這麼勤快），然後編輯它，然後保存，文件仍然是utf8格式的。

到此，一切都十分美好，完全按照理想中的情況運行。

是嗎？但是，也許，在某個時候，你對這個文件的編輯，剛好去掉了，文件中所有的中文內容，但是你仍然像往常一樣，把它保存成utf8，而且沒有任何異樣。

問題來了，下次再打開這個文件時，編輯器怎麼識別這個文件的編碼,？ascii?gbk?還是utf8?utf8對ascii的兼容確實是它的好，但是這個優點在某些時候恰恰成了隱藏問題的缺點。

因此bom大法好，加bom保平安。

補充：如果你只活在專業的程序員的世界裡面，尤其你又是個linux平台的，那麼確實utf8就是近乎終極解決方案了。

但是這個世界上的大多數人仍然用的是windows，給你提交這個文件的人，可能來自你的組員，也可能來自其他部門，或者是來自你的客戶，你不能保證所有的情況下其他人都能按照你的標準來操作，何況世界上絕大多數人對編碼的問題一無所知，如果造成了問題，對於他們來說，這是你的問題，不是他們的問題，他們給你東西的時候確實是好的。

你也許要爭論，但是，對於一個成熟的專業人士來說，要學會傾聽和理解客戶的訴求，而不是向你的客戶解釋你的專業問題。

這就是為什麼windows的記事本要強行給utf8加bom的原因——為了兼容舊系統的編碼問題，unix陣營放棄帶bom的utf8——為了讓它們的上古程序能繼續運行下去，這個各自有自己利益訴求的差異決定其實並不對錯。

但是，讓一眾程序加入chomputf8bom的功能，和讓上億沒有專業背景知識的用戶面對亂碼問題，哪種解決起來更麻煩，更成本高，我想，答案是顯而易見的吧。

文本的編碼屬於文本的元數據，html和xml等在頭部已經說明了編碼所以不需要BOM。

而對於一個沒有元數據說明的文本文件，*nix憑什麼就欽定是UTF-8呢，為什麼就不可以是GB2312，不能是JIS呢？所以我覺得Windows的做法提現了他的考慮周全，和對用戶負責的態度。

另外:Unicode不推薦UTF-8使用BOM完全是無中生有。

被這個坑過，後來保存文件會加個小心，當然，gbk-&>utf8也是一個坑什麼鬼，就是因為這個bom，CSV導入mongodb時，第一個欄位總是不正常，直接導致用第一個欄位作為條件find時，出不了結果！坑了laozi一晚上別看上面瞎扯一大堆,言不達意,看TheUnicodeConsortium官方解釋16.8Specials TheSpecialsblockcontainscodepointsthatareinterpretedasneithercontrolnorgraphiccharactersbutthatareprovidedtofacilitatecurrentsoftwarepractices.ForinformationaboutthenoncharactercodepointsU+FFFEandU+FFFF,seeSection16.7,Noncharacters.ByteOrderMark(BOM):U+FEFFForhistoricalreasons,thecharacterU+FEFFusedforthebyteordermarkisnamedzerowidthno-breakspace.ExceptforcompatibilitywithversionsofUnicodepriortoVer-sion3.2,U+FEFFisnotusedwiththesemanticsofzerowidthno-breakspace(seeSection16.2,LayoutControls).Instead,itsmostcommonandmostimportantusageisinthefollowingtwocircumstances: 1.UnmarkedByteOrder.Somemachinearchitecturesusetheso-calledbig-endianbyteorder,whileothersusethelittle-endianbyteorder.WhenUnicodetextisserializedintobytes,thebytescangoineitherorder,dependingonthearchitecture.Sometimesthisbyteorderisnotexternallymarked,whichcausesproblemsininterchangebetweendifferentsystems.2.UnmarkedCharacterSet.Insomecircumstances,thecharactersetinformationforastreamofcodedcharacters(suchasafile)isnotavailable.Theonlyinfor-mationavailableisthatthestreamcontainstext,buttheprecisecharactersetisnotknown.Inthesetwocases,thecharacterU+FEFFisusedasasignaturetoindicatethebyteorder andthecharactersetbyusingthebyteserializationsdescribedinSection3.10,UnicodeEncodingSchemes.Becausethebyte-swappedversionU+FFFEisanoncharacter,whenaninterpretingprocessfindsU+FFFEasthefirstcharacter,itsignalseitherthattheprocesshasencounteredtextthatisoftheincorrectbyteorderorthatthefileisnotvalidUnicodetext.IntheUTF-16encodingscheme,U+FEFFattheverybeginningofafileorstreamexplicitlysignalsthebyteorder.Thebytesequences&or&mayalsoserveasasignaturetoidentifyafileascontainingUTF-16text.Eithersequenceisexceedinglyrareattheoutsetoftextfilesusingothercharacterencodings,whethersingle-ormultiple-byte,andthereforenot likelytobeconfusedwithrealtextdata.Forexample,insystemsthatemployISOLatin-1(ISO/IEC8859-1)ortheMicrosoftWindowsANSICodePage1252,thebytesequence&constitutesthestring&「t?」;insystemsthatemploytheAppleMacintoshRomancharactersetortheAdobeStandardEncoding,thissequencerep-Copyright?1991-2007,Unicode,Inc.TheUnicodeStandard5.0–Electronicedition16.8Specials551resentsthesequence&「¤?」;insystemsthatemployothercommonIBMPCcodepages(forexample,CP437,850),thissequencerepresents&space&>「?」.InUTF-8,theBOMcorrespondstothebytesequence&.Althoughthere areneveranyquestionsofbyteorderwithUTF-8text,thissequencecanserveassignatureforUTF-8encodedtextwherethecharactersetisunmarked.AswithaBOMinUTF-16,thissequenceofbyteswillbeextremelyrareatthebeginningoftextfilesinothercharacterencodings.Forexample,insystemsthatemployMicrosoftWindowsANSICodePage1252,&correspondstothesequence&mark&>「???」.ForcompatibilitywithversionsoftheUnicodeStandardpriortoVersion3.2,thecodepointU+FEFFhastheword-joiningsemanticsofzerowidthno-breakspacewhenitisnotusedasaBOM.Innewtext,thesesemanticsshouldbeencodedbyU+2060wordjoiner.See「LineandWordBreaking」inSection16.2,LayoutControls,formoreinformation. Wherethebyteorderisexplicitlyspecified,suchasinUTF-16BEorUTF-16LE,thenallU+FEFFcharacters—evenattheverybeginningofthetext—aretobeinterpretedaszerowidthno-breakspaces.Similarly,whereUnicodetexthasknownbyteorder,initialU+FEFFcharactersarenotrequired,butforbackwardcompatibilityaretobeinterpretedaszerowidthno-breakspaces.Forexample,forstringsinanAPI,thememoryarchitectureoftheprocessorprovidestheexplicitbyteorder.Fordatabasesandsimilarstructures,itismuchmoreefficientandrobusttouseauniformbyteorderforthesamefield(ifnottheentiredatabase),therebyavoidinguseofthebyteordermark.SystemsthatusethebyteordermarkmustrecognizewhenaninitialU+FEFFsignalsthebyteorder.Inthosecases,itisnotpartofthetextualcontentandshouldberemovedbeforeprocessing,becauseotherwiseitmaybemistakenforalegitimatezerowidthno-breakspace.TorepresentaninitialU+FEFFzerowidthno-breakspaceinaUTF-16file,useU+FEFFtwiceinarow.Thefirstoneisabyteordermark;thesecondoneistheinitialzerowidthno-breakspace.SeeTable16-4forasummaryofencodingschemesignatures.Table16-4.UnicodeEncodingSchemeSignaturesEncodingSchemeSignatureUTF-8EFBBBFUTF-16Big-endianFEFFUTF-16Little-endianFFFEUTF-32Big-endian0000FEFFUTF-32Little-endianFFFE0000IfU+FEFFhadonlythesemanticsofasignaturecodepoint,itcouldbefreelydeletedfromtextwithoutaffectingtheinterpretationoftherestofthetext.Carelesslyappendingfilestogether,forexample,canresultinasignaturecodepointinthemiddleoftext.Unfortu-nately,U+FEFFalsohassignificanceasacharacter.Asazerowidthno-breakspace,itindi-catesthatlinebreaksarenotallowedbetweentheadjoiningcharacters.ThusU+FEFFaffectstheinterpretationoftextandcannotbefreelydeleted.TheoverloadingofsemanticsTheUnicodeStandard5.0–ElectroniceditionCopyright?1991–2007Unicode,Inc.552SpecialAreasandFormatCharactersforthiscodepointhascausedproblemsforprogramsandprotocols.ThenewcharacterU+2060wordjoinerhasthesamesemanticsinallcasesasU+FEFF,exceptthatitcannotbeusedasasignature.Implementersarestronglyencouragedtousewordjoinerinthosecircumstanceswheneverwordjoiningsemanticsareintended.AninitialU+FEFFalsotakesacharacteristicforminothercharsetsdesignedforUnicodetext.(Theterm「charset」referstoawiderangeoftextencodings,includingencodingschemesaswellascompressionschemesandtext-specifictransformationformats.)ThecharacteristicsequencesofbytesassociatedwithaninitialU+FEFFcanserveassignaturesinthosecases,asshowninTable16-5.Table16-5.U+FEFFSignatureinOtherCharsetsCharsetSignatureSCSU0EFEFFBOCU-1FBEE28UTF-72B2F7638or2B2F7639or2B2F762Bor2B2F762FUTF-EBCDICDD736673MostsignaturescanbedeletedeitherbeforeorafterconversionofaninputstreamintoaUnicodeencodingform.However,inthecaseofBOCU-1andUTF-7,theinputbytesequencemustbeconvertedbeforetheinitialU+FEFFcanbedeleted,becausestrippingthesignaturebytesequencewithoutconversiondestroyscontextnecessaryforthecorrectinterpretationofsubsequentbytesintheinputsequence.Specials:U+FFF0–U+FFF8ThenineunassignedUnicodecodepointsintherangeU+FFF0..U+FFF8arereservedforspecialcharacterdefinitions.AnnotationCharacters:U+FFF9–U+FFFBAninterlinearannotationconsistsofannotatingtextthatisrelatedtoasequenceofanno-tatedcharacters.Forallregulareditingandtext-processingalgorithms,theannotatedchar-actersaretreatedaspartofthetextstream.Theannotatingtextisalsopartofthecontent,butforallorsometextprocessing,itdoesnotformpartofthemaintextstream.However,withintheannotatingtext,charactersareaccessibletothesamekindoflayout,text-pro-cessing,andeditingalgorithmsasthebasetext.Theannotationcharactersdelimittheannotatingandtheannotatedtext,andidentifythemaspartofanannotation.SeeFigure16-4.Theannotationcharactersareusedininternalprocessingwhenout-of-bandinformationisassociatedwithacharacterstream,verysimilarlytotheusageofU+FFFCobjectreplace-Copyright?1991-2007,Unicode,Inc.TheUnicodeStandard5.0–ElectroniceditionFigure16-4.AnnotationCharactersFelixTextdisplayTextstreamAnnotatedtextAnnotatingtextAnnotationcharactersAnnotatedtextAnnotatingtextAnnotationcharacters16.8Specials553mentcharacter.However,unliketheopaqueobjectshiddenbythelattercharacter,theannotationitselfistextual.Conformance.Aconformantimplementationthatsupportsannotationcharactersinter-pretsthebasetextasifitwerepartofanunannotatedtextstream.Withintheannotatingtext,itinterpretstheannotatingcharacterswiththeirregularUnicodesemantics.U+FFF9interlinearannotationanchorisananchorcharacter,precedingtheinterlin-earannotation.Theexactnatureandformattingoftheannotationdependonadditionalinformationthatisnotpartoftheplaintextstream.ThissituationisanalogoustothatforU+FFFCobjectreplacementcharacter.U+FFFAinterlinearannotationseparatorseparatesthebasecharactersinthetextstreamfromtheannotationcharactersthatfollow.Theexactinterpretationofthischarac-terdependsonthenatureoftheannotation.Morethanoneseparatormaybepresent.Additionalseparatorsdelimitpartsofamultipartannotatingtext.U+FFFBinterlinearannotationterminatorterminatestheannotationobject(andreturnstotheregulartextstream).UseinPlainText.Usageoftheannotationcharactersinplaintextinterchangeisstronglydiscouragedwithoutprioragreementbetweenthesenderandthereceiver,becausethecon-tentmaybemisinterpretedotherwise.Simplyfilteringouttheannotationcharactersoninputwillproduceanunreadableresultor,evenworse,anoppositemeaning.Oninput,aplaintextreceivershouldeitherpreserveallcharactersorremovetheinterlinearannota-tioncharactersaswellastheannotatingtextincludedbetweentheinterlinearannota-tionseparatorandtheinterlinearannotationterminator.Whenanoutputforplaintextusageisdesiredbutthereceiverisunknowntothesender,theseinterlinearannotationcharactersshouldberemovedaswellastheannotatingtextincludedbetweentheinterlinearannotationseparatorandtheinterlinearanno-tationterminator.Thisrestrictiondoesnotprecludetheuseofannotationcharactersinplaintextinter-change,butitrequiresaprioragreementbetweenthesenderandthereceiverforcorrectinterpretationoftheannotations.TheUnicodeStandard5.0–ElectroniceditionCopyright?1991–2007Unicode,Inc.554SpecialAreasandFormatCharactersLexicalRestrictions.Ifanimplementationencountersaparagraphbreakbetweenananchoranditscorrespondingterminator,itshallterminateanyopenannotationsatthispoint.Anchorcharactersmustprecedetheircorrespondingterminatorcharacters.Unpairedanchorsorterminatorsshallbeignored.Aseparatoroccurringoutsideapairofdelimiters,shallbeignored.Annotationsmaybenested.Formatting.Allformattinginformationforanannotationisprovidedbyhigher-levelpro-tocols.Thedetailsofthelayoutoftheannotationareimplementation-defined.Correctfor-mattingmayrequireadditionalinformationthatisnotpresentinthecharacterstream,butratherismaintainedout-of-band.Therefore,annotationmarkersserveasplaceholdersforanimplementationthathasaccesstothatinformationfromanothersource.Theformat-tingofannotationsandotherspeciallinelayoutfeaturesofJapaneseisdiscussedinJISX4501.Input.Annotationcharactersarenotnormallyinputorediteddirectlybyendusers.Theirinsertionandmanagementintextaretypicallyhandledbyanapplication,whichwillpresentauserinterfaceforselectingandannotatingtext.Collation.Withtheexceptionofthespecialcasewheretheannotationisintendedtobeusedasasortkey,annotationsaretypicallyignoredforcollationoroptionallypreprocessedtoactastiebreakersonly.Importantly,annotationbasecharactersarenotignored,butratheraretreatedlikeregulartext.ReplacementCharacters:U+FFFC–U+FFFDU+FFFC.TheU+FFFCobjectreplacementcharacterisusedasaninsertionpointforobjectslocatedwithinastreamoftext.Allotherinformationabouttheobjectiskeptout-sidethecharacterdatastream.Internallyitisadummycharacterthatactsasananchorpointfortheobject』sformattinginformation.Inadditiontoassuringcorrectplacementofanobjectinadatastream,theobjectreplacementcharacterallowstheuseofgeneralstream-basedalgorithmsforanytextualaspectsofembeddedobjects.U+FFFD.TheU+FFFDreplacementcharacteristhegeneralsubstitutecharacterintheUnicodeStandard.Itcanbesubstitutedforany「unknown」characterinanotherencodingthatcannotbemappedintermsofknownUnicodecharacters(seeSection5.3,UnknownandMissingCharacters). 看到了好多連記事本都不會用的"程序員".... 不知道微軟搞什麼，本來漢語的處理就很麻煩。

notepad++裡面編碼裡面有兩個選項。

「以UTF-8格式編碼「和」以UTF-8無BOM格式編碼」。

打眼一看，肯定選擇「以UTF-8格式編碼」啊。

於是，從notepad++--&>Mongodb裡面複製東西的時候，莫名其妙多了不少的位元組數。

如果不安裝notepad++，使用默認的記事本，那就更是個坑，默認有boom，你還無法選擇。

曾因為BOM問題，而導致花費數小時Debug，吃一塹長一智，Windows下編程千萬要注意在pintia上掛題，手冊上寫的UTF8，然後我就習慣性的帶上BOM了，然後內測的時候咋交咋不對最後花了點時間轉碼重新上傳了一遍～我想知道unix下如何將無BOM的utf8轉成帶BOM的，被csv亂碼弄死了，求大神為了各自的利益把UTF-8搞得烏煙瘴氣推薦閱讀： TAG:HTML|Unicode統一碼|字元編碼|UTF-8|位元組序標記BOM| 一點新知 GetIt01

請為這篇文章評分？

延伸文章資訊

這些是什麼? BOM/UFT-8有簽章/withBOM/withoutBOM - iT 邦幫忙

這是另一篇關於BOM之亂的描述. Windows 作業系統不少程式(像是記事本)，預設會對UTF-8 檔案加上BOM 而Linux 則避免妨礙到像是解譯器腳本而不加BOM，對於沒有預期要 ...

位元組順序記號 - 维基百科

位元組順序記號（英語：byte-order mark，BOM）是位於碼點 U+FEFF 的統一碼字符的名称。當以UTF-16或UTF-32來將UCS/統一碼字符所組成的字串編碼時，這個字符被用來...

「帶BOM 的UTF-8」和「無BOM 的UTF-8」有什麼區別？網頁 ...

UTF-8 不需要BOM，儘管Unicode 標準允許在UTF-8 中使用BOM。所以不含BOM 的UTF-8 才是標準形式，在UTF-8 文件中放置BOM 主要是微軟的習慣（順便提 ...

What's the difference between UTF-8 and UTF-8 with BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream ( 0xEF, 0xBB, 0xBF ) that allo...

UTF-8 BOM (Byte Order Mark) 的問題@新精讚

然後提到了很多程式, 尤其是unix 上的工具和一些xml 工具, 只能處理沒有加上BOM 的UTF-8 檔案, 以及根據標準, 為甚麼這樣子不能叫做符合標準, 他有順便錶一下自家的 ...

「帶BOM 的UTF-8」和「無BOM 的UTF-8」有什麼區別？網頁 ...

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

中日口譯課程

中國生產力中心口譯評價

紙的應用