Process a file that starts with a BOM (FF FE)
文章推薦指數: 80 %
From this wikipedia article, FF FE means UTF16LE . So you should tell iconv to convert from UTF16LE to UTF8 : Unix&LinuxStackExchangeisaquestionandanswersiteforusersofLinux,FreeBSDandotherUn*x-likeoperatingsystems.Itonlytakesaminutetosignup. Signuptojointhiscommunity Anybodycanaskaquestion Anybodycananswer Thebestanswersarevotedupandrisetothetop Home Public Questions Tags Users Companies Unanswered Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams ProcessafilethatstartswithaBOM(FFFE) AskQuestion Asked 8years,4monthsago Modified 2years,11monthsago Viewed 14ktimes 12 Ireceiveda.csvfilewiththeFFFEBOM: $head-n1dotan.csv|hd 00000000fffe410064002000670072006f007500|..A.d..g.r.o.u.| WhenusingawktoparseitI'mgettingabunchofnullbytes,whichIsuspectisduetothebyteorder.HowcanIswapthebyteorderonthisfile(usingtheCLI)sothatnormaltoolswillworkwithit? NotethatIthinkthatthisfileisonlyASCIIcharacters(exceptfortheBOM),butIcannotconfirmthatasgrepthinksthatitisabinaryfile: $grep-P'^[\x00-\x7f]'dotan.csv Binaryfiledotan.csvmatches SearchingforthesamestringinVIMshowseverycharactermatching! UsingiconvtoconverttoASCIIdoesnotgetridof\x00values,actuallyitmakestheproblemworseasnowtheylooklikenullbytesinsteadofUTF-8! $iconv-fUTF-8-tASCIIdotan.csv>fixed.txt iconv:illegalinputsequenceatposition0 $iconv-fUTF-8-tASCII//IGNOREdotan.csv>fixed.txt $head-n1fixed.txt|hd 00000000410064002000670072006f0075007000|A.d..g.r.o.u.p.| HowcanIswapthebyteorderonthisfile(usingtheCLI)sothatnormaltoolswillworkwithit? text-processingcharacter-encodingunicode Share Improvethisquestion Follow editedJun15,2014at22:56 Gilles'SO-stopbeingevil' 768k188188goldbadges15911591silverbadges20832083bronzebadges askedJun15,2014at8:07 dotancohendotancohen 14.6k2424goldbadges7878silverbadges112112bronzebadges 3 TheCSVfileyoucreatedinWindowsorMac? – cuonglm Jun15,2014at8:25 Canyougiveaportionoffile? – cuonglm Jun15,2014at8:27 Hereisalinktoananonymizedportionofthefilewhichpreservestheuniqueproblemswithit.Thankyou! – dotancohen Jun15,2014at9:04 Addacomment | 3Answers 3 Sortedby: Resettodefault Highestscore(default) Datemodified(newestfirst) Datecreated(oldestfirst) 18 Fromthiswikipediaarticle,FFFEmeansUTF16LE.SoyoushouldtelliconvtoconvertfromUTF16LEtoUTF8: iconv-fUTF-16LE-tUTF-8dotan.csv>fixed.txt Share Improvethisanswer Follow editedJun15,2014at9:20 answeredJun15,2014at8:52 cuonglmcuonglm 146k3838goldbadges311311silverbadges392392bronzebadges 3 Perfect,thankyou!IhadtheUTF-8andUTF-16BOMmixedup:IthoughtthatFFFEandFEFFwereUTF-8andIneverknewtheUTF-16BOM(s).Actually,thoseareUTF-16BOMs,andIneverknewthe(useless)UTF-8BOM!. – dotancohen Jun15,2014at9:19 @dotancohen:ItestinmyFedoraandthetailsolutionworksfine.WhatOSdoyouuse? – cuonglm Jun15,2014at9:22 Thisdoesn'twork(i.e.removetheBOM)forversion"iconv(GNUlibiconv1.14)"inGitBashonWindows.But(forwhatverreason)usingjustUTF-16insteadofoneofthebyte-orderversionsworks. – KennyEvitt Mar21,2017at20:10 Addacomment | 5 dos2unixalsoremovesBOMsandconvertsUTF-16toUTF-8: $printf%sあ|recode..utf16>a;xxd-pa;dos2unixa;xxd-pa feff3042 dos2unix:convertingfileatoUnixformat... e38182 dos2unixalsoremovesUTF-8BOMs: $printf%b'\xef\xbb\xbfa'>a;dos2unixa;xxd-pa dos2unix:convertingfileatoUnixformat... 61 Share Improvethisanswer Follow answeredDec22,2015at3:03 nisetamanisetama 9791111silverbadges77bronzebadges Addacomment | 1 AlsoansweredonStackOverflow:HowcanIremovetheBOMfromaUTF-8file?@ricihasagoodanswer. Shortanswer: Shortanswer:sed-i$'1s/^\uFEFF//'file.txt,butnotonBSDorOS/X. Anotheranswer:vifile.txt,:setnobomb,:w,simplebutmanual Installdos2unuix;dos2unix-rfile.txt Thesemarkshaveseveralpossiblemeanings,includingjustthatthefileisUTF-8;seetheWikipediaArticle. Windowsprogramslovetoaddthesemarks.Mosteditorswillnotremovethesemarks. Share Improvethisanswer Follow editedNov6,2019at21:30 answeredNov6,2019at20:59 CharlesMerriamCharlesMerriam 17111silverbadge44bronzebadges Addacomment | YourAnswer ThanksforcontributingananswertoUnix&LinuxStackExchange!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedtext-processingcharacter-encodingunicodeoraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Linked 0 ProcessUnicodefileswithBOMcorrectlywithPOSIXtools Related 4 ConvertingtextintoASCII/ISO-8859-1 3 HowtodoaregexsearchinaUTF-16LEfilewhileinaUTF-8locale? 4 Unixcharactersetconversion 5 Getconsistentencodingforallfilesindirectory 16 Specifyencodingwithlibreoffice--convert-tocsv 16 ConvertbinaryencodingthatheadandNotepadcanreadtoUTF-8 7 AWKwithBOM:IsthereanycoolwaytohandleUnicodeBOMwithregexp? 0 WhyisanexplicitLANG=Crequiredwhensearchingforhexrepresentationsofcharactersingrep? 0 ProcessUnicodefileswithBOMcorrectlywithPOSIXtools HotNetworkQuestions HowtogetridofUbuntuProadvertisementwhenupdatingapt? ArethereanyspellsotherthanWishthatcanlocateanobjectthroughleadshielding? Adecimal-basedunitoftime Wouldmerfolkgainanyrealadvantagefrommounts(andbeastsofburden)? Whatisthedefinitionofatrollinthelegalcontext? Whyarefighterjetssoloudwhendoingslowflight? WhyareRussiancombatantsinUkraineconsideredsoldiersratherthanterrorists? Interpretinganegativeself-evaluationofahighperformer Botchingcrosswindlandings My(large)employerhasn'tregisteredanobviousmisspellingoftheirprimarydomainURL WherewasthisneonsignofadragondisplayedinLosAngelesinthe1990s?Isitstilltherenow? LaTeX2(e)vsLaTeX3 Wouldextractinghydrogenfromthesunlessenitslifespan? Whattranslation/versionoftheBiblewouldChaucerhaveread? 9dotsthatare3by3,continuethepattern Isthereawordfor"amessagetomyself"? DotheseresultsmeanthatIhavefoundthisexoplanet? Probabilisticmethodsforundecidableproblem Ifquasarsdestroyalllifeintheirhostgalaxy,thenhowdidlifesurvivewhenMilkyWaywasaQuasar6millionyearsago? Howdoyoucalculatethetimeuntilthesteady-stateofadrug? Unknownnotation:squarebrackets,triangles,andnumbers WhathadEstherdonein"TheBellJar"bySylviaPlath? IsthematrixinducedL1-normgreaterthantheinducedL2-norm? Canyoufindit? morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings
延伸文章資訊
- 1BOM BOM BOM | 就是愛程式
ff fe ## ## UTF-16, Little Endian; ef bb bf UTF-8. Microsoft與BOM. 許多Windows 軟體(包括Windows 筆記本) 在UT...
- 2Unicode 的BOM (byte order mark) @ 工作小錦囊 - 隨意窩
Unicode 的BOM (byte order mark) ... A Byte Order Mark (BOM) is the character at code point U+FEFF ...
- 3Process a file that starts with a BOM (FF FE)
From this wikipedia article, FF FE means UTF16LE . So you should tell iconv to convert from UTF16...
- 4Byte order mark - Globalization - Microsoft Learn
Byte Order Mark (BOM) is used to indicate how a processor places serialized text into a sequence ...
- 5Unicode 與UTF - OpenHome.cc
... JavaScript》條款七),一開頭的兩個位元組(ff fe)是用來識別檔案採用的位元組順序,稱為BOM(byte order mark),之後使用兩個位元組來儲存每個Unicode 字元。