How can I remove the BOM from a UTF-8 file?

文章推薦指數: 80 %
投票人數:10人

Oddly with vim 8 on a mac, I have a csv utf-8 file made by Excel and it starts with , yet :set nobomb doesn't modify or remove it. – ... Unix&LinuxStackExchangeisaquestionandanswersiteforusersofLinux,FreeBSDandotherUn*x-likeoperatingsystems.Itonlytakesaminutetosignup. Signuptojointhiscommunity Anybodycanaskaquestion Anybodycananswer Thebestanswersarevotedupandrisetothetop Home Public Questions Tags Users Companies Unanswered Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams HowcanIremovetheBOMfromaUTF-8file? AskQuestion Asked 5years,2monthsago Modified 6monthsago Viewed 166ktimes 139 IhaveafileinUTF-8encodingwithBOMandwanttoremovetheBOM.Arethereanylinuxcommand-linetoolstoremovetheBOMfromthefile? $filetest.xml test.xml:XML1.0document,UTF-8Unicode(withBOM)text,withverylonglines command-linefilesunicode Share Improvethisquestion Follow editedJul23,2017at10:06 MichaelHomer 71.8k1616goldbadges203203silverbadges227227bronzebadges askedJul23,2017at10:05 m13rm13r 2,47722goldbadges1616silverbadges1414bronzebadges 4 Similar:AWKwithBOM:IsthereanycoolwaytohandleUnicodeBOMwithregexp? – StéphaneChazelas Jul23,2017at10:40 1 I'vemadeafarilysimpletooltodojustthatafewmonthsago:oskog97.com/read/?path=/small-scripts/killbom&referer=/…Mightbeworthinstallingsomethinglikeitin/usr/local/binifyouhavemanyUTF-8encodedfileswithBOMs. – OskarSkog Jul23,2017at11:24 Weirdly,cross-postedatstackoverflow.com/questions/45240387/… – tripleee Jan12,2021at7:27 InUTF8,U+FEFFisencodedas3bytes:EFBBBF,onethingyoucoulddoiscombinexxdandxxd-rtochangethosefirstthreebytestosomethingwithinprintableasciirange,like414141,sothat"AAA"willappearintheBOM'splace,whichyoucanthensimplydeleteandsavewitharegulartexteditor.Bitofaroundaboutwaybutitworks. – BradenBest Aug11,2021at23:29 Addacomment  |  10Answers 10 Sortedby: Resettodefault Highestscore(default) Datemodified(newestfirst) Datecreated(oldestfirst) 140 Ifyou'renotsureifthefilecontainsaUTF-8BOM,thenthis(assumingtheGNUimplementationofsed)willremovetheBOMifitexists,ormakenochangesifitdoesn't. sed'1s/^\xEF\xBB\xBF//'new.txt Youcanalsooverwritetheexistingfilewiththe-ioption: sed-i'1s/^\xEF\xBB\xBF//'orig.txt IfyouareusingtheBSDversionofsed(egmacOS)thenyouneedtohavebashdotheescaping: sed$'1s/\xef\xbb\xbf//'new.txt Share Improvethisanswer Follow editedMay28,2020at13:05 MatthewBuckett 15311silverbadge55bronzebadges answeredJul23,2017at14:08 CSMCSM 1,92011goldbadge99silverbadges77bronzebadges 11 4 thismaynotworkinautf8locale,butprependingalocaleoverridetocorposixwillalwayswork. – hildred Jul23,2017at15:29 3 @hildredI'vetesteditwiththeen_US.UTF-8localeanditworked.Whenwillitfail? – m13r Jul24,2017at6:55 2 @m13r,Itdependsontheversionofsedandcompileoptions.InthefailurecaseaverynewversionofsedwithUnicodecharacterclasseswillbringthethreebytesequenceinasasinglecharacterwhichdoesnotmatchthethreecharactersequence.Howeverinsuchcaseyoucandoasixteenbitcharactermatch.Howeverthisisanewfeatureandnotuniversallypresent.IfyouwanttotestIrecommendcompilingthelatestversion. – hildred Jul24,2017at16:25 4 Tofixittoworkwithaunicode-enabledseddoLC_ALL=Csed'1s/^\xEF\xBB\xBF//' – Joshua Jul24,2017at17:41 2 @mazunki,1s/meansonlysearchthefirstline;otherlinesareunaffected.The^meansonlymatchatthestartofthe(first)line.\xEF\xBB\xBFistheUTF-8BOM(escapedhexstring).//meansreplacewithnothing.Icouldhaveadded1totheend(for1s/^xEF\xBB\xBF//1),whichwouldmeanonlymatchthefirstoccurrenceofthepatternontheline.Butasthethesearchisanchoredwith^,thiswon'tmakeanydifference.Ifthefiledoesn'thavetheBOMatthestartofthefirstline,thepatternwon'tmatch,andthusnochangeismade. – CSM Oct27,2019at18:47  |  Show6morecomments 117 ABOMdoesn'tmakesenseinUTF-8.ThosearegenerallyaddedbymistakebybogussoftwareonMicrosoftOSes. dos2unixwillremoveitandalsotakecareofotheridiosyncrasiesofWindowstextfiles. dos2unixtest.xml Share Improvethisanswer Follow answeredJul23,2017at10:42 StéphaneChazelasStéphaneChazelas 484k8989goldbadges948948silverbadges14041404bronzebadges 12 22 IagreethataUTF-8encodedBOMdoesnotmakesense,butbelieveitornot,therearelotsofpeoplewhothinkitisagreatideathathelpsdifferentiateUTF-8fromother8-bitencodings.Soitisamatteroftaste.WindowsNotepadaddsaBOMonpurpose. – JohanMyréen Jul23,2017at14:02 24 Whatdoesitmatterifitmakessenseornot,whenthecontextisjustaquestiononhowtoremoveit?AccordingtoWikipedia,NotepadrequirestheBOMtorecognizeafileasUTF-8,andGoogleDocsalsoaddsitwhileexportingafileastext.Idoubttheyalldoitbymistake. – ilkkachu Jul23,2017at14:09 3 IsthereawayofnotconvertingthelineendingsandjustremovetheBOMwithdos2unix? – m13r Jul25,2017at7:55 3 @m13rThenusethesedscriptinthisanswer.Thatwillremoveonlythebom(ifitexist),nothingelsewillbechanged. – user232326 Jul26,2017at5:51 5 @JohanMyréenyes,butitisnotcorrectcallingthemUTF-8.TheyarenotUTF-8files.TheyareUTF-8-with-BOMfiles,whichisanotherfileformat.IsupposethoseWindowsfreakswon'tbehappygettingODTfilescallledMSOfficefiles:) – 9ilsdx9rvj0lo Nov9,2018at9:47  |  Show7morecomments 84 UsingVIM OpenfileinVIM: vitext.xml RemoveBOMencoding: :setnobomb Saveandquit: :wq Foranon-interactivesolution,trythefollowingcommandline: vi-c":setnobomb"-c":wq"text.xml ThatshouldremovetheBOM,savethefileandquit,allfromthecommandline. Share Improvethisanswer Follow editedAug4,2020at20:54 answeredDec24,2017at18:05 JoshuaPinterJoshuaPinter 1,05099silverbadges99bronzebadges 3 1 Oddlywithvim8onamac,Ihaveacsvutf-8filemadebyExcelanditstartswith,yet:setnobombdoesn'tmodifyorremoveit. – dlamblin Oct9,2019at21:11 1 Thisismuchfasterthantailonlargefiles. – user239558 Dec2,2019at20:14 Formultiplefiles:vim-c":bufdosetnobomb|update"-c"q"* – DennisWilliamson Sep7,2021at13:41 Addacomment  |  33 ItispossibletoremovetheBOMfromafilewiththetailcommand: tail-c+4withBOM.txt>withoutBOM.txt Beawarethatthischopsthefirst4bytesfromthefile,sobesurethatthefilereallycontainstheBOMbeforerunningtail. Share Improvethisanswer Follow editedOct13,2021at14:30 answeredJul23,2017at10:05 m13rm13r 2,47722goldbadges1616silverbadges1414bronzebadges 8 2 Why4?TheBOMhas3byte. – deviantfan Jul23,2017at17:12 12 @deviantfanWhichiswhyyouneedtostartatthe4thbyteifyouwanttoskipit. – StéphaneChazelas Jul23,2017at18:33 13 tailisusing1basedindexing?!WTF! – CodesInChaos Jul23,2017at19:31 6 @CodesInChaos,tail-c-1ortail-c1(whattailisgenerallyusedfor)isthecontentstartingwiththelastbyte,tail-c+1startingwiththefirstbyte.tail-c0/tail-c+0forthatwouldbealotmoreunintuitive. – StéphaneChazelas Jul23,2017at23:05 2 @deviantfan:(ddbs=1count=3of=/dev/null;cat)output.OrwithGNU(head-c3>/dev/null;cat)--eveninUTF8orothernon-singlebytelocale;GNUheaddoes'char'=byte. – dave_thompson_085 Jul24,2017at6:16  |  Show3morecomments 7 Youcanuse LANG=CLC_ALL=Csed-e's/\r$//;1s/^\xef\xbb\xbf//'-i--filename toremovethebyteordermarkfromthebeginningofthefile,ifithasany,aswellasconvertanyCRLFnewlinestoLFonly.TheLANG=CLC_ALL=CtellstheshellyouwantthecommandtoruninthedefaultClocale(alsoknownasthedefaultPOSIXlocale),wherethethreebytesformingtheByteOrderMarkaretreatedasbytes.The-ioptiontosedmeansin-place.Ifyouuse-i.old,thensedsavestheoriginalfileasfilename.old,andthenewfile(withthemodifications,ifany)asfilename. Ipersonallyliketohavethisas~/bin/fix-ms;forexample,as #!/bin/dash exportLANG=CLC_ALL=C if[$#-gt0];then forFILEin"$@";do sed-e's/\r$//;1s/^\xef\xbb\xbf//'-i--"$FILE"||exit1 done else execsed-e's/\r$//;1s/^\xef\xbb\xbf//' fi sothatifIneedtoapplythistosayallCsourcefilesandheaders(myoldcodefromtheMS-DOSera,forexample!),Ijustrun find.-name'*.[CHch]'-print0|xargs-r0~/bin/ms-fix or,ifIjustwanttolookatsuchafile,withoutmodifyingit,Icanrun ~/bin/ms-fixinmyUTF-8terminal. Share Improvethisanswer Follow editedJul24,2017at14:25 answeredJul23,2017at19:10 NominalAnimalNominalAnimal 3,0551414silverbadges1313bronzebadges 3 1 Whynotsimplysed-e's/\r$//;1s/^\xef\xbb\xbf//'-i--"$@"? – StéphaneChazelas Jul24,2017at14:02 @StéphaneChazelas:BecauseIwantthescripttoexitimmediatelyifthereisanissuewithareplacement,whichsed-e's/\r$//;1s/^\xef\xbb\xbf//'-i--"$@"doesnotdo;itdoesreturnanexitcode,butitprocessesallfileslistedintheargumentlistbeforeexiting. – NominalAnimal Jul24,2017at14:24 @StéphaneChazelas:The--beforethefilename(s)is,ofcourse,important:withoutit,filenamesbeginningwithadashmaybeconsideredoptionsbysed.Ieditedthoseintomyanswer;thankyouforthereminder! – NominalAnimal Jul24,2017at14:27 Addacomment  |  7 Iuseavimone-linerontheregularforthis: vim--clean-c'senobomb|wq'filename vim--clean-c'bufdosenobomb|wqa'filename1filename2... Share Improvethisanswer Follow answeredJan23,2020at19:40 RobynMurdockRobynMurdock 7111silverbadge11bronzebadge 1 ThisshouldalsobeachievableusingVIM'sexpersonality. – JdeBP Oct7,2020at9:46 Addacomment  |  2 Ihaveaslightlydifferentproblem,andamputtingthishereforsomeonewho,likeme,endsupherewithdatafullofZEROWIDTHNO-BREAKSPACEcharacters(whichareknownasByteOrderMarkwhentheyarethefirstcharacterofthefile). Igotthisdatabycopyingoutofgrafanaquerymetricsfield,andithadmultiple(17)\xef\xbb\xbfsequences(whichshowupinvimasrate(node{job)inasinglelinewithonly81actualcharacters. ImodifiedNominalAnimal'scodejustslightly: LANG=CLC_ALL=Csed-e's/\xef\xbb\xbf//g' Andthe:setnobombthinginvimonlyremovestheveryfirstoneinthefile. triedthis: LANG=Cvimb Thenvimdoesn'tshowthem,buttheyarestillthere(evenafterawrite...) Share Improvethisanswer Follow answeredAug4,2020at22:15 WayneWalkerWayneWalker 95388silverbadges1212bronzebadges Addacomment  |  1 Ihadthesamequestionandendedupwritingadedicatedutilitybom(1)forthis.It'savailablehere. Here'sthemanpage: NAME bom--DecodeUnicodebyteordermark SYNOPSIS bom--strip[--expecttypes][--lenient][--prefer32][--utf8][file] bom--detect[--expecttypes][--prefer32][file] bom--printtype bom--list bom--help bom--version DESCRIPTION bomdecodes,verifies,reports,and/orstripsthebyteordermark(BOM)atthe startofthespecifiedfile,ifany. Whennofileisspecified,orwhenfileis-,readstandardinput. OPTIONS -d,--detect ReportthedetectedBOMtypetostandardoutputandthenexit. SeeSUPPORTEDBOMTYPESforpossiblevalues. -e,--expecttypes ExpecttofindoneofthespecifiedBOMtypes,otherwiseexitwithan error. Multipletypesmaybespecified,separatedbycommas. SpecifyingNONEisacceptableandmatcheswhenthefilehasno(sup- ported)BOM. -h,--help Outputcommandlineusagehelp. -l,--lenient Silentlyignoreanyillegalbytesequencesencounteredwhenconverting theremainderofthefiletoUTF-8. Withoutthisflag,bomwillexitimmediatelywithanerrorifanille- galbytesequenceisencountered. Thisflaghasnoeffectunlessthe--utf8flagisgiven. --listListthesupportedBOMtypesandexit. -p,--printtype Outputthebytesequencecorrespondingtothetypebyteordermark. --prefer32 UsedtodisambiguatethebytesequenceFFFE0000,whichcanbe eitheraUTF-32LEBOMoraUTF-16LEBOMfollowedbyaNULcharacter. Withoutthisflag,UTF-16LEisassumed;withthisflag,UTF-32LEis assumed. -s,--strip StriptheBOM,ifany,fromthebeginningofthefileandoutputthe remainderofthefile. -u,--utf8 ConverttheremainderofthefiletoUTF-8,assumingthecharacter encodingimpliedbythedetectedBOM. Forfileswithno(supported)BOM,thisflaghasnoeffectandthe remainderofthefileiscopiedunmodified. ForfileswithaUTF-8BOM,theidentitytransformationisstill applied,so(forexample)illegalbytesequenceswillbedetected. -v,--version Outputprogramversionandexit. SUPPORTEDBOMTYPES ThesupportedBOMtypesare: NONENosupportedBOMwasdetected. UTF-7AUTF-7BOMwasdetected. UTF-8AUTF-8BOMwasdetected. UTF-16BE AUTF-16(BigEndian)BOMwasdetected. UTF-16LE AUTF-16(LittleEndian)BOMwasdetected. UTF-32BE AUTF-32(BigEndian)BOMwasdetected. UTF-32LE AUTF-32(LittleEndian)BOMwasdetected. GB18030 AGB18030(ChineseNationalStandard)BOMwasdetected. EXAMPLES Totellwhatkindofbyteordermarkafilehas: $bom--detect TonormalizefileswithbyteordermarksintoUTF-8,andpassotherfiles throughunchanged: $bom--strip--utf8 Sameaspreviousexample,butdiscardillegalbytesequencesinsteadofgener- atinganerror: $bom--strip--utf8--lenient ToverifyaproperlyencodedUTF-8orUTF-16filewithabyte-order-markand outputitasUTF-8: $bom--strip--utf8--expectUTF-8,UTF-16LE,UTF-16BE Tojustremoveanybyteordermarkandgetonwithyourlife: $bom--stripfile RETURNVALUES bomexitswithoneofthefollowingvalues: 0Success. 1Ageneralerroroccurred. 2The--expectflagwasgivenbutthedetectedBOMdidnotmatch. 3Anillegalbytesequencewasdetected(and--lenientwasnotspeci- fied). SEEALSO iconv(1) bom:DecodeUnicodebyteordermark,https://github.com/archiecobbs/bom. Share Improvethisanswer Follow answeredApr6at19:08 ArchieArchie 11122bronzebadges Addacomment  |  0 RecentlyIfoundthistinycommand-linetoolwhichaddsorremovestheBOMonarbitaryUTF-8encodedfiles:UTFBOMUtils(newlinkatgithub) Littledrawback,youcandownloadonlytheplainC++sourcecode.Youhavetocreatethemakefile(withCMake,forexample)andcompileitbyyourself,binariesarenotprovidedonthispage. Share Improvethisanswer Follow answeredOct16,2018at17:58 WernfriedDomscheitWernfriedDomscheit 13111silverbadge55bronzebadges Addacomment  |  0 Iknowit'sbeenawhile,butsinceIhadaslightlydifferentissue,I'mpostingsoothersmaybenefit. Mytextfilewasrandomlyhauntedbycharacters\fe\ff,luckilyformetheyappearedatstartofthelinesandthesetofallowedcharactersislimitedtoalphanumeric. Thebelowcommandinvimcutsfirstnon-alphanumericcharacter,butuseitwithcautionasyoursetofallowedcharactersmightvary. :%s/^[^a-zA-Z0-9]//g Share Improvethisanswer Follow editedNov10,2021at11:07 AdminBee 19.4k1616goldbadges4343silverbadges6767bronzebadges answeredNov10,2021at10:54 SmirkSmirk 1 Addacomment  |  YourAnswer ThanksforcontributingananswertoUnix&LinuxStackExchange!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedcommand-linefilesunicodeoraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Linked 2 ~./zshrc:commandnotfound:# 6 #!/bin/bash:Nosuchfileordirectory 1 Isthissomebyteordermarkproblem 1 catleavesUTF-8BOMalone Related 35 ConvertbetweenUnicodeNormalizationFormsontheunixcommand-line 25 HowcanIcheckifaUTF-8textfilehasaBOM? 7 AWKwithBOM:IsthereanycoolwaytohandleUnicodeBOMwithregexp? 1 CanlinuxcommandcommhandleUTF-8encodedtextfiles? 3 WhyisitnotpossibletosearchthroughtextfilecontentsencodedinUTF-16? 0 ProcessUnicodefileswithBOMcorrectlywithPOSIXtools 2 HowcanIexaminetheUnicodeencodingofatextdocument 0 Howtoremoveallsofthyphens(U+00AD)fromafile HotNetworkQuestions Applying5Vto3.3Voutputpins ElectronicCircuitsforSafeInitiationofPyrotechnics? Canananimalfilealawsuitonitsownbehalf? Shouldselectedoptionsberemovedfromsingle-andmulti-selectdropdownlists? Whatisthedefinitionofatrollinthelegalcontext? Sciencefictionbook/novelaboutaliensinhumansbodies PacifistethosblockingmyprogressinStellaris Theunusualphrasing"verb+the+comparativeadjective"intheLordoftheRingsnovels Howtoremovetikznode? WhatdothecolorsindicateonthisKC135tankerboom? Howdoyoucalculatethetimeuntilthesteady-stateofadrug? Myfavoriteanimalisa-singularandpluralform I2C(TWI)vsSPIEMInoiseresistance WhydopeopleinsistonusingTikzwhentheycanusesimplerdrawingtools? 2016PutnamB6difficultsummationproblem circuitikz:Addingarrowheadtotapofvariableinductance? Alternativeversionsofbreathing? Isthereawordfor"amessagetomyself"? WhydidGodprohibitwearingofgarmentsofdifferentmaterialsinLeviticus19:19? keyless/flatkeyboard Howtoelegantlyimplementthisoneusefulobject-orientedfeatureinMathematica? Workplaceidiomfor"beiGelegenheit"-ordertodoeventually,butdonotprovidepriority Unknownnotation:squarebrackets,triangles,andnumbers InD&D3.5,whathappenswhenyouplopaheadbandofintellectonananimal? morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings  



請為這篇文章評分?