Convert UTF-8 with BOM to UTF-8 with no BOM in Python

文章推薦指數: 80 %
投票人數:10人

Is there a solution that can take any known Python encoding and output as UTF-8 without BOM? edit 1 proposed sol'n from below (thanks!) fp = open('brh-m-157. Home Public Questions Tags Users Companies Collectives ExploreCollectives Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Collectives™onStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. LearnmoreaboutCollectives Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams ConvertUTF-8withBOMtoUTF-8withnoBOMinPython AskQuestion Asked 10years,9monthsago Modified 1monthago Viewed 154ktimes 101 Twoquestionshere.IhaveasetoffileswhichareusuallyUTF-8withBOM.I'dliketoconvertthem(ideallyinplace)toUTF-8withnoBOM.Itseemslikecodecs.StreamRecoder(stream,encode,decode,Reader,Writer,errors)wouldhandlethis.ButIdon'treallyseeanygoodexamplesonusage.Wouldthisbethebestwaytohandlethis? sourcefiles: TueJan17$filebrh-m-157.json brh-m-157.json:UTF-8Unicode(withBOM)text Also,itwouldbeidealifwecouldhandledifferentinputencodingwihtoutexplicitlyknowing(seenASCIIandUTF-16).Itseemslikethisshouldallbefeasible.IsthereasolutionthatcantakeanyknownPythonencodingandoutputasUTF-8withoutBOM? edit1proposedsol'nfrombelow(thanks!) fp=open('brh-m-157.json','rw') s=fp.read() u=s.decode('utf-8-sig') s=u.encode('utf-8') printfp.encoding fp.write(s) Thisgivesmethefollowingerror: IOError:[Errno9]Badfiledescriptor Newsflash I'mbeingtoldincommentsthatthemistakeisIopenthefilewithmode'rw'insteadof'r+'/'r+b',soIshouldeventuallyre-editmyquestionandremovethesolvedpart. pythonutf-8utf-16byte-order-mark Share Improvethisquestion Follow editedJan30,2012at21:15 tzot 89.3k2929goldbadges137137silverbadges201201bronzebadges askedJan17,2012at16:37 timponetimpone 18.6k3434goldbadges112112silverbadges205205bronzebadges 1 2 Youneedtoopenyourfileforreadingplusupdate,i.e.,withar+mode.AddbtoosothatitwillworkonWindowsaswellwithoutanyfunnylineendingbusiness.Finally,you'llwanttoseekbacktothebeginningofthefileandtruncateitattheend—pleaseseemyupdatedanswer. – MartinGeisler Jan17,2012at21:58 Addacomment  |  7Answers 7 Sortedby: Resettodefault Highestscore(default) Trending(recentvotescountmore) Datemodified(newestfirst) Datecreated(oldestfirst) 150 Simplyusethe"utf-8-sig"codec: fp=open("file.txt") s=fp.read() u=s.decode("utf-8-sig") ThatgivesyouaunicodestringwithouttheBOM.Youcanthenuse s=u.encode("utf-8") togetanormalUTF-8encodedstringbackins.Ifyourfilesarebig,thenyoushouldavoidreadingthemallintomemory.TheBOMissimplythreebytesatthebeginningofthefile,soyoucanusethiscodetostripthemoutofthefile: importos,sys,codecs BUFSIZE=4096 BOMLEN=len(codecs.BOM_UTF8) path=sys.argv[1] withopen(path,"r+b")asfp: chunk=fp.read(BUFSIZE) ifchunk.startswith(codecs.BOM_UTF8): i=0 chunk=chunk[BOMLEN:] whilechunk: fp.seek(i) fp.write(chunk) i+=len(chunk) fp.seek(BOMLEN,os.SEEK_CUR) chunk=fp.read(BUFSIZE) fp.seek(-BOMLEN,os.SEEK_CUR) fp.truncate() Itopensthefile,readsachunk,andwritesitouttothefile3bytesearlierthanwhereitreadit.Thefileisrewrittenin-place.Aseasiersolutionistowritetheshorterfiletoanewfilelikenewtover'sanswer.Thatwouldbesimpler,butusetwicethediskspaceforashortperiod. Asforguessingtheencoding,thenyoucanjustloopthroughtheencodingfrommosttoleastspecific: defdecode(s): forencodingin"utf-8-sig","utf-16": try: returns.decode(encoding) exceptUnicodeDecodeError: continue returns.decode("latin-1")#willalwayswork AnUTF-16encodedfilewontdecodeasUTF-8,sowetrywithUTF-8first.Ifthatfails,thenwetrywithUTF-16.Finally,weuseLatin-1—thiswillalwaysworksinceall256bytesarelegalvaluesinLatin-1.YoumaywanttoreturnNoneinsteadinthiscasesinceit'sreallyafallbackandyourcodemightwanttohandlethismorecarefully(ifitcan). Share Improvethisanswer Follow editedJul18,2018at20:33 200_success 7,10411goldbadge4242silverbadges7171bronzebadges answeredJan17,2012at16:47 MartinGeislerMartinGeisler 72k2525goldbadges168168silverbadges226226bronzebadges 2 1 hmm,iupdatedthequestioninedit#1withsamplecodebutgettingabadfiledescriptor.thxforanyhelp.Tryingtofigurethisout. – timpone Jan17,2012at17:29 2 seemsgotAttributeError:'str'objecthasnoattribute'decode'.SoIfinallyusedthecodeaswithopen(filename,encoding='utf-8-sig')asf_content:,thendoc=f_content.read()anditworkedforme. – clement116 Apr20,2021at19:21 Addacomment  |  78 InPython3it'squiteeasy:readthefileandrewriteitwithutf-8encoding: s=open(bom_file,mode='r',encoding='utf-8-sig').read() open(bom_file,mode='w',encoding='utf-8').write(s) Share Improvethisanswer Follow editedOct29,2015at19:30 the 20k1111goldbadges6565silverbadges9999bronzebadges answeredOct23,2015at2:57 GengJiawenGengJiawen 8,39422goldbadges4545silverbadges3737bronzebadges 0 Addacomment  |  7 importcodecs importshutil importsys s=sys.stdin.read(3) ifs!=codecs.BOM_UTF8: sys.stdout.write(s) shutil.copyfileobj(sys.stdin,sys.stdout) Share Improvethisanswer Follow answeredJan17,2012at17:03 newtovernewtover 30.3k1111goldbadges8080silverbadges8888bronzebadges 2 1 canyouexplainhowthiscodeiswork?$remove_bom.pyoutput.txtAmiright? – guneysus Nov2,2013at12:38 1 @guneysus,yes,exactly – newtover Nov2,2013at18:55 Addacomment  |  5 ThisismyimplementationtoconvertanykindofencodingtoUTF-8withoutBOMandreplacingwindowsenlinesbyuniversalformat: defutf8_converter(file_path,universal_endline=True): ''' ConvertanytypeoffiletoUTF-8withoutBOM andusinguniversalendlinebydefault. Parameters ---------- file_path:string,filepath. universal_endline:boolean(True), bydefaultconvertendlinestouniversalformat. ''' #Fixfilepath file_path=os.path.realpath(os.path.expanduser(file_path)) #Readfromfile file_open=open(file_path) raw=file_open.read() file_open.close() #Decode raw=raw.decode(chardet.detect(raw)['encoding']) #Removewindowsendline ifuniversal_endline: raw=raw.replace('\r\n','\n') #EncodetoUTF-8 raw=raw.encode('utf8') #RemoveBOM ifraw.startswith(codecs.BOM_UTF8): raw=raw.replace(codecs.BOM_UTF8,'',1) #Writetofile file_open=open(file_path,'w') file_open.write(raw) file_open.close() return0 Share Improvethisanswer Follow answeredMay14,2014at8:04 estevoestevo 8811010silverbadges1111bronzebadges Addacomment  |  5 Ifoundthisquestionbecausehavingtroublewithconfigparser.ConfigParser().read(fp)whenopeningfileswithUTF8BOMheader. ForthosewhoarelookingforasolutiontoremovetheheadersothatConfigPhasercouldopentheconfigfileinsteadofreportinganerrorof: Filecontainsnosectionheaders,pleaseopenthefilelikethefollowing: configparser.ConfigParser().read(config_file_path,encoding="utf-8-sig") ThiscouldsaveyoutonsofeffortbymakingtheremoveoftheBOMheaderofthefileunnecessary. (Iknowthissoundsunrelated,buthopefullythiscouldhelppeoplestrugglinglikeme.) Share Improvethisanswer Follow editedMay28,2019at10:45 AhmedAshour 4,7281010goldbadges3535silverbadges5050bronzebadges answeredSep7,2017at0:46 Alto.ClefAlto.Clef 13122silverbadges44bronzebadges 1 2 asIwasfirstworkingwithtry-except-->thisalsoopensUTF-8"notBOM"encodedfileswithoutproblems – flipSTAR Oct7,2020at14:08 Addacomment  |  4 Youcanusecodecs. importcodecs withopen("test.txt",'r')asfilehandle: content=filehandle.read() ifcontent[:3]==codecs.BOM_UTF8: content=content[3:] printcontent.decode("utf-8") Share Improvethisanswer Follow editedOct22,2018at7:27 doekman 18.3k2020goldbadges6464silverbadges8484bronzebadges answeredFeb9,2015at10:44 wcc526wcc526 3,69522goldbadges3030silverbadges2929bronzebadges 1 notusablesnippletatall(filehandle?alsocodecs.BOM_UTF8returnasyntaxerror) – Max Sep7,2016at16:11 Addacomment  |  1 Inpython3youshouldaddencoding='utf-8-sig': withopen(file_name,mode='a',encoding='utf-8-sig')ascsvfile: csvfile.writelines(rows) that'sit. Share Improvethisanswer Follow answeredAug21at4:46 MohammadAminEskandariMohammadAminEskandari 11911silverbadge66bronzebadges Addacomment  |  YourAnswer ThanksforcontributingananswertoStackOverflow!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedpythonutf-8utf-16byte-order-markoraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew... CollectivesUpdate:RecognizedMembers,Articles,andGitLab Shouldweburninatethe[script]tag? Linked 0 pythoncantparsecsvaslist(utf-8bom) 0 StrangebehaviorwhentryingtocreateandwritetoatextfileonmacOS 51 WhydoesmyPythoncodeprinttheextracharacters""whenreadingfromatextfile? 26 ConvertUTF-16toUTF-8andremoveBOM? 11 PythonValueError:NoJSONobjectcouldbedecoded 5 RemovingBOMfromgzip'edCSVinPython 6 Powershell'>'operator,changeencoding? 4 PandasValueError:'Date'isnotinlist 5 HowtoreadaSQLfilewithpythonwithpropercharacterencoding? 3 HowamIsuppposedtohandletheBOMwhiletextprocessingusingsys.stdininPython3? Seemorelinkedquestions Related 6975 WhataremetaclassesinPython? 7492 DoesPythonhaveaternaryconditionaloperator? 3469 Convertbytestoastring 3246 HowdoIconcatenatetwolistsinPython? 2975 Manuallyraising(throwing)anexceptioninPython 974 What'sthedifferencebetweenUTF-8andUTF-8withBOM? 2573 HowtoupgradeallPythonpackageswithpip? 3588 DoesPythonhaveastring'contains'substringmethod? 582 IsitpossibletoforceExcelrecognizeUTF-8CSVfilesautomatically? HotNetworkQuestions WhathadEstherdonein"TheBellJar"bySylviaPlath? Canananimalfilealawsuitonitsownbehalf? Flatkeyboardwithoutanyphysicalkeys Howtoelegantlyimplementthisoneusefulobject-orientedfeatureinMathematica? Whenisthefirstelementintheargumentlistregardedasafunctionsymbolandwhennot? MLmodellingwheretheoutputaffectstheDGP Whatisthebestwaytocalculatetruepasswordentropyforhumancreatedpasswords? 2016PutnamB6difficultsummationproblem CPLEXstuckinsolvemethod-dualsimplexsolvedmodel Changelinkcolorbasedinbackgroundcolor? CounterexampleforChvatal'sconjectureinaninfiniteset What'sthedifferencebetween'Dynamic','Random',and'Procedural'generations? Wouldmerfolkgainanyrealadvantagefrommounts(andbeastsofburden)? Whatprotocolisthiswaveform? Iwanttodothedoubleslitexperimentwithelectrons,but My(large)employerhasn'tregisteredanobviousmisspellingoftheirprimarydomainURL WhytheneedforaScienceOfficeronacargovessel? HowtogetridofUbuntuProadvertisementwhenupdatingapt? Doyoupayforthebreakfastinadvance? LeavingaTTjobthenre-enteringacademia:Areaofbusinessandmanagement Sciencefictionbook/novelaboutaliensinhumansbodies HowdothosewhoholdtoaliteralinterpretationofthefloodaccountrespondtothecriticismthatNoahbuildingthearkwouldbeunfeasible? Howtotellifmybikehasanaluminumframe Whattranslation/versionoftheBiblewouldChaucerhaveread? morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. lang-py Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings  



請為這篇文章評分?