Convert UTF-8 with BOM to UTF-8 with no BOM in Python
文章推薦指數: 80 %
Is there a solution that can take any known Python encoding and output as UTF-8 without BOM? edit 1 proposed sol'n from below (thanks!) fp = open('brh-m-157.
Home
Public
Questions
Tags
Users
Companies
Collectives
ExploreCollectives
Teams
StackOverflowforTeams
–Startcollaboratingandsharingorganizationalknowledge.
CreateafreeTeam
WhyTeams?
Teams
CreatefreeTeam
Collectives™onStackOverflow
Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost.
LearnmoreaboutCollectives
Teams
Q&Aforwork
Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch.
LearnmoreaboutTeams
ConvertUTF-8withBOMtoUTF-8withnoBOMinPython
AskQuestion
Asked
10years,9monthsago
Modified
1monthago
Viewed
154ktimes
101
Twoquestionshere.IhaveasetoffileswhichareusuallyUTF-8withBOM.I'dliketoconvertthem(ideallyinplace)toUTF-8withnoBOM.Itseemslikecodecs.StreamRecoder(stream,encode,decode,Reader,Writer,errors)wouldhandlethis.ButIdon'treallyseeanygoodexamplesonusage.Wouldthisbethebestwaytohandlethis?
sourcefiles:
TueJan17$filebrh-m-157.json
brh-m-157.json:UTF-8Unicode(withBOM)text
Also,itwouldbeidealifwecouldhandledifferentinputencodingwihtoutexplicitlyknowing(seenASCIIandUTF-16).Itseemslikethisshouldallbefeasible.IsthereasolutionthatcantakeanyknownPythonencodingandoutputasUTF-8withoutBOM?
edit1proposedsol'nfrombelow(thanks!)
fp=open('brh-m-157.json','rw')
s=fp.read()
u=s.decode('utf-8-sig')
s=u.encode('utf-8')
printfp.encoding
fp.write(s)
Thisgivesmethefollowingerror:
IOError:[Errno9]Badfiledescriptor
Newsflash
I'mbeingtoldincommentsthatthemistakeisIopenthefilewithmode'rw'insteadof'r+'/'r+b',soIshouldeventuallyre-editmyquestionandremovethesolvedpart.
pythonutf-8utf-16byte-order-mark
Share
Improvethisquestion
Follow
editedJan30,2012at21:15
tzot
89.3k2929goldbadges137137silverbadges201201bronzebadges
askedJan17,2012at16:37
timponetimpone
18.6k3434goldbadges112112silverbadges205205bronzebadges
1
2
Youneedtoopenyourfileforreadingplusupdate,i.e.,withar+mode.AddbtoosothatitwillworkonWindowsaswellwithoutanyfunnylineendingbusiness.Finally,you'llwanttoseekbacktothebeginningofthefileandtruncateitattheend—pleaseseemyupdatedanswer.
– MartinGeisler
Jan17,2012at21:58
Addacomment
|
7Answers
7
Sortedby:
Resettodefault
Highestscore(default)
Trending(recentvotescountmore)
Datemodified(newestfirst)
Datecreated(oldestfirst)
150
Simplyusethe"utf-8-sig"codec:
fp=open("file.txt")
s=fp.read()
u=s.decode("utf-8-sig")
ThatgivesyouaunicodestringwithouttheBOM.Youcanthenuse
s=u.encode("utf-8")
togetanormalUTF-8encodedstringbackins.Ifyourfilesarebig,thenyoushouldavoidreadingthemallintomemory.TheBOMissimplythreebytesatthebeginningofthefile,soyoucanusethiscodetostripthemoutofthefile:
importos,sys,codecs
BUFSIZE=4096
BOMLEN=len(codecs.BOM_UTF8)
path=sys.argv[1]
withopen(path,"r+b")asfp:
chunk=fp.read(BUFSIZE)
ifchunk.startswith(codecs.BOM_UTF8):
i=0
chunk=chunk[BOMLEN:]
whilechunk:
fp.seek(i)
fp.write(chunk)
i+=len(chunk)
fp.seek(BOMLEN,os.SEEK_CUR)
chunk=fp.read(BUFSIZE)
fp.seek(-BOMLEN,os.SEEK_CUR)
fp.truncate()
Itopensthefile,readsachunk,andwritesitouttothefile3bytesearlierthanwhereitreadit.Thefileisrewrittenin-place.Aseasiersolutionistowritetheshorterfiletoanewfilelikenewtover'sanswer.Thatwouldbesimpler,butusetwicethediskspaceforashortperiod.
Asforguessingtheencoding,thenyoucanjustloopthroughtheencodingfrommosttoleastspecific:
defdecode(s):
forencodingin"utf-8-sig","utf-16":
try:
returns.decode(encoding)
exceptUnicodeDecodeError:
continue
returns.decode("latin-1")#willalwayswork
AnUTF-16encodedfilewontdecodeasUTF-8,sowetrywithUTF-8first.Ifthatfails,thenwetrywithUTF-16.Finally,weuseLatin-1—thiswillalwaysworksinceall256bytesarelegalvaluesinLatin-1.YoumaywanttoreturnNoneinsteadinthiscasesinceit'sreallyafallbackandyourcodemightwanttohandlethismorecarefully(ifitcan).
Share
Improvethisanswer
Follow
editedJul18,2018at20:33
200_success
7,10411goldbadge4242silverbadges7171bronzebadges
answeredJan17,2012at16:47
MartinGeislerMartinGeisler
72k2525goldbadges168168silverbadges226226bronzebadges
2
1
hmm,iupdatedthequestioninedit#1withsamplecodebutgettingabadfiledescriptor.thxforanyhelp.Tryingtofigurethisout.
– timpone
Jan17,2012at17:29
2
seemsgotAttributeError:'str'objecthasnoattribute'decode'.SoIfinallyusedthecodeaswithopen(filename,encoding='utf-8-sig')asf_content:,thendoc=f_content.read()anditworkedforme.
– clement116
Apr20,2021at19:21
Addacomment
|
78
InPython3it'squiteeasy:readthefileandrewriteitwithutf-8encoding:
s=open(bom_file,mode='r',encoding='utf-8-sig').read()
open(bom_file,mode='w',encoding='utf-8').write(s)
Share
Improvethisanswer
Follow
editedOct29,2015at19:30
the
20k1111goldbadges6565silverbadges9999bronzebadges
answeredOct23,2015at2:57
GengJiawenGengJiawen
8,39422goldbadges4545silverbadges3737bronzebadges
0
Addacomment
|
7
importcodecs
importshutil
importsys
s=sys.stdin.read(3)
ifs!=codecs.BOM_UTF8:
sys.stdout.write(s)
shutil.copyfileobj(sys.stdin,sys.stdout)
Share
Improvethisanswer
Follow
answeredJan17,2012at17:03
newtovernewtover
30.3k1111goldbadges8080silverbadges8888bronzebadges
2
1
canyouexplainhowthiscodeiswork?$remove_bom.py
延伸文章資訊
- 1Python flat bill-of-material program based on Excel files - GitHub
A Python program for flattening a layered bill-of-material (BOM) based on Excel files. Part quant...
- 2Python: 關於Unicode 的BOM - 傑克! 真是太神奇了! - 痞客邦
註一: 主要是因為可使用的編碼數只有256 個, 而不同code page 之間會對應不同的符號, 進而無法得知資訊的原始樣貌. 關於Unicode 的BOM (Byte Order Mark)...
- 3cyclonedx-bom - PyPI
CycloneDX Python SBOM Generation Tool ... This project provides a runnable Python-based applicati...
- 4python 读取带BOM的utf-8格式文件 - 简书
标示。比如很多现代脚本语言,例如python,其解释器本身是能处理BOM的,但是shell卡在这里。 因此我们在linux ...
- 5BOM 的去除方式:分別使用vim, Python, 及bash - Kirin
BOM 的去除方式:分別使用vim, Python, 及bash. 0. Kirin written 10 個月ago. 最後更新日期:2022 年01 月3 日. BOM 是Byte Orde...