How can Python check if a file name is in UTF8?
文章推薦指數: 80 %
How can Python check if a file name is in UTF8? I have a PHP script that creates a list of files in a directory, however, PHP can see only file names ...
HomeSolutionsTagsFAQContact
IhaveaPHPscriptthatcreatesalistoffilesinadirectory,however,PHPcanseeonlyfilenamesinEnglishandtotallyignoresfilenamesinotherlanguages,suchasRussianorAsianlanguages.
AfterlotsofeffortsIfoundtheonlysolutionthatcouldworkforme-usingapythonscriptthatrenamesthefilestoUTF8,sothePHPscriptcanprocessthemafterthat.
(AfterPHPhasfinishedprocessingthefiles,IrenamethefilestoEnglish,Idon'tkeeptheminUTF8).
Iusedthefollowingpythonscript,thatworksfine:
importsys
importos
importglob
importntpath
fromrandomimportrandint
forinfileinglob.glob(os.path.join('C:\\MyFiles',u'*')):
ifos.path.isfile(infile):
infile_utf8=infile.encode('utf8')
os.rename(infile,infile_utf8)
TheproblemisthatitconvertsalsofilenamesthatarealreadyinUTF8.IneedawaytoskiptheconversionincasethefilenameisalreadyinUTF8.
Iwastryingthispythonscript:
forinfileinglob.glob(os.path.join('C:\\MyFiles',u'*')):
ifos.path.isfile(infile):
try:
infile.decode('UTF-8','strict')
exceptUnicodeDecodeError:
infile_utf8=infile.encode('utf8')
os.rename(infile,infile_utf8)
But,iffilenameisalreadyinutf8,Igetfatalerror:
UnicodeDecodeError:'ascii'codeccan'tdecodecharactersinposition18-20
ordinalnotinrange(128)
Ialsotriedanotherway,whichalsodidn'twork:
forinfileinglob.glob(os.path.join('C:\\MyFiles',u'*')):
ifos.path.isfile(infile):
try:
tmpstr=str(infile)
exceptUnicodeDecodeError:
infile_utf8=infile.encode('utf8')
os.rename(infile,infile_utf8)
Igotexactlythesameerrorasbefore.
Anyideas?
Pythonisverynewtome,anditisahugeeffortformetodebugevenasimplescript,sopleasewriteanexplicitanswer(i.e.code).Idon'thavetheabilityoftestinggeneralideasthatmaybeworkormaybenot.Thanks.
Examplesoffilenames:
hello.txt
你好.txt
안녕하세요.html
chào.doc
Ithinkyou'reconfusingyourterminologyandmakingsomewrongassumptions.AFAIK,PHPcanopenfilenamesofanyencodingtype-PHPisverymuchagnosticaboutencodingtypes.
Youhaven'tbeenclearexactlywhatyouwanttoachieveasUTF-8!=EnglishandtheexampleforeignfilenamescouldbeencodedinanumberofwaysbutneverinASCIIEnglish!CanyouexplainwhatyouthinkanexistingUTF-8filelookslikeandwhatanon-UTF-8fileis?
Toaddtoyourconfusion,underWindows,filenamesaretransparentlystoredasUTF-16.
Therefore,youshouldnottrytoencodetofilenamestoUTF-8.Instead,youshoulduseUnicodestringsandallowPythontoworkouttheproperconversion.(Don'tencodeinUTF-16either!)
Pleaseclarifyyourquestionfurther.
Update:
InowunderstandyourproblemwithPHP.http://evertpot.com/filesystem-encoding-and-php/tellsusthatnon-latincharactersaretroublesomewithPHP+Windows.ItwouldseemthatonlyfilesthataremadeofWindows1252charactersetcharacterscanbeseenandopened.
ThechallengeyouhaveistoconvertyourfilenamestobeWindows1252compatible.Asyou'vestatedinyourquestion,itwouldbeidealnottorenamefilesthatarealreadycompatible.I'vereworkedyourattemptas:
importos
fromglobimportglob
importshutil
importurllib
files=glob(u'*.txt')
formy_fileinfiles:
try:
print"File%s"%my_file
exceptUnicodeEncodeError:
print"File(escaped):%s"%my_file.encode("unicode_escape")
new_name=my_file
try:
my_file.encode("cp1252","strict")
print"Nameunchanged.Copyinganyway"
exceptUnicodeEncodeError:
print"Cannotconverttocp1252"
utf_8_name=my_file.encode("UTF-8")
new_name=urllib.quote(utf_8_name)
print"Newname:(%%encoded):%s"%new_name
shutil.copy2(my_file,os.path.join("fixed",new_name))
breakdown:
Printfilename.Bydefault,theWindowsshellonlyshowsresultsinalocalDOScodepage.Forexample,myshellcanshowü.txtbut€.txtshowsas?.txt.Therefore,youneedtobecarefulofPythonthrowingExceptionsbecauseitcan'tprintproperly.Thiscode,attemptstoprinttheUnicodeversionbutresortstoprintUnicodecodepointescapesinstead.
TrytoencodestringasWindows-1252.Ifthisworks,filenameisok
Else:ConvertthefilenametoUTF-8,thenpercentencodeit.Thisway,thefilenameremainsuniqueandyoucouldreversethisprocedureinPHP.
Copyfiletonew/verifiedfile.
Forexample,你好.txtbecomes%E4%BD%A0%E5%A5%BD.txt
ForallUTF-8issueswithPython,Iwarmlyrecommandspending36minuteswatchingthe"PragmaticUnicode"byNedBatchelder(http://nedbatchelder.com/text/unipain.html)atPyCon2012.Formeitwasarevelation!AlotfromthispresentationisinfactnotPython-specificbuthelpsunderstandingimportantthingslikethedifferencebetweenUnicodestringsandUTF-8encodedbytes...
ThereasonI'mrecommendingthisvideotoyou(likeIdidformanyfriends)isbecausesomeyourcodecontainscontradictionsliketryingtodecodeandthenencodeifdecodingfails:suchmethodscannotapplytothesameobject!EventhoughinPython2it'ssyntaxicallypossiblepossible,itmakesnosense,andinPython3,thedisctinctionbetweenbytesandstrmakesthingsclearer:
Astrobjectcanbeencodedinbytes:
>>>a='a'
>>>type(a)
延伸文章資訊
- 1How to write a check in python to see if file is valid UTF-8?
- 2Python Files and os.path - 2021 - BogoToBogo
There is always a current working directory, whether we're in the Python Shell, ... coding: utf-8...
- 3How to write a check in python to see if file is valid UTF-8?
Could be simpler by using only one line: codecs.open("path/to/file", encoding="utf-8", errors="st...
- 4How can Python check if a file name is in UTF8?
How can Python check if a file name is in UTF8? I have a PHP script that creates a list of files ...
- 5Check whether a file contains valid UTF-8. Returns 0 for valid ...
Check whether a file contains valid UTF-8. Returns 0 for valid UTF-8, prints an error message to ...