How can Python check if a file name is in UTF8?

文章推薦指數: 80 %
投票人數:10人

How can Python check if a file name is in UTF8? I have a PHP script that creates a list of files in a directory, however, PHP can see only file names ... HomeSolutionsTagsFAQContact IhaveaPHPscriptthatcreatesalistoffilesinadirectory,however,PHPcanseeonlyfilenamesinEnglishandtotallyignoresfilenamesinotherlanguages,suchasRussianorAsianlanguages. AfterlotsofeffortsIfoundtheonlysolutionthatcouldworkforme-usingapythonscriptthatrenamesthefilestoUTF8,sothePHPscriptcanprocessthemafterthat. (AfterPHPhasfinishedprocessingthefiles,IrenamethefilestoEnglish,Idon'tkeeptheminUTF8). Iusedthefollowingpythonscript,thatworksfine: importsys importos importglob importntpath fromrandomimportrandint forinfileinglob.glob(os.path.join('C:\\MyFiles',u'*')): ifos.path.isfile(infile): infile_utf8=infile.encode('utf8') os.rename(infile,infile_utf8) TheproblemisthatitconvertsalsofilenamesthatarealreadyinUTF8.IneedawaytoskiptheconversionincasethefilenameisalreadyinUTF8. Iwastryingthispythonscript: forinfileinglob.glob(os.path.join('C:\\MyFiles',u'*')): ifos.path.isfile(infile): try: infile.decode('UTF-8','strict') exceptUnicodeDecodeError: infile_utf8=infile.encode('utf8') os.rename(infile,infile_utf8) But,iffilenameisalreadyinutf8,Igetfatalerror: UnicodeDecodeError:'ascii'codeccan'tdecodecharactersinposition18-20 ordinalnotinrange(128) Ialsotriedanotherway,whichalsodidn'twork: forinfileinglob.glob(os.path.join('C:\\MyFiles',u'*')): ifos.path.isfile(infile): try: tmpstr=str(infile) exceptUnicodeDecodeError: infile_utf8=infile.encode('utf8') os.rename(infile,infile_utf8) Igotexactlythesameerrorasbefore. Anyideas? Pythonisverynewtome,anditisahugeeffortformetodebugevenasimplescript,sopleasewriteanexplicitanswer(i.e.code).Idon'thavetheabilityoftestinggeneralideasthatmaybeworkormaybenot.Thanks. Examplesoffilenames: hello.txt 你好.txt 안녕하세요.html chào.doc Ithinkyou'reconfusingyourterminologyandmakingsomewrongassumptions.AFAIK,PHPcanopenfilenamesofanyencodingtype-PHPisverymuchagnosticaboutencodingtypes. Youhaven'tbeenclearexactlywhatyouwanttoachieveasUTF-8!=EnglishandtheexampleforeignfilenamescouldbeencodedinanumberofwaysbutneverinASCIIEnglish!CanyouexplainwhatyouthinkanexistingUTF-8filelookslikeandwhatanon-UTF-8fileis? Toaddtoyourconfusion,underWindows,filenamesaretransparentlystoredasUTF-16. Therefore,youshouldnottrytoencodetofilenamestoUTF-8.Instead,youshoulduseUnicodestringsandallowPythontoworkouttheproperconversion.(Don'tencodeinUTF-16either!) Pleaseclarifyyourquestionfurther. Update: InowunderstandyourproblemwithPHP.http://evertpot.com/filesystem-encoding-and-php/tellsusthatnon-latincharactersaretroublesomewithPHP+Windows.ItwouldseemthatonlyfilesthataremadeofWindows1252charactersetcharacterscanbeseenandopened. ThechallengeyouhaveistoconvertyourfilenamestobeWindows1252compatible.Asyou'vestatedinyourquestion,itwouldbeidealnottorenamefilesthatarealreadycompatible.I'vereworkedyourattemptas: importos fromglobimportglob importshutil importurllib files=glob(u'*.txt') formy_fileinfiles: try: print"File%s"%my_file exceptUnicodeEncodeError: print"File(escaped):%s"%my_file.encode("unicode_escape") new_name=my_file try: my_file.encode("cp1252","strict") print"Nameunchanged.Copyinganyway" exceptUnicodeEncodeError: print"Cannotconverttocp1252" utf_8_name=my_file.encode("UTF-8") new_name=urllib.quote(utf_8_name) print"Newname:(%%encoded):%s"%new_name shutil.copy2(my_file,os.path.join("fixed",new_name)) breakdown: Printfilename.Bydefault,theWindowsshellonlyshowsresultsinalocalDOScodepage.Forexample,myshellcanshowü.txtbut€.txtshowsas?.txt.Therefore,youneedtobecarefulofPythonthrowingExceptionsbecauseitcan'tprintproperly.Thiscode,attemptstoprinttheUnicodeversionbutresortstoprintUnicodecodepointescapesinstead. TrytoencodestringasWindows-1252.Ifthisworks,filenameisok Else:ConvertthefilenametoUTF-8,thenpercentencodeit.Thisway,thefilenameremainsuniqueandyoucouldreversethisprocedureinPHP. Copyfiletonew/verifiedfile. Forexample,你好.txtbecomes%E4%BD%A0%E5%A5%BD.txt ForallUTF-8issueswithPython,Iwarmlyrecommandspending36minuteswatchingthe"PragmaticUnicode"byNedBatchelder(http://nedbatchelder.com/text/unipain.html)atPyCon2012.Formeitwasarevelation!AlotfromthispresentationisinfactnotPython-specificbuthelpsunderstandingimportantthingslikethedifferencebetweenUnicodestringsandUTF-8encodedbytes... ThereasonI'mrecommendingthisvideotoyou(likeIdidformanyfriends)isbecausesomeyourcodecontainscontradictionsliketryingtodecodeandthenencodeifdecodingfails:suchmethodscannotapplytothesameobject!EventhoughinPython2it'ssyntaxicallypossiblepossible,itmakesnosense,andinPython3,thedisctinctionbetweenbytesandstrmakesthingsclearer: Astrobjectcanbeencodedinbytes: >>>a='a' >>>type(a) >>>a.encode >>>a.decode Traceback(mostrecentcalllast): File"",line1,in AttributeError:'str'objecthasnoattribute'decode' ...whileabytesobjectcanbedecodedinstr: >>>b=b'b' >>>type(b) >>>b.decode >>>b.encode Traceback(mostrecentcalllast): File"",line1,in AttributeError:'bytes'objecthasnoattribute'encode' Comingbacktoyourquestionofworkingwithfilenames,thetrickyquestionyouneedtoansweris:"whatistheencodingofyourfilenames".Thelanguagedoesn'tmatter,onlytheencoding! RelatedTopicspythonwindowsfilenamesunicodeutf-8Comments9 yearsagoAnotheralternativewouldbetousetheunidecodemoduleandconverttheunicodefilenamestogood-enoughASCII.9 yearsagoUnicodeDecodeError:'ascii'isnotwhatyou'rethinking.Thismeanssomethingistryingtobedecodedas'ascii'notutf8.9 yearsagoShowtheentireerrortrace.There'sonlyoneexplicitdecodebutitdoesn'tlookliketheerroriscomingfromthere.9 yearsagoLookslikethiswasansweredfewtimes.stackoverflow.com/questions/6707657/…9 yearsagoTheerrorisexactlywhatIwrote.Ifyoutrytorunthiscodeand,asaninput,useafilenameinAsian/Russian/Arabiclanguage,youwillgetexactlythesameerror.9 yearsagoIamtryingtodecodetoseeifvariableinfileisinUTF8ornot.Ifdecodingfails,itmeansthatitwasneverencodedinUTF8soitcanbeencodednow.9 yearsagoPleasenotethattheresultofthedecodingdoesnotgotoanyvariable.Doyoustillthinkthereisacontradiction?9 yearsagoOk,theproblemisIquitedon'tunderstandwhatconvertingtoEnglishorconvertingtoUTF-8reallymeans.Itwouldbeniceifyoucouldprovidetherepr()`oftwofilenames:onethatyouwanttoconvert,onethatyoudon't.(Ican'tcommentdirectlyonyourquestion)9 yearsagoIhaveaddedexamplesoffilenamesattheendofmyquestionabove.9 yearsagogood,butwhatistherepr()ofthesenameswhenthey'rereadbyglob?DoyougetaUnicodeobjectoranstr?Andalso,howwouldyoulikethesenamestobeconverted"inEnglish"?9 yearsagoWhatIwanttodoisverybasic-toreadafilenameusingPHP.However,PHPinwindowsOScannotseefilenamesinforeignlanguagessuchas:你好.txt,안녕하세요.html,chào.doc.YoucantryitbyyourselfbycreatingsuchfilesandtrytolistthemusingPHP(glob,scandir,etc').Ifyousucceed,pleaseletmeknowhowyoudidit.9 yearsagoI'vedonesomeresearchandknowunderstandwhatyourrestrictionsare.Pleaseseemyupdatetomyanswer8 years,11 monthsagoYourmethodisverynice,butsofarIcouldnotuseitbecauseitoftencreateshugefilenamesthatwindowscannotwrite/read(morethan255characters).Anyotherideas?8 years,11 monthsago@Tom,I'mreallydisappointedyouunacceptedmyanswer.Iputalotoftimeandeffortinansweringyourquestionwithasolution.Ifyourrequirementshavechangedthenyououghttocreateanewquestion.Someinitialthoughtsare:useshortpathsandgzip+base64encodethefilenameoriftheoriginalfilenameisnotimportantyoucouldhashthefilename.8 years,11 monthsagoYouareright.Yoursolutionisthebestsofaranditsolvesmostoftheproblem.Igaveyouthecreditback.Welldoneandmanythanks.Regardless-thisprogrammingproblemneedsmoredevelopment.MaybeIwillcontinueitinanewquestion,asyousuggested.MentionsPhyton_userCommunityAlastairMcCormackPierreH.



請為這篇文章評分?