Read UTF-8 files correctly with PowerShell - iTecNote

文章推薦指數: 80 %
投票人數:10人

A PowerShell script creates a file with UTF-8 encoding; The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, ... Skiptocontent encodingpowershellutf-8 Followingsituation: APowerShellscriptcreatesafilewithUTF-8encoding Theusermayormaynoteditthefile,possiblylosingtheBOM,butshouldkeeptheencodingasUTF-8,andpossiblychangingthelineseparators ThesamePowerShellscriptreadsthefile,addssomemorecontentandwritesitallasUTF-8backtothesamefile Thiscanbeiteratedmanytimes WithGet-ContentandOut-File-EncodingUTF8Ihaveproblemsreadingitcorrectly.It'sstumblingovertheBOMithaswrittenbefore(puttingitinthecontent,breakingmyparsingregex),doesnotuseUTF-8encodingandevendeleteslinebreaksintheoriginalcontentpart. IneedafunctionthatcanreadanyfilewithUTF-8encoding,ignoreanddeletetheBOMandnotmodifythecontent.WhatshouldIuse? Update IhaveaddedalittletestscriptthatshowswhatI'mtryingtodoandwhathappensinstead. #Readdataifexists $data="" $startRev=1; if(Test-Pathtest.txt) { $data=Get-Content-Pathtest.txt if($data-match"^[0-9-]{10}-r([0-9]+)") { $startRev=[int]$matches[1]+1 } } Write-HostNextrevisionis$startRev #Defineexampledatatoadd $startRev=$startRev+10 $newMsgs="2014-04-01-r"+$startRev+"`r`n`r`n"+` "Line1`r`n"+` "Line2`r`n`r`n" #Writenewdataback $data=$newMsgs+$data $data|Out-Filetest.txt-EncodingUTF8 Afterrunningitafewtimes,newsectionsshouldbeaddedtothebeginningofthefile,theexistingcontentshouldnotbealteredinanyway(currentlyloseslinebreaks)andnoadditionalnewlinesshouldbeaddedattheendofthefile(seemstohappensometimes). Instead,thesecondrungivesmeanerror. BestSolution IfthefileissupposedtobeUTF8whydon'tyoutrytoreaditdecodingUTF8: Get-Content-Pathtest.txt-EncodingUTF8 RelatedSolutionsBestwaytoconverttextfilesbetweencharactersets Stand-aloneutilityapproach iconv-fISO-8859-1-tUTF-8in.txt>out.txt -fENCODINGtheencodingoftheinput -tENCODINGtheencodingoftheoutput Youdon'thavetospecifyeitherofthesearguments.Theywilldefaulttoyourcurrentlocale,whichisusuallyUTF-8. Php–UTF-8allthewaythrough DataStorage: Specifytheutf8mb4charactersetonalltablesandtextcolumnsinyourdatabase.ThismakesMySQLphysicallystoreandretrievevaluesencodednativelyinUTF-8.NotethatMySQLwillimplicitlyuseutf8mb4encodingifautf8mb4_*collationisspecified(withoutanyexplicitcharacterset). InolderversionsofMySQL(<5.5.3),you'llunfortunatelybeforcedtousesimplyutf8,whichonlysupportsasubsetofUnicodecharacters.IwishIwerekidding. DataAccess: Inyourapplicationcode(e.g.PHP),inwhateverDBaccessmethodyouuse,you'llneedtosettheconnectioncharsettoutf8mb4.Thisway,MySQLdoesnoconversionfromitsnativeUTF-8whenithandsdataofftoyourapplicationandviceversa. Somedriversprovidetheirownmechanismforconfiguringtheconnectioncharacterset,whichbothupdatesitsowninternalstateandinformsMySQLoftheencodingtobeusedontheconnection—thisisusuallythepreferredapproach.InPHP: Ifyou'reusingthePDOabstractionlayerwithPHP≥5.3.6,youcanspecifycharsetintheDSN: $dbh=newPDO('mysql:charset=utf8mb4'); Ifyou'reusingmysqli,youcancallset_charset(): $mysqli->set_charset('utf8mb4');//objectorientedstyle mysqli_set_charset($link,'utf8mb4');//proceduralstyle Ifyou'restuckwithplainmysqlbuthappentoberunningPHP≥5.2.3,youcancallmysql_set_charset. Ifthedriverdoesnotprovideitsownmechanismforsettingtheconnectioncharacterset,youmayhavetoissueaquerytotellMySQLhowyourapplicationexpectsdataontheconnectiontobeencoded:SETNAMES'utf8mb4'. Thesameconsiderationregardingutf8mb4/utf8appliesasabove. Output: Ifyourapplicationtransmitstexttoothersystems,theywillalsoneedtobeinformedofthecharacterencoding.Withwebapplications,thebrowsermustbeinformedoftheencodinginwhichdataissent(throughHTTPresponseheadersorHTMLmetadata). InPHP,youcanusethedefault_charsetphp.inioption,ormanuallyissuetheContent-TypeMIMEheaderyourself,whichisjustmoreworkbuthasthesameeffect. Whenencodingtheoutputusingjson_encode(),addJSON_UNESCAPED_UNICODEasasecondparameter. Input: Unfortunately,youshouldverifyeveryreceivedstringasbeingvalidUTF-8beforeyoutrytostoreitoruseitanywhere.PHP'smb_check_encoding()doesthetrick,butyouhavetouseitreligiously.There'sreallynowayaroundthis,asmaliciousclientscansubmitdatainwhateverencodingtheywant,andIhaven'tfoundatricktogetPHPtodothisforyoureliably. FrommyreadingofthecurrentHTMLspec,thefollowingsub-bulletsarenotnecessaryorevenvalidanymoreformodernHTML.Myunderstandingisthatbrowserswillworkwithandsubmitdatainthecharactersetspecifiedforthedocument.However,ifyou'retargetingolderversionsofHTML(XHTML,HTML4,etc.),thesepointsmaystillbeuseful: ForHTMLbeforeHTML5only:youwantalldatasenttoyoubybrowserstobeinUTF-8.Unfortunately,ifyougobytheonlywaytoreliablydothisisaddtheaccept-charsetattributetoallyour

tags:
. ForHTMLbeforeHTML5only:notethattheW3CHTMLspecsaysthatclients"should"defaulttosendingformsbacktotheserverinwhatevercharsettheserverserved,butthisisapparentlyonlyarecommendation,hencetheneedforbeingexplicitoneverysingle
tag. OtherCodeConsiderations: Obviouslyenough,allfilesyou'llbeserving(PHP,HTML,JavaScript,etc.)shouldbeencodedinvalidUTF-8. YouneedtomakesurethateverytimeyouprocessaUTF-8string,youdososafely.Thisis,unfortunately,thehardpart.You'llprobablywanttomakeextensiveuseofPHP'smbstringextension. PHP'sbuilt-instringoperationsarenotbydefaultUTF-8safe.TherearesomethingsyoucansafelydowithnormalPHPstringoperations(likeconcatenation),butformostthingsyoushouldusetheequivalentmbstringfunction. Toknowwhatyou'redoing(read:notmessitup),youreallyneedtoknowUTF-8andhowitworksonthelowestpossiblelevel.Checkoutanyofthelinksfromutf8.comforsomegoodresourcestolearneverythingyouneedtoknow. RelatedQuestionPowershell–DetermineinstalledPowerShellversionWindows–HowtooutputsomethinginPowerShellWhat’sthedifferencebetweenUTF-8andUTF-8withoutBOMUnicode,UTF-8,UTF-16PowerShellsays“executionofscriptsisdisabledonthissystem.”PowerShell–BatchchangefilesencodingToUTF-8Powershell–UTF-8outputfromPowerShellPowershell–ChangingPowerShell’sdefaultoutputencodingtoUTF-8


請為這篇文章評分?