Read UTF-8 files correctly with PowerShell - iTecNote
文章推薦指數: 80 %
A PowerShell script creates a file with UTF-8 encoding; The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, ... Skiptocontent encodingpowershellutf-8 Followingsituation: APowerShellscriptcreatesafilewithUTF-8encoding Theusermayormaynoteditthefile,possiblylosingtheBOM,butshouldkeeptheencodingasUTF-8,andpossiblychangingthelineseparators ThesamePowerShellscriptreadsthefile,addssomemorecontentandwritesitallasUTF-8backtothesamefile Thiscanbeiteratedmanytimes WithGet-ContentandOut-File-EncodingUTF8Ihaveproblemsreadingitcorrectly.It'sstumblingovertheBOMithaswrittenbefore(puttingitinthecontent,breakingmyparsingregex),doesnotuseUTF-8encodingandevendeleteslinebreaksintheoriginalcontentpart. IneedafunctionthatcanreadanyfilewithUTF-8encoding,ignoreanddeletetheBOMandnotmodifythecontent.WhatshouldIuse? Update IhaveaddedalittletestscriptthatshowswhatI'mtryingtodoandwhathappensinstead. #Readdataifexists $data="" $startRev=1; if(Test-Pathtest.txt) { $data=Get-Content-Pathtest.txt if($data-match"^[0-9-]{10}-r([0-9]+)") { $startRev=[int]$matches[1]+1 } } Write-HostNextrevisionis$startRev #Defineexampledatatoadd $startRev=$startRev+10 $newMsgs="2014-04-01-r"+$startRev+"`r`n`r`n"+` "Line1`r`n"+` "Line2`r`n`r`n" #Writenewdataback $data=$newMsgs+$data $data|Out-Filetest.txt-EncodingUTF8 Afterrunningitafewtimes,newsectionsshouldbeaddedtothebeginningofthefile,theexistingcontentshouldnotbealteredinanyway(currentlyloseslinebreaks)andnoadditionalnewlinesshouldbeaddedattheendofthefile(seemstohappensometimes). Instead,thesecondrungivesmeanerror. BestSolution IfthefileissupposedtobeUTF8whydon'tyoutrytoreaditdecodingUTF8: Get-Content-Pathtest.txt-EncodingUTF8 RelatedSolutionsBestwaytoconverttextfilesbetweencharactersets Stand-aloneutilityapproach iconv-fISO-8859-1-tUTF-8in.txt>out.txt -fENCODINGtheencodingoftheinput -tENCODINGtheencodingoftheoutput Youdon'thavetospecifyeitherofthesearguments.Theywilldefaulttoyourcurrentlocale,whichisusuallyUTF-8. Php–UTF-8allthewaythrough DataStorage: Specifytheutf8mb4charactersetonalltablesandtextcolumnsinyourdatabase.ThismakesMySQLphysicallystoreandretrievevaluesencodednativelyinUTF-8.NotethatMySQLwillimplicitlyuseutf8mb4encodingifautf8mb4_*collationisspecified(withoutanyexplicitcharacterset). InolderversionsofMySQL(<5.5.3),you'llunfortunatelybeforcedtousesimplyutf8,whichonlysupportsasubsetofUnicodecharacters.IwishIwerekidding. DataAccess: Inyourapplicationcode(e.g.PHP),inwhateverDBaccessmethodyouuse,you'llneedtosettheconnectioncharsettoutf8mb4.Thisway,MySQLdoesnoconversionfromitsnativeUTF-8whenithandsdataofftoyourapplicationandviceversa. Somedriversprovidetheirownmechanismforconfiguringtheconnectioncharacterset,whichbothupdatesitsowninternalstateandinformsMySQLoftheencodingtobeusedontheconnection—thisisusuallythepreferredapproach.InPHP: Ifyou'reusingthePDOabstractionlayerwithPHP≥5.3.6,youcanspecifycharsetintheDSN: $dbh=newPDO('mysql:charset=utf8mb4'); Ifyou'reusingmysqli,youcancallset_charset(): $mysqli->set_charset('utf8mb4');//objectorientedstyle mysqli_set_charset($link,'utf8mb4');//proceduralstyle Ifyou'restuckwithplainmysqlbuthappentoberunningPHP≥5.2.3,youcancallmysql_set_charset. Ifthedriverdoesnotprovideitsownmechanismforsettingtheconnectioncharacterset,youmayhavetoissueaquerytotellMySQLhowyourapplicationexpectsdataontheconnectiontobeencoded:SETNAMES'utf8mb4'. Thesameconsiderationregardingutf8mb4/utf8appliesasabove. Output: Ifyourapplicationtransmitstexttoothersystems,theywillalsoneedtobeinformedofthecharacterencoding.Withwebapplications,thebrowsermustbeinformedoftheencodinginwhichdataissent(throughHTTPresponseheadersorHTMLmetadata). InPHP,youcanusethedefault_charsetphp.inioption,ormanuallyissuetheContent-TypeMIMEheaderyourself,whichisjustmoreworkbuthasthesameeffect. Whenencodingtheoutputusingjson_encode(),addJSON_UNESCAPED_UNICODEasasecondparameter. Input: Unfortunately,youshouldverifyeveryreceivedstringasbeingvalidUTF-8beforeyoutrytostoreitoruseitanywhere.PHP'smb_check_encoding()doesthetrick,butyouhavetouseitreligiously.There'sreallynowayaroundthis,asmaliciousclientscansubmitdatainwhateverencodingtheywant,andIhaven'tfoundatricktogetPHPtodothisforyoureliably. FrommyreadingofthecurrentHTMLspec,thefollowingsub-bulletsarenotnecessaryorevenvalidanymoreformodernHTML.Myunderstandingisthatbrowserswillworkwithandsubmitdatainthecharactersetspecifiedforthedocument.However,ifyou'retargetingolderversionsofHTML(XHTML,HTML4,etc.),thesepointsmaystillbeuseful: ForHTMLbeforeHTML5only:youwantalldatasenttoyoubybrowserstobeinUTF-8.Unfortunately,ifyougobytheonlywaytoreliablydothisisaddtheaccept-charsetattributetoallyour
延伸文章資訊
- 1PowerShell scripts for creating and reading test files with the ...
WE CAN'T BE SURE THAT THE ENCLOSING FILE WILL HAVE a UTF-8 BOM. # E.G., WHEN DOWNLOADED FROM A Gi...
- 2Get Text File Encoding - Power Tips - IDERA Community
That's why most cmdlets dealing with text file reading offer the ... Encoding]::UTF8} [PSCustomOb...
- 3Byte order mark - Globalization - Microsoft Learn
- 4Read UTF-8 files correctly with PowerShell - Stack Overflow
I need a function that can read any file with UTF-8 encoding, ignore and delete the BOM and not m...
- 5How to: Save and open files with encoding - Visual Studio (Windows)