Read UTF-8 files correctly with PowerShell - iTecNote
文章推薦指數: 80 %
A PowerShell script creates a file with UTF-8 encoding; The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, ... Skiptocontent encodingpowershellutf-8 Followingsituation: APowerShellscriptcreatesafilewithUTF-8encoding Theusermayormaynoteditthefile,possiblylosingtheBOM,butshouldkeeptheencodingasUTF-8,andpossiblychangingthelineseparators ThesamePowerShellscriptreadsthefile,addssomemorecontentandwritesitallasUTF-8backtothesamefile Thiscanbeiteratedmanytimes WithGet-ContentandOut-File-EncodingUTF8Ihaveproblemsreadingitcorrectly.It'sstumblingovertheBOMithaswrittenbefore(puttingitinthecontent,breakingmyparsingregex),doesnotuseUTF-8encodingandevendeleteslinebreaksintheoriginalcontentpart. IneedafunctionthatcanreadanyfilewithUTF-8encoding,ignoreanddeletetheBOMandnotmodifythecontent.WhatshouldIuse? Update IhaveaddedalittletestscriptthatshowswhatI'mtryingtodoandwhathappensinstead. #Readdataifexists $data="" $startRev=1; if(Test-Pathtest.txt) { $data=Get-Content-Pathtest.txt if($data-match"^[0-9-]{10}-r([0-9]+)") { $startRev=[int]$matches[1]+1 } } Write-HostNextrevisionis$startRev #Defineexampledatatoadd $startRev=$startRev+10 $newMsgs="2014-04-01-r"+$startRev+"`r`n`r`n"+` "Line1`r`n"+` "Line2`r`n`r`n" #Writenewdataback $data=$newMsgs+$data $data|Out-Filetest.txt-EncodingUTF8 Afterrunningitafewtimes,newsectionsshouldbeaddedtothebeginningofthefile,theexistingcontentshouldnotbealteredinanyway(currentlyloseslinebreaks)andnoadditionalnewlinesshouldbeaddedattheendofthefile(seemstohappensometimes). Instead,thesecondrungivesmeanerror. BestSolution IfthefileissupposedtobeUTF8whydon'tyoutrytoreaditdecodingUTF8: Get-Content-Pathtest.txt-EncodingUTF8 RelatedSolutionsBestwaytoconverttextfilesbetweencharactersets Stand-aloneutilityapproach iconv-fISO-8859-1-tUTF-8in.txt>out.txt -fENCODINGtheencodingoftheinput -tENCODINGtheencodingoftheoutput Youdon'thavetospecifyeitherofthesearguments.Theywilldefaulttoyourcurrentlocale,whichisusuallyUTF-8. Php–UTF-8allthewaythrough DataStorage: Specifytheutf8mb4charactersetonalltablesandtextcolumnsinyourdatabase.ThismakesMySQLphysicallystoreandretrievevaluesencodednativelyinUTF-8.NotethatMySQLwillimplicitlyuseutf8mb4encodingifautf8mb4_*collationisspecified(withoutanyexplicitcharacterset). InolderversionsofMySQL(<5.5.3),you'llunfortunatelybeforcedtousesimplyutf8,whichonlysupportsasubsetofUnicodecharacters.IwishIwerekidding. DataAccess: Inyourapplicationcode(e.g.PHP),inwhateverDBaccessmethodyouuse,you'llneedtosettheconnectioncharsettoutf8mb4.Thisway,MySQLdoesnoconversionfromitsnativeUTF-8whenithandsdataofftoyourapplicationandviceversa. Somedriversprovidetheirownmechanismforconfiguringtheconnectioncharacterset,whichbothupdatesitsowninternalstateandinformsMySQLoftheencodingtobeusedontheconnection—thisisusuallythepreferredapproach.InPHP: Ifyou'reusingthePDOabstractionlayerwithPHP≥5.3.6,youcanspecifycharsetintheDSN: $dbh=newPDO('mysql:charset=utf8mb4'); Ifyou'reusingmysqli,youcancallset_charset(): $mysqli->set_charset('utf8mb4');//objectorientedstyle mysqli_set_charset($link,'utf8mb4');//proceduralstyle Ifyou'restuckwithplainmysqlbuthappentoberunningPHP≥5.2.3,youcancallmysql_set_charset. Ifthedriverdoesnotprovideitsownmechanismforsettingtheconnectioncharacterset,youmayhavetoissueaquerytotellMySQLhowyourapplicationexpectsdataontheconnectiontobeencoded:SETNAMES'utf8mb4'. Thesameconsiderationregardingutf8mb4/utf8appliesasabove. Output: Ifyourapplicationtransmitstexttoothersystems,theywillalsoneedtobeinformedofthecharacterencoding.Withwebapplications,thebrowsermustbeinformedoftheencodinginwhichdataissent(throughHTTPresponseheadersorHTMLmetadata). InPHP,youcanusethedefault_charsetphp.inioption,ormanuallyissuetheContent-TypeMIMEheaderyourself,whichisjustmoreworkbuthasthesameeffect. Whenencodingtheoutputusingjson_encode(),addJSON_UNESCAPED_UNICODEasasecondparameter. Input: Unfortunately,youshouldverifyeveryreceivedstringasbeingvalidUTF-8beforeyoutrytostoreitoruseitanywhere.PHP'smb_check_encoding()doesthetrick,butyouhavetouseitreligiously.There'sreallynowayaroundthis,asmaliciousclientscansubmitdatainwhateverencodingtheywant,andIhaven'tfoundatricktogetPHPtodothisforyoureliably. FrommyreadingofthecurrentHTMLspec,thefollowingsub-bulletsarenotnecessaryorevenvalidanymoreformodernHTML.Myunderstandingisthatbrowserswillworkwithandsubmitdatainthecharactersetspecifiedforthedocument.However,ifyou'retargetingolderversionsofHTML(XHTML,HTML4,etc.),thesepointsmaystillbeuseful: ForHTMLbeforeHTML5only:youwantalldatasenttoyoubybrowserstobeinUTF-8.Unfortunately,ifyougobytheonlywaytoreliablydothisisaddtheaccept-charsetattributetoallyour
延伸文章資訊
- 1Changing source files encoding and some fun with PowerShell
Changing source files encoding and some fun with PowerShell ... At least it can correctly read te...
- 2How to: Save and open files with encoding - Visual Studio (Windows)
- 3PowerShell Studio tip: View and change file encoding
- 4Change / Save encoding How to convert several txt files UTF ...
Hello. I need to convert several txt files , located in C:\Folder1 from UTF-8 to UTF-8-BOM. I had...
- 5Byte order mark - Globalization - Microsoft Learn