Read UTF-8 files correctly with PowerShell - iTecNote
文章推薦指數: 80 %
A PowerShell script creates a file with UTF-8 encoding; The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, ... Skiptocontent encodingpowershellutf-8 Followingsituation: APowerShellscriptcreatesafilewithUTF-8encoding Theusermayormaynoteditthefile,possiblylosingtheBOM,butshouldkeeptheencodingasUTF-8,andpossiblychangingthelineseparators ThesamePowerShellscriptreadsthefile,addssomemorecontentandwritesitallasUTF-8backtothesamefile Thiscanbeiteratedmanytimes WithGet-ContentandOut-File-EncodingUTF8Ihaveproblemsreadingitcorrectly.It'sstumblingovertheBOMithaswrittenbefore(puttingitinthecontent,breakingmyparsingregex),doesnotuseUTF-8encodingandevendeleteslinebreaksintheoriginalcontentpart. IneedafunctionthatcanreadanyfilewithUTF-8encoding,ignoreanddeletetheBOMandnotmodifythecontent.WhatshouldIuse? Update IhaveaddedalittletestscriptthatshowswhatI'mtryingtodoandwhathappensinstead. #Readdataifexists $data="" $startRev=1; if(Test-Pathtest.txt) { $data=Get-Content-Pathtest.txt if($data-match"^[0-9-]{10}-r([0-9]+)") { $startRev=[int]$matches[1]+1 } } Write-HostNextrevisionis$startRev #Defineexampledatatoadd $startRev=$startRev+10 $newMsgs="2014-04-01-r"+$startRev+"`r`n`r`n"+` "Line1`r`n"+` "Line2`r`n`r`n" #Writenewdataback $data=$newMsgs+$data $data|Out-Filetest.txt-EncodingUTF8 Afterrunningitafewtimes,newsectionsshouldbeaddedtothebeginningofthefile,theexistingcontentshouldnotbealteredinanyway(currentlyloseslinebreaks)andnoadditionalnewlinesshouldbeaddedattheendofthefile(seemstohappensometimes). Instead,thesecondrungivesmeanerror. BestSolution IfthefileissupposedtobeUTF8whydon'tyoutrytoreaditdecodingUTF8: Get-Content-Pathtest.txt-EncodingUTF8 RelatedSolutionsBestwaytoconverttextfilesbetweencharactersets Stand-aloneutilityapproach iconv-fISO-8859-1-tUTF-8in.txt>out.txt -fENCODINGtheencodingoftheinput -tENCODINGtheencodingoftheoutput Youdon'thavetospecifyeitherofthesearguments.Theywilldefaulttoyourcurrentlocale,whichisusuallyUTF-8. Php–UTF-8allthewaythrough DataStorage: Specifytheutf8mb4charactersetonalltablesandtextcolumnsinyourdatabase.ThismakesMySQLphysicallystoreandretrievevaluesencodednativelyinUTF-8.NotethatMySQLwillimplicitlyuseutf8mb4encodingifautf8mb4_*collationisspecified(withoutanyexplicitcharacterset). InolderversionsofMySQL(<5.5.3),you'llunfortunatelybeforcedtousesimplyutf8,whichonlysupportsasubsetofUnicodecharacters.IwishIwerekidding. DataAccess: Inyourapplicationcode(e.g.PHP),inwhateverDBaccessmethodyouuse,you'llneedtosettheconnectioncharsettoutf8mb4.Thisway,MySQLdoesnoconversionfromitsnativeUTF-8whenithandsdataofftoyourapplicationandviceversa. Somedriversprovidetheirownmechanismforconfiguringtheconnectioncharacterset,whichbothupdatesitsowninternalstateandinformsMySQLoftheencodingtobeusedontheconnection—thisisusuallythepreferredapproach.InPHP: Ifyou'reusingthePDOabstractionlayerwithPHP≥5.3.6,youcanspecifycharsetintheDSN: $dbh=newPDO('mysql:charset=utf8mb4'); Ifyou'reusingmysqli,youcancallset_charset(): $mysqli->set_charset('utf8mb4');//objectorientedstyle mysqli_set_charset($link,'utf8mb4');//proceduralstyle Ifyou'restuckwithplainmysqlbuthappentoberunningPHP≥5.2.3,youcancallmysql_set_charset. Ifthedriverdoesnotprovideitsownmechanismforsettingtheconnectioncharacterset,youmayhavetoissueaquerytotellMySQLhowyourapplicationexpectsdataontheconnectiontobeencoded:SETNAMES'utf8mb4'. Thesameconsiderationregardingutf8mb4/utf8appliesasabove. Output: Ifyourapplicationtransmitstexttoothersystems,theywillalsoneedtobeinformedofthecharacterencoding.Withwebapplications,thebrowsermustbeinformedoftheencodinginwhichdataissent(throughHTTPresponseheadersorHTMLmetadata). InPHP,youcanusethedefault_charsetphp.inioption,ormanuallyissuetheContent-TypeMIMEheaderyourself,whichisjustmoreworkbuthasthesameeffect. Whenencodingtheoutputusingjson_encode(),addJSON_UNESCAPED_UNICODEasasecondparameter. Input: Unfortunately,youshouldverifyeveryreceivedstringasbeingvalidUTF-8beforeyoutrytostoreitoruseitanywhere.PHP'smb_check_encoding()doesthetrick,butyouhavetouseitreligiously.There'sreallynowayaroundthis,asmaliciousclientscansubmitdatainwhateverencodingtheywant,andIhaven'tfoundatricktogetPHPtodothisforyoureliably. FrommyreadingofthecurrentHTMLspec,thefollowingsub-bulletsarenotnecessaryorevenvalidanymoreformodernHTML.Myunderstandingisthatbrowserswillworkwithandsubmitdatainthecharactersetspecifiedforthedocument.However,ifyou'retargetingolderversionsofHTML(XHTML,HTML4,etc.),thesepointsmaystillbeuseful: ForHTMLbeforeHTML5only:youwantalldatasenttoyoubybrowserstobeinUTF-8.Unfortunately,ifyougobytheonlywaytoreliablydothisisaddtheaccept-charsetattributetoallyour
延伸文章資訊
- 1Byte order mark - Globalization - Microsoft Learn
- 2Understanding file encoding in VS Code and PowerShell
This problem occurs because VS Code encodes the character – in UTF-8 as the bytes 0xE2 0x80 0x93 ...
- 3Get-Content - PowerShell Command - PDQ
UTF8: Encodes in UTF-8 format. ... Therefore, by default, when reading a text file, Get-Content r...
- 4Set-Content - PowerShell - SS64.com
set-content -encoding UTF8 will write a BOM if one is available in the source file, or if the sou...
- 5UTF-8 - MDN Web Docs Glossary: Definitions of Web-related terms