Changing source files encoding and some fun with PowerShell

文章推薦指數: 80 %
投票人數:10人

The script detects encoding and saves file back in UTF-8. UPDATE: After I published the post I realized that all these stuff about detecting encoding (at least ... OpeninappHomeNotificationsListsStoriesWritePublishedinSergeiDorogin’stechnicalblogChangingsourcefilesencodingandsomefunwithPowerShellOnedayIwasaskedtoassistincreatingaPDF-documentwithallsourcecodeofsomeourlibrary.Thisweirdtaskwasneededforpatentingourproduct.It’spossibletodomanuallyindeedbutit’shumiliatingforadev:).Obviouslyitcanbeeaslyautomatedinmanyways.Itturnedouttobeafunadventure.FirstofallIdescribethecontext—thelibraryis.NETsolutionwithalmost3000C#files.It’salittlebitbig:).ThefirstissueIencounteredwasthefactthatthesolutioncontainsfilesindifferentencodings.AllcommentswerewritteninRussian.SosomefileswereinASCII-basedencoding(windows-1251)buttheothersinutf-8.It’samess.Sothefirstideawastonormalizeallfilesinutf-8encodingdespitethefactthatcommentsthemselvesarenotneededforthistask.Butbesidescommentssourcescontainstringswithnationalletterssoencodingisimportantanyway.By“important”Imeanthatitshouldbeknowntoreadandprocessafile.InthispostI’lltalkaboutconvertingencodingandinthenextone—aboutgeneratingWord/PDFfiles.Therearethreeapproachesindetectingencoding:Usebyteordermark(BOM)—it’sadummyapproachtodetectUnicode/ASCII,butactuallyitdoesn’tworkasit’scommonpracticetonothaveBOMinutf-8files.UsesomeplatformAPI,forexampleonWindowswehaveMLangCOMcomponent(mlang.dll)TrytodetectencodingbyheuristicsonourownIfoundanicelib/toolwrapperforMLanghere—http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text.ButIwasn’tkeentouseCOM.NextIfoundthisC#portofMozillaUniversalCharsetDetector—https://github.com/errepi/ude.Itworkedprettygoodforme.Notideally.Insomecasesitdidn’tseethedifferencebetweenwindows-1215andmac-Cyrilliccontent.Butmygoalwasn’ttodetectarbitraryencodingbutonlytodifferentiateUTF-8andASCIIsoitwasgoodenough.Itshouldbeunderstoodthatinanywaydetectingencodingisanondeterministictask.Wecouldguessbasingoncontentbutit’snot100%guaranteed.IdecidedtousePowerShelltoprocessfiles.Belowyou’llfindasimplescripttoprocessallfilesbymaskinafolderrecursively.ThescriptdetectsencodingandsavesfilebackinUTF-8.UPDATE:AfterIpublishedthepostIrealizedthatallthesestuffaboutdetectingencoding(atleastinsuchaway)areover-complication.ItturnedoutthatPowerShell’sGet-Contentcmdletsupportsdetectingfileencoding.It’sundocumentedbutitworks.Atleastitcancorrectlyreadtextinutf-8andascii(windows-1251inmycase).I’mnotsurewhetherit’sabletodetectdifferentASCIIencodingsbutatleastitcandifferentiateutf8/non-utf8.IsuspectthatthecmdletusesWindowsregionalsettingstogetcodepagefornon-Unicodecontent.Ifit’sthecasethenit’sprettylimited.SoIleftthepostmostlyunchangedasitcanbestillhelpfultodemonstratedifferenttechnicstoworkwithPowerShell.AlsoGet-Contentdoesn’tworkcorrectlywithutf-8filewithoutBOM.Soinsomecasesitcanbeusefultouseencodingdetectionanyway.AtthebottomIputsimplerworkingversionofthescript.AllmyeffortsendedwithascriptwhichIpublishedatGitHub—https://github.com/evil-shrike/SourceFilesProcessor.Here’sasimplifiedversionofit.Let’sgothroughthescript.Firstweneedtoloadthelibrary’sassembly.WeusedAdd-Typecmdletforthis.Ascurrentdirectorycandifferfromthedirectorywherethescriptislocatedwe’reusingabsolutepathtothelibrarygettingitfromthepathofthescriptitself:$scriptPath=Split-Path-Path$MyInvocation.MyCommand.Definition-ParentAdd-Type-Path$scriptPathUde.dllAftertheassemblyisloadedwecanusetypesfromit.UsageofUDElibiswrappedinGetFileEncodingfunction:#Readcontentofthefileasbytes[byte[]]$bytes=get-content-Encodingbyte-Path$filePath#CreateaninstanceofUde.CharsetDetector$cdet=new-object-TypeNameUde.CharsetDetector#PassthebytestoUDEtodetectencoding$cdet.Feed($bytes,0,$bytes.Length);$cdet.DataEnd();#Getaresult-it'sanameofencodingreturn$cdet.CharsetUnfortunatelythelibdoesn’tsupportSystem.Text.Encodingandjustreturnsastring.Next,aswedetectedthefileencodingwecancorrectlyreaditastext.We’reinterestedonlyincaseswhenencodingisneitherutf-8(alreadygood)nor7bit-ASCII(nonationalletters,alsogood).AsyoucanseeIhard-codedthecodepagehere,consideringthatnon-Unicodefileshouldbeinwindows-1251.IfyouhavefilesindifferentASCIIcodepagesthenit’llbealittlebithardertodetectcorrectencoding.UDFlibraryreturnsnameofencodingwiththehighestprobability(it’scalled“Confidence”insidethelib)butinternallyitkeepsresults(probability)foralltestedencoding.$text=Get-Content$filePath-EncodingByte-ReadCount0$text=[System.Text.Encoding]::GetEncoding(1251).GetString($text)NowthatwehavecontentofthefilesaveitbackinUTF-8.[System.IO.File]::WriteAllText($filePath,$text)WhatcanbeeasierinPowerShellright?ButwhydoIuse.NETFile.WriteAllText?There’redifferentmethodstosavefilesinPS:Set-Content-Path"$filePath"-Value$text-Encodingutf8-ForceOut-File-filepath"$filePath"-InputObject$text-Encodingutf8-ForceItturnedoutthatbothmethodsinsertBOMinsavedfileandit’sunavoidable.Ididn’twanttopollutemyfileswithBOM.That’sbecauseIuseFile.WriteAllText.ItjustsavesinUTF-8w/oBOM.That’sallonthetaskofchangingfilesencoding.Butthere’resomegotchaswithPowerShellIencounteredwithwhilecoding.InthefollowingsectionsI’llsharesmalltricksandtipsonPowerShell.Resolve-RelativePathForsomereasonneitherPowerShellnor.NETdon’thaveaneasymethodtogetrelativepath.Sadbuttrue.Sohere’ssuchamethod.InC#it’dbe:stringResolveRelativePath(stringpath,stringfromPath){returnUri.UnescapeDataString(newUri(fromPath).MakeRelativeUri(newUri(path)).ToString().Replace('/',Path.DirectorySeparatorChar));}Itreturnsarelativepathtofolder/filepathfromfolderfromPath.Ifpath=“c:\temp\folder\file.ext”andfromPath=“c:\temp”thenresultwillbe“folder\file.ext”.inPowerShell:functionResolve-RelativePath($path,$fromPath){$path=Resolve-Path$path$fromPath=Resolve-Path$fromPath$fromUri=new-object-TypeNameSystem.Uri-ArgumentList"$fromPath"$pathUri=new-object-TypeNameSystem.Uri-ArgumentList$pathreturn[System.Uri]::UnescapeDataString($fromUri.MakeRelativeUri($pathUri).ToString().Replace('/',[System.IO.Path]::DirectorySeparatorChar))}ReadingfileastextGet-Contentcmdletreturnsanarrayofstringbydefault.Canyoubelieve?Ifweneedjustcontentofafilewemustspecify-Rawoption:$text=Get-Content$filePath-EncodingUTF8-RawProcessfilesLet’ssupposethatweneedtoupdateheaderinallfiles.Fileheaderisacommentbeforethefirstlineofcode.Soweneedtoremoveoldheaderfirstthataddanewone.Toremovetheheaderwesplittextbynewlines(-splitoperator)andfilteroutwithwhere.#Removeallcommentlinesfromthebeginningofthefile$bHeader=$true$lines=$text-split"rn"|where{if(!$bHeader){return$true}if($_-eq"rn"-or$_-eq"n"){return$false}if($_-notmatch"^//"){$bHeader=$falsereturn$true}return$false}#Addnewheader$lines=@($headerText)+$lines$text=[System.String]::Join([System.Environment]::NewLine,$lines)Havinganarrayoflines($lines)weneedtoinsertanitematthebeginning.IMOaddinganitemintoanarrayisnotveryobviousinPS.Someonemaythinkabout$lines.Add($item)butitwon’tworkasarraysarefixedlength.Sousuallypeopleuse+=operatorwhichcreatesanewarray:$lines+=$headerTextButweneedtoinsertnotadd,so:$lines=@($headerText)+$linesThesimplerversionHere’sthescriptforconvertingfileencodingsintoutf8usingGet-Contentabilitytodetectencoding.Itdoesn’tuseanylibraryasGet-Contentsupportssomebasicencodingdetectionwhichcanbeenough.MorefromSergeiDorogin’stechnicalblogFollowRandomthoughtsonsoftwaredevelopmenton.NETandWebplatformsandmiscellaneousIT/DevOpstopics.ReadmorefromSergeiDorogin’stechnicalblogAboutHelpTermsPrivacyGettheMediumappGetstartedSergeiDorogin97FollowersSoftwareDeveloperFollowHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable



請為這篇文章評分?