Regular Expressions: An Introduction for Translators
文章推薦指數: 80 %
What Are Regular Expressions and How Can They Help Translators? ... A regular expression is a special sequence of characters or symbols that define a search ... RegularExpressions:AnIntroductionforTranslators Regularexpressions(alsoknownasRegEx)areaverypowerfulresourceandopenafullrangeofpossibilitiesindifferentprograms,includingsomecomputer-assistedtranslation(CAT)tools.Youcanthinkofregularexpressionsasasearch-and-replacefunctiononsteroids.Regularexpressionscanassistourtranslationworkbyallowingustosearch,replace,andfiltertextinwaysthatwouldotherwisebeimpossibleinoursoftwaretools. Haveyoueverwonderedhowmucheasieritwouldbeifyoucould: Performtwoormoreseparatesearchesatthesametime(e.g.,searchingfordifferentformsofthesameterm,orperhapsfordifferentwordsaltogether)? FiltertextinyourCATtooltodisplayonlythosesegmentsthatarecapitalizeddifferentlybetweenthesourceandtarget? Searchaglossaryforallcapitalizedheadwordsandchangethemtolowercasewhileleavingacronymsandothertermsthatareinalluppercaseunchanged—anddoallthisinasingleoperation? FiltertextinyourCATtooltofindthesegmentswheretheendpunctuationdiffersbetweenthesourceandtarget? FiltertextinyourCATtooltodisplayonlythesegmentsthatcontaincertainwordsinthesource,orthesegmentsthatdon’tcontainaspecificwordinthetarget? Inshort,haveyoueverwonderedhowmucheasieritwouldbeifyoucoulddosomethingbeyondwhatthenormalsearch-and-replacefunctioncando?Ifyes,thenregularexpressionsmayhelpyou. Atfirst,regularexpressionsmayappearcryptic,butonceyou’velearnedthebasicsandseenhowusefultheycanbe,you’llbeabletodecidehowmuchtimeyouwanttoinvesttobecomemoreproficientatusingthem.Thefollowingwillfocusonusingregularexpressionsforsearching,replacing,andfilteringtextinCATtoolssuchasSDLTradosStudio,memoQ,orXbench.CATtoolsalsouseregularexpressionsforcreatingsegmentationandauto-translationrules,orforprotectingtags. Figure1:UsingaSingleRegularExpressiontoFindDifferentFormsoftheSameTerm WhatAreRegularExpressionsandHowCanTheyHelpTranslators? Aregularexpressionisaspecialsequenceofcharactersorsymbolsthatdefineasearchpattern.Thispatternisthenusedtosearchfor(orreplace)specificinstancesofwordsorphrasesinatext.1Regularexpressionsareusedbysearchengines,texteditors,textprocessingutilities,andforlexicalanalysis. Thesimplestregularexpressionsusenosymbols,justnormalcharacters.Forexample,tofindallinstancesofwordsthatcontain“able,”youwouldusetheregularexpressioncomprisingthesearchstring“able,”whichwouldnotonlyfindtheword“able,”butalso“enable,”“able-bodied,”and“agreeable.”Butifthiswasallthatregularexpressionscoulddo,theywouldn’tbeinteresting,orparticularlyuseful,orchallengingtolearn. Aswe’llseefromtheexamplesthatfollow,whatmakesregularexpressionspowerfularethevarioussymbolsandcharactersyoucanusewiththem.(Pleasenotethattheregularexpressionsappearinredintheexamples.) WaystoUseRegularExpressions FindingDifferentFormsoftheSameTerm:Hereishowyoucanusearegularexpressiontoensurethattheterms“gray”and“preamplifier”arespelledconsistentlyinyourtranslation: gr(a|e)yorgr[ae]y:Findstheword“gray”spelledbothwithan“a”(“gray”)andwithan“e”(“grey”). pre[-]?amplifier:Findsinstancesof“preamplifier,”“pre-amplifier,”and“preamplifier.”(SeeFigure1.) Let’sseehowtheseregularexpressionswork. gr(a|e)y:Thisregularexpressionsearchesfortheletters“gr”followedbyagroup(enclosedinparentheses)thatcontainsanalternativespelling(markedbythepipesymbol“|”):eithertheletter“a”or“e”followedbytheletter“y.” gr[ae]y:Herewedothesame,butthistimeusingaset(enclosedinbrackets)ofthecharacterspossibleinthatposition:theletters“a”or“e.”(Note:boththegroupandthesetintheseexamplesrepresentsearchesforonlyoneletter,butprovidealternativesforwhichletterthatcouldbe.) Pre[-]?amplifier:Thisexpressionusesaset(enclosedinbrackets)ofaspaceorahyphentomatch“Preamplifier”or“Pre-amplifier,”followedbyaquestionmark.Thequestionmarksymbolisaquantifierthattellstheregularexpressionhowmanytimestheprecedingelementorcharactershouldbematched.Thequestionmarkquantifiermeans“0or1”times.So,thisregularexpressionsearchesforwordscontaining“Pre,”followedeitherbyaspace,hyphen,orbynothingatall(thisiswherethe“zerotimes”comesintoplay),followedby“amplifier.”(Note:InXbench,thisregularexpressionwouldneedtobechangedto Pre[\-]amplifier.) SearchingforMultipleWordsattheSameTime:Let’ssayyou’vejustreceivedalongtranslationtoedit.Sinceitwasdonebytranslatorsfromdifferentcountries,younoticethattheyuseddifferentwordsforthesameterm.Youwanttofilterthetargettexttoseeallthesegmentsthatcontaineitherthewords“melocotón”or“durazno,”twoalternativetranslationsfortheword“peach.”Usingthesimpleregularexpression(melocotón|durazno)doesthetrick.(Notethatsearchingforalternativesisnotlimitedtoonlytwoterms.) Figure2:FindAllSegmentsWhereCapitalizationDoesn’tMatchbetweentheSourceandTargetText FindingAllSegmentsWhereTargetCapitalizationDoesn’tMatchtheSource:Thefollowingpairofregularexpressionscanbeusedtoensurethatcapitalizationinyourtargetdocumentmatchescapitalizationinyoursource: ^[A-Z](capitalizationinthesource) ^[a-z](capitalizationinthetarget) ThisworksinthetextfilteroftoolslikememoQ,Studio2017,orXbenchtofindsegmentsthatarecapitalizedinthesourcetextbutnotinthetarget.(Inthesetools,youwouldusetheregularexpressionsearchmodeandselect“matchcase”or“casesensitive.”SeeFigure2.) Intheexamplesabove,thecaretsymbol(^)atthebeginningoftheregularexpressionsignalsthebeginningofastringorsegment.Thisisfollowedbytwosetsoflettersinbrackets.Eachsetcontainsranges:thehyphenbetweenthelettersmarkstherangewithintheset.Thefirstsetistherangeofalluppercaselettersandthesecondsetmarkstherangeofalllowercaseletters.Youcanspecifydifferentrangesasnecessaryandhaveseveralrangesinaset.Forexample,youcanuse[A–G]todesignatetherangeofalluppercaselettersfrom“A”through“G,”and[0-9A-Za-z]todesignatetherangeofalldigitsandallcapitalorlowercaseletters. NormalizeCapitalizationofHeadwordsinaGlossary:Let’ssayyouhaveaglossaryinatab-delimitedformatandit’samess:someheadwordsarecapitalized,somearenot,andsomeareacronymsinalluppercase.Whenpreparingtoimporttheglossaryintoyourtermbase,youdecideyouwanttofindallthecapitalizedheadwordsandreplacethemwithlowercasewhileleavingthetermsthatareallincapitallettersuntouched.Thatis,youwanttochangethis: Accióncorrectiva Correctiveaction ácidonitrilotriacético NTA,NitrolotriaceticAcid ADNrecombinante RecombinantDNA bajadadelniveldeagua drawdown DatosEMAP EMAPdata DDT DDT Desperdiciosdomésticos Householdwaste Empaqueapruebadeniños CRP,Child-ResistantPackaging intothis: accióncorrectiva correctiveaction ácidonitrilotriacético NTA,NitrolotriaceticAcid ADNrecombinante recombinantDNA bajadadelniveldeagua drawdown datosEMAP EMAPdata DDT DDT desperdiciosdomésticos householdwaste empaqueapruebadeniños CRP,child-resistantpackaging YoucandothisinatexteditorlikeNotepad++usingapairofregularexpressions:(^|\t)([A–Z])([a–z])inthesearchfield,and$1\L$2$3inthe“Replace”field.(SeeFigure3.)Let’swalkthroughtheprocess. Inthefirsthighlightedboxinthe“Findwhat”sectioninFigure3,westartbytellingNotepad++tosearchforthebeginningofaline(representedbythesymbol“^”)oratabcharacter(\t),andthenforawordthatbeginswithanycapitalletter([A–Z])followedbyanylowercaseletter([a–z]).Eachoftheseitemsisenclosedinparenthesestoformitsowngroup.Inthesecondhighlightedboxinthe“Replacewith”section,wetellNotepad++toreplacethebeginningofthelineortabcharacter(^|\t)inthefirstgroup—inFigure3,($1)indicatesthefirstgroup—withthesamecharacter.WethentellNotepad++toreplaceeachinitialcapitalletter(L)inthesecondgroupwiththesameletter,butlowercase(\L$2),andtoleavethethirdgroupunchanged($3).2Theendresult:Notepad++willsearchforwordsconsistingofacapitalletterfollowedbyalowercaseoneandskipanyacronymsthatarealluppercase. Figure3:NormalizeGlossaryCapitalizationinNotepad++ Figure4:FindingSegmentsWheretheEndPunctuationDoesn’tMatchbetweentheSourceTextandtheTarget FindingAllSegmentsWheretheEndPunctuationintheTargetTextDoesn’tMatchtheSource:Herearetworegularexpressionsyoucanusetoensurethatpunctuationinyourtargettextmatchesthepunctuationinthesource: \.$(punctuationinthesource) [^.]$(punctuationinthetarget) Theseexpressionsfindallsegmentpairsthatendwithaperiodinthesourcetextbutnotinthetarget.Inthefirstexpression,the“$”signalstheendofastringorsegment,andthebackslash(\)followedbythe“.”signalstheperiod.(Thebackslashistheescapecharacter.3)Thistellstheregularexpressiontofindallsegmentsthatendinaperiod.Inthesecondexpression,thecaretinsidethesetmarksnegation,so[^.]indicates“anycharacterthatisnotaperiod.”Therefore,[^.]$willfindallthesegmentsthatdon’tendinaperiod.(SeeFigure4.)Youcanmodifythisexpressiontosearchforotherpunctuationmarks(e.g.,\?$and[^?]$wouldfindallsegmentsendingwithaquestionmarkinthesourcebutnotinthetarget). FindingAllTermsEnclosedinDoubleQuotes:Youcanusetheseregularexpressionstofindallquotedtermsinadocumentsoyoucanaddthemtoyourtermbase: (“|“).*?(“|”):Thisfindsallitemsenclosedindoublequotes—bothstraightandcurlyquotes.First,theregularexpressionfindstheopeningdoublequotes—eitherstraightorcurly.Thenitfindsthecontentenclosedinthequotes,endingwiththeclosingdoublequotes.Inthisexpressionthe“.”means“anycharacterthatisnotaparagraphmark(newline).”Theasterisk“*”isanotherquantifierthatmeans“betweenzeroandanynumberoftimes,”whilethequestionmark“?”heremeans“butonlyuntilyoufindthefirstofthefollowingcharacter.”Withoutthequestionmark,theregularexpressionwouldfindmatchesuntilthelastclosingdoublequotesinthesegment.ThisregularexpressionworksinbothmemoQandStudio.(SeeFigure5.) (“|“)[^””]*(“|”):The[^””]*means“anycontentthatisnotaclosingquote.”Thisregularexpressionissimilartotheoneabove,butalsoworksinXbench(thefirstonedoesn’t).Remember,notalltoolsusethesameregularexpressionsearchengine,sowhatworksinonetoolmaynotworkthesamewayinanother.4 Figure5:FindAllTermsEnclosedinDoubleQuotes Figure6:AnAnnotatedListofRegularExpressions HowDoYouLearnRegularExpressions? Agoodwayistostartwithatutorial.ThebestIknowisonlineatRegular-Expressions.info(www.regular-expressions.info). Next,getintothehabitofexpressinginwordswhatyouwanttodoandtrytoseehowtoconvertthatinregularexpressions.Therearealsoseveraltoolsandwebsitesthatcanhelpyoubuildandtestregularexpressions. Expresso www.ultrapico.com/expresso.htm Expressoisfreeforusewith.NETregularexpressionsonly(i.e.,withtheregularexpressionsusedinStudioandmemoQ). RegularExpressions101 https://regex101.com Afreeonlinetoolthatexplainswhateachelementofyourregularexpressiondoes. RegexBuddy www.regexbuddy.com RegexBuddyisacommercialtoolthatwillintegratewithyourfavoritesearchingandeditingtoolsforinstantaccess.Itwillalsohelpyoucollectanddocumentlibrariesofregularexpressionsforfuturereuse. Regardlessofthetoolyouuse,ageneralsuggestionistokeepalistoftheregularexpressionsyouuseandwriteabriefdescriptiontorememberwhateachdoes.Noneedforanythingfancy,asimpletextfilewilldo.(SeeFigure6foranexample.) Notes BasedonthedefinitionprovidedbyWikipedia. This,unfortunately,doesn’tworkforaccentedletters,whichareleftunchanged. Thebackslashisusedbecausethedothasaspecialmeaninginregularexpressions.Whenyouneedtosearchfortheperioditself,youneedtoescapeit.Forcertainregularexpressionengines,whenwithinaset(butnotelsewhere),thedotjustindicatestheperiodcharacter. MythankstoJosepCondalofApSICforsuggestingthisregularexpressionforXbench. RegularExpressionCheatSheet (Note:ThesearesomeofthemoreimportantRegExsymbols.Seethereferencesonpage31,oryourCATtoolhelpfile,formore.) RegExSymbol Explanation () Group [] Set | Alternative . Anycharacterthatisnotaparagraphmark ? Quantifier:matchesthepreviouscharacterbetweenzeroandonetime + Quantifier:matchesthepreviouscharacteroneormoretimes * Quantifier:matchesthepreviouscharacterbetweenzeroandmoretimes {n} Exactquantifier:matchesthepreviouscharacterexactlyntimes {n,m} Exactrangequantifier:matchesthepreviouscharacterbetweennandmtimes ^ Designatedthebeginningofastringorsegment $ Designatedtheendofastringorsegment – Rangeoperator:forexample,[A–D]istherangeofallcapitallettersbetweenAandD. \t Tabcharacter \d Theclassofalldigits,soanydigit.Thesameas[0–9]. \s Theclassofallwhitespace,sospace,non-breakingspace,etc. [^0-5] Negatedclass:thismeans“nodigitbetween0and5” \ Escapecharacter:usedtosearchforacharacterthatotherwisewouldmeansomethingelseinaregularexpression. Forexample,tosearchforaquestionmark,wemustescapeit:\? $1,$2,etc. Inareplacementoperation,theserepresentthefirstgroup,thesecondgroup,etc. \L Inareplacementoperation,thismeanstochangetheletterfollowingthe“\”tolowercase.Notethatthiswillnotworkwithaccentedcharacters. \U Inareplacementoperation,thismeanstochangetheletterfollowingthe“\”touppercase.Thiswillnotworkwithaccentedcharacters. RegExExample Explanation pre[-]?amplifier Finds“amplifier,”“pre-amplifier,”and“preamplifier”(StudioandmemoQ). pre[\-]?amplifier Finds“preamplifier,”“pre-amplifier,”and“preamplifier”(Xbench). (apple|orange) Findsallthesegmentscontainingeitherthewords“apple”or“orange.”Notethatthisisnotrestrictedtoonlytwooptions:(apple|orange|banana)findsallsegmentsthatcontain“apple,”“orange,”or“banana.” (“.*?”)|(“.*?”) Findsallitemsenclosedindoublequotes(bothstraightquotesandcurlyquotes). (“|“).*?(“|”) Findsallitemsenclosedindoublequotes(bothstraightquotesandcurlyquotes),anditfindsthemevenwhenmismatched(e.g.,openingstraightquotesandclosingcurlyquotes,andviceversa).Notethat(“|“).*(“|”)withoutthequestionmarkwillfinditemsfromthefirstopeningquotetothelastclosingquote. ([““«]).*?([“”»]) Findsallitemsenclosedindoublequotes(bothstraightquotesandcurlyquotes),butthiswillalsofindthreedifferenttypesofdoublequotes. (“|“)[^””]*(“|”) Thisworksthesameway,butforXbench. ((?<=\s)|^)[-+\(]?((\d{1,3}(,\d{3})*)|\d+)(\.\d+)?\)?((?=\s)|$) Findsallnumberswithadot(insteadofacomma)asadecimalseparator,andnumberswithacommaasthethousandsseparator.Thiscouldbechangedfordifferentnumericpatterns. ^((?!searchstring).)*$ Findsallsegmentsthatdon’tcontainthesearchstring(worksformemoQandStudio,notforXbench). -“searchstring” Sameasabove,butworksforXbench(withPowerSearchon). RegEx(Source) RegEx(Target) Explanation ^[A-Z] ^[a-z] ThisworksinthetextfilterofatoollikememoQ,Studio2017,andXbench—togetherwiththeselectionof“RegularExpressions”and“Casesensitive”—tofindthesegmentsthatarecapitalizedinthesourcebutnotinthetarget. \.$ [^.]$ Findsmismatchedclosingpunctuation(inthiscase,theperiod,butthesameregularexpressioncanbeadaptedtosearchforotherpunctuation). “([\.\?,!;:]$)=1” -@1$ (Xbench,withPowersearchon)Findsmismatchinclosingpunctuationbetweenthesourceandtarget.Thisisforseveralmarksatatime.(ThankstoOscarMartinofApSICforsuggestingthispattern.) SearchField ReplaceField Explanation ^([A-Z])([a-z]) \L$1$2 (Atthebeginningofaline,withmatchcaseon)Searchesforallstringsatthebeginningofasegmentthatbeginwithanuppercaseletterfollowedbyalowercaseletterandreplacesthemwithalllowercase.Thisskipswords(suchasacronyms)thatarealluppercase. (\t)([A-Z])([a-z]) $1\L$2$3 (Afteratab)Sameasabove,butafteratab,insteadofatthebeginningofaline.ThesetwoRegExsearchandreplacestringsareusefulforconvertingtolowercaseglossariesthatwerewrittenwithcapitalizedentries. (^|\t)([A-Z])([a-z]) $1\L$2$3 Thiscombinesthetwoprevioussearchandreplaceoperations:convertstolowercasewordsatthebeginningofalineandafteratab. WhichTranslationToolsUseRegularExpressions? CATTools SDLTradosStudio http://bit.ly/SDL-RegEx memoQ http://bit.ly/memoQ-RegEx OtherCATtools(e.g.,WordfastorDéjaVu)alsouseregularexpressions;however,certaintoolsmayusethemforfilepreparationandprojectmanagement,butnotforsearchandreplaceorfiltering. Wordfast http://bit.ly/Wordfast-RegEx DéjaVu http://bit.ly/DéjaVu-RegEx TranslationQualityAssuranceandSupportTools Xbench http://bit.ly/Xbench-RegEx QADistiller http://bit.ly/QaDistiller-RegEx TextEditorsandWordProcessors Notepad++ https://notepad-plus-plus.org Notepad++isanexcellenttexteditorthatusesregularexpressions. MSWord http://bit.ly/MSWord-wildcards MSWord,inAdvancedFind,useswildcards,asimplifiedformofregularexpressions. AdditionalReferences Multifarious https://multifarious.filkin.com PaulFilkin’sblog:heoftenwritesabouthowtouseregularexpressionsinStudio. TranslationTribulations http://www.translationtribulations.com/ KevinLossner’sblogisagreatresourceformemoQ,withvariouspoststhatexplainhowtouseregularexpressionstofine-tunememoQ. Regular-Expressions.info www.regular-expressions.info Agreattutorialandreferencesitethatcoversregularexpressionsindepth. RegExLib.com www.regexlib.com/Default.aspx Theinternet’sfirstregularexpressionlibrary. RegularExpressionLanguage—QuickReference http://bit.ly/Microsoft-regex AreferencetoMicrosoft’s.NETregularexpressions,whichareusedinmemoQandStudio. RiccardoSchiaffinohasalwaysworkedintranslation—firstasafreelancetranslator,thenasapartnerinatranslationagency,andlaterasatranslatorandtranslationmanagerforamajorsoftwarecompany.Heisparticularlyinterestedinusingsoftwaretoolstohelpimprovetranslationquality.HecurrentlyworksforAliquantum,acompanyspecializinginItalianandSpanishlegal,medical,andITtranslationandlocalization.Healsoteachestranslation,translationtools,andlocalizationatDenverUniversity.Contact:[email protected]. Remember,ifyouhaveanyideasand/orsuggestionsregardinghelpfulresourcesortoolsyouwouldliketoseefeatured,[email protected]. COLUMNS FromthePresident FromthePresident:ChangefortheBetter Readmore FromthePresident-Elect FromthePresident-Elect:ATA63WillBeanEventtoRemember! Readmore FromtheExecutiveDirector FromtheExecutiveDirector:BoardMeetingHighlights Readmore BusinessPractices CanClientsFindYou?HelpThemWithGoogleMyBusiness Readmore GeekSpeak SellingData Readmore ResourceReview WhichiPadIsBestforInterpreting? Readmore CertificationForum TheCertificationExam:InDemandandOnDemand! Readmore BecomeaChronicleAuthor Gainrecognitionintheindustrybysharingyouruniqueknowledgeandexperience.TheATAChroniclewelcomesoriginalarticlesofinteresttothefieldsoftranslationandinterpreting.PleasesendyourideastoJeffSanfacon. Clickhereforsubmissionguidelines TheATAChronicle©2022Allrightsreserved.
延伸文章資訊
- 1regex101: build, test, and debug regex
Regular expression tester with syntax highlighting, explanation, cheat sheet for PHP/PCRE, Python...
- 2regex-translator - npm
Convert a Regular Expression from one flavour to another.. Latest version: 0.2.8, last published:...
- 3Regular Expression Analyzer - Online Software Tool - dCode
- 4Regular Expression Analyzer
An online utility that helps analyzing regular expression structure. ... This is a tool to parse ...
- 5Regular expressions - memoQ Documentation
Regular expressions are a powerful means for finding character sequences in text. In memoQ, they ...