re — Regular expression operations — Python 3.10.5 ...

文章推薦指數: 80 %
投票人數:10人

Regular expressions use the backslash character ( '\' ) to indicate special forms or to allow special characters to be used without invoking their special ... Navigation index modules| next| previous| Python» 3.10.5Documentation» ThePythonStandardLibrary» TextProcessingServices» re—Regularexpressionoperations | re—Regularexpressionoperations¶ Sourcecode:Lib/re.py Thismoduleprovidesregularexpressionmatchingoperationssimilarto thosefoundinPerl. BothpatternsandstringstobesearchedcanbeUnicodestrings(str) aswellas8-bitstrings(bytes). However,Unicodestringsand8-bitstringscannotbemixed: thatis,youcannotmatchaUnicodestringwithabytepatternor vice-versa;similarly,whenaskingforasubstitution,thereplacement stringmustbeofthesametypeasboththepatternandthesearchstring. Regularexpressionsusethebackslashcharacter('\')toindicate specialformsortoallowspecialcharacterstobeusedwithoutinvoking theirspecialmeaning.ThiscollideswithPython’susageofthesame characterforthesamepurposeinstringliterals;forexample,tomatch aliteralbackslash,onemighthavetowrite'\\\\'asthepattern string,becausetheregularexpressionmustbe\\,andeach backslashmustbeexpressedas\\insidearegularPythonstring literal.Also,pleasenotethatanyinvalidescapesequencesinPython’s usageofthebackslashinstringliteralsnowgenerateaDeprecationWarning andinthefuturethiswillbecomeaSyntaxError.Thisbehaviour willhappenevenifitisavalidescapesequenceforaregularexpression. ThesolutionistousePython’srawstringnotationforregularexpression patterns;backslashesarenothandledinanyspecialwayinastringliteral prefixedwith'r'.Sor"\n"isatwo-characterstringcontaining '\'and'n',while"\n"isaone-characterstringcontaininga newline.UsuallypatternswillbeexpressedinPythoncodeusingthisraw stringnotation. Itisimportanttonotethatmostregularexpressionoperationsareavailableas module-levelfunctionsandmethodson compiledregularexpressions.Thefunctionsareshortcuts thatdon’trequireyoutocompilearegexobjectfirst,butmisssome fine-tuningparameters. Seealso Thethird-partyregexmodule, whichhasanAPIcompatiblewiththestandardlibraryremodule, butoffersadditionalfunctionalityandamorethoroughUnicodesupport. RegularExpressionSyntax¶ Aregularexpression(orRE)specifiesasetofstringsthatmatchesit;the functionsinthismoduleletyoucheckifaparticularstringmatchesagiven regularexpression(orifagivenregularexpressionmatchesaparticular string,whichcomesdowntothesamething). Regularexpressionscanbeconcatenatedtoformnewregularexpressions;ifA andBarebothregularexpressions,thenABisalsoaregularexpression. Ingeneral,ifastringpmatchesAandanotherstringqmatchesB,the stringpqwillmatchAB.ThisholdsunlessAorBcontainlowprecedence operations;boundaryconditionsbetweenAandB;orhavenumberedgroup references.Thus,complexexpressionscaneasilybeconstructedfromsimpler primitiveexpressionsliketheonesdescribedhere.Fordetailsofthetheory andimplementationofregularexpressions,consulttheFriedlbook[Frie09], oralmostanytextbookaboutcompilerconstruction. Abriefexplanationoftheformatofregularexpressionsfollows.Forfurther informationandagentlerpresentation,consulttheRegularExpressionHOWTO. Regularexpressionscancontainbothspecialandordinarycharacters.Most ordinarycharacters,like'A','a',or'0',arethesimplestregular expressions;theysimplymatchthemselves.Youcanconcatenateordinary characters,solastmatchesthestring'last'.(Intherestofthis section,we’llwriteRE’sinthisspecialstyle,usuallywithoutquotes,and stringstobematched'insinglequotes'.) Somecharacters,like'|'or'(',arespecial.Special characterseitherstandforclassesofordinarycharacters,oraffect howtheregularexpressionsaroundthemareinterpreted. Repetitionqualifiers(*,+,?,{m,n},etc)cannotbe directlynested.Thisavoidsambiguitywiththenon-greedymodifiersuffix ?,andwithothermodifiersinotherimplementations.Toapplyasecond repetitiontoaninnerrepetition,parenthesesmaybeused.Forexample, theexpression(?:a{6})*matchesanymultipleofsix'a'characters. Thespecialcharactersare: .(Dot.)Inthedefaultmode,thismatchesanycharacterexceptanewline.If theDOTALLflaghasbeenspecified,thismatchesanycharacter includinganewline. ^(Caret.)Matchesthestartofthestring,andinMULTILINEmodealso matchesimmediatelyaftereachnewline. $Matchestheendofthestringorjustbeforethenewlineattheendofthe string,andinMULTILINEmodealsomatchesbeforeanewline.foo matchesboth‘foo’and‘foobar’,whiletheregularexpressionfoo$matches only‘foo’.Moreinterestingly,searchingforfoo.$in'foo1\nfoo2\n' matches‘foo2’normally,but‘foo1’inMULTILINEmode;searchingfor asingle$in'foo\n'willfindtwo(empty)matches:onejustbefore thenewline,andoneattheendofthestring. *CausestheresultingREtomatch0ormorerepetitionsoftheprecedingRE,as manyrepetitionsasarepossible.ab*willmatch‘a’,‘ab’,or‘a’followed byanynumberof‘b’s. +CausestheresultingREtomatch1ormorerepetitionsoftheprecedingRE. ab+willmatch‘a’followedbyanynon-zeronumberof‘b’s;itwillnot matchjust‘a’. ?CausestheresultingREtomatch0or1repetitionsoftheprecedingRE. ab?willmatcheither‘a’or‘ab’. *?,+?,??The'*','+',and'?'qualifiersareallgreedy;theymatch asmuchtextaspossible.Sometimesthisbehaviourisn’tdesired;iftheRE <.>ismatchedagainst'b',itwillmatchtheentire string,andnotjust''.Adding?afterthequalifiermakesit performthematchinnon-greedyorminimalfashion;asfew charactersaspossiblewillbematched.UsingtheRE<.>willmatch only''. {m}SpecifiesthatexactlymcopiesofthepreviousREshouldbematched;fewer matchescausetheentireREnottomatch.Forexample,a{6}willmatch exactlysix'a'characters,butnotfive. {m,n}CausestheresultingREtomatchfrommtonrepetitionsofthepreceding RE,attemptingtomatchasmanyrepetitionsaspossible.Forexample, a{3,5}willmatchfrom3to5'a'characters.Omittingmspecifiesa lowerboundofzero,andomittingnspecifiesaninfiniteupperbound.Asan example,a{4,}bwillmatch'aaaab'orathousand'a'characters followedbya'b',butnot'aaab'.Thecommamaynotbeomittedorthe modifierwouldbeconfusedwiththepreviouslydescribedform. {m,n}?CausestheresultingREtomatchfrommtonrepetitionsofthepreceding RE,attemptingtomatchasfewrepetitionsaspossible.Thisisthe non-greedyversionofthepreviousqualifier.Forexample,onthe 6-characterstring'aaaaaa',a{3,5}willmatch5'a'characters, whilea{3,5}?willonlymatch3characters. \Eitherescapesspecialcharacters(permittingyoutomatchcharacterslike '*','?',andsoforth),orsignalsaspecialsequence;special sequencesarediscussedbelow. Ifyou’renotusingarawstringtoexpressthepattern,rememberthatPython alsousesthebackslashasanescapesequenceinstringliterals;iftheescape sequenceisn’trecognizedbyPython’sparser,thebackslashandsubsequent characterareincludedintheresultingstring.However,ifPythonwould recognizetheresultingsequence,thebackslashshouldberepeatedtwice.This iscomplicatedandhardtounderstand,soit’shighlyrecommendedthatyouuse rawstringsforallbutthesimplestexpressions. []Usedtoindicateasetofcharacters.Inaset: Characterscanbelistedindividually,e.g.[amk]willmatch'a', 'm',or'k'. Rangesofcharacterscanbeindicatedbygivingtwocharactersandseparating thembya'-',forexample[a-z]willmatchanylowercaseASCIIletter, [0-5][0-9]willmatchallthetwo-digitsnumbersfrom00to59,and [0-9A-Fa-f]willmatchanyhexadecimaldigit.If-isescaped(e.g. [a\-z])orifit’splacedasthefirstorlastcharacter (e.g.[-a]or[a-]),itwillmatchaliteral'-'. Specialcharacterslosetheirspecialmeaninginsidesets.Forexample, [(+*)]willmatchanyoftheliteralcharacters'(','+', '*',or')'. Characterclassessuchas\wor\S(definedbelow)arealsoaccepted insideaset,althoughthecharacterstheymatchdependsonwhether ASCIIorLOCALEmodeisinforce. Charactersthatarenotwithinarangecanbematchedbycomplementing theset.Ifthefirstcharacterofthesetis'^',allthecharacters thatarenotinthesetwillbematched.Forexample,[^5]willmatch anycharacterexcept'5',and[^^]willmatchanycharacterexcept '^'.^hasnospecialmeaningifit’snotthefirstcharacterin theset. Tomatchaliteral']'insideaset,precedeitwithabackslash,or placeitatthebeginningoftheset.Forexample,both[()[\]{}]and []()[{}]willbothmatchaparenthesis. SupportofnestedsetsandsetoperationsasinUnicodeTechnical Standard#18mightbeaddedinthefuture.Thiswouldchangethe syntax,sotofacilitatethischangeaFutureWarningwillberaised inambiguouscasesforthetimebeing. Thatincludessetsstartingwithaliteral'['orcontainingliteral charactersequences'--','&&','~~',and'||'.To avoidawarningescapethemwithabackslash. Changedinversion3.7:FutureWarningisraisedifacharactersetcontainsconstructs thatwillchangesemanticallyinthefuture. |A|B,whereAandBcanbearbitraryREs,createsaregularexpressionthat willmatcheitherAorB.AnarbitrarynumberofREscanbeseparatedbythe '|'inthisway.Thiscanbeusedinsidegroups(seebelow)aswell.As thetargetstringisscanned,REsseparatedby'|'aretriedfromleftto right.Whenonepatterncompletelymatches,thatbranchisaccepted.Thismeans thatonceAmatches,Bwillnotbetestedfurther,evenifitwould producealongeroverallmatch.Inotherwords,the'|'operatorisnever greedy.Tomatchaliteral'|',use\|,orencloseitinsidea characterclass,asin[|]. (...)Matcheswhateverregularexpressionisinsidetheparentheses,andindicatesthe startandendofagroup;thecontentsofagroupcanberetrievedafteramatch hasbeenperformed,andcanbematchedlaterinthestringwiththe\number specialsequence,describedbelow.Tomatchtheliterals'('or')', use\(or\),orenclosetheminsideacharacterclass:[(],[)]. (?...)Thisisanextensionnotation(a'?'followinga'('isnotmeaningful otherwise).Thefirstcharacterafterthe'?'determineswhatthemeaning andfurthersyntaxoftheconstructis.Extensionsusuallydonotcreateanew group;(?P...)istheonlyexceptiontothisrule.Followingarethe currentlysupportedextensions. (?aiLmsux)(Oneormorelettersfromtheset'a','i','L','m', 's','u','x'.)Thegroupmatchestheemptystring;the letterssetthecorrespondingflags:re.A(ASCII-onlymatching), re.I(ignorecase),re.L(localedependent), re.M(multi-line),re.S(dotmatchesall), re.U(Unicodematching),andre.X(verbose), fortheentireregularexpression. (TheflagsaredescribedinModuleContents.) Thisisusefulifyouwishtoincludetheflagsaspartofthe regularexpression,insteadofpassingaflagargumenttothe re.compile()function.Flagsshouldbeusedfirstinthe expressionstring. (?:...)Anon-capturingversionofregularparentheses.Matcheswhateverregular expressionisinsidetheparentheses,butthesubstringmatchedbythegroup cannotberetrievedafterperformingamatchorreferencedlaterinthe pattern. (?aiLmsux-imsx:...)(Zeroormorelettersfromtheset'a','i','L','m', 's','u','x',optionallyfollowedby'-'followedby oneormorelettersfromthe'i','m','s','x'.) Theletterssetorremovethecorrespondingflags: re.A(ASCII-onlymatching),re.I(ignorecase), re.L(localedependent),re.M(multi-line), re.S(dotmatchesall),re.U(Unicodematching), andre.X(verbose),forthepartoftheexpression. (TheflagsaredescribedinModuleContents.) Theletters'a','L'and'u'aremutuallyexclusivewhenused asinlineflags,sotheycan’tbecombinedorfollow'-'.Instead, whenoneofthemappearsinaninlinegroup,itoverridesthematchingmode intheenclosinggroup.InUnicodepatterns(?a:...)switchesto ASCII-onlymatching,and(?u:...)switchestoUnicodematching (default).Inbytepattern(?L:...)switchestolocaledepending matching,and(?a:...)switchestoASCII-onlymatching(default). Thisoverrideisonlyineffectforthenarrowinlinegroup,andthe originalmatchingmodeisrestoredoutsideofthegroup. Newinversion3.6. Changedinversion3.7:Theletters'a','L'and'u'alsocanbeusedinagroup. (?P...)Similartoregularparentheses,butthesubstringmatchedbythegroupis accessibleviathesymbolicgroupnamename.Groupnamesmustbevalid Pythonidentifiers,andeachgroupnamemustbedefinedonlyoncewithina regularexpression.Asymbolicgroupisalsoanumberedgroup,justasif thegroupwerenotnamed. Namedgroupscanbereferencedinthreecontexts.Ifthepatternis (?P['"]).*?(?P=quote)(i.e.matchingastringquotedwitheither singleordoublequotes): inthesamepatternitself (?P=quote)(asshown) \1 whenprocessingmatchobjectm m.group('quote') m.end('quote')(etc.) inastringpassedtotherepl argumentofre.sub() \g \g<1> \1 (?P=name)Abackreferencetoanamedgroup;itmatcheswhatevertextwasmatchedbythe earliergroupnamedname. (?#...)Acomment;thecontentsoftheparenthesesaresimplyignored. (?=...)Matchesif...matchesnext,butdoesn’tconsumeanyofthestring.Thisis calledalookaheadassertion.Forexample,Isaac(?=Asimov)willmatch 'Isaac'onlyifit’sfollowedby'Asimov'. (?!...)Matchesif...doesn’tmatchnext.Thisisanegativelookaheadassertion. Forexample,Isaac(?!Asimov)willmatch'Isaac'onlyifit’snot followedby'Asimov'. (?<=...)Matchesifthecurrentpositioninthestringisprecededbyamatchfor... thatendsatthecurrentposition.Thisiscalledapositivelookbehind assertion.(?<=abc)defwillfindamatchin'abcdef',sincethe lookbehindwillbackup3charactersandcheckifthecontainedpatternmatches. Thecontainedpatternmustonlymatchstringsofsomefixedlength,meaningthat abcora|bareallowed,buta*anda{3,4}arenot.Notethat patternswhichstartwithpositivelookbehindassertionswillnotmatchatthe beginningofthestringbeingsearched;youwillmostlikelywanttousethe search()functionratherthanthematch()function: >>>importre >>>m=re.search('(?<=abc)def','abcdef') >>>m.group(0) 'def' Thisexamplelooksforawordfollowingahyphen: >>>m=re.search(r'(?<=-)\w+','spam-egg') >>>m.group(0) 'egg' Changedinversion3.5:Addedsupportforgroupreferencesoffixedlength. (?|$)isapooremailmatchingpattern,which willmatchwith''aswellas'[email protected]',but notwith''. Thespecialsequencesconsistof'\'andacharacterfromthelistbelow. IftheordinarycharacterisnotanASCIIdigitoranASCIIletter,thenthe resultingREwillmatchthesecondcharacter.Forexample,\$matchesthe character'$'. \numberMatchesthecontentsofthegroupofthesamenumber.Groupsarenumbered startingfrom1.Forexample,(.+)\1matches'thethe'or'5555', butnot'thethe'(notethespaceafterthegroup).Thisspecialsequence canonlybeusedtomatchoneofthefirst99groups.Ifthefirstdigitof numberis0,ornumberis3octaldigitslong,itwillnotbeinterpretedas agroupmatch,butasthecharacterwithoctalvaluenumber.Insidethe '['and']'ofacharacterclass,allnumericescapesaretreatedas characters. \AMatchesonlyatthestartofthestring. \bMatchestheemptystring,butonlyatthebeginningorendofaword. Awordisdefinedasasequenceofwordcharacters.Notethatformally, \bisdefinedastheboundarybetweena\wanda\Wcharacter (orviceversa),orbetween\wandthebeginning/endofthestring. Thismeansthatr'\bfoo\b'matches'foo','foo.','(foo)', 'barfoobaz'butnot'foobar'or'foo3'. BydefaultUnicodealphanumericsaretheonesusedinUnicodepatterns,but thiscanbechangedbyusingtheASCIIflag.Wordboundariesare determinedbythecurrentlocaleiftheLOCALEflagisused. Insideacharacterrange,\brepresentsthebackspacecharacter,for compatibilitywithPython’sstringliterals. \BMatchestheemptystring,butonlywhenitisnotatthebeginningorend ofaword.Thismeansthatr'py\B'matches'python','py3', 'py2',butnot'py','py.',or'py!'. \Bisjusttheoppositeof\b,sowordcharactersinUnicode patternsareUnicodealphanumericsortheunderscore,althoughthiscan bechangedbyusingtheASCIIflag.Wordboundariesare determinedbythecurrentlocaleiftheLOCALEflagisused. \d ForUnicode(str)patterns:MatchesanyUnicodedecimaldigit(thatis,anycharacterin Unicodecharactercategory[Nd]).Thisincludes[0-9],and alsomanyotherdigitcharacters.IftheASCIIflagis usedonly[0-9]ismatched. For8-bit(bytes)patterns:Matchesanydecimaldigit;thisisequivalentto[0-9]. \DMatchesanycharacterwhichisnotadecimaldigit.Thisis theoppositeof\d.IftheASCIIflagisusedthis becomestheequivalentof[^0-9]. \s ForUnicode(str)patterns:MatchesUnicodewhitespacecharacters(whichincludes [\t\n\r\f\v],andalsomanyothercharacters,forexamplethe non-breakingspacesmandatedbytypographyrulesinmany languages).IftheASCIIflagisused,only [\t\n\r\f\v]ismatched. For8-bit(bytes)patterns:MatchescharactersconsideredwhitespaceintheASCIIcharacterset; thisisequivalentto[\t\n\r\f\v]. \SMatchesanycharacterwhichisnotawhitespacecharacter.Thisis theoppositeof\s.IftheASCIIflagisusedthis becomestheequivalentof[^\t\n\r\f\v]. \w ForUnicode(str)patterns:MatchesUnicodewordcharacters;thisincludesmostcharacters thatcanbepartofawordinanylanguage,aswellasnumbersand theunderscore.IftheASCIIflagisused,only [a-zA-Z0-9_]ismatched. For8-bit(bytes)patterns:MatchescharactersconsideredalphanumericintheASCIIcharacterset; thisisequivalentto[a-zA-Z0-9_].IftheLOCALEflagis used,matchescharactersconsideredalphanumericinthecurrentlocale andtheunderscore. \WMatchesanycharacterwhichisnotawordcharacter.Thisis theoppositeof\w.IftheASCIIflagisusedthis becomestheequivalentof[^a-zA-Z0-9_].IftheLOCALEflagis used,matchescharacterswhichareneitheralphanumericinthecurrentlocale northeunderscore. \ZMatchesonlyattheendofthestring. MostofthestandardescapessupportedbyPythonstringliteralsarealso acceptedbytheregularexpressionparser: \a\b\f\n \N\r\t\u \U\v\x\\ (Notethat\bisusedtorepresentwordboundaries,andmeans“backspace” onlyinsidecharacterclasses.) '\u','\U',and'\N'escapesequencesareonlyrecognizedinUnicode patterns.Inbytespatternstheyareerrors.UnknownescapesofASCII lettersarereservedforfutureuseandtreatedaserrors. Octalescapesareincludedinalimitedform.Ifthefirstdigitisa0,orif therearethreeoctaldigits,itisconsideredanoctalescape.Otherwise,itis agroupreference.Asforstringliterals,octalescapesarealwaysatmost threedigitsinlength. Changedinversion3.3:The'\u'and'\U'escapesequenceshavebeenadded. Changedinversion3.6:Unknownescapesconsistingof'\'andanASCIIletternowareerrors. Changedinversion3.8:The'\N{name}'escapesequencehasbeenadded.Asinstringliterals, itexpandstothenamedUnicodecharacter(e.g.'\N{EMDASH}'). ModuleContents¶ Themoduledefinesseveralfunctions,constants,andanexception.Someofthe functionsaresimplifiedversionsofthefullfeaturedmethodsforcompiled regularexpressions.Mostnon-trivialapplicationsalwaysusethecompiled form. Flags¶ Changedinversion3.6:FlagconstantsarenowinstancesofRegexFlag,whichisasubclassof enum.IntFlag. re.A¶ re.ASCII¶ Make\w,\W,\b,\B,\d,\D,\sand\S performASCII-onlymatchinginsteadoffullUnicodematching.Thisisonly meaningfulforUnicodepatterns,andisignoredforbytepatterns. Correspondstotheinlineflag(?a). Notethatforbackwardcompatibility,there.Uflagstill exists(aswellasitssynonymre.UNICODEanditsembedded counterpart(?u)),buttheseareredundantinPython3since matchesareUnicodebydefaultforstrings(andUnicodematching isn’tallowedforbytes). re.DEBUG¶ Displaydebuginformationaboutcompiledexpression. Nocorrespondinginlineflag. re.I¶ re.IGNORECASE¶ Performcase-insensitivematching;expressionslike[A-Z]willalso matchlowercaseletters.FullUnicodematching(suchasÜmatching ü)alsoworksunlessthere.ASCIIflagisusedtodisable non-ASCIImatches.Thecurrentlocaledoesnotchangetheeffectofthis flagunlessthere.LOCALEflagisalsoused. Correspondstotheinlineflag(?i). NotethatwhentheUnicodepatterns[a-z]or[A-Z]areusedin combinationwiththeIGNORECASEflag,theywillmatchthe52ASCII lettersand4additionalnon-ASCIIletters:‘İ’(U+0130,Latincapital letterIwithdotabove),‘ı’(U+0131,Latinsmallletterdotlessi), ‘ſ’(U+017F,Latinsmallletterlongs)and‘K’(U+212A,Kelvinsign). IftheASCIIflagisused,onlyletters‘a’to‘z’ and‘A’to‘Z’arematched. re.L¶ re.LOCALE¶ Make\w,\W,\b,\Bandcase-insensitivematching dependentonthecurrentlocale.Thisflagcanbeusedonlywithbytes patterns.Theuseofthisflagisdiscouragedasthelocalemechanism isveryunreliable,itonlyhandlesone“culture”atatime,anditonly workswith8-bitlocales.Unicodematchingisalreadyenabledbydefault inPython3forUnicode(str)patterns,anditisabletohandledifferent locales/languages. Correspondstotheinlineflag(?L). Changedinversion3.6:re.LOCALEcanbeusedonlywithbytespatternsandis notcompatiblewithre.ASCII. Changedinversion3.7:Compiledregularexpressionobjectswiththere.LOCALEflagno longerdependonthelocaleatcompiletime.Onlythelocaleat matchingtimeaffectstheresultofmatching. re.M¶ re.MULTILINE¶ Whenspecified,thepatterncharacter'^'matchesatthebeginningofthe stringandatthebeginningofeachline(immediatelyfollowingeachnewline); andthepatterncharacter'$'matchesattheendofthestringandatthe endofeachline(immediatelyprecedingeachnewline).Bydefault,'^' matchesonlyatthebeginningofthestring,and'$'onlyattheendofthe stringandimmediatelybeforethenewline(ifany)attheendofthestring. Correspondstotheinlineflag(?m). re.S¶ re.DOTALL¶ Makethe'.'specialcharactermatchanycharacteratall,includinga newline;withoutthisflag,'.'willmatchanythingexceptanewline. Correspondstotheinlineflag(?s). re.X¶ re.VERBOSE¶ Thisflagallowsyoutowriteregularexpressionsthatlooknicerandare morereadablebyallowingyoutovisuallyseparatelogicalsectionsofthe patternandaddcomments.Whitespacewithinthepatternisignored,except wheninacharacterclass,orwhenprecededbyanunescapedbackslash, orwithintokenslike*?,(?:or(?P<...>. Whenalinecontainsa#thatisnotinacharacterclassandisnot precededbyanunescapedbackslash,allcharactersfromtheleftmostsuch #throughtheendofthelineareignored. Thismeansthatthetwofollowingregularexpressionobjectsthatmatcha decimalnumberarefunctionallyequal: a=re.compile(r"""\d+#theintegralpart \.#thedecimalpoint \d*#somefractionaldigits""",re.X) b=re.compile(r"\d+\.\d*") Correspondstotheinlineflag(?x). Functions¶ re.compile(pattern,flags=0)¶ Compilearegularexpressionpatternintoaregularexpressionobject,whichcanbeusedformatchingusingits match(),search()andothermethods,described below. Theexpression’sbehaviourcanbemodifiedbyspecifyingaflagsvalue. Valuescanbeanyofthefollowingvariables,combinedusingbitwiseOR(the |operator). Thesequence prog=re.compile(pattern) result=prog.match(string) isequivalentto result=re.match(pattern,string) butusingre.compile()andsavingtheresultingregularexpression objectforreuseismoreefficientwhentheexpressionwillbeusedseveral timesinasingleprogram. Note Thecompiledversionsofthemostrecentpatternspassedto re.compile()andthemodule-levelmatchingfunctionsarecached,so programsthatuseonlyafewregularexpressionsatatimeneedn’tworry aboutcompilingregularexpressions. re.search(pattern,string,flags=0)¶ Scanthroughstringlookingforthefirstlocationwheretheregularexpression patternproducesamatch,andreturnacorrespondingmatchobject.ReturnNoneifnopositioninthestringmatchesthe pattern;notethatthisisdifferentfromfindingazero-lengthmatchatsome pointinthestring. re.match(pattern,string,flags=0)¶ Ifzeroormorecharactersatthebeginningofstringmatchtheregular expressionpattern,returnacorrespondingmatchobject.ReturnNoneifthestringdoesnotmatchthepattern; notethatthisisdifferentfromazero-lengthmatch. NotethateveninMULTILINEmode,re.match()willonlymatch atthebeginningofthestringandnotatthebeginningofeachline. Ifyouwanttolocateamatchanywhereinstring,usesearch() instead(seealsosearch()vs.match()). re.fullmatch(pattern,string,flags=0)¶ Ifthewholestringmatchestheregularexpressionpattern,returna correspondingmatchobject.ReturnNoneifthe stringdoesnotmatchthepattern;notethatthisisdifferentfroma zero-lengthmatch. Newinversion3.4. re.split(pattern,string,maxsplit=0,flags=0)¶ Splitstringbytheoccurrencesofpattern.Ifcapturingparenthesesare usedinpattern,thenthetextofallgroupsinthepatternarealsoreturned aspartoftheresultinglist.Ifmaxsplitisnonzero,atmostmaxsplit splitsoccur,andtheremainderofthestringisreturnedasthefinalelement ofthelist. >>>re.split(r'\W+','Words,words,words.') ['Words','words','words',''] >>>re.split(r'(\W+)','Words,words,words.') ['Words',',','words',',','words','.',''] >>>re.split(r'\W+','Words,words,words.',1) ['Words','words,words.'] >>>re.split('[a-f]+','0a3B9',flags=re.IGNORECASE) ['0','3','9'] Iftherearecapturinggroupsintheseparatoranditmatchesatthestartof thestring,theresultwillstartwithanemptystring.Thesameholdsfor theendofthestring: >>>re.split(r'(\W+)','...words,words...') ['','...','words',',','words','...',''] Thatway,separatorcomponentsarealwaysfoundatthesamerelative indiceswithintheresultlist. Emptymatchesforthepatternsplitthestringonlywhennotadjacent toapreviousemptymatch. >>>re.split(r'\b','Words,words,words.') ['','Words',',','words',',','words','.'] >>>re.split(r'\W*','...words...') ['','','w','o','r','d','s','',''] >>>re.split(r'(\W*)','...words...') ['','...','','','w','','o','','r','','d','','s','...','','',''] Changedinversion3.1:Addedtheoptionalflagsargument. Changedinversion3.7:Addedsupportofsplittingonapatternthatcouldmatchanemptystring. re.findall(pattern,string,flags=0)¶ Returnallnon-overlappingmatchesofpatterninstring,asalistof stringsortuples.Thestringisscannedleft-to-right,andmatches arereturnedintheorderfound.Emptymatchesareincludedintheresult. Theresultdependsonthenumberofcapturinggroupsinthepattern. Iftherearenogroups,returnalistofstringsmatchingthewhole pattern.Ifthereisexactlyonegroup,returnalistofstrings matchingthatgroup.Ifmultiplegroupsarepresent,returnalist oftuplesofstringsmatchingthegroups.Non-capturinggroupsdonot affecttheformoftheresult. >>>re.findall(r'\bf[a-z]*','whichfootorhandfellfastest') ['foot','fell','fastest'] >>>re.findall(r'(\w+)=(\d+)','setwidth=20andheight=10') [('width','20'),('height','10')] Changedinversion3.7:Non-emptymatchescannowstartjustafterapreviousemptymatch. re.finditer(pattern,string,flags=0)¶ Returnaniteratoryieldingmatchobjectsover allnon-overlappingmatchesfortheREpatterninstring.Thestring isscannedleft-to-right,andmatchesarereturnedintheorderfound.Empty matchesareincludedintheresult. Changedinversion3.7:Non-emptymatchescannowstartjustafterapreviousemptymatch. re.sub(pattern,repl,string,count=0,flags=0)¶ Returnthestringobtainedbyreplacingtheleftmostnon-overlappingoccurrences ofpatterninstringbythereplacementrepl.Ifthepatternisn’tfound, stringisreturnedunchanged.replcanbeastringorafunction;ifitis astring,anybackslashescapesinitareprocessed.Thatis,\nis convertedtoasinglenewlinecharacter,\risconvertedtoacarriagereturn,and soforth.UnknownescapesofASCIIlettersarereservedforfutureuseand treatedaserrors.Otherunknownescapessuchas\&areleftalone. Backreferences,such as\6,arereplacedwiththesubstringmatchedbygroup6inthepattern. Forexample: >>>re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', ...r'staticPyObject*\npy_\1(void)\n{', ...'defmyfunc():') 'staticPyObject*\npy_myfunc(void)\n{' Ifreplisafunction,itiscalledforeverynon-overlappingoccurrenceof pattern.Thefunctiontakesasinglematchobject argument,andreturnsthereplacementstring.Forexample: >>>defdashrepl(matchobj): ...ifmatchobj.group(0)=='-':return'' ...else:return'-' >>>re.sub('-{1,2}',dashrepl,'pro----gram-files') 'pro--gramfiles' >>>re.sub(r'\sAND\s','&','BakedBeansAndSpam',flags=re.IGNORECASE) 'BakedBeans&Spam' Thepatternmaybeastringorapatternobject. Theoptionalargumentcountisthemaximumnumberofpatternoccurrencestobe replaced;countmustbeanon-negativeinteger.Ifomittedorzero,all occurrenceswillbereplaced.Emptymatchesforthepatternarereplacedonly whennotadjacenttoapreviousemptymatch,sosub('x*','-','abxd')returns '-a-b--d-'. Instring-typereplarguments,inadditiontothecharacterescapesand backreferencesdescribedabove, \gwillusethesubstringmatchedbythegroupnamedname,as definedbythe(?P...)syntax.\gusesthecorresponding groupnumber;\g<2>isthereforeequivalentto\2,butisn’tambiguous inareplacementsuchas\g<2>0.\20wouldbeinterpretedasa referencetogroup20,notareferencetogroup2followedbytheliteral character'0'.Thebackreference\g<0>substitutesintheentire substringmatchedbytheRE. Changedinversion3.1:Addedtheoptionalflagsargument. Changedinversion3.5:Unmatchedgroupsarereplacedwithanemptystring. Changedinversion3.6:Unknownescapesinpatternconsistingof'\'andanASCIIletter nowareerrors. Changedinversion3.7:Unknownescapesinreplconsistingof'\'andanASCIIletter nowareerrors. Changedinversion3.7:Emptymatchesforthepatternarereplacedwhenadjacenttoaprevious non-emptymatch. re.subn(pattern,repl,string,count=0,flags=0)¶ Performthesameoperationassub(),butreturnatuple(new_string, number_of_subs_made). Changedinversion3.1:Addedtheoptionalflagsargument. Changedinversion3.5:Unmatchedgroupsarereplacedwithanemptystring. re.escape(pattern)¶ Escapespecialcharactersinpattern. Thisisusefulifyouwanttomatchanarbitraryliteralstringthatmay haveregularexpressionmetacharactersinit.Forexample: >>>print(re.escape('https://www.python.org')) https://www\.python\.org >>>legal_chars=string.ascii_lowercase+string.digits+"!#$%&'*+-.^_`|~:" >>>print('[%s]+'%re.escape(legal_chars)) [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+ >>>operators=['+','-','*','/','**'] >>>print('|'.join(map(re.escape,sorted(operators,reverse=True)))) /|\-|\+|\*\*|\* Thisfunctionmustnotbeusedforthereplacementstringinsub() andsubn(),onlybackslashesshouldbeescaped.Forexample: >>>digits_re=r'\d+' >>>sample='/usr/sbin/sendmail-0errors,12warnings' >>>print(re.sub(digits_re,digits_re.replace('\\',r'\\'),sample)) /usr/sbin/sendmail-\d+errors,\d+warnings Changedinversion3.3:The'_'characterisnolongerescaped. Changedinversion3.7:Onlycharactersthatcanhavespecialmeaninginaregularexpression areescaped.Asaresult,'!','"','%',"'",',', '/',':',';','','@',and "`"arenolongerescaped. re.purge()¶ Cleartheregularexpressioncache. Exceptions¶ exceptionre.error(msg,pattern=None,pos=None)¶ Exceptionraisedwhenastringpassedtooneofthefunctionshereisnota validregularexpression(forexample,itmightcontainunmatchedparentheses) orwhensomeothererroroccursduringcompilationormatching.Itisneveran errorifastringcontainsnomatchforapattern.Theerrorinstancehas thefollowingadditionalattributes: msg¶ Theunformattederrormessage. pattern¶ Theregularexpressionpattern. pos¶ Theindexinpatternwherecompilationfailed(maybeNone). lineno¶ Thelinecorrespondingtopos(maybeNone). colno¶ Thecolumncorrespondingtopos(maybeNone). Changedinversion3.5:Addedadditionalattributes. RegularExpressionObjects¶ Compiledregularexpressionobjectssupportthefollowingmethodsand attributes: Pattern.search(string[,pos[,endpos]])¶ Scanthroughstringlookingforthefirstlocationwherethisregular expressionproducesamatch,andreturnacorrespondingmatchobject.ReturnNoneifnopositioninthestringmatchesthe pattern;notethatthisisdifferentfromfindingazero-lengthmatchatsome pointinthestring. Theoptionalsecondparameterposgivesanindexinthestringwherethe searchistostart;itdefaultsto0.Thisisnotcompletelyequivalentto slicingthestring;the'^'patterncharactermatchesattherealbeginning ofthestringandatpositionsjustafteranewline,butnotnecessarilyatthe indexwherethesearchistostart. Theoptionalparameterendposlimitshowfarthestringwillbesearched;it willbeasifthestringisendposcharacterslong,soonlythecharacters frompostoendpos-1willbesearchedforamatch.Ifendposisless thanpos,nomatchwillbefound;otherwise,ifrxisacompiledregular expressionobject,rx.search(string,0,50)isequivalentto rx.search(string[:50],0). >>>pattern=re.compile("d") >>>pattern.search("dog")#Matchatindex0 >>>pattern.search("dog",1)#Nomatch;searchdoesn'tincludethe"d" Pattern.match(string[,pos[,endpos]])¶ Ifzeroormorecharactersatthebeginningofstringmatchthisregular expression,returnacorrespondingmatchobject. ReturnNoneifthestringdoesnotmatchthepattern;notethatthisis differentfromazero-lengthmatch. Theoptionalposandendposparametershavethesamemeaningasforthe search()method. >>>pattern=re.compile("o") >>>pattern.match("dog")#Nomatchas"o"isnotatthestartof"dog". >>>pattern.match("dog",1)#Matchas"o"isthe2ndcharacterof"dog". Ifyouwanttolocateamatchanywhereinstring,use search()instead(seealsosearch()vs.match()). Pattern.fullmatch(string[,pos[,endpos]])¶ Ifthewholestringmatchesthisregularexpression,returnacorresponding matchobject.ReturnNoneifthestringdoesnot matchthepattern;notethatthisisdifferentfromazero-lengthmatch. Theoptionalposandendposparametershavethesamemeaningasforthe search()method. >>>pattern=re.compile("o[gh]") >>>pattern.fullmatch("dog")#Nomatchas"o"isnotatthestartof"dog". >>>pattern.fullmatch("ogre")#Nomatchasnotthefullstringmatches. >>>pattern.fullmatch("doggie",1,3)#Matcheswithingivenlimits. Newinversion3.4. Pattern.split(string,maxsplit=0)¶ Identicaltothesplit()function,usingthecompiledpattern. Pattern.findall(string[,pos[,endpos]])¶ Similartothefindall()function,usingthecompiledpattern,but alsoacceptsoptionalposandendposparametersthatlimitthesearch regionlikeforsearch(). Pattern.finditer(string[,pos[,endpos]])¶ Similartothefinditer()function,usingthecompiledpattern,but alsoacceptsoptionalposandendposparametersthatlimitthesearch regionlikeforsearch(). Pattern.sub(repl,string,count=0)¶ Identicaltothesub()function,usingthecompiledpattern. Pattern.subn(repl,string,count=0)¶ Identicaltothesubn()function,usingthecompiledpattern. Pattern.flags¶ Theregexmatchingflags.Thisisacombinationoftheflagsgivento compile(),any(?...)inlineflagsinthepattern,andimplicit flagssuchasUNICODEifthepatternisaUnicodestring. Pattern.groups¶ Thenumberofcapturinggroupsinthepattern. Pattern.groupindex¶ Adictionarymappinganysymbolicgroupnamesdefinedby(?P)togroup numbers.Thedictionaryisemptyifnosymbolicgroupswereusedinthe pattern. Pattern.pattern¶ Thepatternstringfromwhichthepatternobjectwascompiled. Changedinversion3.7:Addedsupportofcopy.copy()andcopy.deepcopy().Compiled regularexpressionobjectsareconsideredatomic. MatchObjects¶ MatchobjectsalwayshaveabooleanvalueofTrue. Sincematch()andsearch()returnNone whenthereisnomatch,youcantestwhethertherewasamatchwithasimple ifstatement: match=re.search(pattern,string) ifmatch: process(match) Matchobjectssupportthefollowingmethodsandattributes: Match.expand(template)¶ Returnthestringobtainedbydoingbackslashsubstitutiononthetemplate stringtemplate,asdonebythesub()method. Escapessuchas\nareconvertedtotheappropriatecharacters, andnumericbackreferences(\1,\2)andnamedbackreferences (\g<1>,\g)arereplacedbythecontentsofthe correspondinggroup. Changedinversion3.5:Unmatchedgroupsarereplacedwithanemptystring. Match.group([group1,...])¶ Returnsoneormoresubgroupsofthematch.Ifthereisasingleargument,the resultisasinglestring;iftherearemultiplearguments,theresultisa tuplewithoneitemperargument.Withoutarguments,group1defaultstozero (thewholematchisreturned).IfagroupNargumentiszero,thecorresponding returnvalueistheentirematchingstring;ifitisintheinclusiverange [1..99],itisthestringmatchingthecorrespondingparenthesizedgroup.Ifa groupnumberisnegativeorlargerthanthenumberofgroupsdefinedinthe pattern,anIndexErrorexceptionisraised.Ifagroupiscontainedina partofthepatternthatdidnotmatch,thecorrespondingresultisNone. Ifagroupiscontainedinapartofthepatternthatmatchedmultipletimes, thelastmatchisreturned. >>>m=re.match(r"(\w+)(\w+)","IsaacNewton,physicist") >>>m.group(0)#Theentirematch 'IsaacNewton' >>>m.group(1)#Thefirstparenthesizedsubgroup. 'Isaac' >>>m.group(2)#Thesecondparenthesizedsubgroup. 'Newton' >>>m.group(1,2)#Multipleargumentsgiveusatuple. ('Isaac','Newton') Iftheregularexpressionusesthe(?P...)syntax,thegroupN argumentsmayalsobestringsidentifyinggroupsbytheirgroupname.Ifa stringargumentisnotusedasagroupnameinthepattern,anIndexError exceptionisraised. Amoderatelycomplicatedexample: >>>m=re.match(r"(?P\w+)(?P\w+)","MalcolmReynolds") >>>m.group('first_name') 'Malcolm' >>>m.group('last_name') 'Reynolds' Namedgroupscanalsobereferredtobytheirindex: >>>m.group(1) 'Malcolm' >>>m.group(2) 'Reynolds' Ifagroupmatchesmultipletimes,onlythelastmatchisaccessible: >>>m=re.match(r"(..)+","a1b2c3")#Matches3times. >>>m.group(1)#Returnsonlythelastmatch. 'c3' Match.__getitem__(g)¶ Thisisidenticaltom.group(g).Thisallowseasieraccessto anindividualgroupfromamatch: >>>m=re.match(r"(\w+)(\w+)","IsaacNewton,physicist") >>>m[0]#Theentirematch 'IsaacNewton' >>>m[1]#Thefirstparenthesizedsubgroup. 'Isaac' >>>m[2]#Thesecondparenthesizedsubgroup. 'Newton' Newinversion3.6. Match.groups(default=None)¶ Returnatuplecontainingallthesubgroupsofthematch,from1uptohowever manygroupsareinthepattern.Thedefaultargumentisusedforgroupsthat didnotparticipateinthematch;itdefaultstoNone. Forexample: >>>m=re.match(r"(\d+)\.(\d+)","24.1632") >>>m.groups() ('24','1632') Ifwemakethedecimalplaceandeverythingafteritoptional,notallgroups mightparticipateinthematch.ThesegroupswilldefaulttoNoneunless thedefaultargumentisgiven: >>>m=re.match(r"(\d+)\.?(\d+)?","24") >>>m.groups()#SecondgroupdefaultstoNone. ('24',None) >>>m.groups('0')#Now,thesecondgroupdefaultsto'0'. ('24','0') Match.groupdict(default=None)¶ Returnadictionarycontainingallthenamedsubgroupsofthematch,keyedby thesubgroupname.Thedefaultargumentisusedforgroupsthatdidnot participateinthematch;itdefaultstoNone.Forexample: >>>m=re.match(r"(?P\w+)(?P\w+)","MalcolmReynolds") >>>m.groupdict() {'first_name':'Malcolm','last_name':'Reynolds'} Match.start([group])¶ Match.end([group])¶ Returntheindicesofthestartandendofthesubstringmatchedbygroup; groupdefaultstozero(meaningthewholematchedsubstring).Return-1if groupexistsbutdidnotcontributetothematch.Foramatchobjectm,and agroupgthatdidcontributetothematch,thesubstringmatchedbygroupg (equivalenttom.group(g))is m.string[m.start(g):m.end(g)] Notethatm.start(group)willequalm.end(group)ifgroupmatcheda nullstring.Forexample,afterm=re.search('b(c?)','cba'), m.start(0)is1,m.end(0)is2,m.start(1)andm.end(1)areboth 2,andm.start(2)raisesanIndexErrorexception. Anexamplethatwillremoveremove_thisfromemailaddresses: >>>email="tony@tiremove_thisger.net" >>>m=re.search("remove_this",email) >>>email[:m.start()]+email[m.end():] '[email protected]' Match.span([group])¶ Foramatchm,returnthe2-tuple(m.start(group),m.end(group)).Note thatifgroupdidnotcontributetothematch,thisis(-1,-1). groupdefaultstozero,theentirematch. Match.pos¶ Thevalueofposwhichwaspassedtothesearch()or match()methodofaregexobject.Thisis theindexintothestringatwhichtheREenginestartedlookingforamatch. Match.endpos¶ Thevalueofendposwhichwaspassedtothesearch()or match()methodofaregexobject.Thisis theindexintothestringbeyondwhichtheREenginewillnotgo. Match.lastindex¶ Theintegerindexofthelastmatchedcapturinggroup,orNoneifnogroup wasmatchedatall.Forexample,theexpressions(a)b,((a)(b)),and ((ab))willhavelastindex==1ifappliedtothestring'ab',while theexpression(a)(b)willhavelastindex==2,ifappliedtothesame string. Match.lastgroup¶ Thenameofthelastmatchedcapturinggroup,orNoneifthegroupdidn’t haveaname,orifnogroupwasmatchedatall. Match.re¶ Theregularexpressionobjectwhosematch()or search()methodproducedthismatchinstance. Match.string¶ Thestringpassedtomatch()orsearch(). Changedinversion3.7:Addedsupportofcopy.copy()andcopy.deepcopy().Matchobjects areconsideredatomic. RegularExpressionExamples¶ CheckingforaPair¶ Inthisexample,we’llusethefollowinghelperfunctiontodisplaymatch objectsalittlemoregracefully: defdisplaymatch(match): ifmatchisNone: returnNone return''%(match.group(),match.groups()) Supposeyouarewritingapokerprogramwhereaplayer’shandisrepresentedas a5-characterstringwitheachcharacterrepresentingacard,“a”forace,“k” forking,“q”forqueen,“j”forjack,“t”for10,and“2”through“9” representingthecardwiththatvalue. Toseeifagivenstringisavalidhand,onecoulddothefollowing: >>>valid=re.compile(r"^[a2-9tjqk]{5}$") >>>displaymatch(valid.match("akt5q"))#Valid. "" >>>displaymatch(valid.match("akt5e"))#Invalid. >>>displaymatch(valid.match("akt"))#Invalid. >>>displaymatch(valid.match("727ak"))#Valid. "" Thatlasthand,"727ak",containedapair,ortwoofthesamevaluedcards. Tomatchthiswitharegularexpression,onecouldusebackreferencesassuch: >>>pair=re.compile(r".*(.).*\1") >>>displaymatch(pair.match("717ak"))#Pairof7s. "" >>>displaymatch(pair.match("718ak"))#Nopairs. >>>displaymatch(pair.match("354aa"))#Pairofaces. "" Tofindoutwhatcardthepairconsistsof,onecouldusethe group()methodofthematchobjectinthefollowingmanner: >>>pair=re.compile(r".*(.).*\1") >>>pair.match("717ak").group(1) '7' #Errorbecausere.match()returnsNone,whichdoesn'thaveagroup()method: >>>pair.match("718ak").group(1) Traceback(mostrecentcalllast): File"",line1,in re.match(r".*(.).*\1","718ak").group(1) AttributeError:'NoneType'objecthasnoattribute'group' >>>pair.match("354aa").group(1) 'a' Simulatingscanf()¶ Pythondoesnotcurrentlyhaveanequivalenttoscanf().Regular expressionsaregenerallymorepowerful,thoughalsomoreverbose,than scanf()formatstrings.Thetablebelowofferssomemore-or-less equivalentmappingsbetweenscanf()formattokensandregular expressions. %c . %5c .{5} %d [-+]?\d+ %e,%E,%f,%g [-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)? %i [-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+) %o [-+]?[0-7]+ %s \S+ %u \d+ %x,%X [-+]?(0[xX])?[\dA-Fa-f]+ Toextractthefilenameandnumbersfromastringlike /usr/sbin/sendmail-0errors,4warnings youwoulduseascanf()formatlike %s-%derrors,%dwarnings Theequivalentregularexpressionwouldbe (\S+)-(\d+)errors,(\d+)warnings search()vs.match()¶ Pythonofferstwodifferentprimitiveoperationsbasedonregularexpressions: re.match()checksforamatchonlyatthebeginningofthestring,while re.search()checksforamatchanywhereinthestring(thisiswhatPerl doesbydefault). Forexample: >>>re.match("c","abcdef")#Nomatch >>>re.search("c","abcdef")#Match Regularexpressionsbeginningwith'^'canbeusedwithsearch()to restrictthematchatthebeginningofthestring: >>>re.match("c","abcdef")#Nomatch >>>re.search("^c","abcdef")#Nomatch >>>re.search("^a","abcdef")#Match NotehoweverthatinMULTILINEmodematch()onlymatchesatthe beginningofthestring,whereasusingsearch()witharegularexpression beginningwith'^'willmatchatthebeginningofeachline. >>>re.match('X','A\nB\nX',re.MULTILINE)#Nomatch >>>re.search('^X','A\nB\nX',re.MULTILINE)#Match MakingaPhonebook¶ split()splitsastringintoalistdelimitedbythepassedpattern.The methodisinvaluableforconvertingtextualdataintodatastructuresthatcanbe easilyreadandmodifiedbyPythonasdemonstratedinthefollowingexamplethat createsaphonebook. First,hereistheinput.Normallyitmaycomefromafile,hereweareusing triple-quotedstringsyntax >>>text="""RossMcFluff:834.345.1254155ElmStreet ... ...RonaldHeathmore:892.345.3428436FinleyAvenue ...FrankBurger:925.541.7625662SouthDogwoodWay ... ... ...HeatherAlbrecht:548.326.4584919ParkPlace""" Theentriesareseparatedbyoneormorenewlines.Nowweconvertthestring intoalistwitheachnonemptylinehavingitsownentry: >>>entries=re.split("\n+",text) >>>entries ['RossMcFluff:834.345.1254155ElmStreet', 'RonaldHeathmore:892.345.3428436FinleyAvenue', 'FrankBurger:925.541.7625662SouthDogwoodWay', 'HeatherAlbrecht:548.326.4584919ParkPlace'] Finally,spliteachentryintoalistwithfirstname,lastname,telephone number,andaddress.Weusethemaxsplitparameterofsplit() becausetheaddresshasspaces,oursplittingpattern,init: >>>[re.split(":?",entry,3)forentryinentries] [['Ross','McFluff','834.345.1254','155ElmStreet'], ['Ronald','Heathmore','892.345.3428','436FinleyAvenue'], ['Frank','Burger','925.541.7625','662SouthDogwoodWay'], ['Heather','Albrecht','548.326.4584','919ParkPlace']] The:?patternmatchesthecolonafterthelastname,sothatitdoesnot occurintheresultlist.Withamaxsplitof4,wecouldseparatethe housenumberfromthestreetname: >>>[re.split(":?",entry,4)forentryinentries] [['Ross','McFluff','834.345.1254','155','ElmStreet'], ['Ronald','Heathmore','892.345.3428','436','FinleyAvenue'], ['Frank','Burger','925.541.7625','662','SouthDogwoodWay'], ['Heather','Albrecht','548.326.4584','919','ParkPlace']] TextMunging¶ sub()replaceseveryoccurrenceofapatternwithastringorthe resultofafunction.Thisexampledemonstratesusingsub()with afunctionto“munge”text,orrandomizetheorderofallthecharacters ineachwordofasentenceexceptforthefirstandlastcharacters: >>>defrepl(m): ...inner_word=list(m.group(2)) ...random.shuffle(inner_word) ...returnm.group(1)+"".join(inner_word)+m.group(3) >>>text="ProfessorAbdolmalek,pleasereportyourabsencespromptly." >>>re.sub(r"(\w)(\w+)(\w)",repl,text) 'PoefsrosrAealmlobdk,pslaeereorptyourabnsecesplmrptoy.' >>>re.sub(r"(\w)(\w+)(\w)",repl,text) 'PofsroserAodlambelk,plaseereoprtyuorasnebcespotlmrpy.' FindingallAdverbs¶ findall()matchesalloccurrencesofapattern,notjustthefirst oneassearch()does.Forexample,ifawriterwantedto findalloftheadverbsinsometext,theymightusefindall()in thefollowingmanner: >>>text="Hewascarefullydisguisedbutcapturedquicklybypolice." >>>re.findall(r"\w+ly\b",text) ['carefully','quickly'] FindingallAdverbsandtheirPositions¶ Ifonewantsmoreinformationaboutallmatchesofapatternthanthematched text,finditer()isusefulasitprovidesmatchobjectsinsteadofstrings.Continuingwiththepreviousexample,if awriterwantedtofindalloftheadverbsandtheirpositionsin sometext,theywouldusefinditer()inthefollowingmanner: >>>text="Hewascarefullydisguisedbutcapturedquicklybypolice." >>>forminre.finditer(r"\w+ly\b",text): ...print('%02d-%02d:%s'%(m.start(),m.end(),m.group(0))) 07-16:carefully 40-47:quickly RawStringNotation¶ Rawstringnotation(r"text")keepsregularexpressionssane.Withoutit, everybackslash('\')inaregularexpressionwouldhavetobeprefixedwith anotheronetoescapeit.Forexample,thetwofollowinglinesofcodeare functionallyidentical: >>>re.match(r"\W(.)\1\W","ff") >>>re.match("\\W(.)\\1\\W","ff") Whenonewantstomatchaliteralbackslash,itmustbeescapedintheregular expression.Withrawstringnotation,thismeansr"\\".Withoutrawstring notation,onemustuse"\\\\",makingthefollowinglinesofcode functionallyidentical: >>>re.match(r"\\",r"\\") >>>re.match("\\\\",r"\\") WritingaTokenizer¶ Atokenizerorscanner analyzesastringtocategorizegroupsofcharacters.Thisisausefulfirst stepinwritingacompilerorinterpreter. Thetextcategoriesarespecifiedwithregularexpressions.Thetechniqueis tocombinethoseintoasinglemasterregularexpressionandtoloopover successivematches: fromtypingimportNamedTuple importre classToken(NamedTuple): type:str value:str line:int column:int deftokenize(code): keywords={'IF','THEN','ENDIF','FOR','NEXT','GOSUB','RETURN'} token_specification=[ ('NUMBER',r'\d+(\.\d*)?'),#Integerordecimalnumber ('ASSIGN',r':='),#Assignmentoperator ('END',r';'),#Statementterminator ('ID',r'[A-Za-z]+'),#Identifiers ('OP',r'[+\-*/]'),#Arithmeticoperators ('NEWLINE',r'\n'),#Lineendings ('SKIP',r'[\t]+'),#Skipoverspacesandtabs ('MISMATCH',r'.'),#Anyothercharacter ] tok_regex='|'.join('(?P%s)'%pairforpairintoken_specification) line_num=1 line_start=0 formoinre.finditer(tok_regex,code): kind=mo.lastgroup value=mo.group() column=mo.start()-line_start ifkind=='NUMBER': value=float(value)if'.'invalueelseint(value) elifkind=='ID'andvalueinkeywords: kind=value elifkind=='NEWLINE': line_start=mo.end() line_num+=1 continue elifkind=='SKIP': continue elifkind=='MISMATCH': raiseRuntimeError(f'{value!r}unexpectedonline{line_num}') yieldToken(kind,value,line_num,column) statements=''' IFquantityTHEN total:=total+price*quantity; tax:=price*0.05; ENDIF; ''' fortokenintokenize(statements): print(token) Thetokenizerproducesthefollowingoutput: Token(type='IF',value='IF',line=2,column=4) Token(type='ID',value='quantity',line=2,column=7) Token(type='THEN',value='THEN',line=2,column=16) Token(type='ID',value='total',line=3,column=8) Token(type='ASSIGN',value=':=',line=3,column=14) Token(type='ID',value='total',line=3,column=17) Token(type='OP',value='+',line=3,column=23) Token(type='ID',value='price',line=3,column=25) Token(type='OP',value='*',line=3,column=31) Token(type='ID',value='quantity',line=3,column=33) Token(type='END',value=';',line=3,column=41) Token(type='ID',value='tax',line=4,column=8) Token(type='ASSIGN',value=':=',line=4,column=12) Token(type='ID',value='price',line=4,column=15) Token(type='OP',value='*',line=4,column=21) Token(type='NUMBER',value=0.05,line=4,column=23) Token(type='END',value=';',line=4,column=27) Token(type='ENDIF',value='ENDIF',line=5,column=4) Token(type='END',value=';',line=5,column=9) Frie09 Friedl,Jeffrey.MasteringRegularExpressions.3rded.,O’Reilly Media,2009.ThethirdeditionofthebooknolongercoversPythonatall, butthefirsteditioncoveredwritinggoodregularexpressionpatternsin greatdetail. TableofContents re—Regularexpressionoperations RegularExpressionSyntax ModuleContents Flags Functions Exceptions RegularExpressionObjects MatchObjects RegularExpressionExamples CheckingforaPair Simulatingscanf() search()vs.match() MakingaPhonebook TextMunging FindingallAdverbs FindingallAdverbsandtheirPositions RawStringNotation WritingaTokenizer Previoustopic string—Commonstringoperations Nexttopic difflib—Helpersforcomputingdeltas ThisPage ReportaBug ShowSource Navigation index modules| next| previous| Python» 3.10.5Documentation» ThePythonStandardLibrary» TextProcessingServices» re—Regularexpressionoperations |



請為這篇文章評分?