Regular Expression (Regex) Tutorial

文章推薦指數: 80 %
投票人數:10人

A range expression consists of two characters separated by a hyphen ( - ). It matches any single character that sorts between the two characters, inclusive. For ...   TABLEOFCONTENTS(HIDE) RegularExpressions(Regex) RegularExpression,orregexorregexpinshort,isextremelyandamazinglypowerfulinsearchingandmanipulatingtextstrings,particularlyinprocessingtextfiles.Onelineofregexcaneasilyreplaceseveraldozenlinesofprogrammingcodes. Regexissupportedinallthescriptinglanguages(suchasPerl,Python,PHP,andJavaScript);aswellasgeneralpurposeprogramminglanguagessuchasJava;andevenwordprocessorssuchasWordforsearchingtexts.Gettingstartedwithregexmaynotbeeasyduetoitsgeekysyntax,butitiscertainlyworththeinvestmentofyourtime. RegexByExamples Thissectionismeantforthosewhoneedtorefreshtheirmemory.Fornovices,gotothenextsectiontolearnthesyntax,beforelookingattheseexamples. RegexSyntaxSummary Character:Allcharacters,exceptthosehavingspecialmeaninginregex,matchesthemselves.E.g.,theregexxmatchessubstring"x";regex9matches"9";regex=matches"=";andregex@matches"@". SpecialRegexCharacters:Thesecharactershavespecialmeaninginregex(tobediscussedbelow):.,+,*,?,^,$,(,),[,],{,},|,\. EscapeSequences(\char): Tomatchacharacterhavingspecialmeaninginregex,youneedtouseaescapesequenceprefixwithabackslash(\).E.g.,\.matches".";regex\+matches"+";andregex\(matches"(". Youalsoneedtouseregex\\tomatch"\"(back-slash). Regexrecognizescommonescapesequencessuchas\nfornewline,\tfortab,\rforcarriage-return,\nnnforaupto3-digitoctalnumber,\xhhforatwo-digithexcode,\uhhhhfora4-digitUnicode,\uhhhhhhhhfora8-digitUnicode. ASequenceofCharacters(orString):Stringscanbematchedviacombiningasequenceofcharacters(calledsub-expressions).E.g.,theregexSaturdaymatches"Saturday".Thematching,bydefault,iscase-sensitive,butcanbesettocase-insensitiveviamodifier. OROperator(|):E.g.,theregexfour|4acceptsstrings"four"or"4". Characterclass(orBracketList): [...]:AcceptANYONEofthecharacterwithinthesquarebracket,e.g.,[aeiou]matches"a","e","i","o"or"u". [.-.](RangeExpression):AcceptANYONEofthecharacterintherange,e.g.,[0-9]matchesanydigit;[A-Za-z]matchesanyuppercaseorlowercaseletters. [^...]:NOTONEofthecharacter,e.g.,[^0-9]matchesanynon-digit. Onlythesefourcharactersrequireescapesequenceinsidethebracketlist:^,-,],\. OccurrenceIndicators(orRepetitionOperators): +:oneormore(1+),e.g.,[0-9]+matchesoneormoredigitssuchas'123','000'. *:zeroormore(0+),e.g.,[0-9]*matcheszeroormoredigits.Itacceptsallthosein[0-9]+plustheemptystring. ?:zeroorone(optional),e.g.,[+-]?matchesanoptional"+","-",oranemptystring. {m,n}:mton(bothinclusive) {m}:exactlymtimes {m,}:mormore(m+) Metacharacters:matchesacharacter .(dot):ANYONEcharacterexceptnewline.Sameas[^\n] \d,\D:ANYONEdigit/non-digitcharacter.Digitsare[0-9] \w,\W:ANYONEword/non-wordcharacter.ForASCII,wordcharactersare[a-zA-Z0-9_] \s,\S:ANYONEspace/non-spacecharacter.ForASCII,whitespacecharactersare[\n\r\t\f] PositionAnchors:doesnotmatchcharacter,butpositionsuchasstart-of-line,end-of-line,start-of-wordandend-of-word. ^,$:start-of-lineandend-of-linerespectively.E.g.,^[0-9]$matchesanumericstring. \b:boundaryofword,i.e.,start-of-wordorend-of-word.E.g.,\bcat\bmatchestheword"cat"intheinputstring. \B:Inverseof\b,i.e.,non-start-of-wordornon-end-of-word. \:start-of-wordandend-of-wordrespectively,similarto\b.E.g.,\matchestheword"cat"intheinputstring. \A,\Z:start-of-inputandend-of-inputrespectively. ParenthesizedBackReferences: Useparentheses()tocreateabackreference. Use$1,$2,...(Java,Perl,JavaScript)or\1,\2,...(Python)toretreivethebackreferencesinsequentialorder. Laziness(CurbGreedinessforRepetitionOperators):*?,+?,??,{m,n}?,{m,}? Example:Numbers[0-9]+or\d+ Aregex(regularexpression)consistsofasequenceofsub-expressions.Inthisexample,[0-9]and+. The[...],knownascharacterclass(orbracketlist),enclosesalistofcharacters.ItmatchesanySINGLEcharacterinthelist.Inthisexample,[0-9]matchesanySINGLEcharacterbetween0and9(i.e.,adigit),wheredash(-)denotestherange. The+,knownasoccurrenceindicator(orrepetitionoperator),indicatesoneormoreoccurrences(1+)oftheprevioussub-expression.Inthiscase,[0-9]+matchesoneormoredigits. Aregexmaymatchaportionoftheinput(i.e.,substring)ortheentireinput.Infact,itcouldmatchzeroormoresubstringsoftheinput(withglobalmodifier). Thisregexmatchesanynumericsubstring(ofdigits0to9)oftheinput.Forexamples, Iftheinputis"abc123xyz",itmatchessubstring"123". Iftheinputis"abcxyz",itmatchesnothing. Iftheinputis"abc00123xyz456_0",itmatchessubstrings"00123","456"and"0"(threematches). Takenotethatthisregexmatchesnumberwithleadingzeros,suchas"000","0123"and"0001",whichmaynotbedesirable. Youcanalsowrite\d+,where\disknownasametacharacterthatmatchesanydigit(sameas[0-9]).Therearemorethanonewaystowritearegex!Takenotethatmanyprogramminglanguages(C,Java,JavaScript,Python)usebackslash\astheprefixforescapesequences(e.g.,\nfornewline),andyouneedtowrite"\\d+"instead. CodeExamples(Python,Java,JavaScript,Perl,PHP) CodeExampleinPython See"Python'sremoduleforRegularExpression"forfullcoverage. PythonsupportsRegexviamodulere.Pythonalsousesbackslash(\)forescapesequences(i.e.,youneedtowrite\\for\,\\dfor\d),butitsupportsrawstringintheformofr'...',whichignoretheinterpretationofescapesequences-greatforwritingregex. #TestunderthePythonCommand-LineInterpreter $python3 ...... >>>importre#Needmodule're'forregularexpression #Tryfind:re.findall(regexStr,inStr)->matchedSubstringsList #r'...'denotesrawstringswhichignoreescapecode,i.e.,r'\n'is'\'+'n' >>>re.findall(r'[0-9]+','abc123xyz') ['123']#Returnalistofmatchedsubstrings >>>re.findall(r'[0-9]+','abcxyz') [] >>>re.findall(r'[0-9]+','abc00123xyz456_0') ['00123','456','0'] >>>re.findall(r'\d+','abc00123xyz456_0') ['00123','456','0'] #Trysubstitute:re.sub(regexStr,replacementStr,inStr)->outStr >>>re.sub(r'[0-9]+',r'*','abc00123xyz456_0') 'abc*xyz*_*' #Trysubstitutewithcount:re.subn(regexStr,replacementStr,inStr)->(outStr,count) >>>re.subn(r'[0-9]+',r'*','abc00123xyz456_0') ('abc*xyz*_*',3)#Returnatupleofoutputstringandcount CodeExampleinJava See"RegularExpressions(Regex)inJava"forfullcoverage. JavasupportsRegexinpackagejava.util.regex. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 importjava.util.regex.Pattern; importjava.util.regex.Matcher; publicclassTestRegexNumbers{ publicstaticvoidmain(String[]args){ StringinputStr="abc00123xyz456_0";//InputStringformatching StringregexStr="[0-9]+";//Regextobematched //Step1:CompilearegexviastaticmethodPattern.compile(),defaultiscase-sensitive Patternpattern=Pattern.compile(regexStr); //Pattern.compile(regex,Pattern.CASE_INSENSITIVE);//forcase-insensitivematching //Step2:Allocateamatchingenginefromthecompiledregexpattern, //andbindtotheinputstring Matchermatcher=pattern.matcher(inputStr); //Step3:PerformmatchingandProcessthematchingresults //TryMatcher.find(),whichfindsthenextmatch while(matcher.find()){ System.out.println("find()foundsubstring\""+matcher.group() +"\"startingatindex"+matcher.start() +"andendingatindex"+matcher.end()); } //TryMatcher.matches(),whichtriestomatchtheENTIREinput(^...$) if(matcher.matches()){ System.out.println("matches()foundsubstring\""+matcher.group() +"\"startingatindex"+matcher.start() +"andendingatindex"+matcher.end()); }else{ System.out.println("matches()foundnothing"); } //TryMatcher.lookingAt(),whichtriestomatchfromtheSTARToftheinput(^...) if(matcher.lookingAt()){ System.out.println("lookingAt()foundsubstring\""+matcher.group() +"\"startingatindex"+matcher.start() +"andendingatindex"+matcher.end()); }else{ System.out.println("lookingAt()foundnothing"); } //TryMatcher.replaceFirst(),whichreplacesthefirstmatch StringreplacementStr="**"; StringoutputStr=matcher.replaceFirst(replacementStr);//firstmatchonly System.out.println(outputStr); //TryMatcher.replaceAll(),whichreplacesallmatches replacementStr="++"; outputStr=matcher.replaceAll(replacementStr);//allmatches System.out.println(outputStr); } } Theoutputis: find()foundsubstring"00123"startingatindex3andendingatindex8 find()foundsubstring"456"startingatindex11andendingatindex14 find()foundsubstring"0"startingatindex15andendingatindex16 matches()foundnothing lookingAt()foundnothing abc**xyz456_0 abc++xyz++_++ CodeExampleinPerl See"RegularExpression(Regex)inPerl"forfullcoverage. Perlmakesextensiveuseofregularexpressionswithmanybuilt-insyntaxesandoperators.InPerl(andJavaScript),aregexisdelimitedbyapairofforwardslashes(default),intheformof/regex/.Youcanusebuilt-inoperators: m/regex/modifieror/regex/modifier:Matchagainsttheregex.misoptional. s/regex/replacement/modifier:Substitutematchedsubstring(s)bythereplacement. InPerl,youcanusesingle-quotednon-interpolatingstring'....'towriteregextodisableinterpretationofbackslash(\)byPerl. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #!/usr/bin/envperl usestrict; usewarnings; my$inStr='abc00123xyz456_0';#inputstring my$regex='[0-9]+';#regexpatternstringinnon-interpolatingstring #Trymatch/regex/modifiers(orm/regex/modifiers) my@matches=($inStr=~/$regex/g);#Match$inStrwithregexwithglobalmodifier #Storeallmatchesinanarray print"@matches\n";#Output:001234560 while($inStr=~/$regex/g){ #Thebuilt-inarrayvariables@-and@+keepthestartandendpositions #ofthematches,where$-[0]and$+[0]isthefullmatch,and #$-[n]and$+[n]forbackreferences$1,$2,etc. printsubstr($inStr,$-[0],$+[0]-$-[0]),',';#Output:00123,456,0, } print"\n"; #Trysubstitutes/regex/replacement/modifiers $inStr=~s/$regex/**/g;#withglobalmodifier print"$inStr\n";#Output:abc**xyz**_** CodeExampleinJavaScript See"RegularExpressioninJavaScript"forfullcoverage. InJavaScript(andPerl),aregexisdelimitedbyapairofforwardslashes,intheformof/.../.Therearetwosetsofmethods,issueviaaRegExobjectoraStringobject. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 JavaScriptExample:Regex

Hello,

CodeExampleinPHP [TODO] Example:FullNumericStrings^[0-9]+$or^\d+$ Theleading^andthetrailing$areknownaspositionanchors,whichmatchthestartandendpositionsoftheline,respectively.Astheresult,theentireinputstringshallbematchedfully,insteadofaportionoftheinputstring(substring). Thisregexmatchesanynon-emptynumericstrings(comprisingofdigits0to9),e.g.,"0"and"12345".Itdoesnotmatchwith""(emptystring),"abc","a123","abc123xyz",etc.However,italsomatches"000","0123"and"0001"withleadingzeros. Example:PositiveIntegerLiterals[1-9][0-9]*|0or[1-9]\d*|0 [1-9]matchesanycharacterbetween1to9;[0-9]*matcheszeroormoredigits.The*isanoccurrenceindicatorrepresentingzeroormoreoccurrences.Together,[1-9][0-9]*matchesanynumberswithoutaleadingzero. |representstheORoperator;whichisusedtoincludethenumber0. Thisexpressionmatches"0"and"123";butdoesnotmatch"000"and"0123"(butseebelow). Youcanreplace[0-9]bymetacharacter\d,butnot[1-9]. Wedidnotusepositionanchors^and$inthisregex.Hence,itcanmatchanypartsoftheinputstring.Forexamples, Iftheinputstringis"abc123xyz",itmatchesthesubstring"123". Iftheinputstringis"abcxyz",itmatchesnothing. Iftheinputstringis"abc123xyz456_0",itmatchessubstrings"123","456"and"0"(threematches). Iftheinputstringis"0012300",itmatchessubstrings:"0","0"and"12300"(threematches)!!! Example:FullIntegerLiterals^[+-]?[1-9][0-9]*|0$or^[+-]?[1-9]\d*|0$ ThisregexmatchanIntegerliteral(forentirestringwiththepositionanchors),bothpositive,negativeandzero. [+-]matcheseither+or-sign.?isanoccurrenceindicatordenoting0or1occurrence,i.e.optional.Hence,[+-]?matchesanoptionalleading+or-sign. Wehavecoveredthreeoccurrenceindicators:+foroneormore,*forzeroormore,and?forzeroorone. Example:Identifiers(orNames)[a-zA-Z_][0-9a-zA-Z_]*or[a-zA-Z_]\w* Beginwithonelettersorunderscore,followedbyzeroormoredigits,lettersandunderscore. Youcanusemetacharacter\wforawordcharacter[a-zA-Z0-9_].Recallthatmetacharacter\dcanbeusedforadigit[0-9]. Example:ImageFilenames^\w+\.(gif|png|jpg|jpeg)$ Thepositionanchors^and$matchthebeginningandtheendingoftheinputstring,respectively.Thatis,thisregexshallmatchtheentireinputstring,insteadofapartoftheinputstring(substring). \w+matchesoneormorewordcharacters(sameas[a-zA-Z0-9_]+). \.matchesthedot(.)character.Weneedtouse\.torepresent.as.hasspecialmeaninginregex.The\isknownastheescapecode,whichrestoretheoriginalliteralmeaningofthefollowingcharacter.Similarly,*,+,?(occurrenceindicators),^,$(positionanchors)havespecialmeaninginregex.Youneedtouseanescapecodetomatchwiththesecharacters. (gif|png|jpg|jpeg)matcheseither"gif","png","jpg"or"jpeg".The|denotes"OR"operator.Theparenthesesareusedforgroupingtheselections. Themodifieriaftertheregexspecifiescase-insensitivematching(applicabletosomelanguageslikePerlandJavaScriptonly).Thatis,itaccepts"test.GIF"and"TesT.Gif". Example:EmailAddresses^\w+([.-]?\w+)*@\w+([.-]?\w+)*(\.\w{2,3})+$ Thepositionanchors^and$matchthebeginningandtheendingoftheinputstring,respectively.Thatis,thisregexshallmatchtheentireinputstring,insteadofapartoftheinputstring(substring). \w+matches1ormorewordcharacters(sameas[a-zA-Z0-9_]+). [.-]?matchesanoptionalcharacter.or-.Althoughdot(.)hasspecialmeaninginregex,inacharacterclass(squarebrackets)anycharactersexcept^,-,]or\isaliteral,anddonotrequireescapesequence. ([.-]?\w+)*matches0ormoreoccurrencesof[.-]?\w+. Thesub-expression\w+([.-]?\w+)*isusedtomatchtheusernameintheemail,[email protected][a-zA-Z0-9_],followedbymorewordcharactersor.or-.However,a.or-mustfollowbyawordcharacter[a-zA-Z0-9_].Thatis,theinputstringcannotbeginwith.or-;andcannotcontain"..","--",".-"or"-.".Exampleofvalidstringare"a.1-2-3". [email protected],allcharactersotherthanthosehavingspecialmeaningsmatchesitself,e.g.,amatchesa,bmatchesb,andetc. Again,thesub-expression\w+([.-]?\w+)*isusedtomatchtheemaildomainname,withthesamepatternastheusernamedescribedabove. Thesub-expression\.\w{2,3}matchesa.followedbytwoorthreewordcharacters,e.g.,".com",".edu",".us",".uk",".co". (\.\w{2,3})+specifiesthattheabovesub-expressioncouldoccuroneormoretimes,e.g.,".com",".co.uk",".edu.sg"etc. Exercise:Interpretthisregex,whichprovideanotherrepresentationofemailaddress:^[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\.[a-zA-z0-9]{2,4}$. Example:SwappingWordsusingParenthesizedBack-References^(\S+)\s+(\S+)$and$2$1 The^and$matchthebeginningandendingoftheinputstring,respectively. The\s(lowercases)matchesawhitespace(blank,tab\t,andnewline\ror\n).Ontheotherhand,the\S+(uppercaseS)matchesanythingthatisNOTmatchedby\s,i.e.,non-whitespace.Inregex,theuppercasemetacharacterdenotestheinverseofthelowercasecounterpart,forexample,\wforwordcharacterand\Wfornon-wordcharacter;\dfordigitand\Dornon-digit. Theaboveregexmatchestwowords(withoutwhitespaces)separatedbyoneormorewhitespaces. Parentheses()havetwomeaningsinregex: togroupsub-expressions,e.g.,(abc)* toprovideaso-calledback-referenceforcapturingandextractingmatches. Theparenthesesin(\S+),calledparenthesizedback-reference,isusedtoextractthematchedsubstringfromtheinputstring.Inthisregex,therearetwo(\S+),matchthefirsttwowords,separatedbyoneormorewhitespaces\s+.Thetwomatchedwordsareextractedfromtheinputstringandtypicallykeptinspecialvariables$1and$2(or\1and\2inPython),respectively. Toswapthetwowords,youcanaccessthespecialvariables,andprint"$2$1"(viaaprogramminglanguage);orsubstituteoperator"s/(\S+)\s+(\S+)/$2$1/"(inPerl). CodeExampleinPython Pythonkeepstheparenthesizedbackreferencesin\1,\2,....Also,\0keepstheentirematch. $python3 >>>re.findall(r'^(\S+)\s+(\S+)$','appleorange') [('apple','orange')]#Alistoftuplesifthepatternhasmorethanonebackreferences #Backreferencesarekeptin\1,\2,\3,etc. >>>re.sub(r'^(\S+)\s+(\S+)$',r'\2\1','appleorange')#Prefixrforrawstringwhichignoresescape 'orangeapple' >>>re.sub(r'^(\S+)\s+(\S+)$','\\2\\1','appleorange')#Needtouse\\for\forregularstring 'orangeapple' CodeExampleinJava Javakeepstheparenthesizedbackreferencesin$1,$2,.... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 importjava.util.regex.Pattern; importjava.util.regex.Matcher; publicclassTestRegexSwapWords{ publicstaticvoidmain(String[]args){ StringinputStr="appleorange"; StringregexStr="^(\\S+)\\s+(\\S+)$";//Regexpatterntobematched StringreplacementStr="$2$1";//Replacementpatternwithbackreferences //Step1:AllocateaPatternobjecttocompilearegex Patternpattern=Pattern.compile(regexStr); //Step2:AllocateaMatcherobjectfromthePattern,andprovidetheinput Matchermatcher=pattern.matcher(inputStr); //Step3:Performthematchingandprocessthematchingresult StringoutputStr=matcher.replaceFirst(replacementStr);//firstmatchonly System.out.println(outputStr);//Output:orangeapple } } Example:HTTPAddresses^http:\/\/\S+(\/\S+)*(\/)?$ Beginwithhttp://.Takenotethatyoumayneedtowrite/as\/withanescapecodeinsomelanguages(JavaScript,Perl). Followedby\S+,oneormorenon-whitespaces,forthedomainname. Followedby(\/\S+)*,zeroormore"/...",forthesub-directories. Followedby(\/)?,anoptional(0or1)trailing/,fordirectoryrequest. Example:RegexPatternsinAngularJS ThefollowingrathercomplexregexpatternsareusedbyAngularJSinJavaScriptsyntax: varISO_DATE_REGEXP=/^\d{4,}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+(?:[+-][0-2]\d:[0-5]\d|Z)$/; varURL_REGEXP=/^[a-z][a-z\d.+-]*:\/*(?:[^:@]+(?::[^@]+)?@)?(?:[^\s:/?#]+|\[[a-f\d:]+])(?::\d+)?(?:\/[^?#]*)?(?:\?[^#]*)?(?:#.*)?$/i; varEMAIL_REGEXP=/^(?=.{1,254}$)(?=.{1,64}@)[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+(\.[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+)*@[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?(\.[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?)*$/; //Matchbothuppercaseandlowercaseletters,single-quotebutnotdouble-quote varNUMBER_REGEXP=/^\s*(-|\+)?(\d+|(\d*(\.\d*)))([eE][+-]?\d+)?\s*$/; varDATE_REGEXP=/^(\d{4,})-(\d{2})-(\d{2})$/; varDATETIMELOCAL_REGEXP=/^(\d{4,})-(\d\d)-(\d\d)T(\d\d):(\d\d)(?::(\d\d)(\.\d{1,3})?)?$/; varWEEK_REGEXP=/^(\d{4,})-W(\d\d)$/; varMONTH_REGEXP=/^(\d{4,})-(\d\d)$/; varTIME_REGEXP=/^(\d\d):(\d\d)(?::(\d\d)(\.\d{1,3})?)?$/; Example:SampleRegexinPerl s/^\s+//#Removeleadingwhitespaces(substitutewithemptystring) s/\s+$//#Removetrailingwhitespaces s/^\s+.*\s+$//#Removeleadingandtrailingwhitespaces RegularExpression(Regex)Syntax ARegularExpression(orRegex)isapattern(orfilter)thatdescribesasetofstringsthatmatchesthepattern. Inotherwords,aregexacceptsacertainsetofstringsandrejectstherest. Aregexconsistsofasequenceofcharacters,metacharacters(suchas.,\d,\D,\s,\S,\w,\W)andoperators(suchas+,*,?,|,^).Theyareconstructedbycombiningmanysmallersub-expressions. MatchingaSingleCharacter Thefundamentalbuildingblocksofaregexarepatternsthatmatchasinglecharacter. Mostcharacters,includingallletters(a-zandA-Z)anddigits(0-9),matchitself.Forexample,theregexxmatchessubstring"x";zmatches"z";and9matches"9". Non-alphanumericcharacterswithoutspecialmeaninginregexalsomatchesitself.Forexample,=matches"=";@matches"@". RegexSpecialCharactersandEscapeSequences Regex'sSpecialCharacters Thesecharactershavespecialmeaninginregex(Iwilldiscussindetailinthelatersections): metacharacter:dot(.) bracketlist:[] positionanchors:^,$ occurrenceindicators:+,*,?,{} parentheses:() or:| escapeandmetacharacter:backslash(\) EscapeSequences Thecharacterslistedabovehavespecialmeaningsinregex.Tomatchthesecharacters,weneedtoprependitwithabackslash(\),knownasescapesequence. Forexamples,\+matches"+";\[matches"[";and\.matches".". Regexalsorecognizescommonescapesequencessuchas\nfornewline,\tfortab,\rforcarriage-return,\nnnforaupto3-digitoctalnumber,\xhhforatwo-digithexcode,\uhhhhfora4-digitUnicode,\uhhhhhhhhfora8-digitUnicode. CodeExampleinPython $python3 >>>importre#Needmodule're'forregularexpression #Tryfind:re.findall(regexStr,inStr)->matchedStrList #r'...'denotesrawstringswhichignoreescapecode,i.e.,r'\n'is'\'+'n' >>>re.findall(r'a','abcabc') ['a','a'] >>>re.findall(r'=','abc=abc')#'='isnotaspecialregexcharacter ['='] >>>re.findall(r'\.','abc.com')#'.'isaspecialregexcharacter,needregexescapesequence ['.'] >>>re.findall('\\.','abc.com')#Youneedtowrite\\for\inregularPythonstring ['.'] CodeExampleinJavaScript [TODO] CodeExampleinJava [TODO] MatchingaSequenceofCharacters(StringorText) Sub-Expressions Aregexisconstructedbycombiningmanysmallersub-expressionsoratoms.Forexample,theregexFridaymatchesthestring"Friday".Thematching,bydefault,iscase-sensitive,butcanbesettocase-insensitiveviamodifier. OR(|)Operator Youcanprovidealternativesusingthe"OR"operator,denotedbyaverticalbar'|'.Forexample,theregexfour|for|floor|4acceptsstrings"four","for","floor"or"4". BracketList(CharacterClass)[...],[^...],[.-.] Abracketexpressionisalistofcharactersenclosedby[],alsocalledcharacterclass.ItmatchesANYONEcharacterinthelist.However,ifthefirstcharacterofthelististhecaret(^),thenitmatchesANYONEcharacterNOTinthelist.Forexample,theregex[02468]matchesasingledigit0,2,4,6,or8;theregex[^02468]matchesanysinglecharacterotherthan0,2,4,6,or8. Insteadoflistingallcharacters,youcouldusearangeexpressioninsidethebracket.Arangeexpressionconsistsoftwocharactersseparatedbyahyphen(-).Itmatchesanysinglecharacterthatsortsbetweenthetwocharacters,inclusive.Forexample,[a-d]isthesameas[abcd].Youcouldincludeacaret(^)infrontoftherangetoinvertthematching.Forexample,[^a-d]isequivalentto[^abcd]. Mostofthespecialregexcharacterslosetheirmeaninginsidebracketlist,andcanbeusedastheyare;except^,-,]or\. Toincludea],placeitfirstinthelist,oruseescape\]. Toincludea^,placeitanywherebutfirst,oruseescape\^. Toincludea-placeitlast,oruseescape\-. Toincludea\,useescape\\. Noescapeneededfortheothercharacterssuchas.,+,*,?,(,),{,},andetc,insidethebracketlist Youcanalsoincludemetacharacters(tobeexplainedinthenextsection),suchas\w,\W,\d,\D,\s,\Sinsidethebracketlist. NameCharacterClassesinBracketList(ForPerlOnly?) Named(POSIX)classesofcharactersarepre-definedwithinbracketexpressions.Theyare: [:alnum:],[:alpha:],[:digit:]:letters+digits,letters,digits. [:xdigit:]:hexadecimaldigits. [:lower:],[:upper:]:lowercase/uppercaseletters. [:cntrl:]:Controlcharacters [:graph:]:printablecharacters,exceptspace. [:print:]:printablecharacters,includespace. [:punct:]:printablecharacters,excludinglettersanddigits. [:space:]:whitespace Forexample,[[:alnum:]]means[0-9A-Za-z].(Notethatthesquarebracketsintheseclassnamesarepartofthesymbolicnames,andmustbeincludedinadditiontothesquarebracketsdelimitingthebracketlist.) Metacharacters.,\w,\W,\d,\D,\s,\S Ametacharacterisasymbolwithaspecialmeaninginsidearegex. Themetacharacterdot(.)matchesanysinglecharacterexceptnewline\n(sameas[^\n]).Forexample,...matchesany3characters(includingalphabets,numbers,whitespaces,butexceptnewline);the..matches"there","these","the  ",andsoon. \w(wordcharacter)matchesanysingleletter,numberorunderscore(sameas[a-zA-Z0-9_]).Theuppercasecounterpart\W(non-word-character)matchesanysinglecharacterthatdoesn'tmatchby\w(sameas[^a-zA-Z0-9_]). Inregex,theuppercasemetacharacterisalwaystheinverseofthelowercasecounterpart. \d(digit)matchesanysingledigit(sameas[0-9]).Theuppercasecounterpart\D(non-digit)matchesanysinglecharacterthatisnotadigit(sameas[^0-9]). \s(space)matchesanysinglewhitespace(sameas[\t\n\r\f],blank,tab,newline,carriage-returnandform-feed).Theuppercasecounterpart\S(non-space)matchesanysinglecharacterthatdoesn'tmatchby\s(sameas[^\t\n\r\f]). Examples: \s\s#Matchestwospaces \S\S\s#Twonon-spacesfollowedbyaspace \s+#Oneormorespaces \S+\s\S+#Twowords(non-spaces)separatedbyaspace Backslash(\)andRegexEscapeSequences Regexusesbackslash(\)fortwopurposes: formetacharacterssuchas\d(digit),\D(non-digit),\s(space),\S(non-space),\w(word),\W(non-word). toescapespecialregexcharacters,e.g.,\.for.,\+for+,\*for*,\?for?.Youalsoneedtowrite\\for\inregextoavoidambiguity. Regexalsorecognizes\nfornewline,\tfortab,etc. Takenotethatinmanyprogramminglanguages(C,Java,Python),backslash(\)isalsousedforescapesequencesinstring,e.g.,"\n"fornewline,"\t"fortab,andyoualsoneedtowrite"\\"for\.Consequently,towriteregexpattern\\(whichmatchesone\)intheselanguages,youneedtowrite"\\\\"(twolevelsofescape!!!).Similarly,youneedtowrite"\\d"forregexmetacharacter\d.Thisiscumbersomeanderror-prone!!! OccurrenceIndicators(RepetitionOperators):+,*,?,{m},{m,n},{m,} Aregexsub-expressionmaybefollowedbyanoccurrenceindicator(akarepetitionoperator): ?:Theprecedingitemisoptionalandmatchedatmostonce(i.e.,occurs0or1timesoroptional). *:Theprecedingitemwillbematchedzeroormoretimes,i.e.,0+ +:Theprecedingitemwillbematchedoneormoretimes,i.e.,1+ {m}:Theprecedingitemismatchedexactlymtimes. {m,}:Theprecedingitemismatchedmormoretimes,i.e.,m+ {m,n}:Theprecedingitemismatchedatleastmtimes,butnotmorethanntimes. Forexample:Theregexxy{2,4}accepts"xyy","xyyy"and"xyyyy". Modifiers Youcanapplymodifierstoaregextotailoritsbehavior,suchasglobal,case-insensitive,multiline,etc.Thewaystoapplymodifiersdifferamonglanguages. InPerl,youcanattachmodifiersafteraregex,intheformof/.../modifiers.Forexamples: m/abc/i#case-insensitivematching m/abc/g#global(MatchALLinsteadofmatchfirst) InJava,youapplymodifierswhencompilingtheregexPattern.Forexample, Patternp1=Pattern.compile(regex,Pattern.CASE_INSENSITIVE);//forcase-insensitivematching Patternp2=Pattern.compile(regex,Pattern.MULTILINE);//formultilineinputstring Patternp3=Pattern.compile(regex,Pattern.DOTALL);//Dot(.)matchesallcharactersincludingnewline Thecommonly-usedmodifermodesare: Case-Insensitivemode(ori):case-insensitivematchingforletters. Global(org):matchAllinsteadoffirstmatch. Multilinemode(orm):affect^,$,\Aand\Z.Inmultilinemode,^matchesstart-of-lineorstart-of-input;$matchesend-of-lineorend-of-input,\Amatchesstart-of-input;\Zmatchesend-of-input. Single-linemode(ors):Dot(.)willmatchallcharacters,includingnewline. Commentmode(orx):allowandignoreembeddedcommentstartingwith#tillend-of-line(EOL). more... Greediness,LazinessandBacktrackingforRepetitionOperators GreedinessofRepetitionOperators*,+,?,{m,n}:Therepetitionoperatorsaregreedyoperators,andbydefaultgraspasmanycharactersaspossibleforamatch.Forexample,theregexxy{2,4}trytomatchfor"xyyyy",then"xyyy",andthen"xyy". LazyQuantifiers*?,+?,??,{m,n}?,{m,}?,:Youcanputanextra?aftertherepetitionoperatorstocurbitsgreediness(i.e.,stopattheshortestmatch).Forexample, input="Thefirstandsecondinstances" regex=.*matches"firstandsecond" But regex=.*?producestwomatches:"first"and"second" Backtracking:Ifaregexreachesastatewhereamatchcannotbecompleted,itbacktracksbyunwindingonecharacterfromthegreedymatch.Forexample,iftheregexz*zzzismatchedagainstthestring"zzzz",thez*firstmatches"zzzz";unwindstomatch"zzz";unwindstomatch"zz";andfinallyunwindstomatch"z",suchthattherestofthepatternscanfindamatch. PossessiveQuantifiers*+,++,?+,{m,n}+,{m,}+:Youcanputanextra+totherepetitionoperatorstodisablebacktracking,evenitmayresultinmatchfailure.e.g,z++zwillnotmatch"zzzz".Thisfeaturemightnotbesupportedinsomelanguages. PositionAnchors^,$,\b,\B,\,\A,\Z PositionalanchorsDONOTmatchactualcharacter,butmatchespositioninastring,suchasstart-of-line,end-of-line,start-of-word,andend-of-word. ^and$:The^matchesthestart-of-line.The$matchestheend-of-lineexcludingnewline,orend-of-input(forinputnotendingwithnewline).Thesearethemostcommonly-usedpositionanchors.Forexamples, ing$#endingwith'ing' ^testing123$#Matchesonlyonepattern.Shoulduseequalitycomparisoninstead. ^[0-9]+$#Numericstring \band\B:The\bmatchestheboundaryofaword(i.e.,start-of-wordorend-of-word);and\Bmatchesinverseof\b,ornon-word-boundary.Forexamples, \bcat\b#matchestheword"cat"ininputstring"Thisisacat." #butdoesnotmatchinput"Thisisacatalog." \:The\matchthestart-of-wordandend-of-word,respectively(comparedwith\b,whichcanmatchboththestartandendofaword). \Aand\Z:The\Amatchesthestartoftheinput.The\Zmatchestheendoftheinput. Theyaredifferentfrom^and$whenitcomestomatchinginputwithmultiplelines.^matchesatthestartofthestringandaftereachlinebreak,while\Aonlymatchesatthestartofthestring.$matchesattheendofthestringandbeforeeachlinebreak,while\Zonlymatchesattheendofthestring.Forexamples, $python3 #Using^and$inmultilinemode >>>p1=re.compile(r'^.+$',re.MULTILINE)#.foranycharacterexceptnewline >>>p1.findall('testing\ntesting') ['testing','testing'] >>>p1.findall('testing\ntesting\n') ['testing','testing'] #^matchesstart-of-inputoraftereachlinebreakatstart-of-line #$matchesend-of-inputorbeforelinebreakatend-of-line #newlinesareNOTincludedinthematches #Using\Aand\Zinmultilinemode >>>p2=re.compile(r'\A.+\Z',re.MULTILINE) >>>p2.findall('testing\ntesting') []#Thispatterndoesnotmatchtheinternal\n >>>p3=re.compile(r'\A.+\n.+\Z',re.MULTILINE)#tomatchtheinternal\n >>>p3.findall('testing\ntesting') ['testing\ntesting'] >>>p3.findall('testing\ntesting\n') []#Thispatterndoesnotmatchthetrailing\n #\Amatchesstart-of-inputand\Zmatchesend-of-input CapturingMatchesviaParenthesizedBack-References&MatchedVariables$1,$2,... Parentheses()servetwopurposesinregex: Firstly,parentheses()canbeusedtogroupsub-expressionsforoverridingtheprecedenceorapplyingarepetitionoperator.Forexample,(abc)+(acceptsabc,abcabc,abcabcabc,...)isdifferentfromabc+(acceptsabc,abcc,abccc,...). Secondly,parenthesesareusedtoprovidethesocalledback-references(orcapturinggroups).Aback-referencecontainsthematchedsubstring.Forexamples,theregex(\S+)createsoneback-reference(\S+),whichcontainsthefirstword(consecutivenon-spaces)oftheinputstring;theregex(\S+)\s+(\S+)createstwoback-references:(\S+)andanother(\S+),containingthefirsttwowords,separatedbyoneormorespaces\s+. Theseback-references(orcapturinggroups)arestoredinspecialvariables$1,$2,…(or\1,\2,...inPython),where$1containsthesubstringmatchedthefirstpairofparentheses,andsoon.Forexample,(\S+)\s+(\S+)createstwoback-referenceswhichmatchedwiththefirsttwowords.Thematchedwordsarestoredin$1and$2(or\1and\2),respectively. Back-referencesareimportanttomanipulatethestring.Back-referencescanbeusedinthesubstitutionstringaswellasthepattern.Forexamples, #Swapthefirstandsecondwordsseparatedbyonespace s/(\S+)(\S+)/$2$1/;#Perl re.sub(r'(\S+)(\S+)',r'\2\1',inStr)#Python #Removeduplicateword s/(\w+)$1/$1/;#Perl re.sub(r'(\w+)\1',r'\1',inStr)#Python (Advanced)Lookahead/Lookbehind,GroupingsandConditional Thesefeaturemightnotbesupportedinsomelanguages. PositiveLookahead(?=pattern) The(?=pattern)isknownaspositivelookahead.Itperformsthematch,butdoesnotcapturethematch,returningonlytheresult:matchornomatch.Itisalsocalledassertionasitdoesnotconsumeanycharactersinmatching.Forexample,thefollowingcomplexregexisusedtomatchemailaddressesbyAngularJS: ^(?=.{1,254}$)(?=.{1,64}@)[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+(\.[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+)*@[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?(\.[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?)*$ Thefirstpositivelookaheadpatterns^(?=.{1,254}$)setsthemaximumlengthto254characters.Thesecondpositivelookahead^(?=.{1,64}@)setsmaximumof64charactersbeforethe'@'signfortheusername. NegativeLookahead(?!pattern) Inverseof(?=pattern).Matchifpatternismissing.Forexample,a(?=b)matches'a'in'abc'(notconsuming'b');butnot'acc'.Whereasa(?!b)matches'a'in'acc',butnotabc. PositiveLookbehind(?<=pattern) [TODO] NegativeLookbehind(?pattern) Thecapturegroupcanbereferencedlaterbyname. AtomicGrouping(>pattern) Disablebacktracking,evenifthismayleadtomatchfailure. Conditional(?(Cond)then|else) [TODO] Unicode Themetacharacters\w,\W,(wordandnon-wordcharacter),\b,\B(wordandnon-wordboundary)recongizeUnicodecharacters. [TODO] RegexinProgrammingLanguages Python:See"PythonremoduleforRegularExpression" Java:See"RegularExpressionsinJava" JavaScript:See"RegularExpressioninJavaScript" Perl:See"RegularExpressionsinPerl" PHP:[Link] C/C++:[Link] REFERENCES&RESOURCES (Python)Python'sRegularExpressionHOWTO@https://docs.python.org/3/howto/regex.html(Python3). (Python)Python'sre-Regularexpressionoperations@https://docs.python.org/3/library/re.html(Python3). (Java)OnlineJavaTutorial'sTrailon"RegularExpressions"@https://docs.oracle.com/javase/tutorial/essential/regex/index.html. (Java)JavaDocforjava.util.regexPackage@https://docs.oracle.com/javase/10/docs/api/java/util/regex/package-summary.html(JDK10). (Perl)perlrequick-Perlregularexpressionsquickstart@https://perldoc.perl.org/perlrequick.html. (Perl)perlre-Perlregularexpressions@https://perldoc.perl.org/perlre.html. (JavaScript)RegularExpressions@https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions. Lastmodified:November,2018  



請為這篇文章評分?