How can I extract a portion of a string variable using regular ...

文章推薦指數: 80 %
投票人數:10人

We will show some examples of how to use regular expression to extract and/or ... Where n is the number assigned to the substring you want to extract. SkiptoprimarynavigationSkiptomaincontentSkiptoprimarysidebar StringprocessingisfairlyeasyinStatabecauseofthemanybuilt-instring functions.Amongthesestringfunctionsarethreefunctionsthatarerelatedto regularexpressions,regexmformatching,regexrforreplacingand regexsforsubexpressions.Wewillshow someexamplesofhowtouse regularexpressiontoextractand/orreplaceaportionofastringvariable usingthesethreefunctions. Atthebottomofthepageisanexplanationof alltheregularexpressionoperatorsaswellasthefunctionsthatworkwith regularexpressions. Examples Example1:Aresearcherhasaddressesasastringvariableandwantstocreateanewvariable thatcontainsjustthezipcodes. Example2:Wehaveavariablethatcontainsfullnamesintheorderoffirst nameandthenlastname.Wewanttocreateanewvariablewithfullnameinthe orderoflastnameandthenfirstnameseparatedbycomma. Example2:Dateswereenteredasastringvariable,insomecasestheyearwasenteredasafour-digit value(whichiswhatStatagenerallyexpectstosee),butinothercasesitwasenteredasatwo-digit value.Wewanttocreateadatevariableinnumericformatbasedonthisstring variable.ThistaskcanactuallyeasilybehandledwithregularStatacommands,seeourFAQpage “Mydatevariableisastring,howcanIturnitintoadatevariableStatacan recognize?”forinformationondoingthis.Wehaveincludedthisexample herefordemonstrationpurposes,notbecauseregularexpressionsarenecessarily thebestwaytohandlethissituation. Inthesesituations,regularexpressionscanbeusedtoidentifycasesinwhich astringcontainsasetofvalues(e.g.aspecificword,anumberfollowedbyaword etc.)andextractthatsetofvaluesfromthewholestringforuseelsewhere. Example1:Extractingzipcodesfromaddresses Let’sstartwithsomefakeentriesofaddresses. inputstr60address "4905LakewayDrive,CollegeStation,Texas77845USA" "673JasmineStreet,LosAngeles,CA90024" "2376Firststreet,SanDiego,CA90126" "6WestCentralSt,TempeAZ80068" "1234MainSt.Cambridge,MA01238-1234" end Tofindthezipcodewewilllookforafive-digitnumberwithinanaddress. Thegencommand (shortfor"generate")belowtellsStatatogenerateanewvariablecalledzip. Therestofthecommandisalittletricky,the"if"isevaluatedfirst,if(regexm(address,“[0-9][0-9][0-9][0-9][0-9]”)) searchesthevariableaddressforafivedigitnumber,and,ifitcan findafivedigitnumberinthevariableaddress,the=regexs(0) indicatesthatStatashouldsetthevalueofziptobeequaltothat five-digitnumber.Weindicate thatwewantafive-digitnumberbyspecifying“[0-9]” fivetimes.Unlessotherwiseindicatedusinga*,+,or?mark, oneandonlyoneof thecharacterscontainedinbracketswillbematched.Thismeansthatstringingfiveofthese expressionstogetherwillenableustofindastringofexactlyfivedigits. Notethatthe0-9indicatesthattheexpressionshouldmatchanycharacter0 through9(i.e.0,1,2,3,4,5,6,7,8,and9areallmatches). genzip=regexs(0)if(regexm(address,"[0-9][0-9][0-9][0-9][0-9]")) list +--------------------------------------------------------------+ |addresszip| |--------------------------------------------------------------| 1.|4905LakewayDrive,CollegeStation,Texas77845USA77845| 2.|673JasmineStreet,LosAngeles,CA9002490024| 3.|2376Firststreet,SanDiego,CA9012690126| 4.|6WestCentralSt,TempeAZ8006880068| 5.|1234MainSt.Cambridge,MA01238-123401238| +--------------------------------------------------------------+ Example1,VariationNumber1 Inoursimplifiedexampleabove,noneoftheaddresseshavefive-digitstreet numbers.Whatifthereareaddresseswithfive-digitstreetnumbers?Let’slook atanotherdatasetoffakeaddressesandseewhathappenswhenwetrytousethe samecodeabove. clear inputstr60address "4905LakewayDrive,CollegeStation,Texas77845" "673JasmineStreet,LosAngeles,CA90024" "2376Firststreet,SanDiego,CA90126" "66666WestCentralSt,TempeAZ80068" "12345MainSt.Cambridge,MA01238" end genzip=regexs(0)if(regexm(address,"[0-9][0-9][0-9][0-9][0-9]")) list +----------------------------------------------------------+ |addresszip| |----------------------------------------------------------| 1.|4905LakewayDrive,CollegeStation,Texas7784577845| 2.|673JasmineStreet,LosAngeles,CA9002490024| 3.|2376Firststreet,SanDiego,CA9012690126| 4.|66666WestCentralSt,TempeAZ8006866666| 5.|12345MainSt.Cambridge,MA0123812345| +----------------------------------------------------------+ Apparently,thisisnotworkingcorrectlysincethelasttworowsofthe variableziphavepickedupthestreetnumbersfortheseaddressesinsteadofzipcodes.In thisdataset,thezipcodeappearsattheendoftheaddressstring.Ifwe assumethatthisthecaseforalladdressesinthedata,theremedywillbe reallysimple.Wecanspecify"[0-9][0-9][0-9][0-9][0-9]$" whichwouldinstructStatatofindafive-digitnumberattheendofthestring. genzip=regexs(0)if(regexm(address,"[0-9][0-9][0-9][0-9][0-9]$")) list +----------------------------------------------------------+ |addresszip| |----------------------------------------------------------| 1.|4905LakewayDrive,CollegeStation,Texas7784577845| 2.|673JasmineStreet,LosAngeles,CA9002490024| 3.|2376Firststreet,SanDiego,CA9012690126| 4.|66666WestCentralSt,TempeAZ8006880068| 5.|12345MainSt.Cambridge,MA0123801238| +----------------------------------------------------------+ Example1,VariationNumber2 Sometimeszipcodealsoincludethefour-digitextensionandthecountry namemayalsoappearattheendoftheaddress,suchasinsomeof theaddressesshownbelow. clear inputstr60address "4905LakewayDrive,CollegeStation,Texas77845USA" "673JasmineStreet,LosAngeles,CA90024" "2376Firststreet,SanDiego,CA90126" "66666WestCentralSt,TempeAZ80068" "12345MainSt.Cambridge,MA01238-1234" "12345MainStSommervilleMA01239-2345" "12345MainStWatertwonMA01239USA" end  Inthistypeofmorerealisticsituation,thecodeintheprevious exampleswon’tworkcorrectlysincethereareextracharactersafterthezip codetobeextracted.Hereishowwecandoitusingamorecomplicatedregular expression. genzip=regexs(1)ifregexm(address,"([0-9][0-9][0-9][0-9][0-9])[-]*[0-9]*[a-zA-Z]*$") list +--------------------------------------------------------------+ |addresszip| |--------------------------------------------------------------| 1.|4905LakewayDrive,CollegeStation,Texas77845USA77845| 2.|673JasmineStreet,LosAngeles,CA9002490024| 3.|2376Firststreet,SanDiego,CA9012690126| 4.|66666WestCentralSt,TempeAZ8006880068| 5.|12345MainSt.Cambridge,MA01238-123401238| |--------------------------------------------------------------| 6.|12345MainStSommervilleMA01239-234501239| 7.|12345MainStWatertwonMA01239USA01239| +--------------------------------------------------------------+ Whatwehaveaddedintheregularexpressionisthissub-:"[-]*[0-9]*[a-zA-Z]*".  Therearethreecomponentsinthisregularexpression. [-]*–matchingzeroormoredashes"-" [0-9]*–matchingzeroormorenumbers [a-zA-Z]*–matchingzeroormoreblankspacesorletters Theseadditionsallowustomatchupthecaseswheretherearetrailing charactersafterthezipcodeandtoextractthezipcodecorrectly.Noticethat wealsoused"regexs(1)"insteadof"regexs(0)"aswedidpreviously,becausewe arenowusingsubexpressionsindicatedbythepairofparenthesisin"([0-9][0-9][0-9][0-9][0-9])". Anotherstrategythatmightworkbetterinsomecasesistheregularexpression genzip2=regexs(1)if(regexm(address,".*([0-9][0-9][0-9][0-9][0-9])")) Inthisexample,theperiod(i.e.“.”)matchesanycharctor,andtheasterixalone(“*”)matchesany characters.Together,thetwo indicatethatthenumberwearelookingforshouldnotoccuratthevery beginningofthestring,butmayoccuranywhereafter. Example2:Extractingfirstnameandlastnameandswitchingtheirorder Wehaveavariablethatcontainsaperson’sfullnameintheorderoffirstnameand thenlastname.Wewanttocreateanewvariableforfullnameintheorderof lastnameandthenfirstnameseparatedbycomma.Tostart,let’smakeasampledata set. clear inputstr40fullname "JohnAdams" "AdamSmiths" "MarySmiths" "CharlieWade" end Nowweneedtocapturethefirstwordandthesecondwordandswapthem.Hereisthe regularexpressionforthispurpose:(([a-zA-Z]+)[]*([a-zA-Z]+)). Therearethreepartsinthisregularexpression: ([a-zA-Z]+)–subexpressioncapturingastringconsistingof letters,bothlowercaseanduppercase.Thiswillbethefirstname. []*–matchingwithspace(s).Thisisthespacingbetweenfirst nameandlastname. ([a-zA-Z]+)–subexpressioncapturingastringconsistingof letters.Thiswillbethelastname. genn=regexs(2)+","+regexs(1)ifregexm(fullname,"([a-zA-Z]+)[]*([a-zA-Z]+)") list +------------------------------+ |fullnamen| |------------------------------| 1.|JohnAdamsAdams,John| 2.|AdamSmithsSmiths,Adam| 3.|MarySmithsSmiths,Mary| 4.|CharlieWadeWade,Charlie| +------------------------------+ Thisindeedworks.Let’sseehowregexsworksinthiscase.regex actuallyidentifiesanumberofsections,basedonthewholeexpressionaswellasthe subexpressions.Thefollowingcodeusesregexstoplaceeachofthese components(subexpressions)intoitsownvariable andthendisplaysthem. genn0=regexs(0)ifregexm(fullname,"(([a-zA-Z]+)[]*([a-zA-Z]+))") genn1=regexs(2)ifregexm(fullname,"(([a-zA-Z]+)[]*([a-zA-Z]+))") genn2=regexs(3)ifregexm(fullname,"(([a-zA-Z]+)[]*([a-zA-Z]+))") listfullnamen0n1n2 +------------------------------------------------+ |fullnamen0n1n2| |------------------------------------------------| 1.|JohnAdamsJohnAdamsJohnAdams| 2.|AdamSmithsAdamSmithsAdamSmiths| 3.|MarySmithsMarySmithsMarySmiths| 4.|CharlieWadeCharlieWadeCharlieWade| +------------------------------------------------+ Example3:Two-andfour-digitvaluesforyear. Inthisexample,wehavedatesenteredasastringvariable.Statacanhandle thisusingstandardcommands(see"Mydatevariableisastring,howcanI turnitintoadatevariableStatacanrecognize?"),weareusingthisasanexampleofwhat youcoulddowithregularexpressions.Thegoalofthisprocessistoproduceastring variablewiththeappropriatefourdigityearforeverycase,whichStatacan theneasilyconvertintoadate.Todothiswewillstartbyseparating outeachelementofthedate(day,month,andtwo-orfour-digityear)intoa separatevariable,thenwewillassignthecorrectfour-digityeartocases wheretherearecurrentlyonlytwodigits,finally,weconcatenatethevariablestocreateasingle stringvariablethatcontainsmonth,day,andfour-digityears. First,inputthedates: inputstr18date 20jan2007 16June06 06sept1985 21june04 4july90 9jan1999 6aug99 19august2003 end Next,wewanttoidentifythedayofthemonthandplaceitinavariable calledday.TodothisweinstructStatatofindthedaybylookingatthe beginningofthestring(i.e.thedate), foroneormorevaluesfrom0-9.(Inotherwords,lookforanumber atthestartoftheline,sinceweknowthefirstseriesofnumbersis theday.)Generateanewvariableday,andsetitequaltothatvalue. genday=regexs(0)ifregexm(date,"^[0-9]+") Thelineofsyntaxbelowfindsthemonthbylookingforoneormoreletterstogetherinthestring. Then,generatesthevariablemonthandsetsitequaltothemonthidentifiedinthestring. genmonth=regexs(0)ifregexm(date,"[a-zA-Z]+") Theyeariswherethingsgetmorecomplex.Notethatthevaluesforassigning centuriesarebasedonmyknowledgeofmy“data.”Firstofall,weextractall thedigitsforyear.Weusethe"$"operatortoindicatethatthesearchisfrom theendofthestring.Wethenturnthestringvariableintoanumericvariable usingStata’sfunction"real".Thenextactioninvolvesdealingwithtwodigityears startingwith"0".Thiscorrespondstorecentyearsinthetwentyfirstcentury. Toturntheseintofour-digityears,weconcatenate(usingthe+)thestring identified(thetwo-digityear)withthestring"20".Nextwewillfindthe two-digityears10-99,andconcatenatethosestringswiththestring"19".  Finally,wecreatethevariabledate2whichisourdatecontainingonly four-digityears.(Wecouldalsousethethreevariables,day,month,andyearto tocreateadatevariableusingtheStatadatefunctions.) genyear=regexs(0)ifregexm(date,"[0-9]*$") replaceyear="20"+regexs(0)ifregexm(year,"^[0][0-9]$") replaceyear="19"+regexs(0)ifregexm(year,"^[1-9][0-9]$") gendate2=day+month+year list +---------------------------------------------------+ |datedaymonthyeardate2| |---------------------------------------------------| 1.|20jan200720jan200720jan2007| 2.|16June0616June200616June2006| 3.|06sept198506sept198506sept1985| 4.|21june0421june200421june2004| 5.|4july904july19904july1990| |---------------------------------------------------| 6.|9jan19999jan19999jan1999| 7.|6aug996aug19996aug1999| 8.|19august200319august200319august2003| +---------------------------------------------------+ RegularExpressions Regularexpressionsare,ingeneral,awayofsearchingforandinsomecasesreplacingthe occurrenceofapatternwithinastringbasedonasetofrules.Theserulesare definedusingasetofoperators.Thefollowing tableshowsalloftheoperatorsStataaccepts,andexplainseachone.Notethat inStata, regularexpressionswillalwaysfallwithinquotationmarks. [] Squarebracketsindicatethatoneofthecharactersinsidethe bracketsshouldbematched.Forexample,ifIwantedtosearchfora singleletterbetweenfandm, Iwouldtype"[f-m]" a-z Arangespecifiesthatanyvaluewithinthatrangeisacceptable. Thisiscasesensitive,soa-zisnotthesameasA-Z,ifeithercase canbecountedasamatch,includebotha-zA-Z.Numericvaluesarealso acceptableasranges(e.g.0-9). . Aperiodmatchesanycharacter. Allowsyoutomatchcharactersthatareusuallyregularexpression operators.Forexample,ifyouwantedtomatch a"["youwouldtype[insteadof justasingle[. * Matchzeroormoreofthecharactersinprecedingexpression.Forexample ifIwantedtomatchanumbermadeupofoneormoredigitsifthereis anumber,butstill wanttoindicateamatchiftherestoftheexpressionfits,Icouldspecify [0-9]* + Matchoneormoreofthecharactersintheprecedingexpression.Forexample ifIwantedtomatchawordcontaininganycombinationofletters,Iwouldspecify [a-zA-Z]+ ? Matcheitherzerooroneofthepreviousexpression. ^ Whenitappearsatthebeginningofanexpression,a"^"indicates thatthefollowingexpressionshouldappearatthebeginningofthestring. $ Whenitappearsattheendofanexpression,a"$"indicatesthatthe precedingexpressionshouldappearattheendofthestring.Forexample, ifIwantedtomatchanumberthatwasthelastthingtoappear attheendofastring,Iwouldspecify"[0-9]+$" | Thelogicaloperatoror,indicatingthateithertheexpression preceding itorfollowingitqualifyasamatch. () Createsasubexpressionwithinalargerexpression.Usefulwiththe "or"perator(i.e.|),and whenextractingandreplacingvalues.Forexample,ifIwantedtoextractanumeric valuewhichIknowfollowsdirectlyafterawordorsetofletters,Icoulduse theregularexpression“[a-zA-Z]+([0-9]+)"thismatchesthewhole expression,butallowsyoutoselecttheportionintheparentheses (calledasubstring).Handlingsubstringsisdiscussedin greaterdetailbelow. Theseexpressionscanbecombinedtosearchforawidevarietyofstrings. Asmentionedabove,therearethreetypesoffunctionsthatcanbepreformed withregularexpressionsinStata(ifyouarecreative,youcandoanynumberof otherthingsusingthesefunctions,butthebasictoolsarethebuiltinStatafunctions).Statahas separatecommandsforeachofthethreetypesofactionsregularexpressionscan perform: regexm–usedtofindmatchingstrings,evaluatestooneifthereisamatch, andzerootherwise regexs–usedtoreturnthenthsubstringwithinanexpression matchedbyregexm(hence,regexmmustalwaysberunbeforeregexs,note thatan"if"isevaluatedfirsteventhoughitappearslateronthelineof syntax). regexr–usedtoreplaceamatchedexpressionwithsomethingelse. Eachofthesehasaslightlydifferentsyntax.Thelinebelowshowsthesyntaxfor regexm,thatis,thefunctionthatmatchesyourregularexpression,wherethe stringmayeitherbeastringyoutypeinyourself,astringfromamacro,ormost commonly,thenameofavariable.Regularexpressionistheregular expressionforthestringyouwouldliketofind,notethatitmustappearin quotationmarks. regexm(string,"regularexpression") Forregexs,thatis,torecallalloraportionofastring,thesyntaxis: regexs(n) Wherenisthenumberassignedtothesubstringyouwanttoextract. Thesubstringsareactuallydividedwhenyourunregexm.Theentiresubstringis returnedinzero,andeachsubstringisnumberedsequentiallyfrom1ton.For example,regexm(“907-789-3939”,“([0-9]*)-([0-9]*)-([0-9]*)”)returnsthefollowing: Subexpression# StringReturned 0 907-789-3939 1 907 2 789 3 3939 Notethatinsubexpressions1,2,and3,thedashesaredropped,sincethey arenotincludedintheparenthesesthatmarkthesubexpressions. Youcantakeanotherlookathowthisworksusingthefollowingsyntax,which usesthedisplaycommandtorunthefunction. displayregexm("907-789-3939","([0-9]*)-([0-9]*)-([0-9]*)") displayregexs(0) displayregexs(1) displayregexs(2) displayregexs(3) Becausetheyarefunctions,theregexcommandsworkwithinothercommands(e.g.generate), butcannotbeusedontheirown(i.e.youcannotstartacommandinStatawith regexm(…)). Reference Whatareregular expressionsandhowcanIusetheminStata?



請為這篇文章評分?