foundations of computational agents -- 7.8 Bayesian Learning

文章推薦指數: 80 %
投票人數:10人

The idea of Bayesian learning is to compute the posterior probability distribution of the target features of a new example conditioned on its input features ... ArtificialIntelligence foundationsofcomputationalagents HomeIndexContents FulltextofthesecondeditionofArtificialIntelligence: foundationsofcomputationalagents,Cambridge UniversityPress,2017isnowavailable. 7.8BayesianLearning Ratherthanchoosingthemostlikelymodelordelineatingthesetof allmodelsthatareconsistentwiththetrainingdata,another approachistocomputetheposteriorprobabilityofeachmodelgiven thetrainingexamples. TheideaofBayesianlearning istocomputetheposteriorprobabilitydistributionofthetarget featuresofanewexample conditionedonitsinputfeaturesandallofthe trainingexamples. SupposeanewcasehasinputsX=xandhastarget features,Y;theaimistocomputeP(Y|X=x∧e),whereeis thesetoftrainingexamples.Thisis theprobabilitydistributionofthetargetvariablesgiventheparticularinputsandthe examples.Theroleofamodelistobethe assumedgeneratoroftheexamples.IfweletMbeasetofdisjointand coveringmodels,thenreasoningbycases andthechainrulegive P(Y|x∧e)=∑m∈MP(Y∧m|x∧e) =∑m∈MP(Y|m∧x∧e)×P(m|x∧e) =∑m∈MP(Y|m∧x)×P(m|e) . Thefirsttwoequalitiesaretheoremsfromthedefinitionofprobability.Thelast equalitymakestwoassumptions:themodelincludesallofthe informationabouttheexamplesthatisnecessaryforaparticularprediction [i.e.,P(Y|m∧x∧e)=P(Y|m∧x)],and themodeldoesnotchangedependingonthe inputsofthenewexample[i.e.,P(m|x∧e)=P(m|e)].Thisformulasaysthatweaverageover thepredictionofallofthemodels,whereeachmodelisweightedby itsposteriorprobabilitygiventheexamples. P(m|e)canbecomputedusingBayes'rule: P(m|e)=(P(e|m)×P(m))/(P(e)) . Thus,theweightofeachmodeldependsonhowwellitpredictsthedata (thelikelihood)anditspriorprobability.Thedenominator,P(e), isanormalizingconstanttomakesuretheposteriorprobabilitiesof themodelssumto1.ComputingP(e)canbeverydifficultwhen therearemanymodels. Aset{e1,...,ek}ofexamplesare i.i.d.(independentandidentically distributed),wherethedistributionisgivenbymodelmif,forall iandj,exampleseiand ejareindependentgivenm,whichmeansP(ei∧ej|m)=P(ei|m)×P(ej|m).Weusuallyassumethattheexamples arei.i.d. Supposethesetoftrainingexampleseis{e1,...,ek}.That is,eistheconjunctionoftheei,becausealloftheexampleshave beenobservedtobetrue.Theassumptionthattheexamplesare i.i.d.implies P(e|m)=∏i=1kP(ei|m) . Thesetofmodelsmayincludestructurallydifferentmodelsinadditiontomodelsthatdifferinthevaluesoftheparameters.OneofthetechniquesofBayesianlearningistomakethe parametersofthemodelexplicitandtodeterminethedistributionovertheparameters. Example7.30:Considerthesimplestlearningtaskunderuncertainty.Supposethereisa singleBooleanrandom variable,Y. Oneoftwooutcomes,aand¬a,occursforeachexample.Wewanttolearnthe probabilitydistributionofY givensomeexamples. Thereisasingleparameter,φ,thatdeterminesthesetofall models.Supposethatφrepresentsthe probabilityofY=true.Wetreatthisparameterasareal-valuedrandom variableontheinterval[0,1].Thus,bydefinitionofφ,P(a|φ)=φandP(¬a|φ)=1-φ. SupposeanagenthasnopriorinformationabouttheprobabilityofBoolean variableYandnoknowledgebeyondthetrainingexamples.Thisignorancecanbe modeledbyhavingthepriorprobabilitydistributionofthevariable φasauniform distributionovertheinterval[0,1].Thisisthe theprobabilitydensityfunctionlabeledn0=0,n1=0in Figure7.15. Wecanupdatetheprobabilitydistributionofφgiven someexamples.Assumethattheexamples,obtainedbyrunninganumberof independentexperiments,areaparticularsequenceofoutcomesthat consistsofn0caseswhereYisfalseandn1caseswhereYis true. Figure7.15:Betadistributionbasedondifferentsamples Theposteriordistributionforφgiventhetrainingexamplescan bederivedbyBayes'rule.Lettheexamplesebethe particularsequenceofobservationthatresultedinn1occurrencesof Y=trueandn0occurrencesofY=false.Bayes'rulegivesus P(φ|e)=(P(e|φ)×P(φ))/(P(e)) . Thedenominatorisanormalizingconstanttomake suretheareaunderthecurveis1. Giventhattheexamplesarei.i.d., P(e|φ)=φn1×(1-φ)n0 becausethereare n0caseswhereY=false,eachwithaprobabilityof1-φ,andn1caseswhereY=true,eachwithaprobabilityofφ. Onepossiblepriorprobability, P(φ),isauniformdistributiononthe interval[0,1].Thiswouldbereasonablewhentheagenthasnopriorinformationabouttheprobability. Figure7.15givessomeposteriordistributionsofthe variableφbasedondifferentsamplesizes,andgivenauniformprior.Thecasesare (n0=1,n1=2),(n0=2,n1=4),and(n0=4, n1=8).Eachofthesepeakatthesameplace,namelyat (2)/(3).Moretrainingexamplesmakethecurvesharper. Thedistributionofthisexampleisknownasthebeta distribution;itisparametrizedbytwo counts,α0andα1,andaprobabilityp.Traditionally,the αiparametersforthebetadistributionareonemorethanthe counts;thus,αi=ni+1. Thebetadistributionis Betaα0,α1(p)=(1)/(K)pα1-1×(1-p)α0-1 whereKisanormalizingconstantthatensurestheintegral overallvaluesis1.Thus, theuniformdistributionon[0,1]isthebetadistribution Beta1,1. Thegeneralizationofthebetadistributiontomore thantwoparametersisknownastheDirichlet distribution. TheDirichletdistribution withtwosortsofparameters,the"counts" α1,...,αk,andtheprobabilityparameters p1,...,pk,is Dirichletα1,...,αk(p1,...,pk)=(1)/(K)∏j=1k pjαj-1whereKisanormalizingconstantthatensurestheintegral overallvaluesis1;piistheprobabilityoftheithoutcome (andso0≤pi≤1) andαiisonemorethanthecountoftheithoutcome.That is,αi=ni+1.The DirichletdistributionlookslikeFigure7.15along eachdimension(i.e.,aseachpjvariesbetween0and1). Formanycases,summingoverallmodelsweightedbytheirposteriordistributionisdifficult,becausethemodelsmaybe complicated(e.g.,iftheyaredecisiontreesorevenbelief networks).However,fortheDirichletdistribution,theexpectedvalue foroutcomei(averagingoverallpj's)is (αi)/(∑jαj) . Thereasonthattheαiparametersareonemorethanthecountsis tomakethisformulasimple.Thisfractioniswell definedonlywhentheαjareallnon-negativeandnotallarezero. Example7.31: ConsiderExample7.30,whichdeterminesthevalueofφ basedonasequenceofobservationsmadeupofn0caseswhereYisfalseandn1caseswhereYistrue.Considertheposterior distributionasshowninFigure7.15.Whatisinteresting aboutthisisthat,whereasthemostlikelyposteriorvalueofφ is(n1)/(n0+n1),theexpectedvalueofthis distributionis(n1+1)/(n0+n1+2). Thus,theexpectedvalueofthen0=1,n1=2curveis(3)/(5),for then0=2,n1=4casetheexpectedvalueis(5)/(8),andforthen0=4,n1=8caseit is(9)/(14).Asthelearnergetsmoretrainingexamples,thisvalueapproaches (n)/(m). Thisestimateisbetterthan(n)/(m)foranumberofreasons.First,ittells uswhattodoifthelearningagenthasnoexamples:Usetheuniform priorof(1)/(2).Thisistheexpectedvalueofthen=0, m=0case.Second,considerthecasewheren=0andm=3.Theagent shouldnotuseP(y)=0,becausethissaysthatYisimpossible,andit certainlydoesnothaveevidenceforthis!Theexpectedvalueofthis curvewithauniformprioris(1)/(5). Anagentdoesnothavetostartwithauniformprior;itcanstart withanypriordistribution.Iftheagentstartswithapriorthatis aDirichletdistribution,itsposteriorwillbeaDirichlet distribution.Theposteriordistributioncanbeobtainedbyaddingthe observedcountstotheαiparametersofthepriordistribution. Thei.i.d.assumptioncanberepresentedasabeliefnetwork,where eachoftheeiareindependentgivenmodelm. Thisindependenceassumptioncanberepresentedbythebeliefnetwork shownontheleftsideofFigure7.16. Figure7.16:BeliefnetworkandplatemodelsofBayesianlearning Ifmismadeinto adiscretevariable,anyofthe inferencemethodsofthepreviouschaptercanbeusedforinferenceinthis network.Astandardreasoningtechniqueinsuchanetworkisto conditiononalloftheobservedeiandtoquerythemodelvariable oranunobservedei variable. Theproblemwithspecifyingabeliefnetworkforalearningproblemisthatthemodelgrowswiththenumber ofobservations.Suchanetworkcanbespecifiedbeforeany observationshavebeenreceivedbyusingaplate model.Aplatemodelspecifieswhat variableswillbeusedinthemodelandwhatwillberepeatedintheobservations. TherightsideofFigure7.16showsaplatemodelthat representsthesameinformationastheleftside.Theplateisdrawn asarectanglethatcontainssomenodes,andanindex(drawnonthe bottomrightoftheplate).Thenodesintheplateareindexedbytheindex.Intheplatemodel,therearemultiplecopies ofthevariablesintheplate,oneforeachvalueoftheindex.The intuitionisthatthereisapileofplates,oneforeachvalueof theindex.Thenumberofplatescanbevarieddependingonthenumber ofobservationsandwhatisqueried.Inthisfigure,allofthenodes intheplateshareacommonparent.Theprobabilityofeachcopyofa variableinaplategiventheparentsisthesameforeachindex. Aplatemodelletsusspecifymorecomplexrelationshipsbetweenthe variables.InahierarchicalBayesian model,theparametersofthemodelcan dependonotherparameters.Suchamodelishierarchicalinthesense thatsomeparameterscandependonotherparameters. Example7.32: Supposeadiagnosticassistantagentwantstomodeltheprobability thataparticularpatientinahospitalissickwiththeflubeforesymptoms havebeenobservedforthispatient.Thispriorinformationaboutthe patientcanbecombinedwiththeobservedsymptomsofthepatient.The agentwantstolearnthisprobability,basedonthestatisticsabout otherpatientsinthesamehospitalandaboutpatientsatdifferent hospitals.Thisproblemcanrangefromthecaseswherealot ofdataexistsaboutthecurrenthospital(inwhichcase,presumably,that datashouldbeused)tothecasewherethereisnodataaboutthe particularhospitalthatthepatientisin.AhierarchicalBayesianmodelcanbeusedtocombinethestatisticsabouttheparticular hospitalthepatientisinwiththestatisticsabouttheother hospitals. SupposethatforpatientXinhospitalHthereisarandom variableSHXthatistruewhenthepatientissickwiththeflu.(Assumethat thepatientidentificationnumberandthehospitaluniquelydeterminethepatient.)ThereisavalueφHforeachhospitalHthatwillbeusedfortheprior probabilityofbeingsickwiththefluforeachpatientinH.InaBayesianmodel, φHistreatedasareal-valuedrandomvariablewithdomain [0,1].SHXdependsonφH,with P(SHX|φH)=φH.AssumethatφHisdistributed accordingtoabetadistribution. Wedon'tassumethatφhiand φh2areindependentofeachother,butdependon hyperparameters.Thehyperparameterscanbethepriorcountsα0 andα1.Theparametersdependonthehyperparametersinterms oftheconditionalprobabilityP(φhi|α0,α1)= Betaα0,α1(φhi);α0andα1are real-valuedrandomvariables,whichrequiresomepriordistribution. Figure7.17:HierarchicalBayesianmodel Theplatemodeland thecorrespondingbeliefnetworkareshownin Figure7.17.Part(a)showstheplatemodel,where thereisacopyoftheoutsideplateforeachhospitalandacopyof theinsideplateforeachpatientinthehospital.Partofthe resultingbeliefnetworkisshowninpart(b).ObservingsomeoftheSHXwill affecttheφHandsoα0andα1,whichwillin turnaffecttheotherφHvariablesandtheunobservedSHX variables. Sophisticatedmethodsexisttoevaluatesuch networks.However,ifthevariablesaremadediscrete,anyofthe methodsofthepreviouschaptercanbeused. Inadditiontousingtheposteriordistributionofφtoderivethe expectedvalue,wecanuseittoanswerotherquestionssuchas:What istheprobabilitythattheposteriorprobabilityofφisin therange[a,b]?Inotherwords,deriveP((φ≥a∧φ≤b)|e).ThisistheproblemthattheReverend ThomasBayessolvedmorethan200yearsago[Bayes(1763)].Thesolutionhegave-althoughinmuchmorecumbersomenotation-was (∫abpn×(1-p)m-n)/(∫01pn×(1-p)m-n) . Thiskindofknowledgeisusedinsurveyswhenitmaybereportedthat asurveyiscorrectwithanerrorofatmost5%,19timesoutof20.Itisalsothe sametypeofinformationthatisusedbyprobablyapproximatelycorrect(PAC)learning,whichguaranteesanerroratmost εatleast1-δofthetime.Ifanagentchoosesthe midpointoftherange[a,b],namely(a+b)/(2),asits hypothesis,itwillhaveerrorlessthanorequalto(b-a)/(2), justwhenthehypothesisisin[a,b].Thevalue1-δcorrespondsto P(φ≥a∧φ≤b|e).If ε=(b-a)/(2)andδ=1-P(φ≥a∧φ≤b|e),choosingthemidpointwillresultinanerroratmost εin1-δofthetime.PAClearninggivesworst-caseresults,whereas Bayesianlearninggivestheexpectednumber.Typically,theBayesian estimateismoreaccurate,butthePACresultsgiveaguaranteeof theerror.Thesamplecomplexity(seeSection7.7.2) requiredforBayesianlearningistypicallymuchlessthanthatof PAClearning-manyfewerexamplesarerequiredtoexpecttoachievethe desiredaccuracythanareneededtoguaranteethedesiredaccuracy. ArtificialIntelligence,Poole &Mackworth(LCI,UBC,Vancouver,Canada) Copyright©2010,DavidPooleand Alan Mackworth.ThisworkislicensedunderaCreativeCommonsAttribution-Noncommercial-NoDerivativeWorks2.5CanadaLicense.



請為這篇文章評分?