Bayesian method (1). The prior distribution | by Xichu Zhang
文章推薦指數: 80 %
The most intuitive and easiest prior is a uniform prior distribution if the value of the parameter is bounded. This prior is noninformative ( ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceBayesianmethod(1)ThepriordistributionFig0.1Betadistributionwithvariousparameters.(Imagebyauthor)ItiseasytofindahugeamountofgoodarticlesontheintroductionofBayesianstatistics.However,mostofthemintroduceonlywhatBayesianstatisticsisandhowdoesBayesianinferenceworkandnotmanymathematicaldetailsareinvolved.Inaddition,itisafunandchallengingareatoexplore.Therefore,IamplanningtomakeaseriesofarticlestosharemoreaboutthetheoryofBayesianstatistics,whichwillincludetheselectionoftheprior,thelossfunctionintheBayesianinferenceandtherelationbetweenBayesianstatisticsandsomefrequentistsapproaches.Inthispost,thepriordistributionusedinBayesianstatisticswillbeintroduced.Whydoweneedtolearnthis?BecausepickingapriordistributionisonefirststepsweneedtouseBayesianinference.Andknowingmoreaboutthemhelpswithchoosing.ThebasicsHerewestartwithabriefoverviewofhowBayesianstatisticsworksandsomenotationswewilluselaterarealsointroducedhere.InBayesianstatistics,weassumeapriorprobabilitydistributionandthenupdatethepriorusingthedatawehave.Thisupdatinggivesustheposteriorprobabilitydistribution.Wedenotetheposteriorprobabilityasπ(θ|x)(πmightseemtobeveryannoyingsinceit’salsoaverycommonconstantinmath,butinthiscontext,πremindsusthatthedistributionisrelatedtotheparameterofthepopulationdistribution),anditiscalculatedasEq1.1FormulaforcalculatingtheposteriorprobabilitywhereΘisthespace(here,by“space”,wemeana“samplespace”)ofallthepossibleparametersvaluesandπ(x|θ)isthelikelihood—theconditionalprobabilitythatgiventhetrueparametervaluebeingθ,outputxisobserved.Sinceθ∈Θistheparameterrelatedtothepriordistribution,insteadofthedistributionofthepopulation,wecanθthehyperparameter.Andlikealways,weusetheboldfont(x)todenoteavector.Thedenominatorisalsoknownastheevidence,whichisanormalizingfactor(aconstant)tomaketheposteriorprobabilityπ(θ|x)beaprobabilitydistribution(sumuptoone).ThisisveryeasytoverifyAndthenormalizingfactorcanbeignoredininference,wecanseethisinsomepiecesofliterature,suchas[1],sinceleavingouttheconstantdoesn’tchangetheshapeofthecurve.ThenthepriorprobabilitycanbewrittenintheformEq1.2TheposteriorprobabilitywithoutthedenominatorOncewehavetheposteriordistribution,whichisthedistributionoftheparameter(keepthisinmind),wecancalculatethepredictivedistribution.Itisaconditionalprobability,whichistheprobabilitydistributionofobservingy,givendatax.ItiscalculatedasEq1.3Thepredictivedistributionwheretheredpartistheprobabilitydensityfunctionofthenewobservation,giventheparameterθ.Equation1.3mightseemabitmessyatfirst,butafteracloselook,wecanseethatit’sinfactcalculatedusingthelawoftotalprobability(whichisassimpleasaweightedaverage)—itistheintegrationoftheproductoftheprobabilitydistributionofYgiventheparametervalueθandtheprobabilityoftheparametertakingvalueθgivendatax.ChoosingthepriorThepriorissometimesdescribedasthe“belief”aboutthedata.[2]Thismeansthatwechoosetheprioraccordingtoourknowledgeofthedata.Ofcourse,it’snotcompletelyasubjectivematterastheword“belief”mightsuggest.PropertiesofthepriorNotethatthedistributionoftheparametercanbeunbounded,whichmeansthattheprobabilitydensitiesofitarenonnegative,buttheirsumorintegralisinfinite.Wecallthiskindofdistributionoftheparameterimproperpriordistribution.AccordingtoWikipedia,aninformativepriorexpressesspecific,definiteinformationaboutavariable.Usually,itisavoidedatallcosts,butifpriorinformationisavailable,theinformativepriorsareanappropriatewayofintroducingtheinformationintothemodel.[7]Whenwedon’tknowmuchaboutourdataandthedistributionoftheparameter,itmakessensetochooseaso-called“vagueprior”,whichreflectsminimalknowledge.Therefore,weneedapriordistributionwithnopopulationbase,whichmakesitdifficulttoconstruct,andplaysaminimalroleintheposteriordistribution.Suchpriordensityiscallednoninformativeprior,ordiffuseprior.Andsomepeopleratherthinkthatthepriordistributionalwayscontainssomeinformation.Sometimesimproperpriorsareusedtorepresentsuchvagueprior.Laterwewillseeexamplesofthis.Arelatedtermisweaklyinformativeprior,whichcontainspartialinformation,whichmeansit’senoughtogivetheposteriordistributionreasonablebounds,butdoesn’tfullycaptureone’sscientificknowledgeabouttheparameter.[3]Averyinterestingpropertyofthepriorisconjugacy,whichmeansthattheposteriordistributionhasthesameparametricformasthepriordistribution.Wecanseethatthiskindofpriorisstronglyrelatedtotheposterior,thuswesaythatitcontainsstrongpriorknowledge.[5]Thebenefitofconjugatepriorsisobvious—theposteriordistributionwillbeaknowndistribution.ExamplesofsomecommonpriorsUniformpriorThemostintuitiveandeasiestpriorisauniformpriordistributionifthevalueoftheparameterisbounded.Thispriorisnoninformative(sometimesit’salsocalled“alowinformationprior”[2]),itassumesthatalltheparametersintheparameterspaceΘareequallylikely.Forexample,ifwewanttouseBernoullidistributiontomodelthedata(asinthefamousexample—cointossing),theparameterpisaprobability,whichfallsintheinterval[0,1].Inthiscase,thepriorprobabilitybecomesπ(θ)=1,forθin[0,1].2.HaldanepriorMinimalknowledgedoesn’tnecessarilyhavetomeanthatalltheparametersareequallylikely.Alotofothernoninformativepriorsarealsopossible.AnotherexampleofthenoninformativepriorisHaldaneprior,whichisproposedbyJ.B.S.Haldanefortheestimationofrareevents.TheHaldanepriorisactuallythebetadistributionwithparameterα=0,β=0.ThereforetheHaldanepriorisEq.2.1TheHaldanepriorwhereB(α,β)isthebetafunction.Asareminder,theBetafunctionlookslikethis:Eq2.3ThebetafunctionwhichcanbealsowrittenintermsofthegammafunctionEq2.4ThebetafunctionintermsofthegammafunctionNotethatBeta(0,0)isnotdefined,butwecanconsiderwhatitapproximatestoatpoint(0,0),whichisshowninthefollowinggraphEq2.5ThelimitbehaviorofBetadistributionwithparameterα=0,β=0Theposteriordistributionπ(θ|x)isproportionaltoθ⁻¹(1-θ)⁻¹(recallthattheBayesiantheoremcanbewrittenintheformEquation1.2),whichmeansEq2.6TheHaldanepriorwithoutthenormalizingcoefficientThispriorgivesthemostweighttoθ=1andθ=0.Thiscanbemadeclearusingtheexamplefrom[5]:considerthescenariothatweareobservingwhetheranunknowncompoundwilldissolveinwater.Atfirst,wearecompletelyignorantoftheresult.Therefore,afterobservingthatasmallsampledissolves,weimmediatelyassumethatallthesampleswilldoso;ifitdoesn’t,weassumethatnosamplecandissolve.3.Conjugateprior—betadistributionAthirdexampleisthebetadistribution,itisaconjugatepriorofthebinomialdistribution.Andnotethat,sincetheBernoullidistributionisaspecialcaseofthebinomialdistribution(thesameasB(1,1)),BetadistributionisalsoaconjugateprioroftheBernoullidistribution.Itisatypicalexampleofconjugateprior(ithasappearedonWikipedia,[3]and).Herewewillshowwhythebetadistributionisconjugatetothebinomialdistribution.FirstlyrecallthattheprobabilitymassfunctionofthebinomialdistributionisEq2.7Probabilitymassfunction(pmf)ofthebinomialdistributionwherenisthetotalnumberoftrials,kisthenumberofsuccessesandpistheprobabilityofsuccess.Therefore,thelikelihoodisEq2.8ThelikelihoodReferringtowhatwehaveseeninthesectionofbasics,thelikelihoodisdenotedasπ(x|θ),wherexistheobservedvalue,sox=(k,n-k).Thismeanstheparametersofthebinomialdistributionbecometheobservedvalueandthe“parameter”inthislikelihoodisthehyperparameter.Thenwechoosebetadistributiontobetheprior.Whatwewanttodoistoshowthattheposteriordistributionisthesametypeasthepriordistribution.Eq2.9ChoosebetadistributiontobethepriorTheposteriordistributionisderivedasfollowsEq2.10Theposteriordistributionwecanseethattheposteriordistributionisagainbetadistribution.4.JeffreysPriorTheJeffreyspriorisanon-informativepriordefinedintermsofthesquarerootofthedeterminantoftheFisherinformationmatrix.Def2.11ThedefinitionofJeffreyspriorTheFisherinformationandFisherinformationmatrixwereintroducedhere,butforconvenience,wealsomentionitagainhere.Originally,theFisherinformationisdefinedasthevarianceofthescoreDef2.12ThedefinitionofFisherinformationwherethelowerindexθmeansthattheexpectedvalueiswithregardtoθ,andthematrixformiswrittenasEq2.13Fisherinformationmatrixbutundersomecertainconditions(thedensityfunctionfbeingsecond-orderdifferentiableandregularityconditions,agoodsummaryofwhichcanbefoundhere).Eq2.14FisherinformationundercertainconditionEquation2.14istheformulaforthesinglevariablecase(whentherearemultipleparameters,weusethematrixform).Let’strytocalculatetheJeffreyspriorofaBernoullitrial,whichisasinglevariablecase.Therearereasonswhyweusethisdistributionfordemonstration,whichwewillseelater.WeknowthattheprobabilitydistributionoftheBernoullidistributionisEq2.15DensityfunctionofBernoullidistributionNowweneedtocalculatetheFisherinformationofthedensityfunction(Equation2.15)Eq1.26FisherinformationofBernoullitrialSincetheparameterisjustone-dimensional(single-variable),theFisherinformationisjustanumber,whichisalsothedeterminant,andwehavethepriordistributionEq1.27LookatEquation1.27carefullyandwewillfindoutthattheJeffreyspriorissimilartotheHaldaneprior.ButunlikeHaldaneprior,Jeffreyspriorisproper.TheplotofEquation1.27looksasfollowsFig1.28(Imagebyauthor)ItisalsorelatedtotheBetadistributionsinceEquation1.27equalsBeta(1/2,1/2).SummaryThispostismainlyaboutthepriordistributioninBayesianinference.Inthebeginning,thebasicsofBayesianinferencearebrieflyintroduced.Thenwelookatthetypesofthepriordistributionsandthensomecommonpriordistributionsareselected.References:[1]Lee,T.S.,&Mumford,D.(2003).HierarchicalBayesianinferenceinthevisualcortex.JOSAA,20(7),1434–1448.[2]Surya,Tokdar,Choosingapriordistribution,accessed4December2021.[3]Gelman,A.,Carlin,J.B.,Stern,H.S.,&Rubin,D.B.(1995).Bayesiandataanalysis.ChapmanandHall/CRC.[4]Etz,A.,&Wagenmakers,E.J.(2017).JBSHaldane’scontributiontotheBayesfactorhypothesistest.StatisticalScience,313–329.[5]Jaynes,E.T.(1968).Priorprobabilities.IEEETransactionsonsystemsscienceandcybernetics,4(3),227–241.[6]Stanford,J.L.,&Vardeman,S.B.(1994).Statisticalmethodsforphysicalscience(Vol.28).AcademicPress.[7]Golchi,S.(2016,October).InformativepriorsandBayesiancomputation.In2016IEEEinternationalconferenceondatascienceandadvancedanalytics(DSAA)(pp.782–789).IEEE.[8]Nicenboim,B.,Schad,D.J.,&Vasishth,S.(2021).Anintroductiontobayesiandataanalysisforcognitivescience.[9]JeremyOrloffandJonathanBloom,Conjugatepriors:Betaandnormal,accessed11December2021.[10]Thepriordistribution,accessed1January2022.Furtherreading:Formoreaboutprobabilitytheory:MeasuretheoryinprobabilityProbabilityisnotsimpleafterall.towardsdatascience.comFormoreaboutthecomparisonbetweenBayesianandfrequentistapproaches:SubjectivismindecisionscienceAnoteonBayesianismmedium.comSupplement:codeusedtogenerateFigure0.1x.v
延伸文章資訊
- 1Relationship between Bayesian prior, posterior, and data. Prior...
Download scientific diagram | Relationship between Bayesian prior, posterior, and data. Prior kno...
- 2Prior probability - Wikipedia
In Bayesian statistical inference, a prior probability distribution, often simply called the prio...
- 3The use of Bayesian priors in Ecology: The good, the bad and ...
Bayesian data analysis (BDA) is a powerful tool for making inference from ecological data, but it...
- 4第40 章貝葉斯統計入門 - Bookdown
A Bayesian statistician is one who, vaguely expecting a horse and ... 利用貝葉斯定理,我們將先驗隨機概率分佈(prior p...
- 5Bayesian method (1). The prior distribution | by Xichu Zhang
The most intuitive and easiest prior is a uniform prior distribution if the value of the paramete...