Probability concepts explained: Bayesian inference for ...

2025-01-21

文章推薦指數： 80 %

投票人數：10人

Bayesian inference is therefore just the process of deducing properties about a population or probability distribution from data using Bayes' ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceProbabilityconceptsexplained:Bayesianinferenceforparameterestimation.PixabayIntroductionInthepreviousblogpostIcoveredthemaximumlikelihoodmethodforparameterestimationinmachinelearningandstatisticalmodels.Inthispostwe’llgooveranothermethodforparameterestimationusingBayesianinference.I’llalsoshowhowthismethodcanbeviewedasageneralisationofmaximumlikelihoodandinwhatcasethetwomethodsareequivalent.Somefundamentalknowledgeofprobabilitytheoryisassumede.g.marginalandconditionalprobability.Theseconceptsareexplainedinmyfirstpostinthisseries.Additionally,italsohelpstohavesomebasicknowledgeofaGaussiandistributionbutit’snotnecessary.Bayes’TheoremBeforeintroducingBayesianinference,itisnecessarytounderstandBayes’theorem.Bayes’theoremisreallycool.Whatmakesitusefulisthatitallowsustousesomeknowledgeorbeliefthatwealreadyhave(commonlyknownastheprior)tohelpuscalculatetheprobabilityofarelatedevent.Forexample,ifwewanttofindtheprobabilityofsellingicecreamonahotandsunnyday,Bayes’theoremgivesusthetoolstousepriorknowledgeaboutthelikelihoodofsellingicecreamonanyothertypeofday(rainy,windy,snowyetc.).We’lltalkmoreaboutthislatersodon’tworryifyoudon’tunderstanditjustyet.MathematicaldefinitionMathematicallyBayes’theoremisdefinedas:whereAandBareevents,P(A|B)istheconditionalprobabilitythateventAoccursgiventhateventBhasalreadyoccurred(P(B|A)hasthesamemeaningbutwiththerolesofAandBreversed)andP(A)andP(B)arethemarginalprobabilitiesofeventAandeventBoccurringrespectively.ExampleMathematicaldefinitionscanoftenfeeltooabstractandscarysolet’strytounderstandthiswithanexample.OneoftheexamplesthatIgaveintheintroductoryblogpostwasaboutpickingacardfromapackoftraditionalplayingcards.Thereare52cardsinthepack,26ofthemareredand26areblack.Whatistheprobabilityofthecardbeinga4giventhatweknowthecardisred?ToconvertthisintothemathsymbolsthatweseeabovewecansaythateventAistheeventthatthecardpickedisa4andeventBisthecardbeingred.Hence,P(A|B)intheequationaboveisP(4|red)inourexample,andthisiswhatwewanttocalculate.Wepreviouslyworkedoutthatthisprobabilityisequalto1/13(there26redcardsand2ofthoseare4's)butlet’scalculatethisusingBayes’theorem.Weneedtofindtheprobabilitiesforthetermsontherighthandside.Theyare:P(B|A)=P(red|4)=1/2P(A)=P(4)=4/52=1/13P(B)=P(red)=1/2WhenwesubstitutethesenumbersintotheequationforBayes’theoremaboveweget1/13,whichistheanswerthatwewereexpecting.HowdoesBayes’Theoremallowustoincorporatepriorbeliefs?AboveImentionedthatBayes’theoremallowsustoincorporatepriorbeliefs,butitcanbehardtoseehowitallowsustodothisjustbylookingattheequationabove.Solet’sseehowwecandothatusingtheicecreamandweatherexampleabove.LetArepresenttheeventthatwesellicecreamandBbetheeventoftheweather.Thenwemightaskwhatistheprobabilityofsellingicecreamonanygivendaygiventhetypeofweather?MathematicallythisiswrittenasP(A=icecreamsale|B=typeofweather)whichisequivalenttothelefthandsideoftheequation.P(A)ontherighthandsideistheexpressionthatisknownastheprior.InourexamplethisisP(A=icecreamsale),i.e.the(marginal)probabilityofsellingicecreamregardlessofthetypeofweatheroutside.P(A)isknownasthepriorbecausewemightalreadyknowthemarginalprobabilityofthesaleoficecream.Forexample,Icouldlookatdatathatsaid30peopleoutofapotential100actuallyboughticecreamatsomeshopsomewhere.SomyP(A=icecreamsale)=30/100=0.3,priortomeknowinganythingabouttheweather.ThisishowBayes’Theoremallowsustoincorporatepriorinformation.Caution:ImentionedabovethatIcouldfinddatafromashoptogetpriorinformation,butthereisnothingstoppingmefrommakingupacompletelysubjectivepriorthatisnotbasedonanydatawhatsoever.It’spossibleforsomeonetocomeupwithapriorthatisaninformedguessfrompersonalexperienceorparticulardomainknowledgebutit’simportanttoknowthattheresultingcalculationwillbeaffectedbythischoice.I’llgointomoredetailregardinghowthestrengthofthepriorbeliefaffectstheoutcomelaterinthepost.BayesianInferenceDefinitionNowweknowwhatBayes’theoremisandhowtouseit,wecanstarttoanswerthequestionwhatisBayesianinference?Firstly,(statistical)inferenceistheprocessofdeducingpropertiesaboutapopulationorprobabilitydistributionfromdata.Wedidthisinmypreviouspostonmaximumlikelihood.Fromasetofobserveddatapointswedeterminedthemaximumlikelihoodestimateofthemean.BayesianinferenceisthereforejusttheprocessofdeducingpropertiesaboutapopulationorprobabilitydistributionfromdatausingBayes’theorem.That’sit.UsingBayes’theoremwithdistributionsUntilnowtheexamplesthatI’vegivenabovehaveusedsinglenumbersforeachtermintheBayes’theoremequation.Thismeantthattheanswerswegotwerealsosinglenumbers.However,theremaybetimeswhensinglenumbersarenotappropriate.Intheicecreamexampleabovewesawthatthepriorprobabilityofsellingicecreamwas0.3.However,whatif0.3wasjustmybestguessbutIwasabituncertainaboutthisvalue.Theprobabilitycouldalsobe0.25or0.4.Inthiscaseadistributionofourpriorbeliefmightbemoreappropriate(seefigurebelow).Thisdistributionisknownasthepriordistribution.2distributionsthatrepresentourpriorprobabilityofsellingiceonanygivenday.Thepeakvalueofboththeblueandgoldcurvesoccuraroundthevalueof0.3which,aswesaidabove,isourbestguessofourpriorprobabilityofsellingicecream.Thefactthatf(x)isnon-zeroofothervaluesofxshowsthatwe’renotcompletelycertainthat0.3isthetruevalueofsellingicecream.Thebluecurveshowsthatit’slikelytobeanywherebetween0and0.5,whereasthegoldcurveshowsthatit’slikelytobeanywherebetween0and1.Thefactthatthegoldcurveismorespreadoutandhasasmallerpeakthanthebluecurvemeansthatapriorprobabilityexpressedbythegoldcurveis“lesscertain”aboutthetruevaluethanthebluecurve.InasimilarmannerwecanrepresenttheothertermsinBayes’Theoremusingdistributions.Wemostlyneedtousedistributionswhenwe’redealingwithmodels.ModelformofBayes’TheoremIntheintroductorydefinitionofBayes’TheoremaboveI’veusedeventsAandBbutwhenthemodelformofBayes’theoremisstatedintheliteraturedifferentsymbolsareoftenused.Let’sintroducethem.InsteadofeventA,we’lltypicallyseeΘ,thissymboliscalledTheta.Thetaiswhatwe’reinterestedin,itrepresentsthesetofparameters.Soifwe’retryingtoestimatetheparametervaluesofaGaussiandistributionthenΘrepresentsboththemean,μandthestandarddeviation,σ(writtenmathematicallyasΘ={μ,σ}).InsteadofeventB,we’llseedataory={y1,y2,…,yn}.Theserepresentthedata,i.e.thesetofobservationsthatwehave.I’llexplicitlyusedataintheequationtohopefullymaketheequationalittlelesscryptic.SonowBayes’theoreminmodelformiswrittenas:We‘veseenthatP(Θ)isthepriordistribution.Itrepresentsourbeliefsaboutthetruevalueoftheparameters,justlikewehaddistributionsrepresentingourbeliefabouttheprobabilityofsellingicecream.P(Θ|data)onthelefthandsideisknownastheposteriordistribution.Thisisthedistributionrepresentingourbeliefabouttheparametervaluesafterwehavecalculatedeverythingontherighthandsidetakingtheobserveddataintoaccount.P(data|Θ)issomethingwe’vecomeacrossbefore.Ifyoumadeittotheendofmypreviouspostonmaximumlikelihoodthenyou’llrememberthatwesaidL(data;μ,σ)isthelikelihooddistribution(foraGaussiandistribution).WellP(data|Θ)isexactlythis,it’sthelikelihooddistributionindisguise.Sometimesit’swrittenasℒ(Θ;data)butit’sthesamethinghere.Thereforewecancalculatetheposteriordistributionofourparametersusingourpriorbeliefsupdatedwithourlikelihood.ThisgivesusenoughinformationtogothroughanexampleofparameterinferenceusingBayesianinference.Butfirst…WhydidIcompletelydisregardP(data)?Well,apartfrombeingthemarginaldistributionofthedataitdoesn’treallyhaveafancyname,althoughit’ssometimesreferredtoastheevidence.Remember,we’reonlyinterestedintheparametervaluesbutP(data)doesn’thaveanyreferencetothem.Infact,P(data)doesn’tevenevaluatetoadistribution.It’sjustanumber.we’vealreadyobservedthedatasowecancalculateP(data).Ingeneral,itturnsoutthatcalculatingP(data)isveryhardandsomanymethodsexisttocalculateit.ThisblogpostbyPrasoonGoyalexplainsseveralmethodsofdoingso.ThereasonwhyP(data)isimportantisbecausethenumberthatcomesoutisanormalisingconstant.Oneofthenecessaryconditionsforaprobabilitydistributionisthatthesumofallpossibleoutcomesofaneventisequalto1(e.g.thetotalprobabilityofrollinga1,2,3,4,5or6ona6-sideddieisequalto1).Thenormalisingconstantmakessurethattheresultingposteriordistributionisatrueprobabilitydistributionbyensuringthatthesumofthedistribution(Ishouldreallysayintegralbecauseit’susuallyacontinuousdistributionbutthat’sjustbeingtoopedanticrightnow)isequalto1.Insomecaseswedon’tcareaboutthispropertyofthedistribution.Weonlycareaboutwherethepeakofthedistributionoccurs,regardlessofwhetherthedistributionisnormalisedornot.InthiscasemanypeoplewritethemodelformofBayes’theoremaswhere∝means“proportionalto”.Thismakesitexplicitthatthetrueposteriordistributionisnotequaltotherighthandsidebecausewehaven’taccountedforthenormalisationconstantP(data).BayesianinferenceexampleWelldoneformakingitthisfar.Youmayneedabreakafterallofthattheory.Butlet’sploughonwithanexamplewhereinferencemightcomeinhandy.Theexamplewe’regoingtouseistoworkoutthelengthofahydrogenbond.Youdon’tneedtoknowwhatahydrogenbondis.I’monlyusingthisasanexamplebecauseitwasonethatIcameupwithtohelpoutafriendduringmyPhD(wewereintheBiochemistrydepartmentwhichiswhyitwasrelevantatthetime).I’veincludedthisimagebecauseIthinkitlooksnice,helpstobreakupthedensetextandiskindofrelatedtotheexamplethatwe’regoingtogothrough.Don’tworry,youdon’tneedtounderstandthefiguretounderstandwhatwe’reabouttogothroughonBayesianinference.Incaseyou’rewondering,ImadethefigurewithInkscape.Let’sassumethatahydrogenbondisbetween3.2Å—4.0Å(AquickcheckonGooglegavemethisinformation.TheÅngström,Å,isaunitofdistancewhere1Åisequalto0.1nanometers,sowe’retalkingaboutverytinydistances).Thisinformationwillformmyprior.Intermsofaprobabilitydistribution,I’llreformulatethisasaGaussiandistributionwithmeanμ=3.6Åandstandarddeviationσ=0.2Å(seefigurebelow).Ourpriorprobabilityforthelengthofahydrogenbond.ThisisrepresentedbyaGaussiandistributionwithmeanμ=3.6Åandstandarddeviationσ=0.2Å.Nowwe’representedwithsomedata(5datapointsgeneratedrandomlyfromaGaussiandistributionofmean3Åandstandarddeviation0.4Åtobeexact.Inrealworldsituationsthesedatawillcomefromtheresultofascientificexperiment)thatgivesmeasuredlengthsofhydrogenbonds(goldpointsinFigure3).Wecanderivealikelihooddistributionfromthedatajustlikewedidinthepreviouspostonmaximumlikelihood.AssumingthatthedataweregeneratedfromaprocessthatcanbedescribedbyaGaussiandistributionwegetalikelihooddistributionrepresentedbythegoldcurveinthefigurebelow.Noticethatthemaximumlikelihoodestimateofthemeanfromthe5datapointsislessthan3(about2.8Å)Priorprobabilityforthedistanceofahydrogenbondinblueandthelikelihooddistributioningoldderivedfromthe5golddatapoints.Nowwehave2Gaussiandistributions,bluerepresentingthepriorandgoldrepresentingthelikelihood.Wedon’tcareaboutthenormalisingconstantsowehaveeverythingweneedtocalculatetheunnormalisedposteriordistribution.RecallthattheequationrepresentingtheprobabilitydensityforaGaussianisSowehavetomultiply2ofthese.Iwontgothroughthemathsherebecauseitgetsverymessy.Ifyou’reinterestedinthemathsthenyoucanseeitperformedinthefirst2pagesofthisdocument.Theresultingposteriordistributionisshowninpinkinthefigurebelow.Theposteriordistributioninpinkgeneratedbymultiplyingtheblueandgolddistributions.Nowwehavetheposteriordistributionforthelengthofahydrogenbondwecanderivestatisticsfromit.Forexample,wecouldusetheexpectedvalueofthedistributiontoestimatethedistance.Orwecouldcalculatethevariancetoquantifyouruncertaintyaboutourconclusion.Oneofthemostcommonstatisticscalculatedfromtheposteriordistributionisthemode.ThisisoftenusedastheestimateofthetruevaluefortheparameterofinterestandisknownastheMaximumaposterioriprobabilityestimateorsimply,theMAPestimate.InthiscasetheposteriordistributionisalsoaGaussiandistribution,sothemeanisequaltothemode(andthemedian)andtheMAPestimateforthedistanceofahydrogenbondisatthepeakofthedistributionatabout3.2Å.ConcludingremarksWhyamIalwaysusingGaussians?You’llnoticethatinallmyexamplesthatinvolvedistributionsIuseGaussiandistributions.Oneofthemainreasonsisthatitmakesthemathsaloteasier.ButfortheBayesianinferenceexampleitrequiredcalculatingtheproductof2distributions.IsaidthiswasmessyandsoIdidn’tgothroughthemaths.Butevenwithoutdoingthemathsmyself,IknewthattheposteriorwasaGaussiandistribution.ThisisbecausetheGaussiandistributionhasaparticularpropertythatmakesiteasytoworkwith.It’sconjugatetoitselfwithrespecttoaGaussianlikelihoodfunction.ThismeansthatifImultiplyaGaussianpriordistributionwithaGaussianlikelihoodfunction,I’llgetaGaussianposteriorfunction.Thefactthattheposteriorandpriorarebothfromthesamedistributionfamily(theyarebothGaussians)meansthattheyarecalledconjugatedistributions.Inthiscasethepriordistributionisknownasaconjugateprior.Inmanyinferencesituationslikelihoodsandpriorsarechosensuchthattheresultingdistributionsareconjugatebecauseitmakesthemathseasier.AnexampleindatascienceisLatentDirichletAllocation(LDA)whichisanunsupervisedlearningalgorithmforfindingtopicsinseveraltextdocuments(referredtoasacorpus).AverygoodintroductiontoLDAiscanbefoundhereinEdwinChen’sblog.Insomecaseswecan’tjustpickthepriororlikelihoodinsuchawaytomakeiteasytocalculatetheposteriordistribution.Sometimesthelikelihoodand/orthepriordistributioncanlookhorrendousandcalculatingtheposteriorbyhandisnoteasyorpossible.Inthesecaseswecanusedifferentmethodstocalculatetheposteriordistribution.OneofthemostcommonwaysisbyusingatechniquecalledMarkovChainMonteCarlomethods.BenShaverhaswrittenabrilliantarticlecalledAZero-MathIntroductiontoMarkovChainMonteCarloMethodsthatexplainsthistechniqueinaveryaccessiblemanner.Whathappenswhenwegetnewdata?OneofthegreatthingsaboutBayesianinferenceisthatyoudon’tneedlotsofdatatouseit.1observationisenoughtoupdatetheprior.Infact,theBayesianframeworkallowsyoutoupdateyourbeliefsiterativelyinrealtimeasdatacomesin.Itworksasfollows:youhaveapriorbeliefaboutsomething(e.g.thevalueofaparameter)andthenyoureceivesomedata.Youcanupdateyourbeliefsbycalculatingtheposteriordistributionlikewedidabove.Afterwards,wegetevenmoredatacomein.Soourposteriorbecomesthenewprior.Wecanupdatethenewpriorwiththelikelihoodderivedfromthenewdataandagainwegetanewposterior.Thiscyclecancontinueindefinitelysoyou’recontinuouslyupdatingyourbeliefs.TheKalmanfilter(andit’svariants)isagreatexampleofthis.It’susedinmanyscenarios,butpossiblythemosthighprofileindatascienceareitsapplicationstoselfdrivingcars.IusedavariantcalledtheUnscentedKalmanfilterduringmyPhDinmathematicalproteincrystallography,andcontributedtoanopensourcepackageimplementingthem.ForagoodvisualdescriptionofKalmanFilterscheckoutthisblogpost:HowaKalmanfilterworks,inpicturesbyTimBabb.UsingpriorsasregularisersThedatathatwegeneratedinthehydrogenbondlengthexampleabovesuggestedthat2.8Åwasthebestestimate.However,wemaybeatriskofoverfittingifwebasedourestimatesolelyonthedata.Thiswouldbeahugeproblemifsomethingwaswrongwiththedatacollectionprocess.WecancombatthisintheBayesianframeworkusingpriors.InourexampleusingaGaussianpriorcentredon3.6ÅresultedinaposteriordistributionthatgaveaMAPestimateofthehydrogenbondlengthas3.2Å.Thisdemonstratesthatourpriorcanactasaregulariserwhenestimatingparametervalues.Theamountofweightthatweputonourpriorvsourlikelihooddependsontherelativeuncertaintybetweenthetwodistributions.Inthefigurebelowwecanseethisgraphically.Thecoloursarethesameasabove,bluerepresentsthepriordistribution,goldthelikelihoodandpinktheposterior.Intheleftgraphinthefigureyoucanseethatourprior(blue)ismuchlessspreadoutthanthelikelihood(gold).Thereforetheposteriorresemblesthepriormuchmorethatthelikelihood.Theoppositeistrueinthegraphontheright.Thereforeifwewishtoincreasetheregularisationofaparameterwecanchoosetonarrowthepriordistributioninrelationtothelikelihood.MichaelGreenhaswrittenanarticlecalledThetruthaboutBayesianpriorsandoverfittingthatcoversthisinmoredetailandgivesadviceonhowtosetpriors.WhenistheMAPestimateequaltothemaximumlikelihoodestimate?TheMAPestimateisequaltotheMLEwhenthepriordistributionisuniform.Anexampleofauniformdistributionisshownbelow.UniformdistributionWhatwecanseeisthattheuniformdistributionassignsequalweighttoeveryvalueonthex-axis(it’sahorizontalline).Intuitivelyitrepresentsalackofanypriorknowledgeaboutwhichvaluesaremostlikely.Inthiscasealloftheweightisassignedtothelikelihoodfunction,sowhenwemultiplythepriorbythelikelihoodtheresultingposteriorexactlyresemblesthelikelihood.Therefore,themaximumlikelihoodmethodcanbeviewedasaspecialcaseofMAP.WhenIstartedwritingthispostIdidn’tactuallythinkthatitwouldbeanywherenearthislongsothankyousomuchformakingitthisfar.Ireallydoappreciateit.Asalways,ifthereisanythingthatisunclearorI’vemadesomemistakesintheabovefeelfreetoleaveacomment.InthenextpostinthisseriesIwillprobablytrytocovermarginalisationforworkingoutP(data),thenormalisingconstantthatIignoredinthispost.Unlessofcoursethereissomethingelsethatsomeonewouldlikemetogoover;)Thankyouforreading.MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumvipulchauhanProbabilityTheory :Part-IJonnyBrooks-BartlettinTowardsDataScienceProbabilityconceptsexplained:MarginalisationJonnyBrooks-BartlettinTowardsDataScienceProbabilityconceptsexplained:IntroductionAbhaNaikSomeBasicConceptsinQuantumMechanicsPratikPandabinAnalyticsVidhyaDecodingtheMontyHallProblemAsishSamantaWhatisFibonacciRatio?VEVE⭕TheMintFinderUnderstandingtheVeveMarketplace(Part2):thePump&DumpCycleAbhaNaikMultipleQubitsandCircuitsPart-1AboutHelpTermsPrivacyGettheMediumappGetstartedJonnyBrooks-Bartlett10.6KFollowersDatascientistatDeliveroo,publicspeaker,sciencecommunicator,mathematicianandsportsenthusiast.FollowMorefromMediumRyanBurninTowardsDataScienceLogisticRegressionandtheMissingPriorAmy@GrabNGoInfoinTowardsAIPowerAnalysisForSampleSizeUsingPythonTarekSamaaliinMLearning.aiThoughtsonRandomForestsandXgboostJulienPascalinTowardsDataScienceLinearModeltheMachineLearningWayHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable