When is Bayesian Machine Learning actually useful?

文章推薦指數: 80 %
投票人數:10人

So, in theory, Bayesian Machine Learning yields a more complete picture than the frequentist approach. The caveat however is the intractability ... WhenitcomestoBayesianMachineLearning,youlikelyeitherloveitorprefertostayatasafedistancefromanythingBayesian.Giventhatcurrentstate-of-the-artmodelshardlyevermentionBayesatall,thereisprobablyatleastsomerationalebehindthelatter.Ontheotherhand,manyhighprofileresearchgroupsarehavebeenworkingonBayesianMachineLearningtoolsfordecades.Andtheystilldo.Thus,itisratherunlikelythatthisfieldiscompletequackeryeither.Asofteninlife,thetruthliesprobablysomewherebetweenthetwoextremes.Inthisarticle,IwanttosharemyviewonthetitledquestionandpointoutwhenBayesianMachineLearningcanbehelpful.Toavoidconfusion,letusbrieflydefinewhatBayesianMachineLearningmeansinthecontextofthisarticle:"TheBayesianframeworkformachinelearningstatesthatyoustartoutbyenumeratingallreasonablemodelsofthedataandassigningyourpriorbeliefP(M)toeachofthesemodels.Then,uponobservingthedataD,youevaluatehowprobablethedatawasundereachofthesemodelstocomputeP(D|M)."ZoubinGhahramaniLessformally,weapplytoolsandframeworksfromBayesianstatisticstoMachineLearningmodels.Iwillprovidesomereferencesattheend,incaseyouarenotfamiliarwithBayesianstatisticsyet.Also,ourdiscussionwillessentiallyequateMachineLearningwithneuralnetworksanddifferentiablemodelsingeneral(particularlyLinearRegression).SinceDeepLearningiscurrentlythecornerstoneofmodernMachineLearning,thisappearstobeafairapproach.Asafinaldisclaimer,wewilldifferentiatebetweenfrequentistandBayesianMachineLearning.TheformerincludesthestandardMLmethodsandlossfunctionsthatyouareprobablyalreadyfamiliarwith.Finally,thefactthat'Bayesian'iswritteninuppercasebut'frequentist'isnothasnojudgementalmeaning.Now,letusbeginwithasurprisingresult:YourfrequentistmodelisprobablyalreadyBayesian Thisstatementmightsoundsurprisingandodd.However,thereisaneatconnectionbetweentheBayesianandfrequentistlearningparadigm.Let'sstartwithBayes'formulaandapplyittoaBayesianNeuralNetwork:Asanillustrativeexample,wecouldhavei.e.thenetworkoutputdefinesthemeanofthetargetvariablewhich,presumably,followsaNormaldistribution.Fortheprioroverthenetworkweights-let'spresumewehaveKweightsintotal-wemightchooseThesetupwith(2)and(3)isfairlystandardforBayesianNeuralNetworks.Findingaposteriorweightdistributionfor(1)turnsouttobefutileinanyreasonablesettingforBayesianNetworks.ThisisnothingnewforBayesianmodelseither.Wenowhaveafewoptionstodealwiththisissue,e.g.MCMC,VariationalBayesorMaximuma-posterioriestimation(MAP).Technically,thelatteronlygivesusapointestimatefortheposteriormaximum,notthefullposteriordistribution:Giventhataprobabilitydensityisstrictlypositiveoveritsdomain,wecanintroducelogarithms:Sincethelastsummanddoesnotdependonthemodelparameters,itdoesnotaffecttheargmax.Hence,wecanleaveitout.FromBayesianMAPtoregularizedMSE Thepurposeofequations(2)and(3)wasonlyanillustrativeone.Letusreplacelikelihoodandpriorwiththefollowing:TheZ-termsarenormalizationconstantswhosesolepurposeistoyieldvalidprobabilitydensities.Nowwecanplug(5)and(6)into(4):Asthemaximumofafunctionisequaltotheminimumofthenegativefunction,wecanwriteFinally,theargmaxisalsounaffectedbymultiplicationwithaconstant:ThefinaltermisnothingmorethanthestandardMSE-objectivewithregularization.Hence,aregularizedMSEobjectiveisequivalenttoaBayesianMAPobjective,givenspecificpriorandlikelihood.Ifyoureversetheabove,youcanfindaprior-likelihoodpairforalmostanyfrequentistlossfunction.However,thefrequentistobjectivetypicallyonlypointyouapointsolution.TheBayesianmethod,ontheotherhand,givesyouafullposteriordistributionwithuncertaintyintervalsontop.Usinguninformativepriordistributionsyoucanalsoremovetheregularizationtermifnecessary.Wewon'tcoveraderivationhere,however.ThepriceofBayesianMachineLearningintherealworld So,intheory,BayesianMachineLearningyieldsamorecompletepicturethanthefrequentistapproach.Thecaveathoweveristheintractabilityoftheposteriordistributionandthustheneedtoapproximateorestimateit.Bothapproximationandestimation,however,inevitablyleadtoalossinprecision.TakeforexamplethepopularvariationalinferenceapproachforBayesianNeuralNetworks.Inordertooptimizethemodel,weneedtoapproximateaso-calledELBO-objectivebysamplingfromthevariationaldistribution.TheusualELBO-objectivelookslikethis-don'tworryifyoudon'tunderstandeverything:Ourgoalistoeithermaximize(7)orminimizeitsnegative.However,theexpectationtermistypicallynotintractable.Theeasiestsolutiontothisissueistheapplicationofthereparametrizationtrick.ThisallowsustoestimatetheELBOviaThelinearityofthegradientoperationthenallowsustoestimatethegradientof(7)via(8):Withthisformula,wearebasicallysamplingMgradientsfromareparametrizeddistribution.Thereafter,weusethosesamplestocalculatean‘average’gradient.UnbiasedgradientestimationfortheELBO AsdemonstratedinthelandmarkpaperbyKingmaandWelling(2013),wehaveInessence,equation(9)tellsusthatoursamplingbasedgradientis,onaverage,equaltothetruegradient.Theproblemwith(9)isthatwedon'tknowanythingaboutthehigherorderstatisticalmomentsofsampledgradient.Forexample,ourestimatemighthaveaprohibitivelylargevariance.Thus,whilewegradientdescendinthecorrectdirectiononaverage,wearedoingsounreasonablyinefficiently.Theadditionalrandomnessduetoresamplingalsomakesitdifficulttofindtherightstoppingtimeforgradientdescent.Addthattotheproblemofnon-convexloss-functionsandyourchancesofunderperformingagainstanon-Bayesiannetworkarefairlyhigh.Insummary,thetrueposteriorinaBayesianworldcouldgiveusamorecompletepictureabouttheoptimalparameters.Sinceweneedtoresorttoapproximationhowever,wemostlikelyendupwithworseperformancethanwiththestandardapproach.WhendoesBayesianMachineLearningactuallymakesense? TheaboveconsiderationsbegthequestionofwhetheryouwanttoevenuseBayesianMachineLearningatall.Personally,Iseetwoparticularsituationswherethiscouldbethecase.Keepinmind,thatwhatfollowsarerulesofthumb.TheactualdecisionfororagainstBayesianMachineLearningshouldbebasedonthespecificproblemathand.WhenisBayesianMachineLearningactuallyuseful?Asimplisticdecisiontreeforguidance-alwaysadapttoyourspecificproblem.Smalldatasetsandinformativepriorknowledge LetusmovebacktotheregularizedMSEderivationfrombefore.OurobjectivewasThereisacleartradeoffbetweenthesizeofthedataset,N,andtheamountofmodelparameters,K.Forsufficientlylargedatasets,thepriorontherightsidebecomesirrelevantfortheMAPsolutionandvice-versa.Thisisalsoakeypropertyforposteriorinferenceingeneral.Figurativebehaviouroftheposteriordistribution.Moredatamakestheposterioracloserfittothedata.Lessdatamakesitmoreandmoresimilartothepriordistribution.Whenthedatasetissmallinrelationtotheamountofmodelparameters,theposteriordistributionwillcloselyresembletheprior.Thus,ifthepriordistributioncontainssensibleinformationabouttheproblem,wecanmitigatethelackofdatatosomeextent.Takeforexampleaplainlinearregressionmodel.Dependingonthesignal-to-noiseratio,alotofdatamightbenecessarybeforetheparametersconvergetothe'best'value.Ifyouhaveawellreasonedpriordistributionaboutthat'best'solution,yourmodelmightactuallyconvergefaster.1)Duetothesmalldatasetandtherandomnoise,theplainOLSregressionlineisoff2)Awellinformedprioralreadyproducesareasonablerangebeforeanymodelfithasoccurred3)After'training',theBayesianmodelproducesapredictivemeanregressionlinethatcomesmuchclosertothedatageneratingregressionlineAsaresult,convergencetothe'best'solutioncouldbeevenslowerthanwithoutanyprioratall.TherearemoretoolsinBayesianstatisticstomitigatethisreasonableobjectionandIamkeentocoverthistopicinafuturearticle.Also,evenifyoudon'twanttogofullyBayesian,aMAPestimatecouldstillbeusefulgivenhigh-qualitypriorknowledge.FunctionalpriorsformodernBayesianMachineLearning Adifferentissuemightbethedifficultyofexpressingmeaningfulpriorknowledgeoverneuralnetworkweights.Ifthemodelisablack-box,howcouldyouactuallytellitwhattodo?Apartfrommaybesomevaguezero-meanNormaldistributionoverweights?OnepromisingdirectiontosolvethisissuecouldbefunctionalBayes.Inthatapproach,weonlytellthenetworkwhatoutputsweexpectforgiveninputs,basedonourpriorknowledge.Hence,weonlycareabouttheposteriorfunctionsthatourmodelcanexpress.Theexactparameterposteriordistributionisonlyofsecondaryinterest.Inshort,ifyouhaveonlylimiteddatabutwell-informedpriorknowledge,BayesianMachineLearningcouldactuallyhelp.Importanceofuncertaintyquantification Forlargedatasets,theeffectofyourpriorknowledgebecomeslessandlessrelevant.Inthatcase,obtainingafrequentistsolutionmightbefullysufficientandthesuperiorapproach.Atsomeoccasionshowever,thefullpicture-a.k.a.posterioruncertainty-couldbeimportant.ConsiderthecaseofmakingamedicaldecisionbasedonsomeMachineLearningmodel.Inaddition,themodeldoesnotproduceafinaldecisionbutrathersupportsdoctorsintheirjudgement.IfBayesianpointaccuracyisatleastclosetoafrequentistequivalent,uncertaintyoutputcanserveasusefulextrainformation.Presumingthattheexpert'stimeislimited,theymightwanttotakeacloserlookathighlyuncertainmodeloutput.Foroutputwithlowuncertainty,aquicksanitycheckmightsuffice.Thisofcourserequirestheuncertaintyestimatestobevalidinthefirstplace.Hence,theBayesianapproachcomesattheadditionalcostofmonitoringuncertaintyperformanceaswell.ThisshouldofcoursebetakenintoaccountwhendecidingwhethertouseaBayesianmodelornot.AnotherexamplewhereBayesianuncertaintycanshinearetime-seriesforecastingproblems.Ifyoumodelatimeseriesinanautoregressivemanner,i.e.youestimateyourmodelerrorsaccumulate,thefurtheraheadyouaretryingtoforecast.Evenforahighlysimpleautoregressivetime-seriessuchasaminusculedeviationinyourestimatedmodelcanleadtocatastrophicerroraccumulationinthelongrun:Afullposteriorcouldgiveamuchclearerpictureofhowother,similarlyprobablescenariosmightplayout.Eventhoughthemodelmightbewrongonaverage,weneverthelessgetanideaofhowdifferentparameterestimatesaffecttheforecast.Bayesianforecast(purple)withuncertaintyvs.frequentistpointforecast(red).Althoughthefrequentistforecastisslightlymoreaccurateinthelongrun,theBayesianuncertaintyintervalcorrectlyincludestherealizedmeantrajectory.Finally,ifthelastparagraphhasmadeyouinterestedintryingoutBayesianMachineLearning,hereisagreatmethodtostartwith:BayesianDeepLearninglightwithMCdropout Luckily,itisnowadaysnottoocostlytotrainaBayesianmodel.Unlessyouareworkingwithverybigmodels,theincreasedcomputationaldemandofBayesianinferenceshouldnotbetooproblematic.OneparticularlystraightforwardapproachtoBayesianinferenceisMCdropout.ThelatterwasintroducedbyGalandGhahramaniandhassincebecomeafairlypopulartool.Insummary,theauthorsshowthatitissensibletouseDropoutduringbothtrainingandinferenceofamodel.Infact,thisisprovedtobeequivalenttovariationalinferencewithaparticular,Bernoulli-based,variationaldistribution.Hence,MCDropoutcanbeagreatstartingpointtomakeyourNeuralNetworkBayes-ishwithoutrequiringtoomucheffortupfront.Prosandcons Ontheonehand,MCdropoutmakesitquitestraightforwardtomakeanexistingmodelBayesian.Youcansimplyre-trainitwithdropoutandleavedropoutturnedonduringinference.ThesamplesyoucreatethatwaycanbeseenasdrawsfromtheBayesianposteriorpredictivedistribution.Ontheotherhand,thevariationaldistributioninMCdropoutisbasedonBernoullirandomvariables.Thisshould-intheory-makeitanevenlessaccurateapproximationthanthecommonGaussianvariationaldistribution.However,buildingandusingathismodelisquitesimple,requiringonlyplainTensorfloworPytorch.Also,therehasbeensomedeepercriticismaboutthevalidityoftheapproach.Thereexistsaninterestingdebateinvolvingoneoftheauthorshere.Whateveryoumakeoutofthatcriticism,MCDropoutcanbeahelpfulbaselineformoresophisticatedmethods.OnceyougetahangofBayesianMachineLearning,youcantrytoimproveperformancewithmoresophisticatedmodelfromthere.Conclusion IhopethatthisarticlehasgivenyousomeinsightsontheusefulnessofBayesianMachineLearning.Certainly,itisnomagicbulletandtherearemanyoccasionswhereitmightnotbetherightchoice.Ifyoucarefullyweighintheprosandconsthough,Bayesianmethodscanbeahighlyusefultool.Also,Iamhappytohaveadeeperdiscussiononthetopic,eitherinthecommentsorviaprivatechannels.References [1]Gelman,Andrew,etal."Bayesiandataanalysis".CRCpress,2013.[2]Kruschke,John."DoingBayesiandataanalysis:AtutorialwithR,JAGS,andStan."AcademicPress,2014.[3]Sivia,Devinderjit,andJohnSkilling."Dataanalysis:aBayesiantutorial."OUPOxford,2006. Readnext Bayesianinferencewithlesspainandevenlesscode. HowGaussianProcessregressioncanhandlevaryingvariance.



請為這篇文章評分?