Indices of Effect Existence and Significance in the Bayesian ...

2025-01-23

文章推薦指數： 80 %

投票人數：10人

Thus, this study describes and compares several Bayesian indices, ... Bayesian inference allows making intuitive probability statements of ... DownloadArticle DownloadPDF ReadCube EPUB XML(NLM) Supplementary Material Supplementaldata totalviews ViewArticleImpact SHAREON PietroCipresso DepartmentofPsychology,UniversityofTurin,Italy RichardS.John UniversityofSouthernCalifornia,UnitedStates JoseD.Perezgonzalez MasseyUniversityBusinessSchool,NewZealand Theeditorandreviewer'saffiliationsarethelatestprovidedontheirLoopresearchprofilesandmaynotreflecttheirsituationatthetimeofreview. Abstract Introduction MaterialsandMethods Results Discussion ReportingGuidelines DataAvailabilityStatement AuthorContributions ConflictofInterest Acknowledgments Footnotes References Opensupplementaldata Exportcitation EndNote ReferenceManager SimpleTEXTfile BibTex Checkforupdates Peoplealsolookedat ORIGINALRESEARCHarticle Front.Psychol.,10December2019Sec.QuantitativePsychologyandMeasurement https://doi.org/10.3389/fpsyg.2019.02767 IndicesofEffectExistenceandSignificanceintheBayesianFramework DominiqueMakowski1*,MattanS.Ben-Shachar2,S.H.AnnabelChen1,3,4*†andDanielLüdecke5† 1SchoolofSocialSciences,NanyangTechnologicalUniversity,Singapore,Singapore 2DepartmentofPsychology,Ben-GurionUniversityoftheNegev,Beersheba,Israel 3CentreforResearchandDevelopmentinLearning,NanyangTechnologicalUniversity,Singapore,Singapore 4LeeKongChianSchoolofMedicine,NanyangTechnologicalUniversity,Singapore,Singapore 5DepartmentofMedicalSociology,UniversityMedicalCenterHamburg-Eppendorf,Hamburg,Germany Turmoilhasengulfedpsychologicalscience.Causesandconsequencesofthereproducibilitycrisisareindispute.Withthehopeofaddressingsomeofitsaspects,Bayesianmethodsaregainingincreasingattentioninpsychologicalscience.Someoftheiradvantages,asopposedtothefrequentistframework,aretheabilitytodescribeparametersinprobabilistictermsandexplicitlyincorporatepriorknowledgeaboutthemintothemodel.Theseissuesarecrucialinparticularregardingthecurrentdebateaboutstatisticalsignificance.Bayesianmethodsarenotnecessarilytheonlyremedyagainstincorrectinterpretationsorwrongconclusions,butthereisanincreasingagreementthattheyareoneofthekeystoavoidsuchfallacies.Nevertheless,itsflexiblenatureisitspowerandweakness,forthereisnoagreementaboutwhatindicesof“significance”shouldbecomputedorreported.Thislackofaconsensualindexorguidelines,suchasthefrequentistp-value,furthercontributestotheunnecessaryopacitythatmanynon-familiarreadersperceiveinBayesianstatistics.Thus,thisstudydescribesandcomparesseveralBayesianindices,provideintuitivevisualrepresentationoftheir“behavior”inrelationshipwithcommonsourcesofvariancesuchassamplesize,magnitudeofeffectsandalsofrequentistsignificance.Theresultscontributetothedevelopmentofanintuitiveunderstandingofthevaluesthatresearchersreport,allowingtodrawsensiblerecommendationsforBayesianstatisticsdescription,criticalforthestandardizationofscientificreporting. Introduction TheBayesianframeworkisquicklygainingpopularityamongpsychologistsandneuroscientists(AndrewsandBaguley,2013),forreasonssuchasflexibility,betteraccuracyinnoisydataandsmallsamples,lesspronenesstotypeIerrors,thepossibilityofintroducingpriorknowledgeintotheanalysisandtheintuitivenessandstraightforwardinterpretationofresults(Kruschke,2010;Kruschkeetal.,2012;EtzandVandekerckhove,2016;Wagenmakersetal.,2016,2018;DienesandMclatchie,2018).Ontheotherhand,thefrequentistapproachhasbeenassociatedwiththefocusonp-valuesandnullhypothesissignificancetesting(NHST).Themisinterpretationandmisuseofp-values,socalled“p-hacking”(Simmonsetal.,2011),hasbeenshowntocriticallycontributetothereproducibilitycrisisinpsychologicalscience(Chambersetal.,2014;SzucsandIoannidis,2016).Therelianceonp-valueshasbeencriticizedforitsassociationwithinappropriateinference,andeffectscanbedrasticallyoverestimated,sometimeseveninthewrongdirection,whenestimationistiedtostatisticalsignificanceinhighlyvariabledata(Gelman,2018).Powercalculationsallowresearcherstocontroltheprobabilityoffalselyrejectingthenullhypothesis,butdonotcompletelysolvethisproblem.Forinstance,the“false-alarmprobability”ofevenverysmallp-valuescanbemuchhigherthanexpected(Nuzzo,2014).Inresponse,thereisanincreasingbeliefthatthegeneralizationandutilizationoftheBayesianframeworkisonewayofovercomingtheseissues(Maxwelletal.,2015;EtzandVandekerckhove,2016;Marasinietal.,2016;Wagenmakersetal.,2017;Benjaminetal.,2018;Halsey,2019). Thetenacityandresilienceofthep-valueasanindexofsignificanceisremarkable,despitethelong-lastingcriticismanddiscussionaboutitsmisuseandmisinterpretation(GardnerandAltman,1986;Cohen,1994;Andersonetal.,2000;Fidleretal.,2004;Finchetal.,2004).Thisendurancemightbeinformativeonhowsuchindices,andtheaccompanyingheuristicsappliedtointerpretthem(e.g.,assigningthresholdslike0.05,0.01,and0.001tocertainlevelsofsignificance),areusefulandnecessaryforresearcherstogainanintuitive(althoughpossiblysimplified)understandingoftheinteractionsandstructureoftheirdata.Moreover,theutilityofsuchanindexismostsalientincontextswheredecisionsmustbemadeandrationalized(e.g.,inmedicalsettings).Unfortunately,theseheuristicscanbecomeseverelyrigidified,andmeetingsignificancehasbecomeagoaluntoitselfratherthanatoolforunderstandingthedata(Cohen,1994;Kirk,1996).Thisisparticularlyproblematicgiventhatp-valuescanonlybeusedtorejectthenullhypothesisandnottoacceptitastrue,becauseastatisticallynon-significantresultdoesnotmeanthatthereisnodifferencebetweengroupsornoeffectofatreatment(Wagenmakers,2007;Amrheinetal.,2019). Whilesignificancetesting(anditsinherentcategoricalinterpretationheuristics)mighthaveitsplaceasacomplementaryperspectivetoeffectestimation,itdoesnotprecludethefactthatimprovementsareneeded.Forinstance,onepossibleadvancecouldfocusonimprovingtheunderstandingofthevaluesbeingused,forinstance,throughanew,simpler,index.Bayesianinferenceallowsmakingintuitiveprobabilitystatementsofaneffect,asopposedtothelessstraightforwardmathematicaldefinitionofthep-value,thatcontributestoitscommonmisinterpretation.Anotherimprovementcouldbefoundinprovidinganintuitiveunderstanding(e.g.,byvisualmeans)ofthebehavioroftheindicesinrelationshipwithmainsourcesofvariance,suchassamplesize,noise,oreffectpresence.Suchbetteroverallunderstandingoftheindiceswouldhopefullyactasabarrieragainsttheirmindlessreportingbyallowingtheuserstonuancetheinterpretationsandconclusionsthattheydraw. TheBayesianframeworkoffersseveralalternativeindicesforthep-value.Tobetterunderstandtheseindices,itisimportanttopointoutoneofthecoredifferencesbetweenBayesianandfrequentistmethods.Fromafrequentistperspective,theeffectsarefixed(butunknown)anddataarerandom.Ontheotherhand,insteadofhavingsingleestimatesofsome“trueeffect”(forinstance,the“true”correlationbetweenxandy),Bayesianmethodscomputetheprobabilityofdifferenteffectsvaluesgiventheobserveddata(andsomepriorexpectation),resultinginadistributionofpossiblevaluesfortheparameters,calledtheposteriordistribution.Thedescriptionoftheposteriordistribution(e.g.,throughitscentrality,dispersion,etc.)allowstodrawconclusionsfromBayesiananalyses. Bayesian“significance”testingindicescouldberoughlygroupedintothreeoverlappingcategories:Bayesfactors,posteriorindicesandRegionofPracticalEquivalence(ROPE)-basedindices.Bayesfactorsareafamilyofindicesofrelativeevidenceofonemodeloveranother(e.g.,thenullvs.thealternativehypothesis;Jeffreys,1998;Lyetal.,2016).Asidefromhavingastraightforwardinterpretation(“giventheobserveddata,isthenullhypothesisofanabsenceofaneffectmore,orlesslikely?”),theyallowtoquantifytheevidenceinfavorofthenullhypothesis(Dienes,2014;JaroszandWiley,2014).However,itsuseforparametersdescriptionincomplexmodelsisstillamatterofdebate(Wagenmakersetal.,2010;Heck,2019),beinghighlydependentonthespecificationofpriors(Etzetal.,2018;KruschkeandLiddell,2018).Onthecontrary,“posteriorindices”reflectobjectivecharacteristicsoftheposteriordistribution,forinstancetheproportionofstrictlypositivevalues.Theyalsoallowtoderivelegitimatestatementsthatindicatetheprobabilityofaneffectfallinginagivenrangesimilartothemisleadingconclusionsrelatedtofrequentistconfidenceintervals.Finally,ROPE-basedindicesarerelatedtotheredefinitionofthenullhypothesisfromtheclassicpoint-nullhypothesistoarangeofvaluesconsiderednegligibleortoosmalltobeofanypracticalrelevance(theRegionofPracticalEquivalence–ROPE;Kruschke,2014;Lakens,2017;Lakensetal.,2018),usuallyspreadequallyaround0(e.g.,[−0.1;0.1]).Theideabehindthisindexisthataneffectisalmostneverexactlyzero,butinsteadcanbeverytiny,withnopracticalrelevance.Itisinterestingtonotethatthisperspectiveunitessignificancetestingwiththefocusoneffectsize(involvingadiscreteseparationbetweenatleasttwocategories:negligibleandnon-negligible),whichfindsanechoinrecentstatisticalrecommendations(EllisandSteyn,2003;SullivanandFeinn,2012;Simonsohnetal.,2014). DespitetherichnessprovidedbytheBayesianframeworkandtheavailabilityofmultipleindices,noconsensushasyetemergedonwhichonestobeused.Literaturecontinuestobloominaragingdebate,oftenpolarizedbetweenproponentsoftheBayesfactorasthesupremeindexanditsdetractors(Spanos,2013;Robert,2014,2016;Wagenmakersetal.,2019),withstrongtheoreticalargumentsbeingdevelopedonbothsides.Yetnopractical,empiricalanddirectcomparisonbetweentheseindiceshasbeendone.ThismightbeadeterrentforscientistsinterestedinadoptingtheBayesianframework.Moreover,thisgrayareacanincreasethedifficultyofreadersorreviewersunfamiliarwiththeBayesianframeworktofollowtheassumptionsandconclusions,whichcouldinturngenerateunnecessarydoubtuponanentirestudy.Whilewethinkthatsuchindicesofsignificanceandtheirinterpretationguidelines(intheformofrulesofthumb)areusefulinpractice,wealsostronglybelievethattheyshouldbeaccompaniedwiththeunderstandingoftheir“behavior”inrelationshipwithmajorsourcesofvariance,suchassamplesize,noiseoreffectpresence.Thisknowledgeisimportantforpeopletoimplicitlyandintuitivelyappraisethemeaningandimplicationofthemathematicalvaluestheyreport.Suchanunderstandingcouldpreventthecrystallizationofthepossibleheuristicsandcategoriesderivedfromsuchindices,ashasunfortunatelyoccurredforthep-values. Thus,basedonthesimulationoflinearandlogisticregressions(arguablysomeofthemostwidelyusedmodelsinthepsychologicalsciences),thepresentworkaimsatcomparingseveralindicesofeffect“significance,”providevisualrepresentationsofthe“behavior”ofsuchindicesinrelationshipwithsamplesize,noiseandeffectpresence,aswellastheirrelationshiptofrequentistp-values(anindexwhich,beyonditsmanyflaws,iswellknownandcouldbeusedasareferenceforBayesianneophytes),andfinallydrawrecommendationsforBayesianstatisticsreporting. MaterialsandMethods DataSimulation Wesimulateddatasetssuitedforlinearandlogisticregressionandstartedbysimulatinganindependent,normallydistributedxvariable(withmean0andSD1)ofagivensamplesize.Then,thecorrespondingyvariablewasadded,havingaperfectcorrelation(inthecaseofdataforlinearregressions)orasabinaryvariableperfectlyseparatedbyx.Thecaseofnoeffectwassimulatedbycreatingayvariablethatwasindependentof(i.e.,notcorrelatedto)x.Finally,aGaussiannoise(theerror)wasaddedtothexvariablebeforeitsstandardization,whichinturndecreasesthestandardizedcoefficient(theeffectsize). Thesimulationaimedatmodulatingthefollowingcharacteristics:outcometype(linearorlogisticregression),samplesize(from20to100bystepsof10),nullhypothesis(originalregressioncoefficientfromwhichdataisdrawnpriortonoiseaddition,1–presenceof“true”effect,or0–absenceof“true”effect)andnoise(GaussiannoiseappliedtothepredictorwithSDuniformlyspreadbetween0.666and6.66,with1000differentvalues),whichisdirectlyrelatedtotheabsolutevalueofthecoefficient(i.e.,theeffectsize).Wegeneratedadatasetforeachcombinationofthesecharacteristics,resultinginatotalof36,000(2modeltypes×2presence/absenceofeffect×9samplesizes×1,000noisevariations)datasets.ThecodeusedfordatagenerationisavailableonGitHub1.Notethatittakesusuallyseveraldays/weeksforthegenerationtocomplete. Indices Foreachofthesedatasets,Bayesianandfrequentistregressionswerefittedtopredictyfromxasasingleuniquepredictor.Wethencomputedthefollowingsevenindicesfromallsimulatedmodels(seeFigure1),relatedtotheeffectofx. FIGURE1 Figure1.Bayesianindicesofeffectexistenceandsignificance.(A)TheprobabilityofDirection(pd)isdefinedastheproportionoftheposteriordistributionthatisofthemedian’ssign(thesizeoftheyellowarearelativetothewholedistribution).(B)TheMAP-basedp-valueisdefinedasthedensityvalueat0–theheightoftheredlollipop,dividedbythedensityattheMaximumAPosteriori(MAP)–theheightofthebluelollipop.(C)ThepercentageinROPEcorrespondstotheredarearelativetothedistribution[withorwithouttailsforROPE(full)andROPE(95%),respectively].(D)TheBayesfactor(vs.0)correspondstothepoint-nulldensityoftheprior(thebluelollipoponthedotteddistribution)dividedbythatoftheposterior(theredlollipopontheyellowdistribution),andtheBayesfactor(vs.ROPE)iscalculatedastheoddsofthepriorfallingwithinvs.outsidetheROPE(theblueareaonthedotteddistribution)dividedbythatoftheposterior(theredareaontheyellowdistribution). Frequentistp-Value Thiswastheonlyindexcomputedbythefrequentistversionoftheregression.Thep-valuerepresentstheprobabilitythatforagivenstatisticalmodel,whenthenullhypothesisistrue,theeffectwouldbegreaterthanorequaltotheobservedcoefficient(WassersteinandLazar,2016). ProbabilityofDirection(pd) TheProbabilityofDirection(pd)variesbetween50and100%andcanbeinterpretedastheprobabilitythataparameter(describedbyitsposteriordistribution)isstrictlypositiveornegative(whicheveristhemostprobable).Itismathematicallydefinedastheproportionoftheposteriordistributionthatisofthemedian’ssign(Makowskietal.,2019). MAP-Basedp-Value TheMAP-basedp-valueisrelatedtotheoddsthataparameterhasagainstthenullhypothesis(MillsandParent,2014;Mills,2017).Itismathematicallydefinedasthedensityvalueat0dividedbythedensityattheMaximumAPosteriori(MAP),i.e.,theequivalentofthemodeforcontinuousdistributions. ROPE(95%) TheROPE(95%)referstothepercentageofthe95%HighestDensityInterval(HDI)thatlieswithintheROPE.AssuggestedbyKruschke(2014),theRegionofPracticalEquivalence(ROPE)wasdefinedasrangefrom−0.1to0.1forlinearregressionsanditsequivalent,−0.18to0.18,forlogisticmodels(basedontheπ/3formulatoconvertlogoddsratiostostandardizeddifferences;Cohen,1988).Althoughwepresentthe“95%percentage”becauseofthehistoryofthisindexandofitswidespreaduse,thereadershouldnotethatthisvaluewasrecentlychallengedduetoitsarbitrarynature(McElreath,2018). ROPE(Full) TheROPE(full)issimilartoROPE(95%),withtheexceptionthatitreferstothepercentageofthewholeposteriordistributionthatlieswithintheROPE. BayesFactor(vs.0) TheBayesFactor(BF)usedhereisbasedonpriorandposteriordistributionsofasingleparameter.Inthiscontext,theBayesfactorindicatesthedegreebywhichthemassoftheposteriordistributionhasshiftedfurtherawayfromorclosertothenullvalue(0),relativetothepriordistribution,thusindicatingifthenullhypothesishasbecomelessormorelikelygiventheobserveddata.TheBFwascomputedasaSavage-Dickeydensityratio,whichisalsoanapproximationofaBayesfactorcomparingthemarginallikelihoodsofthemodelagainstamodelinwhichthetestedparameterhasbeenrestrictedtothepoint-null(Wagenmakersetal.,2010). BayesFactor(vs.ROPE) TheBayesfactor(vs.ROPE)issimilartotheBayesfactor(vs.0),butinsteadofapoint-null,thenullhypothesisisarangeofnegligiblevalues(definedheresameasfortheROPEindices).TheBFwascomputedbycomparingthepriorandposterioroddsoftheparameterfallingwithinvs.outsidetheROPE(seeNon-overlappingHypothesesinMoreyandRouder,2011).ThismeasureiscloselyrelatedtotheROPE(full),asitcanbeformallydefinedastheratiobetweentheROPE(full)oddsfortheposteriordistributionandtheROPE(full)oddsforthepriordistribution: BF ROPE = odds ⁢ ( ROPE full ⁢ posterior ) odds ⁢ ( ROPE full ⁢ prior ) DataAnalysis Inordertoachievethetwo-foldaimofthisstudy;(1)comparingBayesianindicesand(2)providevisualguidesforanintuitiveunderstandingofthenumericvaluesinrelationtoaknownframeofreference(thefrequentistp-value),wewillstartbypresentingtherelationshipbetweentheseindicesandmainsourcesofvariance,suchassamplesize,noiseandnullhypothesis(trueifabsenceofeffect,falseifpresenceofeffect).WewillthencompareBayesianindiceswiththefrequentistp-valueanditscommonlyusedthresholds(0.05,0.01,0.001).Finally,wewillshowthemutualrelationshipbetweenthreerecommendedBayesiancandidates.Takentogether,theseresultswillhelpusoutlineguidestoeasethereportingandinterpretationoftheindices. Inordertoprovideanintuitiveunderstandingofvalues,dataprocessingwillfocusoncreatingclearvisualfigurestohelptheusergraspthepatternsandvariabilitythatexistswhencomputingtheinvestigatedindices.Nevertheless,wedecidedtoalsomathematicallytestourclaimsincaseswherethegraphicalrepresentationbeggedforadeeperinvestigation.Thus,wefittedtworegressionmodelstoassesstheimpactofsamplesizeandnoise,respectively.Forthesemodels(butnotforthefigures),toensurethatanydifferencesbetweentheindicesarenotduetodifferencesintheirscaleordistribution,weconvertedallindicestothesamescalebynormalizingtheindicesbetween0and1(notethatBFsweretransformedtoposteriorprobabilities,assuminguniformpriorodds)andreversingthep-values,theMAP-basedp-valuesandtheROPEindicessothatahighervaluecorrespondstostronger“significance.” ThestatisticalanalyseswereconductedusingR(RCoreTeam,2019).ComputationsofBayesianmodelsweredoneusingtherstanarmpackage(Goodrichetal.,2019),awrapperforStanprobabilisticlanguage(Carpenteretal.,2017).WeusedMarkovChainMonteCarlosampling(inparticular,HamiltonianMonteCarlo;Gelmanetal.,2014)with4chainsof2000iterations,halfofwhichusedforwarm-up.Mildlyinformativepriors(anormaldistributionwithmean0andSD1)wereusedfortheparameterinallmodels.TheBayesianindiceswerecalculatedusingthebayestestRpackage(Makowskietal.,2019). Results ImpactofSampleSize Figure2showsthesensitivityoftheindicestosamplesize.Thep-value,thepdandtheMAP-basedp-valuearesensitivetosamplesizeonlyincaseofthepresenceofatrueeffect(whenthenullhypothesisisfalse).Whenthenullhypothesisistrue,allthreeindicesareunaffectedbysamplesize.Inotherwords,theseindicesreflecttheamountofobservedevidence(thesamplesize)forthepresenceofaneffect(i.e.,againstthenullhypothesisbeingtrue),butnotfortheabsenceofaneffect.TheROPEindices,however,appearasstronglymodulatedbythesamplesizewhenthereisnoeffect,suggestingtheirsensitivitytotheamountofevidencefortheabsenceofeffect.Finally,thefiguresuggeststhatBFsaresensitivetosamplesizeforbothpresenceandabsenceoftrueeffect. FIGURE2 Figure2.Impactofsamplesizeonthedifferentindices,forlinearandlogisticmodels,andwhenthenullhypothesisistrueorfalse.Grayverticallinesforp-valuesandBayesfactorsrepresentcommonlyusedthresholds. ConsistentlywithFigure2andTable1,themodelinvestigatingthesensitivityofsamplesizeonthedifferentindicessuggeststhatBFindicesaresensitivetosamplesizebothwhenaneffectispresent(nullhypothesisisfalse)andabsent(nullhypothesisistrue).ROPEindicesareparticularlysensitivetosamplesizewhenthenullhypothesisistrue,whilep-value,pdandMAP-basedp-valueareonlysensitivetosamplesizewhenthenullhypothesisisfalse,inwhichcasetheyaremoresensitivethanROPEindices.Thesefindingscanberelatedtotheconceptofconsistency:asthenumberofdatapointsincreases,thestatisticconvergestowardsome“true”value.Here,weobservethatp-value,pdandtheMAP-basedp-valueareconsistentonlywhenthenullhypothesisisfalse.Inotherwords,assamplesizeincreases,theytendtoreflectmorestronglythattheeffectispresent.Ontheotherhand,ROPEindicesappearasconsistentwhentheeffectisabsent.Finally,BFsareconsistentbothwhentheeffectisabsentandwhenitispresent,andBF(vs.ROPE),comparedtoBF(vs.0),ismoresensitivetosamplesizewhenthenullhypothesisistrue,andROPE(full)isoverallslightlymoreconsistentthanROPE(95%). TABLE1 Table1.Sensitivitytosamplesize. ImpactofNoise Figure3showstheindices’sensitivitytonoise.Unlikethepatternsofsensitivitytosamplesize,theindicesdisplaymoresimilarpatternsintheirsensitivitytonoise(ormagnitudeofeffect).Allindicesareunidirectionalimpactedbynoise:asnoiseincreases,theobservedcoefficientsdecreaseinmagnitude,andtheindicesbecomeless“pronounced”(respectivelytotheirdirection).However,itisinterestingtonotethatthevariabilityoftheindicesseemsdifferentlyimpactedbynoise.Forthep-values,thepdandtheROPEindices,thevariabilityincreasesasthenoiseincreases.Inotherwords,smallvariationinsmallobservedcoefficientscanyieldverydifferentvalues.Onthecontrary,thevariabilityofBFsdecreasesasthetrueeffecttendstoward0.FortheMAP-basedp-value,thevariabilityappearstobethehighestformoderateamountofnoise.Thisbehaviorseemsconsistentacrossmodeltypes. FIGURE3 Figure3.Impactofnoise.ThenoisecorrespondstothestandarddeviationoftheGaussiannoisethatwasaddedtothegenerateddata.Itisrelatedtothemagnitudeoftheparameter(themorenoisethereis,thesmallerthecoefficient).Grayverticallinesforp-valuesandBayesfactorsrepresentcommonlyusedthresholds.ThescaleiscappedfortheBayesfactorsastheseextendtoinfinity. ConsistentlywithFigure3andTable2,themodelinvestigatingthesensitivityofnoisewhenaneffectispresent(asthereisonlynoiseintheabsenceofeffect),adjustedforsamplesize,suggeststhatBFs(especiallyvs.ROPE),followedbytheMAP-basedp-valueandpercentagesinROPE,arethemostsensitivetonoise.Asnoiseisaproxyofeffectsize(linearlyrelatedtotheabsolutevalueofthecoefficientoftheparameter),thisresulthighlightsthefactthattheseindicesaresensitivetothemagnitudeoftheeffect.Forexample,asnoiseincreases,evidenceforaneffectbecomesweak,anddataseemstosupporttheabsenceofaneffect(orattheveryleastthepresenceofanegligibleeffect),whichisreflectedinBFsbeingconsistentlysmallerthan1.Ontheotherhand,asthep-valueandthepdquantifyevidenceonlyforthepresenceofaneffect,asnoiseincreases,theyarebecomemoredependentonlargersamplesizetobeabletodetectthepresenceofaneffect. TABLE2 Table2.Sensitivitytonoise. RelationshipWiththeFrequentistp-Value Figure4suggeststhatthepdhasa1:1correspondencewiththefrequentistp-value(throughtheformulaptwo−sided=2×(1−pd)).BFindicesstillappearashavingaseverelynon-linearrelationshipwiththefrequentistindex,mostlyduetothefactthatsmallerp-valuescorrespondtostrongerevidenceinfavorofthepresenceofaneffect,butthereverseisnottrue.ROPE-basedpercentagesappeartobeonlyweaklyrelatedtop-values.Critically,theirrelationshipseemstobestronglydependentonsamplesize. FIGURE4 Figure4.Relationshipwiththefrequentistp-value.Ineachplot,thep-valuedensitiesarevisualizedbythemarginaltop(absenceoftrueeffect)andbottom(presenceoftrueeffect)markers,whereasontheleft(presenceoftrueeffect)andright(absenceoftrueeffect),themarkersrepresentthedensityoftheindexofinterest.Differentpointshapes,representingdifferentsamplesizes,specificallyillustrateitsimpactonthepercentagesinROPE,forwhicheach“curveline”isassociatedwithonesamplesize(thebiggerthesamplesize,thehigherthepercentageinROPE). Figure5showsequivalencebetweenp-valuethresholds(0.1,0.05,0.01,0.001)andtheBayesianindices.Asexpected,thepdhasthesharpestthresholds(95,97.5,99.5,and99.95%,respectively).Forlogisticmodels,thesethresholdpointsappearasmoreconservative(i.e.,Bayesianindiceshavetobemore“pronounced”toreachthesamelevelofsignificance).ThissensitivitytomodeltypeisthestrongestforBFs(whichispossiblyrelatedtothedifferenceinthepriorspecificationforthesetwotypesofmodels). FIGURE5 Figure5.Theprobabilityofreachingdifferentp-valuebasedsignificancethresholds(0.1,0.05,0.01,0.001forsolid,long-dashed,short-dashed,anddottedlines,respectively)fordifferentvaluesofthecorrespondingBayesianindices. RelationshipBetweenROPE(Full),pd,andBF(vs.ROPE) Figure6suggeststhattherelationshipbetweentheROPE(full)andthepdmightbestronglyaffectedbythesamplesize,andsubjecttodifferencesacrossmodeltypes.ThisseemstoechotherelationshipbetweenROPE(full)andp-value,thelatterhavinga1:1correspondencewithpd.Ontheotherhand,theROPE(full)andtheBF(vs.ROPE)seemverycloselyrelatedwithinthesamemodeltype,reflectingtheirformalrelationship[seedefinitionofBF(vs.ROPE)above].Overall,theseresultshelptodemonstrateROPE(full)andBF(vs.ROPE)’sconsistencybothincaseofpresenceandabsenceofatrueeffect,whereasthepd,beingequivalenttothep-value,isonlyconsistentwhenthetrueeffectisabsent. FIGURE6 Figure6.RelationshipbetweenthreeBayesianindices:theprobabilityofdirection(pd),thepercentageofthefullposteriordistributionintheROPE,andtheBayesfactor(vs.ROPE). Discussion Basedonthesimulationoflinearandlogisticmodels,thepresentworkaimedtocompareseveralBayesianindicesofeffect“significance”(seeTable3),providingvisualrepresentationsofthe“behavior”ofsuchindicesinrelationshipwithimportantsourcesofvariancesuchassamplesize,noiseandeffectpresence,aswellascomparingthemwiththewell-knownandwidelyusedfrequentistp-value. TABLE3 Table3.SummaryofBayesianindicesofeffectexistenceandsignificance. Theresultstendtosuggestthattheinvestigatedindicescouldbeseparatedintotwocategories.Thefirstgroup,includingthepdandtheMAP-basedp-value,presentssimilarpropertiestothoseofthefrequentistp-value:theyaresensitiveonlytotheamountofevidenceforthealternativehypothesis(i.e.,whenaneffectistrulypresent).Inotherwords,theseindicesarenotabletoreflecttheamountofevidenceinfavorofthenullhypothesis(Rouderetal.,2009;RouderandMorey,2012).Ahighvaluesuggeststhattheeffectexists,butalowvalueindicatesuncertaintyregardingitsexistence(butnotcertaintythatitisnon-existent).Thesecondgroup,includingROPEandBayesfactors,seemsensitivetobothpresenceandabsenceofeffect,accumulatingevidenceasthesamplesizeincreases.However,ROPEseemsparticularlysuitedtoprovideevidenceinfavorofthenullhypothesis.Consistentwiththis,combiningBayesfactorswithROPE(BFvs.ROPE),ascomparedtoBayesfactorsagainstthepoint-null(BFvs.0),leadstoahighersensitivitytonull-effects(MoreyandRouder,2011;RouderandMorey,2012). Wealsoshowedthatbesidessharingsimilarproperties,thepdhasa1:1correspondencewiththefrequentistp-value,beingitsBayesianequivalent.Bayesfactors,however,appeartohaveaseverelynon-linearrelationshipwiththefrequentistindex,whichistobeexpectedfromtheirmathematicaldefinitionandtheirsensitivitywhenthenullhypothesisistrue.Thisinturncanleadtosurprisingconclusions.Forinstance,Bayesfactorslowerthan1,whichareconsideredasprovidingevidenceagainstthepresenceofaneffect,canstillcorrespondtoa“significant”frequentistp-value(seeFigures3,4).ROPEindicesaremorecloselyrelatedtothep-value,astheirrelationshipappearsdependentonanotherfactor:thesamplesize.ThissuggeststhattheROPEencapsulatesadditionalinformationaboutthestrengthofevidence. WhatisthepointofcomparingBayesianindiceswiththefrequentistp-value,especiallyafterhavingpointedoutitsmanyflaws?Whilethiscomparisonmayseemcounter-intuitive(asBayesianthinkingisintrinsicallydifferentfromthefrequentistframework),webelievethatthisjuxtapositionisinterestingfordidacticreasons.Thefrequentistp-value“speaks”tomanyandcanthusbeseenasareferenceandawaytofacilitatetheshifttowardtheBayesianframework.Thus,pragmaticallydocumentingsuchbridgescanonlyfostertheunderstandingofthemethodologicalissuesthatourfieldisfacing,andinturnactagainstdogmaticadherencetoaframework.Thisdoesnotpreclude,however,thatachangeinthegeneralparadigmofsignificanceseekingand“p-hacking”isnecessary,andthatBayesianindicesarefundamentallydifferentfromthefrequentistp-value,ratherthanmereapproximationsorequivalents. Critically,whilethepurposeoftheseindiceswassolelyreferredtoassignificanceuntilnow,wewouldliketoemphasizethenuancedperspectiveofexistence-significancetestingasadual-frameworkforparameterdescriptionandinterpretation.Theideasupportedhereisthatthereisaconceptualandpracticaldistinction,andpossibledissociationtobemade,betweenaneffect’sexistenceanditssignificance.Inthiscontext,existenceissimplydefinedastheconsistencyofaneffectinoneparticulardirection(i.e.,positiveornegative),withoutanyassumptionsorconclusionsastoitssize,importance,relevanceormeaning.Itisanobjectivefeatureofanestimate(tiedtoitsuncertainty).Ontheotherhand,significancewouldbeherere-framedfollowingitsoriginalliterallydefinitionsuchas“beingworthyofattention”or“importance.”Aneffectcanbeconsideredsignificantifitsmagnitudeishigherthansomegiventhreshold.Thisaspectcanbeexplored,toacertainextent,inanobjectivewaywiththeconceptofpracticalequivalence(Kruschke,2014;Lakens,2017;Lakensetal.,2018),whichsuggeststheuseofarangeofvaluesassimilatedtotheabsenceofaneffect(ROPE).Iftheeffectfallswithinthisrange,itisconsideredtobenon-significantforpracticalreasons:themagnitudeoftheeffectislikelytobetoosmalltobeofhighimportanceinreal-worldscenariosorapplications.Nevertheless,significancealsowithholdsamoresubjectiveaspect,correspondingtoitscontextualmeaningfulnessandrelevance.This,however,isusuallydependentontheliterature,priors,novelty,contextorfield,andthuscannotbeobjectivelyorneutrallyassessedusingastatisticalindexalone. Whileindicesofexistenceandsignificancecanbenumericallyrelated(asshowninourresults),theformerisconceptuallyindependentfromthelatter.Forexample,aneffectforwhichthewholeposteriordistributionisconcentratedwithinthe[0.0001,0.0002]rangewouldbeconsideredtobepositivewithahighlevelofcertainty(andthus,existinginthatdirection),butalsonotsignificant(i.e.,toosmalltobeofanypracticalrelevance).Acknowledgingthedistinctionandcomplementarynatureofthesetwoaspectscaninturnenrichtheinformationandusefulnessoftheresultsreportedinpsychologicalscience(forpracticalreasons,theimplementationofthisdual-frameworkofexistence-significancetestingismadestraightforwardthroughthebayestestRopen-sourcepackageforR;Makowskietal.,2019).Inthiscontext,thepdandtheMAP-basedp-valueappearasindicesofeffectexistence,mostlysensitivetothecertaintyrelatedtothedirectionoftheeffect.ROPE-basedindicesandBayesfactorsareindicesofeffectsignificance,relatedtothemagnitudeandtheamountofevidenceinfavorofit(seealsoasimilardiscussionofstatisticalsignificancevs.effectsizeinthefrequentistframework;e.g.,Cohen,1994). TheinherentsubjectivityrelatedtotheassessmentofsignificanceisoneofthepracticallimitationsofROPE-basedindices(despitebeing,conceptually,anasset,allowingforcontextualnuanceintheinterpretation),astheyrequireanexplicitdefinitionofthenon-significantrange(theROPE).Althoughdefaultvalueshavebeenreportedintheliterature(forinstance,halfofa“negligible”effectsizereferencevalue;Kruschke,2014),itiscriticaltoreproducibilityandtransparencythattheresearcher’schoiceisexplicitlystated(and,ifpossible,justified).Beyondbeingarbitrary,thisrangealsohashardlimits(forinstance,contrarytoavalueof0.0499,avalueof0.0501wouldbeconsiderednon-negligibleiftherangeendsat0.05).Thisreinforcesacategoricalandclusteredperspectiveofwhatisbyessenceacontinuousspaceofpossibilities.Importantly,asthisrangeisfixedtothescaleoftheresponse(itisexpressedintheunitoftheresponse),ROPEindicesaresensitivetochangesinthescaleofthepredictors.Forinstance,negligibleresultsmaychangeintonon-negligibleresultswhenpredictorsarescaledup(e.g.,reactiontimesexpressedinsecondsinsteadofmilliseconds),whichoneinattentiveormaliciousresearchercouldmisleadinglypresentas“significant”(notethatindicesofexistence,suchasthepd,wouldnotbeaffectedbythis).Finally,theROPEdefinitionisalsodependentonthemodeltype,andselectingaconsistentorhomogeneousrangeforallthefamiliesofmodelsisnotstraightforward.Thiscanmakecomparisonsbetweenmodeltypesdifficult,andanadditionalburdenwheninterpretingROPE-basedindices.Insummary,whileawell-definedROPEcanbeapowerfultooltogiveadifferentandnewperspective,italsorequiresextracautiononthepaetsofauthorsandreaders. AsforthedifferencebetweenROPE(95%)andROPE(full),wesuggestreportingthelatter(i.e.,thepercentageofthewholeposteriordistributionthatfallswithintheROPEinsteadofagivenproportionofCI).Thisbypassestheuseofanotherarbitraryrange(95%)andappearstobemoresensitivetodelineatehighlysignificanteffects).Critically,ratherthanusingthepercentageinROPEasadichotomous,all-or-nothingdecisioncriterion,suchassuggestedbytheoriginalequivalencetest(Kruschke,2014),werecommendusingthepercentageasacontinuousindexofsignificance(withexplicitlyspecifiedcut-offpointsifcategorizationisneeded,forinstance5%forsignificanceand95%fornon-significance). OurresultsunderlinetheBayesfactorasaninterestingindex,abletoprovideevidenceinfavororagainstthepresenceofaneffect.Moreover,itseasyinterpretationintermsofoddsinfavororagainstonehypothesisoranothermakesitacompellingindexforcommunication.Nevertheless,oneofthemaincritiquesofBayesfactorsisitssensitivitytopriors(showninourresultsherethroughitssensitivitytomodeltypes,aspriors’oddsforlogisticandlinearmodelsaredifferent).Moreover,whiletheBFappearsevenbetterwhencomparedwithaROPEthanwhencomparedwithapoint-null,italsocarriesallthelimitationsrelatedtoROPEspecificationmentionedabove.Thus,werecommendusingBayesfactors(preferentiallyvs.aROPE)iftheuserhasexplicitlyspecified(andhasarationalefor)informativepriors(oftencalled“subjective”priors;Wagenmakers,2007).Intheend,thereisarelativeproximitybetweenBayesfactors(vs.ROPE)andthepercentageinROPE(full),consistentwiththeirmathematicalrelationship. BeingquitedifferentfromtheBayesfactorandROPEindices,theProbabilityofDirection(pd)isanindexofeffectexistencerepresentingthecertaintywithwhichaneffectgoesinaparticulardirection(i.e.,ispositiveornegative).Beyonditssimplicityofinterpretation,understandingandcomputation,thisindexalsopresentsotherinterestingproperties.Itisindependentfromthemodel,i.e.,itissolelybasedontheposteriordistributionsanddoesnotrequireanyadditionalinformationfromthedataorthemodel.ContrarytoROPE-basedindices,itisrobusttothescaleofboththeresponsevariableandthepredictors.Nevertheless,thisindexalsopresentssomelimitations.Mostimportantly,thepdisnotrelevantforassessingthesizeorimportanceofaneffectandisnotabletoprovideinformationinfavorofthenullhypothesis.Inotherwords,ahighpdsuggeststhepresenceofaneffectbutasmallpddoesnotgiveusanyinformationabouthowplausiblethenullhypothesisis,suggestingthatthisindexcanonlybeusedtoeventuallyrejectthenullhypothesis(whichisconsistentwiththeinterpretationofthefrequentistp-value).Incontrast,BFs(andtosomeextentthepercentageinROPE)increaseordecreaseastheevidencebecomesstronger(moredatapoints),inbothdirections. MuchofthestrengthsofthepdalsoapplytotheMAP-basedp-value.Althoughpossiblyshowingsomesuperiorityintermsofsensitivityascomparedtoit,italsopresentsanimportantlimitation.Indeed,theMAPismathematicallydependentonthedensityat0andatthemode.However,thedensityestimationofacontinuousdistributionisastatisticalproblemonitsownandmanydifferentmethodsexist.ItispossiblethatchangingthedensityestimationmayimpacttheMAP-basedp-value,withunknownresults.Thepd,however,hasalinearrelationshipwiththefrequentistp-value,whichisinouropinionanasset. Afterallthecriticismregardingthefrequentistp-value,itmayappearcontradictorytosuggesttheusageofitsBayesianempiricalequivalent.Thesubtlerperspectivethatwesupportisthatthep-valueisnotanintrinsicallybad,orwrong,index.Instead,itisitsmisuse,misunderstandingandmisinterpretationthatfuelsthedecayofthesituationintothecrisis.Interestingly,theproximitybetweenthepdandthep-valuefollowstheoriginaldefinitionofthelatter(Fisher,1925)asanindexofeffectexistenceratherthansignificance(asin“worthofinterest”;Cohen,1994).Addressingthisconfusion,theBayesianequivalenthasanintuitivemeaningandinterpretation,contributingtomakingmoreobviousthefactthatallthresholdsandheuristicsarearbitrary.Insummary,themathematicalandinterpretativetransparencyofthepd,anditsconceptualizationasanindexofeffectexistence,offervaluableinsightintothecharacterizationofBayesianresults,anditspracticalproximitywiththefrequentistp-valuemakesitaperfectmetrictoeasethetransitionofpsychologicalresearchintotheadoptionoftheBayesianframework. Ourstudyhassomelimitations.First,oursimulationswerebasedonsimplelinearandlogisticregressionmodels.Althoughthesemodelsarewidespread,thebehaviorofthepresentedindicesforothermodelfamiliesortypes,suchascountmodelsormixedeffectsmodels,stillneedstobeexplored.Furthermore,weonlytestedcontinuouspredictors.Theindicesmaybehavedifferentlywhenvaryingthetypeofpredictor(binary,ordinal)aswell.Finally,welimitedoursimulationstosmallsamplesizes,forthereasonthatdataisparticularlynoisyinsmallsamples,andexperimentsinpsychologyoftenincludeonlyalimitednumberofsubjects.However,itispossiblethattheindicesconverge(ordiverge)forlargersamples.Importantly,beforebeingabletodrawadefinitiveconclusionaboutthequalitiesoftheseindices,furtherstudiesshouldinvestigatetherobustnessoftheseindicestosamplingcharacteristics(e.g.,samplingalgorithm,numberofiterations,chains,warm-up)andtheimpactofpriorspecification(KassandRaftery,1995;Vanpaemel,2010;Kruschke,2011),allofwhichareimportantparametersofBayesianstatistics. ReportingGuidelines Howcanthecurrentobservationsbeusedtoimprovestatisticalgoodpracticesinpsychologicalscience?Basedonthepresentcomparison,wecanstartoutliningthefollowingguidelines.Asexistenceandsignificancearecomplementaryperspectives,wesuggestusingatminimumoneindexofeachcategory.Asanobjectiveindexofeffectexistence,thepdshouldbereported,foritssimplicityofinterpretation,itsrobustnessanditsnumericproximitytothewell-knownfrequentistp-value;AsanindexofsignificanceeithertheBF(vs.ROPE)ortheROPE(full)shouldbereported,fortheirabilitytodiscriminatebetweenpresenceandabsenceofeffect(DeSantis,2007)andtheinformationtheyproviderelatedtoevidenceofthesizeoftheeffect.SelectionbetweentheBF(vs.ROPE)ortheROPE(full)shoulddependontheinformativenessofthepriorsused–whenuninformativepriorsareused,andthereislittlepriorknowledgeregardingtheexpectedsizeoftheeffect,theROPE(full)shouldbereportedasitreflectsonlytheposteriordistributionandisnotsensitivetothewidthofawide-rangeofpriorscales(Rouderetal.,2018).Ontheotherhand,incaseswhereinformedpriorsareused,reflectingpriorknowledgeregardingtheexpectedsizeoftheeffect,BF(vs.ROPE)shouldbeused. Definingappropriateheuristicstoaidininterpretationisbeyondthescopeofthispaper,asitwouldrequiretestingthemonmorenaturaldatasets.Nevertheless,ifwetakethefrequentistframeworkandtheexistingliteratureasareferencepoint,itseemsthat95,97,and99%mayberelevantreferencepoints(i.e.,easy-to-remembervalues)forthepd.Aconcise,standardized,referencetemplatesentencetodescribetheparameterofamodelincludinganindexofpoint-estimate,uncertainty,existence,significanceandeffectsize(Cohen,1988)couldbe,inthecaseofpdandBF: “Thereismoderateevidence(BFROPE=3.44)[BF(vs.ROPE)]infavorofthepresenceofeffectofX,whichhasaprobabilityof98.14%[pd]ofbeingnegative(Median=−5.04,89%CI[−8.31,0.12]),andcanbeconsideredtobesmall(Std.Median=−0.29)[standardizedcoefficient].” AndiftheuserdecidestousethepercentageinROPEinsteadoftheBF: “TheeffectofXhasaprobabilityof98.14%[pd]ofbeingnegative(Median=−5.04,89%CI[−8.31,0.12]),andcanbeconsideredtobesmall(Std.Median=−0.29)[standardizedcoefficient]andsignificant(0.82%inROPE)[ROPE(full)].” DataAvailabilityStatement ThefullRcodeusedfordatageneration,dataprocessing,figurescreation,andmanuscriptcompilingisavailableonGitHubathttps://github.com/easystats/easystats/tree/master/publications/makowski_2019_bayesian. AuthorContributions DMconceivedandcoordinatedthestudy.DM,MB-S,andDLparticipatedinthestudydesign,statisticalanalysis,datainterpretation,andmanuscriptdrafting.DLsupervisedthemanuscriptdrafting.SCperformedacriticalreviewofthemanuscript,assistedwiththemanuscriptdrafting,andprovidedfundingforpublication.Allauthorsreadandapprovedthefinalmanuscript. ConflictofInterest Theauthorsdeclarethattheresearchwasconductedintheabsenceofanycommercialorfinancialrelationshipsthatcouldbeconstruedasapotentialconflictofinterest. Acknowledgments ThisstudywasmadepossiblebythedevelopmentofthebayestestRpackage,itselfpartoftheeasystatsecosystem(Lüdeckeetal.,2019),anopen-sourceandcollaborativeprojectcreatedtofacilitatetheusageofR.Thus,thereissubstantialevidenceinfavorofthefactthatwethankthemastersofeasystatsandalltheotherpadawanfollowingthewayoftheBayes. Footnotes ^https://github.com/easystats/easystats/tree/master/publications/makowski_2019_bayesian/data References Amrhein,V.,Greenland,S.,andMcShane,B.(2019).Scientistsriseupagainststatisticalsignificance.Nature567,305–307.doi:10.1038/d41586-019-00857-9 PubMedAbstract|CrossRefFullText|GoogleScholar Anderson,D.R.,Burnham,K.P.,andThompson,W.L.(2000).Nullhypothesistesting:problems,prevalence,andanalternative.J.WildlifeManag.64,912–923. GoogleScholar Andrews,M.,andBaguley,T.(2013).Priorapproval:thegrowthofbayesianmethodsinpsychology.Br.J.Math.Statist.Psychol.66,1–7.doi:10.1111/bmsp.12004 PubMedAbstract|CrossRefFullText|GoogleScholar Benjamin,D.J.,Berger,J.O.,Johannesson,M.,Nosek,B.A.,Wagenmakers,E.-J.,Berk,R.,etal.(2018).Redefinestatisticalsignificance.Nat.Hum.Behav.2,6–10. GoogleScholar Carpenter,B.,Gelman,A.,Hoffman,M.D.,Lee,D.,Goodrich,B.,Betancourt,M.,etal.(2017).Stan:aprobabilisticprogramminglanguage.J.Statist.Softw.76,1–32.doi:10.18637/jss.v076.i01 CrossRefFullText|GoogleScholar Chambers,C.D.,Feredoes,E.,Muthukumaraswamy,S.D.,andEtchells,P.(2014).Insteadof‘playingthegame’itistimetochangetherules:registeredreportsataimsneuroscienceandbeyond.AIMSNeurosci.1,4–17.doi:10.3934/neuroscience.2014.1.4 CrossRefFullText|GoogleScholar Cohen,J.(1988).StatisticalPowerAnalysisfortheSocialSciences.NewYork,NY:AcademicPublishers. GoogleScholar Cohen,J.(1994).Theearthisround(p<.05 crossreffulltext desantis dienes pubmedabstract ellis googlescholar etz fidler finch fisher gardner gelman goodrich halsey heck jarosz jeffreys kass kirk kruschke lakens l ly makowski marasini maxwell mcelreath mills morey nuzzo rcoreteam robert rouder simmons simonsohn spanos sullivan szucs vanpaemel wagenmakers wasserstein keywords:bayesian citation:makowskid received:18september2019 editedby: pietrocipresso reviewedby: richards.john josed.perezgonzalez copyright thisarticleispartoftheresearchtopic statisticalguidelines:newdevelopmentsinstatisticalmethodsandpsychometrictools viewall peoplealsolookedat download>