Why you should be plotting learning curves in your next ...

文章推薦指數: 80 %
投票人數:10人

Learning curves show the relationship between training set size and your chosen evaluation metric (e.g. RMSE, accuracy, etc.) on your training and validation ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceGettingStartedWhyyoushouldbeplottinglearningcurvesinyournextmachinelearningprojectSpoiler:theywillhelpyouunderstandwhetheryourmodelsuffersfromhighvarianceorhighbias—andI’llexplainwhatyoucandoaboutitImagebyauthorThebias-variancedilemmaisawidelyknownprobleminthefieldofmachinelearning.Itsimportanceissuch,thatifyoudon’tgetthetrade-offright,itwon’tmatterhowmanyhoursorhowmuchmoneyyouthrowatyourmodel.Intheillustrationabove,youcangetafeelforwhatbiasandvarianceareaswellashowtheycanaffectyourmodelperformance.Thefirstchartshowsamodel(blueline)thatisunderfittingthetrainingdata(redcrosses).Thismodelisbiased,becauseit“assumes”therelationshipbetweentheindependentvariableandthedependentvariableislinearwhenitisnot.Plottingascatterplotofthedataisalwayshelpfulasitwillrevealthetruerelationshipbetweenthevariables—aquadraticfunctionwouldfitthedata“justright”(secondchart).Thethirdchartisaclearexampleofoverfitting.Thehighcomplexityofthemodelallowsittofitthedataveryclosely—tooclosely.Althoughthismodelmightperformreallywellonthetrainingdata,itsperformanceonthetestdata(i.e.dataithasneverseenbefore)willbemuchworse.Inotherwords,thismodelsuffersfromhighvariance,whichmeansthatitwon’tbegoodatmakingpredictionsondataithasneverseenbefore.Becausethemainpointofbuildingamachinelearningmodelistobeabletoaccuratelymakepredictionsonnewdata,youshouldbefocusedonmakingsureitwillgeneralisewelltounseenobservations,ratherthanmaximisingitsperformanceonyourtrainingset.Whatcanyoudoifyourmodelperformanceisnotsogood?Thereareseveralthingsyoucando:GetmoredataTryasmallersetoffeatures(reducemodelcomplexity)Tryadding/creatingmorefeatures(increasemodelcomplexity)Trydecreasingtheregularisationparameterλ(increasemodelcomplexity)Tryincreasingtheregularisationparameterλ(decreasemodelcomplexity)Thequestionnowis:“howdoIknowwhichofthosethingstotryfirst?”.Theansweris:“well,itdepends.”.Anditbasicallydependsonwhetheryourmodelissufferingfromhighbiasorfromhighvariance.Theissuehere,youmightbewondering,is:“ok,somymodelisnotperformingasexpected…buthowdoIknowifithasabiasproblemoravarianceproblem?!”.Learningcurves!LearningcurvesLearningcurvesshowtherelationshipbetweentrainingsetsizeandyourchosenevaluationmetric(e.g.RMSE,accuracy,etc.)onyourtrainingandvalidationsets.Theycanbeanextremelyusefultoolwhendiagnosingyourmodelperformance,astheycantellyouwhetheryourmodelissufferingfrombiasorvariance.ImagebyauthorIfyourlearningcurveslooklikethis,itmeansyourmodelissufferingfromhighbias.Boththetrainingandvalidation(orcross-validation)errorishighanditdoesn’tseemtoimprovewithmoretrainingexamples.Thefactthatyourmodelisperformingsimilarlybadforboththetrainingandvalidationsetssuggeststhatthemodelisunderfittingthedataandthereforehashighbias.ImagebyauthorOntheotherhand,ifyourlearningcurveslooklikethis,yourmodelmighthaveahigh-varianceproblem.Inthischart,thevalidationerrorismuchhigherthanthetrainingerror,whichsuggeststhatyouareoverfittingthedata.Whatcanyoudoifyourmodelperformanceisnotsogood?(pt.II)Cool,soyouhavenowidentifiedwhat’sgoingonwithyourmodelandareinagreatpositiontodecidewhattodonext.Ifyourmodelhashighbias,youshould:Tryadding/creatingmorefeaturesTrydecreasingtheregularisationparameterλThesetwothingswillincreaseyourmodelcomplexityandthereforewillcontributetosolveyourunderfittingproblem.Ifyourmodelhashighvariance,youshould:GetmoredataTryasmallersetoffeaturesTryincreasingtheregularisationparameterλWhenyourmodelisoverfittingthetrainingdata,youcaneithertryreducingitscomplexityorgettingmoredata.Asyoucanseeabove,thelearningcurveschartofahigh-variancemodelsuggeststhat,withenoughdata,thevalidationandtrainingerrorwillendupclosertoeachother.Anintuitiveexplanationforthisisthatifyougiveyourmodelmoredata,thegapbetweenyourmodel’scomplexityandtheunderlyingcomplexityinyourdatawillgetsmallerandsmaller.PythonimplementationandreallifeexampleIwrotethisfunctiontoplotthelearningcurvesofamodel.Feelfreetouseitinyourownwork!IthoughtIwouldendthispostbyshowingyouareal-lifeexampleofalearningcurvesplot,whichwascreatedwiththeabovecode:ImagebyauthorFromtheplot,itisveryclearthatmyrandomforestmodelissufferingfromhighbias,asthetrainingandvalidationcurvesareveryclosetogetherandtheaccuracyisnotgreatataroundthe70%mark.Knowingthishelpedmewhenitcametodecidingwhatmynextstepwasgoingtobeinordertoimprovemymodelperformance.BecauseIhadahigh-biasproblem,Iknewgettingmoretrainingdatawasn’tgoingtohelpbyitself,andthatincreasingthecomplexityofmymodelbyengineeringnewandmorerelevantfeatureswasprobablygoingtodeliverthegreatestimpact.ConclusionNexttimeyouhaveabad-performingmodelinfrontofyou,remembertoplotthelearningcurves,analysethem,andworkoutwhetheryouhaveabiasoravarianceproblem.Knowingthiswillhelpyoudecidewhatyournextstepsshouldbeanditcouldsaveyoucountlessheadachesandhourswastedonworkthatisnotgoingtohelpyourmodel.MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumShubhamGuptaHandleImbalancedDatasetStephenWhiteDataNoirUlysses-PacomeKoudouinDataScienceDemystifiedTheGolfBallTheory — EasyMLNishikantMundokarLinearAlgebraforDataScienceandMachineLearningGeoffLeighinAnalyticsVidhyaCreditRiskandMachineLearningConcepts-3FreemanMakinAnalyticsVidhyaNHLvsNBA:Whydounderdogsdobetterinhockey?QunyquekyaWallisGoogleDataAnalyticsCertificateCapstone:BellabeatCaseStudySabinaLiminUNLEASHLabWhatwedon’tcount,wecan’taccountfor.AboutHelpTermsPrivacyGettheMediumappGetstartedAdriàLuz70FollowersTalesaboutdata,statistics,machinelearning,visualisation,andmuchmore.ByAdriàLuz(@adrialuz)andSaraGaspar(@sargaspar).FollowMorefromMediumMagdalenaKonkiewiczinTowardsDataScienceEvaluatingsearchrelevanceon-demandwithcrowdsourcingTatevKareninTowardsAIEssentialStatisticalTestsForStatisticalSignificanceinMachineLearningKurtisPykesinProjectProHowtoEffectivelyPlanYourFirstMachineLearningProject?ScottBishopinCars.ComTechnologyDealBadgesHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable



請為這篇文章評分?