Why you should be plotting learning curves in your next ...
文章推薦指數: 80 %
Learning curves show the relationship between training set size and your chosen evaluation metric (e.g. RMSE, accuracy, etc.) on your training and validation ... OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceGettingStartedWhyyoushouldbeplottinglearningcurvesinyournextmachinelearningprojectSpoiler:theywillhelpyouunderstandwhetheryourmodelsuffersfromhighvarianceorhighbias—andI’llexplainwhatyoucandoaboutitImagebyauthorThebias-variancedilemmaisawidelyknownprobleminthefieldofmachinelearning.Itsimportanceissuch,thatifyoudon’tgetthetrade-offright,itwon’tmatterhowmanyhoursorhowmuchmoneyyouthrowatyourmodel.Intheillustrationabove,youcangetafeelforwhatbiasandvarianceareaswellashowtheycanaffectyourmodelperformance.Thefirstchartshowsamodel(blueline)thatisunderfittingthetrainingdata(redcrosses).Thismodelisbiased,becauseit“assumes”therelationshipbetweentheindependentvariableandthedependentvariableislinearwhenitisnot.Plottingascatterplotofthedataisalwayshelpfulasitwillrevealthetruerelationshipbetweenthevariables—aquadraticfunctionwouldfitthedata“justright”(secondchart).Thethirdchartisaclearexampleofoverfitting.Thehighcomplexityofthemodelallowsittofitthedataveryclosely—tooclosely.Althoughthismodelmightperformreallywellonthetrainingdata,itsperformanceonthetestdata(i.e.dataithasneverseenbefore)willbemuchworse.Inotherwords,thismodelsuffersfromhighvariance,whichmeansthatitwon’tbegoodatmakingpredictionsondataithasneverseenbefore.Becausethemainpointofbuildingamachinelearningmodelistobeabletoaccuratelymakepredictionsonnewdata,youshouldbefocusedonmakingsureitwillgeneralisewelltounseenobservations,ratherthanmaximisingitsperformanceonyourtrainingset.Whatcanyoudoifyourmodelperformanceisnotsogood?Thereareseveralthingsyoucando:GetmoredataTryasmallersetoffeatures(reducemodelcomplexity)Tryadding/creatingmorefeatures(increasemodelcomplexity)Trydecreasingtheregularisationparameterλ(increasemodelcomplexity)Tryincreasingtheregularisationparameterλ(decreasemodelcomplexity)Thequestionnowis:“howdoIknowwhichofthosethingstotryfirst?”.Theansweris:“well,itdepends.”.Anditbasicallydependsonwhetheryourmodelissufferingfromhighbiasorfromhighvariance.Theissuehere,youmightbewondering,is:“ok,somymodelisnotperformingasexpected…buthowdoIknowifithasabiasproblemoravarianceproblem?!”.Learningcurves!LearningcurvesLearningcurvesshowtherelationshipbetweentrainingsetsizeandyourchosenevaluationmetric(e.g.RMSE,accuracy,etc.)onyourtrainingandvalidationsets.Theycanbeanextremelyusefultoolwhendiagnosingyourmodelperformance,astheycantellyouwhetheryourmodelissufferingfrombiasorvariance.ImagebyauthorIfyourlearningcurveslooklikethis,itmeansyourmodelissufferingfromhighbias.Boththetrainingandvalidation(orcross-validation)errorishighanditdoesn’tseemtoimprovewithmoretrainingexamples.Thefactthatyourmodelisperformingsimilarlybadforboththetrainingandvalidationsetssuggeststhatthemodelisunderfittingthedataandthereforehashighbias.ImagebyauthorOntheotherhand,ifyourlearningcurveslooklikethis,yourmodelmighthaveahigh-varianceproblem.Inthischart,thevalidationerrorismuchhigherthanthetrainingerror,whichsuggeststhatyouareoverfittingthedata.Whatcanyoudoifyourmodelperformanceisnotsogood?(pt.II)Cool,soyouhavenowidentifiedwhat’sgoingonwithyourmodelandareinagreatpositiontodecidewhattodonext.Ifyourmodelhashighbias,youshould:Tryadding/creatingmorefeaturesTrydecreasingtheregularisationparameterλThesetwothingswillincreaseyourmodelcomplexityandthereforewillcontributetosolveyourunderfittingproblem.Ifyourmodelhashighvariance,youshould:GetmoredataTryasmallersetoffeaturesTryincreasingtheregularisationparameterλWhenyourmodelisoverfittingthetrainingdata,youcaneithertryreducingitscomplexityorgettingmoredata.Asyoucanseeabove,thelearningcurveschartofahigh-variancemodelsuggeststhat,withenoughdata,thevalidationandtrainingerrorwillendupclosertoeachother.Anintuitiveexplanationforthisisthatifyougiveyourmodelmoredata,thegapbetweenyourmodel’scomplexityandtheunderlyingcomplexityinyourdatawillgetsmallerandsmaller.PythonimplementationandreallifeexampleIwrotethisfunctiontoplotthelearningcurvesofamodel.Feelfreetouseitinyourownwork!IthoughtIwouldendthispostbyshowingyouareal-lifeexampleofalearningcurvesplot,whichwascreatedwiththeabovecode:ImagebyauthorFromtheplot,itisveryclearthatmyrandomforestmodelissufferingfromhighbias,asthetrainingandvalidationcurvesareveryclosetogetherandtheaccuracyisnotgreatataroundthe70%mark.Knowingthishelpedmewhenitcametodecidingwhatmynextstepwasgoingtobeinordertoimprovemymodelperformance.BecauseIhadahigh-biasproblem,Iknewgettingmoretrainingdatawasn’tgoingtohelpbyitself,andthatincreasingthecomplexityofmymodelbyengineeringnewandmorerelevantfeatureswasprobablygoingtodeliverthegreatestimpact.ConclusionNexttimeyouhaveabad-performingmodelinfrontofyou,remembertoplotthelearningcurves,analysethem,andworkoutwhetheryouhaveabiasoravarianceproblem.Knowingthiswillhelpyoudecidewhatyournextstepsshouldbeanditcouldsaveyoucountlessheadachesandhourswastedonworkthatisnotgoingtohelpyourmodel.MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumShubhamGuptaHandleImbalancedDatasetStephenWhiteDataNoirUlysses-PacomeKoudouinDataScienceDemystifiedTheGolfBallTheory — EasyMLNishikantMundokarLinearAlgebraforDataScienceandMachineLearningGeoffLeighinAnalyticsVidhyaCreditRiskandMachineLearningConcepts-3FreemanMakinAnalyticsVidhyaNHLvsNBA:Whydounderdogsdobetterinhockey?QunyquekyaWallisGoogleDataAnalyticsCertificateCapstone:BellabeatCaseStudySabinaLiminUNLEASHLabWhatwedon’tcount,wecan’taccountfor.AboutHelpTermsPrivacyGettheMediumappGetstartedAdriàLuz70FollowersTalesaboutdata,statistics,machinelearning,visualisation,andmuchmore.ByAdriàLuz(@adrialuz)andSaraGaspar(@sargaspar).FollowMorefromMediumMagdalenaKonkiewiczinTowardsDataScienceEvaluatingsearchrelevanceon-demandwithcrowdsourcingTatevKareninTowardsAIEssentialStatisticalTestsForStatisticalSignificanceinMachineLearningKurtisPykesinProjectProHowtoEffectivelyPlanYourFirstMachineLearningProject?ScottBishopinCars.ComTechnologyDealBadgesHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable
延伸文章資訊
- 1How to use Learning Curves to Diagnose Machine Learning ...
A learning curve is a plot of model learning performance over experience or time. Learning curves...
- 2Why you should be plotting learning curves in your next ...
Learning curves show the relationship between training set size and your chosen evaluation metric...
- 3Learning Curve to identify Overfitting and Underfitting in ...
Learning curves plot the training and validation loss of a sample of training examples by increme...
- 4What is a Learning Curve in machine learning? - Stack Overflow
An ROC curve is a graphical depiction of classifier performance that shows the trade-off between ...
- 5Learning Curves Tutorial: What Are Learning Curves?
Learning curves are plots used to show a model's performance as the training set size increases. ...