Learning Curves Tutorial: What Are Learning Curves?

文章推薦指數: 80 %
投票人數:10人

Learning curves are plots used to show a model's performance as the training set size increases. Another way it can be used is to show the ... SkiptomaincontentSignInGetStartedBlogBlogArticlesPodcastTutorialsCheatSheetsCategoryCategoryAboutDataCampLatestnewsaboutourproductsandteamForBusinessCategoryTechnologiesDiscovercontentbytoolsandtechnologyGitPowerBIPythonRProgrammingScalaSpreadsheetsSQLTableauCategoryTopicsDiscovercontentbydatasciencetopicsAIBigDataDataAnalysisDataEngineeringDataLiteracyDataScienceDataVisualizationDeepLearningMachineLearningWorkspaceWriteforusCategorySearch Alookatthebias-variancetradeoff Theanatomyofalearningcurve Usecase:Predictingrealestatevaluations Diagnosinglearningcurves Model1:Decisiontreeregressor Model2:SupportVectorMachine Model3:RandomForestRegressor Machinelearningmodelsareemployedtolearnpatternsindata.Thebestmodelscangeneralizewellwhenfacedwithinstancesthatwerenotpartoftheinitialtrainingdata.Duringtheresearchphase,severalexperimentsareconductedtofindthesolutionthatbestsolvesthebusiness'sproblem,andreducestheerrorbeingmadebythemodel.Anerrormaybedefinedasthedifferencebetweenthepredictionofobservationandthetruevalueoftheobservation. Therearetwomajorcausesforerrorsinmachinelearningmodels: Biasdescribesamodelwhichmakessimplifiedassumptionssothetargetfunctioniseasiertoapproximate;amodelmaylearnthatevery5'9maleintheworldwearsasizemediumtop-thisisclearlybiased. Variancedescribesthevariabilityinthemodelprediction;howmuchthepredictionofthemodelchangeswhenwechangethedatausedtotrainit. Toattainamoreaccuratesolution,weseektoreducetheamountofbiasandvariancepresentinourmodel.Thisisnotastraightforwardtask.Biasandvarianceareatoddswitheachother-reducingonewillincreasetheotherbecauseofaconceptknownasthebias-variancetradeoff. Inthisarticleyou'lllearn: Howtodetectwhetheramodelsuffersfromhighbiasorhighvariance Howtodiagnoseamodelsufferingfromeithersymptom Howtobuildagood-fitmodel Beforewegetintodetectingtheerrorsymptoms,let'sfirstgointomoredepthwiththebias-variancetradeoff. Alookatthebias-variancetradeoff Allsupervisedlearningalgorithmsstrivetoachievethesameobjective:estimatingthemappingfunction(f_hat)foratargetvariable(y)givensomeinputdata(X).Werefertothefunctionthatamachinelearningmodelaimstoapproximateasthetargetfunction. Changingtheinputdatausedtoapproximatethetargetvariablewilllikelyresultinadifferenttargetfunction,whichmayimpacttheoutputspredictedbythemodel.Howmuchourtargetfunctionvariesasthetrainingdataischangedisknownasthevariance.Wedon'twantourmodeltohavehighvariancebecausewhileouralgorithmmayperformflawlesslyduringtraining,itfailstogeneralizetounseeninstances. [Source:Wikipedia] Intheaboveimage,theapproximatedtargetfunctionisthegreenlineandthelineofbestfitisinblack.Noticehowwellthemodellearnsthetrainingdatawiththegreenline.Itdoesitsbesttoensureallredandblueobservationsareseparated.Ifwetrainedthismodelonnewobservations,itwouldlearnanentirelynewtargetfunctionandattempttoenactthesamebehavior. Considerascenarioinwhichweusealinearmethodlikelinearregressiontoapproximatethetargetfunction.Thefirstthingtonoteaboutlinearregressionisthatitassumesalinearrelationshipbetweentheinputdataandthetargetwearetryingtopredict.Eventsintherealworldarealotmorecomplex.Atthecostofsomeflexibility,thissimpleassumptionmakesthetargetfunctionmuchquickertolearnandeasiertounderstand.WerefertothisparadigmasBias. [Source:Wikipedia] Intheimageabove,theredlinerepresentsthelearnedtargetfunction.Manyoftheobservationsfallfarawayfromthevaluespredictedbythemodel. Wecanreducethebiasinamodelbymakingitmoreflexible,butthisintroducesvariance.Ontheflipside,wecanreducethevarianceofamodelbysimplifyingit,butthisisintroducingbias.There'snowaytoescapethisrelationship.Thebestalternativeistochooseamodelandconfigureitsuchthatitstrikesabalanceinthetradeoffbetweenbiasandvariance. [Source:Wikipedia] Duetounknownfactorsinfluencingthetargetfunction,therewillalwaysbesomeerrorpresentinthemodel,knownastheirreducibleerror.ThismaybeobservedintheimageabovebynotingtheamountoferrorthatoccursunderthelowestpointoftheTotalErrorplot.Tobuildtheidealmodel,wemustfindabalancebetweenbiasandvariancesuchthatthetotalerrorisminimized.ThisisillustratedwiththedottedlinecalledOptimumModelComplexity. Let'sexpandonbiasandvarianceusinglearningcurves. Theanatomyofalearningcurve Learningcurvesareplotsusedtoshowamodel'sperformanceasthetrainingsetsizeincreases.Anotherwayitcanbeusedistoshowthemodel'sperformanceoveradefinedperiodoftime.Wetypicallyusedthemtodiagnosealgorithmsthatlearnincrementallyfromdata.Itworksbyevaluatingamodelonthetrainingandvalidationdatasets,thenplottingthemeasuredperformance. Forexample,imaginewe'vemodeledtherelationshipbetweensomeinputsandoutputsusingamachinelearningalgorithm.Westartoffbytrainingthemodelononeinstanceandvalidatingagainstone-hundredinstances.Whatdoyouthinkwillhappen?Ifyousaidthemodelwilllearnthetrainingdataperfectlythenyou'recorrect-therewouldbenoerrors. It'snothardtomodeltherelationshipofoneinputtooutput;allyouhavetodoisrememberthatrelationship.Thedifficultpartwouldbetryingtomakeaccuratepredictionswhenpresentedwithnew,unseeninstances.Sinceourmodellearnedthetrainingdatasowell,itwouldhaveaterribletimetryingtogeneralizetodatait'snotseenbefore.Themodelwillperformpoorlyonourvalidationdataasaresult.Thiswouldmeantherewouldbealargedifferencebetweentheperformanceofourmodelonthetrainingdataandvalidationdata.Wecallthisdifferencethegeneralizationerror. Ifouralgorithmisgoingtostandachanceofmakingbetterpredictionsonthevalidationdataset,weneedtoaddmoredata.Introducingnewinstancestothetrainingdatawillinevitablychangethetargetfunctionofourmodel.Howthemodelperformsaswegrowthetrainingdatasetcouldbemonitoredandplottedtorevealtheevolutionofthetrainingandvalidationerrorscores. Thismeansthegraphwilldisplaytwodifferentresults: Trainingcurve:Thecurvecalculatedfromthetrainingdata;usedtoinformhowwellamodelislearning. Validationcurve:Thecurvecalculatedfromthevalidationdata;usedtoinformofhowwellthemodelisgeneralizingtounseeninstances. Thesecurvesshowushowwellthemodelisperformingasthedatagrows,hencethenamelearningcurves. Note:Thesameprocessmaybeusedtoinformusofhowourmodellearnsovertime.Insteadofmonitoringhowthemodelisdoingasthedatagetslarger,wemonitorhowwellthemodellearnsovertime.Forexample,youmaydecidetolearnanewlanguage.Yourgraspofthatlanguagecouldbeevaluatedandassignedanumericalscoretoshowhowyou'vefairedoverthecourseof52weeks. You'venowlearnedtheanatomyofalearningcurve;let'sputitintopracticewithareal-worlddatasettogiveyouavisualunderstanding. Usecase:Predictingrealestatevaluations Wewillbeusingthedataset:themarkethistoricaldatasetofrealestatevaluation.ThisdatawascollectedfromSindianDist.,NewTaipei,Taiwanandconsistsofmarkethistoricaldata. Ourtaskistopredicttherealestatevaluationgiventhefollowingfeatures: X1=thetransactiondate(forexample,2013.250=2013March,2013.500=2013June,etc.) X2=thehouseage(unit:year) X3=thedistancetothenearestMRTstation(unit:meter) X4=thenumberofconveniencestoresinthelivingcircleonfoot(integer) X5=thegeographiccoordinate,latitude.(unit:degree) X6=thegeographiccoordinate,longitude.(unit:degree) Thetargetvariableisdefinedas: Y=housepriceofunitarea(10000NewTaiwanDollar/Ping,wherePingisalocalunit,1Ping=3.3meterssquared) Thetargetwearepredictingiscontinuous,thustheproblemisgoingtorequireregressiontechniques. Let'sstartbypeekingatthedata: importpandasaspd data=pd.read_excel("/content/gdrive/MyDrive/real_estate_valuation_data.xlsx") print(data.info()) data.head() >>>> RangeIndex:414entries,0to413 Datacolumns(total8columns): #ColumnNon-NullCountDtype ---------------------------- 0No414non-nullint64 1X1transactiondate414non-nullfloat64 2X2houseage414non-nullfloat64 3X3distancetothenearestMRTstation414non-nullfloat64 4X4numberofconveniencestores414non-nullint64 5X5latitude414non-nullfloat64 6X6longitude414non-nullfloat64 7Yhousepriceofunitarea414non-nullfloat64 dtypes:float64(6),int64(2) memoryusage:26.0KB None NoX1transactiondateX2houseageX3distancetothenearestMRTstationX4numberofconveniencestoresX5latitudeX6longitudeYhousepriceofunitarea 012012.91666732.084.878821024.98298121.5402437.9 122012.91666719.5306.59470924.98034121.5395142.2 232013.58333313.3561.98450524.98746121.5439147.3 342013.50000013.3561.98450524.98746121.5439154.8 452012.8333335.0390.56840524.97937121.5424543.1 There'sanextrafeaturedcalledNowhichwasnotreferencedinthedocumentationofthedata.It'spossibleitreferstoanindex,butforsimplicity'ssakewearegoingtoremoveit.Also,thefeaturenamesdonotreflectwhatwasgiveninthedocumentationsowearegoingtocleanthisup. #renamethecolumns renamed_columns=[col.split()[0]forcolindata.columns] renamed_columns_map={data.columns[i]:renamed_columns[i]foriinrange(len(data.columns))} data.rename(renamed_columns_map,axis=1,inplace=True) #removeNocolumn data.drop("No",axis=1,inplace=True) print(data.head()) #separatefeaturesandtargetdata features,target=data.columns[:-1],data.columns[-1] X=data[features] y=data[target] Thisishowthefinaldatasetlooksbeforewesplitthefeaturesandtargetlabels: X1X2X3X4X5X6Y 02012.91666732.084.878821024.98298121.5402437.9 12012.91666719.5306.59470924.98034121.5395142.2 22013.58333313.3561.98450524.98746121.5439147.3 32013.50000013.3561.98450524.98746121.5439154.8 42012.8333335.0390.56840524.97937121.5424543.1 Todemonstratebias,variance,andgoodfitsolutions,wearegoingtobuildthreemodels:adecisiontreeregressor,asupportvectormachineforregression,andarandomforestregressor.Afterbuildingthemodel,wewillplotlearningcurvesforeachoneandsharesomediagnostictechniques. Diagnosinglearningcurves Learningcurvesareinterpretedbyassessingtheirshape.Oncetheshapeanddynamicshavebeeninterpreted,wecanusethemtodiagnoseanyproblemsinamachinelearningmodel'sbehavior. Thelearning_curve()functioninScikit-learnmakesiteasyforustomonitortrainingandvalidationscores,whichiswhatisrequiredtoplotalearningcurve. Theparameterswepasstothelearning_curve()functionareasfollows: estimator:themodelusedtoapproximatethetargetfunction X:theinputdata y:thetarget cv:thecross-validationsplitstrategy scoring:themetricusedtoevaluatetheperformanceofthemodel train_sizes:theabsolutenumbersoftrainingexamplesthatwillbeusedtogeneratethelearningcurve;thevaluesweareusingarecompletelyrandom. Model1:Decisiontreeregressor Amodelwithhighvarianceissaidtobeoverfit.Itlearnsthetrainingdataandtherandomnoiseextremelywell,thusresultinginamodelthatperformswellonthetrainingdata,butfailstogeneralizetounseeninstances.Weobservesuchbehaviorwhenthealgorithmbeingusedistooflexiblefortheproblembeingsolved,orwhenthemodelistrainedfortoolong. Forexample,thedecisiontreeregressorisanon-linearmachinelearningalgorithm.Non-linearalgorithmstypicallyhavelowbiasandhighvariance.Thissuggeststhatchangestothedatasetwillcauselargevariationstothetargetfunction. Let'sdemonstratehighvariancewithourdecisiontreeregressor: fromsklearn.model_selectionimportlearning_curve fromsklearn.treeimportDecisionTreeRegressor importmatplotlib.pyplotasplt #overfitting decision_tree=DecisionTreeRegressor() train_sizes,train_scores,test_scores=learning_curve( estimator=decision_tree, X=X, y=y, cv=5, scoring="neg_root_mean_squared_error", train_sizes=[1,75,165,270,331] ) train_mean=-train_scores.mean(axis=1) test_mean=-test_scores.mean(axis=1) plt.subplots(figsize=(10,8)) plt.plot(train_sizes,train_mean,label="train") plt.plot(train_sizes,test_mean,label="validation") plt.title("LearningCurve") plt.xlabel("TrainingSetSize") plt.ylabel("RMSE") plt.legend(loc="best") plt.show() Themodelmakesveryfewmistakeswhenit'srequiredtopredictinstancesit'sseenduringtraining,butperformsterriblyonnewinstancesithasn'tbeenexposedto.Youcanobservethisbehaviorbynoticinghowlargethegeneralizationerrorisbetweenthetrainingcurveandthevalidationcurve.Asolutiontoimprovethisbehaviormaybetoaddmoreinstancestoourtrainingdatasetwhichintroducesbias.Anothersolutionmaybetoaddregularizationtothemodel(i.e.restrictingthetreefromgrowingtoitsfulldepth). Model2:SupportVectorMachine Amodelwithhighbiasissaidtobeunderfit.Itmakessimplisticassumptionsaboutthetrainingdata,whichmakesitdifficulttolearntheunderlyingpatterns.Thisresultsinamodelthathashigherroronthetrainingandvalidationdatasets.Wecanobservesuchbehaviorwhenthemodelbeingusedistoosimplefortheproblembeingsolved,orwhenthemodelisnotbeingtrainedforlongenough. Forexample,thesupportvectormachineisalinearmachinelearningalgorithm.Linearalgorithmstypicallyhavehighbiasandlowvariance.Thissuggeststhatmoreassumptionsaremadeabouttheformofthetargetfunction.Tointroducemorebiasintoourmodel,we'veaddedregularizationbysettingtheCparameterinourmodel. Let'sdemonstratehighbiaswithoursupportvectormachine: fromsklearn.svmimportSVR fromsklearn.preprocessingimportStandardScaler #Underfitting scaler=StandardScaler() X_scaled=scaler.fit_transform(X) svm=SVR(C=0.25) train_sizes,train_scores,test_scores=learning_curve( estimator=svm, X=X_scaled, y=y, cv=5, scoring="neg_root_mean_squared_error", train_sizes=[1,75,150,270,331] ) train_mean=-train_scores.mean(axis=1) test_mean=-test_scores.mean(axis=1) plt.subplots(figsize=(10,8)) plt.plot(train_sizes,train_mean,label="train") plt.plot(train_sizes,test_mean,label="validation") plt.title("LearningCurve") plt.xlabel("TrainingSetSize") plt.ylabel("RMSE") plt.legend(loc="best") plt.show() Thegeneralizationgapforthetrainingandvalidationcurvebecomesextremelysmallasthetrainingdatasetsizeincreases.Thisindicatesthataddingmoreexamplestoourmodelisnotgoingtoimproveitsperformance.Asolutiontothisproblemmaybetocreatemorefeaturesortomakethemodelmoreflexibletoreducethenumberofassumptionsbeingmade. Model3:RandomForestRegressor Agoodfitmodelexistsinthegrayareabetweenanunderfitandoverfitmodel.Themodelmaynotbeasgoodonthetrainingdataasitisintheoverfitinstance,butitwillmakefarfewererrorswhenfacedwithunseeninstances.Thisbehaviourcanbeobservedwhenthetrainingerrorrises,butonlytothepointofstability,asthevalidationerrordecreasestothepointofstability Todemonstratethiswearegoingtousearandomforestwhichisanensembleofdecisiontrees.Thismeansthemodelisalsonon-linear,butbiasisaddedtothemodelbycreatingseveraldiversemodelsandcombiningtheirpredictions. We'vealsoaddedmoreregularizationbysettingthemax_depth,whichcontrolsthemaximumdepthofeachtree,toavalueofthree. Let'sseehowthislooksincode: fromsklearn.ensembleimportRandomForestRegressor #better random_forest=RandomForestRegressor(max_depth=3) train_sizes,train_scores,test_scores=learning_curve( estimator=random_forest, X=X, y=y, cv=5, scoring="neg_root_mean_squared_error", train_sizes=[1,75,150,270,331] ) train_mean=-train_scores.mean(axis=1) test_mean=-test_scores.mean(axis=1) plt.subplots(figsize=(10,8)) plt.plot(train_sizes,train_mean,label="train") plt.plot(train_sizes,test_mean,label="validation") plt.title("LearningCurve") plt.xlabel("TrainingSetSize") plt.ylabel("RMSE") plt.legend(loc="best") plt.show() Nowyoucanseewe'vereducedtheerrorinthevalidationdata.Itcameatthecostofweakenedperformanceonthetrainingdata,butoverallit'sabettermodel. Thegeneralizationerrorismuchsmaller,withalownumberoferrorsbeingmade.Also,bothcurvesarestablebeyonda250trainingsetsize,whichimpliesthataddingmoreinstancesmaynotimprovethismodelmuchfurther. Insummary,amodel'sbehaviorcanbeobservedusinglearningcurves.Theidealscenariowhenbuildingmachinelearningmodelsistokeeptheerroraslowaspossible.Twofactorsthatresultinhigherrorarebiasandvariance,andbeingabletostrikeabalanceofbothwillresultinabetter-performingmodel.Postedin:DataScienceShareon:LinkedInLinkedInFacebookFacebookTwitterTwitterCopyCopylink←Backtotutorial



請為這篇文章評分?