Split it up - Part 2
Recently heard in news about ongoing 10th CBSE exams. Then suddenly I recalled my 10th board exams days . PHEWWW..!!! how dreadful those examination days were. I remember as the exams were approaching near ,school teachers used to increase their teaching speed ,took extra lectures ,sometimes even ate up our sports time , trying to complete syllabus as soon as possible . After finishing the syllabus and taking up many revision sessions by our teachers ,finally we were ready for final battle of our school life that is final exams and fortunately we all scored well.
Hey wait!! I forgot to mention pre-board exams .After our revision session and when enemy named final exam was just outside our palace ready to attack us soon ,our teacher used to conduct an exam less dreadful than the final exams, commonly known as Pre-board exam. I always wondered why the heck do they conduct these many exams . I asked my mother who is indeed a very good mathematics teacher as to what actually does this pre-board exams are meant for. She patiently told me that it is necessary to examine a student's learning before she/he actually gives final exams and check whether a student is able to apply the concept which she/he have learnt during the class and generalize it for every other question rather than just mugging up. If student scores well in the pre-board exam then they are ready for final exam ,if not teacher needs to reteach some concept, give some more practice problems , introspect where the students are facing problems and what can be done. All this problems can be handled and corrected before final exams because you cannot afford to make mistakes there and hence pre-board exam is a necessary factor to check student's performance on questions similar to final exams .
Ohh !!! is it? Then can we extend this idea to Machine learning as well ??After all there also first training and then testing is performed like that of ours learning phase and exam phase. Answer to this question is a big yesss.
As discussed in the last blog "Split it up -Part 1" ,a dataset is splitted into Train and Test data where Train data is used to train model and test data is used to evaluate performance of the model.But heyy....!! does this guarantee that model will give the same accuracy on future unseen data as it gave via Test data ?Absolutely not, one cannot guarantee that model will work as good as it worked on Test data. The model also needs to undergo through the same exam that is Pre-board exam to validate how the model is actually working and what changes are required to enhance its performance . In Machine Learning this exam is known as Cross Validation. A dataset is usually splitted into 3 parts namely Train data ,Test data and Cross Validation data .
Cross validation is used to make necessary changes in the hyperparameters of the model so that model can perform well on test data which will absolutely be new data for the model and generalization can be done .
In a simple manner now we can say that after cross validation, our model is ready to give final exam that is to face test data (completely new questions) .And now result of final exam on test data will give the actual performance of the model . As training/learning is more important and crucial ,more number of datapoints are reserved for Train data (that is more than half of the data reserved for Training) .Test and CV data can have almost similar number of datapoints.
Take away points:
- Dataset must be splitted into Train,Test and Cross Validation data .
- Cross validation is necessary to make changes in model before feeding it the test data which is unseen data.
- Train data must hold maximum number of datapoints than Test and Validation data.
- Data can be splitted in the ratio Train: CV: Test such as 60:20:20, 70:15:15, 64:16:20 etc. whichever gives good result, decided by user .
You made it simple as that!!!
ReplyDeleteWhat if I use my entire data for cross validation? Will that be a wrong practice?
ReplyDeleteYes it is absolutely wrong..You need to first train the model via train data ,then and only then you can use cross validation data for hyperparameter tuning and finally evaluate performance of model on Test data
Delete