Skip to main content

Cross Validation and Its Types in Machine Learning

Cross Validation is technique in which we train our model on subset of data and then evaluate the model using complementary subset of data set.

The Cross Validation includes the following steps:
1. Reserve some portion of the sample dataset.
2. Train your model using the rest of the data set.
3. Use the reserve portion of the data set to test your model.

There are various methods in cross validation, they are as follows:
 1. Validation : 
In this method we perform train-test and split using 50-50 % of the data set. It may be possible that the remaining 50 % of data set which we are reserving for testing our model contains some important information and we leave all those our model will be highly bias and is bad model.

2. Leave one Out Cross Validation (LOOCV) :
   In this we perform training on whole dataset but leaves only one data point of the available dataset and then iterates for each data point. The model has advantages as it considers all the data points , so we can be sure that all the features will be covered by the model and hence it is less biased. At same time model has some disadvatages as it iterates over each data point till number of data points in the data set , so it takes lot of time to execute. 

3. K-fold Cross Validation :
This method is soulution of the problems we faced in above two methods. In this we split the data set into k number of subsets known as folds. Then we perform training on all the subsets but leave one (k-1) subset for the evaluation of the trained model.
In this process record the error you see on each of predictions, the average of your k records error is called cross validation error and will serve as your performance metric for the model.

K fold Cross Validation Visual

Note: It is Suggested that the value of k should be 10 as the lower value of k would takes us towards validation process and larger value will take us towards LOOCV method.

4. Stratified K fold cross Validation:
    Stratified is the process of rearranging the data as to ensure that each fold is a good representative of the whole.


Popular posts from this blog

Complete Tutorial on Business Analytics

CRISP-DM (Cross Industry Standard Process for Data Mining)                                    The framework is made up of 6 steps: Business Issue Understanding Data Understanding Data Preparation Analysis/Modeling Validation Presentation/Visualization The map outlines two main scenarios for a business problem: Data analysis Predictive analysis Data analysis refers to the more standard approaches of blending together data and reporting on trends and statistics and helps answer business questions that involve understanding more about the dataset such as "On average, how many people order coffee and a donut per transaction in my store in any given week?" Predictive analysis will help businesses predict future behavior based on existing data such as "Given the average coffee order, how much coffee can I expect to sell next week if I were to add a new brand of coffee?" Business Issue Understanding "This initial phase focuses on understanding the project objectives a

What is a Type II Error?

  A Type II error is a false negative in a test outcome, where something is falsely inferred to not exist. This usually means incorrectly accepting the null hypothesis (H0), which is the testing statement that whatever is being studied has no statistically significant effect on the problem. An example would be a drug trial that incorrectly concludes the prescribed medication had no effect on the patient’s ailment, when in fact the disease was cured, but subsequent exams caused a false positive showing the patient was still sick. Null Hypothesis and Statistical Significance In practice, the difference between a false positive and a false negative is usually not so clear-cut. Since the tests are most often quantitively rather than qualitatively based, the results tend to be expressed in a confidence interval value less than 100%, rather than a simple Yes/No decision. This question of how likely the results are to be found if the null hypothesis is true is called statistical significance.

The Art of Winning Kaggle Competitions

 What is Kaggle? Learning data science can be overwhelming. Finding a community to share code, data, and ideas can se e m also seem like an overwhelming as well as farfetched task. But, there is one spot where all of these characteristics come together. That place is called Kaggle. Looked at more comprehensively, Kaggle is an online community for data scientists that offers machine learning competitions, datasets, notebooks, access to training accelerators, and education. Anthony Goldbloom (CEO) and Ben Hamner (CTO) founded Kaggle in 2010, and Google acquired the company in 2017. Kaggle competitions have improved the state of the machine learning art in several areas. One is mapping dark matter; another is HIV/AIDS research. Looking at the winners of Kaggle competitions, you’ll see lots of XGBoost models, some Random Forest models, and a few deep neural networks. The Winning Recipe of Kaggle Competitions involves the following steps: Step one   is to start by reading the competition gu