Skip to main content

Cross Validation and Its Types in Machine Learning

Cross Validation is technique in which we train our model on subset of data and then evaluate the model using complementary subset of data set.

The Cross Validation includes the following steps:
1. Reserve some portion of the sample dataset.
2. Train your model using the rest of the data set.
3. Use the reserve portion of the data set to test your model.

There are various methods in cross validation, they are as follows:
 1. Validation : 
In this method we perform train-test and split using 50-50 % of the data set. It may be possible that the remaining 50 % of data set which we are reserving for testing our model contains some important information and we leave all those our model will be highly bias and is bad model.

2. Leave one Out Cross Validation (LOOCV) :
   In this we perform training on whole dataset but leaves only one data point of the available dataset and then iterates for each data point. The model has advantages as it considers all the data points , so we can be sure that all the features will be covered by the model and hence it is less biased. At same time model has some disadvatages as it iterates over each data point till number of data points in the data set , so it takes lot of time to execute. 

3. K-fold Cross Validation :
This method is soulution of the problems we faced in above two methods. In this we split the data set into k number of subsets known as folds. Then we perform training on all the subsets but leave one (k-1) subset for the evaluation of the trained model.
In this process record the error you see on each of predictions, the average of your k records error is called cross validation error and will serve as your performance metric for the model.

K fold Cross Validation Visual

Note: It is Suggested that the value of k should be 10 as the lower value of k would takes us towards validation process and larger value will take us towards LOOCV method.

4. Stratified K fold cross Validation:
    Stratified is the process of rearranging the data as to ensure that each fold is a good representative of the whole.


Popular posts from this blog

Complete Tutorial on Business Analytics

CRISP-DM (Cross Industry Standard Process for Data Mining)                                    The framework is made up of 6 steps: Business Issue Understanding Data Understanding Data Preparation Analysis/Modeling Validation Presentation/Visualization The map outlines two main scenarios for a business problem: Data analysis Predictive analysis Data analysis refers to the more standard approaches of blending together data and reporting on trends and statistics and helps answer business questions that involve understanding more about the dataset such as "On average, how many people order coffee and a donut per transaction in my store in any given week?" Predictive analysis will help businesses predict future behavior based on existing data such as "Given the average coffee order, how much coffee can I expect to sell next week if I were to add a new brand of coffee?" Business Issue Understanding "This initial phase focuses on understanding the project objectives a

What is a Type II Error?

  A Type II error is a false negative in a test outcome, where something is falsely inferred to not exist. This usually means incorrectly accepting the null hypothesis (H0), which is the testing statement that whatever is being studied has no statistically significant effect on the problem. An example would be a drug trial that incorrectly concludes the prescribed medication had no effect on the patient’s ailment, when in fact the disease was cured, but subsequent exams caused a false positive showing the patient was still sick. Null Hypothesis and Statistical Significance In practice, the difference between a false positive and a false negative is usually not so clear-cut. Since the tests are most often quantitively rather than qualitatively based, the results tend to be expressed in a confidence interval value less than 100%, rather than a simple Yes/No decision. This question of how likely the results are to be found if the null hypothesis is true is called statistical significance.

What is Difference between Data Science, Machine Learning, Deep Learning, AI and StatisticsI

In this article, I will be telling you about the different roles of data scientist and how data science compares and overlaps with related fields such as Machine Learning, Deep Learning, AI, Applied Mathematics, Linear Algebra and Statistics. Data Science is a broad discipline, I start by describing the different types of data scientists that one may encounter in any business setting. You might discover that you are a data scientist yourself, without knowing it. As with any scientific discipline, Data science may borrow techniques from related disciplines. Pictures are the best way to represent anything, from below given pictorial representation you can understand that what is the overlapping of Different fields with Data science. 1- Different Types  of Data Scientist Here in this section, I will be talking about types of Data Scientist ♦ The Type A Data Scientist can code well enough to work with data but is not necessarily an expert. The Type A data scientist may be an ex