Skip to main content

The Art of Winning Kaggle Competitions

 What is Kaggle?
Learning data science can be overwhelming. Finding a community to share code, data, and ideas can seem also seem like an overwhelming as well as farfetched task. But, there is one spot where all of these characteristics come together. That place is called Kaggle.

Looked at more comprehensively, Kaggle is an online community for data scientists that offers machine learning competitions, datasets, notebooks, access to training accelerators, and education. Anthony Goldbloom (CEO) and Ben Hamner (CTO) founded Kaggle in 2010, and Google acquired the company in 2017.
Kaggle competitions have improved the state of the machine learning art in several areas. One is mapping dark matter; another is HIV/AIDS research. Looking at the winners of Kaggle competitions, you’ll see lots of XGBoost models, some Random Forest models, and a few deep neural networks.
The Winning Recipe of Kaggle Competitions involves the following steps:
Step one is to start by reading the competition guidelines thoroughly. Many Kagglers who are struggling to succeed on this platform do not have a thorough understanding of the competition, that is the overview, description, timeline, evaluation and eligibility criteria, and the prize. Ignoring these little details will cost you big time in the long run. You need to know the deadline for your last submission. Small details such as the timeline of a particular competition are deal-breakers. By studying the guidelines clearly, you will also uncover other commonly missed details such as the appropriate submission format and a guide on reproducing benchmarks. Do not start working on a Kaggle competition before you are clear about all the instructions. Take your time before jumping in.

The second and very crucial step is to understand the performance measures. How the performance measure works is the yardstick your submission will be measured against, and you need to know it inside out. According to most experienced Kagglers, an optimized approach that is suitable to a particular measure makes it substantially easy to boost your score. For instance, Mean Square Error (MSE) and Mean Absolute Error (MAE) are closely related, not knowing the difference will penalize your end score.
Step three is to understand the data in detail. You start with exploratory data analysis to find missing and null values and hidden patterns in the dataset. The more you know about the data, the better models you can build on top of it to improve your performance. Over-specialisation works in your favor as far as you do not over-fit. See what data weaknesses you can exploit for your own advantage, can you extract second fields from the given primary values, or can you typecast the given values to any other format to make it more machine learning-friendly.
Step four is to know what you want (objective) before worrying about how. Most novices on Kaggle tend to worry excessively about which language to use (R or Python). It is wise, to begin with learning the data and ascertaining the patterns you intend to model. Knowing the domain and understanding data goes a long way when it comes to winning the competition.
Step five and the often neglected step is to set up your own local validation environment. By doing that, you will be able to move at a faster pace. This will enable you to produce dependable results instead of solely relying on leader-board scores. You can skip this step if you are out of time or the dataset is too small and can easily be managed and executed on Kaggle dockers. By setting up your own environment, you can run the submission as many times as you like and you are not bound with five submissions a day restriction on Kaggle competitions. Once you feel confident enough about the results, you can submit it to live competition. It gives you an immense edge over your peers who do not have their local environments setup. By reducing the number of submissions you make, you are also substantially reducing the probability of over-fitting the leader-board, and it will save you for poor results at the evaluation stage.
Step six is to read the forums. Forums and discussions are your friend. Take your time to consistently monitor the forum as you work on the competition, there is no way around it. Please subscribe to the forum and receive notifications related to the competition you are participating in. The forum will help you keep abreast with what the competition is up to. This has been made possible by the recent Kaggle trend of sharing code as the competition is going on. The host also shares their insights and directions about the competition on the forum more often. Even if you do not win, you can keep trying and learn from the post-competition summaries available at the forum to see where you went wrong or what your peers did to supersede your brilliance. This is a great way to learn from the best and improve consistently.

Kaggle Profile category, Earn Badges to show your skills

Step seven is to research exhaustively. There is a good possibility that the competition you are participating is by people who have dedicated their lives to finding a viable solution. The people who host such competitions often have codes, benchmarks, official company blogs, and extensive published papers or patents that come in handy. Even if you do not win in your first several attempts, you will learn, hone your skills and become a better data scientist.
Step eight to stay with basics and apply it rigorously. While playing around with obscure methods is fun for data scientists, it is the basics that will get you far in a competition. The common algorithms you may ignore have great implementations. It is wise to do manual tuning or main parameters when experimenting with methods. Experienced Kagglers admit that one of the winning habits is to do manual tuning.
Step nine is the mother of all steps. It’s time to ensemble models. It simply means combining all the models that you have developed independently. In most high profile competitions, different teams usually come together to combine their models to boost their scores. Since no competition on Kaggle has ever been won through a single model, it is wise to merge different independent models even when you are doing the solo ride.
Step ten is the commitment to work on a single or selected few projects. If you commit and try to compete in every single competition, you will lose focus. It is better to focus on one or two and prove your mettle. The rank progression all the way to the grandmaster will come naturally doing that. Remember the time and patience are two prime factors along with your data science expertise to move forward.
Step eleven is the final step to pick the right approach. In the history of Kaggle, there are only two winning approaches that keep emerging from all the competitions. Feature engineering and Neural/Deep Learning Networks.
Feature engineering is the best approach if you understand the data. The first step is taking the provided data and using it to accurately plot histograms to help you explore more. You will then typically spend a large amount of time generating features and then testing which ones correlate with the given target variables. For example, in a recent Kaggle competition titled Don’t Get Kicked hosted by a chain of dealers known as Carvana. The participants were required to predict the cars that would go up for sale in a second hand (pre-owned) auction and the ones that will not be sold. Many participants put forward their algorithms and models. Ultimately, it turns out that the most feasible predictive feature was color. The participants grouped the cars into two categories: standard colors and unusual colors. It turns out that an unusually colored car is more likely to be sold at a second-hand auction. Before Kaggle was able to arrive at this conclusion, there were numerous hypotheses, models, and kernels that did not perform the way expected.
The most popular winning algorithm was a Random Forest. However, this has changed over the last six months. A new algorithm XGboost is becoming a winner, it is taking over practically every competition for structured data.
The second winning approach to Kaggle is neural networks and deep learning. If you are dealing with a dataset that contains speech problems and image-rich content, deep learning is the way to go. The Kagglers who are emerging as the winner in most competitions are the people dealing with structured data. This is because they rarely spend any time focusing on feature engineering. These people consider it more productive and effective to focus more on the construction of neural networks. For example, let’s take a look at the Kaggle problem that requires the deep learning and neural networks approach. The diabetic retinopathy detection competition hosted by the California health care foundation is where the participants were asked to take clear images of the eye and diagnose which images indicated the presence of diabetic retinopathy. This devastating illness is one of the leading causes of blindness in the United States. The winning algorithm essentially had a similar agreement rate with the ophthalmologist as one professional ophthalmologist will have on another one.
So in a Kaggle competition, should you use deep learning and building networks or just opt for feature engineering? Choosing the best approach for a particular competition is pretty straight-forward. If you are dealing with a problem that consists of a lot of structured data, your best bet at success is using the features engineering approach. On the other hand, if you are dealing with unstructured data or has a lot of images, then the recommended approach is building and training neural networks. Overall, it’s always the mix of the two that takes the prize.

Believe in yourself and take the time to learn as much as you can. Avoid dismissing any piece of information. For all data scientists who want to master machine learning algorithms, Kaggle is the best platform to boost your experience and hone your skills. 


Popular posts from this blog

Complete Tutorial on Business Analytics

CRISP-DM (Cross Industry Standard Process for Data Mining)                                    The framework is made up of 6 steps: Business Issue Understanding Data Understanding Data Preparation Analysis/Modeling Validation Presentation/Visualization The map outlines two main scenarios for a business problem: Data analysis Predictive analysis Data analysis refers to the more standard approaches of blending together data and reporting on trends and statistics and helps answer business questions that involve understanding more about the dataset such as "On average, how many people order coffee and a donut per transaction in my store in any given week?" Predictive analysis will help businesses predict future behavior based on existing data such as "Given the average coffee order, how much coffee can I expect to sell next week if I were to add a new brand of coffee?" Business Issue Understanding "This initial phase focuses on understanding the project objectives a

What is a Type II Error?

  A Type II error is a false negative in a test outcome, where something is falsely inferred to not exist. This usually means incorrectly accepting the null hypothesis (H0), which is the testing statement that whatever is being studied has no statistically significant effect on the problem. An example would be a drug trial that incorrectly concludes the prescribed medication had no effect on the patient’s ailment, when in fact the disease was cured, but subsequent exams caused a false positive showing the patient was still sick. Null Hypothesis and Statistical Significance In practice, the difference between a false positive and a false negative is usually not so clear-cut. Since the tests are most often quantitively rather than qualitatively based, the results tend to be expressed in a confidence interval value less than 100%, rather than a simple Yes/No decision. This question of how likely the results are to be found if the null hypothesis is true is called statistical significance.