One of the first things we are taught in Programming 101 is to write a well structured & commented code. And as any newbie would, we ignore this lesson and focus on achieving the end result. Continuing the same learnings, I coded a R(the R language!) script to be run on files amounting to 30GBs! This was my first professional experience after my graduation and I did not want to fuck up. So, I structured the code, wrote all the comments and ran it on all the files. And what happened next?
I am currently working on the B.Tech Project that involves predicting flight prices. To build the models, I needed historical flight prices. Unfortunately, such data is not available and so I had to build a scraper to extract flight prices daily and save it in a csv file.
I consider myself a newbie for the data analysis world. What I have understood so far is that data preparation is the most important step while solving any problem. Each predictive model requires a certain type of data and in a certain way. For instance, tree based boosting models like xgboost require all the feature variables to be numeric. While solving the San Francisco Crime Classification problem on Kaggle, I stumbled upon different ways to handle categorical variables. One of the method to convert a categorical input variable into a continuous one is One Hot Encoding/ Dummy coding.