With the increase in the accessibility of technology, language barriers have slowly started to break down. The world has started to mix these languages in order to make communication more intuitive and easy. This can very evidently be seen in multi lingual countries like India, Turkey, etc. I spent the early months of 2019 trying to understand and learn the processes to analyse such multi-lingual texts. I am planning to write a series of posts explaining my findings and learning.
One of the first things we are taught in Programming 101 is to write a well structured & commented code. And as any newbie would, we ignore this lesson and focus on achieving the end result. Continuing the same learnings, I coded a R(the R language!) script to be run on files amounting to 30GBs! This was my first professional experience after my graduation and I did not want to fuck up. So, I structured the code, wrote all the comments and ran it on all the files. And what happened next?
I am currently working on the B.Tech Project that involves predicting flight prices. To build the models, I needed historical flight prices. Unfortunately, such data is not available and so I had to build a scraper to extract flight prices daily and save it in a csv file.
I consider myself a newbie for the data analysis world. What I have understood so far is that data preparation is the most important step while solving any problem. Each predictive model requires a certain type of data and in a certain way. For instance, tree based boosting models like xgboost require all the feature variables to be numeric. While solving the San Francisco Crime Classification problem on Kaggle, I stumbled upon different ways to handle categorical variables. One of the method to convert a categorical input variable into a continuous one is One Hot Encoding/ Dummy coding.