Many would reckon that Machine Learning is now the new oil these days. And I would most likely support that. Personally I got exposed to the world of ML in my undergrad days but it had always remained a black box for me. One of the major motivation of joining a Masters program was my intent to peep into this black box. I signed up for this course called Computational Data Analysis(CSE6740) offered by Prof. Negar Kivyansh. Every lecture I attended, I had this eureka moment where I understood the why behind a particular method. This blog (or blog series) is my attempt to follow the Feynman Learning Technique. So lets see if that works!
With the increase in the accessibility of technology, language barriers have slowly started to break down. The world has started to mix these languages in order to make communication more intuitive and easy. This can very evidently be seen in multi lingual countries like India, Turkey, etc. I spent the early months of 2019 trying to understand and learn the processes to analyse such multi-lingual texts. I am planning to write a series of posts explaining my findings and learning.
One of the first things we are taught in Programming 101 is to write a well structured & commented code. And as any newbie would, we ignore this lesson and focus on achieving the end result. Continuing the same learnings, I coded a R(the R language!) script to be run on files amounting to 30GBs! This was my first professional experience after my graduation and I did not want to fuck up. So, I structured the code, wrote all the comments and ran it on all the files. And what happened next?
I am currently working on the B.Tech Project that involves predicting flight prices. To build the models, I needed historical flight prices. Unfortunately, such data is not available and so I had to build a scraper to extract flight prices daily and save it in a csv file.
I consider myself a newbie for the data analysis world. What I have understood so far is that data preparation is the most important step while solving any problem. Each predictive model requires a certain type of data and in a certain way. For instance, tree based boosting models like xgboost require all the feature variables to be numeric. While solving the San Francisco Crime Classification problem on Kaggle, I stumbled upon different ways to handle categorical variables. One of the method to convert a categorical input variable into a continuous one is One Hot Encoding/ Dummy coding.