Data Collection

Data is one of the most important aspect of any machine learning project. Luckily for this project, Amazon has made the review dataset available on S3 buckets.

  1. Data Source
  2. Integrating into the pipeline

Data Source

The Amazon Customer Review Dataset is available for free in the Registry of Open Data on AWS. It consists of two decades of reviews from 1995 to 2015, for 50 different product categories. The complete data set consists of approximately 130M+ reviews. For testing our model code and UI, we downloaded the 'Electronics' review data set file and ran it locally.

A comprehensive list of links that would prove to be helpful in accessing the dataset -

Integrating into the pipeline

As you would imagine, with textual data file size increases exponentially. Moreover, it also becomes very important to clean the text data so that the NLP model gets the best possible version of the dataset.


Cleaning the text

The review text that we extracted from the data set had a lot of unclean data, so we created a cleaning script for the dataset. It removed unnecessary characters, hyperlinks, symbols, excess spaces, and other patterns of text that could not be processed by our algorithms. From the cleaned dataset, we extracted the review text description for our analysis. The cleaning script can be accessed on Github repository of the project.


Tackling the large size

Amazon is hosting the files in two formats - compressed tsv and parquet. Since we had experience with working on tsv files, we decided to move ahead with that particular file format. However, when testing on a sample file, we ran into memory error since the NLP libraries were occuping a significant chunk of the RAM. Therefore to tackle this very problem, we took a series of infrastructure and framework decisions -

  • We created an EC2 instance on AWS with the default configuration provided by t3.micro. Moreover, we also mounted additional EBS volume to increase the storage capacity
  • Pandas provide a very efficient way of reading in compressed files in small chunks. Visit this link to know more. We specifically used chunksize and compression parameters.
Moreover since we were running the model on AWS, the dataset on S3 buckets were now directly accessible and hence there was no need to download raw files. This saved us unnecessary I/O operations.