Tool - Google Play Comments Extractor

I was recently chatting with a childhood friend (rare thanks to COVID19, life is not so fast, for now). He's a researcher and a Six Sigma expert. He mentioned he was facing a problem wherein he needs to extract google play app reviews. Now, being the programmer at heart (for good and bad), I thought it would be easy to crack. I stumbled upon various techniques and found a better/ faster way to get it done. Hereby sharing my journey.

Challenge(s)

Retrieve user reviews for google play app store for given app
App is related to COVID-19 for India so have almost thousands of reviews being posted every single day. So, in short, large data volume
Google play page doesn't make it easy either. Due to nature of minified javascript and dynamic event injection, user experience, it's a much complicated environment for data extraction
Paging incorporates both infinity scrolling and button based next page functionality

Solutions thought/ tried

RPA - (I'm not an RPA expert, not yet), but have read a bit about it and thought that it can easily help us. I tried using community version of UIPath. Getting started was easy, could easily tag the various pieces of information to be extracted. But then, identifying dynamic tag detection and reloading new data (paged data) was becoming tricky to handle in point-and-click mode. (Now I think, I may have been able to use this option, with some code/ scripting)
API - at first glimpse I realized, each time page has to load new comments, it shoots an API call and retrieves all comments. Unfortunately, it uses various coded parameters and extensive security settings, to prohibit (or at least highly complicate) replication of these calls
Data scraping - static scraping doesn't work, as page is dynamic and loads new data in paged fashion
Dynamic data scraping - this seemed like a better bet, as it gives better control over imitating button clicks and automated processing (also, this was in known zone, so felt natural inclination)

Solution

I used NodeJS and puppeteer to have a simple solution, which can be fast and memory efficient

NodeJS - javascript based web server
Puppeteer - nodejs based web scraping tool/ library

Extraction Process Steps:

Load app page in headless browser
On page, switch to view comments sorted by "Newest" (to show comments in chronological order)
Load additional data

To handle infinity scroll, simply scroll to page bottom
In case "Show more" button is displayed, click on it

Wait for 1.5 seconds (just a buffer) to allow new comments to be loaded
Extract comments and save them in a CSV file
Repeat this process (from step 3), until no new contents are being loaded or maximum page loads are hit (configuration)

Setup

Install NodeJS on your machine (download and install from https://nodejs.org/en/download/)
Install puppeteer on your machine (Refer installation steps from https://github.com/puppeteer/puppeteer)
Download tool from https://github.com/toanshulverma/playappreviewsextractor (you can just download the file titled playappreviewsextractor.js)

Usage

Open app page from google play website
Copy url from browser (to be used in step 4)
Navigate to folder where tool (step 3 in Setup above) is downloaded
Run following command (app url is parameterized)
node playappreviewsextractor <GOOGLE PLAY APP URL>

For example,

node playappreviewsextractor "https://play.google.com/store/apps/details?id=com.cynoteck.kidsFun2Write"
Run following command to merge app CSV files into one CSV file

(Windows)

        copy *.csv userreviews.csv

(Linux/ Mac)

        cat *.csv > userreviews.csv

Note:

I'm sure it can be further optimized to run faster, but I wanted to keep addition buffer for page reloads, to ensure as page gets larger (the app I had to use had around 1M reviews).
This can certainly be optimized for memory, as it seems to be eating RAM as new pages get loaded. I did extract each page as separate file, to help reduce memory consumption, but there can be additional improvements
A lot of CSS tags are hardcoded to identify right components. This may be impacted, as an when Google changes app code to use new tags

Salesforce: Some Insights

Search This Blog