I was recently chatting with a childhood friend (rare thanks to COVID19, life is not so fast, for now). He's a researcher and a Six Sigma expert. He mentioned he was facing a problem wherein he needs to extract google play app reviews. Now, being the programmer at heart (for good and bad), I thought it would be easy to crack. I stumbled upon various techniques and found a better/ faster way to get it done. Hereby sharing my journey.
(Linux/ Mac)
Challenge(s)
- Retrieve user reviews for google play app store for given app
- App is related to COVID-19 for India so have almost thousands of reviews being posted every single day. So, in short, large data volume
- Google play page doesn't make it easy either. Due to nature of minified javascript and dynamic event injection, user experience, it's a much complicated environment for data extraction
- Paging incorporates both infinity scrolling and button based next page functionality
Solutions thought/ tried
- RPA - (I'm not an RPA expert, not yet), but have read a bit about it and thought that it can easily help us. I tried using community version of UIPath. Getting started was easy, could easily tag the various pieces of information to be extracted. But then, identifying dynamic tag detection and reloading new data (paged data) was becoming tricky to handle in point-and-click mode. (Now I think, I may have been able to use this option, with some code/ scripting)
- API - at first glimpse I realized, each time page has to load new comments, it shoots an API call and retrieves all comments. Unfortunately, it uses various coded parameters and extensive security settings, to prohibit (or at least highly complicate) replication of these calls
- Data scraping - static scraping doesn't work, as page is dynamic and loads new data in paged fashion
- Dynamic data scraping - this seemed like a better bet, as it gives better control over imitating button clicks and automated processing (also, this was in known zone, so felt natural inclination)
Solution
I used NodeJS and puppeteer to have a simple solution, which can be fast and memory efficient
Extraction Process Steps:
- Load app page in headless browser
- On page, switch to view comments sorted by "Newest" (to show comments in chronological order)
- Load additional data
- To handle infinity scroll, simply scroll to page bottom
- In case "Show more" button is displayed, click on it
- Wait for 1.5 seconds (just a buffer) to allow new comments to be loaded
- Extract comments and save them in a CSV file
- Repeat this process (from step 3), until no new contents are being loaded or maximum page loads are hit (configuration)
Setup
- Install NodeJS on your machine (download and install from https://nodejs.org/en/download/)
- Install puppeteer on your machine (Refer installation steps from https://github.com/puppeteer/puppeteer)
- Download tool from https://github.com/toanshulverma/playappreviewsextractor (you can just download the file titled playappreviewsextractor.js)
Usage
- Open app page from google play website
- Copy url from browser (to be used in step 4)
- Navigate to folder where tool (step 3 in Setup above) is downloaded
- Run following command (app url is parameterized)
- node playappreviewsextractor <GOOGLE PLAY APP URL>For example,node playappreviewsextractor "https://play.google.com/store/apps/details?id=com.cynoteck.kidsFun2Write"
- Run following command to merge app CSV files into one CSV file
(Windows)
copy *.csv userreviews.csv
(Linux/ Mac)
cat *.csv > userreviews.csv
Note:
- I'm sure it can be further optimized to run faster, but I wanted to keep addition buffer for page reloads, to ensure as page gets larger (the app I had to use had around 1M reviews).
- This can certainly be optimized for memory, as it seems to be eating RAM as new pages get loaded. I did extract each page as separate file, to help reduce memory consumption, but there can be additional improvements
- A lot of CSS tags are hardcoded to identify right components. This may be impacted, as an when Google changes app code to use new tags
Comments
Post a Comment