Skip to main content

Tool - Google Play Comments Extractor

I was recently chatting with a childhood friend (rare thanks to COVID19, life is not so fast, for now). He's a researcher and a Six Sigma expert. He mentioned he was facing a problem wherein he needs to extract google play app reviews. Now, being the programmer at heart (for good and bad), I thought it would be easy to crack. I stumbled upon various techniques and found a better/ faster way to get it done. Hereby sharing my journey.

Challenge(s)

  1. Retrieve user reviews for google play app store for given app
  2. App is related to COVID-19 for India so have almost thousands of reviews being posted every single day. So, in short, large data volume
  3. Google play page doesn't make it easy either. Due to nature of minified javascript and dynamic event injection, user experience, it's a much complicated environment for data extraction
  4. Paging incorporates both infinity scrolling and button based next page functionality

Solutions thought/ tried

  1. RPA - (I'm not an RPA expert, not yet), but have read a bit about it and thought that it can easily help us. I tried using community version of UIPath. Getting started was easy, could easily tag the various pieces of information to be extracted. But then, identifying dynamic tag detection and reloading new data (paged data) was becoming tricky to handle in point-and-click mode. (Now I think, I may have been able to use this option, with some code/ scripting)
  2. API - at first glimpse I realized, each time page has to load new comments, it shoots an API call and retrieves all comments. Unfortunately, it uses various coded parameters and extensive security settings, to prohibit (or at least highly complicate) replication of these calls
  3. Data scraping - static scraping doesn't work, as page is dynamic and loads new data in paged fashion
  4. Dynamic data scraping - this seemed like a better bet, as it gives better control over imitating button clicks and automated processing (also, this was in known zone, so felt natural inclination)

Solution

I used NodeJS and puppeteer to have a simple solution, which can be fast and memory efficient
  • NodeJS - javascript based web server
  • Puppeteer - nodejs based web scraping tool/ library

Extraction Process Steps:
  1. Load app page in headless browser
  2. On page, switch to view comments sorted by "Newest" (to show comments in chronological order)
  3. Load additional data
    • To handle infinity scroll, simply scroll to page bottom
    • In case "Show more" button is displayed, click on it
  4. Wait for 1.5 seconds (just a buffer) to allow new comments to be loaded
  5. Extract comments and save them in a CSV file
  6. Repeat this process (from step 3), until no new contents are being loaded or maximum page loads are hit (configuration)

Setup

  1. Install NodeJS on your machine (download and install from https://nodejs.org/en/download/)
  2. Install puppeteer on your machine (Refer installation steps from https://github.com/puppeteer/puppeteer
  3. Download tool from https://github.com/toanshulverma/playappreviewsextractor (you can just download the file titled playappreviewsextractor.js)

Usage

  1. Open app page from google play website
  2. Copy url from browser (to be used in step 4)
  3. Navigate to folder where tool (step 3 in Setup above) is downloaded
  4. Run following command (app url is parameterized)
  5. node playappreviewsextractor <GOOGLE PLAY APP URL>

    For example,

    node playappreviewsextractor "https://play.google.com/store/apps/details?id=com.cynoteck.kidsFun2Write"

  6. Run following command to merge app CSV files into one CSV file

    (Windows)
        copy *.csv userreviews.csv 

          (Linux/ Mac)
        cat *.csv > userreviews.csv 

Note:

  1. I'm sure it can be further optimized to run faster, but I wanted to keep addition buffer for page reloads, to ensure as page gets larger (the app I had to use had around 1M reviews).
  2. This can certainly be optimized for memory, as it seems to be eating RAM as new pages get loaded. I did extract each page as separate file, to help reduce memory consumption, but there can be additional improvements
  3. A lot of CSS tags are hardcoded to identify right components. This may be impacted, as an when Google changes app code to use new tags

Comments

Popular posts from this blog

Lightning: Generate PDF from Lightning components with in-memory data

I'm sure as everyone is diving into lightning components development, they are getting acquainted with the nuances of the Lightning components framework. As well as, its current limitations. Being a new framework, this is bound to happen. Although we have our users still using salesforce classic, we have started using lightning components framework our primary development platform and Visualforce is considered primarily for rendering lightning components within Classic Service console.
Recently, while re-architecting a critical module, we encountered a problem wherein we needed to generate PDF from lightning components. Now, being Javascript intensive framework, it has limited room for such features (may be included in future roadmap). As of now, there is no native feature within the lightning framework to do so (at least I didn't find anything).

Common Scenario - Create Visualforce page to retrieve data and generate PDF For scenarios where the data exist within Salesforce, it…

Lightning: Generate PDF within Lightning Experience with Salesforce Data

Some time back I posted a solution to generate PDF from Lightning components using in-memory data.
Post url:http://www.vermanshul.com/2017/07/lightning-generate-pdf-from-lightning.html

It was developed for a specific scenario, wherein we need to generate PDF where:
User interface is Salesforce classicInitiated via Lightning ComponentData doesn't exist within Salesforce and is completely in-memory As complex and tricky this situation was, we did end up finding a stable and equally tricky solution.

However, I realize that there are still lack of solutions (or maybe my search skills are downgrading) to generate and automatically download PDF document from Lightning Experience, without using any lightning components, wherein data exists within Salesforce. You can use the earlier solution in that case, but it will be an overkill.

There are various solutions available to generate PDF from javascript. But, I still think the plain old method of converting HTML to PDF (via visualforce PDF g…

Quick Tips: Salesforce default Images

Well, I'm sure a lot of you still rely on using out of the box salesforce images for displaying quick icons within formula fields or even using them within your Visualforce pages. Lately, I realized that a lot of earlier resources are no longer accessible, so I tried to quickly extract all images from Salesforce CSS files and provide a quick reference here.

Please note, I've referenced all images from SF servers directly, so if anything changes, the image should stop rendering here. As these images are completely controlled by Salesforce, and in case they change anything, it might lead to image not being accessible.

Image pathImage/img/samples/flag_green.gif/img/samples/flag_green.gif/img/samples/flag_red.gif/img/samples/color_red.gif/img/samples/color_yellow.gif/img/samples/color_green.gif/img/samples/light_green.gif/img/samples/light_yellow.gif/img/samples/light_red.gif/img/samples/stars_100.gif/img/samples/stars_200.gif/img/samples/stars_300.gif/img/samples/stars_400.gif/im…