Skip to main content

Tool - Google Play Comments Extractor

I was recently chatting with a childhood friend (rare thanks to COVID19, life is not so fast, for now). He's a researcher and a Six Sigma expert. He mentioned he was facing a problem wherein he needs to extract google play app reviews. Now, being the programmer at heart (for good and bad), I thought it would be easy to crack. I stumbled upon various techniques and found a better/ faster way to get it done. Hereby sharing my journey.

Challenge(s)

  1. Retrieve user reviews for google play app store for given app
  2. App is related to COVID-19 for India so have almost thousands of reviews being posted every single day. So, in short, large data volume
  3. Google play page doesn't make it easy either. Due to nature of minified javascript and dynamic event injection, user experience, it's a much complicated environment for data extraction
  4. Paging incorporates both infinity scrolling and button based next page functionality

Solutions thought/ tried

  1. RPA - (I'm not an RPA expert, not yet), but have read a bit about it and thought that it can easily help us. I tried using community version of UIPath. Getting started was easy, could easily tag the various pieces of information to be extracted. But then, identifying dynamic tag detection and reloading new data (paged data) was becoming tricky to handle in point-and-click mode. (Now I think, I may have been able to use this option, with some code/ scripting)
  2. API - at first glimpse I realized, each time page has to load new comments, it shoots an API call and retrieves all comments. Unfortunately, it uses various coded parameters and extensive security settings, to prohibit (or at least highly complicate) replication of these calls
  3. Data scraping - static scraping doesn't work, as page is dynamic and loads new data in paged fashion
  4. Dynamic data scraping - this seemed like a better bet, as it gives better control over imitating button clicks and automated processing (also, this was in known zone, so felt natural inclination)

Solution

I used NodeJS and puppeteer to have a simple solution, which can be fast and memory efficient
  • NodeJS - javascript based web server
  • Puppeteer - nodejs based web scraping tool/ library

Extraction Process Steps:
  1. Load app page in headless browser
  2. On page, switch to view comments sorted by "Newest" (to show comments in chronological order)
  3. Load additional data
    • To handle infinity scroll, simply scroll to page bottom
    • In case "Show more" button is displayed, click on it
  4. Wait for 1.5 seconds (just a buffer) to allow new comments to be loaded
  5. Extract comments and save them in a CSV file
  6. Repeat this process (from step 3), until no new contents are being loaded or maximum page loads are hit (configuration)

Setup

  1. Install NodeJS on your machine (download and install from https://nodejs.org/en/download/)
  2. Install puppeteer on your machine (Refer installation steps from https://github.com/puppeteer/puppeteer
  3. Download tool from https://github.com/toanshulverma/playappreviewsextractor (you can just download the file titled playappreviewsextractor.js)

Usage

  1. Open app page from google play website
  2. Copy url from browser (to be used in step 4)
  3. Navigate to folder where tool (step 3 in Setup above) is downloaded
  4. Run following command (app url is parameterized)
  5. node playappreviewsextractor <GOOGLE PLAY APP URL>

    For example,

    node playappreviewsextractor "https://play.google.com/store/apps/details?id=com.cynoteck.kidsFun2Write"

  6. Run following command to merge app CSV files into one CSV file

    (Windows)
        copy *.csv userreviews.csv 

          (Linux/ Mac)
        cat *.csv > userreviews.csv 

Note:

  1. I'm sure it can be further optimized to run faster, but I wanted to keep addition buffer for page reloads, to ensure as page gets larger (the app I had to use had around 1M reviews).
  2. This can certainly be optimized for memory, as it seems to be eating RAM as new pages get loaded. I did extract each page as separate file, to help reduce memory consumption, but there can be additional improvements
  3. A lot of CSS tags are hardcoded to identify right components. This may be impacted, as an when Google changes app code to use new tags

Comments

Popular posts from this blog

Quick Tips: Salesforce default Images

Well, I'm sure a lot of you still rely on using out of the box salesforce images for displaying quick icons within formula fields or even using them within your Visualforce pages. Lately, I realized that a lot of earlier resources are no longer accessible, so I tried to quickly extract all images from Salesforce CSS files and provide a quick reference here. Please note, I've referenced all images from SF servers directly, so if anything changes, the image should stop rendering here. As these images are completely controlled by Salesforce, and in case they change anything, it might lead to image not being accessible. Image path Image /img/samples/flag_green.gif /img/samples/flag_green.gif /img/samples/flag_red.gif /img/samples/color_red.gif /img/samples/color_yellow.gif /img/samples/color_green.gif /img/samples/light_green.gif /img/samples/light_yellow.gif /img/samples/light_red.gif /img/samples/stars_100.gif /img/samples/stars_200.gif /img/samples/stars_300.

Lightning: Generate PDF from Lightning components with in-memory data

I'm sure as everyone is diving into lightning components development, they are getting acquainted with the nuances of the Lightning components framework. As well as, its current limitations. Being a new framework, this is bound to happen. Although we have our users still using salesforce classic, we have started using lightning components framework our primary development platform and Visualforce is considered primarily for rendering lightning components within Classic Service console. Recently, while re-architecting a critical module, we encountered a problem wherein we needed to generate PDF from lightning components. Now, being Javascript intensive framework, it has limited room for such features (may be included in future roadmap). As of now, there is no native feature within the lightning framework to do so (at least I didn't find anything). Common Scenario - Create Visualforce page to retrieve data and generate PDF For scenarios where the data exist within Sa

Quick Tips: Setup SFDX Manually without Admin access

We all have faced challenges while working in different enterprise environments, where there may be lot of controls/ checks/ red-tape to get by. In such situations, getting access to simple tools (even git) can take lot of time. Note: This tutorial is to be followed at your own risk, as it may not be complaint to your organization's IT policies. What is SFDX? SFDX is a command line utility for managing salesforce builds/ deployments. Being command line, it can be easily embedded to automation chains, to help build fully automated build and deployment processes. To get started, refer  https://trailhead.salesforce.com/en/content/learn/trails/sfdx_get_started Setup SFDX on Windows machine without admin access As you may have already realized, SFDX installation needs admin access to one's machine. Which may be a luxury a lot of developers may not have. So, i tried to provide a step-by-step guide to setup SFDX on your computer without any admin access Steps: Note: