How to create unigrams, bigrams and n-grams of App Reviews

in R using tidytext

This is one of the frequent questions I’ve heard from the first timer NLP / Text Analytics - programmers (or as the world likes it to be called “Data Scientists”).

Prerequisite

For simplicity, this post assumes that you already know how to install a package and so you’ve got tidytext installed on your R machine.

install.packages("tidytext")

Loading the Library

Let’s start with loading the tidytext library.

library(tidytext)

Extracting App Reviews

We’ll use the R-package itunesr for downloading iOS App Reviews on which we’ll perform Simple Text Analysis (unigrams, bigrams, n-grams). getReviews() funciton of itunesr helps us in extracting reviews of Medium iOS App.

library(itunesr)
library(tidyverse)

# Extracting Medium iOS App Reviews
medium <- getReviews("828256236","us",1)

Overview of the extract App Reviews

As usual, we’ll start with seeing what’s head of the dataframe.

head(medium) 
##                                    Title
## 1 Great blog, but feature gaps on mobile
## 2              A Platform for Your Voice
## 3          Great Thought Provoking Reads
## 4        Series functionality is broken.
## 5                great until you realize
## 6                       Always inspired!
##                                        Author_URL   Author_Name
## 1 https://itunes.apple.com/us/reviews/id510046603   shiquorlits
## 2 https://itunes.apple.com/us/reviews/id261380693    ehunternyc
## 3 https://itunes.apple.com/us/reviews/id660631915   Chrissyerks
## 4 https://itunes.apple.com/us/reviews/id256769909       Joesdjk
## 5 https://itunes.apple.com/us/reviews/id268734859     Booskiiee
## 6 https://itunes.apple.com/us/reviews/id219297759 bebe@lightbox
##   App_Version Rating
## 1        3.91      3
## 2        3.91      5
## 3        3.91      5
## 4        3.90      1
## 5        3.90      1
## 6        3.90      5
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Review
## 1 First of all, Medium is an excellent platform with an excellent mobile app. Don’t get me wrong. \n\nI’m knocking off stars because there’s a couple major feature gaps between the mobile app and desktop platform for content creators. Namely...\n\n- Sharing the friends-only link\n- Changing the distribution settings, licensing, and SEO of an article (aside from tags)\n- Code blocks or snippets in draft mode\n\nIf any of those features exist in mobile, I’ve missed them. \n\nI also find it very unusual that getting a direct message/response on your article from a publisher (or otherwise) doesn’t show up in notifications - instead, unless I’m mistaken, you have to look for small asterisks that show up to the right of the article body.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   As a new writer, it is almost impossible to be published. Medium has allowed me to get my message out and be HEARD! \nA wonderful “first step.”\nEllen Hunter, KidsAreAlright.org
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Can spend hours reading this app. Love it!
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                     There is no way to delete a card from a series draft on desktop and every time I try to delete a card on mobile the app crashes. Not to mention every time I open the series draft on mobile it arbitrarily adds a new blank card to the beginning of the series only making the problem worse.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         you gotta pay to keep this going i really thought i found something finally free to use and help me wanna educate my self better \ngarbage absolute garbage
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Medium provides inspirational and thought provoking articles that help me grow. I love sharing them with others, as well.
##                  Date
## 1 2019-08-16 01:59:02
## 2 2019-08-15 06:48:25
## 3 2019-08-15 02:37:31
## 4 2019-08-13 16:29:38
## 5 2019-08-13 15:00:21
## 6 2019-08-12 21:19:34

Now, we know that there are two Text Columns of our interest - Title and Review.

To make our n-grams analysis a bit more meaningful, we’ll extract only the positive reviews (5-star) to see what’s good people are writing about Medium iOS App. To make a better sense of the filter we’ve to use let’s see the split of Rating.

table(medium$Rating)
## 
##  1  3  4  5 
## 10  4  4 31

So, 5-star is the major component in the text reviews we extract and we’re good to go filtering only 5-star.We’ll pick Review from that and also we’ll specify it only for Rating == 5. Since we need a dataframe (or tibble) for tidytext to process it, we’ll put these 5-star reviews as a new column in a new dataframe.

reviews <- data.frame(txt = medium$Review[medium$Rating==5],
                      stringsAsFactors = FALSE)

Tokens

Tokenization in NLP is the process of splitting a text corpus based on some splitting factor - It could be Word Tokens or Sentence Tokens or based on some advanced alogrithm to split a conversation. In this process, we’ll just simply do word tokenization.

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  head()
##       word
## 1       as
## 1.1      a
## 1.2    new
## 1.3 writer
## 1.4     it
## 1.5     is

As you can see above, unnest_tokens() is the function that’ll help us in this tokenization process. Since it supports %>% pipe operator, the first argument of the function is a dataframe or tibble, the second argument output is the name of the output (new) column where the tokenized words are going to be put in. The third column input is where the input text is fed in.

Now, this is what unigrams are for this Medium iOS App Reviews. As with many other data science projects, Data like this is not useful unless it’s visualized in a way to look at insights.

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  count(word, sort = TRUE) 
## # A tibble: 421 x 2
##    word         n
##    <chr>    <int>
##  1 i           38
##  2 the         37
##  3 to          27
##  4 and         24
##  5 of          22
##  6 for         18
##  7 a           16
##  8 it          14
##  9 medium      13
## 10 articles    12
## # … with 411 more rows

Roughly, looking at the most frequently appeared unigram we end up with the,i,and and this is one of those places where we need to remove stopwords

Stopword Removal

Fortunately, tidytext helps us in removing stopwords by having a dataframe of stopwords from multiple lexicons. With that, we can use anti_join for picking the words (that are present in the left df (reviews) but not present in the right df (stop_words)).

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) 
## Joining, by = "word"
## # A tibble: 251 x 2
##    word         n
##    <chr>    <int>
##  1 medium      13
##  2 articles    12
##  3 reading     12
##  4 app          8
##  5 love         8
##  6 content      5
##  7 i’m          5
##  8 it’s         4
##  9 read         4
## 10 easy         3
## # … with 241 more rows

With that stop word removal, now we can see better represenation of most frequently appearing unigrams in the reviews.

unigram Visualziation

We’ve got our data in the shape that we want so, let’s go ahead and visualize it. To keep the pipeline intact, I’m not creating any temporary object to store the previous output and just simply continue using the same. Also too many bars (words) wouldn’t make any sense (except resulting in a shabby plot), We’ll filter taking the top 10 words

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  slice(1:10) %>% 
  ggplot() + geom_bar(aes(word, n), stat = "identity", fill = "#de5833") +
  theme_minimal() +
  labs(title = "Top unigrams of Medium iOS App Reviews",
       subtitle = "using Tidytext in R",
       caption = "Data Source: itunesr - iTunes App Store")
## Joining, by = "word"

Bigrams & N-grams

Now that we’ve got the core code for unigram visualization set up. We can slightly modify the same - just by adding a new argument n=2 and token="ngrams" to the tokenization process to extract n-gram. 2 for bigram and 3 trigram - or n of your interest. But remember, large n-values may not useful as the smaller values.

Doing this naively also has a catch and the catch is - the stop-word removal process we used above was using anti_join which wouldn’t be supported in this process since we’ve a bigram (two-word combination separated by a space). So, we’ll separate the word by space and then filter out the stop words in both word1 and word2 and then unite them back - which gives us the bigram after stop-word removal. This is the process that you might have to carry out when you are dealing with n-grams.

reviews %>% 
  unnest_tokens(word, txt, token = "ngrams", n = 2) %>% 
  separate(word, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  unite(word,word1, word2, sep = " ") %>% 
  count(word, sort = TRUE) %>% 
  slice(1:10) %>% 
  ggplot() + geom_bar(aes(word, n), stat = "identity", fill = "#de5833") +
  theme_minimal() +
  coord_flip() +
  labs(title = "Top Bigrams of Medium iOS App Reviews",
       subtitle = "using Tidytext in R",
       caption = "Data Source: itunesr - iTunes App Store")

Summary

This particular assignment that may not reveal some meaningful insights as we started with less data, but this is really useful when you have a decent amount of text corpus and this simple analysis of unigram, bigram (n-gram analysis) can reveal something business-worthy (let’s say in Customer Service, App Development or in multiple other use-cases).

 
comments powered by Disqus