This is one of the frequent questions I’ve heard from the first timer NLP / Text Analytics - programmers (or as the world likes it to be called “Data Scientists”).
Prerequisite
For simplicity, this post assumes that you already know how to install a package and so you’ve got tidytext
installed on your R machine.
install.packages("tidytext")
Loading the Library
Let’s start with loading the tidytext
library.
library(tidytext)
Extracting App Reviews
We’ll use the R-package itunesr
for downloading iOS App Reviews on which we’ll perform Simple Text Analysis (unigrams, bigrams, n-grams). getReviews()
funciton of itunesr
helps us in extracting reviews of Medium iOS App.
library(itunesr)
library(tidyverse)
# Extracting Medium iOS App Reviews
medium <- getReviews("828256236","us",1)
Overview of the extract App Reviews
As usual, we’ll start with seeing what’s head
of the dataframe.
head(medium)
## Title
## 1 Great blog, but feature gaps on mobile
## 2 A Platform for Your Voice
## 3 Great Thought Provoking Reads
## 4 Series functionality is broken.
## 5 great until you realize
## 6 Always inspired!
## Author_URL Author_Name
## 1 https://itunes.apple.com/us/reviews/id510046603 shiquorlits
## 2 https://itunes.apple.com/us/reviews/id261380693 ehunternyc
## 3 https://itunes.apple.com/us/reviews/id660631915 Chrissyerks
## 4 https://itunes.apple.com/us/reviews/id256769909 Joesdjk
## 5 https://itunes.apple.com/us/reviews/id268734859 Booskiiee
## 6 https://itunes.apple.com/us/reviews/id219297759 bebe@lightbox
## App_Version Rating
## 1 3.91 3
## 2 3.91 5
## 3 3.91 5
## 4 3.90 1
## 5 3.90 1
## 6 3.90 5
## Review
## 1 First of all, Medium is an excellent platform with an excellent mobile app. Don’t get me wrong. \n\nI’m knocking off stars because there’s a couple major feature gaps between the mobile app and desktop platform for content creators. Namely...\n\n- Sharing the friends-only link\n- Changing the distribution settings, licensing, and SEO of an article (aside from tags)\n- Code blocks or snippets in draft mode\n\nIf any of those features exist in mobile, I’ve missed them. \n\nI also find it very unusual that getting a direct message/response on your article from a publisher (or otherwise) doesn’t show up in notifications - instead, unless I’m mistaken, you have to look for small asterisks that show up to the right of the article body.
## 2 As a new writer, it is almost impossible to be published. Medium has allowed me to get my message out and be HEARD! \nA wonderful “first step.”\nEllen Hunter, KidsAreAlright.org
## 3 Can spend hours reading this app. Love it!
## 4 There is no way to delete a card from a series draft on desktop and every time I try to delete a card on mobile the app crashes. Not to mention every time I open the series draft on mobile it arbitrarily adds a new blank card to the beginning of the series only making the problem worse.
## 5 you gotta pay to keep this going i really thought i found something finally free to use and help me wanna educate my self better \ngarbage absolute garbage
## 6 Medium provides inspirational and thought provoking articles that help me grow. I love sharing them with others, as well.
## Date
## 1 2019-08-16 01:59:02
## 2 2019-08-15 06:48:25
## 3 2019-08-15 02:37:31
## 4 2019-08-13 16:29:38
## 5 2019-08-13 15:00:21
## 6 2019-08-12 21:19:34
Now, we know that there are two Text Columns of our interest - Title
and Review
.
To make our n-grams analysis a bit more meaningful, we’ll extract only the positive reviews (5-star) to see what’s good people are writing about Medium iOS App. To make a better sense of the filter we’ve to use let’s see the split of Rating
.
table(medium$Rating)
##
## 1 3 4 5
## 10 4 4 31
So, 5-star is the major component in the text reviews we extract and we’re good to go filtering only 5-star.We’ll pick Review
from that and also we’ll specify it only for Rating == 5
. Since we need a dataframe (or tibble) for tidytext to process it, we’ll put these 5-star reviews as a new column in a new dataframe.
reviews <- data.frame(txt = medium$Review[medium$Rating==5],
stringsAsFactors = FALSE)
Tokens
Tokenization in NLP is the process of splitting a text corpus based on some splitting factor - It could be Word Tokens or Sentence Tokens or based on some advanced alogrithm to split a conversation. In this process, we’ll just simply do word tokenization.
reviews %>%
unnest_tokens(output = word, input = txt) %>%
head()
## word
## 1 as
## 1.1 a
## 1.2 new
## 1.3 writer
## 1.4 it
## 1.5 is
As you can see above, unnest_tokens()
is the function that’ll help us in this tokenization process. Since it supports %>%
pipe operator, the first argument of the function is a dataframe
or tibble
, the second argument output
is the name of the output (new) column where the tokenized words are going to be put in. The third column input
is where the input text is fed in.
Now, this is what unigram
s are for this Medium iOS App Reviews. As with many other data science projects, Data like this is not useful unless it’s visualized in a way to look at insights.
reviews %>%
unnest_tokens(output = word, input = txt) %>%
count(word, sort = TRUE)
## # A tibble: 421 x 2
## word n
## <chr> <int>
## 1 i 38
## 2 the 37
## 3 to 27
## 4 and 24
## 5 of 22
## 6 for 18
## 7 a 16
## 8 it 14
## 9 medium 13
## 10 articles 12
## # … with 411 more rows
Roughly, looking at the most frequently appeared unigram we end up with the
,i
,and
and this is one of those places where we need to remove stopwords
Stopword Removal
Fortunately, tidytext
helps us in removing stopwords by having a dataframe of stopwords from multiple lexicons. With that, we can use anti_join
for picking the words (that are present in the left df (reviews
) but not present in the right df (stop_words
)).
reviews %>%
unnest_tokens(output = word, input = txt) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 251 x 2
## word n
## <chr> <int>
## 1 medium 13
## 2 articles 12
## 3 reading 12
## 4 app 8
## 5 love 8
## 6 content 5
## 7 i’m 5
## 8 it’s 4
## 9 read 4
## 10 easy 3
## # … with 241 more rows
With that stop word removal, now we can see better represenation of most frequently appearing unigrams in the reviews.
unigram Visualziation
We’ve got our data in the shape that we want so, let’s go ahead and visualize it. To keep the pipeline intact, I’m not creating any temporary object to store the previous output and just simply continue using the same. Also too many bars (words) wouldn’t make any sense (except resulting in a shabby plot), We’ll filter taking the top 10 words
reviews %>%
unnest_tokens(output = word, input = txt) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
slice(1:10) %>%
ggplot() + geom_bar(aes(word, n), stat = "identity", fill = "#de5833") +
theme_minimal() +
labs(title = "Top unigrams of Medium iOS App Reviews",
subtitle = "using Tidytext in R",
caption = "Data Source: itunesr - iTunes App Store")
## Joining, by = "word"
Bigrams & N-grams
Now that we’ve got the core code for unigram visualization set up. We can slightly modify the same - just by adding a new argument n=2
and token="ngrams"
to the tokenization process to extract n-gram. 2
for bigram and 3
trigram - or n
of your interest. But remember, large n-values may not useful as the smaller values.
Doing this naively also has a catch and the catch is - the stop-word removal process we used above was using anti_join
which wouldn’t be supported in this process since we’ve a bigram (two-word combination separated by a space). So, we’ll separate
the word by space
and then filter out the stop words in both word1
and word2
and then unite
them back - which gives us the bigram
after stop-word removal. This is the process that you might have to carry out when you are dealing with n-grams.
reviews %>%
unnest_tokens(word, txt, token = "ngrams", n = 2) %>%
separate(word, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(word,word1, word2, sep = " ") %>%
count(word, sort = TRUE) %>%
slice(1:10) %>%
ggplot() + geom_bar(aes(word, n), stat = "identity", fill = "#de5833") +
theme_minimal() +
coord_flip() +
labs(title = "Top Bigrams of Medium iOS App Reviews",
subtitle = "using Tidytext in R",
caption = "Data Source: itunesr - iTunes App Store")
Summary
This particular assignment that may not reveal some meaningful insights as we started with less data, but this is really useful when you have a decent amount of text corpus and this simple analysis of unigram, bigram (n-gram analysis) can reveal something business-worthy (let’s say in Customer Service, App Development or in multiple other use-cases).