Scrape HTML Table using rvest

In this tutorial, we’ll see how to scrape an HTML table from Wikipedia and process the data for finding insights in it (or naively, to build a data visualization plot).

Youtube - https://youtu.be/KCUj7JQKOJA

Why?

Most of the times, As a Data Scientist or Data Analyst, your data may not be readily availble hence it’s handy to know skills like Web scraping to collect your own data. While Web scraping is a vast area, this tutorial focuses on one particular aspect of it, which is “Scraping or Extracting Tables from Web Pages”.

Code

library(tidyverse)

content <- read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films_in_the_United_States_and_Canada")

tables <- content %>% html_table(fill = TRUE)

first_table <- tables[[1]]

first_table <- first_table[-1,]

library(janitor)

first_table <- first_table %>% clean_names()

first_table %>% 
  mutate(lifetime_gross = parse_number(lifetime_gross)) %>% 
  arrange(desc(lifetime_gross)) %>% 
  head(20) %>% 
  mutate(title = fct_reorder(title, lifetime_gross)) %>% 
  ggplot() + geom_bar(aes(y = title, x = lifetime_gross), stat = "identity", fill = "blue") +
  labs(title = "Top 20 Grossing movies in US and Canada",
       caption = "Data Source: Wikipedia ")



first_table %>% 
  mutate(lifetime_gross_2 = parse_number(lifetime_gross_2)) %>% 
  arrange(desc(lifetime_gross_2)) %>% 
  head(20) %>% 
  mutate(title = fct_reorder(title, lifetime_gross_2)) %>% 
  ggplot() + geom_bar(aes(y = title, x = lifetime_gross_2), stat = "identity", fill = "blue") +
  labs(title = "Top 20 Grossing movies in US and Canada",
       caption = "Data Source: Wikipedia ")



second_table <- tables[[2]]

second_table %>% 
  clean_names() -> second_table


second_table %>% 
  mutate(adjusted_gross = parse_number(adjusted_gross)) %>% 
  group_by(year) %>% 
  summarise(total_adjusted_gross = sum(adjusted_gross)) %>% 
  arrange(desc(total_adjusted_gross)) %>% 
  ggplot() + geom_line(aes(x = year,y = total_adjusted_gross, group = 1))
 
comments powered by Disqus