The Problem
There’s one thing about R that a lot of people have as their Top-of-Mind. That’s the black-and-white plot of iris
dataset which is definitely a huge boring view of R. That’s boring because of aesthetics but also because it’s such a cliched example used over and over again. The other problem is finding the right set of dataset for the right set of problem you want to teach/learn/experiment. Let’s say you want to teach Time Series and that’s a case where your Spam / Ham Classification Dataset isn’t going to be of any use.
Solution
No more worries. That’s where fakir
has arrived to help us. fakir
is an R-package by Colin Fay (of Think-R) who’s been so good with his contributions to the R community.
Video Tutorial
About fakir
As in the documentation, The goal of fakir
is to provide fake datasets that can be used to teach R.
Installation and Loading
fakir
can be installed from Github (fakir
isn’t available on CRAN yet)
# install.packages("devtools")
devtools::install_github("ThinkR-open/fakir")
library(fakir)
Use-case: Clickstream / Web Data
Clickstream / Web Data is one thing a lot of organizations use in analytics these days but it’s hard to get your hand on some clickstream data since no company would prefer sharing theirs. There’s a sample Data on Google Analytics Test Account but that may not serve you any purpose in learning Data science in R or R’s ecosystem.
This is a typical case where fakir
can help you
library(tidyverse)
fakir::fake_visits() %>% head()
## # A tibble: 6 x 8
## timestamp year month day home about blog contact
## <date> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 2017-01-01 2017 1 1 NA 64 446 145
## 2 2017-01-02 2017 1 2 159 102 487 250
## 3 2017-01-03 2017 1 3 NA 59 479 433
## 4 2017-01-04 2017 1 4 123 202 601 109
## 5 2017-01-05 2017 1 5 362 162 311 378
## 6 2017-01-06 2017 1 6 NA 244 450 350
That’s how simple is to get a sample Clickstream (tidy) data with fakir
. Another good thing to mention is, If you look at the fake_visits()
documentation, You’ll find it that there’s an argument that takes seed
value which means, you are in control of randomizing the data and reproducing them.
fake_visits(from = "2017-01-01", to = "2017-12-31", local = c("en_US", "fr_FR"),
seed = 2811) %>% head()
## # A tibble: 6 x 8
## timestamp year month day home about blog contact
## <date> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 2017-01-01 2017 1 1 NA 64 446 145
## 2 2017-01-02 2017 1 2 159 102 487 250
## 3 2017-01-03 2017 1 3 NA 59 479 433
## 4 2017-01-04 2017 1 4 123 202 601 109
## 5 2017-01-05 2017 1 5 362 162 311 378
## 6 2017-01-06 2017 1 6 NA 244 450 350
Use-case: French Data
Also, in the above usage of fake_visits()
function you might have noticed another attribute local
which can help you select French
data instead of English. In my personal opinion, This is crucial if you are on a mission of improving Data Literacy or Democratising Data Science.
fake_ticket_client(vol = 10, local = "fr_FR") %>% head()
## # A tibble: 6 x 25
## ref num_client prenom nom job age region id_dpt departement
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 DOSS~ 79 Phili~ Bért~ Prof~ 18 Poito~ 86 Vienne
## 2 DOSS~ 69 Étien~ Dupo~ Char~ 42 Breta~ 22 Côtes-d'Ar~
## 3 DOSS~ 120 Roland Pasc~ Admi~ 34 Île-d~ 77 Seine-et-M~
## 4 DOSS~ 31 Noël Bena~ Cons~ 43 Poito~ 79 Deux-Sèvres
## 5 DOSS~ 59 Jean Pelt~ Ingé~ 46 Picar~ 80 Somme
## 6 DOSS~ 118 Adèle Pare~ <NA> 19 <NA> 41 Loir-et-Ch~
## # ... with 16 more variables: gestionnaire_cb <chr>, nom_complet <chr>,
## # entry_date <dttm>, points_fidelite <dbl>, priorite_encodee <dbl>,
## # priorite <fct>, timestamp <date>, annee <dbl>, mois <dbl>, jour <int>,
## # pris_en_charge <chr>, pris_en_charge_code <int>, type <chr>,
## # type_encoded <int>, etat <fct>, source_appel <fct>
In the above example, We’ve used another function fake_ticket_client()
of fakir that helps us in giving a typical ticket dataset (like the one you get from ServiceNow or Zendesk)
Use-case: Scatter Plot
So, the rant that I made at the start of this post about iris
(Don’t mistake me: I’ve got huge respect for the scientists who created this dataset, it’s just that the wrong / over-usage of it which I don’t appreciate), Now we can overcome with fakir
’s datasets.
fake_visits() %>%
ggplot() + geom_point(aes(blog,about, color = as.factor(month)))
## Warning: Removed 51 rows containing missing values (geom_point).
(Perhaps, Not a good scatter plot to show Correlation but hey, you can teach scatter plot without plotting Petal Length and Sepal Length)
Summary
If you are in the business of teaching or likes experimenting and don’t want to use cliched datasets, fakir
is a very nice package to get to know. As the author of fakir
’s package mentions in the description, charlatan
is another such R-package that helps in generating meaningful fake data.
References
If you liked this, Please subscribe to my Language-agnostic Data Science Newsletter and also share it with your friends!