Penguins Dataset Overview - iris alternative in R

If there’s a dataset that’s been most used by data scientists / data analysts while they’re learning something or coaching something - it’s either iris (more R users) or titanic (more Python users). iris dataset isn’t most used just because it’s easy accessible but it’s something that you can use to demonstrate many data science concepts like correlation, regression, classification. The objective of this post is to introduce you to penguins dataset and get you started with a few code snippets so that you can take off yourself!

ggplot2 Text Customization with ggtext | Data Visualization in R

ggplot2 is go-to R package for anyone who wants to make beautiful static visualizations in R. But most ggplot2 gplots look almost the same and little many data analysts or data scientists care about customizing it, primarily because it’s not very intuitive to do so. That’s where ggplot2 extensions come in very handy. ggtext is an R package (by Claus O. Wilke) that helps in customizing the text present in ggplot2 plots.


Despite all the memes around Microsoft Excel, Excel is still a powerful tool for quick and fast data transformation, data preprocessing (forget about the date thing 😉 ). This post is my attempt to tell an Excel person how they can replicate some of their most frequently used operation like VLOOKUP (Fuzzy) using R. We perform VLOOKUP’s approximate match first on Excel and replicate the same task on RStudio using stringdist_left_join() Fuzzy left join from the R package {fuzzyjoin}

Analyse Google Trends Search Data in R using {gtrendsR}

As much as Google is popular for searching information on the web, Google can also provide information (meta) about those searches. Google Search Insights can be extremely helpful in Marketing Analytics, Market Research, Understanding Customer Demands, Trends and so on. While usual Market Researches or someone interested in Google Search Information would go to Google Trends website, Being a programmer you can extract the same information programmatically to use in your Data Science Workflow or to set up an Automation.

GET Hackernews Front Page Results using REST API in R

Whenever we talk about Data Collection, We usually think about Web Scraping. What’s often forgotten is that a lot of websites / web apps usually offer API to access their data in the right way. This video tutorial explains how you can use httr package to use GET requests (REST API Calls) to collected data from Hacker News, a very popular website for Tech News. The objective of this post it that it can outline how to simple use httr’s GET() to start making REST API calls and also to parse the response object and extract desired data.

Automated Programmatic Website Screenshots in R with {webshot} [Video Tutorial]

In this video tutorial, We explore the R package {webshot} by Winston Chang. This package internally uses phantom js to capture screenshot of web pages / websites, Shiny Applications, RMarkdown documents. {webshot} also lets you take screenshot of a particular viewport or a section of website selected by css selector. Youtube: https://youtu.be/oQKwd1cgiq4 Please subscribe and leave a comment if you have any feedback. I’m new for this Video making so any suggestion/feedback to improve would be a great benefit!

How to do Excel VLOOKUP in R (using left_join, merge)

This tutorial helps you code Excel’s VLOOKUP (Exact Match) functionality in R using dplyr’s left_join() and Base-R’s merge(). Please let me know your feedback if this can help Excel users try out R and get confident about doing Data Analytics in R Youtube - https://www.youtube.com/watch?v=GsxlOwa4dSg Video Tutorial Code # library tidyverse for data manipulation and plot library(tidyverse) # reading input dataset co2 <- read_csv("C:/users/abdrs/Downloads/food_consumption.csv") countries <- read_csv("C:/users/abdrs/Downloads/Countries-Continents.

Easy ggplot2 Theme customization with {ggeasy}

In this post, We’ll learn about {ggeasy} an R package by Jonathan Carroll. The goal of {ggeasy} is to help R programmers make ggplot2 theme customizations with simple-easy English functions. (much easier than playing with theme()) We use dataset generated by {fakir} for this tutorial. Youtube: https://youtu.be/iAH1GJoBZmI Video Tutorial Code library(fakir) library(tidyverse) library(ggeasy) # generate dataset clients <- fakir::fake_ticket_client(100) # rotate x axis labels clients %>% count(state) %>% ggplot() + geom_col(aes(state,n)) + easy_rotate_x_labels() # color the text and increase text size clients %>% count(state) %>% ggplot() + geom_col(aes(n,state), fill = "orange") + easy_text_color("orange") + easy_text_size(25, teach = TRUE) # move legend position clients %>% count(state, source_call) %>%# View() ggplot() + geom_col(aes(n,state, fill = source_call)) + #easy_move_legend("bottom", teach = TRUE) theme(legend.

How to make Powepoint Slides PPT using RStudio in 2020

The reason I wanted to make this short tutorial is because there are a lot of old tutorial available on Internet to help you make a powerpoint using R. But some time back RStudio made this new option available that makes it extremely easy and simple to make a Powerpoint Presentation with R outputs in it. powerpointslides Check it yourself + By going to File -> New File -> R Markdown + Then, Presentation -> Powerpoint

Sentiment Analysis in R with {sentimentr} that handles Negation (Valence Shifters)

Sentiment Analysis is one of the most wanted and used NLP techniques. Companies like to see what their customers are talking about - like if there’s a new product launch then what’s the feedback about it. Whereever you’ve got Natural Language - like Social Media, Community Pages, Customer Support - Sentiment Analysis as a technique has found its home there. While the technique itself is highly wanted, Sentiment Analysis is one of the NLP fields that’s far from super-accurate and the reason being is a lot of ways Humans talk.

Android Smartphone Analysis in R [Code + Video]

In this post, We’ll learn how to take analyse your Android Smartphone usage data. Steps: Download your MyActivity Data from Google Takeout - https://takeout.google.com/ (after Selecting json format - instead of the default html format) When the download is available, save the zip file and unzip them to find MyActivity.json inside last-level of the folder Create a new R project (using your RStudio) with the MyActivity.json file in the project folder

Need a good RStudio Alternative? Try VSCode R

In this tutorial, you’ll learn to use VS Code (Visual Studio Code from Microsoft) as an alternative IDE to the most popular RStudio. You’ll start from enabling R programming within your VS Code environment and also briefly look at some features and shortcuts of VSCode R. Youtube - https://www.youtube.com/watch?v=ZFGt9LyijhM vscode-r Github Plugin - https://github.com/Ikuyadeu/vscode-R Data Science Course Offer

Scrape HTML Table using rvest

In this tutorial, we’ll see how to scrape an HTML table from Wikipedia and process the data for finding insights in it (or naively, to build a data visualization plot). Youtube - https://youtu.be/KCUj7JQKOJA Why? Most of the times, As a Data Scientist or Data Analyst, your data may not be readily availble hence it’s handy to know skills like Web scraping to collect your own data. While Web scraping is a vast area, this tutorial focuses on one particular aspect of it, which is “Scraping or Extracting Tables from Web Pages”.

Quick Intro to Reproducible Example in R with reprex

This video quickly introduces you to an amazing R package called reprex that helps in generating Reproducible Example which could be useful in a lot of places like Github issues, Stack Overflow Question and Answers, R-dev mailing list or simply to share your problem with someone or Teaching! Link: https://www.youtube.com/watch?v=hnzrDLf9anw

How to Connect RStudio with Git (Github)

This video explains how to connect your RStudio with Git (Github) for a better R Programming / Software Development Workflow. It could be as big as updating a package file or as simple as managing a simple repo. This video also shows how can you clone a repo, commit a change and push it back to its master on Github. Youtube Link: https://www.youtube.com/watch?v=lXwH2R4n3RQ

How to use bootstraplib's Live Theme Previewer to customize Shiny apps?

One of the announcements of RStudio conf 2020 that caught my eyes is a brand new package {bootstraplib} - https://github.com/rstudio/bootstraplib/ . It’s another open-source contribution from RStudio (a PBC). {bootstraplib} basically provides tools for theming shiny and rmarkdown from R via Bootstrap (3 or 4) Sass. If you’re not aware of Bootstrap, it’s one of the most popular (open-source) css framework available on the web. While {bootstraplib} has got a lot of things, this post is going to be about one thing that I love the most which is “Live Theme Preview” where we can simply edit the values of CSS properties and see the change in Real time.

How to create Bar Race Animation Charts in R

Bar Race Animation Charts have started going Viral on Social Media leaving a lot of Data Enthusiasts wondering how are these Bar Race Animation Charts made. The objective of this post is to explain how to build such Bar Race Animation Charts using R — R with the power of versatile packages. Packages The packages that are required to build animated plots in R are: ggplot2 gganimate While those above two are the essential packages, We have also used the entire tidyverse, janitor and scales in this project for Data Manipulation, Cleaning and Formatting.

How to reorder arrange bars with in each Facet of ggplot

One of the problems that we usually face with ggplot is that rearranging the bars in ascending or descending order. If that problem is solved using reorder() or fct_reorder(), the next problem is when we have facets and ordering bars within each facet. Recently I came acrosss this function reorder_within() from the package tidytext (Thanks to Julia Silge and Tyler Rinker - who created this solution originally) Example Code: library(tidyr) library(ggplot2) iris_gathered <- gather(iris, metric, value, -Species) ggplot(iris_gathered, aes(reorder(Species, value), value)) + geom_bar(stat = 'identity') + facet_wrap(~ metric) As you can see above, the bars in the last facet isn’t ordered properly.

Kannada MNIST Prediction Classification using H2O AutoML in R

Kannada MNIST dataset is another MNIST-type Digits dataset for Kannada (Indian) Language. All details of the dataset curation has been captured in the paper titled: “Kannada-MNIST: A new handwritten digits dataset for the Kannada language.” by Vinay Uday Prabhu. The github repo of the author can be found here. The objective of this post is to demonstrate how to use h2o.ai’s automl function to quickly get a (better) baseline. Thsi also proves a point how these automl tools help democratizing Machine Learning Model Building process.

Handling Missing Values in R using tidyr

In this post, We’ll see 3 functions from tidyr that’s useful for handling Missing Values (NAs) in the dataset. Please note: This post isn’t going to be about Missing Value Imputation. tidyr According to the documentation of tidyr, The goal of tidyr is to help you create tidy data. Tidy data is data where: + Every column is variable. + Every row is an observation.. + Every cell is a single value.

Functional Programming + Iterative Web Scraping in R

Web Scraping in R Web scraping needs no introduction among Data enthusiasts. It’s one of the most viable and most essential ways of collecting Data when the data itself isn’t available. Knowing web scraping comes very handy when you are in shortage of data or in need of Macroeconomics indicators or simply no data available for a particular project like a Word2vec / Language with a custom text dataset.

Hindi and Other Languages in India based on 2001 census

India is the world’s largest Democracy and as it goes, also a highly diverse place. This is my attempt to see how “Hindi” and other languages are spoken in India. In this post, we’ll see how to collect data for this relevant puzzle - directly from Wikipedia and How we’re going to visualize it - highlighting the insight. Data Wikipedia is a great source for data like this - Languages spoken in India and also because Wikipedia lists these tables as html <table> it becomes quite easier for us to use rvest::html_table() to extract the table as dataframe without much hassle.

Regex Problem? Here's an R package that will write Regex for you

REGEX is that thing that scares everyone almost all the time. Hence, finding some alternative is always very helpful and peaceful too. Here’s a nice R package thst helps us do REGEX without knowing REGEX. REGEX This is the REGEX pattern to test the validity of a URL: ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$ A typical regular expression contains — Characters ( http ) and Meta Characters ([]). The combination of these two form a meaningful regular expression for a particular task.

How to do Tamil Text Analysis & NLP in R

udpipe is a beautiful R package for Text Analytics and NLP and helps in Topic Extraction. While most Text Analytics resources online are only about English, This post picks up a different lanugage - Tamil and fortuntely, udpipe has got a Tamil Language Model. Loading library(udpipe) Tamil Text Below is part extracted from a Tamil Movie Review text <- data.frame(tamil = "கரு - கோமாவால் 16 வருட வாழக்கையை இழந்தவன் மனிதத்தை இந்த கால மனிதர்களுக்கு நினைவுபடுத்து தான் கோமாளி படத்தின் கரு.

How to scrape Zomato Restaurants Data in R

Zomato is a popular restaurants listing website in India (Similar to Yelp) and People are always interested in seeing how to download or scrape Zomato Restaurants data for Data Science and Visualizations. In this post, We’ll learn how to scrape / download Zomato Restaurants (Buffets) data using R. Also, hope this post would serve as a basic web scraping framework / guide for any such task of building a new dataset from internet using web scraping.

Combining the power of R and Python with reticulate

R + Py In the word of R vs Python fights, This is a simple (could be called, naive as well) attempt to show how we can combine the power of Python with R and create a new superpower. Like this one, If you have watched The Incredibles before! About this Dataset This dataset contains a bunch of tweet that came with this tag #JustDoIt after Nike released the ad campaign with Colin Kaepernick that turned controversial.

How to do Topic Extraction from Customer Reviews in R

Topic Extraction is an integral part of IE (Information Extraction) from Corpus of Text to understand what are all the key things the corpus is talking about. While this can be achieved naively using unigrams and bigrams, a more intelligent way of doing it with an algorithm called RAKE is what we’re going to see in this post. Udpipe udpipe is an NLP-focused R package created and opensourced by this organization bnosac.

3 tidyverse tricks for most commonly used Excel Features

In this post, We’re simply going to see 5 tricks that could help improve your tooling using {tidyverse}. Create a difference variable between the current value and the next value This is also known as lead and lag - especially in a time series dataset this varaible becomes very important in feature engineering. In Excel, This is simply done by creating a new formula field and subtracting the next cell with the current cell or the current cell with the previous cell and dragging the cell formula to the last cell.

How to Automate EDA with DataExplorer in R

EDA (Exploratory Data Analysis) is one of the key steps in any Data Science Project. The better the EDA is the better the Feature Engineering could be done. From Modelling to Communication, EDA has got much more hidden benefits that aren’t often emphasised while beginners start while teaching Data Science for beginners. The Problem That said, EDA is also one of the areas of the Data Science Pipeline where a lot of manual code is written for different types of plots and different types for inference.

Do you love Data Science? I mean, the Data part in it

Last week, We talked all about Artificial Intelligence (also Artifical Stupidity) which led me to think about the foundation of Data Science that's the Data itself. I think, Data is the least appreciated entity in the Data Science Value chain. You might agree with me, If you do Data Science outside Competitive Platforms like Kaggle where Data given to you is what most of the Data Scientists dream about in their jobs.

How to generate meaningful fake data for learning, experimentation and teaching

The Problem There’s one thing about R that a lot of people have as their Top-of-Mind. That’s the black-and-white plot of iris dataset which is definitely a huge boring view of R. That’s boring because of aesthetics but also because it’s such a cliched example used over and over again. The other problem is finding the right set of dataset for the right set of problem you want to teach/learn/experiment.

Extract Top Reddit Posts of #rstats in 3 lines of R Code

This post is kept (literally) minimal to demonstrate how simple is this hack using R (of course could be simple in other languages too). This is also to establish a point that R has got use-cases beyond statistics and data-mining. Objective rstats subreddit is one of the popular sources of R-related information / discussion on the internet. We’re trying to extract the top posts of rstats subreddit. Data Format Lucky for us, Reddit offers a json file for every subreddit (also post) and we’ll use that here.

How to make Square (Pie) Charts for Infographics in R

Are you looking for some unique way of visualizing your numbers instead of simply using bar charts - which sometimes could be boring the audience - if used, slide after slide? Here’s Square Pie / Waffle Chart for you. Waffle Chart or as it goes technically, Square Pie Chart is just is just a pie chart that use squares instead of circles to represent percentages. So, it’s good to keep in mind that this is applicable better for Percentages.

How to create unigrams, bigrams and n-grams of App Reviews

This is one of the frequent questions I’ve heard from the first timer NLP / Text Analytics - programmers (or as the world likes it to be called “Data Scientists”). Prerequisite For simplicity, this post assumes that you already know how to install a package and so you’ve got tidytext installed on your R machine. install.packages("tidytext") Loading the Library Let’s start with loading the tidytext library. library(tidytext) Extracting App Reviews We’ll use the R-package itunesr for downloading iOS App Reviews on which we’ll perform Simple Text Analysis (unigrams, bigrams, n-grams).

Interactive Visualization in R with apexcharter

Interactive Visualizations are powerful these days because those are all made for web. Web - simply a combination of html,css and javascript which build interactive visualizations. Thus, paving way for a lot of javascript charting libraries like highcharts.js, apexcharts.js. Thanks to htmlwidgets of R, many R developers have started porting those javascript charting libraries to R and dreamRs is one of such leading Developer groups working on the intersection R + Web.

Programmatically extract TIOBE Index Ratings

TIOBE Index is an index (ranking) that claims to represent the popularity of programming languages. Yihui (The creator of blogdown package), recently wrote a blogpost titled “On TIOBE Index and the era of decision fatigue” and I strongly recommend you to go through that before continuing with this post. So the Disclaimer goes like this: This post/author doesn’t believe that TIOBE Index is a fair way to measure/present popularity of programming languages and this is writtet just to teach you how to extract/get TIOBE Index programmatically using the R package tiobeindexr

How to reshape a dataframe from wide to long or long to wide format

Reshaping a dataframe / table from long to wide format or wide to long format is one of the daily tasks a Data Analyst / Data Scientist would be doing. The long format is similar to the tidy format that the tidyverse advocates. Even while, it’s been a very common task - the tidyr package’s solution of using spread() and gather() almost never was intuitive enough to be used in the code without SOing or Referring the documentation.

Find out Bulk Email ID Reputations Risk using R

If you are working in Info Sec / Cyber Security, One of the things that might be part of your day job is to filter email to remove spams / phishing emails. While this could be done at several levels and ways, monitoring the email id (like abc@xyz.com) and validating its reputation to see if it seems risky / suspicious or authentic and then allowing them to reach the user inbox - is one of the solid ways (while it’s also error-prone with False Positives).

How to do negation-proof sentiment analysis in R

Sentiment Analysis is one of those things in Machine learning which is still getting improvement with the rise of Deep Learning based NLP solutions. There are many things like Sarcasm, Negations and similar items make Sentiment Analysis a rather tough nut to crack. Deep learning as much as it’s effective, it’s also computationally expensive and if you are ready to trade off between Cost (expense) and Accuracy, then you this is the solution for building a negation-proof Sentiment Analysis solution (in R).