r-bloggers

How to reorder arrange bars with in each Facet of ggplot

One of the problems that we usually face with ggplot is that rearranging the bars in ascending or descending order. If that problem is solved using reorder() or fct_reorder(), the next problem is when we have facets and ordering bars within each facet. Recently I came acrosss this function reorder_within() from the package tidytext (Thanks to Julia Silge and Tyler Rinker - who created this solution originally) Example Code: library(tidyr) library(ggplot2) iris_gathered <- gather(iris, metric, value, -Species) ggplot(iris_gathered, aes(reorder(Species, value), value)) + geom_bar(stat = 'identity') + facet_wrap(~ metric) As you can see above, the bars in the last facet isn’t ordered properly.

Kannada MNIST Prediction Classification using H2O AutoML in R

Kannada MNIST dataset is another MNIST-type Digits dataset for Kannada (Indian) Language. All details of the dataset curation has been captured in the paper titled: “Kannada-MNIST: A new handwritten digits dataset for the Kannada language.” by Vinay Uday Prabhu. The github repo of the author can be found here. The objective of this post is to demonstrate how to use h2o.ai’s automl function to quickly get a (better) baseline. Thsi also proves a point how these automl tools help democratizing Machine Learning Model Building process.

Handling Missing Values in R using tidyr

In this post, We’ll see 3 functions from tidyr that’s useful for handling Missing Values (NAs) in the dataset. Please note: This post isn’t going to be about Missing Value Imputation. tidyr According to the documentation of tidyr, The goal of tidyr is to help you create tidy data. Tidy data is data where: + Every column is variable. + Every row is an observation.. + Every cell is a single value.

Functional Programming + Iterative Web Scraping in R

Web Scraping in R Web scraping needs no introduction among Data enthusiasts. It’s one of the most viable and most essential ways of collecting Data when the data itself isn’t available. Knowing web scraping comes very handy when you are in shortage of data or in need of Macroeconomics indicators or simply no data available for a particular project like a Word2vec / Language with a custom text dataset.

Hindi and Other Languages in India based on 2001 census

India is the world’s largest Democracy and as it goes, also a highly diverse place. This is my attempt to see how “Hindi” and other languages are spoken in India. In this post, we’ll see how to collect data for this relevant puzzle - directly from Wikipedia and How we’re going to visualize it - highlighting the insight. Data Wikipedia is a great source for data like this - Languages spoken in India and also because Wikipedia lists these tables as html <table> it becomes quite easier for us to use rvest::html_table() to extract the table as dataframe without much hassle.

Regex Problem? Here's an R package that will write Regex for you

REGEX is that thing that scares everyone almost all the time. Hence, finding some alternative is always very helpful and peaceful too. Here’s a nice R package thst helps us do REGEX without knowing REGEX. REGEX This is the REGEX pattern to test the validity of a URL: ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$ A typical regular expression contains — Characters ( http ) and Meta Characters ([]). The combination of these two form a meaningful regular expression for a particular task.

How to do Tamil Text Analysis & NLP in R

udpipe is a beautiful R package for Text Analytics and NLP and helps in Topic Extraction. While most Text Analytics resources online are only about English, This post picks up a different lanugage - Tamil and fortuntely, udpipe has got a Tamil Language Model. Loading library(udpipe) Tamil Text Below is part extracted from a Tamil Movie Review text <- data.frame(tamil = "கரு - கோமாவால் 16 வருட வாழக்கையை இழந்தவன் மனிதத்தை இந்த கால மனிதர்களுக்கு நினைவுபடுத்து தான் கோமாளி படத்தின் கரு.

How to scrape Zomato Restaurants Data in R

Zomato is a popular restaurants listing website in India (Similar to Yelp) and People are always interested in seeing how to download or scrape Zomato Restaurants data for Data Science and Visualizations. In this post, We’ll learn how to scrape / download Zomato Restaurants (Buffets) data using R. Also, hope this post would serve as a basic web scraping framework / guide for any such task of building a new dataset from internet using web scraping.

Combining the power of R and Python with reticulate

R + Py In the word of R vs Python fights, This is a simple (could be called, naive as well) attempt to show how we can combine the power of Python with R and create a new superpower. Like this one, If you have watched The Incredibles before! About this Dataset This dataset contains a bunch of tweet that came with this tag #JustDoIt after Nike released the ad campaign with Colin Kaepernick that turned controversial.

How to do Topic Extraction from Customer Reviews in R

Topic Extraction is an integral part of IE (Information Extraction) from Corpus of Text to understand what are all the key things the corpus is talking about. While this can be achieved naively using unigrams and bigrams, a more intelligent way of doing it with an algorithm called RAKE is what we’re going to see in this post. Udpipe udpipe is an NLP-focused R package created and opensourced by this organization bnosac.

3 tidyverse tricks for most commonly used Excel Features

In this post, We’re simply going to see 5 tricks that could help improve your tooling using {tidyverse}. Create a difference variable between the current value and the next value This is also known as lead and lag - especially in a time series dataset this varaible becomes very important in feature engineering. In Excel, This is simply done by creating a new formula field and subtracting the next cell with the current cell or the current cell with the previous cell and dragging the cell formula to the last cell.

How to Automate EDA with DataExplorer in R

EDA (Exploratory Data Analysis) is one of the key steps in any Data Science Project. The better the EDA is the better the Feature Engineering could be done. From Modelling to Communication, EDA has got much more hidden benefits that aren’t often emphasised while beginners start while teaching Data Science for beginners. The Problem That said, EDA is also one of the areas of the Data Science Pipeline where a lot of manual code is written for different types of plots and different types for inference.

Do you love Data Science? I mean, the Data part in it

Last week, We talked all about Artificial Intelligence (also Artifical Stupidity) which led me to think about the foundation of Data Science that's the Data itself. I think, Data is the least appreciated entity in the Data Science Value chain. You might agree with me, If you do Data Science outside Competitive Platforms like Kaggle where Data given to you is what most of the Data Scientists dream about in their jobs.

How to generate meaningful fake data for learning, experimentation and teaching

The Problem There’s one thing about R that a lot of people have as their Top-of-Mind. That’s the black-and-white plot of iris dataset which is definitely a huge boring view of R. That’s boring because of aesthetics but also because it’s such a cliched example used over and over again. The other problem is finding the right set of dataset for the right set of problem you want to teach/learn/experiment.

Extract Top Reddit Posts of #rstats in 3 lines of R Code

This post is kept (literally) minimal to demonstrate how simple is this hack using R (of course could be simple in other languages too). This is also to establish a point that R has got use-cases beyond statistics and data-mining. Objective rstats subreddit is one of the popular sources of R-related information / discussion on the internet. We’re trying to extract the top posts of rstats subreddit. Data Format Lucky for us, Reddit offers a json file for every subreddit (also post) and we’ll use that here.

How to make Square (Pie) Charts for Infographics in R

Are you looking for some unique way of visualizing your numbers instead of simply using bar charts - which sometimes could be boring the audience - if used, slide after slide? Here’s Square Pie / Waffle Chart for you. Waffle Chart or as it goes technically, Square Pie Chart is just is just a pie chart that use squares instead of circles to represent percentages. So, it’s good to keep in mind that this is applicable better for Percentages.

How to create unigrams, bigrams and n-grams of App Reviews

This is one of the frequent questions I’ve heard from the first timer NLP / Text Analytics - programmers (or as the world likes it to be called “Data Scientists”). Prerequisite For simplicity, this post assumes that you already know how to install a package and so you’ve got tidytext installed on your R machine. install.packages("tidytext") Loading the Library Let’s start with loading the tidytext library. library(tidytext) Extracting App Reviews We’ll use the R-package itunesr for downloading iOS App Reviews on which we’ll perform Simple Text Analysis (unigrams, bigrams, n-grams).

Interactive Visualization in R with apexcharter

Interactive Visualizations are powerful these days because those are all made for web. Web - simply a combination of html,css and javascript which build interactive visualizations. Thus, paving way for a lot of javascript charting libraries like highcharts.js, apexcharts.js. Thanks to htmlwidgets of R, many R developers have started porting those javascript charting libraries to R and dreamRs is one of such leading Developer groups working on the intersection R + Web.

Programmatically extract TIOBE Index Ratings

TIOBE Index is an index (ranking) that claims to represent the popularity of programming languages. Yihui (The creator of blogdown package), recently wrote a blogpost titled “On TIOBE Index and the era of decision fatigue” and I strongly recommend you to go through that before continuing with this post. So the Disclaimer goes like this: This post/author doesn’t believe that TIOBE Index is a fair way to measure/present popularity of programming languages and this is writtet just to teach you how to extract/get TIOBE Index programmatically using the R package tiobeindexr

How to reshape a dataframe from wide to long or long to wide format

Reshaping a dataframe / table from long to wide format or wide to long format is one of the daily tasks a Data Analyst / Data Scientist would be doing. The long format is similar to the tidy format that the tidyverse advocates. Even while, it’s been a very common task - the tidyr package’s solution of using spread() and gather() almost never was intuitive enough to be used in the code without SOing or Referring the documentation.

Find out Bulk Email ID Reputations Risk using R

If you are working in Info Sec / Cyber Security, One of the things that might be part of your day job is to filter email to remove spams / phishing emails. While this could be done at several levels and ways, monitoring the email id (like abc@xyz.com) and validating its reputation to see if it seems risky / suspicious or authentic and then allowing them to reach the user inbox - is one of the solid ways (while it’s also error-prone with False Positives).

How to do negation-proof sentiment analysis in R

Sentiment Analysis is one of those things in Machine learning which is still getting improvement with the rise of Deep Learning based NLP solutions. There are many things like Sarcasm, Negations and similar items make Sentiment Analysis a rather tough nut to crack. Deep learning as much as it’s effective, it’s also computationally expensive and if you are ready to trade off between Cost (expense) and Accuracy, then you this is the solution for building a negation-proof Sentiment Analysis solution (in R).