Kannada MNIST dataset is another MNIST-type Digits dataset for Kannada (Indian) Language. All details of the dataset curation has been captured in the paper titled: “Kannada-MNIST: A new handwritten digits dataset for the Kannada language.” by Vinay Uday Prabhu. The github repo of the author can be found here. The objective of this post is to demonstrate how to use h2o.ai’s automl function to quickly get a (better) baseline. Thsi also proves a point how these automl tools help democratizing Machine Learning Model Building process.
In this post, We’ll see 3 functions from tidyr that’s useful for handling Missing Values (NAs) in the dataset. Please note: This post isn’t going to be about Missing Value Imputation. tidyr According to the documentation of tidyr, The goal of tidyr is to help you create tidy data. Tidy data is data where: + Every column is variable. + Every row is an observation.. + Every cell is a single value.
Web Scraping in R Web scraping needs no introduction among Data enthusiasts. It’s one of the most viable and most essential ways of collecting Data when the data itself isn’t available. Knowing web scraping comes very handy when you are in shortage of data or in need of Macroeconomics indicators or simply no data available for a particular project like a Word2vec / Language with a custom text dataset.
India is the world’s largest Democracy and as it goes, also a highly diverse place. This is my attempt to see how “Hindi” and other languages are spoken in India. In this post, we’ll see how to collect data for this relevant puzzle - directly from Wikipedia and How we’re going to visualize it - highlighting the insight. Data Wikipedia is a great source for data like this - Languages spoken in India and also because Wikipedia lists these tables as html <table> it becomes quite easier for us to use rvest::html_table() to extract the table as dataframe without much hassle.