Investigating the XDA Package | elastacloud-channels

I stumbled upon a package in R just last week called the xda package. Typically in R, there are several ways of performing exploratory analysis. Data scientists know that the initial look into your dataset can provide a quick insight into what the dataset looks like and also give you some idea of how much work may go into the pre-processing stage.

Common functions that come with base-R are summary(), str(), head(), tail() which respectively give a statistical summary of your dataset, display the internal structure of an R object and obtain the first (and last) several rows of a matrix or data frame. However these functions apply on the entire dataset and does not discriminate based on data type. This is where the xda package comes in.

The functions currently included in the package are:

numSummary(mydata) function automatically detects all numeric columns in the dataframe mydata and provides their summary statistics
charSummary(mydata) function automatically detects all character columns in the dataframe mydata and provides their summary statistics
Plot(mydata, dep.var) plots all independent variables in the dataframe mydata against the dependant variable specified by the dep.var parameter
removeSpecial(mydata, vec) replaces all special characters (specified by vector vec) in the dataframe mydata with NA
bivariate(mydata, dep.var, indep.var) performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe mydata

I’ll use the SLID dataset from the ‘car’ package to show how some of the tools from the xda package works. To do that, we will need to install the xda package. The recommended way to do this in R is through devtools. Once you have the devtools package installed, load it and install xda from github:

library(devtools)

install_github("ujjwalkarn/xda")

Now load the xda package and the SLID dataset. The SLID dataset is from the ‘car’ package.

library(xda)

data(SLID)

With the xda package, we can specify which columns (by data type) we want to see by their data type. So for example, the summary statistics for the numerical variables can be specified using numSummary(). In this case, the result is:

For the char variables,

Note that for the charSummary, it 'identifies' sex and language as categorical variables and gives us the levels and counts for each level.

The bivariate() command performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe. For a bivariate analysis on the sex and wages variables

This clearly shows minimum, maximum and mean wages for each gender type. If we did the other way round with sex as dependent variable and wages as independent variable, we have:

This creates a bin for the wages variable and shows how many males and females fall into the bin.

The Plot command is for me the most exciting thing about the xda package. It plots all independent variables in the dataframe against the dependent variable.

It automatically detects what plots should be used for which independent variable. This is very useful when you want to do a quick analysis of your sample data. The thing to note though is that (as mentioned in one of the comments on reddit), one can perform all of the above using other tools (maybe even with more flexibility). That said, the xda package is recommended during the initial data analysis stage.

Links to articles on xda package:

https://www.reddit.com/r/statistics/comments/4jyc8c/xda_r_package_for_exploratory_data_analysis/

https://www.r-bloggers.com/introducing-xda-r-package-for-exploratory-data-analysis/

https://www.r-bloggers.com/using-xda-with-googlesheets-in-r/