Welcome to this post on data cleaning in R! This post will describe the data exploration process and introduce useful functions to streamline common tasks.

Philosophy

There are common steps that you may face in your data exploration:

  • Import (messy) data
  • Clean messy data
  • Describe cleaned data
  • Plot cleaned data
  • Model data
  • Write output tables

This package aims to facilitate this basic data handling with functions that I’ve found handy.

Quick Reference Table

handy Function Use Case Description
clean_names Clean Data Normalize variable names of a data.frame to snake_case
title_names Clean Data Normalize variable names of a data.frame to Title Case
remove_columns Clean Data Removes columns that are either missing or non-unique
add_mean_row Describe Data Add a mean row to a data.frame of numeric values
add_total_row Describe Data Add a total row to a data.frame of numeric values
check_variables Describe Data Calculate % missing and number of unique values for all variables
%p% Shorthand Allow string concatenation by piping
cs Shorthand Makes a character vector without quotes
named_list Shorthand Makes a named list using object names
multiplot Plotting Arrange multiple plots
scale_x_human Plotting Scales ggplot axis to easily readable K, M, B, etc.
scale_y_human Plotting Scales ggplot axis to easily readable K, M, B, etc.
write_excel Write Output Write a list of data.frames in R to named sheets in an Excel workbook
write_regression_to_excel Write Output Write a list of data.frames in R to named sheets in an Excel workbook

Clean

Let’s use R’s built in iris dataset.

These cleaning functions are built to be compatible with magrittr/dplyr pipes.

library(handy)
library(magrittr)
library(knitr)
iris3obs <- head(iris, 3) 
iris3obs %>% kable()
iris3obs %>% clean_names %>% kable()

What if we want to focus on columns that change for the set we’re looking for?

iris3obs %>% remove_columns(1) %>% kable()

Describe

For every dataset, we want to know what it contains:

mtcars %>% check_variables %>% kable

Sometimes it’s helpful to add a mean or total row for a table that’s to be printed out. These are best for datasets with a character key as the first column.

arrests <- USArrests %>% tibble::rownames_to_column() %>% dplyr::as_tibble()
arrests %>% head %>% add_mean_row %>% kable
arrests %>% head %>% add_total_row %>% kable

Plot

library(ggplot2)
p1 <- Seatbelts %>% tibble::as_tibble() %>% 
  ggplot(aes(kms, DriversKilled)) +
  geom_point() +
  scale_x_human() 
p2 <- Seatbelts %>% tibble::as_tibble() %>% 
  ggplot(aes(DriversKilled)) +
  geom_histogram(alpha = .5, bins = 30)
multiplot(p1, p2)

Write

And for some analyses, it’s nice to write out the final data to Excel.

named_list(mtcars, iris) %>% 
  write_excel(workbook_fname = 'wb.xlsx')

readxl::excel_sheets('wb.xlsx')
readxl::read_excel('wb.xlsx', sheet = 'iris') %>% head %>% kable

Here, using named_list assembles a list with the list names set to the names of the objects.

lm_assault <- lm(Assault ~ UrbanPop, data = arrests)
lm_anscombe <- lm(y1 ~ x1, data = anscombe)
reg_list <- named_list(lm_assault, lm_anscombe)

reg_list

reg_list %>% write_regression_to_excel('reg.xlsx')
readxl::read_excel('reg.xlsx', sheet = 'lm_anscombe') %>% head %>% kable

Other

What do we have in memory at this point?

lsos(n = 5) %>% kable