Wrangling categorical data in R
A peer-reviewed article of this Preprint also exists.

Author and article information
Abstract
Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.
Cite this as
2017. Wrangling categorical data in R. PeerJ Preprints 5:e3163v2 https://doi.org/10.7287/peerj.preprints.3163v2Author comment
This version contains updated citations to other articles in this collection.
Sections
Additional Information
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
Amelia McNamara analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper.
Nicholas J Horton analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper.
Data Deposition
The following information was supplied regarding data availability:
The code is available online at https://github.com/dsscollection/factor-mgmt (currently a private repository, but will soon be made public).
Funding
The authors received no funding for this work.