Data Import in R
In the world of data analysis, most data is not created within R itself but comes from various data collection software, hardware, and channels such as Excel and the internet. This chapter focuses on how to import data into R to begin data analysis. Readers can either systematically go through the chapter or select topics based on their actual needs and time constraints.
Key topics include symbol-separated files, Excel files, JSON files, and R-supported data formats like RData and RDS. Other formats will be discussed in the "Common Issues and Solutions" section as supplementary content.
Symbol-Separated Files
Symbol-separated files are the most commonly used data file formats. Knowing how to import them is essential. Here, "symbol" refers to any delimiter used to separate data, commonly commas (,), and tab characters (\t), known as CSV and TSV files, respectively.
CSV
CSV files typically have the .csv extension. The extension does not affect the file's content but helps quickly identify the format and aids automatic interpretation by some software. For example, a CSV file representing student grades might look like this:
student,chinese,math,english
stu1,99,100,98
stu2,60,50,88
R has a built-in read.table() function to import various delimited files. Here’s how to use it to import the data directly from text:
stu <- read.table(text = "
student,chinese,math,english
stu1,99,100,98
stu2,60,50,88
", header = TRUE, sep = ",")
stu
class(stu)
Usually, data is stored in files on the computer. To import a CSV file using read.table():
cars <- read.table(file = "data/data-import/mtcars.csv", header = TRUE, sep = ",")
Check the first few rows:
head(cars)
Alternatively, use read.csv() for CSV files, which simplifies the process:
cars2 <- read.csv(file = "data/data-import/mtcars.csv")
head(cars2)
Efficient Data Import with readr and data.table
For larger datasets or when performance is critical, the readr and data.table packages are recommended. These packages can read large datasets more quickly than base R functions.
Using readr:
library(readr)
time2 <- system.time(
z2 <- read_csv(temp_csv)
)
time2
Using data.table:
library(data.table)
time3 <- system.time(
z3 <- fread(temp_csv)
)
time3
Check the class of imported objects:
class(z1)
class(z2)
class(z3)
TSV and Other CSV Variants
TSV files use the tab character as a delimiter and can be imported similarly by specifying sep = "\t":
mt <- read.table("data/data-import/mtcars.tsv", sep = "\t", header = TRUE)
mt
Using readr:
mt2 <- read_tsv("data/data-import/mtcars.tsv")
mt2
Using data.table:
mt3 <- fread("data/data-import/mtcars.tsv")
mt3
Excel
Excel files are widely used for data storage and processing. The readxl package can be used to import data from Excel files:
library(readxl)
mt_excel <- read_excel("data/data-import/mtcars.xlsx")
head(mt_excel)
To read from a specific sheet:
excel_sheets(excel_path)
iris <- read_excel(excel_path, sheet = "iris")
head(iris)
JSON
JSON is a lightweight data-interchange format. The jsonlite package is popular for parsing JSON in R:
jsonlite::toJSON(letters)
jsonlite::toJSON(c(a = 1L, b = 2.0))
jsonlite::toJSON(data.frame(a = 1:3, b = 2:4))
jsonlite::toJSON(list(a = 1L, b = 2:5, c = c(TRUE, FALSE), d = NULL))
Save JSON data to a file:
jsonlite::write_json(list(a = 1L, b = 2:5, c = c(TRUE, FALSE), d = NULL), path = "data/data-import/example.json")
Read JSON data:
jsonlite::read_json("data/data-import/example.json")
jsonlite::read_json("data/data-import/example.json", simplifyVector = TRUE)
R Data Files
Using R's native data storage formats, RData and RDS, is efficient and common for saving and loading R objects.
RData
RData files can save multiple objects:
save(d1, d2, file = "data/data-import/mtcars.RData")
load("data/data-import/mtcars.RData")
ls()
RDS
RDS files are for single objects and allow renaming upon loading:
saveRDS(mtcars, file = "data/data-import/mtcars.rds")
mtcars_rename <- readRDS("data/data-import/mtcars.rds")
head(mtcars_rename)
Common Issues and Solutions
Loading Data from Clipboard
data <- read.table('clipboard', header=TRUE)
Reading Line by Line
Use readLines() to read file content line by line:
fil <- tempfile(fileext = ".data")
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = fil, sep = "\n")
readLines(fil, n = -1)
unlink(fil) # Clean up
Fixed-Width File Format
Use read.fwf() or readr::read_fwf() for fixed-width format files.