Read large CSV in chunks and transform to RDS in R - r

I have a 20GB CSV file that I want to convert to an RDS file in R. However, the original file is too large to be processed (the computer with 64GB RAM tells me that 80.9GB needs to be allocated which exceeds its memory capacity). Therefore I am wondering, whether and how I can read that CSV in chunks, turn each chunk into a separate RDS file and afterward merge them together? Would that yield the same outcome as if I directly turned that one CSV file into an RDS file?
I am very new to R and could unfortunately not find any answers to my question.
Below is the current code I'm using.
library(Matrix)
library(data.table)
b <- fread('dtm.csv')
b_matx<- as.matrix(b)
dtm_b <- Matrix(b_matx, sparse = TRUE)
saveRDS(dtm_b, "dtm.rds")

See if this works.
It reads one column at a time using fread. By default fread creates a data frame; however, these use external pointers which can be
problem so we use data.table=FALSE argument. After reading a columni n it immediately writes it back out as an RDS file. After all columns have been written back out as RDS files it reads the RDS files back in and writes the final RDS file out which combines them. We use the 6 row input in the Note at the end as an example.
If fread with select= still takes up too much memory use the xsv utility (not an R program) to ensure that only the column of interest is read in. xsv can be downloaded for various platforms here and then use the commented out line instead of the line following it. (Alternately suitably use cut, sed or awk for the same purpose.)
You can also try interspersing the code lines with gc() to trigger garbage collection.
Also try replacing as.data.frame in the last line with setDT.
library(data.table)
File <- "BOD.csv"
freadDF <- function(..., data.table = FALSE) fread(..., data.table = data.table)
L <- as.list(freadDF(File, nrows = 0))
nms <- names(L)
fmt <- "xsv select %s %s"
# for(nm in nms) saveRDS(freadDF(cmd = sprintf(fmt, nm, File))[[1]], paste0(nm, ".rds"))
for(nm in nms) saveRDS(freadDF(File, select = nm)[[1]], paste0(nm, ".rds"))
for(nm in names(L)) L[[nm]] <- readRDS(paste0(nm, ".rds"))
saveRDS(as.data.frame(L), sub(".csv$", ".rds", File))
Note
write.csv(BOD, "BOD.csv", quote = FALSE, row.names = FALSE)

Related

Saving data frames using a for loop with file names corresponding to data frames

I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.

R : import millions of small alpha numeric csv files

I have about 300GB of 15KB csv files (each with exactly 100 rows each) that I need to import, concatenate, manipulate and resave as a single rds.
I've managed to reduce the amount of RAM needed by only importing the columns I need but as soon as I need to do any operations on the columns, I max it out.
What is your strategy for this type of problem?
This is a shot at answering your question.
While this may not be the most effective of efficient solution, it works. The biggest upside is that you don't need to store all the information at once, instead just appending the result to a file.
If this is not fast enough it is possible to use parallell to speed it up.
library(tidyverse)
library(data.table)
# Make some example files
for (file_number in 1:1000) {
df = data.frame(a = runif(10), b = runif(10))
write_csv(x = df, path = paste0("example_",file_number,".csv"))
}
# Get the list of files, change getwd() to your directory,
list_of_files <- list.files(path = getwd(), full.names = TRUE)
# Define function to read, manipulate, and save result
read_man_save <- function(filename) {
# Read file using data.table fread, which is faster than read_csv
df = fread(file = filename)
# Do the manipulation here, for example getting only the mean of A
result = mean(df$a)
# Append to a file
write(result, file = "out.csv", append = TRUE)
}
# Use lapply to perform the function over the list of filenames
# The output (which is null) is stored in a junk object
junk <- lapply(list_of_files, read_man_save)
# The resulting "out.csv" now contains 1000 lines of the mean
Feel free to comment if you want any edits to better reflect your use case.
You could also use the disk.frame library, it is designed to allow manipulation of data larger than RAM.
You can then manipulate the data like you would in data.table or using dplyr verbs.

Filter CSV files for specific value before importing

I have a folder with thousands of comma delimited CSV files, totaling dozens of GB. Each file contains many records, which I'd like to separate and process separately based on the value in the first field (for example, aa, bb, cc, etc.).
Currently, I'm importing all the files into a dataframe and then subsetting in R into smaller, individual dataframes. The problem is that this is very memory intensive - I'd like to filter the first column during the import process, not once all the data is in memory.
This is my current code:
setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, fread, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
DF <- rbindlist(temp)
DFaa <- subset(DF, V1 =="aa")
If possible, I'd like to move the "subset" process into lapply.
Thanks
1) read.csv.sql This will read a file directly into a temporarily set up SQLite database (which it does for you) and then only read the aa records into R. The rest of the file will not be read into R at any time. The table will then be deleted from the database.
File is a character string that contains the file name (or pathname if not in the current directory). Other arguments may be needed depending on the format of the data.
library(sqldf)
read.csv.sql(File, "select * from file where V1 == 'aa'", dbname = tempfile())
2) grep/findstr Another possibility is to use grep (Linux) or findstr (Windows) to extract the lines with aa. That should get you the desired lines plus possibly a few others and at that point you have a much smaller input so it could be subset it in R without memory problems. For example,
fread("findstr aa File")[V1 == 'aa'] # Windows
fread("grep aa File")[V1 == 'aa'] # Linux
sed or gawk could also be used and are included with Linux and in Rtools on Windows.
3) csvfix The free csvfix utility is available on all platforms that R supports and can be used to select field values -- there also exist numerous other similar utilities such as csvkit, csvtk, miller and xsv.
The line below says to return only lines for which the first comma separated field equals aa. This line may need to be modified slightly depending on the cmd line or shell processor used.
fread("csvfix find -if $1==aa File") # Windows
fread("csvfix find -if '$1'==aa File") # Linux bash
setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, function(x) subset(fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE), V1=="aa"))
DF <- rbindlist(temp)
Untested, but this will probably work - replace your function call with an anonymous function.
This could help but you have to expand the function:
#Function
myload <- function(x)
{
y <- fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
y <- subset(y, V1 =="aa")
return(y)
}
#Apply
temp <- lapply(files, myload)
If you don't want to muck with SQL, consider using the skip argument in a loop. Slower, but that way you read in a block of lines, filter them, then read in the next block of lines (to the same temp variable so as not to take extra memory), etc.
Inside your lapply call, either a second lapply or equivalently
for (jj in 0: N) {
foo <- fread(filename, skip = (jj*1000+1):((jj+1)*1000), sep=",", fill=TRUE, integer64="numeric",header=FALSE)
mydata[[jj]] <- do_something_to_filter(foo)
}

Read a 20GB file in chunks without exceeding my RAM - R

I'm currently trying to read a 20GB file. I only need 3 columns of that file.
My problem is, that I'm limited to 16 GB of ram. I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits.
Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.
My question now is, how can I read this file? Is there a way to read chunks of the file without using that much ram?
The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.
The readr package has readr_csv_chunked and related functions.
The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.
It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.
vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go.
fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns.
read.csv.sql in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.
library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file",
dbname = tempfile(), ...)
read.table and read.csv in the base of R have a colClasses=argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.
Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.
Also check out the High Performance Computing task view.
We can try something like this, first a small example csv:
X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)
You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:
data = vector("list",200)
con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])
id X1 X2 X3
1 1 0.13870273 0.4480100 0.41655108
2 2 0.82249489 0.1227274 0.27173937
3 3 0.78684815 0.9125520 0.08783347
4 4 0.23481987 0.7643155 0.59345660
5 5 0.55759721 0.6009626 0.08112619
6 6 0.04274501 0.7234665 0.60290296
In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:
for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}
Finally, we build of all of those into a data.frame:
data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE
You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..
So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:
read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){
data = vector("list",rows_to_read)
con = file(file,"r")
data[[1]] = read.csv(con, nrows=rows_to_read)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,col_subset]
for(i in 2:ntimes){
data[[i]] = read.csv(con,
nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
}
return(do.call(rbind,data))
}
all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))

Creating a new file with both a subset of data and file names from a group of .csv files

My issue is likely with how I'm exporting the data from the for loop, but I'm not sure how to fix it.
I've got over 200 files in a folder, all structured in the same way, from which I'd like to pull the maximum number from a single column. I've made a for loop to do this based off of code from here http://www.r-bloggers.com/looping-through-files/
What I have running so far looks like this:
fileNames<-Sys.glob("*.csv")
for(i in 1:length(fileNames)){
data<-read.csv(fileNames[i])
VelM = max(data[,8],na.rm=TRUE)
write.table(VelM, "Summary", append=TRUE, sep=",",
row.names=FALSE,col.names=FALSE)
}
This works, but I need to figure out a way to have a second column in my summary file that contains the original file name the data in that row came from for reference.
I tried making both a matrix and a data frame instead of going straight to the table writing, but in both cases I wasn't able to append the data and ended up with values from only the last file.
Any ideas would be greatly appreciated!
Here's what I would recommend to improve your current method, also going with fread() because it's very fast and has the select argument. Notice I have moved the write.table() call outside the for() loop. This allows a cleaner way of adding the new column of file names alongside the max column, and eliminates the need to append to the file on every iteration.
library(data.table)
fileNames <- Sys.glob("*.csv")
VelM <- numeric(length(fileNames))
for(i in seq_along(fileNames)) {
VelM[i] <- max(fread(fileNames[i], select = 8)[[1L]], na.rm = TRUE)
}
write.table(data.frame(VelM, fileNames), "Summary", sep = ",",
row.names = FALSE, col.names = FALSE)
If you want to quickly read files, you should consider using data.table::fread or readr::read_csv instead of base read.csv.
For example:
fileNames <- list.files(path = your_path, pattern='\\.csv') # instead of Sys.glob
library('data.table')
dt <- rbindlist(lapply(fileNames, fread, select=8, idcol=TRUE))
dt[, .(max_val = max(your_var)), by = id]
write.table(dt, 'yourfile.csv', sep=',', row.names=FALSE, col.names=FALSE)
Explanation: data.table::fread reads in only the select=8th column from each file (via lapply to fileNames, which returns a list of data.tables). Then data.table::rbindlist combines this list of data.tables (of one column each) into a single data.table, producing an additional column idcol. From ?fread, note that
If input is a named list, ids are generated using them
Because lapply returns a named list with each name being the element of fileNames, this is an easy way of passing fileNames index for grouping.
The rest is data.table syntax. It wasn't clear from your question if there is a header row and whether you know the heading in advance. If so, you can either keep header=TRUE and use the header name for your_var, or you can do skip=1, header=FALSE, col.names = 'your_var'.

Resources