R : import millions of small alpha numeric csv files - r

I have about 300GB of 15KB csv files (each with exactly 100 rows each) that I need to import, concatenate, manipulate and resave as a single rds.
I've managed to reduce the amount of RAM needed by only importing the columns I need but as soon as I need to do any operations on the columns, I max it out.
What is your strategy for this type of problem?

This is a shot at answering your question.
While this may not be the most effective of efficient solution, it works. The biggest upside is that you don't need to store all the information at once, instead just appending the result to a file.
If this is not fast enough it is possible to use parallell to speed it up.
library(tidyverse)
library(data.table)
# Make some example files
for (file_number in 1:1000) {
df = data.frame(a = runif(10), b = runif(10))
write_csv(x = df, path = paste0("example_",file_number,".csv"))
}
# Get the list of files, change getwd() to your directory,
list_of_files <- list.files(path = getwd(), full.names = TRUE)
# Define function to read, manipulate, and save result
read_man_save <- function(filename) {
# Read file using data.table fread, which is faster than read_csv
df = fread(file = filename)
# Do the manipulation here, for example getting only the mean of A
result = mean(df$a)
# Append to a file
write(result, file = "out.csv", append = TRUE)
}
# Use lapply to perform the function over the list of filenames
# The output (which is null) is stored in a junk object
junk <- lapply(list_of_files, read_man_save)
# The resulting "out.csv" now contains 1000 lines of the mean
Feel free to comment if you want any edits to better reflect your use case.

You could also use the disk.frame library, it is designed to allow manipulation of data larger than RAM.
You can then manipulate the data like you would in data.table or using dplyr verbs.

Related

Read large CSV in chunks and transform to RDS in R

I have a 20GB CSV file that I want to convert to an RDS file in R. However, the original file is too large to be processed (the computer with 64GB RAM tells me that 80.9GB needs to be allocated which exceeds its memory capacity). Therefore I am wondering, whether and how I can read that CSV in chunks, turn each chunk into a separate RDS file and afterward merge them together? Would that yield the same outcome as if I directly turned that one CSV file into an RDS file?
I am very new to R and could unfortunately not find any answers to my question.
Below is the current code I'm using.
library(Matrix)
library(data.table)
b <- fread('dtm.csv')
b_matx<- as.matrix(b)
dtm_b <- Matrix(b_matx, sparse = TRUE)
saveRDS(dtm_b, "dtm.rds")
See if this works.
It reads one column at a time using fread. By default fread creates a data frame; however, these use external pointers which can be
problem so we use data.table=FALSE argument. After reading a columni n it immediately writes it back out as an RDS file. After all columns have been written back out as RDS files it reads the RDS files back in and writes the final RDS file out which combines them. We use the 6 row input in the Note at the end as an example.
If fread with select= still takes up too much memory use the xsv utility (not an R program) to ensure that only the column of interest is read in. xsv can be downloaded for various platforms here and then use the commented out line instead of the line following it. (Alternately suitably use cut, sed or awk for the same purpose.)
You can also try interspersing the code lines with gc() to trigger garbage collection.
Also try replacing as.data.frame in the last line with setDT.
library(data.table)
File <- "BOD.csv"
freadDF <- function(..., data.table = FALSE) fread(..., data.table = data.table)
L <- as.list(freadDF(File, nrows = 0))
nms <- names(L)
fmt <- "xsv select %s %s"
# for(nm in nms) saveRDS(freadDF(cmd = sprintf(fmt, nm, File))[[1]], paste0(nm, ".rds"))
for(nm in nms) saveRDS(freadDF(File, select = nm)[[1]], paste0(nm, ".rds"))
for(nm in names(L)) L[[nm]] <- readRDS(paste0(nm, ".rds"))
saveRDS(as.data.frame(L), sub(".csv$", ".rds", File))
Note
write.csv(BOD, "BOD.csv", quote = FALSE, row.names = FALSE)

Using a For-loop to create multiple objects with incremental suffixes, then reading in .csv file to each new object (also with incremental suffixes)

I've just started learning R so forgive me for my ignorance! I'm reading in lots of .csv files, each of which correlates to a different year (2010-2019). I then filter down the .csv files based on a variable within one of the columns (because the datasets are very large. Currently I am using the below code to do this and then repeating it for each year:
data_2010 <- data.table::fread("//Project/2010 data/2010 data.csv", select = c("date", "id", "type"))
data_b_2010 <- data_2010[which(data_2010$type=="ABC123")]
rm(data_2010)
What I would like to do is use a For-loop to create new object data_20xx for each year, and then read in the .csv files (and apply the filter of "type") for each year too.
I think I know how to create the objects in a For-loop but not entirely sure how I would also assign the .csv files and change the filepath string so it updates with each year (i.e. "//Project/2010 data/2010 data.csv" to "//Project/2011 data/2011 data.csv").
Any help would be greatly appreciated!
Next time please provide a repoducible example so we can help you.
I would use data.table which contains specialized functions to do what you want.
library(data.table)
setwd("Project")
allfiles <- list.files(recursive = T, full.names = T)
allcsv <- allfiles[grepl(".csv", allfiles)]
data_list <- list()
for(i in 1:length(allcsv)) {
print(paste(round(i/length(allcsv),2)))
data_list[i] <- fread(allcsv[i])
}
data_list_filtered <- lapply(data_list, function(x) {
y <- data.frame(x)
return(y[which(y["type"]=="ABC123",)])
})
result <- rbindlist(data_list_filtered)
First, list.files will tell you all the files contained in your working dir by default.
Second, read each csv file into the data_list list using the fast and efficient fread function.
Third, do the filtering within a loop, as requested.
Fourth, use rbindlist from data.table to rbind all of these data.table's.
Finally, if you are not familiar with the data.table syntax, you can run setDF(result) to convert your results back to a data.frame.
I strongly encourage you to learn the data.table syntax as it is quite powerful and efficient for tabular data manipulations. These vignettes will get you started.

Using R to merge many large CSV files across sub-directories

I have over 300 large CSV files with the same filename, each in a separate sub-directory, that I would like to merge into a single dataset using R. I'm asking for help on how to remove columns I don't need in each CSV file, while merging in a way that breaks the process down into smaller chunks that my memory can more easily handle.
My objective is to create a single CSV file that I can then import into STATA for further analysis using code I have already written and tested on one of these files.
Each of my CSVs is itself rather large (about 80 columns, many of which are unnecessary, and each file has tens to hundreds of thousands of rows), and there are almost 16 million observations in total, or roughly 12GB.
I have written some code which manages to do this successfully for a test case of two CSVs. The challenge is that neither my work nor my personal computers have enough memory to do this for all 300+ files.
The code I have tried is here:
library(here) ##installs package to find files
( allfiles = list.files(path = here("data"), ##creates a list of the files, read as [1], [2], ... [n]
pattern = "candidates.csv", ##[identifies the relevant files]
full.names = TRUE, ##identifies the full file name
recursive = TRUE) ) ##searches in sub-directories
read_fun = function(path) {
test = read.csv(path,
header = TRUE )
test
} ###reads all the files
(test = read.csv(allfiles,
header = TRUE ) )###tests that file [1] has been read
library(purrr) ###installs package to unlock map_dfr
library(dplyr) ###installs packages to unlock map_dfr
( combined_dat = map_dfr(allfiles, read_fun) )
I expect the result to be a single RDS file, and this works for the test case. Unfortunately, the amount of memory this process requires when looking at 15.5m observations across all my files causes RStudio to crash, and no RDS file is produced.
I am looking for help on how to 1) reduce the load on my memory by stripping out some of the variables in my CSV files I don't need (columns with headers junk1, junk2, etc); and 2) how to merge in a more manageable way that merges my CSV files in sequence, either into a few RDS files to themselves be merged later, or through a loop cumulatively into a single RDS file.
However, I don't know how to proceed with these - I am still new to R, and any help on how to proceed with both 1) and 2) would be much appreciated.
Thanks,
Twelve GB is quite a bit for one object. It's probably not practical to use a single RDS or CSV unless you have far more than 12GB of RAM. You might want to look into using a database, a techology that is made for this kind of thing. I'm sure Stata can also interact with databases. You might also want to read up on how to interact with large CSVs using various strategies and packages.
Creating a large CSV isn't at all difficult. Just remember that you have to work with said giant CSV sometime in the future, which probably will be difficult. To create a large CSV, just process each component CSV individually and then append them to your new CSV. The following reads in each CSV, removes unwanted columns, and then appends the resulting dataframe to a flat file:
library(dplyr)
library(readr)
library(purrr)
load_select_append <- function(path) {
# Read in CSV. Let every column be of class character.
df <- read_csv(path, col_types = paste(rep("c", 82), collapse = ""))
# Remove variables beginning with "junk"
df <- select(df, -starts_with("junk"))
# If file exists, append to it without column names, otherwise create with
# column names.
if (file.exists("big_csv.csv")) {
write_csv(df, "big_csv.csv", col_names = F, append = T)
} else {
write_csv(df, "big_csv.csv", col_names = T)
}
}
# Get the paths to the CSVs.
csv_paths <- list.files(path = "dir_of_csvs",
pattern = "\\.csv.*",
recursive = T,
full.names = T
)
# Apply function to each path.
walk(csv_paths, load_select_append)
When you're ready to work with your CSV you might want to consider using something like the ff package, which enables interaction with on-disk objects. You are somewhat restricted in what you can do with an ffdf object, so eventually you'll have to work with samples:
library(ff)
df_ff <- read.csv.ffdf(file = "big_csv.csv")
df_samp <- df_ff[sample.int(nrow(df_ff), size = 100000),]
df_samp <- mutate(df_samp, ID = factor(ID))
summary(df_samp)
#### OUTPUT ####
values ID
Min. :-9.861 17267 : 6
1st Qu.: 6.643 19618 : 6
Median :10.032 40258 : 6
Mean :10.031 46804 : 6
3rd Qu.:13.388 51269 : 6
Max. :30.465 52089 : 6
(Other):99964
As far as I know, chunking and on-disk interactions are not possible with RDS or RDA, so you are stuck with flat files (or you go with one of the other options I mentioned above).

How do I save output from a large simulation in R? (multiple nodes, safe access)

I am doing a large simulation for a research project--simulating 1,000 football seasons and analyzing the results. As the seasons will be spread across multiple nodes, I need an easy way to save my output data into a file (or files) to access later. Since I can't control when the nodes will finish, I can't have them all trying to write to the same file at the same time, but if they all save to a different file, I would need a way to aggregate all the data easily afterward. Thoughts?
I do not know if this question was asked already. But here is what I do in my research. You can loop through the file names and aggregate them into one object like so
require(data.table)
dt1 <- data.table()
for (i in 1:100) {
k <- paste0("C:/chunkruns/dat",i,"/dt.RData")
load(k)
dt1 <- rbind(dt1,dt)
}
agg.data <- dt1
rm(dt1)
The above code assumes that all your files are saved in different folders but with same file name.
Or else, you can use the following to identify file paths matching a pattern and then combine them
require(data.table)
# Get the list of files and then read the files using read.csv command
k <- list.files(path = "W:/chunkruns/dat", pattern = "Output*", all.files = FALSE, full.names = TRUE, recursive = TRUE)
m <- lapply(k, FUN = function (x) read.csv(x,skip=11,header = T))
agg.data <- rbindlist(m)
rm(m)

Using rbind() to combine multiple data frames into one larger data.frame within lapply()

I'm using R-Studio 0.99.491 and R version 3.2.3 (2015-12-10). I'm a relative newbie to R, and I'd appreciate some help. I'm doing a project where I'm trying to use the server logs on an old media server to identify which folders/files within the server are still being accessed and which aren't, so that my team knows which files to migrate. Each log is for a 24 hour period, and I have approximately a year's worth of logs, so in theory, I should be able to see all of the access over the past year.
My ideal output is to get a tree structure or plot that will show me the folders on our server that are being used. I've figured out how to read one log (one day) into R as a data.frame and then use the data.tree package in R to turn that into a tree. Now, I want to recursively go through all of the files in the directory, one by one, and add them to that original data.frame, before I create the tree. Here's my current code:
#Create the list of log files in the folder
files <- list.files(pattern = "*.log", full.names = TRUE, recursive = FALSE)
#Create a new data.frame to hold the aggregated log data
uridata <- data.frame()
#My function to go through each file, one by one, and add it to the 'uridata' df, above
lapply(files, function(x){
uriraw <- read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
#print(nrow(uriraw)
uridata <- rbind(uridata, uriraw)
#print(nrow(uridata))
})
The problem is that, no matter what I try, the value of 'uridata' within the lapply loop seems to not be saved/passed outside of the lapply loop, but is somehow being overwritten each time the loop runs. So instead of getting one big data.frame, I just get the contents of the last 'uriraw' file. (That's why there are those two commented print commands inside the loop; I was testing how many lines there were in the data frames each time the loop ran.)
Can anyone clarify what I'm doing wrong? Again, I'd like one big data.frame at the end that combines the contents of each of the (currently seven) log files in the folder.
do.call() is your friend.
big.list.of.data.frames <- lapply(files, function(x){
read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
})
or more concisely (but less-tinkerable):
big.list.of.data.frames <- lapply(files, read.table,
skip = 3,header = TRUE,
stringsAsFactors = FALSE)
Then:
big.data.frame <- do.call(rbind,big.list.of.data.frames)
This is a recommended way to do things because "growing" a data frame dynamically in R is painful. Slow and memory-expensive, because a new frame gets built at each iteration.
You can use map_df from purrr package instead of lapply, to directly have all results combined as a data frame.
map_df(files, read.table, skip = 3, header = TRUE, stringsAsFactors = FALSE)
Another option is fread from data.table
library(data.table)
rbindlist(lapply(files, fread, skip=3))

Resources