Partition a large list into chunks with convenient I/O - r

I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :
a) every saved file of the chunk is less than 100MB large
b) the original list can be loaded conveniently and fast into a new R workspace
EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.
What is the best solution for this problem?
EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:
An R-example of a list with size of 1.3 GB:
li <- list(a = rnorm(10^8),
b = rnorm(10^7.8))

So, you want to split a file and to reload it in a single dataframe.
There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.
The following is a piece of code I have used for a similar task (unrelated to GitHub though).
The split.file function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc.
The function returns the number of files written.
The join.files function takes the base name.
Note:
Play with the rows parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.
Here are the functions:
split.file <- function(db, rows, basename) {
n = nrow(db)
m = n %/% rows
for (k in seq_len(m)) {
db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
compress = "xz", ascii = F)
}
if (m * rows < n) {
db.sub <- db[seq(1 + m*rows, n), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
compress = "xz", ascii = F)
m <- m + 1
}
m
}
join.files <- function(basename) {
files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
do.call("rbind", lapply(files, readRDS))
}
Example:
n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)

Related

Saving data frames using a for loop with file names corresponding to data frames

I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.

Fastest way in R to read and compare CSV files

I know there are other questions on stack overflow about fastest way of reading csv files in R - and they have been answered; data.table seems to be the way to go. But I have additional requirements.
I need to come up with a script that does set operation of diff between two groups of vectors (to find the count of values that match in both vectors). Both group of vectors are to be fetched from csv files in two different directories, dirA and dirB. Each vector in dirA will be compared with all the vectors in dirB and number of matches will be recorded. dirA has about 50 files and dirB has 3000 files of varying size (1 to 60 MBs).
Below is my attempt at it using R. It is not as fast as I would expect (compared to a similar solution implemented in Pandas this code is 30% slower). One go at reading 3000 files is taking more than 120 seconds. Is there something I am missing or this is about the best I can get in R - my be by clever use of vectorization and multiple comparisons in one go? Any help is appreciated. Thank you.
I am using data.table version 1.13.6.
I want to read everything as string (there are leading zeros and some other anomalies)
Code:
path_dirA <- "data/processed_data_dirA"
path_dirB <- "data/processed_data_dirB"
fn_dirA <- list.files(here(path_dirA), pattern="csv")
fn_dirB <- list.files(here(path_dirB), pattern="csv")
v_count_matched <- integer()
for (fn1 in fn_dirA) {
f1 <- data.table::fread(here(fn_dirA, fn1), colClasses = 'character')
for (fn2 in fn_dirB) {
f2 <- data.table::fread(here(fn_dirB, fn2), colClasses = 'character')
v_count_matched <- c(v_count_matched, length( fintersect(f1[,1],f2[,1]) ) )
}
}
}
For this particular case, most of the time is consumed reading the CSV files.
If you could cache on disk those CSV files in another format with faster read times, you would obtain maximal savings.
For example, if you need to repeat the comparisons daily but only one CSV has changed.
You could have those CSV files saved (cached on disk) in fst format. https://www.fstpackage.org/
One possible speed-up would be to use indexes to add data rather than concatenating:
fn_dirA <- list.files(here(path_dirA), pattern="csv")
fn_dirB <- list.files(here(path_dirB), pattern="csv")
v_count_matched <- vector(NA, length(fn_dirA)*length(n_dirA))
counter = 0
for (fn1 in fn_dirA) {
f1 <- data.table::fread(here(fn_dirA, fn1), colClasses = 'character')
for (fn2 in fn_dirB) {
counter = counter + 1
f2 <- data.table::fread(here(fn_dirB, fn2), colClasses = 'character')
v_count_matched[counter] <- length( fintersect(f1[,1],f2[,1]))
}
}
}
I have accepted an answer based on what worked previously. However, I was able to reduce the execution time greatly by simply adding setkey. The whole thing is taking 6 hours now instead of days!

R : import millions of small alpha numeric csv files

I have about 300GB of 15KB csv files (each with exactly 100 rows each) that I need to import, concatenate, manipulate and resave as a single rds.
I've managed to reduce the amount of RAM needed by only importing the columns I need but as soon as I need to do any operations on the columns, I max it out.
What is your strategy for this type of problem?
This is a shot at answering your question.
While this may not be the most effective of efficient solution, it works. The biggest upside is that you don't need to store all the information at once, instead just appending the result to a file.
If this is not fast enough it is possible to use parallell to speed it up.
library(tidyverse)
library(data.table)
# Make some example files
for (file_number in 1:1000) {
df = data.frame(a = runif(10), b = runif(10))
write_csv(x = df, path = paste0("example_",file_number,".csv"))
}
# Get the list of files, change getwd() to your directory,
list_of_files <- list.files(path = getwd(), full.names = TRUE)
# Define function to read, manipulate, and save result
read_man_save <- function(filename) {
# Read file using data.table fread, which is faster than read_csv
df = fread(file = filename)
# Do the manipulation here, for example getting only the mean of A
result = mean(df$a)
# Append to a file
write(result, file = "out.csv", append = TRUE)
}
# Use lapply to perform the function over the list of filenames
# The output (which is null) is stored in a junk object
junk <- lapply(list_of_files, read_man_save)
# The resulting "out.csv" now contains 1000 lines of the mean
Feel free to comment if you want any edits to better reflect your use case.
You could also use the disk.frame library, it is designed to allow manipulation of data larger than RAM.
You can then manipulate the data like you would in data.table or using dplyr verbs.

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

How do I save output from a large simulation in R? (multiple nodes, safe access)

I am doing a large simulation for a research project--simulating 1,000 football seasons and analyzing the results. As the seasons will be spread across multiple nodes, I need an easy way to save my output data into a file (or files) to access later. Since I can't control when the nodes will finish, I can't have them all trying to write to the same file at the same time, but if they all save to a different file, I would need a way to aggregate all the data easily afterward. Thoughts?
I do not know if this question was asked already. But here is what I do in my research. You can loop through the file names and aggregate them into one object like so
require(data.table)
dt1 <- data.table()
for (i in 1:100) {
k <- paste0("C:/chunkruns/dat",i,"/dt.RData")
load(k)
dt1 <- rbind(dt1,dt)
}
agg.data <- dt1
rm(dt1)
The above code assumes that all your files are saved in different folders but with same file name.
Or else, you can use the following to identify file paths matching a pattern and then combine them
require(data.table)
# Get the list of files and then read the files using read.csv command
k <- list.files(path = "W:/chunkruns/dat", pattern = "Output*", all.files = FALSE, full.names = TRUE, recursive = TRUE)
m <- lapply(k, FUN = function (x) read.csv(x,skip=11,header = T))
agg.data <- rbindlist(m)
rm(m)

Resources