I know there are other questions on stack overflow about fastest way of reading csv files in R - and they have been answered; data.table seems to be the way to go. But I have additional requirements.
I need to come up with a script that does set operation of diff between two groups of vectors (to find the count of values that match in both vectors). Both group of vectors are to be fetched from csv files in two different directories, dirA and dirB. Each vector in dirA will be compared with all the vectors in dirB and number of matches will be recorded. dirA has about 50 files and dirB has 3000 files of varying size (1 to 60 MBs).
Below is my attempt at it using R. It is not as fast as I would expect (compared to a similar solution implemented in Pandas this code is 30% slower). One go at reading 3000 files is taking more than 120 seconds. Is there something I am missing or this is about the best I can get in R - my be by clever use of vectorization and multiple comparisons in one go? Any help is appreciated. Thank you.
I am using data.table version 1.13.6.
I want to read everything as string (there are leading zeros and some other anomalies)
Code:
path_dirA <- "data/processed_data_dirA"
path_dirB <- "data/processed_data_dirB"
fn_dirA <- list.files(here(path_dirA), pattern="csv")
fn_dirB <- list.files(here(path_dirB), pattern="csv")
v_count_matched <- integer()
for (fn1 in fn_dirA) {
f1 <- data.table::fread(here(fn_dirA, fn1), colClasses = 'character')
for (fn2 in fn_dirB) {
f2 <- data.table::fread(here(fn_dirB, fn2), colClasses = 'character')
v_count_matched <- c(v_count_matched, length( fintersect(f1[,1],f2[,1]) ) )
}
}
}
For this particular case, most of the time is consumed reading the CSV files.
If you could cache on disk those CSV files in another format with faster read times, you would obtain maximal savings.
For example, if you need to repeat the comparisons daily but only one CSV has changed.
You could have those CSV files saved (cached on disk) in fst format. https://www.fstpackage.org/
One possible speed-up would be to use indexes to add data rather than concatenating:
fn_dirA <- list.files(here(path_dirA), pattern="csv")
fn_dirB <- list.files(here(path_dirB), pattern="csv")
v_count_matched <- vector(NA, length(fn_dirA)*length(n_dirA))
counter = 0
for (fn1 in fn_dirA) {
f1 <- data.table::fread(here(fn_dirA, fn1), colClasses = 'character')
for (fn2 in fn_dirB) {
counter = counter + 1
f2 <- data.table::fread(here(fn_dirB, fn2), colClasses = 'character')
v_count_matched[counter] <- length( fintersect(f1[,1],f2[,1]))
}
}
}
I have accepted an answer based on what worked previously. However, I was able to reduce the execution time greatly by simply adding setkey. The whole thing is taking 6 hours now instead of days!
Related
I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :
a) every saved file of the chunk is less than 100MB large
b) the original list can be loaded conveniently and fast into a new R workspace
EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.
What is the best solution for this problem?
EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:
An R-example of a list with size of 1.3 GB:
li <- list(a = rnorm(10^8),
b = rnorm(10^7.8))
So, you want to split a file and to reload it in a single dataframe.
There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.
The following is a piece of code I have used for a similar task (unrelated to GitHub though).
The split.file function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc.
The function returns the number of files written.
The join.files function takes the base name.
Note:
Play with the rows parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.
Here are the functions:
split.file <- function(db, rows, basename) {
n = nrow(db)
m = n %/% rows
for (k in seq_len(m)) {
db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
compress = "xz", ascii = F)
}
if (m * rows < n) {
db.sub <- db[seq(1 + m*rows, n), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
compress = "xz", ascii = F)
m <- m + 1
}
m
}
join.files <- function(basename) {
files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
do.call("rbind", lapply(files, readRDS))
}
Example:
n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)
I have 30 sas-files (dataset1.sas7bdat through dataset30.sas7bdat, approx. 10 GB per file) in a folder, and need to analyse a subset of rows in these data files (all rows where the character variable A begins with 10).
Thus, I need to read each of the sas-files into R, filter a subset with grep on variable A and then save each of these filtered datasets as a .rds-file.
I'm trying to achieve this using a for loop of list.files() and the Haven package to read the sas-file. In order to avoid going out-of-memory, I need to remove the imported dataset on each iteration after the subset has been filtered and saved as .rds.
Though not elegant nor satisfying, I could hard-code it manually 30 times over like this, copy/pasting and incrementing the suffixes by 1 each time:
dt1 <- haven::read_sas("~/folder/dataset1.sas7bdat")
dt1 <- data.table::as.data.table(dt1)
dt1 <- dt1[grep("^10", A)]
saveRDS(dt1, "~/folder/subset1.rds")
dt2 <- haven::read_sas("~/folder/dataset2.sas7bdat")
dt2 <- data.table::as.data.table(dt2)
dt2 <- dt1[grep("^10", A)]
saveRDS(dt2, "~/folder/subset2.rds")
etc.
While the following for loop technically works to read the files into memory, it is never going to finish due to massively going out of memory, so it does not allow me to filter the data:
folder <- "~/folder/"
file_list <- list.files(path = folder, pattern = "^dataset")
for (i in 1:length(file_list)) {
assign(file_list[i], Haven::read_sas(paste(folder, file_list[i], sep='')))
}
Is there a way to - on each iteration in the loop - filter the dataset, remove the unfiltered dataset and save the subset in a .rds-file?
I can't seem to come up with a way to incorporate this into my approach of using the assign() function.
Is there a better way to go about this?
You could potentially look at freeing up memory by clearing the memory after each load and filter via
rm(list=ls())
I slept on it, and woke up with a working solution using a function to do the work, and a loop to cycle through filenames. This also enables me to save the output in a different folder (my raw data folder is read-only):
library(haven)
library(data.table)
fromFolder <- "~/folder_with_input_data/"
toFolder <- "~/folder_with_output_data/"
import_sas <- function(filename) {
dt <- read_sas(paste(fromFolder, filename, sep=''), NULL)
dt <- as.data.table(dt)
dt <- dt[grep("^10", A)]
saveRDS(dt, paste(toFolder,filename,'.rds.', sep =''), compress = FALSE)
remove(dt)
}
file_list <- list.files(path = fromFolder, pattern="^dataset")
for (filename in file_list) {
import_sas(filename)
}
I haven't tested this with the full 30 files yet. I'll do that tonight.
If I encounter problems, I will post an update tomorrow. Otherwise, this question can be closed in 48 hours.
Update: It worked without a hitch and completed the 297GB conversion in around 13 hours. I don't think it can be optimized to accomplish the task much faster; the vast majority of computing time is spent on opening the sas-files, which I don't think can be done faster by other means than Haven. Unless someone has an idea to optimize the process, this question can be closed.
I have a huge dataset consisting of hundreds of files and I need to use bigmemory (or similar) package to read the data. How do I use read.big.matrix with multiple files?
Could probably load the files iteratively from working directory and then merge the matrices.
setwd(path_to_files)
files <- list.files(path = ".")
number_of_matrices <- length(files)
#For faster computation set number of matrices
my_matrices <- vector("list", length = S)
for (file in 0:length(files)){
my_matrices[file] <- read.big.matrix(filename=files[file])
}
#Assuming same columns this will merge them to one single:
massive_matrix <- do.call("rbind", my_matrices)
Have not tested the code, but this might give you an idea of how the problem could be approached?
I have a quite big number of quite heavy datasets. I would like to extract a subset out of each of them and save it into different csv files (one for each dataset). These are the commands I would like to loop for all the files I have in the folder:
df <-read.csv("1985.csv",header=FALSE,stringsAsFactors=TRUE,sep="\t")
df_short <- df[df$V6=="OPP", ]
write.csv(df_short, file = "OPP_1985.csv",row.names=FALSE)
rm(df)
rm(df_short)
This is probably a very noob question, but I am struggling to understand how to do it, so I would appreciate a lot help with this!
EDIT:
Following #SimonShine's suggestion, I have run this code and it works!
You don't specify if you are trying to collect the subsets into one dataset, or if you are trying to make one file per subset. You refer to OPP_1985 that appears out of scope for the code you wrote. Did you mean to refer to df_short?
You could start by abstracting what you want to do with one datafile into a function, e.g.:
extract_and_save_from_dataset <- function(csvfile) {
df <- read.csv(csvfile, header=F, stringsAsFactors=T, sep="\t")
df_short <- df[df$V6 == "OPP",]
csvfile_short <- gsub(".csv", "_short.csv", csvfile)
write.csv(df_short, file=csvfile_short, row_names=F)
}
Assuming you have a collection of dataset filenames, you could apply this function multiple times:
# csvfiles <- c("OPP_1985.csv", "OPP_1986.csv", ...)
csvfiles <- list.files("/path/to/my/csvfiles")
for (csvfile in csvfiles) {
extract_and_save_from_dataset(csvfile)
}
The data.table approach is probably the fastest option, specially if you have a large dataset. The function fwrite{data.table} works in parallel using many CPUS, making it extremely fast.
Here is how you can divide your original data according to subgroups defined based on the values of df$V6 and save each subset into a separate .csv file.
library (data.table)
set(df)[, fwrite(.SD, paste0("output_", V6,".csv")), by = V6, .SDcols=names(df) ]
ps. The name of the files will be output_*.csv where * is the correspondent V6 value.
I want to automate the extraction of certain information from text files using grep, grepl and regexpr. I have a code that works when I do it for each individual file, however I cannot get the loop to work, to automate the process for all files in my working directory.
I am reading in the txt files as strings because of the structure of the data. The loop seems to iterate through the first file numerous times corresponding to the number of files in the directory, obviously because of the length(txtfiles)command in the for statement.
txtfiles = list.files(pattern="*.txt")
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
#select hours of operation
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op, regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op))
}
I would be grateful if someone could point me in the right direction to repeat this routine for each file, rather than the same file multiple times over. I want to end up with a list of the file names and the corresponding hours_op.
you need to either add an index ([i]) to every one of your reference to hours_op[i], as in:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op[i], regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op[i]))
}
or better yet, use a temporary variable:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
temp <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(temp, regexpr("[0-9]{1,9}.[0-9]{1,9}",temp))
}