R: Split and write very large data frame into slices - r

I have a large dataframe my_df in R containing 1983000 records. The following lines of sample code take the chunk of 1000 rows starting from 25001, do some processing, and write the processed data into a file to the local disk.
my_df1 <- my_df[25001:26000,]
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
str_data <- my_df1
setwd("path_to_local_dir/data25001_26000")
write.table(str_data, file = "data25001-26000.csv",row.names = F,col.names = F,quote = F)
and so on like this:
my_df2 <- my_df[26001:27000,]
...
I would like automate this task such that the chunks of 1000 records are processed and written to a new directory. Any advise on how this could be done?

Consider generalizing your process in a function, data_to_disk, and call function with an iterator method like lapply passing a sequence of integers with seq() for each subsequent thousand. Also, incorporate a dynamic directory creation (but maybe dump all 1,000+ files in one directory instead of 1,000+ dirs?).
data_to_disk <- function(num) {
str_data <- within(my_df[num:(num + 999)], {
end <- as.POSIXct(paste0(end, "23:59"))
year <- lubridate::year($start)
})
my_dir <- paste0("path_to_local_dir/data", num, "_", num + 999)
if(!dir.exists(my_dir)) dir.create(my_dir)
write.table(str_data, file = paste0(my_dir, "/", "data", num, "-", num + 999, ".csv"),
row.names = FALSE, col.names = FALSE, quote = FALSE)
return(my_df)
}
seqs <- seq(25001, nrow(my_df), by=1000)
head(seqs)
# [1] 25001 26001 27001 28001 29001 30001
tail(seqs)
# [1] 1977001 1978001 1979001 1980001 1981001 1982001
# LIST OF 1,958 DATA FRAMES
df_list <- lapply(seqs, data_to_disk)

Here is my code doing the sliced loop:
step1 = 1000
runto = nrow(my_df)
nsteps = ceiling(runto/step1)
for( part in seq_len(nsteps) ) { # part = 1
cat( part, 'of', nsteps, '\n')
fr = (part-1)*step1 + 1
to = min(part*step1, runto)
my_df1 = my_df[fr:to,]
# ...
write.table(str_data, file = paste0("data",fr,"-",to,".csv"))
}
rm(part, step1, runto, nsteps, fr, to)

You can add a grouping variable to your data first (e.g., to identify every 1000 rows), then use d_ply() to split the data and write to file.
df <- data.frame(var=runif(1000000))
df$fold <- cut(seq(1,nrow(df)),breaks=100,labels=FALSE)
df %>% filter(fold<=2) %>% # only writes first two files
d_ply(.,.(fold), function(i){
# make filenames 'data1.csv', 'data2.csv'
write_csv(i,paste0('data',distinct(i,fold),'.csv'))
})

This is similar to #Parfait but takes a lot of stuff out of the function. Specifically, it creates a copy of the entire dataset and then performs the time manipulation functions.
my_df1 <- my_df
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
lapply(seq(25001, nrow(my_df1), by = 1000),
function(i) write.table(my_df1[i:i+1000-1,]
, file = paste0('path_to_logal_dir/data'
, i, '-', i+1000-1, '.csv')
,row.names = F,col.names = F,quote = F)
)
For me, I'd probably just do:
write.table(my_df1, file = ...)
and be done with it. I don't see the advantages of splitting it up - 1 million rows really isn't that many.

Related

For-loop in R to create a new file (but gives incorrect/unexpected output)

I'm currently busy with some data and I need to check their validity.
Therefore, I would like to use a for-loop to go through all my data files.
In this for-loop, I would like to calculate some things (like mean, min,max...).
My code below works but produced an incorrectly written csv file. The problem occurs after the calculations (and their values) are done during csv file creation. CSV:
"c.1..1..1004.89081855716..630.174466667434..461.738905906677.." "c.1..1..950.990843858612..479.98560814955..517.955102920532.."
1 1
1 1
1004.89081855716 950.990843858612
630.174466667434 479.98560814955
461.738905906677 517.955102920532
1535.86795806885 1452.30199813843
-13.3948961645365 3.72026950120926
1259.26423788071 1159.17089223862
Approach/What I'm expecting:
So I start from some data files with eye tracking data in it.
As you can see at the beginning of the code, I try to get some values out of this eye tracking data (validity, new file with only validity == 1 data...). Once I created the filtered_data dataframe, I want to calculate some extra values out of it (mean, sd, min/max).
My plan is to create a new csv file (validity_loop.csv) in which I can find all my calculations (validity_left, validity_right,mean_eye_x, mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y). All in a row. One row for each data set (file_list[i]).
Can someone help me in how to tackle and solve this issue?
Here is my code:
set <- setwd("/Users/Sarah/Documents")
file_list <- list.files(set, pattern = ".csv", all.files = TRUE)
validity_list <- data_list <- vector("list", "length" = length(file_list))
for(i in seq_along(file_list)){
filename = file_list[i]
#read files
data_frame = read.csv(filename, sep = ",", dec = ".",
header = TRUE,
stringsAsFactors = FALSE)
#what has to be done
#validity
validity_left <- mean(is.numeric(data_frame$left_gaze_point_validity))
validity_right <-mean(is.numeric(data_frame$right_gaze_point_validity))
#Zuiver dataframe (validity ==1)
to_keep = which(data_frame$left_gaze_point_validity == 1 &
data_frame$right_gaze_point_validity==1)
filtered_data = data_frame[to_keep,]
filtered_data$left_eye_x = as.numeric(filtered_data$left_eye_x)
filtered_data$left_eye_y = as.numeric(filtered_data$left_eye_y)
filtered_data$right_eye_x = as.numeric(filtered_data$right_eye_x)
filtered_data$right_eye_y = as.numeric(filtered_data$right_eye_y)
#1 eye-data
filtered_data$eye_x <- (filtered_data$left_eye_x+filtered_data$right_eye_x)/2
filtered_data$eye_y <- (filtered_data$left_eye_y+filtered_data$right_eye_y)/2
#Pixels
filtered_data$eye_x <- (filtered_data$eye_x)*1920
filtered_data$eye_y <- (filtered_data$eye_y)*1080
#SD and Mean + min-max
mean_eye_x<- mean(filtered_data$eye_x)
mean_eye_y <- mean(filtered_data$eye_y)
sd_eye_x <- sd(filtered_data$eye_x)
sd_eye_y <- sd(filtered_data$eye_y)
min_eye_x <- min(filtered_data$eye_x)
min_eye_y <- min(filtered_data$eye_y)
max_eye_x <- max(filtered_data$eye_x)
max_eye_y <- max(filtered_data$eye_y)
#add everything to new file
validity_list[[i]] <- c(validity_left, validity_right,
mean_eye_x, mean_eye_y,
min_eye_x, min_eye_y,
max_eye_x, max_eye_y)
}
#new document
write.table(validity_list,
file = "Master T&O/Thesis /Loop/Validity/validity_loop.csv",
col.names = TRUE, row.names = FALSE)
I managed to get a new data frame in R, which contains the value of my validity_list as a matrix form.
#FOR LOOP poging 2
set <- setwd("/Users/Sarah/Documents/Master T&O/Thesis /Loop")
file_list <- list.files(set, pattern = ".csv", all.files = TRUE)
validity_list <- vector("list", "length" = length(file_list))
for(i in seq_along(file_list)){
filename = file_list[i]
#read files
data_frame = read.csv(filename, sep = ",", dec = ".", header = TRUE, stringsAsFactors = FALSE)
#what has to be done
#validity
validity_left <- mean(is.numeric(data_frame$left_gaze_point_validity))
validity_right <-mean(is.numeric(data_frame$right_gaze_point_validity))
#Zuiver dataframe (validity ==1)
to_keep = which(data_frame$left_gaze_point_validity == 1 & data_frame$right_gaze_point_validity==1)
filtered_data = data_frame[to_keep,]
filtered_data$left_eye_x = as.numeric(filtered_data$left_eye_x)
filtered_data$left_eye_y = as.numeric(filtered_data$left_eye_y)
filtered_data$right_eye_x = as.numeric(filtered_data$right_eye_x)
filtered_data$right_eye_y = as.numeric(filtered_data$right_eye_y)
#1 eye-data
filtered_data$eye_x <- (filtered_data$left_eye_x+filtered_data$right_eye_x)/2
filtered_data$eye_y <- (filtered_data$left_eye_y+filtered_data$right_eye_y)/2
#Pixels
filtered_data$eye_x <- (filtered_data$eye_x)*1920
filtered_data$eye_y <- (filtered_data$eye_y)*1080
#SD and Mean + min-max
mean_eye_x<- mean(filtered_data$eye_x)
mean_eye_y <- mean(filtered_data$eye_y)
sd_eye_x <- sd(filtered_data$eye_x)
sd_eye_y <- sd(filtered_data$eye_y)
min_eye_x <- min(filtered_data$eye_x)
min_eye_y <- min(filtered_data$eye_y)
max_eye_x <- max(filtered_data$eye_x)
max_eye_y <- max(filtered_data$eye_y)
#add everything to new file
validity_list[[i]] <- c(validity_left, validity_right,mean_eye_x, mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y)
validity_matrix <- matrix(unlist(validity_list), ncol = 8, byrow = TRUE)
}
#new document
write.table(validity_matrix, file = "/Users/Sarah/Documents/Master T&O/Thesis /Loop/Validity/validity_loop.csv", dec = ".")
The only problem I have now, is the fact that my values for the validity_list items are wrong, but that's another problem and I'm trying to fix it!
If I get it then the following line grabs all your data together:
validity_list[[i]] <- c (validity_left, validity_right,mean_eye_x,
mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y).
if it's like in python then I would have:
validity_list = (validity_left, validity_right,mean_eye_x,
mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y)
... whereas the '=' tell the interpreter that everything behind it is a tuple '(', data, ')' ...which makes it one single dataset and if I then write it... it would be end up in one column. If you do a pick using a for-loop I would get "validity_left" writing in a separate column. In your case adding this to your below code an option?
for item in validity_list:
function to process item..etc.

write results sequentially in a loop in r

I have a bunt of single files which need to apply a test. I need to find the way to write automatically results of each file into a file. Here is what I do:
library(ape)
stud_files <- list.files("path/dir/data",full.names = T)
for (f in stud_files) {
df <- read.table(f, header=TRUE, sep=";")
df_xts <- as.xts(df$cola, order.by = as.Date(df$colb,"%m/%d/%Y"))
pet <- testa(df_xts)
res <- data.frame(estimate = pet$estimate,
p.value=pet$p.value,
logi = pet$alternative)
write.dna(res,file = "res_testa.xls",format = "sequential")
}
This loop works well, except the last command which aim to write the results of each file consecutively, it saved only the last performance. And the results save as string, not a table as I define above (data.frame). Any idea in this case? Thanks in advance
Check help(write.dna).
write.dna(x, file, format = "interleaved", append = FALSE,
nbcol = 6, colsep = " ", colw = 10, indent = NULL,
blocksep = 1)
append a logical, if TRUE the data are appended to the file without
erasing the data possibly existing in the file, otherwise the file (if
it exists) is overwritten (FALSE the default).
Set append = TRUE and you should be all set.
As some of the comments point out, however, you are probably better off generating your table, and then writing it all at once to a file. Unless you have billions of files, you likely won't run out of memory.
Here is how I would approach this.
library(ape)
library(data.table)
stud_files <- list.files("path/dir/data",full.names = T)
sumfunc <- function(f) {
df <- read.table(f, header=TRUE, sep=";")
df_xts <- as.xts(df$cola, order.by = as.Date(df$colb,"%m/%d/%Y"))
pet <- testa(df_xts)
res <- data.table(estimate = pet$estimate,
p.value=pet$p.value,
logi = pet$alternative)
return(res)
}
lres <- lapply(stud_files, sumfunc)
dat <- rbindlist(lres)
write.table(dat,
file = "res_testa.csv",
sep = ",",
quote = FALSE,
row.names = FALSE)

Loop not working and saving outpout

Hi somehow my loop is not working. It only takes the last variable. Here's the code:
library(readxl)
library(readr)
library(plyr)
library(dplyr)
path = "C:/Users/benja/OneDrive/Studium/Bachelorarbeit/Ressourcen/Conference Calls/"
Enterprise = "ABB Ltd"
#Import Dictionary
Dictionary <- read_excel("C:/Users/benja/OneDrive/Studium/Bachelorarbeit/Ressourcen/LoughranMcDonald_MasterDictionary_2014.xlsx",
sheet = "Tabelle1")
for (File in c("2016 Q1.xml","2016 Q2.xml","2016 Q3.xml","2016 Q4.txt"))
{
#Import Text
ABB_2016_Q4 <- read_delim(paste0(path,Enterprise,"/",File),
" ", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
#Umformatierung -> Zuerst Transp, Vektor, kleinbuchstaben, dataframe
ABB_2016_Q4 = data.frame(tolower(c(t(ABB_2016_Q4))))
colnames(ABB_2016_Q4) = "Word"
#Zusammenführung Text-Dictionary
Analyze_2016_Q4 = inner_join(Dictionary,ABB_2016_Q4)
#Analyse
Rating = sum(Analyze_2016_Q4$Rating)
}
If I try to test it with
print(File)
it has the appropriate list but the loop is not working anyways. And how can I save the results after each loop?
I want to have each Rating for the different quartals displayed.
It looks like you're loading one 'master' file, then loading lots of individual files and trying to join these to the master. If that's the case, I'd take a more functional approach rather than use a for() loop.
Some example data:
master <- data.frame(
key = letters,
stringsAsFactors = FALSE
)
a <- data.frame(
key = sample(letters, 13),
dat = sample(1:100, 13),
stringsAsFactors = FALSE
)
a$key
letters_reduced <- letters %in% a$key
letters_reduced <- letters[!letters_reduced]
b <- data.frame(
key = sample(letters_reduced, 13),
dat = sample(1:100, 13),
stringsAsFactors = FALSE
)
readr::write_csv(a, "~/StackOverflow/BenjaminBerger/a.csv")
readr::write_csv(b, "~/StackOverflow/BenjaminBerger/b.csv")
So we have the master object in memory. To load in multiple files in R, assuming they're in the same directory, I'd use list.files() then iterate over the files with lapply() and read_csv():
files <- list.files("StackOverflow/BenjaminBerger", pattern = "*.csv",
full.names = TRUE)
df <- lapply(files, readr::read_csv)
You now have a list of data frames. There are many ways you could join these to your master object, but perhaps the simplest is to 'collapse' the list of data frames into one data frame, and do one join with this. This is as easy as:
df <- dplyr::bind_rows(df)
master <- dplyr::inner_join(master, df, by = "key")
Which gets you:
head(master)
# key dat
# 1 a 38
# 2 b 52
# 3 c 59
# 4 d 77
# 5 e 34
# 6 f 93
your loop is probably working, but at the moment it's not returning anything : )
you can for instance write your result to a list:
#initiate result list
allResults <- list()
#populate your filelist; depending on your directory, you can also use list.files()
files <- c("2016 Q1.xml","2016 Q2.xml","2016 Q3.xml","2016 Q4.txt")
#iterate through your files
for (i in (1:length(files))
{ #Import Text
ABB_2016_Q4 <- read_delim(paste0(path,Enterprise,"/",files[i]),
" ", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
#Umformatierung -> Zuerst Transp, Vektor, kleinbuchstaben, dataframe
ABB_2016_Q4 = data.frame(tolower(c(t(ABB_2016_Q4))))
colnames(ABB_2016_Q4) = "Word"
#Zusammenführung Text-Dictionary
Analyze_2016_Q4 = inner_join(Dictionary,ABB_2016_Q4)
#Analyse & store results & add identifier:
allResults[[i]] = data.frame(ID = paste0("Q",i),
result =sum(Analyze_2016_Q4$Rating),
stringsAsFactors = FALSE)
}
# flatten resultlist to a dataframe:
allResultsDf <- do.call(rbind, allResults)

Stream processing large csv file in R

I need to make a couple of relatively simple changes to a very large csv file (c.8.5GB). I tried initially using various reader functions: read.csv, readr::read.csv, data.table::fread. However: they all run out of memory.
I'm thinking I need to use a stream processing approach instead; read a chunk, update it, write it, repeat. I found this answer which is on the right lines; however I don't how to terminate the loop (I'm relatively new to R).
So I have 2 questions:
What's the right way to make the while loop work?
Is there a better way (for some definition of 'better')? e.g. is there some way to do this using dplyr & pipes?
Current code as follows:
src_fname <- "testdata/model_input.csv"
tgt_fname <- "testdata/model_output.csv"
#Changes needed in file: rebase identifiers, set another col to constant value
rebase_data <- function(data, offset) {
data$'Unique Member ID' <- data$'Unique Member ID' - offset
data$'Client Name' <- "TestClient2"
return(data)
}
CHUNK_SIZE <- 1000
src_conn = file(src_fname, "r")
data <- read.csv(src_conn, nrows = CHUNK_SIZE, check.names=FALSE)
cols <- colnames(data)
offset <- data$'Unique Member ID'[1] - 1
data <- rebase_data(data, offset)
#1st time through, write the headers
tgt_conn = file(tgt_fname, "w")
write.csv(data,tgt_conn, row.names=FALSE)
#loop over remaining data
end = FALSE
while(end == FALSE) {
data <- read.csv(src_conn, nrows = CHUNK_SIZE, check.names=FALSE, col.names = cols)
data <- rebase_data(data, offset)
#write.csv doesn't support col.names=FALSE; so use write.table which does
write.table(data, tgt_conn, row.names=FALSE, col.names=FALSE, sep=",")
# ??? How to test for EOF and set end = TRUE if so ???
# This doesn't work, presumably because nrow() != CHUNK_SIZE on final loop?
if (nrow(data) < CHUNK_SIZE) {
end <- TRUE
}
}
close(src_conn)
close(tgt_conn)
Thanks for any pointers.
Sorry to poke a 2-year-old thread, but now with readr::read_csv_chunked (auto-loaded along with dplyr when loading tidyverse), we could also do like:
require(tidyverse)
## For non-exploratory code, as #antoine-sac suggested, use:
# require(readr) # for function `read_csv_chunked` and `read_csv`
# require(dplyr) # for the pipe `%>%` thus less parentheses
src_fname = "testdata/model_input.csv"
tgt_fname = "testdata/model_output.csv"
CHUNK_SIZE = 1000
offset = read_csv(src_fname, n_max=1)$comm_code %>% as.numeric() - 1
rebase.chunk = function(df, pos) {
df$comm_code = df$comm_code %>% as.numeric() - offset
df$'Client Name' = "TestClient2"
is.append = ifelse(pos > 1, T, F)
df %>% write_csv(
tgt_fname,
append=is.append
)
}
read_csv_chunked(
src_fname,
callback=SideEffectChunkCallback$new(rebase.chunk),
chunk_size = chunck.size,
progress = T # optional, show progress bar
)
Here the tricky part is to set is.append based on parameter pos, which indicates the start row number of the data frame df within original file. Within readr::write_csv, when append=F the header (columns name) will be written to file, otherwise not.
Try this out:
library("chunked")
read_chunkwise(src_fname, chunk_size=CHUNK_SIZE) %>%
rebase_data(offset) %>%
write_chunkwise(tgt_fname)
You may need to fiddle a bit with the colnames to get exactly what you want.
(Disclaimer: haven't tried the code)
Note that there is no vignette with the package but the standard usage is described on github: https://github.com/edwindj/chunked/
OK I found a solution, as follows:
# src_fname <- "testdata/model_input.csv"
# tgt_fname <- "testdata/model_output.csv"
CHUNK_SIZE <- 20000
#Changes needed in file: rebase identifiers, set another col to constant value
rebase_data <- function(data, offset) {
data$'Unique Member ID' <- data$'Unique Member ID' - offset
data$'Client Name' <- "TestClient2"
return(data)
}
#--------------------------------------------------------
# Get the structure first to speed things up
#--------------------------------------------------------
structure <- read.csv(src_fname, nrows = 2, check.names = FALSE)
cols <- colnames(structure)
offset <- structure$'Unique Member ID'[1] - 1
#Open the input & output files for reading & writing
src_conn = file(src_fname, "r")
tgt_conn = file(tgt_fname, "w")
lines_read <- 0
end <- FALSE
read_header <- TRUE
write_header <- TRUE
while(end == FALSE) {
data <- read.csv(src_conn, nrows = CHUNK_SIZE, check.names=FALSE, col.names = cols, header = read_header)
if (nrow(data) > 0) {
lines_read <- lines_read + nrow(data)
print(paste0("lines read this chunk: ", nrow(data), ", lines read so far: ", lines_read))
data <- rebase_data(data, offset)
#write.csv doesn't support col.names=FALSE; so use write.table which does
write.table(data, tgt_conn, row.names=FALSE, col.names=write_header, sep = ",")
}
if (nrow(data) < CHUNK_SIZE) {
end <- TRUE
}
read_header <- FALSE
write_header <- FALSE
}
close(src_conn)
close(tgt_conn)

Reading multiple csv of same format in a data frame

I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan

Resources