Process a large list of objects in R - r

I have a large list of objects (say 100k elements). Each element will have to be processed by a function "process" BUT I would like to do the processing in chunks... say 20 passes for example as I want to save processing results into a hard drive file and keep operating memory free.
I'm new to R and I know that it should involve some apply magic but I don't know how to do it (yet).
Any guidance would be much appreciated.
A small example:
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
processOneElement <- function(x){
x/20 + 23;
}
I would like to process first 20 elements in one go and save results then process second 20 elements in second go and save results... and so on
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
process <- function(x){
x/20 + 23;
}
results <- lapply(objects, FUN=process)
index <- seq(1, length(objects), by=20);
lapply(index, function(idx1) {
idx2 <- min(idx1+20-1, length(objects));
batch <- lapply(idx:idx2, function(x) {
process(objects[[x]]);
})
write.table(batch, paste("batch", idx1, sep=""));
})

With what you have given, this is the answer I could suggest. Assuming your list is stored in list.object,
lapply(seq(1, length(list.object), by=20), function(idx) {
# here idx will be 1, 21, 41 etc...
idx2 <- min(idx+20-1, length(list.object))
# do what you want here..
batch.20.processed <- lapply(idx:idx2, function(x) {
process(list.object[[x]]) # passes idx:idx2 indices one at a time
})
# here you have processed list with 20 elements
# finally write to file
lapply(1:20, function(x) {
write.table(batch.20.processed[[x]], ...)
# where "..." is all other allowed arguments to write.table
# such as row.names, col.names, quote etc.
# don't literally pass "..." to write.table
})
}

Related

Money adding program in R

I want to create a program in R that takes integer user input and then adds it to the previous user input. ex. user input(say one day): 10, then (maybe the next day) user input: 15 --> output 25.Ideally this would accept nearly an infinite amount of input. here is what I have so far:
amount_spent <- function(){
i <-1
while(i<10){
n <- readline(prompt="How much did you spend?: ")
i<-i+1
}
print(c(as.integer(n)))
}
amount_spent()
Problems I have with this code are that it only saves the last input value, and it is difficult to control when User is allowed to input. Is there any way to save user input to a data that can be manipulated through readline()?
# 1.R
fname <- "s.data"
if (file.exists(fname)) {
load(fname)
}
if (!exists("s")) {
s <- 0
}
n <- 0
while (TRUE) {
cat ("Enter a number: ")
n <- scan("stdin", double(), n=1, quiet = TRUE)
if (length(n) != 1) {
print("exiting")
break
}
s <- s + as.numeric(n)
cat("Sum=", s, "\n")
save(list=c("s"), file=fname)
}
You should run the script like this: Rscript 1.R
To exit the loop press Ctrl-D in Unix, or Ctrl-Z in Windows.
An R-ish way to do it would be through closures. Here is an example for interactive use (i.e. within an R session).
balance_setup <- function() {
balance <- 0
change_balance <- function () {
n <- readline(prompt = "How much did you spend?: ")
n <- as.numeric(n)
if (!is.na(n))
balance <<- balance + n
balance
}
print_balance <- function() {
balance
}
list(change_balance = change_balance,
print_balance = print_balance)
}
funs <- balance_setup()
change_balance <- funs$change_balance
print_balance <- funs$print_balance
Calling balance_setup creates a variable balanceand two functions that can access it: one for changing the balance, one for printing it. In R, functions can only return a single value, so I bundle both functions together as a list.
change_balance()
## How much did you spend? 5
## [1] 5
change_balance()
## How much did you spend? 5
## [1] 10
print_balance()
## [1] 10
If you want many inputs, use a loop:
repeat{
change_balance()
}
Break the loop with Ctrl-C, Escape or whatever is used on your platform.

Putting user-defined on a list in for loop

I have problems storing user defined functions in R list when they are put on it in a for loop.
I have to define some segment-specific functions based on some parameters, so I create functions and put them on a list looping through segments with for-loop. The problem is I get same function everywhere on a result list.
The code looks like this:
n <- 100
segmenty <- 1:n
segment_functions <- list()
for (i in segmenty){
segment_functions[[i]] <- function(){return(i)}
}
When i run the code what I get is the same function (last created in the loop) for all indexes:
## for all k
segment_functions[[k]]()
[1] 100
There is no problem when I put the functions on list manually e.g.
segment_functions[[1]] <- function(){return(1)}
segment_functions[[2]] <- function(){return(2)}
segment_functions[[3]] <- function(){return(3)}
works just fine.
I honsetly have no idea what's wrong. Could you help?
You need to use the force function to ensure that the evaluation of i is done during the assignment into the list:
n <- 100
segmenty <- 1:n
segment_functions <- list()
f <- function(i) { force(i); function() return(i) }
for (i in segmenty){
segment_functions[[i]] <- f(i)
}
I'd use lapply and capture i in a clousre of the wrapper:
segment_functions <- lapply(1:100, function(i) function() i)

Memory problems using bigmemory to load large dataset in R

I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix object. Here is a simplified example:
library(bigmemory)
library(pryr)
con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')
for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}
close(con)
where x.csv contains
4
18
2
14
16
Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration:
[1] "0x101e854d8" "2"
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
Can big.matrix objects be modified in place?
is there a better way to load, process and then save these data? The current method is slow!
is there a better way to load, process and then save these data? The current method is slow!
The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.
Here's the plan:
Figure out how many lines we have in a file
Read in a chunk of those lines
Perform some operation on that chunk
Push that chunk back into a new file to save for later
library(readr)
# Make a file
x <- data.frame(matrix(rnorm(10000),100000,10))
write_csv(x,"./test_set2.csv")
# Create a function to read a variable in file and double it
calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
read.size=500000,variable="X1"){
# Set up variables
num.lines <- 0
lines.per <- NULL
var.top <- NULL
i=0L
# Gather column names and position of objective column
connection.names <- file(calc.file,open="r+")
data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
close(connection.names)
col.name <- which(colnames(data.names)==variable)
#Find length of file by line
connection.len <- file(calc.file,open="r+")
while((linesread <- length(readLines(connection.len,read.size)))>0){
lines.per[i] <- linesread
num.lines <- num.lines + linesread
i=i+1L
}
close(connection.len)
# Make connection for doubling function
# Loop through file and double the set variables
connection.double <- file(calc.file,open="r+")
for (j in 1:length(lines.per)){
# if stops read.table from breaking
# Read in a chunk of the file
if (j == 1) {
data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
} else {
data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
}
# Grab the columns we need and double them
double <- data[,I(col.name)] * 2
if (j != 1) {
write_csv(data.frame(double),outputFile,append = TRUE)
} else {
write_csv(data.frame(double),outputFile)
}
message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
}
close(connection.double)
}
calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk.

Store results of a for-loop in an object or matrix

i've following problem:
I use the for-loop within R to get specific data from a matrix.
my code is as follows.
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}
Using print, R delivers the data.
Using write.table, R delivers the data into different files.
My aim is to aggregate the results from the for-loop within one matrix. (each row for 'i')
But unfortunately I can not make it.
sorry, i'm a real noob within R.
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
this delivers 100 objects, which contain my results.
But what i need is one matrix/dataframe with 100 columns or one list.
Any ideas?
Hey!
Hence I'm not allowed to edit my own answers, here my (simple) solution as follows:
DELIST <- vector("list",100)
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST[[i]] <- as.character((subset(datensatz_Start_End.frame, TIME <= T))[,1])
}
DELIST[[99]] ## it is possible to requist the relevant companies for every 'i'
Thx to everyone!
George
If you want a list you can use lapply instead of loop
LL <- lapply(1:100,
function(i) {
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
)
After that you can rbind results together using do.call
result <- do.call(rbind, LL)
Or if you are confident that columns of all elements of LL are going to be of same, then you can use more efficient rbindlist from package data.table
result <- rbindlist(LL)
check out rbind function. You can start with empty DELIST.DF and append each row to it inside the loop -
DELIST.DF <- NULL
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
DELIST.DF <- rbind(DELIST.DF, DELIST)
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}

combination of expand.grid and mapply?

I am trying to come up with a variant of mapply (call it xapply for now) that combines the functionality (sort of) of expand.grid and mapply. That is, for a function FUN and a list of arguments L1, L2, L3, ... of unknown length, it should produce a list of length n1*n2*n3 (where ni is the length of list i) which is the result of applying FUN to all combinations of the elements of the list.
If expand.grid worked to generate lists of lists rather than data frames, one might be able to use it, but I have in mind that the lists may be lists of things that won't necessarily fit into a data frame nicely.
This function works OK if there are exactly three lists to expand, but I am curious about a more generic solution. (FLATTEN is unused, but I can imagine that FLATTEN=FALSE would generate nested lists rather than a single list ...)
xapply3 <- function(FUN,L1,L2,L3,FLATTEN=TRUE,MoreArgs=NULL) {
retlist <- list()
count <- 1
for (i in seq_along(L1)) {
for (j in seq_along(L2)) {
for (k in seq_along(L3)) {
retlist[[count]] <- do.call(FUN,c(list(L1[[i]],L2[[j]],L3[[k]]),MoreArgs))
count <- count+1
}
}
}
retlist
}
edit: forgot to return the result. One might be able to solve this by making a list of the indices with combn and going from there ...
I think I have a solution to my own question, but perhaps someone can do better (and I haven't implemented FLATTEN=FALSE ...)
xapply <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
inds <- do.call(expand.grid,lapply(L,seq_along)) ## Marek's suggestion
retlist <- list()
for (i in 1:nrow(inds)) {
arglist <- mapply(function(x,j) x[[j]],L,as.list(inds[i,]),SIMPLIFY=FALSE)
if (FLATTEN) {
retlist[[i]] <- do.call(FUN,c(arglist,MoreArgs))
}
}
retlist
}
edit: I tried #baptiste's suggestion, but it's not easy (or wasn't for me). The closest I got was
xapply2 <- function(FUN,...,FLATTEN=TRUE,MoreArgs=NULL) {
L <- list(...)
xx <- do.call(expand.grid,L)
f <- function(...) {
do.call(FUN,lapply(list(...),"[[",1))
}
mlply(xx,f)
}
which still doesn't work. expand.grid is indeed more flexible than I thought (although it creates a weird data frame that can't be printed), but enough magic is happening inside mlply that I can't quite make it work.
Here is a test case:
L1 <- list(data.frame(x=1:10,y=1:10),
data.frame(x=runif(10),y=runif(10)),
data.frame(x=rnorm(10),y=rnorm(10)))
L2 <- list(y~1,y~x,y~poly(x,2))
z <- xapply(lm,L2,L1)
xapply(lm,L2,L1)
#ben-bolker, I had a similar desire and think I have a preliminary solution worked out, that I've also tested to work in parallel. The function, which I somewhat confusingly called gmcmapply (g for grid) takes an arbitrarily large named list mvars (that gets expand.grid-ed within the function) and a FUN that utilizes the list names as if they were arguments to the function itself (gmcmapply will update the formals of FUN so that by the time FUN is passed to mcmapply it's arguments reflect the variables that the user would like to iterate over (which would be layers in a nested for loop)). mcmapply then dynamically updates the values of these formals as it cycles over the expanded set of variables in mvars.
I've posted the preliminary code as a gist (reprinted with an example below) and would be curious to get your feedback on it. I'm a grad student, that is self-described as an intermediately-skilled R enthusiast, so this is pushing my R skills for sure. You or other folks in the community may have suggestions that would improve on what I have. I do think even as it stands, I'll be coming to this function quite a bit in the future.
gmcmapply <- function(mvars, FUN, SIMPLIFY = FALSE, mc.cores = 1, ...){
require(parallel)
FUN <- match.fun(FUN)
funArgs <- formals(FUN)[which(names(formals(FUN)) != "...")] # allow for default args to carry over from FUN.
expand.dots <- list(...) # allows for expanded dot args to be passed as formal args to the user specified function
# Implement non-default arg substitutions passed through dots.
if(any(names(funArgs) %in% names(expand.dots))){
dot_overwrite <- names(funArgs[which(names(funArgs) %in% names(expand.dots))])
funArgs[dot_overwrite] <- expand.dots[dot_overwrite]
#for arg naming and matching below.
expand.dots[dot_overwrite] <- NULL
}
## build grid of mvars to loop over, this ensures that each combination of various inputs is evaluated (equivalent to creating a structure of nested for loops)
grid <- expand.grid(mvars,KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
# specify formals of the function to be evaluated by merging the grid to mapply over with expanded dot args
argdefs <- rep(list(bquote()), ncol(grid) + length(expand.dots) + length(funArgs) + 1)
names(argdefs) <- c(colnames(grid), names(funArgs), names(expand.dots), "...")
argdefs[which(names(argdefs) %in% names(funArgs))] <- funArgs # replace with proper dot arg inputs.
argdefs[which(names(argdefs) %in% names(expand.dots))] <- expand.dots # replace with proper dot arg inputs.
formals(FUN) <- argdefs
if(SIMPLIFY) {
#standard mapply
do.call(mcmapply, c(FUN, c(unname(grid), mc.cores = mc.cores))) # mc.cores = 1 == mapply
} else{
#standard Map
do.call(mcmapply, c(FUN, c(unname(grid), SIMPLIFY = FALSE, mc.cores = mc.cores)))
}
}
example code below:
# Example 1:
# just make sure variables used in your function appear as the names of mvars
myfunc <- function(...){
return_me <- paste(l3, l1^2 + l2, sep = "_")
return(return_me)
}
mvars <- list(l1 = 1:10,
l2 = 1:5,
l3 = letters[1:3])
### list output (mapply)
lreturns <- gmcmapply(mvars, myfunc)
### concatenated output (Map)
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE)
## N.B. This is equivalent to running:
lreturns <- c()
for(l1 in 1:10){
for(l2 in 1:5){
for(l3 in letters[1:3]){
lreturns <- c(lreturns,myfunc(l1,l2,l3))
}
}
}
### concatenated outout run on 2 cores.
lreturns <- gmcmapply(mvars, myfunc, SIMPLIFY = TRUE, mc.cores = 2)
Example 2. Pass non-default args to FUN.
## Since the apply functions dont accept full calls as inputs (calls are internal), user can pass arguments to FUN through dots, which can overwrite a default option for FUN.
# e.g. apply(x,1,FUN) works and apply(x,1,FUN(arg_to_change= not_default)) does not, the correct way to specify non-default/additional args to FUN is:
# gmcmapply(mvars, FUN, arg_to_change = not_default)
## update myfunc to have a default argument
myfunc <- function(rep_letters = 3, ...){
return_me <- paste(rep(l3, rep_letters), l1^2 + l2, sep = "_")
return(return_me)
}
lreturns <- gmcmapply(mvars, myfunc, rep_letters = 1)
A bit of additional functionality I would like to add but am still trying to work out is
cleaning up the output to be a pretty nested list with the names of mvars (normally, I'd create multiple lists within a nested for loop and tag lower-level lists onto higher level lists all the way up until all layers of the gigantic nested loop were done). I think using some abstracted variant of the solution provided here will work, but I haven't figured out how to make the solution flexible to the number of columns in the expand.grid-ed data.frame.
I would like an option to log the outputs of the child processesthat get called in mcmapply in a user-specified directory. So you could look at .txt outputs from every combination of variables generated by expand.grid (i.e. if the user prints model summaries or status messages as a part of FUN as I often do). I think a feasible solution is to use the substitute() and body() functions, described here to edit FUN to open a sink() at the beginning of FUN and close it at the end if the user specifies a directory to write to. Right now, I just program it right into FUN itself, but later it would be nice to just pass gmcmapply an argument called something like log_children = "path_to_log_dir. and then editing the body of the function to (pseudocode) sink(file = file.path(log_children, paste0(paste(names(mvars), sep = "_"), ".txt")
Let me know what you think!
-Nate

Resources