Merge and name data frames in for loop - r

I have a bunch of DF named like: df1, df2, ..., dfN
and lt1, lt2, ..., ltN
I would like to merge them in a loop, something like:
for (X in 1:N){
outputX <- merge(dfX, ltX, ...)
}
But I have some troubles getting the name of output, dfX, and ltX to change in each iteration. I realize that plyr/data.table/reshape might have an easier way, but I would like for loop to work.
Perhaps I should clarify. The DF are quite large, which is why plyr etc will not work (they crash). I would like to avoid copy'ing.
The next in the code is to save the merged DF.
This is why I prefer the for-loop apporach, since I know what each merged DF is named in the enviroment.

You can combine data frames into lists and use mapply, as in the example below:
i <- 1:3
d1.a <- data.frame(i=i,a=letters[i])
d1.b <- data.frame(i=i,A=LETTERS[i])
i <- 11:13
d2.a <- data.frame(i=i,a=letters[i])
d2.b <- data.frame(i=i,A=LETTERS[i])
L1 <- list(d1.a, d2.a)
L2 <- list(d1.b, d2.b)
mapply(merge,L1,L2,SIMPLIFY=F)
# [[1]]
# i a A
# 1 1 a A
# 2 2 b B
# 3 3 c C
#
# [[2]]
# i a A
# 1 11 k K
# 2 12 l L
# 3 13 m M
If you'd like to save every of the resulting data frames in the global environment (I'd advise against it though), you could do:
result <- mapply(merge,L1,L2,SIMPLIFY=F)
names(result) <- paste0('output',seq_along(result))
which will give a name to every data frame in the list, an then:
sapply(names(result),function(s) assign(s,result[[s]],envir = globalenv()))
Please note that provided is a base R solution that does essentially the same thing as your sample code.

If your data frames are in a list, writing a for loop is trivial:
# lt = list(lt1, lt2, lt3, ...)
# if your data is very big, this may run you out of memory
lt = lapply(ls(pattern = "lt[0-9]*"), get)
merged_data = merge(lt[[1]], lt[[2]])
for (i in 3:length(lt)) {
merged_data = merge(merged_data, lt[[i]])
save(merged_data, file = paste0("merging", i, ".rda"))
}

Related

Conditionally add named elements to a list

I have a function to perform actions on a variable list of dataframes depending on user selections. The function mostly performs generic actions but there are a few actions that are dataframe specific.
My code runs fine if all dataframes are selected but I am unable to get it to work if not all dataframes are selected.
The following provides a minimal reproducible example:
# User switches.
df1Switch <- TRUE
df2Switch <- TRUE
df3Switch <- TRUE
# DF creation.
set.seed(1)
df <- data.frame(X=sample(1:10), Y=sample(11:20))
if (df1Switch) df1 <- df
if (df2Switch) df2 <- df
if (df3Switch) df3 <- df
# Function to do something.
fn_something <- function(file_list, file_names) {
df <- file_list
# Do lots of generic things.
df$Z <- df$X + df$Y
# Do a few specific things.
if (file_names == "Name1") df$X <- df$X + 1
else if (file_names == "Name2") df$X <- df$Z - 1
else if (file_names == "Name3") df$Y <- df$X + df$Y
return(df)
}
# Call function to do something.
file_list <- list(Name1=df1, Name2=df2, Name3=df3)
file_names <- names(file_list)
all_df <- do.call(rbind,mapply(fn_something, file_list, file_names,
SIMPLIFY=FALSE))
In this case the code runs fine as the user has selected to create all three dataframes. I use a named list so that the specific actions can be performed against the correct dataframes.
The output looks something like this (the actual numbers aren't important):
X Y Z
Name1.1 4 13 16
Name1.2 5 12 16
Name1.3 6 16 21
: : : :
Name2.1 15 13 16
: : : :
The problem arises if the user selects not to create some dataframes, e.g.:
# User switches.
df1Switch <- TRUE
df2Switch <- FALSE
df3Switch <- TRUE
Not surprisingly, in this case an object not found error results:
> # Call function to do something.
> file_list <- list(Name1=df1, Name2=df2, Name3=df3)
Error: object 'df2' not found
What I would like to do is conditionally specify the contents of file_list along the lines of this pseudo code:
file_list <- list(if (df1Switch) {Name1=df1}, if (df2Switch) {Name2=df2}, if (df3Switch) {Name3=df3})
I have come across list.foldLeft
Conditionally merge list elements but I don't know if this is suitable.
(I'll re-hash my comment:)
In general, I would encourage you to consider use of a list-of-dataframes instead of individual frames. My rationale for this:
assuming that each frame is structured (nearly) identically; and
assuming that what you do to one frame you will (or at least can) do to all frames; then
it is easier to list_of_frames <- lapply(list_of_frames, some_func) than it is to do something like:
for (nm in c("df1", "df2", "df3")) {
d <- get(nm)
d <- some_func(d)
assign(nm, d)
}
especially when dealing with non-global environments (i.e., doing this within a function).
To be clear, "easier" is subjective: though it does win code-golf, I find it much easier to read and understand that "I am running some_func on each element of list_of_frames and saving the result". (You can even save it to a new list-of-frames, thereby keeping the original frames untouched.)
You may also do things conditionally, as in
needs_work <- sapply(list_of_frames, some_checker_func) # returns logical
# or
needs_work <- c("df1", "df2") # names of elements of list_of_frames
list_of_frames[needs_work] <- lapply(list_of_frames[needs_work], some_func)
Having said that ... the direct answer to your one liner:
c(if (df1Switch) list(Name1=df1), if (df2Switch) list(Name2=df2), if (df3Switch) list(Name3=df3))
This capitalizes on the fact that unstated else results in a NULL, and the NULL-compressing (dropping) characteristic of c(). You can see it in action with:
c(if (T) list(a=1), if (T) list(b=2), if (T) list(d=4))
# $a
# [1] 1
# $b
# [1] 2
# $d
# [1] 4
c(if (T) list(a=1), if (FALSE) list(b=2), if (T) list(d=4))
# $a
# [1] 1
# $d
# [1] 4

Creating a data frame from looping through a list of data frames in R

I have a large data set that is organized as a list of 1044 data frames. Each data frame is a profile that holds the same data for a different station and time. I am trying to create a data frame that holds the output of my function fitsObs, but my current code only goes through a single data frame. Any ideas?
i=1
start=1
for(i in 1:1044){
station1 <- surveyCTD$stations[[i]]
df1 <- surveyCTD$data[[i]]
date1 <- surveyCTD$dates[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
if(start==1){
start=0
dfout <- data.frame(
date=date1
,station=station1
)
names(fitObs) <- paste0(names(fitObs),"o")
dfout<-cbind(dfout, df1$temp, df1$depth)
dfout <- cbind(dfout, fitObs)
}
}
From a first look I would try two ways to debug it. First print out the head of a DF to understand the behavior of your loop, then check the range of your variable dfout, it looks like the variable is local to your loop.
Moreover your i variable out of the loop does not change anything in your loop.
I have created a reproducible example of my best guess as to what you are asking. I also assume that you are able to adjust the concepts in this general example to suit your own problem. It's easier if you provide an example of your list in future.
First we create some reproducible data
a <- c(10,20,30,40)
b <- c(5,10,15,20)
c <- c(20,25,30,35)
df1 <- data.frame(x=a+1,y=b+1,z=c+1)
df2 <- data.frame(x=a,y=b,z=c)
ls1 <- list(df1,df2)
Which looks like this
print(ls1)
[[1]]
x y z
1 11 6 21
2 21 11 26
3 31 16 31
4 41 21 36
[[2]]
x y z
1 10 5 20
2 20 10 25
3 30 15 30
4 40 20 35
So we now have two dataframes within a single list. The following code should then work to go through the columns within each dataframe of the list and apply the mean() function to the data in the column. You change this to row by selecting '1' rather than '2'.
df <- do.call("rbind", lapply(ls1, function(x) apply(x,2,mean)))
as.data.frame(df)
print(df)
x y z
1 26 13.5 28.5
2 25 12.5 27.5
You should be able to replace mean() with whatever function you have written for your data. Let me know if this helps.
Consider building a generalized function to be called withi Map (wrapper to mapply, the multiple, elementwise iterator member of apply family) to build a list of data frames each with your fitObs output. And pass all equal length objects into the data.frame() constructor.
Then outside the loop, run do.call for a final, single appended dataframe of all 1,044 smaller dataframes (assuming each maintains exact same and number of columns):
# GENERALIZED FUNCTION
add_fit_obs <- function(dt, st, df) {
fitObs <- fitTp2(-df$depth, df$temp)
names(fitObs) <- paste0(names(fitObs),"o")
tmp <- data.frame(
date = dt,
station = st,
depth = df1$depth,
temp = df1$temp,
fitObs
)
return(tmp)
}
# LIST OF DATA FRAMES
df_list <- Map(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data)
# EQUIVALENTLY:
# df_list <- mapply(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data, SIMPLIFY=FALSE)
# SINGLE DATAFRAME
master_df <- do.call(rbind, df_list)

R: Repeat script n times, changing variables in each iteration

I have a script that I want to repeat n times, where some variables are changed by 1 each iteration. I'm creating a data frame consisting of the standard deviation of the difference of various vectors. My script currently looks like this:
standard.deviation <- data.frame
c(
sd(diff(t1[,1])),
sd(diff(t1[,2])),
sd(diff(t1[,3])),
sd(diff(t1[,4])),
sd(diff(t1[,5]))
),
c(
sd(diff(t2[,1])),
sd(diff(t2[,2])),
sd(diff(t2[,3])),
sd(diff(t2[,4])),
sd(diff(t2[,5]))
),
c(
sd(diff(t3[,1])),
sd(diff(t3[,2])),
sd(diff(t3[,3])),
sd(diff(t3[,4])),
sd(diff(t3[,5]))
),
)
I want to write the script creating the vector only once, and repeat it n times (n=3 in this example) so that I end up with n vectors. In each iteration, I want to add 1 to a variable (in this case: 1 -> 2 -> 3, so the number next to 't'). t1, t2 and t3 are all separate data frames, and I can't figure out how to loop a script with changing data frame names.
1) How to make this happen?
2) I would also like to divide each sd value in a row by the row number. How would I do this?
3) I will be using 140 data frames in total. Is there a way to call all of these with a simple function, rather than making a list and adding each of the 140 data frames individually?
Use functions to get a more readable code:
set.seed(123) # so you'll get the same number as this example
t1 <- t2 <- t3 <- data.frame(replicate(5,runif(10)))
# make a function for your sd of diff
sd.cols <- function(data) {
# loop over the df columns
sapply(data,function(x) sd(diff(x)))
}
# make a list of your data frames
dflist <- list(sdt1=t1,sdt2=t2,sdt3=t3)
# Loop overthe list
result <- data.frame(lapply(dflist,sd.cols))
Which gives:
> result
sdt1 sdt2 sdt3
1 0.4887692 0.4887692 0.4887692
2 0.5140287 0.5140287 0.5140287
3 0.2137486 0.2137486 0.2137486
4 0.3856857 0.3856857 0.3856857
5 0.2548264 0.2548264 0.2548264
Assuming that you always want to use columns 1 to 5...
# some data
t3 <- t2 <- t1 <- as.data.frame(matrix(rnorm(100),10,10))
# script itself
lis=list(t1,t2,t3)
sapply(lis,function(x) sapply(x[,1:5],function(y) sd(diff(y))))
# [,1] [,2] [,3]
# V1 1.733599 1.733599 1.733599
# V2 1.577737 1.577737 1.577737
# V3 1.574130 1.574130 1.574130
# V4 1.158639 1.158639 1.158639
# V5 0.999489 0.999489 0.999489
The output is a matrix, so as.data.frame should fix that.
For completeness: As #Tensibai mentions, you can just use list(mget(ls(pattern="^t[0-9]+$"))), assuming that all your variables are t followed by a number.
Edit: Thanks to #Tensibai for pointing out a missing step and improving the code, and the mget step.
You can itterate through a list of the ts...
ans <- data.frame()
dats <- c(t, t1 , t2)
for (k in dats){
temp <- c()
for (k2 in c(1,2,3,4,5)){
temp <- c(temp , sd(k[,k2]))
}
ans <- rbind(ans,temp)
}
rownames(ans) <- c("t1","t2","t3")
colnames(ans) <- c(1,2,3,4,5)
attr(results,"title") <- "standard deviation"

R: Quickly Performing Operations on Subsets of a Data Frame, then Re-aggregating the Result Without an Inner Function

We have a very large data frame df that can be split by factors. On each subset of the data frame created by this split, we need to perform an operation to increase the number of rows of that subset until it's a certain length. Afterwards, we rbind the subsets to get a bigger version of df.
Is there a way of doing this quickly without using an inner function?
Let's say our subset operation (in a separate .R file) is:
foo <- function(df) { magic }
We've come up with a few ways of doing this:
1)
df <- split(df, factor)
df <- lapply(df, foo)
rbindlist(df)
2)
assign('list.df', list(), envir=.GlobalEnv)
assign('i', 1, envir=.GlobalEnv)
dplyr::group_by(df, factor)
dplyr::mutate(df, foo.list(df.col))
df <- rbindlist(list.df)
rm('list.df', envir=.GlobalEnv)
rm('i', envir=.GlobalEnv)
(In a separate file)
foo.list <- function(df.cols) {
magic;
list.df[[i]] <<- magic.df
i <<- i + 1
return(dummy)
}
The issue with the first approach is time. The lapply simply takes too long to really be desirable (on the order of an hour with our data set).
The issue with the second approach is the extremely undesirable side-effect of tampering with the user's global environment. It's significantly faster, but this is something we'd rather avoid if we can.
We've also tried passing in the list and count variables and then trying to substitute them with the variables in the parent environment (A sort of hack to get around R's lack of pass-by-reference).
We've looked at a number of possibly-relevant SO questions (R applying a function to a subset of a data frame, Calculations on subsets of a data frame, R: Pass by reference, e.t.c.) but none of them dealt with our question too well.
If you want to run code, here's something you can copy and paste:
x <- runif(n=10, min=0, max=3)
y <- sample(x=10, replace=FALSE)
factors <- runif(n=10, min=0, max=2)
factors <- floor(factors)
df <- data.frame(factors, x, y)
df now looks like this (length 10):
## We group by factor, then run foo on the groups.
foo <- function(df.subset) {
min <- min(df.subset$y)
max <- max(df.subset$y)
## We fill out df.subset to have everything between the min and
## max values of y. Then we assign the old values of df.subset
## to the corresponding spots.
df.fill <- data.frame(x=rep(0, max-min+1),
y=min:max,
factors=rep(df.subset$factors[1], max-min+1))
df.fill$x[which(df.subset$y %in%(min:max))] <- df.subset$x
df.fill
}
So I can take my sample code in the first approach to build a new df (length 18):
Using data.table this doesn't take long due to speedy functionality. If you can, rewrite your function to work with specific variables. The split-apply-combine processing may get a performance boost:
library(data.table)
system.time(
df2 <- setDT(df)[,foo(df), factors]
)
# user system elapsed
# 1.63 0.39 2.03
Another variation using data.table.. First get the min(y):max(y) part and then join+update:
require(data.table)
ans = setDT(df)[, .(x=0, y=min(y):max(y)), by=factors
][df, x := i.x, on=c("factors", "y")][]
ans
# factors x y
# 1: 0 1.25104362 1
# 2: 0 0.16729068 2
# 3: 0 0.00000000 3
# 4: 0 0.02533907 4
# 5: 0 0.00000000 5
# 6: 0 0.00000000 6
# 7: 0 1.80547980 7
# 8: 1 0.34043937 3
# 9: 1 0.00000000 4
# 10: 1 1.51742163 5
# 11: 1 0.15709287 6
# 12: 1 0.00000000 7
# 13: 1 1.26282241 8
# 14: 1 2.88292354 9
# 15: 1 1.78573288 10
Pierre and Roland already provides nice solutions.
If the case is scalability not only in timing but also in memory you can spread the data across number of remote R instances.
In most basic setup it requires only Rserve/RSclient, so no non-CRAN deps.
Spread data across R instances
For easier reproducibility below example will start two R instances on a single localhost machine. You need to start Rserve nodes on remote machines for real scalability.
# start R nodes
library(Rserve)
port = 6311:6312
invisible(sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save"))))
# populate data
set.seed(123)
x = runif(n=5e6,min=0, max=3)
y = sample(x=5e6,replace=FALSE)
factors = runif(n=5e6, min=0, max=2)
factors = floor(factors)
df = data.frame(factors, x, y)
# connect Rserve nodes
library(RSclient)
rscl = sapply(port, function(port) RS.connect(port = port))
# assign chunks to R nodes
sapply(seq_along(rscl), function(i) RS.assign(rscl[[i]], name = "x", value = df[df$factors == (i-1),]))
# assign magic function to R nodes
foo = function(df) df
sapply(rscl, RS.assign, name = "foo", value = foo)
All processes on remote machines can be performed parallely (using wait=FALSE and RS.collect) which additionally reduce computing timing.
Using lapply + RS.eval
# sequentially
l = lapply(rscl, RS.eval, foo(x))
rbindlist(l)
# parallely
invisible(sapply(rscl, RS.eval, foo(x), wait=FALSE))
l = lapply(rscl, RS.collect)
rbindlist(l)
Using big.data.table::rscl.*
big.data.table package provides few wrappers on RSclient::RS.* functions allowing them to accept list of connections to R nodes.
They doesn't use data.table in any way so can be effectively applied to data.frame, vector or any R type that is chunk-able. Below example uses basic data.frame.
library(big.data.table)
# sequentially
l = rscl.eval(rscl, foo(x), simplify=FALSE)
rbindlist(l)
# parallely
invisible(rscl.eval(rscl, foo(x), wait=FALSE))
l = rscl.collect(rscl, simplify=FALSE)
rbindlist(l)
Using big.data.table
This example requires data on nodes to be stored as data.tables, but gives some convenient api and a lot of other features.
library(big.data.table)
rscl.require(rscl, "data.table")
rscl.eval(rscl, is.data.table(setDT(x))) # is.data.table to suppress collection of `setDT` results
bdt = big.data.table(rscl = rscl)
# parallely by default
bdt[, foo(.SD), factors]
# considering we have data partitioned using `factors` field, the `by` is redundant in that case
bdt[, foo(.SD)]
# optionally use `[[` to access R nodes environment directly
bdt[[expr = foo(x)]]
Clean workspace
# disconnect
rscl.close(rscl)
# shutdown nodes started from R
l = lapply(setNames(nm = port), function(port) tryCatch(RSconnect(port = port), error = function(e) e, warning = function(w) w))
invisible(lapply(l, function(rsc) if(inherits(rsc, "sockconn")) RSshutdown(rsc)))
I don't think your function works as intended. It relies on y being ordered.
Try using a data.table join with grouping:
library(data.table)
setDT(df)
df2 <- df[, .SD[data.table(y=seq(.SD[, min(y)], .SD[, max(y)], by = 1)), .SD,
on = "y"], #data.table join
by = factors] #grouping
df2[is.na(x), x:= 0]
setkey(df2, factors, y, x)

R: Using a vector to feed dataframe names for sapply

I'm quite new to R, and I trying to use it to organize and extract info from some tables into different, but similar tables, and instead of repeating the commands but changing the names of the table:
#DvE, DvS, and EvS are dataframes
Sum.DvE <- data.frame(DvE$genes, DvE$FDR, DvE$logFC)
names(Sum.DvE) <- c("gene","FDR","log2FC")
Sum.DvS <- data.frame(DvS$genes, DvS$FDR, DvS$logFC)
names(Sum.DvS) <- c("gene","FDR","log2FC")
Sum.EvS <- data.frame(EvS$genes, EvS$FDR, EvS$logFC)
names(Sum.EvS) <- c("gene","FDR","log2FC")
I thought it would be easier to create a vector of the table names, and feed it into a for loop:
Sum.Comp <- c("DvE","DvS","EvS")
for(i in 1:3){
Sum.Comp[i] <- data.frame(i$genes, i$FDR, i$logFC)
names(Sum.Comp[i]) <- c("gene","FDR","log2FC")
}
But I get
>Error in i$genes : $ operator is invalid for atomic vectors
which I kind of expected because I was just trying it out, but can someone tell me if what I want to do can be done some other way, or if you have some suggestions for me, that would be much appreciated!
Clarification: Basically I'm trying to ask if there's a way to feed a dataframe name into a for loop through a vector, because I think I get the error because R doesn't realize "i" in the for loop stands for a dataframe name. This is a more simplified example:
DF1 <- data.frame(A=1:5, B=1:5, C=1:5, D=1:5)
DF2 <- data.frame(A=10:15, B=10:15, C=10:15, D=10:15)
DF3 <- data.frame(A=20:25, B=20:25, D=20:25, D=20:25)
DFs <- ("DF1", "DF2", "DF3")
for (i in 1:3){
New.i <- dataframe(i$A, i$D)
}
And I'd like it to make 3 new dataframes called "New.DF1", "New.DF2", "New.DF3" with example outputs like:
New.DF1
A D
1 1
2 2
3 3
4 4
5 5
New.DF2
A D
10 10
11 11
12 12
13 13
14 14
15 15
Thank you!
Not entirely sure I understand your problem, but the code below may do what you're asking. I've created simple values for the input data frames for testing.
DvE <- data.frame(genes=1:2, FDR=2:3, logFC=3:4)
DvS <- data.frame(genes=4, FDR=5, logFC=6)
EvS <- data.frame(genes=7, FDR=8, logFC=9)
df_names <- c("DvE","DvS", "EvS")
sum_df <- function(x) data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
for(df in df_names) {
assign(paste("Sum.",df,sep=""), do.call("sum_df", list(as.name(df)) ) )
}
Instead of operating on the names of variables, it would be easier to store the data frames you want to process in a list and then process them with lapply:
to.process <- list(DvE, DvS, EvS)
processed <- lapply(to.process, function(x) {
data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
})
Now you can access the new data frames with processed[[1]], processed[[2]], and processed[[3]].

Resources