Combine lapply, seq_along and ddply - r

I've been searching around this forum and trying to implement in my case what was said in previous answers from those questions. However, something in my code is missing.
I use lapply() with a function inside that runs ddply. This works nice. However, I would like to identify every result from a single data frame by reading the name of the data frame, and not [[1]], [[2]]...
For this reason, I am trying to implement the seq_along argument, but unsuccessfully. Let's see what I have:
I created a list to group 16 different data frames (with the same structure) in one object, called melt_noNA_noDC_regression:
melt_noNA_noDC_regression <-
list(I1U_melt_noNA_noDC_regression, I1L_melt_noNA_noDC_regression,
I1U_melt_noNA_noDC_regression, I1L_melt_noNA_noDC_regression,
CU_melt_noNA_noDC_regression, CL_melt_noNA_noDC_regression,
P3U_melt_noNA_noDC_regression, P3L_melt_noNA_noDC_regression,
P4U_melt_noNA_noDC_regression, P4L_melt_noNA_noDC_regression,
M1U_melt_noNA_noDC_regression, M1L_melt_noNA_noDC_regression,
M2U_melt_noNA_noDC_regression, M2L_melt_noNA_noDC_regression,
M3U_melt_noNA_noDC_regression, M3L_melt_noNA_noDC_regression)
Later, I run this lapply() line successfully.
lapply(melt_noNA_noDC_regression, function(x) ddply(x, .(Species), model_regression))
As I have 16 different data frames, I would like to identify them in the results of the lapply function. I have tried several combinations to include seq_along within the lapply code, as in this case:
lapply(melt_noNA_noDC_regression, function(x) {
ddply(x, .(Species), model_regression)
seq_along(x), function(i) paste(names(x)[[i]], x[[i]])
})
However, I've been getting errors constantly, and it is a bit frustrating. It is maybe very easy to solve, but I am block.
Any idea to solve this?

Consider using eapply (lapply's lesser known sibling) or mget to retrieve a named list of your dataframes. Then run them through lapply for the ddply call to return the same named dataframe list with new corresponding values.
df_list <- eapply(.GlobalEnv, function(d) d)[c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...)]
df_list <- mget(c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...))
# GENERALIZED FOR ANY DF IN GLOBAL ENV
df_list <- Filter(function(i) class(i)=="data.frame", eapply(.GlobalEnv, function(d) d))
new_list <- lapply(df_list, function(x) ddply(x, .(Species), model_regression))
And because eapply (being environment apply) is part of the apply family and can iterate through objects, you can bypass lapply. But you must account for non-dataframes and then filter out by df names. Hence, tryCatch is used and [] indexing:
new_list2 <- eapply(.GlobalEnv, function(x)
tryCatch(ddply(x, .(Species), model_regression),
warning = function(w) return(NA),
error = function(e) return(NA)
)
)[c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...)]
all.equal(new_list, new_list2)
# [1] TRUE
With all that said, ideally in your data processing you would originally use a named dataframe list and not create separate, similar structured 16 objects flooding your global environment. Therefore, consider adjusting the source of your regression objects, so replace the following:
I1U_melt_noNA_noDC_regression <- ...
with this:
df_list = list()
df_list["I1U_melt_noNA_noDC_regression"] <- ...

Related

List of lists with data frames [duplicate]

I know this topic appeared on SO a few times, but the examples were often more complicated and I would like to have an answer (or set of possible solutions) to this simple situation. I am still wrapping my head around R and programming in general. So here I want to use lapply function or a simple loop to data list which is a list of three lists of vectors.
data1 <- list(rnorm(100),rnorm(100),rnorm(100))
data2 <- list(rnorm(100),rnorm(100),rnorm(100))
data3 <- list(rnorm(100),rnorm(100),rnorm(100))
data <- list(data1,data2,data3)
Now, I want to obtain the list of means for each vector. The result would be a list of three elements (lists).
I only know how to obtain list of outcomes for a list of vectors and
for (i in 1:length(data1)){
means <- lapply(data1,mean)
}
or by:
lapply(data1,mean)
and I know how to get all the means using rapply:
rapply(data,mean)
The problem is that rapply does not maintain the list structure.
Help and possibly some tips/explanations would be much appreciated.
We can loop through the list of list with a nested lapply/sapply
lapply(data, sapply, mean)
It is otherwise written as
lapply(data, function(x) sapply(x, mean))
Or if you need the output with the list structure, a nested lapply can be used
lapply(data, lapply, mean)
Or with rapply, we can use the argument how to get what kind of output we want.
rapply(data, mean, how='list')
If we are using a for loop, we may need to create an object to store the results.
res <- vector('list', length(data))
for(i in seq_along(data)){
for(j in seq_along(data[[i]])){
res[[i]][[j]] <- mean(data[[i]][[j]])
}
}

R: transforming multiple sets to dataframes at once

I have 31 datasets corresponding to data about 31 teachers. I need to perform multiple transformations on all these datasets. One of them is transforming all of them into dataframes
class(alexandre)
[1] "tbl_df" "tbl" "data.frame"
As I said, I have 31 similar datasets, and I need to transform all into dataframes. My code to do so has been
alexandre <- as.data.frame(alexandre)
adrian <- as.data.frame(adrian)
akemi <- as.data.frame(akemi)
arcanjo <- as.data.frame(arcanjo)
ana_barbara <- as.data.frame(ana_barbara)
brigida <- as.data.frame(brigida)
cleiton <- as.data.frame(cleiton)
daniela <- as.data.frame(daniela)
davi <- as.data.frame(davi)
eliezer <- as.data.frame(eliezer)
eduardo <- as.data.frame(eduardo)
eustaquio <- as.data.frame(eustaquio)
gilberto <- as.data.frame(gilberto)
gilmar <- as.data.frame(gilmar)
jorge <- as.data.frame(jorge)
juarez <- as.data.frame(juarez)
junior <- as.data.frame(junior)
... and add some rows to this code (31 lines of this). Obviously all these lines of code take too much space and there must be a faster(and more elegant) way to accomplish this. In fact, I tried this
teachers <- c(alexandre, akemi, adrian, brigida, davi, ...)
cnames <- function(x){
colnames(x) <- c(1:18)
}
mapply(cnames, teachers)
Then I would do all the work with a few lines of code. And this method (form a vector containing all datasets, then use mapply on the vector) would make my work much easier because, as I said, I have to perform multiple transformation on all these datasets.
This code does not work, however. I get the following error:
Error in `colnames<-`(`*tmp*`, value = c(1:18)) :
attempt to set 'colnames' on an object with less than two dimensions
This error message is very unenlightening, I find. I have no idea what to do to to make the code work, which is obviously why I'm here. Any other methods to accomplish what I'm trying to do are welcome. Thanks.
As commented and often discussed in the R tag of SO, simply use a list to maintain all your individual, similarly structured data frames. Doing so allows you the following benefits:
Easily run operations consistently across all items using loops or apply family calls without separate naming assignments.
Organizes your environment and workspace with maintenance of one object with easy reference by number or name instead of 31 objects flooding your global environment.
Facilitates data frame migrations and handling with rbind, cbind, split, by, or other operations.
To create a list of all current data frames in global environment use eapply or mget filtering on data frame objects. Each returns a named list of data frames.
teachers_df_list <- Filter(is.data.frame, eapply(.GlobalEnv, identity))
teachers_df_list <- Filter(is.data.frame, mget(x=ls()))
Alternatively, source your data frames originally from file sources using list objects such as list.files:
teachers_df_list <- lapply(list.files(...), function(f) read.csv(f, ...))
You lose no functionality of data frame if stored inside a list.
head(teachers_df_list$alexandre)
tail(teachers_df_list$adrian)
summary(teachers_df_list$akemi)
...
Then run your needed operations with lapply like renaming columns with right-hand side function, setNames. Run other needed operations: aggregate or lm.
new_teachers_df_list <- lapply(teachers_df_list,
function(df) setNames(df, paste0("col_", c(1:18)))
new_teachers_agg_list <- lapply(teachers_df_list,
function(df) aggregate(col1 ~ col2, df, sum))
new_teachers_model_list <- lapply(teachers_df_list,
function(df) summary(lm(col1 ~ col2, df)))
Even compile all data frames into one master version using do.call + rbind:
# ADD A TEACHER INDICATOR COLUMN
new_teachers_df_list <- Map(function(df, n) transform(df, teacher=n),
new_teachers_df_list, names(new_teachers_df_list))
# BUILD SINGLE DF
teachers_df <- do.call(rbind, new_teachers_df_list)
Even split master version back into individual groupings if needed later on:
# SPLIT BACK TO LIST OF DFs
teachers_df_list <- split(teachers_df, teachers_df$teacher)
Maybe you could use a list to stock all your data.frame. It seems to work, but you need to find a way to extract all data.frame in the list after that.
df_1 <- data.frame(c(0, 1, 0), c(3, 4, 5))
df_2 <- data.frame(c(0, 1, 0), c(3, 4, 5))
l <- list(df_1, df_2)
lapply(l, function(x){
colnames(x) <- 1:2
return(x)
})

subset inside a function by the variables specified in ddply

Often I need to subset a data.frame inside a function by the variables that I am subsetting another data.frame to which I apply ddply. To do that I explicitly write again the variables inside the function and I wonder whether there is a more elegant way to do that. Below I include a trivial example just to show which is my current approach to do this.
d1<-expand.grid(x=c('a','b'),y=c('c','d'),z=1:3)
d2<-expand.grid(x=c('a','b'),y=c('c','d'),z=4:6)
results<-ddply(d1,.(x,y),function(d) {
d2Sub<-subset(d2,x==unique(d$x) & y==unique(d$y))
out<-d$z+d2Sub$z
data.frame(out)
})
The plyr package offers functions to make the whole split/apply/combine construct easy. To my knowledge, however, you can only split one thing: a list, a data.frame, an array.
In your case, what you are trying to do is split two objects, then mapply (or Map), then recombine. Since plyr does not have a ready solution for this more complicated construct, you could do it in base R. That's how I assume people were doing things before plyr came out:
# split
d1.split <- split(d1, list(d1$x, d1$y))
d2.split <- split(d2, list(d2$x, d2$y))
# apply
res.split <- Map(function(df1, df2) data.frame(x = df1$x, y = df1$y,
out = df1$z + df2$z),
d1.split, d2.split, USE.NAMES = FALSE)
# combine
res <- do.call(rbind, res.split)
Up to you to decide if it is more elegant or not than you current approach. The assignments I made were to help comprehension, but you can write the whole thing as a single res <- do.call(rbind, Map(FUN, split(d1, ...), split(d2, ...), ...)) statement if you prefer.

Apply an already defined function to all dataframes at once [duplicate]

This question already has an answer here:
How to apply a function to a certain column for all the data frames in environment in R
(1 answer)
Closed 1 year ago.
I already have defined a function (which works fine). Nevertheless, I have 20 dataframes in the working space to which I want to lapply the same function (dat1 to dat20).
So far it looks like this:
dat1 <- func(dat=dat1)
dat2 <- func(dat=dat2)
dat3 <- func(dat=dat3)
dat4 <- func(dat=dat4)
...
dat20 <- func(dat=dat20)
However, is there a way to do this more elegant with a shorter command, i.e. to lapply the function to all dataframes at once?
I tried this, but it didn't work:
mylist <- paste0("dat", 1:20, sep="")
lapply(mylist, func)
Try something like:
lapply(mget(ls(pattern="dat")),func)
Some details: The pattern argument in ls will limit which object names it lists (e.g., I assume you have other objects including your function in the global environment). mget retrieves those objects from the environment and turns them into a list, which you can then lapply your function over.
If you have the name of a variable, you can use get() to retrieve the value from the workspace. The corresponding assignment function is called assign():
mylist <- paste0("dat", 1:20)
lapply(mylist, function(name) assign(name, func(dat=get(name))) )
The desired behavior can be obtained using eval instead of lapply.
Assume mylist to be the names of the data.frame you want to apply fun to. mylist might be generated using
mylist <- ls(pattern="dat")
Then you can use the following code to do exactly what you want:
cCmd <- paste(mylist , "<- func(" ,mylist,")", sep="")
eCmd <- parse(text=cCmd)
eval(eCmd)

Correct implementation of lapply

In so far as I understand it, when using r it can be more elegant to use functions such as lapply rather than for loops (that are used more often than not in other object oriented languages). However I cannot get my head around the syntax and am making foolish errors when trying to implement simple tasks with the command. For example:
I have a series of dataframes loaded from csv files using a for loop.The following dummy dataframes adequately describe the data:
x <- c(0,10,11,12,13)
y <- c(1,NA,NA,NA,NA)
z <- c(2,20,21,22,23)
a <- c(0,6,5,4,3)
b <- c(1,7,8,9,10)
c <- c(2,NA,NA,NA,NA)
df1 <- data.frame(x,y,z)
df2 <- data.frame(a,b,c)
I first generate a list of dataframe names (data_names- I do this when loading the csv files) and then simply want to sum the columns. My attempt of course does not work:
lapply(data_names, function(df) {
counts <- colSums(!is.na(data_names))
})
I could of course use lists (and I realise in the long run this maybe better) however from a pedagogical point of view I would like to understand lapply better.
Many thanks for any pointers
It's really just your use of is.na and the fact you don't need to use the asignment operator <- inside the function. lapply returns a list which is the result of applying FUN to each element of the input list. You assign the output of lapply to a variable, e.g. res <- lapply( .... , FUN ).
I'm also not too sure how you made the list initially, but the below should suffice. You also don't need an anonymous function in this case, you can use the named colSums and also provide the na.rm = TRUE argument to take care of persky NAs in your data:
lapply( list( df1, df2 ) , colSums , na.rm = TRUE )
[[1]]
x y z
46 1 88
[[2]]
a b c
18 35 2
So you can read this as:
For each df in the list:
apply colSums with the argument na.rm = TRUE
The result is a list, each element of which is the result of applying colSums to each df in the list.

Resources