Often I need to subset a data.frame inside a function by the variables that I am subsetting another data.frame to which I apply ddply. To do that I explicitly write again the variables inside the function and I wonder whether there is a more elegant way to do that. Below I include a trivial example just to show which is my current approach to do this.
d1<-expand.grid(x=c('a','b'),y=c('c','d'),z=1:3)
d2<-expand.grid(x=c('a','b'),y=c('c','d'),z=4:6)
results<-ddply(d1,.(x,y),function(d) {
d2Sub<-subset(d2,x==unique(d$x) & y==unique(d$y))
out<-d$z+d2Sub$z
data.frame(out)
})
The plyr package offers functions to make the whole split/apply/combine construct easy. To my knowledge, however, you can only split one thing: a list, a data.frame, an array.
In your case, what you are trying to do is split two objects, then mapply (or Map), then recombine. Since plyr does not have a ready solution for this more complicated construct, you could do it in base R. That's how I assume people were doing things before plyr came out:
# split
d1.split <- split(d1, list(d1$x, d1$y))
d2.split <- split(d2, list(d2$x, d2$y))
# apply
res.split <- Map(function(df1, df2) data.frame(x = df1$x, y = df1$y,
out = df1$z + df2$z),
d1.split, d2.split, USE.NAMES = FALSE)
# combine
res <- do.call(rbind, res.split)
Up to you to decide if it is more elegant or not than you current approach. The assignments I made were to help comprehension, but you can write the whole thing as a single res <- do.call(rbind, Map(FUN, split(d1, ...), split(d2, ...), ...)) statement if you prefer.
Related
I have a problem with cleaning up my code. I understand I could type this all out but we don't want that obviously.
I have only dataframes in my global environment. They are all "data.frame".
I want to check the dimensions of all of them and put that in a tibble. I managed that somehow. I also would like to change their colnames() tolower() which works easy if I just type the name of the data.frame, but there's more than 2 and I want it done automatically. Then I also want to mutate all data.frames in the same way.
Small example of my code:
library(tidyverse)
x <- data.frame(letters[1:2]) #To create the data
y <- data.frame(letters[3:4])
dfs <- as.list(ls()) #I take whatever is in my environment
I managed below to get a tibble of the dimensions:
z <- as_tibble(lapply(seq_along(dfs),
function(j) dim(get(dfs[[j]]))), .name_repair = "unique")
colnames(z) <- dfs
Now for the colnames of all the data.frames stored in my list I basically want to perform this code:
colnames(dfs[[1]]) <- tolower(colnames(dfs[[1]])
but that returns NULL as I found out earlier. So I used get() in there to make it work for the dimensions. But if I use get() to assign colnames it says it can't find function "get<-".
Since all colnames for all dataframes are the same (just different nrows()) I could save the lowercase colnames as value and use that, but that doesn't take away that it cant find the get<- function.
names <- tolower(colnames(x))
sapply(seq_along(dfs),
function(j) colnames(get(dfs[[j]])) <- names)
*Error in colnames(get(dfs[[j]])) <- names :
could not find function "get<-"*
as for the mutating part I tried a for loop:
for(i in seq_along(dfs)){
get(dfs[[i]]) <- get(dfs[[i]]) %>% mutate(cd = ab)
}
But it's the same issue.
Could anyone help clearing this problem for me? (and if a cleaner code for the dimensions is available that would be highly appreciated)
I am just trying to up my coding skills. I would have been long done if I just typed it all out but that defeats the purpose.
Thanks!
-JK
Using base R
lapply(dfs, function(x) transform(setNames(x, tolower(names(x))), X = c('a', 'b')))
I have 31 datasets corresponding to data about 31 teachers. I need to perform multiple transformations on all these datasets. One of them is transforming all of them into dataframes
class(alexandre)
[1] "tbl_df" "tbl" "data.frame"
As I said, I have 31 similar datasets, and I need to transform all into dataframes. My code to do so has been
alexandre <- as.data.frame(alexandre)
adrian <- as.data.frame(adrian)
akemi <- as.data.frame(akemi)
arcanjo <- as.data.frame(arcanjo)
ana_barbara <- as.data.frame(ana_barbara)
brigida <- as.data.frame(brigida)
cleiton <- as.data.frame(cleiton)
daniela <- as.data.frame(daniela)
davi <- as.data.frame(davi)
eliezer <- as.data.frame(eliezer)
eduardo <- as.data.frame(eduardo)
eustaquio <- as.data.frame(eustaquio)
gilberto <- as.data.frame(gilberto)
gilmar <- as.data.frame(gilmar)
jorge <- as.data.frame(jorge)
juarez <- as.data.frame(juarez)
junior <- as.data.frame(junior)
... and add some rows to this code (31 lines of this). Obviously all these lines of code take too much space and there must be a faster(and more elegant) way to accomplish this. In fact, I tried this
teachers <- c(alexandre, akemi, adrian, brigida, davi, ...)
cnames <- function(x){
colnames(x) <- c(1:18)
}
mapply(cnames, teachers)
Then I would do all the work with a few lines of code. And this method (form a vector containing all datasets, then use mapply on the vector) would make my work much easier because, as I said, I have to perform multiple transformation on all these datasets.
This code does not work, however. I get the following error:
Error in `colnames<-`(`*tmp*`, value = c(1:18)) :
attempt to set 'colnames' on an object with less than two dimensions
This error message is very unenlightening, I find. I have no idea what to do to to make the code work, which is obviously why I'm here. Any other methods to accomplish what I'm trying to do are welcome. Thanks.
As commented and often discussed in the R tag of SO, simply use a list to maintain all your individual, similarly structured data frames. Doing so allows you the following benefits:
Easily run operations consistently across all items using loops or apply family calls without separate naming assignments.
Organizes your environment and workspace with maintenance of one object with easy reference by number or name instead of 31 objects flooding your global environment.
Facilitates data frame migrations and handling with rbind, cbind, split, by, or other operations.
To create a list of all current data frames in global environment use eapply or mget filtering on data frame objects. Each returns a named list of data frames.
teachers_df_list <- Filter(is.data.frame, eapply(.GlobalEnv, identity))
teachers_df_list <- Filter(is.data.frame, mget(x=ls()))
Alternatively, source your data frames originally from file sources using list objects such as list.files:
teachers_df_list <- lapply(list.files(...), function(f) read.csv(f, ...))
You lose no functionality of data frame if stored inside a list.
head(teachers_df_list$alexandre)
tail(teachers_df_list$adrian)
summary(teachers_df_list$akemi)
...
Then run your needed operations with lapply like renaming columns with right-hand side function, setNames. Run other needed operations: aggregate or lm.
new_teachers_df_list <- lapply(teachers_df_list,
function(df) setNames(df, paste0("col_", c(1:18)))
new_teachers_agg_list <- lapply(teachers_df_list,
function(df) aggregate(col1 ~ col2, df, sum))
new_teachers_model_list <- lapply(teachers_df_list,
function(df) summary(lm(col1 ~ col2, df)))
Even compile all data frames into one master version using do.call + rbind:
# ADD A TEACHER INDICATOR COLUMN
new_teachers_df_list <- Map(function(df, n) transform(df, teacher=n),
new_teachers_df_list, names(new_teachers_df_list))
# BUILD SINGLE DF
teachers_df <- do.call(rbind, new_teachers_df_list)
Even split master version back into individual groupings if needed later on:
# SPLIT BACK TO LIST OF DFs
teachers_df_list <- split(teachers_df, teachers_df$teacher)
Maybe you could use a list to stock all your data.frame. It seems to work, but you need to find a way to extract all data.frame in the list after that.
df_1 <- data.frame(c(0, 1, 0), c(3, 4, 5))
df_2 <- data.frame(c(0, 1, 0), c(3, 4, 5))
l <- list(df_1, df_2)
lapply(l, function(x){
colnames(x) <- 1:2
return(x)
})
I've been searching around this forum and trying to implement in my case what was said in previous answers from those questions. However, something in my code is missing.
I use lapply() with a function inside that runs ddply. This works nice. However, I would like to identify every result from a single data frame by reading the name of the data frame, and not [[1]], [[2]]...
For this reason, I am trying to implement the seq_along argument, but unsuccessfully. Let's see what I have:
I created a list to group 16 different data frames (with the same structure) in one object, called melt_noNA_noDC_regression:
melt_noNA_noDC_regression <-
list(I1U_melt_noNA_noDC_regression, I1L_melt_noNA_noDC_regression,
I1U_melt_noNA_noDC_regression, I1L_melt_noNA_noDC_regression,
CU_melt_noNA_noDC_regression, CL_melt_noNA_noDC_regression,
P3U_melt_noNA_noDC_regression, P3L_melt_noNA_noDC_regression,
P4U_melt_noNA_noDC_regression, P4L_melt_noNA_noDC_regression,
M1U_melt_noNA_noDC_regression, M1L_melt_noNA_noDC_regression,
M2U_melt_noNA_noDC_regression, M2L_melt_noNA_noDC_regression,
M3U_melt_noNA_noDC_regression, M3L_melt_noNA_noDC_regression)
Later, I run this lapply() line successfully.
lapply(melt_noNA_noDC_regression, function(x) ddply(x, .(Species), model_regression))
As I have 16 different data frames, I would like to identify them in the results of the lapply function. I have tried several combinations to include seq_along within the lapply code, as in this case:
lapply(melt_noNA_noDC_regression, function(x) {
ddply(x, .(Species), model_regression)
seq_along(x), function(i) paste(names(x)[[i]], x[[i]])
})
However, I've been getting errors constantly, and it is a bit frustrating. It is maybe very easy to solve, but I am block.
Any idea to solve this?
Consider using eapply (lapply's lesser known sibling) or mget to retrieve a named list of your dataframes. Then run them through lapply for the ddply call to return the same named dataframe list with new corresponding values.
df_list <- eapply(.GlobalEnv, function(d) d)[c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...)]
df_list <- mget(c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...))
# GENERALIZED FOR ANY DF IN GLOBAL ENV
df_list <- Filter(function(i) class(i)=="data.frame", eapply(.GlobalEnv, function(d) d))
new_list <- lapply(df_list, function(x) ddply(x, .(Species), model_regression))
And because eapply (being environment apply) is part of the apply family and can iterate through objects, you can bypass lapply. But you must account for non-dataframes and then filter out by df names. Hence, tryCatch is used and [] indexing:
new_list2 <- eapply(.GlobalEnv, function(x)
tryCatch(ddply(x, .(Species), model_regression),
warning = function(w) return(NA),
error = function(e) return(NA)
)
)[c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...)]
all.equal(new_list, new_list2)
# [1] TRUE
With all that said, ideally in your data processing you would originally use a named dataframe list and not create separate, similar structured 16 objects flooding your global environment. Therefore, consider adjusting the source of your regression objects, so replace the following:
I1U_melt_noNA_noDC_regression <- ...
with this:
df_list = list()
df_list["I1U_melt_noNA_noDC_regression"] <- ...
In so far as I understand it, when using r it can be more elegant to use functions such as lapply rather than for loops (that are used more often than not in other object oriented languages). However I cannot get my head around the syntax and am making foolish errors when trying to implement simple tasks with the command. For example:
I have a series of dataframes loaded from csv files using a for loop.The following dummy dataframes adequately describe the data:
x <- c(0,10,11,12,13)
y <- c(1,NA,NA,NA,NA)
z <- c(2,20,21,22,23)
a <- c(0,6,5,4,3)
b <- c(1,7,8,9,10)
c <- c(2,NA,NA,NA,NA)
df1 <- data.frame(x,y,z)
df2 <- data.frame(a,b,c)
I first generate a list of dataframe names (data_names- I do this when loading the csv files) and then simply want to sum the columns. My attempt of course does not work:
lapply(data_names, function(df) {
counts <- colSums(!is.na(data_names))
})
I could of course use lists (and I realise in the long run this maybe better) however from a pedagogical point of view I would like to understand lapply better.
Many thanks for any pointers
It's really just your use of is.na and the fact you don't need to use the asignment operator <- inside the function. lapply returns a list which is the result of applying FUN to each element of the input list. You assign the output of lapply to a variable, e.g. res <- lapply( .... , FUN ).
I'm also not too sure how you made the list initially, but the below should suffice. You also don't need an anonymous function in this case, you can use the named colSums and also provide the na.rm = TRUE argument to take care of persky NAs in your data:
lapply( list( df1, df2 ) , colSums , na.rm = TRUE )
[[1]]
x y z
46 1 88
[[2]]
a b c
18 35 2
So you can read this as:
For each df in the list:
apply colSums with the argument na.rm = TRUE
The result is a list, each element of which is the result of applying colSums to each df in the list.
merge is a very nice function: It merges matrices and data.frames, and returns a data.frame.
Having rather big character matrices,
is there another good way to merge -
without data.frame conversion?
Comment 1:
A small function to merge a named vector with a matrix or data.frame. Elements of the vector can link to multiple entries in the matrix:
expand <- function(v,m,by.m,v.name='v',...) {
df <- do.call(rbind,lapply(names(v),function(x) {
pos <- which(m[,by.m] %in% v[x])
cbind(x,m[pos,],...)
}))
colnames(df)[1] <- v.name
df
}
Example:
v <- rep(letters,each=3)[seq_along(letters)]
names(v) <- letters
m <- data.frame(a=unique(v),b=seq_along(unique(v)),stringsAsFactors=F)
expand(v,m,'a')
You can use a combination of match and cbind to do the equivalent of merge without conversion to data frame, a simple example:
st1 <- state.x77[ sample(1:50), ]
st2 <- as.matrix( USArrests )[ sample(1:50), ]
tmp1 <- match(rownames(st1), rownames(st2) )
st3 <- cbind( st1, st2[tmp1,] )
head(st3)
Keeping track of which columns you want, and merging whith many to 1 relationships or missing rows in one group require a bit more thought but are still possible.
No, not without either (a) overwriting the merge function or (b) creating a new merge.matrix() S3 function (this would be the right approach to the problem).
You can see in the merge help:
Value
A data frame.
Also, the merge.default function:
> merge.default
function (x, y, ...)
merge(as.data.frame(x), as.data.frame(y), ...)
There is now a merge.Matrix function in the Matrix.utils package. This works on combinations of matrices as well as capital M Matrices, data.frames, etc.
The match solution is nice, but as someone pointed out does not work on m:n relationships. It also does not implement the other features of merge, including all.x, all.y, etc.