I need to to use ddply to apply multiple functions on multiple columns of my data frame. When I use the column name (RV in the example below), my split variables (Group and Round below) work (I get a mean value for each combination of Round and Group).
I need to do this on 20 columns and I was thinking of creating a for loop and pass column indexes.
When I use the column index (for example df[[1]] which is "RV" in my data frame), Group and Round are ignored and the grand mean is returned for all combinations of Round and Group.
I tried to pass the column name, in new.df3 but Round and Group are ignored again.
df <- data.frame("RV" = 1:5, "Group" = c("a","b","b","b","a"), "Round" = c("2","1","1","2","1"))
# this works and a separate mean for each combination of "Group" and "Round" is calculated
new.df <- ddply(df, c("Group", "Round"), summarise,
mean= mean(RV))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df2 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[[1]]))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df3 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[,colnames(df[1])]))
I tried "lapply" and the same issue exists. Any suggestion why this happens and how I can fix it?
As great a package as plyr is, you would do well here to update to it's newest iteration, dplyr. There, the code would be
v <- vars(RV) # add all your variables here
new.df <- df %>%
group_by(Group, Round) %>%
summarize_at(v, funs(mean))
So using this method, you plug in all your variables into v, and you'll get a mean for all of them, for each combination of Group and Round. The pipe operator (%>%) looks weird when you first see it, but it helps streamline your code. It takes the output of the previous function and sets it to be the first argument of the next function. It makes it easy to see that we're taking df, grouping by Group and Round, then summarizing them.
If you really want to stick with plyr, we can get a solution there too:
new.df <- ddply(df, c("Group", "Round"), summarise,
RV_mean = mean(RV),
var2_mean = mean(var2) # add a more variables just like this
)
We can also work from your list approaches:
new.df2 <- ddply(df, .(Group, Round), function(data_subset) { # note alternative way to reference Group and Round
as.data.frame(llply(data_subset[,c("RV"), drop = FALSE], mean)) # add your variables here
})
Note that within ddply, I always refer to the subset of the data frame within my function calls, I never refer to df. df always refers to the original data frame - not the subset you are trying to work with.
Related
As a follow up to this question, I'm using dplyr's group_split() to make dataframes / tibbles based on a levels of a column. Continuing off of this question, I want to split off of two columns instead of 1. When I try to split and name the columns, it attributes the wrong names to some of the datasets.
Here's a simple example:
library(dplyr)
#Sample dataset to intuitively illustrate issue
example <- tibble(number = c(1:6),
even_or_odd = c("odd", "even", "odd", "even", "odd", "even"),
prime_or_not = c("prime", "prime", "prime", "not", "prime", "not")) %>%
mutate(type = paste0(even_or_odd, "_", prime_or_not)) %>%
mutate(type_factor = factor(type, levels = unique(type)))
#Does group split to make 3 datasets
the_test <- example %>%
group_split(even_or_odd, prime_or_not) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #wrong label :`-(
odd_prime <- the_test["odd_prime"]$odd_prime #wrong label :`-(
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
My question: how do I ensure that my group names will be attributed to the right dataset and avoid the issues here with even_not and odd_prime being mixed up?
In my actual dataset, I have 50+ combinations, so typing them all out manually is not an option. In addition, my actual dataset will have some combinations that don't consistently exist (like the (like the odd not prime combination here), so relying on index isn't an option.
Instead of splitting by the two columns, use the factor column that was created, which ensures that it splits by the order of the levels created in the type_factor. In addition, using the unique on type_factor can have some issues if the order of the values in 'type_factor' is different i.e. unique gets the first non-duplicated value based on its occurrence. Instead, levels is better. In fact, it may be more appropriate to droplevels as well in case of unused levels.
the_test <- example %>%
group_split(type_factor) %>%
setNames(levels(example$type_factor))
group_split returns unnamed list. If we want to avoid the pain of renaming incorrectly, use split from base R which does return a named list. Thus, it can return in any order as long as the key/value pairs are correct
# 1 - return in a different order based on alphabetic order
split(example, example[c("even_or_odd", "prime_or_not")], drop = TRUE)
# 2 - return order based on the levels of the factor column
split(example, example$type_factor)
# 3 - With dplyr pipe
example %>%
split(.$type_factor)
# 4 - or using magrittr exposition operator
library(magrittr)
example %$%
split(x = ., f = type_factor)
Oh, of course the moment I post it, I realize that an easy solution existed:
Just change the group split to the new variable and it works!
library(dplyr)
#Does group split to make 3 datasets
the_test <- example %>%
group_split(type_factor) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #works now!
odd_prime <- the_test["odd_prime"]$odd_prime #works now!
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
I have got data with observations in rows. There are an outcome variable y (dbl) as well as multiple factors, herein called f_1 and f_2. The latter denote conditions of an experiment. The data situation is mirrored by the following minimal example:
set.seed(123)
y = rnorm(10)
f_1 = factor(rep(c("A", "B"), 5))
f_2 = factor(rep(c("C", "D"), each = 5))
dat <- data.frame(y, f_1, f_2)
I would like to compute mean values of y for groups defined by f_1 and f_2. Importantly, I do not want a mean value for each combination of f_1 and f_2, but mean values based on f_1 on the one hand and mean values values based on f_2 on the other hand. These should be saved as factors in dat, where each observation has a mean_f_1 (mean value when data is grouped according to f_1) and mean_f_2 (mean value when data is grouped according to f_2). The labels of the new factors mean_f_1 and mean_f_2 should correspond to the values = labels of f_1 and f_2. The labels have a meaning. Thus, a mean calculated for group "A" (from f_1) should keep the label "A" (in mean_f_1). The number of condition variables f_... in the original data is higher than 2. Thus, I would like to not repeat code for each factor (see I).
I have come up with two approaches. The first (I; group_by approach) gives the desired result. But repeats code for each factor.
I) group_by approach
library(dplyr)
dat %>%
group_by(f_1) %>%
mutate(mean_f_1 = factor(mean(y), label = unique(f_1))) %>%
group_by(f_2) %>%
mutate(mean_f_2 = factor(mean(y), label = unique(f_2)))
In other words, repeating the 'group_by - mutate' statements for each factor seems avoidable. I did not manage to use across() here.
The other approach (II; ave approach) avoids code repetition, but wont assign factor labels. Assigning factor labels using unique() messed up the order of labels in the original data.
II) ave approach
dat %>% mutate(across(starts_with("f"),
~ ave(y, .x, FUN = mean),
.names = "mean_{.col}"))
Do you have an idea how to ...
... improve (I) to work on multiple factors?
... improve (II) to include factor labels?
... solve the problem differently?
A dplyr solution is preferred.
To avoid repeating code for each factor, I suggest iterating over factors. Something like:
library(dplyr)
factors = c("f_1", "f_2")
for(ff in factors){
new_col = paste0("mean_",ff)
dat <- dat %>%
group_by(!!sym(ff)) %>%
mutate(!!sym(new_col) := factor(mean(y), label = unique(!!sym(ff))))
}
This produces identical output to your group_by approach. To scale up to more columns, add these to the factors array and the code will iterate overthem.
The !!sym(.) is used to turn a character string into a column name. There are several other ways to do this, see the programming with dplyr vignette for other options. The unusual assignment operator := has the same behavior as = except it can accept some prep on the left-hand-side.
Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A).
I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.
Let's say I do the grouping (in reality it's a cut in intervals):
df %>% group_by(A)
From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.:
this (SO won't let me embed if I have <10 rep).
In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).
However what I want is this: just like in Matlab when you do
group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group
I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.
Minimum working example:
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...
Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).
If you want a base R solution, try the following.
sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633
I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner
lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))
Or, for a result in another form,
sapply(sp, function(x) var(unlist(x)))
#(0.998,2] (2,3]
# 0.848707 1.806330
DATA
set.seed(6322) # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
I have a dataframe containing a line per company, with different variables (some numeric, others not):
data <- data.frame(id=1:5,
CA = c(1200,1500,1550,200,0),
EBE = c(800,50,654,8555,0),
VA = c(6984,6588,633,355,84),
FBCF = c(35,358,358,1331,86),
name=c("qsdf","xdwfq","qsdf","sqdf","qsdfaz"),
weight = c(1, 5, 10,1 ,1))
I would like to summarise all numeric variables by a weighted sum. If I wanted a simple sum I would do:
data %>% summarise_if(is.numeric,sum)
but I don't see how to define a weighted sum.
I tried:
w.sum <- function(x) {sum(x*weight) %>% return()}
but without any success.
We can use it inside the funs
data %>%
summarise_if(is.numeric, funs(sum(.*weight)))
Note that the above is based on the condition that if the columns are numeric class. Based on the example the 'id' column is numeric, which may not need the summariseation. A better option would be summarise_at to specify the columns of interest
data %>%
summarise_at(names(.)[2:5], funs(sum(.*weight)))
This is a mock-up based on mtcars of what I would like to do:
compute a column that counts the number of cars that have less
displacement (disp) of the current row within the same gear type
category (am)
expected column is the values I would like to get
try1 is one try with the findInterval function, the problem is that I cannot make it count across the subsets that depend on the category (am)
I have tried solutions with *apply but I am somehow never able to make the function called work only on a subset that depends on the value of a variable of the row that is processed (hope this makes sense).
x = mtcars[1:6,c("disp","am")]
# expected values are the number of cars that have less disp while having the same am
x$expected = c(1,1,0,1,2,0)
#this ordered table is for findInterval
a = x[order(x$disp),]
a
# I use the findInterval function to get the number of values and I try subsetting the call
# -0.1 is to deal with the closed intervalq
x$try1 = findInterval(x$disp-0.1, a$disp[a$am==x$am])
x
# try1 values are not computed depending on the subsetting of a
Any solution will do; the use of the findInterval function is not mandatory.
I'd rather have a more general solution enabling a column value to be computed by calling a function that takes values from the current row to compute the expected value.
As pointed out by #dimitris_ps, the previous solution neglects the duplicated counts. Following provides the remedy.
library(dplyr)
x %>%
group_by(am) %>%
mutate(expected=findInterval(disp, sort(disp) + 0.0001))
or
library(data.table)
setDT(x)[, expected:=findInterval(disp, sort(disp) + 0.0001), by=am]
Based on #Khashaa's logic this is my approach
library(dplyr)
mtcars %>%
group_by(am) %>%
mutate(expected=match(disp, sort(disp))-1)