Getting two different means in R using same numbers - r

I'm trying to calculate the mean of some grouped data, but I'm running into an issue where the mean generated using base::mean() is generating a different value than when I use base:rowMeans() or try to replicate the mean in Excel.
Here's the code with a simplified data frame looking at just a small piece of the data:
df <- data.frame("ID" = 1101372,
"Q1" = 5.996667,
"Q2" = 6.005556,
"Q3" = 5.763333)
avg1 <- df %>%
summarise(new_avg = mean(Q1,
Q2,
Q3)) # Returns a value of 5.99667
avg2 <- rowMeans(df[,2:4]) # Returns a value of 5.921852
The value in avg2 is what I get when I use AVERAGE in Excel, but I can't figure out why mean() is not generating the same number.
Any thoughts?

Here, the mean is taking only the first argument i.e. Q1 as 'x' because the usage for ?mean is
mean(x, trim = 0, na.rm = FALSE, ...)
i.e. the second and third argument are different. In the OP's code, x will be taken as "Q1", trim as "Q2" and so on.. The ... at the end also means that the user can supply n number of parameters without any error and leads to confusions like this (if we don't check the usage)
We can specify the data as ., subset the columns of interest and use that in rowMeans
df %>%
summarise(new_avg = rowMeans(.[-1]))
This would be more efficient. But, if we want to use mean as such, then do a rowwise
df %>%
rowwise() %>%
summarise(new_avg = mean(c(Q1, Q2, Q3)))
# A tibble: 1 x 1
# new_avg
# <dbl>
#1 5.92
Or convert to 'long' format and then do the group_by 'ID' and get the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>% # can skip this step if there is only a single row
summarise(new_avg = mean(value))
# A tibble: 1 x 2
# ID new_avg
# <dbl> <dbl>
#1 1101372 5.92

Related

Add summarize variable in multiple statements using dplyr?

In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5

Time series function in dplyr

I am working with data that stops in a specific year and is NA afterwards. And I need to calculate allot of variables based on lagged values of other variables. I would like to find a way that a whole series is calculated instead of each time one year when one of the variables is NA. I was looking at dplyr given that I am working with panel data and thus need to group it by ID.
I provide the example below:
set.seed(1)
df <- data.frame( year = c(seq(2000, 2018), seq(2000, 2018)) , id = c(rep(1, 19),rep(2, 19)), varA = floor(rnorm(38)*100), varB= floor(rnorm(38)*100), varC= floor(rnorm(38)*100))
df <- df %>% mutate(varA = if_else(year>2010, as.double(NA) , varA) ,
varB = if_else(year>2010, as.double(NA) , varB),
varC = if_else(year>2010, as.double(NA) , varC)) %>% group_by(id) %>% arrange(year)
What I would like is to find a way to calculate a variable that is equal to variable C when it is available, but afterwards is equal to a formula based on lagged values of variable C, B and A. When executing the code below, varResult and D are ony calculated for one year given that the lags are only available for one year:
df <- df %>% mutate( varD = lag(varA)*lag(varB),
varRESULT = if_else(is.na(varC), lag(varC, 1)/lag(varD, 2)*lag(varD, 1), varC))
But I would like to find a way to calculate immidiatly the whole serries (taking into account the panel dimension of the data) instead of heaving to repeat the code 7 times. Preferably a solution where you can calculate varD seperatly from varResults, given that in the final application I have multiple variables that are linked to each other.
Proposed solution:
Starting with the first NA, the "recursive" lags of vars varA, varB, and varC are equal to the last value of these variables.
Thus, starting from these initial variables, we can create new variables: varA1, varB1, and varC1 where we fill the NAs with the last value, by id:
library(dplyr)
library(tidyr) # for the function `fill`
df <- df %>%
mutate(varA1 = varA, varB1 = varB, varC1 = varC) %>%
group_by(id) %>%
arrange(year) %>%
fill(varA1, varB1, varC1) # fills with last value
Then, we apply the formula:
df <- df %>%
mutate( varD = lag(varA1)*lag(varB1),
varRESULT = if_else(is.na(varC), lag(varC1, 1)/lag(varD, 2)*lag(varD, 1), varC)) %>%
select(-varA1, -varB1, -varC1)

Using tidyverse to reshape a data.frame and its column names

I have a data.frame of some experiment with several factors and measured values for each sample. For example:
factors <- c("age","sex")
The data.frame looks like this:
library(dplyr)
set.seed(1)
df <- do.call(rbind,lapply(1:10,function(i) expand.grid(age=c("Y","O"),sex=c("F","M")) %>% dplyr::mutate(val=rnorm(4))))
grouped.mean.val.df <- df %>% dplyr::group_by_(.dots=factors) %>% dplyr::summarise(mean.val=mean(val))
I want to create a data.frame which has a single row and the number of columns is the number of factor combinations (i.e. nrow(expand.grid(age=c("Y","O"),sex=c("F","M")) in this example), where the value is the mean df$val for the corresponding combination of factors.
To get the mean df$val for each combination of factors I do:
grouped.mean.val.df <- df %>% dplyr::group_by_(.dots=factors) %>% dplyr::summarise(mean.val=mean(val))
And the resulting data.frame I'd like to obtain is:
res.df <- data.frame(Y.F=grouped.mean.val.df$mean.val[1],
Y.M=grouped.mean.val.df$mean.val[2],
O.F=grouped.mean.val.df$mean.val[3],
O.M=grouped.mean.val.df$mean.val[4])
Is there a tidyverse way to get that?
We can do unite and then a spread. unite the 'age' and 'sex' to create a single column, mutate the values to factor (to make the order as the same as in the expected) and do a spread to 'wide' format
library(tidyverse)
grouped.mean.val.df %>%
unite(agesex, age, sex, sep=".") %>%
mutate(agesex = factor(agesex, levels = unique(agesex))) %>%
spread(agesex, mean.val)
# A tibble: 1 x 4
# Y.F Y.M O.F O.M
# <dbl> <dbl> <dbl> <dbl>
#1 0.0695 0.411 -0.118 0.00577
Also, instead of group_by_, we can use group_by_atwhich takes strings as variables
df %>%
group_by_at(factors) %>%
summarise(mean.val = mean(val)) %>%
unite(agesex, age, sex, sep=".") %>%
mutate(agesex = factor(agesex, levels = unique(agesex))) %>%
spread(agesex, mean.val)

Collapse data frame, by group, using lists of variables for weighted average AND sum

I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).

filtering within the summarise function of dplyr

I am struggling a little with dplyr because I want to do two things at one and wonder if it is possible.
I want to calculate the mean of values and at the same time the mean for the values which have a specific value in an other column.
library(dplyr)
set.seed(1234)
df <- data.frame(id=rep(1:10, each=14),
tp=letters[1:14],
value_type=sample(LETTERS[1:3], 140, replace=TRUE),
values=runif(140))
df %>%
group_by(id, tp) %>%
summarise(
all_mean=mean(values),
A_mean=mean(values), # Only the values with value_type A
value_count=sum(value_type == 'A')
)
So the A_mean column should calculate the mean of values where value_count == 'A'.
I would normally do two separate commands and merge the results later, but I guess there is a more handy way and I just don't get it.
Thanks in advance.
We can try
df %>%
group_by(id, tp) %>%
summarise(all_mean = mean(values),
A_mean = mean(values[value_type=="A"]),
value_count=sum(value_type == 'A'))
You can do this with two summary steps:
df %>%
group_by(id, tp, value_type) %>%
summarise(A_mean = mean(values)) %>%
summarise(all_mean = mean(A_mean),
A_mean = sum(A_mean * (value_type == "A")),
value_count = sum(value_type == "A"))
The first summary calculates the means per value_type and the second "sums" only the mean of value_type == "A"
You can also give the following function a try:
?summarise_if
(the function family is summarise_all)
Example
The dplyr documentation serves a quite good example of this, i think:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns. Here we apply mean() to the numeric columns:
starwars %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 1 x 3
#> height mass birth_year
#> <dbl> <dbl> <dbl>
#> 1 174. 97.3 87.6
The interesting thing here is the predicate function. This represents the rule by which the columns, that will have to be summarized, are selected.

Resources