summarize_all with "n()" function - r

I'm summarizing a data frame in dplyr with the summarize_all() function. If I do the following:
summarize_all(mydf, list(mean="mean", median="median", sd="sd"))
I get a tibble with 3 variables for each of my original measures, all suffixed by the type (mean, median, sd). Great! But when I try to capture the within-vector n's to calculate the standard deviations myself and to make sure missing cells aren't counted...
summarize_all(mydf, list(mean="mean", median="median", sd="sd", n="n"))
...I get an error:
Error in (function () : unused argument (var_a)
This is not an issue with my var_a vector. If I remove it, I get the same error for var_b, etc. The summarize_all function is producing odd results whenever I request n or n(), or if I use .funs() and list the descriptives I want to compute instead.
What's going on?

The reason it's giving you problems is because n() doesn't take any arguments, unlike mean() and median(). Use length() instead to get the desired effect:
summarize_all(mydf, list(mean="mean", median="median", sd="sd", n="length"))

Here, we can use the ~ if we want to have finer control, i.e. adding other parameters
library(dplyr)
mtcars %>%
summarise_all(list(mean = ~ mean(.), median = ~median(.), n = ~ n()))
However, getting the n() for each column is not making much sense as it would be the same. Instead create the n() before doing the summarise
mtcars %>%
group_by(n = n()) %>%
summarise_all(list(mean = mean, median = median))
Otherwise, just pass the unquoted function
mtcars %>%
summarise_all(list(mean = mean, median = median))

Related

Is it possible to count by using the count function within across()?

Hello R and tidyverse wizards,
I try to count the rows of the starwars data set to know how many observations we get with the variables "height" and "mass"
.
I managed to get it with this code:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = ~ n(),
mean = mean,
sd = sd))) %>%
View()
I would like to replace the obs = ~ n() by the count function and tried this version:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = count,
mean = mean,
sd = sd))) %>%
View()
but it was too simple to work, classic :p
I had this error message --> Error in View : Problem while computing ..1 = across(...)
And when I got rid of the View() function, I had another error message --> Error in summarise():
! Problem while computing ..1 = across(...).
Caused by error in across():
! Problem while computing column height_obs.
Caused by error in UseMethod():
! no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
So, I got two questions:
could someone please explain why the code worked with ~ n() but not with count?
is it possible to use the count function instead of ~ n() in that case?
Sorry if it is a dumb question but I just try to understand the across and the count functions by playing with it.
In the function description it says that "df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())", so I assume that using count() within across results in something like a double summarize-command, hence the use in favor of n().
Edit: Here you find the solution in the comment by G. Grothendieck
What is the difference between n() and count() in R? When should one favour the use of either or both?
n() returns a number
count() returns a dataframe
count() takes a dataframe as its first argument. It then returns counts for columns within that dataframe, passed as additional arguments. e.g.,
library(dplyr)
count(starwars, mass, height)
When you put count() inside across(), it passes columns to count() without including the dataframe as the first argument. Equivalent to if you ran,
count(starwars$mass, starwars$height)
Because count() expects a dataframe as the first argument, it throws an error.
n(), on the other hand, doesn’t take any arguments, and simply counts rows in the current environment (or group). You have to include the ~, as otherwise it will try passing each column to n(), which causes an error since n() doesn’t expect arguments.

How to calculate weighted mean using mutate_at in R?

I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)

Summarising twice in same pipe R

I obviously get an error with the below but I was hoping to summarise the same column with regards to mean and median, and also how many points are in the polygon. But within the same pipe. Any help would be great.
Nin_Sep_points_sf_joined <-
st_join(merged_ten_seven_shp, Nin_Sep_sf_3011) %>%
filter(!is.na(Employment_diff)) %>%
group_by(Kod) %>%
summarise(Count=mean(as.numeric(as.character(price)))), summarise(Count_tot=n()), summarise(Count=median(as.numeric(as.character(price))))
You can supply multiple arguments to summarize which you separate with a ,:
library(dplyr)
Nin_Sep_points_sf_joined <-
st_join(merged_ten_seven_shp, Nin_Sep_sf_3011) %>%
filter(!is.na(Employment_diff)) %>%
group_by(Kod) %>%
summarise(Count=mean(as.numeric(as.character(price))),
Count_tot=n(),
Count=median(as.numeric(as.character(price))))
Note that you can even refer to the results of previous arguments in the next argument. So you could calculate SD based on Count_tot.

R: summarise multiple columns with different summation functions using dplyr results in error?

I am transforming a customer journey dataset from user aggregation level to a day level aggregation. The problem is that I cannot simply sum or mean all columns, as not all variables can be aggregated in the same way. For example, duration is a variable that I want to summarise via mean, while purchase_own is a variable that I want to summarise via sum.
I used dplyr to get this working, but it gives me an error. I tried the following code:
CJd <- CJre %>% group_by(date) %>% summarise_at(vars(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp, POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch), sum)
%>% summarise_at(vars(duration, difference), mean) %>% summarise_at(CountTP, max)
This results in an error:
Error in .f(.x[[i]], ...) : object 'duration' not found
I suspect that this means that summarise_at(vars(duration, difference), mean) is not allowed as second summarise code. Now my question is, how can I write the summarise function so that summation is different for some variables?
Actual results is that only the first summarise_at gets executed, which results in missing variables in my dataset. The missing variables need to be summarised with mean and max, respectively. The expected outcome is these variables grouped by date and summarised by the named functions mean or max are added to the dataset.
The issue is that after the first summarise_at which didn't include 'duration', therefore, the column is not there in the summarised data. Instead, if we use mutate_at, and create a column, then get the distinct rows of the data and summarise
CJre %>%
group_by(date) %>%
mutate_at(vars(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp,
POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch), sum) %>%
group_by(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp,
POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch, add = TRUE) %>%
summarise_at(vars(duration, difference), mean)
markov, first_touch, last_touch, linear_touch), sum)

Calling prop.test function in R with dplyr

I am trying to calculate several binomial proportion confidence intervals. My data are in a data frame, and though I can successfully extract the estimate from the object returned by prop.test, the conf.int variable seems to be null when run on the data frame.
library(dplyr)
cases <- c(50000, 1000, 10, 2343242)
population <- c(100000000, 500000000, 100000, 200000000)
df <- as.data.frame(cbind(cases, population))
df %>% mutate(rate = prop.test(cases, population, conf.level=0.95)$estimate)
This appropriately returns
cases population rate
1 50000 1e+08 0.00050000
2 1000 5e+08 0.00000200
3 10 1e+05 0.00010000
4 2343242 2e+08 0.01171621
However, when I run
df %>% mutate(confint.lower= prop.test(cases, pop, conf.level=0.95)$conf.int[1])
I sadly get
Error in mutate_impl(.data, dots) :
Column `confint.lower` is of unsupported type NULL
Any thoughts? I know alternative ways to calculate the binomial proportion confidence interval, but I would really like to learn how to use dplyr well.
Thank you!
You can use dplyr::rowwise() to group on rows:
df %>%
rowwise() %>%
mutate(lower_ci = prop.test(cases, pop, conf.level=0.95)$conf.int[1])
By default dplyr takes the column names and treats them like vectors. So vectorized functions, like #Jake Fisher mentioned above, just work without rowwise() added.
This is what I would do to catch all of the confidence interval components at once:
df %>%
rowwise %>%
mutate(tst = list(broom::tidy(prop.test(cases, pop, conf.level=0.95)))) %>%
tidyr::unnest(tst)
As of version 1.0.0, rowwise() is no longer being questioned.
As of version 0.8.3 of dplyr, the lifecycle status of the rowwise() function is "questioning".
As an alternative, I would rather recommend the use of purrr::map2() to achieve the goal:
df %>%
mutate(rate = map2(cases, pop, ~ prop.test(.x, .y, conf.level=0.95) %>%
broom::tidy())) %>%
unnest(rate)

Resources