dplyr "weighted sum" and across()

dplyr "weighted sum" and across() - r

I have already asked a similar question to this here with the answer below. I wanted to aggregate my dataframe by "number" and calculate a weighted mean. Now I would like to do a weighted sum but somehow I cannot find out how to apply a weighted sum to my dataframe. The weighted.sum function doesn no longer work for my R version.
df = data.frame(number=c("a","a","a","b","c","c"), y=c(1,2,3,4,1,7),
z=c(2,2,6,8,9,1), weight =c(1,1,3,1,2,1))
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')

We could use
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight),
weightedsum = ~ sum(. * weight)), .groups = 'drop'))
# A tibble: 3 x 9
# number y_mean y_sd y_weighted y_weightedsum z_mean z_sd z_weighted z_weightedsum
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 2 1 2.4 12 3.33 2.31 4.4 22
#2 b 4 NA 4 4 8 NA 8 8
#3 c 4 4.24 3 9 5 5.66 6.33 19

Related

Dplyr Summarise Groups as Column Names

I got a data frame with a lot of columns and want to summarise them with multiple functions.
test_df <- data.frame(Group = sample(c("A", "B", "C"), 10, T), var1 = sample(1:5, 10, T), var2 = sample(3:7, 10, T))
test_df %>%
group_by(Group) %>%
summarise_all(c(Mean = mean, Sum = sum))
# A tibble: 3 x 5
Group var1_Mean var2_Mean var1_Sum var2_Sum
<chr> <dbl> <dbl> <int> <int>
1 A 3.14 5.14 22 36
2 B 4.5 4.5 9 9
3 C 4 6 4 6
This results in a tibble with the first row Group and column names with a combination of the previous column name and the function name.
The desired result is a table with the previous column names as first row and the groups and functions in the column names.
I can achive this with
test_longer <- test_df %>% pivot_longer(cols = starts_with("var"), names_to = "var", values_to = "val")
# Add row number because spread needs unique identifiers for rows
test_longer <- test_longer %>%
group_by(Group) %>%
mutate(grouped_id = row_number())
spread(test_longer, Group, val) %>%
select(-grouped_id) %>%
group_by(var) %>%
summarise_all(c(Mean = mean, Sum = sum), na.rm = T)
# A tibble: 2 x 7
var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
<chr> <dbl> <dbl> <dbl> <int> <int> <int>
1 var1 3.14 4.5 4 22 9 4
2 var2 5.14 4.5 6 36 9 6
But this seems to be a rather long detour... There probably is a better way, but I could not find it. Any suggestions? Thank you

There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var and group. Then you can just pivot wider to get the final result you want. Note that I used summarize(across()) which replaces the deprecated summarize_all(), even though with a single column could've just manually specified Mean = ... and Sum = ....
set.seed(123)
test_df %>%
pivot_longer(
var1:var2,
names_to = "var"
) %>%
group_by(Group, var) %>%
summarize(
across(
everything(),
list(Mean = mean, Sum = sum),
.names = "{.fn}"
),
.groups = "drop"
) %>%
pivot_wider(
names_from = "Group",
values_from = c(Mean, Sum),
names_glue = "{Group}_{.value}"
)
#> # A tibble: 2 × 7
#> var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 var1 1 2.5 3.2 1 10 16
#> 2 var2 5 4.5 4.4 5 18 22

How to rewrite the same code with across function

I scripted the following code
out %>% group_by(tests0, GROUP) %>%
summarise(
mean0 = mean(score0, na.rm = T),
stderr0 = std.error(score0, na.rm = T),
mean7 = mean(score7, na.rm = T),
stederr7 = std.error(score7, na.rm = T),
diff.std.mean = t.test(score0, score7, paired = T)$estimate,
p.value = t.test(score0, score7, paired = T)$p.value,
)
and I have obtained the following output
tests0 GROUP mean0 stderr0 mean7 stederr7 diff.std.mean p.value
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ADAS_CogT0 CONTROL 12.6 0.525 13.6 0.662 -1.15 0.00182
2 ADAS_CogT0 TRAINING 14.0 0.613 12.6 0.570 1.40 0.00295
3 PVF_T0 CONTROL 32.1 1.22 31.3 1.45 0.498 0.636
4 PVF_T0 TRAINING 31.6 1.37 34.3 1.51 -2.48 0.0102
5 ROCF_CT0 CONTROL 29.6 0.893 30.3 0.821 -0.180 0.835
6 ROCF_CT0 TRAINING 30.1 0.906 29.5 0.929 0.489 0.615
7 ROCF_IT0 CONTROL 12.8 0.563 12.2 0.683 0.580 0.356
8 ROCF_IT0 TRAINING 10.9 0.735 12.3 0.768 -1.44 0.0238
9 ROCF_RT0 CONTROL 12.1 0.725 12.5 0.797 -0.370 0.598
10 ROCF_RT0 TRAINING 10.5 0.746 10.9 0.742 -0.534 0.370
11 SVF_T0 CONTROL 35.5 1.05 34 1.15 1.42 0.107
12 SVF_T0 TRAINING 34.1 1.04 32.9 1.16 0.962 0.231
In case I would like to do the same via across function, What am i supposed to do to achieve the same results, shown into the code above? Actaully I am in trouble becase I was drawing some example from the answer published under this question Reproduce a complex table with double headesrs, but I was not able to suit it properly.
Here the dataset
Below you could find the way I would like to obtain the same. It ius a method requiring for .x manipulation.
out %>%
group_by(across(all_of(tests0, GROUP))) %>% summarise(across(starts_with('score'),
list(mean = ~ mean(.x,na.rm = T),
stderr = ~ std.error(.x, na.rm = TRUE),
diff.std.mean = ~ t.test(.x, na.rm = T)))$estimate,
p.value = ~ t.test(.x, na.rm = T)))$p.value)),.groups = "drop")

You can use the argument .names in across():
library(dplyr)
out %>%
group_by(tests0, GROUP) %>%
summarize(across(c(score0, score7), sd, na.rm = TRUE, .names = "sd_{.col}"),
across(c(score0, score7), mean, na.rm = TRUE, .names = "mean_{.col}"),
diff.std.mean = t.test(score0, score7, paired = T)$estimate,
p.value = t.test(score0, score7, paired = T)$p.value) %>%
ungroup()
#> `summarise()` has grouped output by 'tests0'. You can override using the `.groups` argument.
#> # A tibble: 2 x 8
#> tests0 GROUP sd_score0 sd_score7 mean_score0 mean_score7 diff.std.mean p.value
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ADAS_~ CONT~ 3.72 4.81 12.5 13.5 -1.24 0.00471
#> 2 ADAS_~ TRAI~ 4.55 4.15 14.0 12.6 1.40 0.00295
Created on 2021-11-26 by the reprex package (v2.0.1)
EDIT
If you prefer a list it would be easier to determine the separate parts and then bind them together:
library(data.table)
by <- c("tests0", "GROUP")
out_dt <- data.table::data.table(out)
means <- out_dt[, sapply(.SD, function(x) list(mean = mean(x, na.rm = TRUE))),
by = by, .SDcols = patterns("^score")]
sds <- out_dt[, sapply(.SD, function(x) list(sd = sd(x, na.rm = TRUE))),
by = by, .SDcols = patterns("^score")]
t_est <- out_dt[, .(diff.std.mean = t.test(score0, score7, paired = T)$estimate), by = by]
tpvalue <- out_dt[, .(p.value = t.test(score0, score7, paired = T)$p.value), by = by]
list(means = means, sds = sds, diff.std.mean = t_est, p.value = tpvalue)

Here is another approach you may want to consider. First I took your code and cut and pasted it into a function. Abstracting the column names and removing the dependency on the plotrix package for calculating the standard error are the only changes.
g <- function (df)
{
nms <- c(names(df)[1:2],
paste0('mean', sub(".*[a-z]","",names(df)[3])),
paste0('stderr', sub(".*[a-z]","",names(df)[3])),
paste0('mean', sub(".*[a-z]","",names(df)[4])),
paste0('stderr', sub(".*[a-z]","",names(df)[4])),
'diff.std.mean', 'p.value')
z <- df %>% group_by(df[,1:2]) %>%
summarize(
x1 = mean(pull(df[,3]), na.rm = T),
x2 = sd(pull(df[,3]), na.rm=T) / sqrt(sum(!is.na(pull(df[,3])))),
x3 = mean(pull(df[,4]), na.rm = T),
x4 = sd(pull(df[,4]), na.rm=T) / sqrt(sum(!is.na(pull(df[,4])))),
x5 = t.test(pull(df[,3]), pull(df[,4]), paired = T)$estimate,
x6 = t.test(pull(df[,3]), pull(df[,4]), paired = T)$p.value)
colnames(z) <- nms
return(z)
}
Then, because the test data only had one level of a factor and insufficient sample size for the plotrix::std.error function that you used, I introduced variation in the 'test0' factor, doubled the sample size, and dropped the unused levels because they would cause iterations on empty frames. In addition I added a score8 to show how you could run on other variables.
s <- t %>% mutate(tests0 = case_when(Education <= 8 ~ 'ADAS_CogTO', T ~ 'PVF_T0'),
score8 = score0 + score7)
q <- rbind(s, s)
fct_drop(q$tests0)
Then I split the frame by the factor levels, applied the function to each of the splits, then remerged the data back together inside a function that allows you to manipulate the score and group variables. I assumed 2 each, which is safe with the score variables since your are doing a paired t-test, and it is easily extendible with the group variables (if you simply move the score variables to positions 1 and 2, and use all remaining variables passed to the function as group variables).
h <- function(df, group_vars, score_vars)
{
z <- df %>% select(group_vars, score_vars)
z <- z %>% group_by(z[,1:2]) %>%
group_map( ~ g(.x), .keep = T) %>%
bind_rows()
}
Note that if you desire to apply this to other data, you only need to change the columns passed to the group and score variables. Should be fairly easy to alter that if you want to as well, just thought this was a good framework for what you seem to be trying to do. Think about how you handle the case where test0 is null and test7 is non-null (or vice-versa) since these observations are included in come of your summary statistics, but necessarily excluded from the t-test. Good luck.
x <- h(q, c("tests0", "GROUP"), c("score0", "score7")) %>%
group_by(tests0) %>%
pivot_wider(id_cols = tests0,
names_from = GROUP,
values_from = c("mean0","stderr0","mean7","stderr7",
'diff.std.mean', 'p.value'))

I don't have a function called std.error so I've used sd, but of course you can change it.
library(dplyr)
library(readr)
out %>%
group_by(tests0, GROUP) %>%
summarise(
across(c(score0, score7), list(mean = mean, stderr = sd), na.rm = TRUE,
.names = '{.fn}{parse_number(.col)}'),
with(t.test(score0, score7, paired = T),
tibble(diff.std.mean = estimate,
p.value)))
# # A tibble: 2 × 8
# tests0 GROUP mean0 stderr0 mean7 stderr7 diff.std.mean p.value
# <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 ADAS_CogT0 CONTROL 12.5 3.72 13.5 4.81 -1.24 0.00471
# 2 ADAS_CogT0 TRAINING 14.0 4.55 12.6 4.15 1.40 0.00295
In reality I would just put the above code in a function that takes an x and y argument and then run fun(df, x = score0, y = score7). But, just for fun, if you must use .x and .y, here's one way (although imo it would be a little silly to do this)
df %>%
group_by(tests0, GROUP) %>%
select(starts_with('score')) %>%
summarise(
across(everything(), list(mean = mean, stderr = sd), na.rm = TRUE,
.names = '{.fn}{parse_number(.col)}'),
across(everything(), list(list)) %>%
pmap_dfr(~ t.test(.x, .y, paired = TRUE)[c('estimate', 'p.value')]) %>%
transmute(diff.std.mean = estimate, p.value))
# # A tibble: 2 × 8
# # Groups: tests0 [1]
# tests0 GROUP mean0 stderr0 mean7 stderr7 diff.std.mean p.value
# <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 ADAS_CogT0 CONTROL 12.5 3.72 13.5 4.81 -1.24 0.00471
# 2 ADAS_CogT0 TRAINING 14.0 4.55 12.6 4.15 1.40 0.00295

I thought of a possible workaround (that may or may not help) by using across() "manually", without applying functions one column at a time. The resulting output is a data.frame with list columns that are deeply nested, so unnest() will come in handy. I also used possibly() to address the case when two columns are not present, remember that across() can match any number of columns and t.test() needs x and y arguments.
Code:
library(tidyverse)
data <-
df %>%
group_by(tests0, GROUP) %>%
summarize(
all = list(across(starts_with("score")) %>%
{
tibble(
ttest = data.frame(possibly(~ reduce(., ~ t.test(.x, .y, paired = TRUE))[c("estimate", 'p.value')], NA)(.)),
means = data.frame(map(., ~ mean(.x, na.rm = TRUE)) %>% set_names(., str_replace(names(.), "\\D+", "mean"))),
stderrs = data.frame(map(., ~ sd(.x, na.rm = TRUE)) %>% set_names(., str_replace(names(.), "\\D+", "stederr")))
)
})
)
#> `summarise()` has grouped output by 'tests0'. You can override using the `.groups` argument.
data %>%
unnest(all) %>%
unnest(-c("tests0", "GROUP"))
#> # A tibble: 2 × 8
#> # Groups: tests0 [1]
#> tests0 GROUP estimate p.value mean0 mean7 stederr0 stederr7
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ADAS_CogT0 CONTROL -1.24 0.00471 12.5 13.5 3.72 4.81
#> 2 ADAS_CogT0 TRAINING 1.40 0.00295 14.0 12.6 4.55 4.15
Created on 2021-11-29 by the reprex package (v2.0.1)

Aggregating the standard deviation and counting non-NAs in sparklyr

I have a large data.frame and I have been aggregating the summary statistics for numerous variables using the summarise in conjunction with across . Due to the size of my data.frame I have had to start processing my data in sparklyr.
As sparklyr does not support across I am using the summarise_each. This is working OK, except that summarise_each in sparklyr does not appear to support sd and sum(!is.na(.))
Below is an example dataset and how I would process it usually, using dplyr:
test <- data.frame(ID = c("Group1","Group1",'Group1','Group1','Group1','Group1','Group1',
"Group2","Group2","Group2",'Group2','Group2','Group2',"Group2",
"Group3","Group3","Group3"),
Value1 = c(-100,-10,-5,-5,-5,1,2,1,2,3,4,4,4,4,1,2,3),
Value2 = c(50,100,10,-5,3,1,2,2,2,3,4,4,4,4,1,2,3))
test %>%
group_by %>%
summarise(across((Value1:Value2), ~sum(!is.na(.), na.rm = TRUE), .names = "{col}_count"),
across((Value1:Value2), ~min(., na.rm = TRUE), .names = "{col}_min"),
across((Value1:Value2), ~max(., na.rm = TRUE), .names = "{col}_max"),
across((Value1:Value2), ~mean(., na.rm = TRUE), .names = "{col}_mean"),
across((Value1:Value2), ~sd(., na.rm = TRUE), .names = "{col}_sd"))
# A tibble: 1 x 10
Value1_count Value2_count Value1_min Value2_min Value1_max Value2_max Value1_mean Value2_mean Value1_sd Value2_sd
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 17 -100 -5 4 100 -5.53 11.2 24.7 25.8
I have also been able to successfully achieve the same answer using summarise_each as shown below:
test %>%
group_by(ID) %>%
summarise_each(funs(min = min(., na.rm = TRUE),
max = max(., na.rm = TRUE),
mean = mean(., na.rm = TRUE),
sum = sum(., na.rm = TRUE),
sd = sd(., na.rm = TRUE)))
ID Value1_min Value2_min Value1_max Value2_max Value1_mean Value2_mean Value1_sum Value2_sum
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Group1 -100 -5 2 100 -17.4 23 -122 161
2 Group2 1 2 4 4 3.14 3.29 22 23
3 Group3 1 1 3 3 2 2 6 6
When using sparklyr I have successfully been able to compute the min, max, mean, sum as shown below:
sc <- spark_connect(master = "local", version = "2.4.3")
test <- spark_read_csv(sc = sc, path = "C:\\path\\test space.csv")
test %>%
group_by(ID) %>%
summarise_each(funs(min = min(., na.rm = TRUE),
max = max(., na.rm = TRUE),
mean = mean(., na.rm = TRUE),
sum = sum(., na.rm = TRUE)))
# Source: spark<?> [?? x 9]
ID Value1_min Value_2_min Value1_max Value_2_max Value1_mean Value_2_mean Value1_sum Value_2_sum
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 Group2 1 2 4 4 3.14 3.29 22 23
2 Group3 1 1 3 3 2 2 6 6
3 Group1 -100 -5 2 100 -17.4 23 -122 161
But I get error messages when trying to obtain the sd and sum(!is.na(.)) Below is the code and error message I am receiving. Is there any work around to help aggregate these values?
test %>%
group_by(ID) %>%
summarise_each(funs(min = min(., na.rm = TRUE),
max = max(., na.rm = TRUE),
mean = mean(., na.rm = TRUE),
sum = sum(., na.rm = TRUE),
sd = sd(., na.rm = TRUE)))
Error: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'AS' expecting ')'(line 1, pos 298)
== SQL ==
SELECT `ID`, MIN(`Value1`) AS `Value1_min`, MIN(`Value_2`) AS `Value_2_min`, MAX(`Value1`) AS `Value1_max`, MAX(`Value_2`) AS `Value_2_max`, AVG(`Value1`) AS `Value1_mean`, AVG(`Value_2`) AS `Value_2_mean`, SUM(`Value1`) AS `Value1_sum`, SUM(`Value_2`) AS `Value_2_sum`, stddev_samp(`Value1`, TRUE AS `na.rm`) AS `Value1_sd`, stddev_samp(`Value_2`, TRUE AS `na.rm`) AS `Value_2_sd`
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------^^^
FROM `test_space_30172a44_c0aa_4305_9a5e_d45fa77ba0b9`
GROUP BY `ID`
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke.invoke(invoke.scala:147)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
at sparklyr.StreamHandler.read(stream.scala:61)
at sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
at scala.util.control.Breaks.breakable(Breaks.scala:38)
at sparklyr.BackendHandler.channelRead0(handler.scala:38)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
In addition: Warning messages:
1: Named arguments ignored for SQL stddev_samp
2: Named arguments ignored for SQL stddev_samp

The problem is the na.rm parameter. Spark's stddev_samp function has no such parameter and sparklyr doesn't seem to handle it.
Missing values are always removed in SQL so you don't need to specify na.rm.
test_spark %>%
group_by(ID) %>%
summarise_each(funs(min = min(.),
max = max(.),
mean = mean(.),
sum = sum(.),
sd = sd(.)))
#> # Source: spark<?> [?? x 11]
#> ID Value1_min Value2_min Value1_max Value2_max Value1_mean Value2_mean
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Group2 1 2 4 4 3.14 3.29
#> 2 Group1 -100 -5 2 100 -17.4 23
#> 3 Group3 1 1 3 3 2 2
#> Value1_sum Value2_sum Value1_sd Value2_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 22 23 1.21 0.951
#> 2 -122 161 36.6 38.6
#> 3 6 6 1 1
This looks like a bug specific to summarise as sd with na.rm works fine with mutate.
test_spark %>%
group_by(ID) %>%
mutate_each(funs(sd = sd(., na.rm = TRUE)))
For sum(!is.na(.)), you just need to write it as sum(ifelse(is.na(.), 0, 1)).

Superfluous columns returned by dplyr::summarise() function

I am having some trouble with the new dplyr::summarise() function
Here is the data
df <- data.frame(id = factor(1:10),
group = factor(rep(letters[1:2],each = 5)),
w1 = rnorm(10),
w2 = rnorm(10),
w3 = rnorm(10),
dummy = as.character(LETTERS[1:10]),
stringsAsFactors = F)
Now I want to get means and standard deviations for the numeric variables only. So I ran the following code
df %>%
dplyr::select(id, group, w1:w3) %>%
group_by(group) %>%
dplyr::summarise(across(where(is.numeric), ~ mean(.x, na.rm = T), .names = "mean_{col}"),
across(where(is.numeric), ~ sd(.x, na.rm = T), .names = "sd_{col}"),
count = n())
Which gives me the following output
# A tibble: 2 x 11
# group mean_w1 mean_w2 mean_w3 sd_w1 sd_w2 sd_w3 sd_mean_w1 sd_mean_w2 sd_mean_w3 count
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# a -0.399 0.152 -0.151 1.07 0.703 1.15 NA NA NA 5
# b 0.560 -0.107 -0.0439 1.18 0.612 0.862 NA NA NA 5
Now the columns starting with mean_ and sd_ are exactly what I want, but I'm also getting this set of sd_mean_ columns, I assume because it is trying to find the sd of the new mean_ columns.
How do I get the output without the superfluous columns?

The issue is when you go to second across the number of numeric columns have increased, so it applies sd function to the new columns as well. To avoid this apply multiple function in the same across using list().
library(dplyr)
df %>%
group_by(group) %>%
summarise(across(where(is.numeric), list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE)),
.names = "{fn}_{col}"),
count = n())
# group mean_w1 sd_w1 mean_w2 sd_w2 mean_w3 sd_w3 count
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 a 0.0746 0.696 0.760 1.39 0.0530 1.29 5
#2 b 0.522 0.686 0.0979 0.566 -0.0133 1.12 5
Also, your attempt would work as expected if you don't select columns by their type :
df %>%
group_by(group) %>%
summarise(across(w1:w3, ~ mean(.x, na.rm = T), .names = "mean_{col}"),
across(w1:w3, ~ sd(.x, na.rm = T), .names = "sd_{col}"),
count = n())

weighted.mean, summarise() and across()

I would like to aggregate the following dataframe (variables y and z) by number and weight it by "weight". This works as follows:
df = data.frame(number=c("a","a","a","b","c","c"), y=c(1,2,3,4,1,7),
z=c(2,2,6,8,9,1), weight =c(1,1,3,1,2,1))
aggregate = df %>%
group_by(number) %>%
summarise_at(vars(y,z), funs(weighted.mean(. , w=weight)))
Since summarise_at should not longer be used, I tried it with across. But I wasn't successful:
aggregate = df %>%
group_by(number) %>%
summarise(across(everything(), list( mean = mean, sd = sd)))
# this works for mean but I can't just change it with "weighted.mean" etc.

We can pass the anonymous function with ~. By checking the summarise_at, the OP wants to only return the summarisation of columns 'y', 'z', i.e. using everything() would also return the mean, sd and weighted.mean of 'weight' column as well which doesn't make much sense
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = mean, sd = sd,
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')
# A tibble: 3 x 7
# number y_mean y_sd y_weighted z_mean z_sd z_weighted
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 2 1 2.4 3.33 2.31 4.4
#2 b 4 NA 4 8 NA 8
#3 c 4 4.24 3 5 5.66 6.33
Often, the mean and sd works well when there are no NA elements. But if there are NA values, we may need to use na.rm = TRUE (by default it is FALSE. In that case, the lambda call would be useful to pass additional parameters
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr "weighted sum" and across() - r

Related

Dplyr Summarise Groups as Column Names

How to rewrite the same code with across function

Aggregating the standard deviation and counting non-NAs in sparklyr

Superfluous columns returned by dplyr::summarise() function

weighted.mean, summarise() and across()

Categories

Resources