How to combine summarize and summarize_if in dplyr

How to combine summarize and summarize_if in dplyr - r

I would like to combine a summarize statement (to count the number of observations) with a summarise_if statement (to summarise all numeric variables).
Using data("iris"), I would like to:
Count the number of observations per Species and add this count as a column in the new table.
Summarise all numeric variables (Sepal.Length,Sepal.Width, Petal.Length, Petal.Width) by Species.
I can do these steps separately with the code below:
Number 1.
iris %>%
group_by(Species)%>%
summarise(n = n())
Number 2.
iris %>%
group_by(Species)%>%
summarise_if(is.numeric, median, na.rm = TRUE)
Q: How to combine these calculations into one step?
Just piping one after the other gives me a different result. My desired output is this:

Use across:
iris %>%
group_by(Species) %>%
summarise(n = n(), across(where(is.numeric), median, na.rm = TRUE))
For those interested, the same thing in data.table:
setDT(iris)
iris[, j = data.frame(n = .N, lapply(.SD, median, na.rm = TRUE)),
.SDcols = names(iris)[sapply(iris, is.numeric)],
by = Species]

Related

dplyr summarise and then summarise_at in the same pipe

This question has come up before and there are some solutions but none that I could find for this specific case. e.g.
my_diamonds <- diamonds %>%
mutate(blah_var1 = rnorm(n()),
blah_var2 = rnorm(n()),
blah_var3 = rnorm(n()),
blah_var4 = rnorm(n()),
blah_var5 = rnorm(n()))
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table), .groups = 'drop') %>%
summarise_at(vars(contains('blah')), mean)
Want a new df showing the max clarity, min table and mean of each of the blah variables. The above returned an empty tibble. Based on some other SO posts I tried using mutate and then summarise at:
my_diamonds %>%
group_by(cut) %>%
mutate(MaxClarity = max(clarity),
MinTable = min(table)) %>%
summarise_at(vars(contains('blah')), mean)
This returns a tibble but only for the blah variables, MaxClarity and MinTable are missing.
Is there a way to combine summarise and summarise_at in the same dplyr chain?

One issue with the summarise is that after the first call of summarise, we get only the columns in the grouping i.e. the 'cut' along with and the summarised columns i.e. 'MaxClarity' and 'MinTable'. In addition, after the first summarise step, the grouping is removed with groups = 'drop'
library(dplyr) # version >= 1.0
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table),
across(contains('blah'), mean, na.rm = TRUE), .groups = 'drop')

aggregate data by 5min excluding max and min

I have a data-frame likeso:
Time <- seq.POSIXt(as.POSIXct("2017-11-14 00:01:00 CET"), as.POSIXct("2017-11-14 00:15:00 CET"), units = "minute", by=60)
A <- c(2,3,5,2,5,8,17,3,5,8,17,3,5,1,5)
B <- c(1,1,2,1,2,1,2,2,2,4,6,7,8,8,9)
DF <- data.frame(Time=Time, A=A, B=B)
and i want a "newDF" where I aggregate data by 5min, excluding however, for each column, the max/min value before the aggregation.
Using dplyr i get to something like this:
DF$TimeStamp_round<-floor_date(DF$Time,unit="5 minutes")
DF<-DF %>%
group_by(TimeStamp_round) %>%
mutate(TimeStamp_count = cur_group_id())
newDF<-DF %>%
group_by(TimeStamp_count) %>%
summarise(across(where(is.numeric), mean))
but i still don´t manage to exclude the max/min value before the summarise() function in newDF
note: I do not want to do it manually for each column, because in the real DF the columns are 350

We can remove the range of values before taking the mean after grouping by 'TimeStamp_round'
library(dplyr)
DF %>%
group_by(TimeStamp_round) %>%
summarise(across(A:B, ~ mean(.[!. %in% range(.)])), .groups = 'drop')
Or if there are more columns and want to get the mean only for numeric
DF %>%
select(-Time) %>%
group_by(TimeStamp_round) %>%
summarise(across(where(is.numeric),
~ mean(.[!. %in% range(.)])), .groups = 'drop')

Reduce a data frame by combining like rows according to two qualitative factors

I have a dataframe like the following:
observations<- data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00KS177011","00P0006","00P006","00P006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,16,0,6,13,29), hzdepb = c(20,30,15,30,13,30,20,30,16,30,6,13,29,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Black","Red","White","White","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,0.53,0.47,0.2,0.23,0.53,0.04))
I want to be able to reduce this so that anytime X and Y are the same for two rows, the observations are combined i.e.
data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00P0006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,0,6), hzdepb = c(20,30,15,30,13,30,20,30,30,6,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Red","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,1.00,0.20,0.80))
Any suggestions on how to best go about this?

Edit: ok, now that I see how hzdept and hzdepb are supposed to be combined from your commment above:
library(tidyverse)
df <- observations %>% count(X,Y,wt = Z,name = "Z")
df_hzdept <- observations %>%
arrange(hzdept) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdept)
df_hzdepb <- observations %>%
arrange(desc(hzdepb)) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdepb)
df <- df %>% left_join(df_hzdept) %>% left_join(df_hzdepb)

Using dplyr
Here is how you would group by two columns and summarize using the minimum, max, and sum other columns in a dataframe:
library(magrittr) # For the pipe: %>%
observations %>%
dplyr::group_by(X, Y) %>%
dplyr::summarise(hzdept = min(hzdept),
hzdepb = max(hzdepb),
Z = sum(Z), .groups = 'drop')

How to group by two column in R but with if statment for second?

I can't found any help lf internet.
I have 3 cols in .sav file loaded to R studio.
Is M with values 1,2,3,4,5,6,7 and label: weight, and N with values 1,2,3 and label diet.
I want group by it by these columns, but for N col I want only pick those where value is 1. Also I have last column with age data A.
I wrote this:
library(dplyr)
df%>%
group_by(M, N) %>%
summarize(values = mean(A, na.rm = TRUE))
And I got group by but for all N.
I tried something like this:
library(dplyr)
df%>%
group_by(M, N == 1) %>%
summarize(values = mean(A, na.rm = TRUE))
but I got again group for all categories from N with NA etc.
Expcted: I want only group_by by M - all values, and N where value =1.
How should that group by looks?

We can do a group by 'M' and summarise the filtered 'A'
library(dplyr)
df %>%
group_by(M) %>%
summarise(values = mean(A[N == 1], na.rm = TRUE))
Or another option is to have a filter in between, but this would also remove the groups where there are no 'N' as 1
df %>%
filter(N == 1) %>%
group_by(M) %>%
summarise(values = mean(A, na.rm = TRUE))

Incorporating na.rm=TRUE into Summarise_Each for Multiple Functions in dplyr

So I have a dplyr table movie_info_comb from which I am calculating various statistics on one column metascore. Here is the code:
summarise_each_(movie_info_comb, funs(min,max,mean,sum,sd,median,IQR),"metascore")
How do incorporate na.rm=TRUE? I've only seen examples for which one statistic is being calculated and I'd hate to have to repeat this 5 times (one for each function.
Thanks in advance.

You can do this with lazy evaluation
library(lazyeval)
na.rm = function(FUN_string)
lazy(FUN(., na.rm = TRUE)) %>%
interp(FUN = FUN_string %>% as.name)
na.rm.apply = function(FUN_strings)
FUN_strings %>%
lapply(na.rm) %>%
setNames(FUN_strings)
mtcars %>%
select(mpg) %>%
summarize_each(
c("min","max","mean","sum","sd","median","IQR") %>%
na.rm.apply)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to combine summarize and summarize_if in dplyr - r

Related

dplyr summarise and then summarise_at in the same pipe

aggregate data by 5min excluding max and min

Reduce a data frame by combining like rows according to two qualitative factors

How to group by two column in R but with if statment for second?

Incorporating na.rm=TRUE into Summarise_Each for Multiple Functions in dplyr

Categories

Resources