using summerise of package dplyr - r

I would like to know how many values are used to calculate the mean when using the summarize function
table<- df %>% group_by(x) %>% summarise_if(is.numeric, mean, na.rm = TRUE)

Add a count summary too. (by seeing if is na and then summing them)
Note, summarise_if has been superseded by across()
table<- df %>% group_by(x) %>%
summarise(across(where(is.numeric), list(mean = ~ mean(.x, na.rm = TRUE), n = ~sum(!is.na(.x)))))

I may be wrong, but I believe simply using dplyr's count() should work. See below:
# Creating a demonstrative data frame
colors <- c('red', 'green', 'red', 'green', 'red', 'green', 'green')
obs <- c(1, 2, 3, 1, 5, 2, 6)
mytable <- data.frame(colors, obs)
# Checking the summarise function
mytable %>%
group_by(colors) %>%
summarise_if(is.numeric, mean)
# First approach, using summarise, n = n
mytable %>%
group_by(colors) %>%
summarise(n = n())
# Second, more elegant approach using count
mytable %>%
count(colors)
If needed, you can add in a filter or subset function to test whether data is numeric.

Related

More efficient way of taking averages over multiple lists

I have some data where I use the rsample package to create rolling windows (I use the iris data set here). The rolling_iris dataset contains a number of lists.
I would like to compute the min, max, mean and sd of each of the lists. That is in split 1 compute the min across the first 4 columns etc. I originally do this by mapping over the splits and using pivot_longer to rearrange the data then computing the statistics, finally using pivot_wider to get the data back into the original form. This is quite slow.
library(dplyr)
library(purrr)
iris
rolling_iris <- rsample::rolling_origin(iris, initial = 10, assess = 1, cumulative = FALSE, skip = 0)
rolling_iris_statistics <- map(rolling_iris$splits, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
I would like to map over each of the lists and compute the above statistics. Then once this is done scale the analysis by the following function.
Scale_Me <- function(x){
(x - min(x)) / (max(x) - min(x))
}
Additional:
rolling_iris_analysis <- map(rolling_iris$splits, ~analysis(.x))
rolling_iris_assessment <- map(rolling_iris$splits, ~assessment(.x))
EDIT:
I managed to compute the following (I am not sure if it is "faster")
analysis <- map(rolling_iris$splits, ~analysis(.x))
map(analysis, ~select(., c(1:4)) %>% as.matrix %>% mean())
The below code subsets into each sub data frame. So, rolling_iris_dfs is a list of data frames. Then, you can iterate over each data frame and compute statistics.
rolling_iris_dfs <- map(seq(1, length(rolling_iris[[1]])), ~rolling_iris[[1]][[.x]]$data)
rolling_iris_stats <- map(rolling_iris_dfs, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)

Mutate within nested data frame

I would like to perform kmeans within groups and add to my data information about cluster number and center which an observation was assigned to (still, within groups so cluster 1 is not the same for group A and group B). I thought that I can pluck cluster assignment and centroid from kmeans and then maybe join these two with each other and finally, with original data. To do the former I wanted to add a row number to data frames with centers and then join by the number of cluster. But how can I add row number within nested data frames? The following code works well until the last, 'nested' mutate.
my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>%
group_by(group) %>%
nest() %>%
mutate(km_cluster = map(data, ~kmeans(.x, 3) %>% pluck('cluster')),
km_centers = map(data, ~kmeans(.x, 3) %>% pluck('centers') %>% mutate(cluster = row_number())))
#Luke.sonnet provided an answer that works well with map, but interestingly not with map2, see below:
my_data %>%
group_by(group) %>%
nest() %>%
mutate(number = sample(3:7, 3)) %>%
mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')),
km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = row_number())))
Any ideas how to solve the issue in that case? And equally important, what is the cause of such behaviour?
The problem is that pluck() is returning a matrix. Cast to a tibble first and number differently.
library(tidyverse)
my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>%
group_by(group) %>%
nest() %>%
mutate(number = sample(3:7, 3)) %>%
mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')),
km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = seq_len(nrow(.)))))
Note you can also do mutate(cluster = row_number(x)))) and this provides different numbers (note that just using row_number() uses the rows from the parent df). I think given kmeans that the matrix of centers is ordered row-wise by cluster number that the answer in the main chunk is correct.

Use variable names in function in dplyr for sum and cumsum

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.

Difference between double brackets and the $ sign

Since I have to use a function in a loop, I have to use dplyr group_by_at() and summarise_at() function. Unfortunately, I am not able to use the complete function from plyr to prevent empty groups to be removed by using an Index. Or is there another option to prevent dplyr from dropping empty groups?
library(dplyr)
library(plyr)
df1 <- mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE)) %>%
complete(gear, fill = list(Gewicht = 1))
df1
df2 <- mtcars %>%
group_by_at(10) %>%
summarise_at(1, mean, na.rm = TRUE) %>%
complete(gear, fill = list(Gewicht = 1))

summarise mean of a specific column in dplyr

I would like to summarise a grouped data.frame without knowing the name of the column. But what I know is, that the feature is always at position 3 (column) in this data.frame, is that possible?
df <- data_frame(date = rep(c("2017-01-01", "2017-01-02", "2017-01-03"), 2),
group = rep(c("A", "B"), 3),
temperature = runif(6, -10, 30),
percipitation = runif(6, 0,5)
)
parameter <- "perc"
df1 <- df %>%
select(date, group, starts_with(parameter)) %>%
group_by(group) %>%
summarise(
avg = mean(percipitation)
)
In this example the code works, but of course only for the parameter 'perc' and not for 'temp' or so.
avg = mean(df[[3]])
or something like this doesn't work. Any suggestions?
You could keep just the grouping variable and the third column using select(group, 3). The function summarise_all() can then be used to calculate the mean.
df %>%
select(group, 3) %>%
group_by(group) %>%
summarise_all(
funs(mean)
)

Resources