Moving mean as a function in dplyr - r

I'd like to create a function that can calculate the moving mean for a variable number of last observations and different variables. Take this as mock data:
df = expand.grid(site = factor(seq(10)),
year = 2000:2004,
day = 1:50)
df$temp = rpois(dim(df)[1], 5)
Calculating for 1 variable and a fixed number of last observations works. E.g. this calculates the average of the temperature of the last 5 days:
library(dplyr)
library(zoo)
df <- df %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = temp, 5, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
So far so good. Now trying to functionalize fails.
avg_last_x <- function(dataframe, column, last_x) {
dataframe <- dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = column, k = last_x, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
return(dataframe) }
avg_last_x(dataframe = df, column = "temp", last_x = 10)
I get this error:
Error in mutate_impl(.data, dots) : k <= n is not TRUE
I understand this is probably related to the evaluation mechanism in dplyr, but I don't get it fixed.
Thanks in advance for your help.

This should fix it.
library(lazyeval)
avg_last_x <- function(dataframe, column, last_x) {
dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate_(almost_avg = interp(~rollmean(x = c, k = last_x, align = "right",
fill = NA), c = as.name(column)),
avg = ~lag(almost_avg, 1))
}

Related

group “weighted” rolling mean while excluding own group value when a group has multiple observations

I'm trying to calculate group “weighted” rolling mean while excluding own group value when a group has multiple observations. This is related to my earlier question group "weighted" mean with multiple grouping variables and excluding own group value. The key difference is that this method is not readily applicable since now a group has multiple observations.
Based on the following dataset, here's the operation I want to apply. For instance, the new variable for the first two rows will take 19*9/18 + 48*3/18 + 6*2/18 + 31*4/18 = 25.05. The next two rows will take 81*1/10 + 52*3/10 + 6*2/10 + 31*4/10 = 37.3, and so on.
set.seed(57)
df <- data.frame(
state = rep(c("AL", "CA"), each = 12),
year = rep(c(2011:2012), 12),
county = rep(letters[1:6], each = 4),
value = sample(100, 24),
wt = sample(10, 24, replace = T)
) %>% arrange(state, year)
If I apply the following code, the issue is that observation from the same county is also included in the weighted mean formula.
df %>%
group_by(state, year) %>%
mutate(new_val = purrr::map_dbl(row_number(),
~weighted.mean(value[-.x], wt[-.x])))
As a get around, I've tried the following (find weighted mean within a county-year first and apply the code above), but the two are not producing the same results, tho somewhat similar.
df %>%
group_by(state, county, year) %>%
mutate(wp = weighted.mean(value, wt),
wt2 = sum(wt)) %>%
distinct(state, year, county, wp, wt2) %>%
ungroup() %>%
group_by(state, year) %>%
mutate(new_val = purrr::map_dbl(row_number(),
~weighted.mean(wp[-.x], wt2[-.x])))
Thank you for taking the time to read this!
I found an answer, but I'm sure that this is not the best approach. Any other suggestions would be very helpful for future reference.
x <- c(rep(c(letters[1:3]), 2), rep(c(letters[4:6]), 2))
year <- rep(rep(c(2011:2012), each = 3), time = 2)
state <- rep(c("AL", "CA"), each = 6)
get_wv <- function(x, year, state){
new_val <- weighted.mean(df$value[df$county != x & df$year == year & df$state == state],
df$wt[df$county != x & df$year == year & df$state == state])
new_val
}
res <- pmap(.l = list(x, year, state), .f = get_wv)

Skip "zero" level of dichotomous variables in expss tables

I want to create a summary table for some dichotomous variables using the expss package. Since the variables are dichotomous, one of the two levels would the sufficient to "show the picture".
I tried to use the function tab_net_cell, but was not able to get the right results. Here is some example code with BrCa (Breast cancer) with 1 or 0. I only want to show the number of patients with but not without breast cancer.
df <- data.frame(BrCa = c(1,1,1,0,0,0,NA,NA,0,0))
df$group <- c(1,2,1,2,1,2,1,2,1,2)
df %>%
expss::tab_cols(group) %>%
expss::tab_cells(BrCa) %>%
expss::tab_stat_cpct(total_row_position = "none",label = "%") %>%
expss::tab_stat_cases(total_row_position = "none",label = "N") %>%
expss::tab_pivot(stat_position = "inside_rows")
df %>%
expss::tab_cols(group) %>%
expss::tab_cells(BrCa) %>%
expss::tab_net_cells("BrCa" = eq(1)) %>%
expss::tab_stat_cpct(total_row_position = "none",label = "%") %>%
expss::tab_stat_cases(total_row_position = "none",label = "N") %>%
expss::tab_pivot(stat_position = "inside_rows")
The simplest way is to filter resulted table:
df <- data.frame(BrCa = c(1,1,1,0,0,0,NA,NA,0,0))
df$group <- c(1,2,1,2,1,2,1,2,1,2)
df %>%
expss::tab_cols(group) %>%
expss::tab_cells(BrCa) %>%
expss::tab_stat_cpct(total_row_position = "none",label = "%") %>%
expss::tab_stat_cases(total_row_position = "none",label = "N") %>%
expss::tab_pivot(stat_position = "inside_rows") %>%
expss::where(grepl(1, row_labels))
Another way is to use mean and sum instead of cpct and cases:
df %>%
expss::tab_cols(group) %>%
expss::tab_cells(BrCa*100) %>%
expss::tab_stat_mean(label = "%") %>%
expss::tab_stat_sum(label = "N") %>%
expss::tab_pivot(stat_position = "inside_rows")

Use variable names in function in dplyr for sum and cumsum

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.

Pretty tables with cumulative count / percentage and group totals using R "tables" package

I am trying to produce a formatted html table which has columns for frequency, cumulative frequency, column percentage, and cumulative column percentage. The table should also have the data subsetted by a grouping variable, and including a group total.
I can almost achieve this using a combination of dplyr and tidyr, but the output is a dataframe which doesn't look so pretty. I wonder if there is an easier way using the tables::tabulate command?
# Sample data
dat <- data.frame(
id = 1:100,
group = factor(sample(c("A", "B"), 100, replace = TRUE)),
sessions = factor(sample(1:10, 100, replace = TRUE))
)
# dplyr/tidyr solution
library(dplyr)
library(tidyr)
dat %>%
group_by(group, sessions) %>%
tally() %>%
spread(key = group, value = n) %>%
mutate(All = rowSums(.[-1])) %>%
gather(key = group, value = n, -sessions) %>%
group_by(group) %>%
mutate(
cum_n = cumsum(n),
p = round(n / sum(n)*100,1),
cum_p = round(cum_n / sum(n)*100,1),
) %>%
data.frame() %>%
reshape(timevar = "group", idvar = "sessions", direction = "wide")
# As far as I get using tables::tabulate
library(tables)
tabular(
Factor(sessions, "Sessions") ~
(Heading()*group + 1) *
(
(n = 1) +
# (cum_n = ??) +
Heading("%")*Percent(denom = "col")*Format(digits = 2)
# + Heading("cum_%")*??*Format(digits = 2)
),
data = dat
)
I would recommend using knitr::kable and kableExtra, amazing packages for producing tables. You can also set it up for multiple format outputs, for example using the same code to produce html and latex for pdf.
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)
dat %>%
group_by(group, sessions) %>%
tally() %>%
spread(key = group, value = n) %>%
mutate(All = rowSums(.[-1])) %>%
gather(key = group, value = n, -sessions) %>%
group_by(group) %>%
mutate(
cum_n = cumsum(n),
p = round(n / sum(n)*100,1),
cum_p = round(cum_n / sum(n)*100,1),
) %>%
data.frame() %>%
reshape(timevar = "group", idvar = "sessions", direction = "wide") %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

working with paired data across groups in the tidyverse

I have multiple observations from each of a few groups and I'd like to make a matrix of QQ plots (or another type of plot), comparing each group to every other group.
Here's an example of what I'm talking about:
library(tidyverse)
set.seed(27599)
n <- 30
d <- data_frame(person = c(rep('Alice', n),
rep('Bob', n),
rep('Charlie', n),
rep('Danielle', n)),
score = c(rnorm(n = n),
rnorm(n = n, mean = 0.1),
rnorm(n = n, sd = 2),
rnorm(n = n, mean = 0.3, sd = 1.4)))
by_hand <- data_frame(a = sort(d$score[d$person == 'Alice']),
b = sort(d$score[d$person == 'Bob']),
c = sort(d$score[d$person == 'Charlie']),
d = sort(d$score[d$person == 'Danielle']))
pairs(x = by_hand,
lower.panel = function(x, y) { points(x, y); abline(0, 1);})
Here, I've manipulated the data by hand and used graphics::pairs() to make the plot. Can the same be done inside the tidyverse?
Here's what I've tried.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
glimpse()
This seems promising.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
spread(key = person, value = score)
This gives the 'duplicate identifiers' error.
Maybe reshape2 would be better to use here?
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
dcast(formula = score ~ person)
This creates a data.frame with 120 rows, and most of the values (90 per person) are NA. How can I create a wide data.frame without introducing so many NA?
You need a variable that links the row position for each person. Try
by_tidyverse <- d %>%
group_by(person) %>%
mutate(rowID=1:n(),
score=sort(score)
) %>%
spread(key = person, value = score) %>%
select(-rowID)
pairs(x = by_tidyverse, lower.panel = function(x, y) { points(x, y); abline(0, 1);})

Resources