simple nested functions and dplyr tidyeval - r

library(tidyverse)
set.seed(1)
graph.data <- tibble(cal.date = as.Date(40100:40129, origin = "1899-12-30"),
random_num = rnorm(30, 8, 5))
This is the data frame we're working with here.
# A tibble: 30 x 2
cal.date random_num
<date> <dbl>
1 2009-10-14 4.87
2 2009-10-15 8.92
3 2009-10-16 3.82
4 2009-10-17 16.0
5 2009-10-18 9.65
6 2009-10-19 3.90
7 2009-10-20 10.4
8 2009-10-21 11.7
9 2009-10-22 10.9
10 2009-10-23 6.47
# ... with 20 more rows
I'm trying to nest(sp? lexical scope) two functions, which I call child_function and parent_function.
child_function <- function(df, variable, hor.line = 6) {
variable <- enquo(variable)
df <- mutate(mutation = 2 * !!variable, horizontal.line = hor.line)
}
parent_function <- function(df, date, variable, hor.line = 6) {
date <- enquo(date)
variable <- enquo(variable)
hor.line <- enquo(hor.line)
df <- child_function(df, !!variable, !!hor.line) %>% print()
p <- ggplot(df, aes(date, mutation)) +
geom_point() +
geom_hline(aes(yintercept = !!hor.line))
p
}
When I test it all out with the line below I get "Error in !variable : invalid argument type".
parent_function(graph.data, date = cal.date, variable = random_num, hor.line=8)
I imagine I'm not using the proper dplyr tidyeval syntax. What's wrong with my functions?

Needed a bit of a cleanup, but now it should work:
library(tidyverse)
set.seed(1)
graph.data <- tibble(cal.date = as.Date(40100:40129, origin = "1899-12-30"),
random_num = rnorm(30, 8, 5))
child_function <- function(df, variable, hor.line = 6) {
variable <- enquo(variable)
df <- mutate(df, mutation := 2 * !! variable, horizontal.line := hor.line)
}
parent_function <- function(df, date, variable, hor.line = 6) {
date <- enquo(date)
variable <- enquo(variable)
df <- child_function(df, !! variable, hor.line) %>% print()
p <- ggplot(df, aes(!! date, mutation)) +
geom_point() +
geom_hline(aes(yintercept = hor.line))
p
}
parent_function(graph.data, date = cal.date, variable = random_num, hor.line=8)
I think the main issue was that sometimes you put !! or enquo where there was no need and vice versa.

Related

How can I automate t-test for nested variables in R?

I would like to automate the collection of summary statistics that arise from t-tests. In the example below I have nested variables Age, Location, and Treatment. For each Age & Location I would like to run a t-test based on Treatment which has the two categorical names Control & Treatment. Put another way, I would like to know about the difference between the Control and Treatment means at each Location for each Age.
I would like to run the t-tests using the col_t_welch function in matrixTests because the output already has several of the summary statistics I'm looking for (i.e., mean.diff, stderr, and pvalue). How could I set up my dataframe (df1) to be able to fun a for-loop for a nested t-test?
Reproducible Example:
library(matrixTests)
library(ggplot2)
set.seed(123)
df1 <- data.frame(matrix(ncol = 4, nrow = 36))
x <- c("Age","Location","Treatment","Value")
colnames(df1) <- x
df1$Age <- as.factor(rep(c(1,2,3), each = 12))
df1$Location <- as.factor(rep(c("Central","North"), each = 6))
df1$Treatment <- as.factor(rep(c("Control","Treatment"), each = 3))
df1$Value <- round(rnorm(36,200,25),0)
# I can't get the for-loop below to work because I'm not sure how to set up the data frame, but I was thinking something along these lines.
i <- 1
p <- numeric(length = 3*2)
mean_diff <- numeric(length = 3*2)
SE_diff <- numeric(length = 3*2)
for(j in c("1", "2", "3")){
for(k in c("Control", "Treatment")){
ttest <- col_t_welch(Value, data = df1, subset = Age == j & Treatment == k))
p[i] <- a$pvalue
mean_diff[i] <- ttest$mean.diff
SE_diff[i] <- ttest$stderr
i <- i + 1
}
}
The ideal final data frame would look like d2 below.
d2 <- expand.grid(Age = rep(c(1,2,3), 1),
Location = rep(c("Central","North"), 1),
mean_diff = NA,
SE_diff = NA,
pvalue = NA)
C1 <- df1[c(1:6),3:4]
N1 <- df1[c(7:12),3:4]
C2 <- df1[c(13:18),3:4]
N2 <- df1[c(19:24),3:4]
C3 <- df1[c(25:30),3:4]
N3 <- df1[c(31:36),3:4]
c1_mod <- col_t_welch(x=C1[1:3,2], y=C1[4:6,2])
n1_mod <- col_t_welch(x=N1[1:3,2], y=N1[4:6,2])
c2_mod <- col_t_welch(x=C2[1:3,2], y=C2[4:6,2])
n2_mod <- col_t_welch(x=N2[1:3,2], y=N2[4:6,2])
c3_mod <- col_t_welch(x=C3[1:3,2], y=C3[4:6,2])
n3_mod <- col_t_welch(x=N3[1:3,2], y=N3[4:6,2])
d2[1,3] <- c1_mod$mean.diff
d2[1,4] <- c1_mod$stderr
d2[1,5] <- c1_mod$pvalue
d2[2,3] <- c2_mod$mean.diff
d2[2,4] <- c2_mod$stderr
d2[2,5] <- c2_mod$pvalue
d2[3,3] <- c3_mod$mean.diff
d2[3,4] <- c3_mod$stderr
d2[3,5] <- c3_mod$pvalue
d2[4,3] <- n1_mod$mean.diff
d2[4,4] <- n1_mod$stderr
d2[4,5] <- n1_mod$pvalue
d2[5,3] <- n2_mod$mean.diff
d2[5,4] <- n2_mod$stderr
d2[5,5] <- n2_mod$pvalue
d2[6,3] <- n3_mod$mean.diff
d2[6,4] <- n3_mod$stderr
d2[6,5] <- n3_mod$pvalue
d2
I think this might help you
Libraries
library(matrixTests)
library(tidyverse)
Data
set.seed(123)
df1 <- data.frame(matrix(ncol = 4, nrow = 36))
x <- c("Age","Location","Treatment","Value")
colnames(df1) <- x
df1$Age <- as.factor(rep(c(1,2,3), each = 12))
df1$Location <- as.factor(rep(c("Central","North"), each = 6))
df1$Treatment <- as.factor(rep(c("Control","Treatment"), each = 3))
df1$Value <- round(rnorm(36,200,25),0)
How to
df1 %>%
group_nest(Age,Location,Treatment) %>%
pivot_wider(names_from = Treatment,values_from = data) %>%
mutate(
test = map2(
.x = Control,
.y = Treatment,
.f = ~col_t_welch(.x,.y)
)
) %>%
unnest(test) %>%
select(Age,Location,pvalue,mean.diff,stderr)
Result
# A tibble: 6 x 5
Age Location pvalue mean.diff stderr
<fct> <fct> <dbl> <dbl> <dbl>
1 1 Central 0.675 -9.67 21.3
2 1 North 0.282 -22 17.7
3 2 Central 0.925 -3 28.4
4 2 North 0.570 9.33 14.6
5 3 Central 0.589 -14.7 25.0
6 3 North 0.311 -11.3 8.59

Using the pipe in selfmade function with tidyeval (quo_name)

I have two functions: date_diff and group_stat. So I have read this article tidyverse and I try so create simple functions and use the pipe.
The first function creates a difftime and names them timex_minus_timey but when I pipe this result into the next function I have to look at the name so I can fill in summary_var. Is there a better way to do this?
library(tidyverse)
#
set.seed(42)
data <- dplyr::bind_rows(
tibble::tibble(Hosp = rep("A", 1000),
drg = sample(letters[1:5], 1000, replace = TRUE),
time1 = as.POSIXlt("2018-02-03 08:00:00", tz = "UTC") + rnorm(1000, 0, 60*60*60),
time2 = time1 + runif(1000, min = 10*60, max = 20*60)),
tibble::tibble(Hosp = rep("B", 1000),
drg = sample(letters[1:5], 1000, replace = TRUE),
time1 = as.POSIXlt("2018-02-03 08:00:00", tz = "UTC") + rnorm(1000, 0, 60*60*60),
time2 = time1 + runif(1000, min = 10*60, max = 20*60))
)
date_diff <- function(df, stamp1, stamp2, units = "mins"){
stamp1 <- rlang::enquo(stamp1)
stamp2 <- rlang::enquo(stamp2)
name <- paste0(rlang::quo_name(stamp1), "_minus_", rlang::quo_name(stamp2))
out <- df %>%
dplyr::mutate(!!name := as.numeric(difftime(!!stamp1, !!stamp2, units=units)))
out
}
group_stat <- function(df, group_var, summary_var, .f) {
func <- rlang::as_function(.f)
group_var <- rlang::enquo(group_var)
summary_var <-rlang::enquo(summary_var)
name <- paste0(rlang::quo_name(summary_var), "_", deparse(substitute(.f)))
df %>%
dplyr::group_by(!!group_var) %>%
dplyr::summarise(!!name := func(!!summary_var, na.rm = TRUE))
}
data %>%
date_diff(time2, time1) %>%
group_stat(Hosp, summary_var = time2_minus_time1, mean)
#> # A tibble: 2 x 2
#> Hosp time2_minus_time1_mean
#> <chr> <dbl>
#> 1 A 15.1
#> 2 B 14.9
Created on 2019-05-02 by the reprex package (v0.2.1)
If you intend to always use these functions one after another in this way you could add an attribute containing the new column's name with date_diff, and have group_stat use that attribute. With the if condition, the attribute is only used if it exists and the summary_var argument is not provided.
date_diff <- function(df, stamp1, stamp2, units = "mins"){
stamp1 <- rlang::enquo(stamp1)
stamp2 <- rlang::enquo(stamp2)
name <- paste0(rlang::quo_name(stamp1), "_minus_", rlang::quo_name(stamp2))
out <- df %>%
dplyr::mutate(!!name := as.numeric(difftime(!!stamp1, !!stamp2, units=units)))
attr(out, 'date_diff_nm') <- name
out
}
group_stat <- function(df, group_var, summary_var, .f) {
if(!is.null(attr(df, 'date_diff_nm')) & missing(summary_var))
summary_var <- attr(df, 'date_diff_nm')
group_var <- rlang::enquo(group_var)
name <- paste0(summary_var, "_", deparse(substitute(.f)))
df %>%
dplyr::group_by(!!group_var) %>%
dplyr::summarise_at(summary_var, funs(!!name := .f), na.rm = T)
}
data %>%
date_diff(time2, time1) %>%
group_stat(Hosp, .f = mean)
# # A tibble: 2 x 2
# Hosp time2_minus_time1_mean
# <chr> <dbl>
# 1 A 15.1
# 2 B 14.9

Calculate multiple moving calculations in one statement

I want to calculate all moving averages in one statement rather than repeating myself. Is this possible using quantmod or does it require some clever use of tidyeval and/or purrr?
library(tidyquant)
library(quantmod)
library(zoo)
tibble(date = as.Date('2018-01-01') + days(1:100),
value = 100 + cumsum(rnorm(100))) %>%
tq_mutate(mutate_fun = rollapply, select = "value", width = 10, FUN = mean, col_rename = "rm10") %>%
tq_mutate(mutate_fun = rollapply, select = "value", width = 5, FUN = mean, col_rename = "rm5") %>%
gather(series, value, -date) %>%
ggplot(aes(date, value, color = series)) +
geom_line()
Here is a solution using data.table's new frollmean()-function
data.table v1.12.0 or higher required.
sample data
library( data.table )
set.seed(123)
dt <- data.table( date = as.Date('2018-01-01') + days(1:100),
value = 100 + cumsum(rnorm(100)))
code
#set windwos you want to roll on
windows <- c(5,10)
#create a rm+window column for each roll
dt[, ( paste0( "rm", windows ) ) := lapply( windows, function(x) frollmean( value, x)) ]
output
head( dt, 15 )
# date value rm5 rm10
# 1: 2018-01-02 99.43952 NA NA
# 2: 2018-01-03 99.20935 NA NA
# 3: 2018-01-04 100.76806 NA NA
# 4: 2018-01-05 100.83856 NA NA
# 5: 2018-01-06 100.96785 100.2447 NA
# 6: 2018-01-07 102.68292 100.8933 NA
# 7: 2018-01-08 103.14383 101.6802 NA
# 8: 2018-01-09 101.87877 101.9024 NA
# 9: 2018-01-10 101.19192 101.9731 NA
# 10: 2018-01-11 100.74626 101.9287 101.0867
# 11: 2018-01-12 101.97034 101.7862 101.3398
# 12: 2018-01-13 102.33015 101.6235 101.6519
# 13: 2018-01-14 102.73092 101.7939 101.8482
# 14: 2018-01-15 102.84161 102.1239 102.0485
# 15: 2018-01-16 102.28577 102.4318 102.1802
plot
#plot molten data
library(ggplot2)
ggplot( data = melt(dt, id.vars = c("date") ),
aes(x = date, y = value, colour = variable)) +
geom_line()
update - grouped data
library(data.table)
library(ggplot2)
set.seed(123)
#changed the sample data a bit, to get different values for grp=1 and grp=2
dt <- data.table(grp = rep(1:2, each = 100), date = rep(as.Date('2018-01-01') + days(1:100), 2), value = 100 + cumsum(rnorm(200)))
dt[, ( paste0( "rm", windows ) ) := lapply( windows, function(x) frollmean( value, x)), by = "grp" ]
ggplot( data = melt(dt, id.vars = c("date", "grp") ),
aes(x = date, y = value, colour = variable)) +
geom_line() +
facet_wrap(~grp, nrow = 1)
In this example I use the AAPL adjusted close price downloaded using the getSymbols function from quantmod
lets say you want the SMAs with the following lengths:
smaLength = c(30,35,40,46,53,61,70,81,93)
Now create the SMA like so:
lapply(smaLength,function(x) SMA(AAPL$AAPL.Adjusted,x)) %>% do.call(cbind,.) %>% tail()
result:
SMA SMA.1 SMA.2 SMA.3 SMA.4 SMA.5 SMA.6 SMA.7 SMA.8
2019-03-04 167.3703 165.2570 163.3706 162.1362 161.5904 162.9735 164.7770 169.3341 175.4143
2019-03-05 168.0162 165.9396 164.0682 162.5499 161.7934 162.8342 164.6408 168.9595 174.9418
2019-03-06 168.7454 166.6585 164.7488 162.9638 162.0062 162.8110 164.6165 168.6446 174.5135
2019-03-07 169.3866 167.2323 165.3086 163.3320 162.1409 162.7868 164.5661 168.2780 174.0284
2019-03-08 170.0820 167.7646 165.8150 163.6764 162.3807 162.8711 164.5855 167.8407 173.5334
2019-03-11 170.8092 168.4419 166.4589 164.1471 162.8097 163.0354 164.6573 167.4864 173.0806
Define the input and then lapply over the widths creating a rollmean for each one merging them together. Finally plot it.
library(ggplot2)
library(magrittr)
library(zoo)
set.seed(123)
w <- c(1, 5, 10)
zoo(100 * cumsum(rnorm(100)), as.Date("2018-01-01") + 1:100) %>%
lapply(w, rollmeanr, x = .) %>%
do.call("merge", .) %>%
setNames(w) %>%
autoplot(facet = NULL)

I would like to use dplyr::mutate than plyr::ddply function in pipeline processing

I would like to do the same what I have done here by mutate function not by ddplyr one. Is it possible to perform not vectorized operation here somehow?
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>% mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s
})(x,y))
test %>% plyr::ddply(., c("x","y"), .fun = function(row) {
dn <- dnorm(x = d, mean = row$x, sd = row$y)
s <- sum(dn)
s
})
A popular method is using the dplyr function: rowwise().
library(tidyverse)
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>%
rowwise() %>% # prior to mutate specify calculate rowwise
mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s})(x,y))
This yields the following result:
# A tibble: 3 x 3
x y s
<dbl> <dbl> <dbl>
1 1 0.5 1.56
2 2 1 0.929
3 3 1.5 0.470

How to pass second parameter to function while using the map function of purrr package in R

Apologies for what might be a very simple question.
I am new to using the purrr package in R and I'm struggling with trying to pass a second parameter to a function.
library(dplyr)
library(purrr)
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(.x = old_col, .f = my_function))
This works and most often I don't need to change the value of y, but if I had to pass a different value for y (say y = 3) through the mutate & map combination, what is the syntax for it?
Thank you very much in advance!
The other idea is to use the following syntax.
library(dplyr)
library(purrr)
# The function
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
# Example data frame
my_df <- data_frame(old_col = 1:5)
# Apply the function
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(old_col, ~my_function(.x, y = 3)))
my_df_2
# # A tibble: 5 x 2
# old_col new_col
# <int> <dbl>
# 1 1 4.
# 2 2 5.
# 3 3 6.
# 4 4 7.
# 5 5 8.
I think all you need to do is modify map_dbl like so:
library(dplyr)
library(purrr)
df <- data.frame(a = c(2, 3, 4, 5.5))
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
df %>%
mutate(new_col = map_dbl(.x = a, y = 3, .f = my_function))
a new_col
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
4 5.5 8.5

Resources