Using a temporal inner variable in dplyr outside of the group - r

I need to calculate an FDR variable per group, using an expected random distribution of p values (corresponds to the "Random" type).
library(dplyr)
library(data.table)
calculate_empirical_fdr = function(control_pVal, test_pVal) {
m_control = length(control_pVal)
m_test = length(test_pVal)
unlist(lapply(test_pVal, function(significance_threshold) {
m_control = length(control_pVal)
m_test = length(test_pVal)
FP_expected = length(control_pVal[control_pVal<=significance_threshold])*m_test/m_control # number of
expected false positives in a p-value sequence with the size m_test
S = length(test_pVal[test_pVal<=significance_threshold]) # number of significant hits (FP + TP)
return(FP_expected/S)
}))
}
An example dataset with groups I need to control for in the "Group" variable:
set.seed(42)
library(dplyr); library(data.table)
dataset_test = data.table(Type = c(rep("Random", 500),
rep("test1", 500),
rep("test2", 500)),
Group = sample(c("group1", "group2", "group3"), 1500, replace = T),
Pvalue = c(runif(n = 500),
rbeta(n = 500, shape1 = 1, shape2 = 4),
rbeta(n = 500, shape1 = 1, shape2 = 6))
)
Data visualization:
I have found that the best way to use my function per group would be using a temporal variable where I can store the p values of the random type, but this does not work:
dataset_test %>%
group_by(Group) %>%
{filter(Type=="Random") %>% select(Pvalue) ->> control_set } %>%
group_by(Type, add = T) %>%
mutate(FDR_empirical = calculate_empirical_fdr(control_pVal = control_set,
test_pVal = Pvalue)) %>%
data.table()
Error in filter(Type == "Random") : object 'Type' not found
I understand that probably temporal vairables "do not see" the environment within the data.table, would be glad to hear any suggestions how to fix it.

You can do something like this, which filters the control group P-values using the data.table special .BY
setDT(dataset_test)
dataset_test[
i= Type!="Random",
j = FDR_empirical:=calculate_empirical_fdr(dataset_test[Type=="Random" & Group ==.BY$Group, Pvalue], Pvalue),
by = .(Group, Type)
]
Output:
Type Group Pvalue FDR_empirical
1: Random group1 0.70292111 NA
2: Random group1 0.72383117 NA
3: Random group1 0.76413459 NA
4: Random group1 0.87942702 NA
5: Random group2 0.71229213 NA
---
1496: test2 group3 0.34817178 0.3681791
1497: test2 group1 0.22419118 0.2308988
1498: test2 group3 0.07258545 0.2314655
1499: test2 group2 0.24687976 0.2849462
1500: test2 group1 0.12206777 0.1760657

Two possible solutions
Use the dot .
dataset_test %>%
group_by(Group) %>%
{filter(., Type=="Random") %>% select(Pvalue) ->> control_set; . } %>%
group_by(Type, add = T)
Use the tee-pipe %T>% from the magrittr package
library(magrittr)
dataset_test %>%
group_by(Group) %T>% {
filter(., Type=="Random") %>% select(Pvalue) ->> control_set} %>%
group_by(Type, add = T)

Related

How to plot sjPlots from a nested tibble?

I create some models like this using a nested tidyr dataframe:
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(purrr)
fits <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0, sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1, sample(10, replace = T), sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data=-group) %>%
mutate(fit= map(data, ~glm(formula = colA ~ colB + colC, data = .x, family="binomial"))) %>%
dplyr::select(group, fit) %>%
tibble::column_to_rownames("group")
I would like to use this data to create some quick marginal effects plots with sjPlot::plot_models like this
plot_models(as.list(fits), type = "pred", terms = c("colB", "colA", "colC"))
Unfortunately, I get the error
Error in if (fam.info$is_linear) tf <- NULL else tf <- "exp" :
argument is of length zero
In addition: Warning message:
Could not access model information.
I've played around a bit with the nesting of the data but I've been unable to get it into a format that sjPlot::plot_models will accept.
What I was expecting to get is a "Forest plot of multiple regression models" as described in the help file. Ultimately, the goal is to plot the marginal effects of regression models by group, which I was hoping the plot_models will do (please correct me if I'm wrong).
It think there are some issues with the original code as well as with the data. There are arguments from plot_model in the function call which are not supported in plot_models. I first show an example that shows how plot_models can be called and used with a nested tibble using {ggplot2}'s diamonds data set. Then I apply this approach to the OP's sample data, which doesn't yield useable results*. Finally, I create some new toy data to show how the approach could be applied to a binominal model.
(* In the original toy data the dependent variable is either always 0 or always 1 in each model so this is unlikely to yield useable results).
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(ggplot2)
# general example
fits <- tibble(id = c("x", "y", "z")) %>%
rowwise() %>%
mutate(fit = list(glm(reformulate(
termlabels = c("cut", "color", "depth", "table", "price", id),
response = "carat"),
data = diamonds)))
plot_models(fits$fit)
# OP's example data
fits2 <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0,
sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1,
sample(10, replace = T),
sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data = -group) %>%
rowwise() %>%
mutate(fit = list(glm(formula = colA ~ colB + colC, data = data, family="binomial")))
plot_models(fits2$fit)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 4 rows containing missing values (geom_point).
# new data for binominal model
n <- 500
g <- round(runif(n, 0L, 1L), 0)
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y <- (x2 - x1 + rnorm(n,sd=20)) < 0
fits3 <- tibble(g, y, x1, x2) %>%
nest_by(g) %>%
mutate(fit = list(glm(formula = y ~ x1 + x2, data = data, family="binomial")))
plot_models(fits3$fit)
Created on 2021-01-23 by the reprex package (v0.3.0)

Grid seach on ARIMA model in R

I'm trying to make grid search for my ARIMA model working and I need additional help with it.
I have the following data:
head(train)
Date Count
<date> <int>
1 2016-06-15 21
2 2016-06-16 21
3 2016-06-17 12
4 2016-06-18 20
5 2016-06-19 29
6 2016-06-20 30
Train data Date variable ranges from 2016-06-15 to 2019-06-30 with 1111 observations in total
Train data Count variable ranges from min=3 to max=154 with mean=23.83 and sd=13.84.
I was able to define hyper parameters and create 36 ARIMA models with the following code:
#Create ts data
ts_train = xts(train[, -1], order.by = as.POSIXct(train$Date), frequency = 365)
#ARIMA model tune
#tibble helper function
to_tibble <- function(forecast_object){
point_estimate <- forecast_object$mean %>%
as_tsibble() %>%
rename(point_estimate = value,
date = index)
upper <- forecast_object$upper %>%
as_tsibble() %>%
spread(key, value) %>%
rename(date = index,
upper80 = `80%`,
upper95 = `95%`)
lower <- forecast_object$lower %>%
as_tsibble() %>%
spread(key, value) %>%
rename(date = index,
lower80 = `80%`,
lower95 = `95%`)
reduce(list(point_estimate, upper, lower), full_join)
}
#Trend hyper parameters
order_list <- list("p" = seq(0, 2),
"d" = seq(0, 1),
"q" = seq(0, 2)) %>%
cross() %>%
map(lift(c))
#Seasonal hyper parameteres
season_list <- list("P" = seq(0, 2),
"D" = seq(0, 1),
"Q" = seq(0, 2),
"period" = 365) %>%
cross() %>%
map(lift(c))
#Coerce vectors to tibbles
orderdf <- tibble("order" = order_list)
seasondf <- tibble("season" = season_list)
#Create grid of hyper-parameters
hyper_parameters_df <- crossing(orderdf, seasondf)
#Run grid search of ARIMA models
tic <- Sys.time()
models_df <- hyper_parameters_df %>%
mutate(models = map2(.x = order,
.y = season,
~possibly(arima, otherwise = NULL)(x = ts_train,
order = .x, seasonal = .y)))
running_time <- Sys.time() - tic
running_time
#Drop models which couldn't compute ARIMA
final_models = models_df %>% drop_na()
nrows <- nrow(final_models)
And than I get an error when I try to calculate RMSE across my test data with the following code:
final_models <- final_models %>%
mutate(forecast = map(models, ~possibly(forecast, otherwise = NULL)(., h = 183))) %>%
mutate(point_forecast = map(forecast, ~.$`mean`)) %>%
mutate(true_value = rerun(nrows, test)) %>%
mutate(rmse = map2_dbl(point_forecast, true_value,
~sqrt(mean((.x - .y) ** 2))))
I get one error and one warning message:
Error in .x - .y : non-numeric argument to binary operator
In addition: Warning message:
In mean((.x - .y)^2) :
Incompatible methods ("Ops.ts", "Ops.data.frame") for "-"
Can someone please help me with that?
Here is my test data if it's needed to create dummy data:
head(test)
Date Count
<date> <int>
1 2019-07-02 20
2 2019-07-03 28
3 2019-07-04 35
4 2019-07-05 34
5 2019-07-06 60
6 2019-07-07 63
Test data Date variable ranges from 2019-07-01 to 2019-12-31 with 184 observations in total
Train data Count variable ranges from min=6 to max=63 with mean=21.06 and sd=9.89.
The problem is that when you are computing the RMSE you are using time series rather than vectors. So, you have to change the class of both predictions and true values to numeric.
Here is my solution:
# Load libraries
library(fpp2)
library(dplyr)
library(xts)
library(purrr)
library(tidyr)
# Create sample dataset
dates <- seq.Date(as.Date("2019-07-02"), by = "day", length.out = length(WWWusage))
train <- data.frame(Date = dates, Count = WWWusage)
# Get test dataset using drift method
test <- forecast::rwf(WWWusage, h = 183, drift = TRUE)$mean
#Create ts data
ts_train = xts(train[, -1], order.by = as.POSIXct(train$Date), frequency = 365)
#ARIMA model tune
#tibble helper function
to_tibble <- function(forecast_object){
point_estimate <- forecast_object$mean %>%
as_tsibble() %>%
rename(point_estimate = value,
date = index)
upper <- forecast_object$upper %>%
as_tsibble() %>%
spread(key, value) %>%
rename(date = index,
upper80 = `80%`,
upper95 = `95%`)
lower <- forecast_object$lower %>%
as_tsibble() %>%
spread(key, value) %>%
rename(date = index,
lower80 = `80%`,
lower95 = `95%`)
reduce(list(point_estimate, upper, lower), full_join)
}
#Trend hyper parameters
order_list <- list("p" = seq(0, 2),
"d" = seq(0, 1),
"q" = seq(0, 2)) %>%
cross() %>%
map(lift(c))
#Seasonal hyper parameteres
season_list <- list("P" = seq(0, 2),
"D" = seq(0, 1),
"Q" = seq(0, 2),
"period" = 365) %>%
cross() %>%
map(lift(c))
#Coerce vectors to tibbles
orderdf <- tibble("order" = order_list)
seasondf <- tibble("season" = season_list)
#Create grid of hyper-parameters
hyper_parameters_df <- crossing(orderdf, seasondf)
#Run grid search of ARIMA models
tic <- Sys.time()
models_df <- hyper_parameters_df %>%
mutate(models =
map2(.x = order,
.y = season,
~possibly(arima, otherwise = NULL)(x = ts_train, order = .x, seasonal = .y)))
running_time <- Sys.time() - tic
running_time
#Drop models which couldn't compute ARIMA
final_models = models_df %>% drop_na()
nrows <- nrow(final_models)
# Estimate RSME for each candidate
# Note: you have to make sure that both .x and .y are numeric
final_models2 <- final_models %>%
mutate(forecast = map(models, ~possibly(forecast, otherwise = NULL)(., h = 183))) %>%
mutate(point_forecast = map(forecast, ~.$`mean`)) %>%
mutate(true_value = rerun(nrows, test)) %>%
mutate(rmse = map2_dbl(point_forecast, true_value,
~sqrt(mean((as.numeric(.x) - as.numeric(.y)) ** 2))))

How can I pass change arguments into character in a function?

Give a minimum example.
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
cbind("Var" = rep(y, 3)) # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, R tells me that object "Price" is not found. How to solve this problem?
Also, how to make y as a character indicating the mean and sd are of price?
Price isnt a variable, you need SummarizeFn(df,df$Price) because Price is just defined in your list df
SummarizeFn <- function(x,y,z)
{
df1<-(x %>% group_by(Treatment)
%>% summarize(n = n(), Mean = mean(y), SD = sd(y))
)
df1<- df1 %>% mutate ("Var" = z)
return(df1)
}
SumPrice <- SummarizeFn(df, df$Price,"Price")

How to transform/resample/interpolate data for normalising variable length within a tidy dataset with multiple grouping variables in R?

I am aiming to normalize the length of vectors for averaging within a tidy data set. Using approx seems to be way to go but I can't make it work efficiently in tidyverse. One issue is probably related to resizing within a dataframe. Here's a reproducible example:
# create reproducible dataset
i = 80
I = 110
id = rep("AA", I+i)
event = rep("event1", I+i)
sub_event = NA
sub_event[1:i] = 1
sub_event[i+1:I] = 2
sub_event = as.factor(sub_event)
y1 = sin(seq(0, 5*pi, length.out = i))
y2 = sin(seq(0, 5*pi, length.out = I))
y3 = cos(seq(0, 5*pi, length.out = i))
y4 = cos(seq(0, 5*pi, length.out = I))
var1 = c(y1,y2)
var2 = c(y3,y4)
df1 <- data.frame(id, event, sub_event,var1, var2)
df2 <- df1
df2$event = "event2"
df <- rbind(df1, df2)
temp <- df
temp$id = "BB"
df <- rbind(df, temp)
# create a "time" vector for sub_event
df <- df %>%
group_by(id, event, sub_event) %>%
mutate(sub_event_time = seq_along(var1)) %>%
select(id, event, sub_event, sub_event_time, everything()) %>%
ungroup()
Plot var1
# plot
ggplot(df,
aes(x=sub_event_time, y=var1, colour = sub_event)) +
geom_point() +
geom_path() +
facet_wrap(id~event)
I want transform/resample data to obtain length of var1 for each sub_events to be the length of the longest sub_event within each event for each id.
For instance we want: length of var1 for event 1 sub event 1 = length of var1 for event 1 sub event 2 (which is the longest).
Here's an attempt:
# attempt for var1 only
aim.df <- df %>%
ungroup() %>%
select(-var2) %>%
group_by(id, event) %>%
mutate(max_sub_event_time = max(sub_event_time)) %>%
mutate(var1 = approx(var1, n = max_sub_event_time)$y)
This returns the following error:
Error in mutate_impl(.data, dots) :
Column `var1` must be length 190 (the group size) or one, not 110
In addition: Warning messages:
1: In if (n <= 0) stop("'approx' requires n >= 1") :
the condition has length > 1 and only the first element will be used
2: In seq.int(x[1L], x[nx], length.out = n) :
first element used of 'length.out' argument
Any ideas ?
steps...
group_by(id, event, sub_event)
remove sub_event_time since it will be irrelevant once you add observations
summarise the result of the approx function as a list column (you will have to convert var1 and max_sub_event_time to appropriate input for approx)
unnest the resulting list column
group_by(id, event, sub_event) again and add a new sub_event_time
code...
library(dplyr)
library(tidyr)
df %>%
ungroup() %>%
select(-var2) %>%
group_by(id, event) %>%
mutate(max_sub_event_time = max(sub_event_time)) %>%
group_by(id, event, sub_event) %>%
select(-sub_event_time) %>%
summarise(var1_int = list(approx(as.numeric(var1), n = first(max_sub_event_time))$y)) %>%
unnest() %>%
group_by(id, event, sub_event) %>%
mutate(sub_event_time = row_number())

working with paired data across groups in the tidyverse

I have multiple observations from each of a few groups and I'd like to make a matrix of QQ plots (or another type of plot), comparing each group to every other group.
Here's an example of what I'm talking about:
library(tidyverse)
set.seed(27599)
n <- 30
d <- data_frame(person = c(rep('Alice', n),
rep('Bob', n),
rep('Charlie', n),
rep('Danielle', n)),
score = c(rnorm(n = n),
rnorm(n = n, mean = 0.1),
rnorm(n = n, sd = 2),
rnorm(n = n, mean = 0.3, sd = 1.4)))
by_hand <- data_frame(a = sort(d$score[d$person == 'Alice']),
b = sort(d$score[d$person == 'Bob']),
c = sort(d$score[d$person == 'Charlie']),
d = sort(d$score[d$person == 'Danielle']))
pairs(x = by_hand,
lower.panel = function(x, y) { points(x, y); abline(0, 1);})
Here, I've manipulated the data by hand and used graphics::pairs() to make the plot. Can the same be done inside the tidyverse?
Here's what I've tried.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
glimpse()
This seems promising.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
spread(key = person, value = score)
This gives the 'duplicate identifiers' error.
Maybe reshape2 would be better to use here?
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
dcast(formula = score ~ person)
This creates a data.frame with 120 rows, and most of the values (90 per person) are NA. How can I create a wide data.frame without introducing so many NA?
You need a variable that links the row position for each person. Try
by_tidyverse <- d %>%
group_by(person) %>%
mutate(rowID=1:n(),
score=sort(score)
) %>%
spread(key = person, value = score) %>%
select(-rowID)
pairs(x = by_tidyverse, lower.panel = function(x, y) { points(x, y); abline(0, 1);})

Resources