In R tidymodels how can I specify contrasts for specific variables? - r

I would like to specify "sum to zero" contrasts for two predictors in a LM using a tidymodels recipe. Is it possible? In looking at the recipes documentation, before 1.3, it seems there were attempts to build the variable specific options but the strategy was shifted to a global option.
I am trying to convert this base R code into tidymodels:
Bikeshare <- ISLR2::Bikeshare # start with original data
contrasts(Bikeshare$hr) <- contr.sum(24)
contrasts(Bikeshare$mnth) <- contr.sum(12)
mod.lm2 <-
lm(
bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare
)
summary(mod.lm2)
I got this far:
library(tidymodels)
Bikeshare <- ISLR2::Bikeshare # start with original data
contrasts(Bikeshare$hr) <- contr.sum(24)
contrasts(Bikeshare$mnth) <- contr.sum(12)
lm_spec <- linear_reg() %>%
set_engine("lm")
the_rec <-
recipe(
bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare
) %>%
step_dummy(c(mnth, hr), one_hot = TRUE)
the_workflow<- workflow() %>%
add_recipe(the_rec) %>%
add_model(lm_spec)
the_workflow_fit_lm_fit <-
fit(the_workflow, data = Bikeshare) %>%
extract_fit_parsnip()
summary(the_workflow_fit_lm_fit$fit)
Does anybody know how to get the same results out of a tidymodels workflow?
I don't think I can use contr.sum as a global option. This gives me the betas I would like for two of the variables but it changes the contrasts on others.
BikeShare <- ISLR2::Bikeshare # be sure to work with original data ;
old_opt <- options()$contrast;
options(contrasts = c('contr.sum', 'contr.poly'))

The docs for step_dummy() have :
To change the type of contrast being used, change the global contrast option via options.
so there is no way, outside of global options, to change it.
We should probably have an example though :-/
Note that, for new samples, the options are read from the global option again. Make sure that they are set the same at prediction-time:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
tidymodels_prefer()
data("penguins")
penguins <-
penguins %>%
distinct(species)
# R's defaults
old_opt <- options()$contrast
old_opt
#> unordered ordered
#> "contr.treatment" "contr.poly"
# default contrast
default <-
recipe(~ species, data = penguins) %>%
step_dummy(species) %>%
prep()
default %>% bake(new_data = NULL)
#> # A tibble: 3 × 2
#> species_Chinstrap species_Gentoo
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
# Do do something different
# Now set to something else:
options(contrasts = c('contr.sum', 'contr.poly'))
with_opt <-
recipe(~ species, data = penguins) %>%
step_dummy(species) %>%
prep()
with_opt %>% bake(new_data = NULL)
#> # A tibble: 3 × 2
#> species_X1 species_X2
#> <dbl> <dbl>
#> 1 1 0
#> 2 -1 -1
#> 3 0 1
# reset options:
options(contrasts = old_opt)
with_opt %>% bake(new_data = penguins)
#> # A tibble: 3 × 2
#> species_Chinstrap species_Gentoo
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
Created on 2021-11-16 by the reprex package (v2.0.0)
edit for clarity

Related

Problems in plotting line and error bars on the plot in R

I have this piece of R code that should plot some data growth_data.txt. Basically, it should plot a line graph showing a single line (or line + points) for the control and treated animals in this dataset. That is, one line for all the controls and one line for all the treated animals. Add appropriate error bars for each time point. But I don't know why the plot doesn't show the line and error bars on the plot which is weird.
What is wrong in my code? How to fix it? I included the plot I'm getting now.
library(tximport)
library(DESeq2)
library(tidyverse)
library(cowplot)
library(pheatmap)
library(RColorBrewer)
library(dplyr)
library(ggplot2)
theme_set(theme_classic())
growth_data <- read.delim ("growth_data.txt") %>% tibble()
#tidying the data.
growth_data_long <- growth_data %>% pivot_longer(-animal,
names_to=("Day"),
values_to=("Growth"))
growth2 <- growth_data_long %>%
mutate(group = str_extract(animal, "\\w+"))
growth2
growth2 %>% filter(group!= "") %>% ggplot() + aes(Day, Growth, color=group) + geom_point() + geom_smooth(method = lm)
I'm sorry - I was incorrect about "Day" being a factor - thanks for fixing the broken link.
One potential solution is to add a 'group' aesthetic, e.g.
library(tidyverse)
theme_set(theme_classic())
growth_data <- read.delim ("~/Desktop/growth_data.txt") %>% tibble()
#tidying the data.
growth_data_long <- growth_data %>% pivot_longer(-animal,
names_to=("Day"),
values_to=("Growth"))
growth2 <- growth_data_long %>%
mutate(group = str_extract(animal, "\\w+"))
growth2
#> # A tibble: 60 × 4
#> animal Day Growth group
#> <chr> <chr> <dbl> <chr>
#> 1 Control 1 Day.1 1.08 Control
#> 2 Control 1 Day.2 1.49 Control
#> 3 Control 1 Day.3 2.73 Control
#> 4 Control 1 Day.4 2.81 Control
#> 5 Control 1 Day.5 3.8 Control
#> 6 Control 1 Day.6 4.8 Control
#> 7 Control 2 Day.1 1.22 Control
#> 8 Control 2 Day.2 1.86 Control
#> 9 Control 2 Day.3 2.01 Control
#> 10 Control 2 Day.4 2.53 Control
#> # … with 50 more rows
growth2 %>%
filter(group != "") %>%
ggplot(aes(Day, Growth, color = group, group = group)) +
geom_point() +
geom_smooth(method = "lm")
#> `geom_smooth()` using formula 'y ~ x'
Created on 2021-10-11 by the reprex package (v2.0.1)
The docs go into more detail about grouping.

wrap tidymodels recipe into function

Is it possible to wrap a tidymodel recipe into a function? I've tried the following:
# Data setup
library(tidyverse)
library(tidymodels)
parks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-22/parks.csv')
modeling_df <- parks %>%
select(pct_near_park_data, spend_per_resident_data, med_park_size_data) %>%
rename(nearness = "pct_near_park_data",
spending = "spend_per_resident_data",
acres = "med_park_size_data") %>%
mutate(nearness = (parse_number(nearness)/100)) %>%
mutate(spending = parse_number(spending))
# Start building models
set.seed(123)
park_split <- initial_split(modeling_df)
park_train <- training(park_split)
park_test <- testing(park_split)
Works well without function:
tree_rec <- recipe(nearness ~., data = park_train)
Problem: wrap recipe into function:
custom_rec <- function(dat, var){
tree_rec <- recipe(nearness ~ {{var}}, data = dat)
}
custom_rec(park_train, speeding)
Error:
Error during wrapup: No in-line functions should be used here; use steps to define baking actions.
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
The R formula is an extremely useful but weird, weird thing so I don't recommend trying to mess around with it in a situation like you have here.
Instead, try using the update_role() interface for recipes:
library(tidymodels)
library(modeldata)
data(biomass)
# split data
biomass_tr <- biomass[biomass$dataset == "Training",]
my_rec <- function(dat, preds) {
recipe(dat) %>%
update_role({{preds}}, new_role = "predictor") %>%
update_role(HHV, new_role = "outcome") %>%
update_role(sample, new_role = "id variable") %>%
update_role(dataset, new_role = "splitting indicator")
}
my_rec(biomass_tr, carbon) %>% prep() %>% summary()
#> # A tibble: 8 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal id variable original
#> 2 dataset nominal splitting indicator original
#> 3 carbon numeric predictor original
#> 4 hydrogen numeric <NA> original
#> 5 oxygen numeric <NA> original
#> 6 nitrogen numeric <NA> original
#> 7 sulfur numeric <NA> original
#> 8 HHV numeric outcome original
my_rec(biomass_tr, c(carbon, hydrogen, oxygen, nitrogen)) %>% prep() %>% summary()
#> # A tibble: 8 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 sample nominal id variable original
#> 2 dataset nominal splitting indicator original
#> 3 carbon numeric predictor original
#> 4 hydrogen numeric predictor original
#> 5 oxygen numeric predictor original
#> 6 nitrogen numeric predictor original
#> 7 sulfur numeric <NA> original
#> 8 HHV numeric outcome original
Created on 2021-09-21 by the reprex package (v2.0.1)
If you are set on the formula interface, maybe try using rlang::new_formula().

Adding 'List' Objects to Word document using the Officer package

First time posting here.
I'm trying to get some statistical results to output onto a Word doc using the Officer package. I understand that the body_add_* functions seem to only work on data frames. However, functions and tests like gvlma and ncvTest output as a list with unconventional dimensions so I'm unable to use the tidyr package to tidy the lists before turning them into a data frame using data.frame(). So I need help adding these block of text that are lists into a Word Document.
So far I have this as the ADF test outputs as a very nice list that is easily convertible to a data frame:
# ADF test into dataframe
adf_df = data.frame(adf)
adf_df
ft <- flextable(data = adf_df) %>%
theme_booktabs() %>%
autofit()
# Output table into Word doc
doc <- read_docx() %>%
body_add_flextable(value = ft) %>%
body_add_par(gvlma)
fileout <- "test.docx"
print(doc, target = fileout)
The body_add_par(gvlma) line gives the error:
Warning messages:
1: In if (grepl("<|>", x)) { :
the condition has length > 1 and only the first element will be used
2: In charToRaw(enc2utf8(x)) :
argument should be a character vector of length 1
all but the first element will be ignored
gvlma outputs as a list and here is the output:
Call:
lm(formula = PD ~ ., data = dataset)
Coefficients:
(Intercept) WorldBank_Oil
1.282 -1.449
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model)
Value p-value Decision
Global Stat 4.6172 0.3289 Assumptions acceptable.
Skewness 0.1858 0.6664 Assumptions acceptable.
Kurtosis 0.1812 0.6703 Assumptions acceptable.
Link Function 1.7823 0.1819 Assumptions acceptable.
Heteroscedasticity 2.4678 0.1162 Assumptions acceptable.
Replicating the error with iris data-set:
library(officer); library(flextable)
adf_df <- iris
ft <- flextable(data = adf_df) %>%
theme_booktabs() %>%
autofit()
gvlma <- lm(Petal.Length ~ Sepal.Length + Sepal.Width, data=iris)
# Output table into Word doc
doc <- read_docx() %>%
body_add_flextable(value = ft) %>%
body_add_par(gvlma)
Warning messages: 1: In if (grepl("<|>", x)) { : the condition has
length > 1 and only the first element will be used 2: In
charToRaw(enc2utf8(x)) : argument should be a character vector of
length 1 all but the first element will be ignored
Issue here is that the linear model are kept as list that is efficient in calling out test parameters or model statistics. Not great as a static output.
One way to work around this is to use the commands from library(broom)
library(broom)
gvlma2 <- tidy(gvlma)
gvlma3 <- glance(gvlma)
doc <- read_docx() %>%
body_add_flextable(value = ft) %>%
body_add_flextable(value = flextable(gvlma2)) %>%
body_add_flextable(value = flextable(gvlma3))
fileout <- "test.docx"
print(doc, target = fileout)
gvlma2:
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.52 0.563 -4.48 1.48e- 5
2 Sepal.Length 1.78 0.0644 27.6 5.85e-60
3 Sepal.Width -1.34 0.122 -10.9 9.43e-21
gvlma3:
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 0.868 0.866 0.646 482. 2.74e-65 3 -146. 300. 312. 61.4 147

Using custom function to apply across multiple groups and subsets

I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)

The Intercept of a categorical multiple regression R is not the mean value?

Let's say I have 2 (categorical) variables and one continuous:
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
When I regress depression on sex (0 or 1), the intercept is 10.0436, what is the mean of sex = 0. Ok!
ds %>% group_by(sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 2 x 2
sex_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.0436
2 1.00 10.1640
The same thing happens when I regress depression on schooling. The intercept value is 10.4398. The mean of schooling = 0 is the same.
ds %>% group_by(schooling_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 3 x 2
schooling_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.4398
2 1.00 9.7122
3 2.00 10.1593
Now, when I compute a model with both variables, why the intercept is not the mean when both groups = 0? The regression **intercept is 10.3796, but the mean when sex = 0, and schooling is = 0 is 10.32548:
ds %>% group_by(schooling_dummy,sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=5))
# A tibble: 6 x 3
# Groups: schooling_dummy [?]
schooling_dummy sex_dummy `formatC(mean(depression), format = "f", digits = 5)`
<dbl> <dbl> <chr>
1 0 0 10.32548
2 0 1.00 10.55404
3 1.00 0 9.59305
4 1.00 1.00 9.83139
5 2.00 0 10.21218
6 2.00 1.00 10.10648
When I predict the model when both are 0:
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
1
10.37956
This result is related to depression (of course...) but still not What I was expecting, since:
(Reference: https://www.theanalysisfactor.com/interpret-the-intercept/)
What is the same for this previous forum post
I aware of my variables are categorical and I'm adjusting my script, as you can reproduce using this code below:
Thanks
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
mod <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0"))
summary(mod)
ds %>% group_by(sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod2 <- lm(data=ds, depression ~ relevel(factor(schooling_dummy), ref = "0"))
summary(mod2)
ds %>% group_by(schooling_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod3 <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0") +
relevel(factor(schooling_dummy), ref = "0"))
summary(mod3)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
Two errors in your thinking (although your R code works so it's not a programming error.
First and foremost you violated your own statement you have not dummy coded schooling it does not have only zeroes and ones it has 0,1 & 2.
Second you forgot the interaction effect in your lm modeling...
Try this...
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
# if you explicitly make these variables factors not integers R will do the right thing with them
ds$schooling_dummy<-factor(ds$schooling_dummy)
ds$sex_dummy<-factor(ds$sex_dummy)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
# you need an asterisk in your lm model to include the interaction term
lm(depression ~ schooling_dummy * sex_dummy, data = ds)
The results give you the mean(s) you were expecting...
Call:
lm(formula = depression ~ schooling_dummy * sex_dummy, data = ds)
Coefficients:
(Intercept) schooling_dummy1 schooling_dummy2
10.325482 -0.732433 -0.113305
sex_dummy1 schooling_dummy1:sex_dummy1 schooling_dummy2:sex_dummy1
0.228561 0.009778 -0.334254
and FWIW you can avoid this sort of accidental misuse of categorical variables if your data is coded as characters to begin with... so if your data is coded this way:
ds <- data.frame(
depression=rnorm(90,10,2),
schooling=c("A","B","C"),
sex=c("Male","Female")
)
You're less likely to make the same mistake plus the results are easier to read...

Resources