I am trying to test the effect of a treatment on the proportions of juveniles in a population of migrating birds. The birds were counted and identified as juveniles or adults daily, but the treatment was only on every second day. Days without treatment were used as a control. The problem is that the proportion of juveniles in the population is expected to be affected not only by the treatment, but also by migration phenology. For example, it is possible that on a given day more juveniles migrated to the study area, and therefor this, and not only the treatment, affected the proportion of juveniles in the population. To account for this problem, I also checked the proportion of juveniles every day at a close by site which was not affected by the treatment (i.e., control site). Hence, I have two types of controls.
To analyze the data, I thought of using a binomial GLMM, with the proportion of juveniles as the variable of interest, the treatment as a categorical (with or without treatment) explanatory variable and day as a random-intercepts factor, and I use weights to account for the different number of birds in each day, but I am not sure how to input the data from the control site. From what I read, it should be used as an offset, but I am not sure exactly how.
Is the link function affected by the fact it (juveniles prop. at the ctrl. site) is a proportion?
Is it better to use a the juveniles prop. at the ctrl. site in an interaction instead of offset (i.e., ~ Treatment* Juv.prop.cntrl.site)?
This is the model I have so far, but I am not sure if it makes sense, especially if the offset is set correctly:
glm(Juv.prop.exp.site ~ Treatment + Day, offset = Juv.prop.cntrl.site, weights = Tot.birds.exp.site, data = df, family = Binomial)
Where Juv.prop.exp.site is the number of juveniles divided by the total at this site (juveniles + adults)
See the data here: DATA (day starts at 11, because during the first 10 days no birds of that species were observed)
Normally, I would suggest that questions regarding statistical analysis are migrated to CrossValidated, where you will get better answers to purely statistical questions. However, in your case, it will help a lot to reshape your data into a tidy format before analysis, which is more of a programming problem.
Essentially, you need one column each for day, site, treatment, number of juveniles, and number of adults. I am assuming that in your data, "V" is the treatment and "X" is the control.
library(tidyverse)
df <- data %>%
select(1, 2, 4, 5, 8, 9) %>%
rename_all(~gsub("\\.site", "_site", .x)) %>%
pivot_longer(1:4, names_sep = "\\.", names_to = c(".value", "Site")) %>%
mutate(Treatment = ifelse(Site == "Exp_site", Treatment, "X")) %>%
mutate(Treatment = ifelse(Treatment == "V", "Treatment", "Control")) %>%
mutate(Site = ifelse(Site == "Exp_site", "Experimental", "Control")) %>%
rename(Juveniles = Juv, Adults = Ad) %>%
select(2, 1, 3:5)
This makes your data look like this, and to my mind this is easier to analyse (and to reason about):
df
#> # A tibble: 100 x 5
#> Day Treatment Site Juveniles Adults
#> <int> <chr> <chr> <int> <int>
#> 1 11 Control Experimental 1 0
#> 2 11 Control Control 0 0
#> 3 12 Treatment Experimental 2 1
#> 4 12 Control Control 1 0
#> 5 13 Control Experimental 2 0
#> 6 13 Control Control 1 1
#> 7 14 Treatment Experimental 6 3
#> 8 14 Control Control 4 2
#> 9 15 Control Experimental 6 4
#> 10 15 Control Control 1 2
#> # ... with 90 more rows
#> # i Use `print(n = ...)` to see more rows
You can then perform a binomial glm like this, with Treatment and Site as independent variables.
model <- glm(cbind(Juveniles, Adults) ~ Treatment + Site,
data = df, family = binomial)
summary(model)
#> Call:
#> glm(formula = cbind(Juveniles, Adults) ~ Treatment + Site, family = binomial,
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.4652 -0.6971 0.0000 0.7895 2.9541
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.0059 0.1461 6.886 5.74e-12 ***
#> TreatmentTreatment 0.3012 0.2877 1.047 0.295
#> SiteExperimental -0.1632 0.2598 -0.628 0.530
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 118.16 on 88 degrees of freedom
#> Residual deviance: 117.07 on 86 degrees of freedom
#> AIC: 244.13
#>
#> Number of Fisher Scoring iterations: 4
Related
I was getting acquainted with tidymodels by reading the book and this line in section 9.2 kept me thinking about retransformation.
It is best practice to analyze the predictions on the transformed
scale (if one were used) even if the predictions are reported using
the original units.
But I found it confusing that the examples in the book use a log transformation on the outcome, but they do not use a recipe for this (the recipe has not been introduced at this point, but later when they introduce recipe, still they do not use step_log for the outcome but just for the predictors). So I wanted to try that and found something puzzling, illustrated with the reprex below:
# So let's use most of the code from the examples in the book
library(tidymodels)
tidymodels_prefer()
set.seed(501)
# data budget
data(ames)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_folds <- vfold_cv(ames_train, v = 10, strata = Sale_Price)
# IJALM
lm_model <-
linear_reg(penalty = 0) |>
set_engine("glmnet")
# And use a recipe,
# but Instead of manually transforming the outcome like this ...
# `ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))`
# let's include the outcome transformation into the recipe
simple_ames <-
recipe(
Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train
) |>
step_log(Gr_Liv_Area, base = 10) |>
step_dummy(all_nominal_predictors()) |>
step_log(Sale_Price, base = 10, skip = TRUE)
lm_wflow <-
workflow() |>
add_model(lm_model) |>
add_recipe(simple_ames)
lm_res <- fit_resamples(
lm_wflow,
resamples = ames_folds,
control = control_resamples(save_pred = TRUE, save_workflow = TRUE),
metrics = metric_set(rmse)
)
collect_metrics(lm_res)
#> # A tibble: 1 x 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 rmse standard 197264. 10 1362. Preprocessor1_Model1
# Now, I wanted to double-check how rmse was calculated
# It should be the mean of the rmse for each fold
# (individual values stored in this list lm_res$.metrics,
# with one element for each fold)
# and each rmse should have been calculated with the predictions of each fold
# (stored in lm_res$.predictions)
# So, this rmse corresponding to the first fold
lm_res$.metrics[[1]]
#> # A tibble: 1 x 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 rmse standard 196074. Preprocessor1_Model1
# Should have been calculated with this data
lm_res$.predictions[[1]]
#> # A tibble: 236 x 4
#> .pred .row Sale_Price .config
#> <dbl> <int> <int> <chr>
#> 1 5.09 2 126000 Preprocessor1_Model1
#> 2 4.92 33 105900 Preprocessor1_Model1
#> 3 5.06 34 125500 Preprocessor1_Model1
#> 4 5.18 44 121500 Preprocessor1_Model1
#> 5 5.14 51 127000 Preprocessor1_Model1
#> 6 5.13 53 114000 Preprocessor1_Model1
#> 7 4.90 57 84900 Preprocessor1_Model1
#> 8 5.13 62 113000 Preprocessor1_Model1
#> 9 5.02 74 83500 Preprocessor1_Model1
#> 10 5.02 76 88250 Preprocessor1_Model1
#> # ... with 226 more rows
#> # i Use `print(n = ...)` to see more rows
# But here's the issue!
# The predictions are in the log-scale while the observed values
# are in the original units.
# This is just a quick-check to make sure the rmse reported above
# (calculated by yardstick) does in fact involve mixing-up the log-scale
# (predictions) and the original units (observed values)
yhat <- lm_res$.predictions[[1]]$.pred
yobs <- lm_res$.predictions[[1]]$Sale_Price
sqrt(mean((yhat - yobs)^2))
#> [1] 196073.6
# So, apparently, for cross-validation tidymodels does not `bake` the folds
# with the recipe to calculate the metrics
And here’s where I got it (at least I think so), after spending half
an hour writing this reprex. So, to not feel I wasted my time, I decided
to post it anyway, and put what I think is going on as an answer.
Perhaps someone finds it useful, because it was not evident to me at the
first time. Or perhaps someone can explain if there is something else going on.
Created on 2022-08-07 by the reprex package (v2.0.1)
It was basically your fault. You explicitly told tidymodels not to bake() this step. The line below was the culprit, in particular, the skip = TRUE part.
step_log(Sale_Price, base = 10, skip = TRUE)
Hence, tidymodels will not include that step in baking the folds before making the predictions and you end up with the log-scale of the predictions mixed-up with the untransformed outcome variable. This is perhaps one of the examples they had in mind when they wrote in the documentation that:
Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
You probably decided to skip that step, because the outcome variable may not be available on a new dataset for which you want predictions, and then the process would fail. But it basically messes up metrics for cross-validation. So, better not to skip the step and deal other way with that problem.
I have a data frame with post and follow-up measurements for approximately 200 people. In the study, we try to find out if there is a correlation between sports participation and distress symptoms. We have two measurement periods (post and follow-up) that are conducted after a workshop about health and sports. Post was conducted 6 months after the Workshop and followup one year after the workshop. We formed the following hypothesis: „Participation in sport for obese people within one year after a workshop correlates significantly positively with psychological distress symptoms at follow up.“ I assume, the dependent variable is psychological distress and the independent is the participation in sports activities. The data structure looks like:
Df
$ measurement_period : Factor w/ 2 levels "0","1": 1 1 1 1
$ psychological_distress ; int 12 45 32 85
$ participation : Factor w/ 2 levels "0","1": 1 1 1 1
$ id : num 1 2 3 4
After reading some posts here, we believe that there are 2 levels in the model: 1 ) measurement period (post and follow up) 2) id
At first we conductet a unconditiional Model (intercept only Model for confirming if a multilevel Model fits, hope that this is right) with following code:
test <-lmer(psychological_distress ~1+(1|id),data=Df
But we are not sure if the model is appropriate given the data structure and, whether the level 1 and level 2 classification is correct.
Thank you very much in advance!
Your model:
lmer(psychological_distress ~ 1 + (1|id) , data = Df)
is a variance components model. It will tell you how much of the variation in psychological_distress is attributable to the id level, and how much is attributable to the unit/residual level. That isn't going to answer your research question:
we try to find out if there is a correlation between sports participation and distress symptoms
To look into this, you need to include the participation variable as a fixed effect, and also the time variable, and their interaction. So in the first instance I would consider this:
lmer(psychological_distress ~ measurement_period*participation + (1|id) , data = Df)
A good website on how to fit longitudinal and growth models using lme4 is https://rpsychologist.com/r-guide-longitudinal-lme-lmer
As Robert pointed out, and as demonstrated on the website, it is often useful to fit an interaction between "time" and "group" (e.g., treatment vs. control), to see how the outcome changes for each group over time. You can see this change by looking at the coefficients, but it's usually easier to plot (adjusted) predictions.
Here's a toy example:
library(parameters)
library(datawizard)
library(lme4)
library(ggeffects)
data("qol_cancer")
# filter two time points
qol_cancer <- data_filter(qol_cancer, time %in% c(1, 2))
# create fake treatment/control variable
set.seed(123)
treatment <- sample(unique(qol_cancer$ID), size = length(unique(qol_cancer$ID)) / 2, replace = FALSE)
qol_cancer$treatment <- 0
qol_cancer$treatment[qol_cancer$ID %in% treatment] <- 1
qol_cancer$time <- as.factor(qol_cancer$time)
qol_cancer$treatment <- factor(qol_cancer$treatment, labels = c("control", "treatment"))
m <- lmer(QoL ~ time * treatment + (1 + time | ID),
data = qol_cancer,
control = lmerControl(check.nobs.vs.nRE = "ignore"))
model_parameters(m)
#> # Fixed Effects
#>
#> Parameter | Coefficient | SE | 95% CI | t(368) | p
#> ----------------------------------------------------------------------------------------
#> (Intercept) | 70.74 | 2.15 | [66.52, 74.97] | 32.90 | < .001
#> time [2] | 0.27 | 2.22 | [-4.10, 4.64] | 0.12 | 0.905
#> treatment [treatment] | 4.88 | 3.04 | [-1.10, 10.86] | 1.60 | 0.110
#> time [2] * treatment [treatment] | 1.95 | 3.14 | [-4.23, 8.13] | 0.62 | 0.535
#>
#> # Random Effects
#>
#> Parameter | Coefficient
#> ---------------------------------------
#> SD (Intercept: ID) | 15.14
#> SD (time2: ID) | 7.33
#> Cor (Intercept~time2: ID) | -0.62
#> SD (Residual) | 14.33
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald t-distribution approximation.
ggpredict(m, c("time", "treatment")) |> plot()
Regarding the statistical significance of the interaction term: the p-values from the summary might be misleading. If you're really interested in statistically significant differences either between time points, or between groups (treatment vs. control), it is recommended to calculate pairwise contrasts including p-values. You can do this, e.g., with the emmeans-package.
library(emmeans)
emmeans(m, c("time", "treatment")) |> contrast(method = "pairwise", adjust = "none")
#> contrast estimate SE df t.ratio p.value
#> time1 control - time2 control -0.266 2.22 186 -0.120 0.9049
#> time1 control - time1 treatment -4.876 3.04 186 -1.604 0.1105
#> time1 control - time2 treatment -7.092 2.89 316 -2.453 0.0147
#> time2 control - time1 treatment -4.610 2.89 316 -1.594 0.1118
#> time2 control - time2 treatment -6.826 2.73 186 -2.497 0.0134
#> time1 treatment - time2 treatment -2.216 2.22 186 -0.997 0.3199
#>
#> Degrees-of-freedom method: kenward-roger
Created on 2022-05-22 by the reprex package (v2.0.1)
Here you can see, e.g., that treatment and control do not differ regarding their QoL at time point 1, but they do at time point 2.
I am trying to use the lme4 package in R and function lmer() to fit a model for my split-split plot design. I would have used a repeated measures ANOVA if I did not have a small number of observations missing, but the missing data should be no problem with a linear mixed effects model.
My data frame (data) has a simple structure with four factors and a numeric outcome variable called all_vai. Note that in this example data frame, not all levels of all factors are crossed even though they would be in my real data (except for the missing observations). It shouldn't matter for my question, which is an attempt to fix problematic syntax.
collected_vai <- rnorm(125, mean = 6, sd = 1)
missing <- rep(NA, times = 3)
all_vai <- c(collected_vai, missing)
year1 <- rep(2018, times = 32)
year2 <- rep(2019, times = 32)
year3 <- rep(2020, times = 32)
year4 <- rep(2021, times = 32)
year <- c(year1, year2, year3, year4)
disturbance_severity <- rep(c(0,45,65,85), each = 32)
treatment <- rep(c("B" , "T"), each = 64)
replicate <- rep(c("A", "B", "C", "D"), each = 32)
data = data.frame(all_vai, year, disturbance_severity, treatment, replicate)
data$year <- as.factor(data$year)
data$disturbance_severity <- as.factor(data$disturbance_severity)
data$treatment <- as.factor(data$treatment)
data$replicate <- as.factor(data$replicate)
Here is the model I ran for an identical data set with a different (normally distributed) numeric outcome and no missing observations -- i.e., this is the model I would be running if I didn't have unbalanced repeated measures now due to missing data:
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year + Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
When I run this, I get the error message: "Warning message:
In aov(mean_vai ~ disturbance_severity * treatment * Year + Error(Replicate/disturbance_severity/treatment/Year), :
Error() model is singular"
I have observations from different years nested within different treatments, which are nested within different disturbance severities, and all of this nested within replicates (which are experimental blocks). So I tried using this structure in lme4:
library(lme4)
library(lmerTest)
VAImodel2 <- lmer(all_vai ~ (year|replicate:disturbance_severity:treatment) + disturbance_severity*treatment*year, data = data)
summary(VAImodel2)
And this is the error message I get: "Error: number of observations (=125) <= number of random effects (=128) for term (Year | Replicate:disturbance_severity:treatment); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable"
Next I tried simplifying my model so that I was not running out of degrees of freedom, by removing the treatment variable and interaction term, like so:
VAImodel3 <- lmer(all_vai ~ (year|replicate:disturbance_severity) + disturbance_severity*year, data = data)
summary(VAImodel3)
This time I get a different error: "boundary (singular) fit: see ?isSingular
Warning message:
Model failed to converge with 1 negative eigenvalue: -1.2e-01 "
Thank you in advance for any help.
Your problem is wrong data preparation!!
Let's start by defining values for your variables year, disturbance_severity, treatment, replicate.
library(tidyverse)
set.seed(123)
yars = 2018:2021
disturbances = c(0,45,65,85)
treatments = c("B" , "T")
replicates = c("A", "B", "C", "D")
n = length(yars)*length(disturbances)*length(treatments)*length(replicates)*1
nNA=3
Please note that I first created the variables yars, disturbances, treatments and replicates with all the allowed values.
Then I calculated the amount of data in n (you can increase the last value in the multiplication from 1 e.g. to 10) and determined how many values will be missing in the variable nNA.
The key aspect is the use of the function expand.grid(yars, disturbances, treatments, replicates) which will return the appropriate table with the correct distribution of values.
Look at the first few lines of what expand.grid returns.
Var1 Var2 Var3 Var4
1 2018 0 B A
2 2019 0 B A
3 2020 0 B A
4 2021 0 B A
5 2018 45 B A
6 2019 45 B A
7 2020 45 B A
8 2021 45 B A
9 2018 65 B A
10 2019 65 B A
11 2020 65 B A
12 2021 65 B A
13 2018 85 B A
14 2019 85 B A
15 2020 85 B A
16 2021 85 B A
17 2018 0 T A
18 2019 0 T A
This is crucial here.
The next step is straight ahead. We create a tibble sequence and put it in the aov function.
data = tibble(sample(c(rnorm(n-nNA, mean = 6, sd = 1), rep(NA, nNA)), n)) %>%
mutate(expand.grid(yars, disturbances, treatments, replicates)) %>%
rename_with(~c("all_vai", "year", "disturbance_severity", "treatment", "replicate"))
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year +
Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
output
Error: replicate
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1341 0.1341 0.093 0.811
treatment 1 0.0384 0.0384 0.027 0.897
Residuals 1 1.4410 1.4410
Error: replicate:disturbance_severity
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1391 0.1391 0.152 0.763
treatment 1 0.1819 0.1819 0.199 0.733
year 1 1.4106 1.4106 1.545 0.431
Residuals 1 0.9129 0.9129
Error: replicate:disturbance_severity:treatment
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.4647 0.4647 0.698 0.491
year 1 0.8127 0.8127 1.221 0.384
Residuals 2 1.3311 0.6655
Error: replicate:disturbance_severity:treatment:year
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 2.885 2.8846 3.001 0.144
year 1 0.373 0.3734 0.388 0.560
treatment:year 1 0.002 0.0015 0.002 0.970
Residuals 5 4.806 0.9612
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.03 0.031 0.039 0.8430
year 1 1.29 1.292 1.662 0.2002
treatment:year 1 4.30 4.299 5.532 0.0206 *
Residuals 102 79.26 0.777
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now there are no model is singular errors!!
I'm doing an assignment for university and have copied and pasted the R code so I know it's right but I'm still not getting any P or F values from my data:
Food Temperature Area
50 11 820.2175
100 11 936.5437
50 14 1506.568
100 14 1288.053
50 17 1692.882
100 17 1792.54
This is the code I've used so far:
aovdata<-read.table("Condition by area.csv",sep=",",header=T)
attach(aovdata)
Food <- as.factor(Food) ; Temperature <- as.factor(Temperature)
summary(aov(Area ~ Temperature*Food))
but then this is the output:
Df Sum Sq Mean Sq
Temperature 2 757105 378552
Food 1 1 1
Temperature:Food 2 35605 17803
Any help, especially the code I need to fix it, would be great. I think there could be a problem with the data but I don't know what.
I would do this. Be aware of difference between factor and continues predictors.
library(tidyverse)
df <- sapply(strsplit(c("Food Temperature Area", "50 11 820.2175", "100 11 936.5437",
"50 14 1506.568", "100 14 1288.053", "50 17 1692.882",
"100 17 1792.54")," +"), paste0, collapse=",") %>%
read_csv()
model <- lm(Area ~ Temperature * as.factor(Food),df)
summary(model)
#>
#> Call:
#> lm(formula = Area ~ Temperature * as.factor(Food), data = df)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -83.34 25.50 166.68 -50.99 -83.34 25.50
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -696.328 505.683 -1.377 0.302
#> Temperature 145.444 35.580 4.088 0.055 .
#> as.factor(Food)100 38.049 715.144 0.053 0.962
#> Temperature:as.factor(Food)100 -2.778 50.317 -0.055 0.961
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 151 on 2 degrees of freedom
#> Multiple R-squared: 0.9425, Adjusted R-squared: 0.8563
#> F-statistic: 10.93 on 3 and 2 DF, p-value: 0.08498
ggeffects::ggpredict(model,terms = c('Temperature','Food')) %>% plot()
Created on 2020-12-08 by the reprex package (v0.3.0)
The actual problem with your example is not that you're using factors as predictor variables, but rather that you have fitted a 'saturated' linear model (as many parameters as observations), so there is no variation left to compute a residual SSQ, so the ANOVA doesn't include F/P values etc.
It's fine for temperature and food to be categorical (factor) predictors, that's how they would be treated in a classic two-way ANOVA design. It's just that in order to analyze this design with the interaction you need more replication.
I understand that I need one dummy variable less than the total number of dummy variables. However, I am stuck, because I keep receiving the error: "1 not defined because of singularities", when running lm in R. I found a similar question here: What is causing this error? Coefficients not defined because of singularities but it is slightly different than my problem.
I have two treatments (1) "benefit" and (2) "history", with two Levels each (1) "low" and "high" and (2) "short" and "long", i.e. 4 combinations. Additionally, I have a Control Group, which was exposed to neither. Therefore, I coded 4 dummy variables (which is one less than the total number of Groups n=5). Followingly, the dummy coded data looks like this:
low benefit high benefit short history long history
Control group 0 o 0 0
low benefit, short history 1 0 1 0
low benefit, long history 1 0 0 1
high benefit, short history 0 1 1 0
high benefit, long history 0 1 0 1
When I run my lm I get this:
Model:
summary(lm(X ~ short history + high benefit + long history + low benefit + Control variables, data = df))
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.505398100 0.963932438 5.71139 4.8663e-08 ***
Dummy short history 0.939025772 0.379091565 2.47704 0.0142196 *
Dummy high benefit -0.759944023 0.288192645 -2.63693 0.0091367 **
Dummy long history 0.759352915 0.389085599 1.95163 0.0526152 .
Dummy low benefit NA NA NA NA
Control Varibales xxx xxx xxx xxx
This error occurs always for the dummy varibale at the 4th Position. The Control variables are all calculated without problem.
I already tried to only include two variables with two levels, meaning for "history" I coded, 1 for "long" and 0 for "short", and for "benefit", 1 for "high" and 0 for "low". This way, the lm worked, but the problem is, that the Control Group and the combination "short history, low benefit" are coded identically, i.e. 0 and 0 for both variables.
I am sorry, if this is a basic mistake but I have not been able to figure it out. If you need more information, please say so. Thanks in advance.
As I put in the comments you only have two variables, if you make them factors and check the contrasts r will do the right thing. Please also see http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/
Make up data representative of yours.
set.seed(2020)
df <- data.frame(
X = runif(n = 120, min = 5, max = 15),
benefit = rep(c("control", "low", "high"), 40),
history = c(rep("control", 40), rep("long", 40), rep("short", 40))
)
Make benefit and history factors, check that control is the base contrast for each.
df$benefit <- factor(df$benefit)
df$history <- factor(df$history)
contrasts(df$benefit)
#> high low
#> control 0 0
#> high 1 0
#> low 0 1
contrasts(df$history)
#> long short
#> control 0 0
#> long 1 0
#> short 0 1
Run the regression and get the summary. 4 coefficient all conpared to control/control.
lm(X ~ benefit + history, df)
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Coefficients:
#> (Intercept) benefithigh benefitlow historylong historyshort
#> 9.94474 -0.08721 0.11245 0.37021 -0.35675
summary(lm(X ~ benefit + history, df))
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.4059 -2.3706 -0.0007 2.4986 4.7669
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.94474 0.56786 17.513 <2e-16 ***
#> benefithigh -0.08721 0.62842 -0.139 0.890
#> benefitlow 0.11245 0.62842 0.179 0.858
#> historylong 0.37021 0.62842 0.589 0.557
#> historyshort -0.35675 0.62842 -0.568 0.571
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.809 on 115 degrees of freedom
#> Multiple R-squared: 0.01253, Adjusted R-squared: -0.02182
#> F-statistic: 0.3648 on 4 and 115 DF, p-value: 0.8333