I'm doing an assignment for university and have copied and pasted the R code so I know it's right but I'm still not getting any P or F values from my data:
Food Temperature Area
50 11 820.2175
100 11 936.5437
50 14 1506.568
100 14 1288.053
50 17 1692.882
100 17 1792.54
This is the code I've used so far:
aovdata<-read.table("Condition by area.csv",sep=",",header=T)
attach(aovdata)
Food <- as.factor(Food) ; Temperature <- as.factor(Temperature)
summary(aov(Area ~ Temperature*Food))
but then this is the output:
Df Sum Sq Mean Sq
Temperature 2 757105 378552
Food 1 1 1
Temperature:Food 2 35605 17803
Any help, especially the code I need to fix it, would be great. I think there could be a problem with the data but I don't know what.
I would do this. Be aware of difference between factor and continues predictors.
library(tidyverse)
df <- sapply(strsplit(c("Food Temperature Area", "50 11 820.2175", "100 11 936.5437",
"50 14 1506.568", "100 14 1288.053", "50 17 1692.882",
"100 17 1792.54")," +"), paste0, collapse=",") %>%
read_csv()
model <- lm(Area ~ Temperature * as.factor(Food),df)
summary(model)
#>
#> Call:
#> lm(formula = Area ~ Temperature * as.factor(Food), data = df)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -83.34 25.50 166.68 -50.99 -83.34 25.50
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -696.328 505.683 -1.377 0.302
#> Temperature 145.444 35.580 4.088 0.055 .
#> as.factor(Food)100 38.049 715.144 0.053 0.962
#> Temperature:as.factor(Food)100 -2.778 50.317 -0.055 0.961
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 151 on 2 degrees of freedom
#> Multiple R-squared: 0.9425, Adjusted R-squared: 0.8563
#> F-statistic: 10.93 on 3 and 2 DF, p-value: 0.08498
ggeffects::ggpredict(model,terms = c('Temperature','Food')) %>% plot()
Created on 2020-12-08 by the reprex package (v0.3.0)
The actual problem with your example is not that you're using factors as predictor variables, but rather that you have fitted a 'saturated' linear model (as many parameters as observations), so there is no variation left to compute a residual SSQ, so the ANOVA doesn't include F/P values etc.
It's fine for temperature and food to be categorical (factor) predictors, that's how they would be treated in a classic two-way ANOVA design. It's just that in order to analyze this design with the interaction you need more replication.
Related
I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.
The reproducible data set:
smoking_status
alcohol_drinks
1
2
0
5
1
2
0
1
1
0
1
0
0
0
1
9
1
6
1
5
I have used glm() to analyse this relation:
glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)
Using the above I'm able to extract the p-value and the confidence interval for the entire set.
However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:
Alcohol drinks, mean (95%CI)
p-values
Smokers
X (X - X)
0.492
Non-smokers
X (X - X)
How can I produce this?
First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.
To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.
x <- 'smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5'
df1 <- read.table(textConnection(x), header = TRUE)
pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#>
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6186 -1.7093 -0.8104 1.1389 2.4957
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> factor(smoking_status)0 0.6931 0.4082 1.698 0.0895 .
#> factor(smoking_status)1 1.2321 0.2041 6.036 1.58e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 58.785 on 10 degrees of freedom
#> Residual deviance: 31.324 on 8 degrees of freedom
#> AIC: 57.224
#>
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1 0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822
Created on 2022-06-04 by the reprex package (v2.0.1)
Edit
The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.
bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#>
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7491 -0.8722 0.6705 0.8896 1.0339
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.3474 0.9513 0.365 0.715
#> alcohol_drinks 0.1877 0.2730 0.687 0.492
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12.217 on 9 degrees of freedom
#> Residual deviance: 11.682 on 8 degrees of freedom
#> AIC: 15.682
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(bin_fit))
#> (Intercept) alcohol_drinks
#> 1.415412 1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.2146432 11.167555
#> alcohol_drinks 0.7464740 2.417211
Created on 2022-06-05 by the reprex package (v2.0.1)
Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.
df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)
fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#>
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~
#> alcohol_drinks, family = binomial(), data = df2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9714 -0.2152 0.1369 0.2942 0.8975
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1671 0.3988 -2.927 0.003428 **
#> alcohol_drinks 0.4437 0.1168 3.798 0.000146 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 23.3150 on 9 degrees of freedom
#> Residual deviance: 3.0294 on 8 degrees of freedom
#> AIC: 27.226
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#> (Intercept) alcohol_drinks
#> 0.3112572 1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079
plot(smoking_status/Total ~ alcohol_drinks,
data = df2,
xlab = "Alcoholic Drinks",
ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")
Created on 2022-06-05 by the reprex package (v2.0.1)
First of all, I have to apologize if my headline is misleading. I am not sure how to put it appropriately for my question.
I am currently working on the fixed-effect model. My data looks like the following table, although it is not actual data due to the information privacy.
state
district
year
grade
Y
X
id
AK
1001
2009
3
0.1
0.5
1001.3
AK
1001
2010
3
0.8
0.4
1001.3
AK
1001
2011
3
0.5
0.7
1001.3
AK
1001
2009
4
1.5
1.3
1001.4
AK
1001
2010
4
1.1
0.7
1001.4
AK
1001
2011
4
2.1
0.4
1001.4
...
...
...
..
..
..
...
WY
5606
2011
6
4.2
5.3
5606.6
I used the fixest package to run the fixed-effect model for this project. To get the unique observation in this dataset, I have to combine district, grade, and year. Note that I avoided using plm because there is no way to specify three fixed effects in the model unless you combine two identities (in my case, I generated id by combining district and grade). fixest seems to be able to solve this problem. However, I got different results when specifying three fixed effects (district, grade, and year) compared to two fixed effects (id and year). The following results and codes may clear up some confusion from my explanation.
# Two fixed effects (id and year)
df <- transform(df, id = apply(df[c("district", "grade")], 1, paste, collapse = "."))
fe = feols(y ~ x | id + year, df, se = "standard")
summary(fe)
OLS estimation, Dep. Var.: y
Observations: 499,112
Fixed-effects: id: 64,302, year: 10
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
X 0.012672 0.003602 3.51804 0.00043478 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.589222 Adj. R2: 0.761891
Within R2: 2.846e-5
###########################################################################
# Three fixed effects (district, grade, and year)
fe = feols(y ~ x | district + grade + year, df, se = "standard")
summary(fe)
OLS estimation, Dep. Var.: y
Observations: 499,112
Fixed-effects: district: 11,097, grade: 6, year: 10
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
X 0.014593 0.00401 3.63866 0.00027408 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.702543 Adj. R2: 0.698399
Within R2: 2.713e-5
Questions:
Why are the results different?
This is an equation I plan to use;. I am not sure which model is associated with this specification. To my feeling, it could be the second one. But if it is the case, why do many websites recommend combining two identities and running normal plm.
Thank you so much for reading my question. Any answers/ suggestions/ advice would be appreciated!
The answer is simply that you are estimating two different models.
Three fixed-effects (FEs):
Year + id FEs (I renamed id in to district_grade):
The first set of fixed-effects is strictly included in the set of FEs of the second estimation, which is more restrictive.
Here is a reproducible example in which we see that we obtain two different sets of coefficients.
data(trade)
est = fepois(Euros ~ log(dist_km) | sw(Origin + Product, Origin^Product) + Year, trade)
etable(est, vcov = "iid")
#> model 1 model 2
#> Dependent Var.: Euros Euros
#>
#> log(dist_km) -1.020*** (1.18e-6) -1.024*** (1.19e-6)
#> Fixed-Effects: ------------------- -------------------
#> Origin Yes No
#> Product Yes No
#> Year Yes Yes
#> Origin-Product No Yes
#> _______________ ___________________ ___________________
#> S.E. type IID IID
#> Observations 38,325 38,325
#> Squared Cor. 0.27817 0.35902
#> Pseudo R2 0.53802 0.64562
#> BIC 2.75e+12 2.11e+12
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that they have different FEs, which confirms that the models estimated are completely different:
summary(fixef(est$`Origin + Product + Year`))
#> Fixed_effects coefficients
#> Origin Product Year
#> Number of fixed-effects 15 20 10
#> Number of references 0 1 1
#> Mean 23.5 -0.012 0.157
#> Standard-deviation 1.15 1.35 0.113
#> COEFFICIENTS:
#> Origin: AT BE DE DK ES
#> 22.91 23.84 24.62 23.62 24.83 ... 10 remaining
#> -----
#> Product: 1 2 3 4 5
#> 0 1.381 0.624 1.414 -1.527 ... 15 remaining
#> -----
#> Year: 2007 2008 2009 2010 2011
#> 0 0.06986 0.006301 0.07463 0.163 ... 5 remaining
summary(fixef(est$`Origin^Product + Year`))
#> Fixed_effects coefficients
#> Origin^Product Year
#> Number of fixed-effects 300 10
#> Number of references 0 1
#> Mean 23.1 0.157
#> Standard-deviation 1.96 0.113
#> COEFFICIENTS:
#> Origin^Product: 101 102 103 104 105
#> 22.32 24.42 24.82 21.28 23.04 ... 295 remaining
#> -----
#> Year: 2007 2008 2009 2010 2011
#> 0 0.06962 0.006204 0.07454 0.1633 ... 5 remaining
I just started to get into R for data analysis (previously I just used SPSS or Excel).
Currently, I am trying to run a logistic regression with one dependent and 5 independent while controlling for 3 variables.
My current attempt is:
reg_model <- glm(formula = Dependent ~ Independent1 + Independent2 + Independent3 + Independent4 + Independent5, family = binomial(), data = df)
I am not sure how (or where) to insert the 3 control variables into the model because just adding the 3 control variables as independent variables into the model seems wrong to me (or am I wrong here?).
You can control for potential confounders by adding them as independent variables into the model on the right-hand side of the formula.
Note that the estimate (effect size) of the Graduate Record Exam (GRE) score is lower in the second model after controlling for the grade point average (GPA), which is correlated to GRE:
library(readr)
# gre: Graduate Record Exam scores
# gpa: grade point average
data <- read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> admit = col_double(),
#> gre = col_double(),
#> gpa = col_double(),
#> rank = col_double()
#> )
data
#> # A tibble: 400 x 4
#> admit gre gpa rank
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 380 3.61 3
#> 2 1 660 3.67 3
#> 3 1 800 4 1
#> 4 1 640 3.19 4
#> 5 0 520 2.93 4
#> 6 1 760 3 2
#> 7 1 560 2.98 1
#> 8 0 400 3.08 2
#> 9 1 540 3.39 3
#> 10 0 700 3.92 2
#> # … with 390 more rows
model1 <- glm(admit ~ gre, data = data, family = "binomial")
summary(model1)
#>
#> Call:
#> glm(formula = admit ~ gre, family = "binomial", data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.1623 -0.9052 -0.7547 1.3486 1.9879
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
#> gre 0.003582 0.000986 3.633 0.00028 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 486.06 on 398 degrees of freedom
#> AIC: 490.06
#>
#> Number of Fisher Scoring iterations: 4
# gre and gpa are correlated. Lets's control for them!
cor(data)
#> admit gre gpa rank
#> admit 1.0000000 0.1844343 0.17821225 -0.24251318
#> gre 0.1844343 1.0000000 0.38426588 -0.12344707
#> gpa 0.1782123 0.3842659 1.00000000 -0.05746077
#> rank -0.2425132 -0.1234471 -0.05746077 1.00000000
model2 <- glm(admit ~ gre + gpa, data = data, family = "binomial")
summary(model2)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa, family = "binomial", data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2730 -0.8988 -0.7206 1.3013 2.0620
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
#> gre 0.002691 0.001057 2.544 0.0109 *
#> gpa 0.754687 0.319586 2.361 0.0182 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 480.34 on 397 degrees of freedom
#> AIC: 486.34
#>
#> Number of Fisher Scoring iterations: 4
Created on 2021-10-01 by the reprex package (v2.0.1)
I understand that I need one dummy variable less than the total number of dummy variables. However, I am stuck, because I keep receiving the error: "1 not defined because of singularities", when running lm in R. I found a similar question here: What is causing this error? Coefficients not defined because of singularities but it is slightly different than my problem.
I have two treatments (1) "benefit" and (2) "history", with two Levels each (1) "low" and "high" and (2) "short" and "long", i.e. 4 combinations. Additionally, I have a Control Group, which was exposed to neither. Therefore, I coded 4 dummy variables (which is one less than the total number of Groups n=5). Followingly, the dummy coded data looks like this:
low benefit high benefit short history long history
Control group 0 o 0 0
low benefit, short history 1 0 1 0
low benefit, long history 1 0 0 1
high benefit, short history 0 1 1 0
high benefit, long history 0 1 0 1
When I run my lm I get this:
Model:
summary(lm(X ~ short history + high benefit + long history + low benefit + Control variables, data = df))
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.505398100 0.963932438 5.71139 4.8663e-08 ***
Dummy short history 0.939025772 0.379091565 2.47704 0.0142196 *
Dummy high benefit -0.759944023 0.288192645 -2.63693 0.0091367 **
Dummy long history 0.759352915 0.389085599 1.95163 0.0526152 .
Dummy low benefit NA NA NA NA
Control Varibales xxx xxx xxx xxx
This error occurs always for the dummy varibale at the 4th Position. The Control variables are all calculated without problem.
I already tried to only include two variables with two levels, meaning for "history" I coded, 1 for "long" and 0 for "short", and for "benefit", 1 for "high" and 0 for "low". This way, the lm worked, but the problem is, that the Control Group and the combination "short history, low benefit" are coded identically, i.e. 0 and 0 for both variables.
I am sorry, if this is a basic mistake but I have not been able to figure it out. If you need more information, please say so. Thanks in advance.
As I put in the comments you only have two variables, if you make them factors and check the contrasts r will do the right thing. Please also see http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/
Make up data representative of yours.
set.seed(2020)
df <- data.frame(
X = runif(n = 120, min = 5, max = 15),
benefit = rep(c("control", "low", "high"), 40),
history = c(rep("control", 40), rep("long", 40), rep("short", 40))
)
Make benefit and history factors, check that control is the base contrast for each.
df$benefit <- factor(df$benefit)
df$history <- factor(df$history)
contrasts(df$benefit)
#> high low
#> control 0 0
#> high 1 0
#> low 0 1
contrasts(df$history)
#> long short
#> control 0 0
#> long 1 0
#> short 0 1
Run the regression and get the summary. 4 coefficient all conpared to control/control.
lm(X ~ benefit + history, df)
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Coefficients:
#> (Intercept) benefithigh benefitlow historylong historyshort
#> 9.94474 -0.08721 0.11245 0.37021 -0.35675
summary(lm(X ~ benefit + history, df))
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.4059 -2.3706 -0.0007 2.4986 4.7669
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.94474 0.56786 17.513 <2e-16 ***
#> benefithigh -0.08721 0.62842 -0.139 0.890
#> benefitlow 0.11245 0.62842 0.179 0.858
#> historylong 0.37021 0.62842 0.589 0.557
#> historyshort -0.35675 0.62842 -0.568 0.571
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.809 on 115 degrees of freedom
#> Multiple R-squared: 0.01253, Adjusted R-squared: -0.02182
#> F-statistic: 0.3648 on 4 and 115 DF, p-value: 0.8333
Using R...
I have a data.frame with five variables.
One of the variables colr has values ranging from 1 to 5.
Defined as an integer with values 1, 2, 3, 4, and 5.
Problem: I would like to build a regression model where the values within colr, the integers 1,2,3,4, and 5 are reported as independent variables with the following names.
1 = Silver,
2 = Blue,
3 = Pink,
4 = Other than Silver, Blue or Pink,
5 = Color Not Reported.
Question: Is there a way to extract or rename these values in a way that is different from the following (as this process does not rename, eg. 1 to Silver in the summary regression output):
lm(dependent variable ~ + I(colr.f == 1) +
I(colr.f == 2) +
I(colr.f == 3) +
I(colr.f == 4) +
I(colr.f == 5),
data = df)
I am open to any method that would allow me to create and name these different values independently but would prefer to see if there is a way to do so using the tidyverse or dplyr as this is something I have to do frequently when building multivariate models.
Thank you for any help.
If you have this:
df <- data.frame(int = sample(5, 20, TRUE), value = rnorm(20))
df
#> int value
#> 1 3 -0.62042198
#> 2 4 0.85009260
#> 3 5 -1.04971518
#> 4 1 -2.58255471
#> 5 1 0.62357772
#> 6 4 0.00286785
#> 7 4 -0.05981318
#> 8 4 0.72961261
#> 9 4 -0.03156315
#> 10 1 -2.05486209
#> 11 5 1.77099554
#> 12 1 1.02790956
#> 13 1 -0.70354012
#> 14 1 0.27353731
#> 15 2 -0.04817215
#> 16 2 0.17151374
#> 17 5 -0.54824346
#> 18 2 0.41123284
#> 19 5 0.05466070
#> 20 1 -0.41029986
You can do this:
library(tidyverse)
df <- df %>% mutate(color = factor(c("red", "green", "orange", "blue", "pink"))[int])
df
#> int value color
#> 1 3 -0.62042198 orange
#> 2 4 0.85009260 blue
#> 3 5 -1.04971518 pink
#> 4 1 -2.58255471 red
#> 5 1 0.62357772 red
#> 6 4 0.00286785 blue
#> 7 4 -0.05981318 blue
#> 8 4 0.72961261 blue
#> 9 4 -0.03156315 blue
#> 10 1 -2.05486209 red
#> 11 5 1.77099554 pink
#> 12 1 1.02790956 red
#> 13 1 -0.70354012 red
#> 14 1 0.27353731 red
#> 15 2 -0.04817215 green
#> 16 2 0.17151374 green
#> 17 5 -0.54824346 pink
#> 18 2 0.41123284 green
#> 19 5 0.05466070 pink
#> 20 1 -0.41029986 red
Which allows a regression like this:
lm(value ~ color, data = df) %>% summary()
#>
#> Call:
#> lm(formula = value ~ color, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.03595 -0.33687 -0.00447 0.46149 1.71407
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2982 0.4681 0.637 0.534
#> colorgreen -0.1200 0.7644 -0.157 0.877
#> colororange -0.9187 1.1466 -0.801 0.436
#> colorpink -0.2413 0.7021 -0.344 0.736
#> colorred -0.8448 0.6129 -1.378 0.188
#>
#> Residual standard error: 1.047 on 15 degrees of freedom
#> Multiple R-squared: 0.1451, Adjusted R-squared: -0.0829
#> F-statistic: 0.6364 on 4 and 15 DF, p-value: 0.6444
Created on 2020-02-16 by the reprex package (v0.3.0)
I'm not sure I'm understanding your question the right way, but can't you just use
library(dplyr)
df <- df %>%
mutate(color=factor(colr.f, levels=c(1:5), labels=c("silver", "blue", "pink", "not s, b, p", "not reported"))
and then just run the regression on color only.
/edit for clarification. Making up some data:
df <- data.frame(
x=rnorm(100),
color=factor(rep(c(1,2,3,4,5), each=20),
labels=c("Silver", "Blue", "Pink", "Not S, B, P", "Not reported")),
y=rnorm(100, 4))
m1 <- lm(y~x+color, data=df)
m2 <- lm(y~x+color-1, data=df)
summary(m1)
Call:
lm(formula = y ~ x + color, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.93238 0.19312 20.362 <2e-16 ***
x 0.13588 0.09856 1.379 0.171
colorBlue -0.07862 0.27705 -0.284 0.777
colorPink -0.02167 0.27393 -0.079 0.937
colorNot S, B, P 0.15238 0.27221 0.560 0.577
colorNot reported 0.14139 0.27230 0.519 0.605
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.0268, Adjusted R-squared: -0.02496
F-statistic: 0.5177 on 5 and 94 DF, p-value: 0.7623
summary(m2)
Call:
lm(formula = y ~ x + color - 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.13588 0.09856 1.379 0.171
colorSilver 3.93238 0.19312 20.362 <2e-16 ***
colorBlue 3.85376 0.19570 19.692 <2e-16 ***
colorPink 3.91071 0.19301 20.262 <2e-16 ***
colorNot S, B, P 4.08477 0.19375 21.083 <2e-16 ***
colorNot reported 4.07377 0.19256 21.156 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.9578, Adjusted R-squared: 0.9551
F-statistic: 355.5 on 6 and 94 DF, p-value: < 2.2e-16
The first model is a model with intercept, therefore one of the factor levels must be dropped to avoid perfect multicollinearity. In this case, the "effect" of silver is the value of the intercept, while the "effect" of the other colors is the intercept coefficient value + their respective coefficient value.
The second model is estimated without intercept (without constant), so you can see the individual effects. However, you should probably know what you are doing before estimating the model without intercept.
With base R.
labels <- c("Silver", "Blue", "Pink", "Other Color", "Color Not Reported")
df$colr.f2 <- factor(colr.f, labels = labels, levels = seq_along(labels))