Related
I am running mixed linear models using lmer from lme4. We are testing the effect of family, strain and temperature on several growth factors for brook trouts. I have 4 families (variable FAMILLE) form which we sampled our individuals. 2 are from the selected strain and 2 are from the control strain (variable Lignee). For each strain, the 2 families were either marked as resistant (Res) or sensible (Sens). So my fixed effect variable (FAMILLE), is nested in my variable Lignee. The expermiment was conducted at 3 different temperatures.
Here is what my dataframe looks like :
structure(list(BASSIN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4"), class = "factor"), t.visee = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("15", "17", "19"), class = "factor"), FAMILLE = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L), .Label = c("RES", "SENS"), class = "factor"), Lignee = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), .Label = c("CTRL", "SEL"), class = "factor"), taux.croiss.sp.poids = c(0.8,
1.14285714285714, 1.42857142857143, 0.457142857142857, -0.228571428571429,
0.628571428571429, 0.971428571428571, 0.742857142857143, 1.08571428571429,
0.8, 0.571428571428571, 1.02857142857143, 0.8, 0.285714285714286,
0.285714285714286, 0.571428571428571, 0.742857142857143, 1.14285714285714,
0.628571428571429, 0.742857142857143, 1.02857142857143, 0.285714285714286,
0.628571428571429, 0.628571428571429, 0.857142857142857, 0.8,
1.08571428571429, 1.37142857142857, 0.742857142857143, 1.08571428571429,
0.0571428571428571, 0.571428571428571, 0.171428571428571, 0.8,
0.685714285714286, 0.285714285714286, 0.285714285714286, 0.8,
0.457142857142857, 1.02857142857143, 0.342857142857143, 0.742857142857143,
0.857142857142857, 0.457142857142857, 0.742857142857143, 1.25714285714286,
0.971428571428571, 0.857142857142857, 0.742857142857143, 0.514285714285714
)), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"
))
Lignee has 2 levels (Sel and Ctrl)
FAMILLE has 2 levels (Sens and Res)
So I have 4 distinct levels :
Lignee Sel and FAMILLE Sens
Lignee Sel and FAMILLE Res
Lignee Ctrl and FAMILLE Sens
Lignee Ctrl and FAMILLE Res
when I run for example this line to test the effect of the variables on the rate of weight gain:
model6 <- lmer((taux.croiss.sp.poids) ~ t.visee + Lignee/FAMILLE + (1 |BASSIN), data = mydata1, REML = FALSE)
and then
summary(model6)
<Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: (taux.croiss.sp.poids) ~ t.visee + Lignee/FAMILLE + (1 | BASSIN)
Data: mydata1
AIC BIC logLik deviance df.resid
115.2 139.5 -50.6 101.2 228
Scaled residuals:
Min 1Q Median 3Q Max
-3.11527 -0.59489 0.05557 0.69775 2.79920
Random effects:
Groups Name Variance Std.Dev.
BASSIN (Intercept) 0.01184 0.1088
Residual 0.08677 0.2946
Number of obs: 235, groups: BASSIN, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.770942 0.209508 194.702337 3.680 0.000302 ***
t.visee -0.019077 0.011682 231.005933 -1.633 0.103809
LigneeSEL 0.214062 0.054471 231.007713 3.930 0.000112 ***
LigneeCTRL:FAMILLESENS -0.008695 0.054487 231.038877 -0.160 0.873358
LigneeSEL:FAMILLESENS -0.205001 0.054242 231.016973 -3.779 0.000200 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) t.vise LgnSEL LCTRL:
t.visee -0.948
LigneeSEL -0.131 0.000
LCTRL:FAMIL -0.124 -0.007 0.504
LSEL:FAMILL 0.000 0.000 -0.498 0.000>
From what I can understand, the model chooses 1 family as the reference group, which won't be in the output. But the problem here is that 2 groups are missing :
LigneeCTRL:FAMILLERES
AND
LigneeSEL:FAMILLERES
Does somebody knows why my output is missing not ONE but TWO of the groups?
I'm french canadian so don't hesitate if some things are not clear, I will try to re-explain in other words!
Also, this is my 1st message on Stack, I tried to include everything needed but dont hesitate if I need to include some other things!
Thanks in advance
I'm applying the example here:
https://quantdev.ssri.psu.edu/sites/qdev/files/09_EnsembleMethods_2017_1127.html
to my data, to build a model for classification using the caret package.
I got to the point:
cvcontrol <- trainControl(method="repeatedcv", number = 10, repeats=3,allowParallel=TRUE)
train.rf <- train(as.factor(variate) ~ .,
data=train.n.inp,
method="rf",
trControl=cvcontrol,
importance=TRUE)
rf.classTrain <- predict(train.rf, type="raw")
#computing confusion matrix
cM <- confusionMatrix(train.n.inp$variate,rf.classTrain)
I don't understand the need to use the predict function to calculate the confusion matrix, or, in other words, what is the difference between cM and train.rf$finalModel:
train.rf$finalModel
OOB estimate of error rate: 43.08%
Confusion matrix:
MV UV class.error
MV 25 12 0.3243243
UV 16 12 0.5714286
> cM
Confusion Matrix and Statistics
Reference
Prediction MV UV
MV 37 0
UV 0 28
Accuracy : 1
I am confused by the (large) difference between the two confusion matrices and unsure which one reflects the accuracy of the model. Any help appreciated.
the data:
dput(train.n.inp)
structure(list(variate = structure(c(1L, 1L, 2L, 1L, 1L, 2L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("MV",
"UV"), class = "factor"), AMB = c(0.148918043959789, 0.137429106929874,
0.13522219247215, 0.152139165429334, 0.193551266136034, 0.1418753904697,
0.132098434875739, 0.256245486778797, 0.136593400352133, 0.0183612037420183,
0.0235701709547339, 0.030539801539972, 0.0532418112925866, 0.0506048730618504,
0.0443005622763673, 0.172991261592386, 0.135717125493919, 0.139092406429261,
0.1225892299329, 0.13579014839877, 0.183709401293317, 0.122207888096455,
0.00542803592726925, 0.0192455922563268, 0.0731446096925737,
0.0150264910871489, 0.0487793004405717, 0.0433918327937752, 0.0122597343588996,
0.0211847560629296, 0.114451232870044, 0.113712890165437, 0.00788647372392488,
-0.03807738805183, 0.00735097242168299, -0.00173226557619129,
0.000279921135262793, 0.0487306185040041, 0.00901021509302318,
0.164378615647997, 0.081505732298031, 0.0337690366656119, 0.0520247628784008,
0.0318461001711981, 0.0467265454486446, 0.0503046677863513, 0.026150313592808,
0.102418680881792, 0.145640126897581, 0.158703113209843, 0.166192017785134,
0.145234444092853, 0.189096868940113, 0.142573164893833, 0.157794383727251,
0.312043099741174, 0.136009217113324, 0.115213916542934, 0.119757563955894,
0.120065882887488, 0.141891617781889, 0.177956819122265, 0.13731551574455,
0.328513821613157, 0.110426859447136), MB = c(-0.73416, -0.67752,
-0.66664, -0.75004, -0.9542, -0.69944, -0.65124, -1.26328, -0.6734,
-0.09052, -0.1162, -0.15056, -0.26248, -0.24948, -0.2184, -0.85284,
-0.66908, -0.68572, -0.60436, -0.66944, -0.90568, -0.60248, -0.02676,
-0.09488, -0.3606, -0.07408, -0.24048, -0.21392, -0.06044, -0.10444,
-0.56424, -0.5606, -0.0388800000000001, 0.18772, -0.0362400000000001,
0.00854000000000001, -0.00138, -0.24024, -0.04442, -0.81038,
-0.40182, -0.16648, -0.25648, -0.157, -0.23036, -0.248, -0.12892,
-0.50492, -0.718, -0.7824, -0.81932, -0.716, -0.93224, -0.70288,
-0.77792, -1.53836, -0.67052, -0.568, -0.5904, -0.59192, -0.69952,
-0.87732, -0.67696, -1.61956, -0.5444), MGE = c(1.58768, 1.6152,
1.53288, 1.52972, 1.12908, 1.50552, 1.48988, 1.67552, 1.55052,
1.23556, 1.27284, 1.21336, 0.84592, 1.30172, 1.14048, 1.26828,
1.20884, 1.21764, 1.22876, 1.22168, 1.27944, 1.22528, 1.26932,
1.25408, 1.183, 1.38032, 1.33416, 0.95584, 1.31188, 1.39796,
1.33848, 1.4458, 1.18416, 1.23868, 1.22968, 1.17838, 1.17278,
1.13368, 1.11374, 1.31642, 1.14034, 1.21984, 1.17128, 1.16364,
1.15036, 1.12984, 1.22484, 1.17244, 1.2768, 1.55744, 1.66964,
1.54848, 1.17416, 1.56424, 1.48928, 1.9326, 1.54588, 1.228, 1.29096,
1.39296, 1.38432, 1.275, 1.32704, 1.9442, 1.35128)), row.names = c(NA,
-65L), class = "data.frame")
I want to do a regression when parendiv is my Dependent variable and routine1997 is my Independent variable, and compare males to females. The code is like this:
structure(list(gender = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), .Label = c("male",
"female"), class = "factor"), parent = structure(c(2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("intact", "parentaldivorce"), class = "factor"),
routine = structure(c(1L, 1L, 1L, 1L, NA, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 2L, 1L, 3L, 3L), .Label = c("Med",
"High", "Low"), class = "factor")), row.names = c(3L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 16L, 18L, 19L, 21L, 22L, 23L, 24L,
25L, 28L, 29L, 30L, 34L), class = "data.frame")
This is the code and I want to specifically compare coefficient among men and women.
lm(parent~routine, data=nlsy97, subset=gender)
There are two ways to compare the coefficients.
The easiest way would be to code gender as dummy (0/1) and include an interaction term in the model. Then, you get the difference the gender makes for the coefficient, complete with a p-value:
out = lm(parent ~ routine + gender + routine*gender, data=nlsy97)
The other way would be to use a multigroup regression and comparing the pooled regression model (all genders included) with the unpooled models (seperate slopes or intercepts or both for genders). The model with the smallest AIC fits the data best. If your random slope model yields the lowest AIC, you have gender differences in your effect. If the random intercept is best, you just have level differences between the genders but may assume equal effects.
library(lme4)
pooled = lm(parent ~ routine, data=nlsy97)
r.inter = lmer(parent ~ routine + (1|gender), data=nlsy97)
r.slope = lmer(parent ~ routine + (routine|gender), data=nlsy97)
r.unpooled = lmer(parent ~ routine + (1+routine|gender), data=nlsy97)
AIC(pooled)
AIC(r.inter)
AIC(r.slope)
AIC(r.unpooled)
Using the method coefficients() on the model with the lowest AIC provides you with the exact coefficients for the individual groups.
EDIT: I just noticed that you just have 20 cases in total. If this is your whole dataset you should probably not do any statistical analyses at all.
I want to do logistic regressions for several (n = 30) SNPs (coded as 0,1,2) as predictors and a casecontrol variable (0,1) as an outcome. As few of those rs are correlated, I cannot put all rs# in one model but have to run one at a time regression for each i.e., I cannot simply plus them together in one model like rs1 + rs2 + rs3 and so on....I need to have each regressed separately like below;
test1 = glm(casecontrol ~ rs1, data = mydata, family=binomial)
test2 = glm(casecontrol ~ rs2, data = mydata, family=binomial)
test3 = glm(casecontrol ~ rs3, data = mydata, family=binomial)
While I can run all the above regressions separately, is there a way to loop them together so I could get a summary() of all tests in one go?
I will have to adjust for age and sex too, but that would come after I run an unadjusted loop.
My data head from dput(head(mydata)) for example;
structure(list(ID = 1:6, sex = c(2L, 1L, 2L, 1L, 1L, 1L), age = c(52.4725405022036,
58.4303618001286, 44.5300065923948, 61.4786037395243, 67.851808819687,
39.7451378498226), bmi = c(31.4068751083687, 32.0614937413484,
23.205021363683, 29.1445372393355, 32.6287483051419, 20.5887741968036
), casecontrol = c(0L, 1L, 0L, 1L, 1L, 1L), rs1 = c(1L, 0L, 2L,
2L, 1L, 2L), rs2 = c(2L, 1L, 2L, 0L, 1L, 1L), rs3 = c(1L, 0L,
1L, 2L, 2L, 2L), rs4 = c(1L, 1L, 1L, 1L, 0L, 2L), rs5 = c(1L,
0L, 0L, 0L, 1L, 2L), rs6 = c(1L, 1L, 1L, 1L, 1L, 2L), rs7 = c(0L,
0L, 0L, 0L, 0L, 0L), rs8 = c(0L, 0L, 1L, 0L, 2L, 1L), rs9 = c(0L,
0L, 2L, 1L, 1L, 0L), rs10 = c(2L, 0L, 0L, 2L, 2L, 1L), rs11 = c(0L,
1L, 1L, 0L, 1L, 1L), rs12 = c(1L, 2L, 0L, 1L, 2L, 2L), rs13 = c(0L,
2L, 0L, 0L, 0L, 0L), rs14 = c(1L, 1L, 1L, 1L, 2L, 2L), rs15 = c(1L,
2L, 1L, 1L, 0L, 1L), rs16 = c(0L, 2L, 1L, 2L, 2L, 1L), rs17 = c(0L,
2L, 1L, 1L, 2L, 2L), rs18 = c(1L, 2L, 2L, 1L, 1L, 1L), rs19 = c(1L,
1L, 0L, 1L, 2L, 2L), rs20 = c(2L, 1L, 0L, 2L, 2L, 1L), rs21 = c(1L,
2L, 2L, 1L, 1L, 0L), rs22 = c(1L, 1L, 2L, 2L, 0L, 1L), rs23 = c(2L,
0L, 2L, 1L, 1L, 1L), rs24 = c(0L, 0L, 0L, 2L, 2L, 2L), rs25 = c(2L,
2L, 1L, 1L, 0L, 1L), rs26 = c(1L, 1L, 0L, 2L, 0L, 1L), rs27 = c(1L,
1L, 1L, 1L, 0L, 1L), rs28 = c(0L, 1L, 1L, 2L, 0L, 2L), rs29 = c(2L,
2L, 2L, 2L, 1L, 2L), rs30 = c(0L, 2L, 1L, 2L, 1L, 0L)), row.names = c(NA,
6L), class = "data.frame")```
Probably you want something like this:
lapply(1:30, function(i) glm(as.formula(paste0('casecontrol ~ ', 'rs', i)), data = mydata, family = binomial))
which will execute 30 logistic regressions with the selected predictor.
Instead of hard coding the overall number of predictors, you can use:
sum(grepl('rs', names(mydata))), which will return 30.
You can use tidy function from broom package to get the summary in a tidy format.
purrr::map_dfr(1:30, function(i) data.frame(model = i, tidy(glm(as.formula(paste0('casecontrol ~ ', 'rs', i)), data = mydata, family = binomial))))
or you can do this in a more dynamic way:
names(mydata)[grepl('rs', names(mydata))] -> pred #get all predictors that contain 'rs'
purrr::map_dfr(1:length(pred),
function(i) data.frame(model = i,
tidy(glm(as.formula(paste0('casecontrol ~ ', pred[i])), data = mydata, family = binomial))))
If you want to include another variable, you simply need to adjust the pred vector.
c(pred, paste0(pred, ' + age')) -> pred #interaction between rs drivers and age
or
c(pred, paste0(pred, ' + age + sex')) -> pred #interaction between rs drivers age and sex
you can do something like this
outcome<-mydata %>% select("casecontrol") #select outcome
features <- mydata %>% select("rs1":"rs30") #select features
features_names<-data.frame(colnames(features)) #get feature names
for(i in 1:nrow(features_names)) # loop over length of features_name
{selected=features[,i,drop=FALSE] #get each feature
total <- cbind(selected, response) # combine with outcome
glm.fit <- glm(casecontrol ~ ., data = total, family = binomial("logit"))
summary(glm.fit)
}
Please find My data "w" below.
I have the covariate w$WHO, which has three levels: w$WHO==1, w$WHO==2 and w$WHO==3
I want to relevel so w$WHO==1 is set as reference.
I tried
w$WHO <- factor(w$WHO)
w$WHO <- relevel(w$WHO, ref=1)
and
w$WHO <- relevel(w$WHO, ref="1")
My script is
library(rms)
d <- datadist(w)
options(datadist="d")
model <- cph(Surv(rfs,recurrence)~age + WHO,data=w)
summary(model)
As you can see, the adjusted model chooses w$WHO==2 as reference.
Effects Response : Surv(rfs, recurrence)
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 48.545 68.907 20.362 0.28228 0.093283 0.099454 0.46512
Hazard Ratio 48.545 68.907 20.362 1.32620 NA 1.104600 1.59220
WHO - 1:2 2.000 1.000 NA -0.56706 0.156850 -0.874490 -0.25963
Hazard Ratio 2.000 1.000 NA 0.56719 NA 0.417080 0.77134
WHO - 3:2 2.000 3.000 NA 0.69360 0.152910 0.393910 0.99330
Hazard Ratio 2.000 3.000 NA 2.00090 NA 1.482800 2.70010
Here is my data
My data
w <- structure(list(age = c(54.36164384, 74.91232877, 64.98356164,
60.56712329, 57.61369863, 45.85205479, 78.47123288, 59.95616438,
57.4739726, 25.12876712, 56.61917808, 61.10136986, 58.74246575,
62.56438356, 55.81917808, 30.83013699, 63.11232877, 56.29863014,
47.96986301, 40.53424658, 49.9890411, 47.75616438, 40.83835616,
42.02191781, 49.85205479, 55.05479452, 59.33424658, 71.89589041,
60.30410959, 50.24383562, 41.3260274, 33.4, 73.27945205, 67.45753425
), WHO = c(3L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 3L,
2L, 3L, 2L, 1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), recurrence = c(1L, 1L, 1L, 1L, 0L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), rfs = c(19.1, 15.33333333,
49.16666667, 15.6, 57.16666667, 47.63333333, 54, 16.93333333,
6.7, 102.1, 24.33666667, 127.7666667, 100.6333333, 25.96666667,
1.233333333, 13.1, 72.16666667, 62, 97.23333333, 199.1, 24.73333333,
60.46666667, 10.43333333, 31.76666667, 28.96666667, 56.43333333,
9.533333333, 114.9333333, 114.8666667, 85.06666667, 107.6, 121.2,
69.56666667, 70.03333333)), .Names = c("age", "WHO", "recurrence",
"rfs"), class = "data.frame", row.names = 271:304)
Thanks,
Best.
The solution was
d$limits$WHO[2] <- 1
model <- cph(Surv(rfs,recurrence)~age + WHO,data=w)
summary(model)