Linear model regressing on every level of a numeric field - r

I am currently trying to run a linear model on a large data set, but am running into issues with some specific variables.
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage, data = train)
summary(pv_model)
Here is code for my regression. SalePrice, MSSubClass, GarageArea, and LotFrontage are all numeric fields, while LotConfig is a factored variable.
Here is the output of my pv_model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98154.64 17235.51 5.695 1.75e-08 ***
MSSubClass 50.05 58.38 0.857 0.391539
LotConfigCulDSac 69949.50 12740.62 5.490 5.42e-08 ***
LotConfigFR2 19998.34 14592.31 1.370 0.170932
LotConfigFR3 21390.99 34126.44 0.627 0.530962
LotConfigInside 21666.04 5597.33 3.871 0.000118 ***
GarageArea 175.67 10.96 16.035 < 2e-16 ***
LotFrontage101 42571.20 42664.89 0.998 0.318682
LotFrontage102 26051.49 35876.54 0.726 0.467968
LotFrontage103 36528.81 35967.56 1.016 0.310131
LotFrontage104 218129.42 58129.56 3.752 0.000188 ***
LotFrontage105 61737.12 27618.21 2.235 0.025673 *
LotFrontage106 40806.22 58159.42 0.702 0.483120
LotFrontage107 36744.69 29494.94 1.246 0.213211
LotFrontage108 71537.30 42565.91 1.681 0.093234 .
LotFrontage109 -29193.02 42528.98 -0.686 0.492647
LotFrontage110 73589.28 27706.92 2.656 0.008068 **
As you can see, the first variables operate correctly. Both the factored and numeric fields respond appropriately. That is, until it gets to LotFrontage. For whatever reason, the model runs the regression on every single level of LotFrontage.
For reference, LotFrontage describes the square footage of the subject's front yard. I have properly cleaned the data and replaced NA values. I really am at a loss for why this particular column is acting so unusually.
Any help is greatly appreciated.

If I download the data from the kaggle link or use a github link and do:
train = read.csv("train.csv")
class(x$LotFrontage)
[1] "integer"
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage,
data = train)
summary(pv_model)
Call:
lm(formula = SalePrice ~ MSSubClass + LotConfig + GarageArea +
LotFrontage, data = train)
Residuals:
Min 1Q Median 3Q Max
-380310 -33812 -4418 24345 487970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11915.866 9455.677 1.260 0.20785
MSSubClass 105.699 45.345 2.331 0.01992 *
LotConfigCulDSac 81789.113 10547.120 7.755 1.89e-14 ***
LotConfigFR2 17736.355 11787.227 1.505 0.13266
LotConfigFR3 17649.409 31418.281 0.562 0.57439
LotConfigInside 13073.201 5002.092 2.614 0.00907 **
GarageArea 208.708 8.725 23.920 < 2e-16 ***
LotFrontage 722.380 88.294 8.182 7.12e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Suggest that you read in the csv again like above.

Related

General Linear Model interpretation of parameter estimates in R

I have a data set that looks like
"","OBSERV","DIOX","logDIOX","OXYGEN","LOAD","PRSEK","PLANT","TIME","LAB"
"1",1011,984.06650389,6.89169348002254,"L","H","L","RENO_N","1","KK"
"2",1022,1790.7973641,7.49041625445373,"H","H","L","RENO_N","1","USA"
"3",1031,661.95870145,6.4952031694744,"L","H","H","RENO_N","1","USA"
"4",1042,978.06853583,6.88557974511529,"H","H","H","RENO_N","1","KK"
"5",1051,270.92290942,5.60183431332639,"N","N","N","RENO_N","1","USA"
"6",1062,402.98269729,5.99889362626069,"N","N","N","RENO_N","1","USA"
"7",1071,321.71945701,5.77367991426247,"H","L","L","RENO_N","1","KK"
"8",1082,223.15260359,5.40785585845064,"L","L","L","RENO_N","1","USA"
"9",1091,246.65350151,5.507984523849,"H","L","H","RENO_N","1","USA"
"10",1102,188.48323034,5.23900903921703,"L","L","H","RENO_N","1","KK"
"11",1141,267.34994025,5.58855843790491,"N","N","N","RENO_N","1","KK"
"12",1152,452.10355987,6.11391126834609,"N","N","N","RENO_N","1","KK"
"13",2011,2569.6672555,7.85153169693888,"N","N","N","KARA","1","USA"
"14",2021,604.79620572,6.40489155123453,"N","N","N","KARA","1","KK"
"15",2031,2610.4804449,7.86728956188212,"L","H",NA,"KARA","1","KK"
"16",2032,3789.7097503,8.24004471210954,"L","H",NA,"KARA","1","USA"
"17",2052,338.97054188,5.82591320649553,"L","L","L","KARA","1","KK"
"18",2061,391.09027375,5.96893841249289,"H","L","H","KARA","1","USA"
"19",2092,410.04420258,6.01626496505788,"N","N","N","KARA","1","USA"
"20",2102,313.51882368,5.74785940190679,"N","N","N","KARA","1","KK"
"21",2112,1242.5931417,7.12495571830002,"H","H","H","KARA","1","KK"
"22",2122,1751.4827969,7.46821802066524,"H","H","L","KARA","1","USA"
"23",3011,60.48026048,4.10231703874031,"N","N","N","RENO_S","1","KK"
"24",3012,257.27729731,5.55015448107691,"N","N","N","RENO_S","1","USA"
"25",3021,46.74282552,3.84466077914493,"N","N","N","RENO_S","1","KK"
"26",3022,73.605375516,4.29871805996994,"N","N","N","RENO_S","1","KK"
"27",3031,108.25433812,4.68448344109116,"H","H","L","RENO_S","1","KK"
"28",3032,124.40704234,4.82355878915293,"H","H","L","RENO_S","1","USA"
"29",3042,123.66859296,4.81760535031397,"L","H","L","RENO_S","1","KK"
"30",3051,170.05332632,5.13611207209694,"N","N","N","RENO_S","1","USA"
"31",3052,95.868704018,4.56297958887925,"N","N","N","RENO_S","1","KK"
"32",3061,202.69261215,5.31169060558111,"N","N","N","RENO_S","1","USA"
"33",3062,70.686307069,4.25825187761015,"N","N","N","RENO_S","1","USA"
"34",3071,52.034715526,3.95191110210073,"L","H","H","RENO_S","1","KK"
"35",3072,93.33525462,4.53619789950355,"L","H","H","RENO_S","1","USA"
"36",3081,121.47464906,4.79970559129829,"H","H","H","RENO_S","1","USA"
"37",3082,94.833869239,4.55212661590867,"H","H","H","RENO_S","1","KK"
"38",3091,68.624596439,4.22865101914209,"H","L","L","RENO_S","1","USA"
"39",3092,64.837097371,4.17187792984139,"H","L","L","RENO_S","1","KK"
"40",3101,32.351569811,3.47666254561192,"L","L","L","RENO_S","1","KK"
"41",3102,29.285124102,3.37707967726539,"L","L","L","RENO_S","1","USA"
"42",3111,31.36974463,3.44584388158928,"L","L","H","RENO_S","1","USA"
"43",3112,28.127853881,3.33676032670116,"L","L","H","RENO_S","1","KK"
"44",3121,91.825330102,4.51988818660262,"H","L","H","RENO_S","1","KK"
"45",3122,136.4559307,4.91600171048243,"H","L","H","RENO_S","1","USA"
"46",4011,126.11889968,4.83722511024933,"H","L","H","RENO_N","2","KK"
"47",4022,76.520259821,4.33755554003153,"L","L","L","RENO_N","2","KK"
"48",4032,93.551979795,4.53851721545715,"L","L","H","RENO_N","2","USA"
"49",4041,207.09703422,5.33318744777751,"H","L","L","RENO_N","2","USA"
"50",4052,383.44185307,5.94918798759058,"N","N","N","RENO_N","2","USA"
"51",4061,156.79345897,5.05492939129363,"N","N","N","RENO_N","2","USA"
"52",4071,322.72413197,5.77679787769979,"L","H","L","RENO_N","2","USA"
"53",4082,554.05710342,6.31726775620079,"H","H","H","RENO_N","2","USA"
"54",4091,122.55552697,4.80856420867156,"N","N","N","RENO_N","2","KK"
"55",4102,112.70050456,4.72473389805434,"N","N","N","RENO_N","2","KK"
"56",4111,94.245481423,4.54590288271731,"L","H","H","RENO_N","2","KK"
"57",4122,323.16498582,5.77816298482521,"H","H","L","RENO_N","2","KK"
I define a linear model in R using lm as
lm1 <- lm(logDIOX ~ 1 + OXYGEN + LOAD + PLANT + TIME + LAB, data=data)
and I want to interpret the estimated coefficients. However, when I extract the coefficients I get multiple 'NAs' (I'm assuming it's due to linear dependencies among the variables). How can I then interpret the coefficients? I only have one intercept that somehow represents one of the levels of each of the included factors in the model. Is it possible to get an estimate for each factor level?
> summary(lm1)
Coefficients:
Call:
lm(formula = logDIOX ~ OXYGEN + LOAD + PLANT + TIME + LAB, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.90821 -0.32102 -0.08993 0.27311 0.97758
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2983 0.2110 34.596 < 2e-16 ***
OXYGENL -0.4086 0.1669 -2.449 0.017953 *
OXYGENN -0.7567 0.1802 -4.199 0.000113 ***
LOADL -1.0645 0.1675 -6.357 6.58e-08 ***
LOADN NA NA NA NA
PLANTRENO_N -0.6636 0.2174 -3.052 0.003664 **
PLANTRENO_S -2.3452 0.1929 -12.158 < 2e-16 ***
TIME2 -0.9160 0.2065 -4.436 5.18e-05 ***
LABUSA 0.3829 0.1344 2.849 0.006392 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5058 on 49 degrees of freedom
Multiple R-squared: 0.8391, Adjusted R-squared: 0.8161
F-statistic: 36.5 on 7 and 49 DF, p-value: < 2.2e-16
For the NA part of your question you can have a look here:
[linear regression "NA" estimate just for last coefficient, actually of your variables can be described as a linear combination of the rest.
For the factors and their levels the way r works is showing intercept with first factor level and shows the difference of the intercept with the rest of the factors. I think it will be more clear with just one factor regression:
lm1 <- lm(logDIOX ~ 1 + OXYGEN , data=df)
> summary(lm1)
Call:
lm(formula = logDIOX ~ 1 + OXYGEN, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.7803 -0.7833 -0.2027 0.6597 3.1229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5359 0.2726 20.305 <2e-16 ***
OXYGENL -0.4188 0.3909 -1.071 0.289
OXYGENN -0.1896 0.3807 -0.498 0.621
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.188 on 54 degrees of freedom
Multiple R-squared: 0.02085, Adjusted R-squared: -0.01542
F-statistic: 0.5749 on 2 and 54 DF, p-value: 0.5662
what this result is saying is that for
OXYGEN="H" intercept is 5.5359, for OXYGEN="L" intercept is 5.5359-0.4188=5.1171 and for OXYGEN="N" intercept is 5.5359-0.1896= 5.3463.
Hope this helps
UPDATE:
Following your comment I generalize to your model.
when
OXYGEN = "H"
LOAD = "H"
PLANT= "KARRA"
TIME=1
LAB="KK"
then:
logDIOX =7.2983
when
OXYGEN = "L"
LOAD = "H"
PLANT= "KARRA"
TIME=1
LAB="KK"
then:
logDIOX =7.2983-0.4086 =6.8897
when
OXYGEN = "L"
LOAD = "L"
PLANT= "KARRA"
TIME=1
LAB="KK"
then:
logDIOX =7.2983-0.4086-1.0645 =5.8252
etc.

GLM submodel testing in R: why all statistics are still the same after remove one continuous covariate?

I am doing a submodel testing. The smaller model is nested in the bigger model. The bigger model has one continuous variable compared to the smaller model. I use the likelihood ratio test. The result is quite strange. Both models have the same statistics such as residual deviance and df. I also find two models have the same estimated coefficients are std.errors. How is the fact possible?
summary(m2221)
Call:
glm(formula = clm ~ veh_age + veh_body + agecat + veh_value:veh_age +
veh_value:area, family = "binomial", data = Car)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9245 -0.3939 -0.3683 -0.3437 2.7095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.294118 0.382755 -3.381 0.000722 ***
veh_age2 0.051790 0.098463 0.526 0.598897
veh_age3 -0.166801 0.094789 -1.760 0.078457 .
veh_age4 -0.239862 0.096154 -2.495 0.012611 *
veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 **
veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 *
veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 **
veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 *
veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765
veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 **
veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 *
veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883
veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 **
veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 **
veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 **
veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 ***
agecat2 -0.198002 0.058382 -3.391 0.000695 ***
agecat3 -0.224492 0.056905 -3.945 7.98e-05 ***
agecat4 -0.253377 0.056774 -4.463 8.09e-06 ***
agecat5 -0.441906 0.063227 -6.989 2.76e-12 ***
agecat6 -0.447231 0.072292 -6.186 6.15e-10 ***
veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740
veh_age2:veh_value 0.035386 0.031465 1.125 0.260753
veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 **
veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 ***
veh_value:areaB 0.044099 0.021550 2.046 0.040722 *
veh_value:areaC 0.021892 0.019189 1.141 0.253931
veh_value:areaD -0.023616 0.024939 -0.947 0.343658
veh_value:areaE -0.013506 0.026886 -0.502 0.615415
veh_value:areaF 0.057780 0.026602 2.172 0.029850 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33767 on 67855 degrees of freedom
Residual deviance: 33592 on 67826 degrees of freedom
AIC: 33652
Number of Fisher Scoring iterations: 5
summary(m222)
Call:
glm(formula = clm ~ veh_value + veh_age + veh_body + agecat +
veh_value:veh_age + veh_value:area, family = "binomial",
data = Car)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9245 -0.3939 -0.3683 -0.3437 2.7095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.294118 0.382755 -3.381 0.000722 ***
veh_value -0.000637 0.026387 -0.024 0.980740
veh_age2 0.051790 0.098463 0.526 0.598897
veh_age3 -0.166801 0.094789 -1.760 0.078457 .
veh_age4 -0.239862 0.096154 -2.495 0.012611 *
veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 **
veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 *
veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 **
veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 *
veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765
veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 **
veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 *
veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883
veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 **
veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 **
veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 **
veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 ***
agecat2 -0.198002 0.058382 -3.391 0.000695 ***
agecat3 -0.224492 0.056905 -3.945 7.98e-05 ***
agecat4 -0.253377 0.056774 -4.463 8.09e-06 ***
agecat5 -0.441906 0.063227 -6.989 2.76e-12 ***
agecat6 -0.447231 0.072292 -6.186 6.15e-10 ***
veh_value:veh_age2 0.036023 0.034997 1.029 0.303331
veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 **
veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 **
veh_value:areaB 0.044099 0.021550 2.046 0.040722 *
veh_value:areaC 0.021892 0.019189 1.141 0.253931
veh_value:areaD -0.023616 0.024939 -0.947 0.343658
veh_value:areaE -0.013506 0.026886 -0.502 0.615415
veh_value:areaF 0.057780 0.026602 2.172 0.029850 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33767 on 67855 degrees of freedom
Residual deviance: 33592 on 67826 degrees of freedom
AIC: 33652
anova(m2221,m222, test ="LRT")###
Analysis of Deviance Table
Model 1: clm ~ veh_age + veh_body + agecat + veh_value:veh_age +
veh_value:area
Model 2: clm ~ veh_value + veh_age + veh_body + agecat + veh_value:veh_age +
veh_value:area
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 67826 33592
2 67826 33592 0 0
You have specified the same model two different ways. To show this, I'll first explain what is going under the hood a bit, then I'll walk through the coefficients of your two models to show they are the same, and I'll end with some higher level intuition and explanation.
Different formula creating different interaction terms
First, the only difference between your two models is the first models includes veh_value as a predictor, whereas in the second does not. However, veh_value is interacted with other predictors in both models.
So let's consider a simple reproducible example and see what R does when we do this. I'll use the model.matrix to simply feed in two different formulas and view the continuous and dummy variables that R creates as a result.
colnames(model.matrix(am ~ mpg + factor(cyl):mpg, mtcars))
#[1] "(Intercept)" "mpg" "mpg:factor(cyl)6" "mpg:factor(cyl)8"
colnames(model.matrix(am ~ factor(cyl):mpg, mtcars))
#"(Intercept)" "factor(cyl)4:mpg" "factor(cyl)6:mpg" "factor(cyl)8:mpg"
Notice in the first call I included the continuous predictor mpg whereas I did not include it in the second call (a simpler example of what you are doing).
Now note that the second model matrix contains an extra interaction variable (factor(cyl)4:mpg) in the model that the first does not. In other words, because we did not include mpg in the model directly, all levels of cyl get included in the interaction.
Your models are the same
Your models are essentially doing the same thing as the simple example above and we can show that the coefficients at the end of the day are actually the same when added together.
In your first model all 4 levels of veh_age are included when it is included as interaction with veh_value but veh_value is not included in the model.
veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740
veh_age2:veh_value 0.035386 0.031465 1.125 0.260753
veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 **
veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 ***
In your second model only 3 levels of veh_age are included when it is interacted with veh_value because veh_value is included in the model.
veh_value:veh_age2 0.036023 0.034997 1.029 0.303331
veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 **
veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 **
Now, here is the critical piece to see that the models are actually the same. It's easiest to show by just walking through all of the levels of veh_age.
First consider veh_age = 1
For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 1 is -0.000637
# For first model
veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740
# For second model
veh_value -0.000637 0.026387 -0.024 0.980740
Now consider veh_age = 2
For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 2 is 0.035386
# For first model
veh_age2:veh_value 0.035386 0.031465 1.125 0.260753
# For second model - note sum of the two below is 0.035386
veh_value -0.000637 0.026387 -0.024 0.980740
veh_value:veh_age2 0.036023 0.034997 1.029 0.303331
Intution
When you include the interaction veh_value:veh_age you are essentially saying that you want the coefficient of veh_value a continuous variable to be conditioned on veh_age, a categorical variable. Including both the interaction veh_value:veh_age and veh_value as predictors is saying the same thing. You want to know the coefficient of veh_value but you want to condition it based on the value of veh_age.

How to Interpret binomial GLMM results

I have a large dataset (24765 obs)
I am trying to look at how does cleaning method effect emergence success(ES).
I have several factors: beach(4 levels), cleaning method(3 levels) -->fixed
I also have a few random variables: Zone (128 levels),Year(18 years) and Index(24765)
This is an ORLE model to account for overdispersion.
My best fit model based on AIC scores is:
mod8a<-glmer(ES.test~beach+method+(1|Year)+(1|index),data=y5,weights=egg.total,family=binomial)
The summary showed:
summary(mod8a)#AIC=216732.9, same affect at every beach
Generalized linear mixed model fit by maximum likelihood (LaplaceApproximation) ['glmerMod']
Family: binomial ( logit )
Formula: ES.test ~ beach + method + (1 | Year) + (1 | index)
Data: y5
Weights: egg.total
AIC BIC logLik deviance df.resid
214834.2 214891.0 -107410.1 214820.2 24758
Scaled residuals:
Min 1Q Median 3Q Max
-1.92900 -0.09344 0.00957 0.14682 1.62327
Random effects:
Groups Name Variance Std.Dev.
index (Intercept) 1.6541 1.286
Year (Intercept) 0.6512 0.807
Number of obs: 24765, groups: index, 24765; Year, 19
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.65518 0.18646 3.514 0.000442 ***
beachHillsboro -0.06770 0.02143 -3.159 0.001583 **
beachHO/HA 0.31927 0.03716 8.591 < 2e-16 ***
methodHTL only 0.18106 0.02526 7.169 7.58e-13 ***
methodno clean 0.05989 0.03170 1.889 0.058853 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) bchHll bHO/HA mtHTLo
beachHllsbr -0.002
beachHO/HA -0.054 0.047
mthdHTLonly -0.107 -0.242 0.355
methodnclen -0.084 -0.060 0.265 0.628
What is my "intercept" (as seen above)? I am missing levels of fixed factors, is that because R could not compute it?
I tested for Overdispersion:
overdisp_fun <- function(mod8a) {
+ ## number of variance parameters in
+ ## an n-by-n variance-covariance matrix
+ vpars <- function(m) {
+ nrow(m)*(nrow(m)+1)/2
+ }
+
+ model8a.df <- sum(sapply(VarCorr(mod8a),vpars))+length(fixef(mod8a))
+ rdf <- nrow(model.frame(mod8a))-model8a.df
+ rp <- residuals(mod8a,type="pearson")
+ Pearson.chisq <- sum(rp^2)
+ prat <- Pearson.chisq/rdf
+ pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
+ c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
+ }
> overdisp_fun(mod8a)
chisq ratio rdf p
2.064765e+03 8.339790e-02 2.475800e+04 1.000000e+00
This shows the plot of mod8a
I would like to know why I am getting such a curve and what it means
Lastly I did a multicomparion analysis using multcomp
ls1<- glht(mod8a, mcp(beach = "Tukey"))$linfct
ls2 <- glht(mod8a, mcp(method= "Tukey"))$linfct
summary(glht(mod8a, linfct = rbind(ls1, ls2)))
Simultaneous Tests for General Linear Hypotheses
Fit: glmer(formula = ES.test ~ beach + method + (1 | Year) + (1 |
index), data = y5, family = binomial, weights = egg.total)
Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
Hillsboro - FTL/P == 0 -0.06770 0.02143 -3.159 0.00821 **
HO/HA - FTL/P == 0 0.31927 0.03716 8.591 < 0.001 ***
HO/HA - Hillsboro == 0 0.38696 0.04201 9.211 < 0.001 ***
HTL only - HTL and SB == 0 0.18106 0.02526 7.169 < 0.001 ***
no clean - HTL and SB == 0 0.05989 0.03170 1.889 0.24469
no clean - HTL only == 0 -0.12117 0.02524 -4.800 < 0.001 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
At this point help with interpreting for analysis would help and be greatly appreciated. (Especially with that sigmoid curve for my residuals)

What happens to the coefficients when we switch labels (0/1) - in practice?

I am trying to see in practice what was explained here what happens to the coefficients once labels are switched but I am not getting what is expected. Here is my attempt:
I am using the example of natality public-use data given as an example in "Practical Data Science with R" Where the output is a logical variable that classifies new born babies if they are atRisk with levels FALSE and TRUE
load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData"))
train <- sdata[sdata$ORIGRANDGROUP<=5,]
test <- sdata[sdata$ORIGRANDGROUP>5,]
complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
"URF_ECLAM")
y <- "atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
summary(model)
This results to the following coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
OK, now let us switch the labels in our atRisk variable:
esdata$atRisk <- factor(sdata$atRisk)
levels(sdata$atRisk) <- c("TRUE", "FALSE")
and re-run the above analysis where I am expecting to see a change in the signs of the above reported coefficients, however, I am getting exactly the same coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
What is that am I doing wrong here? Can you help please
its because you set train <- sdata[sdata$ORIGRANDGROUP<=5,] and then you change sdata$atRisk <- factor(sdata$atRisk) but your model is using the train dataset, whose levels DID NOT get changed.
Instead you can do
y <- "!atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2641 0.1358 0.1511 0.1818 0.9732
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.412189 0.289352 15.249 < 2e-16 ***
PWGT -0.003762 0.001487 -2.530 0.011417 *
UPREVIS 0.063289 0.015252 4.150 3.33e-05 ***
CIG_RECTRUE -0.313169 0.187230 -1.673 0.094398 .
GESTREC3< 37 weeks -1.545183 0.140795 -10.975 < 2e-16 ***
DPLURALtriplet or higher -1.394193 0.498866 -2.795 0.005194 **
DPLURALtwin -0.312319 0.241088 -1.295 0.195163
ULD_MECOTRUE -0.818426 0.235798 -3.471 0.000519 ***
ULD_PRECIPTRUE -0.191720 0.357680 -0.536 0.591951
ULD_BREECHTRUE -0.749237 0.178129 -4.206 2.60e-05 ***
URF_DIABTRUE 0.346467 0.287514 1.205 0.228187
URF_CHYPERTRUE -0.560025 0.389678 -1.437 0.150676
URF_PHYPERTRUE -0.161599 0.250003 -0.646 0.518029
URF_ECLAMTRUE -0.498064 0.776948 -0.641 0.521489
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2698.7 on 14211 degrees of freedom
Residual deviance: 2463.0 on 14198 degrees of freedom
AIC: 2491
Number of Fisher Scoring iterations: 7

Logistic Regression: Strange Variable arises

I am using R to perform logistic regression on my data set. My data set has more than 50 variables.
I am running the following code:
glm(X...ResponseFlag ~ NetWorth + LOR + IntGrandChld + OccupInput, family = binomial, data = data)
When I see summary() I got the following output:
> summary(ResponseModel)
Call:
glm(formula = X...ResponseFlag ~ NetWorth + LOR + IntGrandChld +
OccupInput, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2785 -0.9576 -0.8925 1.3736 1.9721
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.971166 0.164439 -5.906 3.51e-09 ***
NetWorth 0.082168 0.019849 4.140 3.48e-05 ***
LOR -0.019716 0.006494 -3.036 0.0024 **
IntGrandChld -0.021544 0.085274 -0.253 0.8005
OccupInput2 0.005796 0.138390 0.042 0.9666
OccupInput3 0.471020 0.289642 1.626 0.1039
OccupInput4 -0.031880 0.120636 -0.264 0.7916
OccupInput5 -0.148898 0.129922 -1.146 0.2518
OccupInput6 -0.481183 0.416277 -1.156 0.2477
OccupInput7 -0.057485 0.218309 -0.263 0.7923
OccupInput8 0.505676 0.123955 4.080 4.51e-05 ***
OccupInput9 -0.382375 0.821362 -0.466 0.6415
OccupInputA -12.903334 178.064831 -0.072 0.9422
OccupInputB 0.581272 1.003193 0.579 0.5623
OccupInputC -0.034188 0.294507 -0.116 0.9076
OccupInputD 0.224634 0.385959 0.582 0.5606
OccupInputE -1.292358 1.072864 -1.205 0.2284
OccupInputF 14.132144 308.212341 0.046 0.9634
OccupInputH 0.622677 1.006982 0.618 0.5363
OccupInputU 0.087526 0.095740 0.914 0.3606
OccupInputV -1.010939 0.637746 -1.585 0.1129
OccupInputW 0.262031 0.256238 1.023 0.3065
OccupInputX 0.332209 0.428806 0.775 0.4385
OccupInputY 0.059771 0.157135 0.380 0.7037
OccupInputZ 0.638520 0.711979 0.897 0.3698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5885.1 on 4467 degrees of freedom
Residual deviance: 5809.6 on 4443 degrees of freedom
AIC: 5859.6
Number of Fisher Scoring iterations: 12
From the output, it is seen that some new variable like OccupInput2... has arisen. Actually OccupInput had values 1,2,3,...A,B,C,D.. But it did not happen for NetWorth,LOR.
I am new to R and do not have any explanation, why there are new variables.
Can anybody give me an explanation? Thank you in advance.
I would assume that OccupInput in your model is a factor variable. R introduces so-called dummy variables, when you include factorial regressors in a linear model.
What you see as OccupInput2 and so forth in the table are the coefficients associated with the individual factor levels (the reference level OccupInput1 is covered by the intercept term).
You can verify the type of OccupInput from the output of the sapply(data, class) call, which yields the data types of the columns in your input data frame.

Resources