Linear model regressing on every level of a numeric field - r
I am currently trying to run a linear model on a large data set, but am running into issues with some specific variables.
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage, data = train)
summary(pv_model)
Here is code for my regression. SalePrice, MSSubClass, GarageArea, and LotFrontage are all numeric fields, while LotConfig is a factored variable.
Here is the output of my pv_model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98154.64 17235.51 5.695 1.75e-08 ***
MSSubClass 50.05 58.38 0.857 0.391539
LotConfigCulDSac 69949.50 12740.62 5.490 5.42e-08 ***
LotConfigFR2 19998.34 14592.31 1.370 0.170932
LotConfigFR3 21390.99 34126.44 0.627 0.530962
LotConfigInside 21666.04 5597.33 3.871 0.000118 ***
GarageArea 175.67 10.96 16.035 < 2e-16 ***
LotFrontage101 42571.20 42664.89 0.998 0.318682
LotFrontage102 26051.49 35876.54 0.726 0.467968
LotFrontage103 36528.81 35967.56 1.016 0.310131
LotFrontage104 218129.42 58129.56 3.752 0.000188 ***
LotFrontage105 61737.12 27618.21 2.235 0.025673 *
LotFrontage106 40806.22 58159.42 0.702 0.483120
LotFrontage107 36744.69 29494.94 1.246 0.213211
LotFrontage108 71537.30 42565.91 1.681 0.093234 .
LotFrontage109 -29193.02 42528.98 -0.686 0.492647
LotFrontage110 73589.28 27706.92 2.656 0.008068 **
As you can see, the first variables operate correctly. Both the factored and numeric fields respond appropriately. That is, until it gets to LotFrontage. For whatever reason, the model runs the regression on every single level of LotFrontage.
For reference, LotFrontage describes the square footage of the subject's front yard. I have properly cleaned the data and replaced NA values. I really am at a loss for why this particular column is acting so unusually.
Any help is greatly appreciated.
If I download the data from the kaggle link or use a github link and do:
train = read.csv("train.csv")
class(x$LotFrontage)
[1] "integer"
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage,
data = train)
summary(pv_model)
Call:
lm(formula = SalePrice ~ MSSubClass + LotConfig + GarageArea +
LotFrontage, data = train)
Residuals:
Min 1Q Median 3Q Max
-380310 -33812 -4418 24345 487970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11915.866 9455.677 1.260 0.20785
MSSubClass 105.699 45.345 2.331 0.01992 *
LotConfigCulDSac 81789.113 10547.120 7.755 1.89e-14 ***
LotConfigFR2 17736.355 11787.227 1.505 0.13266
LotConfigFR3 17649.409 31418.281 0.562 0.57439
LotConfigInside 13073.201 5002.092 2.614 0.00907 **
GarageArea 208.708 8.725 23.920 < 2e-16 ***
LotFrontage 722.380 88.294 8.182 7.12e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Suggest that you read in the csv again like above.
Related
General Linear Model interpretation of parameter estimates in R
I have a data set that looks like "","OBSERV","DIOX","logDIOX","OXYGEN","LOAD","PRSEK","PLANT","TIME","LAB" "1",1011,984.06650389,6.89169348002254,"L","H","L","RENO_N","1","KK" "2",1022,1790.7973641,7.49041625445373,"H","H","L","RENO_N","1","USA" "3",1031,661.95870145,6.4952031694744,"L","H","H","RENO_N","1","USA" "4",1042,978.06853583,6.88557974511529,"H","H","H","RENO_N","1","KK" "5",1051,270.92290942,5.60183431332639,"N","N","N","RENO_N","1","USA" "6",1062,402.98269729,5.99889362626069,"N","N","N","RENO_N","1","USA" "7",1071,321.71945701,5.77367991426247,"H","L","L","RENO_N","1","KK" "8",1082,223.15260359,5.40785585845064,"L","L","L","RENO_N","1","USA" "9",1091,246.65350151,5.507984523849,"H","L","H","RENO_N","1","USA" "10",1102,188.48323034,5.23900903921703,"L","L","H","RENO_N","1","KK" "11",1141,267.34994025,5.58855843790491,"N","N","N","RENO_N","1","KK" "12",1152,452.10355987,6.11391126834609,"N","N","N","RENO_N","1","KK" "13",2011,2569.6672555,7.85153169693888,"N","N","N","KARA","1","USA" "14",2021,604.79620572,6.40489155123453,"N","N","N","KARA","1","KK" "15",2031,2610.4804449,7.86728956188212,"L","H",NA,"KARA","1","KK" "16",2032,3789.7097503,8.24004471210954,"L","H",NA,"KARA","1","USA" "17",2052,338.97054188,5.82591320649553,"L","L","L","KARA","1","KK" "18",2061,391.09027375,5.96893841249289,"H","L","H","KARA","1","USA" "19",2092,410.04420258,6.01626496505788,"N","N","N","KARA","1","USA" "20",2102,313.51882368,5.74785940190679,"N","N","N","KARA","1","KK" "21",2112,1242.5931417,7.12495571830002,"H","H","H","KARA","1","KK" "22",2122,1751.4827969,7.46821802066524,"H","H","L","KARA","1","USA" "23",3011,60.48026048,4.10231703874031,"N","N","N","RENO_S","1","KK" "24",3012,257.27729731,5.55015448107691,"N","N","N","RENO_S","1","USA" "25",3021,46.74282552,3.84466077914493,"N","N","N","RENO_S","1","KK" "26",3022,73.605375516,4.29871805996994,"N","N","N","RENO_S","1","KK" "27",3031,108.25433812,4.68448344109116,"H","H","L","RENO_S","1","KK" "28",3032,124.40704234,4.82355878915293,"H","H","L","RENO_S","1","USA" "29",3042,123.66859296,4.81760535031397,"L","H","L","RENO_S","1","KK" "30",3051,170.05332632,5.13611207209694,"N","N","N","RENO_S","1","USA" "31",3052,95.868704018,4.56297958887925,"N","N","N","RENO_S","1","KK" "32",3061,202.69261215,5.31169060558111,"N","N","N","RENO_S","1","USA" "33",3062,70.686307069,4.25825187761015,"N","N","N","RENO_S","1","USA" "34",3071,52.034715526,3.95191110210073,"L","H","H","RENO_S","1","KK" "35",3072,93.33525462,4.53619789950355,"L","H","H","RENO_S","1","USA" "36",3081,121.47464906,4.79970559129829,"H","H","H","RENO_S","1","USA" "37",3082,94.833869239,4.55212661590867,"H","H","H","RENO_S","1","KK" "38",3091,68.624596439,4.22865101914209,"H","L","L","RENO_S","1","USA" "39",3092,64.837097371,4.17187792984139,"H","L","L","RENO_S","1","KK" "40",3101,32.351569811,3.47666254561192,"L","L","L","RENO_S","1","KK" "41",3102,29.285124102,3.37707967726539,"L","L","L","RENO_S","1","USA" "42",3111,31.36974463,3.44584388158928,"L","L","H","RENO_S","1","USA" "43",3112,28.127853881,3.33676032670116,"L","L","H","RENO_S","1","KK" "44",3121,91.825330102,4.51988818660262,"H","L","H","RENO_S","1","KK" "45",3122,136.4559307,4.91600171048243,"H","L","H","RENO_S","1","USA" "46",4011,126.11889968,4.83722511024933,"H","L","H","RENO_N","2","KK" "47",4022,76.520259821,4.33755554003153,"L","L","L","RENO_N","2","KK" "48",4032,93.551979795,4.53851721545715,"L","L","H","RENO_N","2","USA" "49",4041,207.09703422,5.33318744777751,"H","L","L","RENO_N","2","USA" "50",4052,383.44185307,5.94918798759058,"N","N","N","RENO_N","2","USA" "51",4061,156.79345897,5.05492939129363,"N","N","N","RENO_N","2","USA" "52",4071,322.72413197,5.77679787769979,"L","H","L","RENO_N","2","USA" "53",4082,554.05710342,6.31726775620079,"H","H","H","RENO_N","2","USA" "54",4091,122.55552697,4.80856420867156,"N","N","N","RENO_N","2","KK" "55",4102,112.70050456,4.72473389805434,"N","N","N","RENO_N","2","KK" "56",4111,94.245481423,4.54590288271731,"L","H","H","RENO_N","2","KK" "57",4122,323.16498582,5.77816298482521,"H","H","L","RENO_N","2","KK" I define a linear model in R using lm as lm1 <- lm(logDIOX ~ 1 + OXYGEN + LOAD + PLANT + TIME + LAB, data=data) and I want to interpret the estimated coefficients. However, when I extract the coefficients I get multiple 'NAs' (I'm assuming it's due to linear dependencies among the variables). How can I then interpret the coefficients? I only have one intercept that somehow represents one of the levels of each of the included factors in the model. Is it possible to get an estimate for each factor level? > summary(lm1) Coefficients: Call: lm(formula = logDIOX ~ OXYGEN + LOAD + PLANT + TIME + LAB, data = data) Residuals: Min 1Q Median 3Q Max -0.90821 -0.32102 -0.08993 0.27311 0.97758 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 7.2983 0.2110 34.596 < 2e-16 *** OXYGENL -0.4086 0.1669 -2.449 0.017953 * OXYGENN -0.7567 0.1802 -4.199 0.000113 *** LOADL -1.0645 0.1675 -6.357 6.58e-08 *** LOADN NA NA NA NA PLANTRENO_N -0.6636 0.2174 -3.052 0.003664 ** PLANTRENO_S -2.3452 0.1929 -12.158 < 2e-16 *** TIME2 -0.9160 0.2065 -4.436 5.18e-05 *** LABUSA 0.3829 0.1344 2.849 0.006392 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5058 on 49 degrees of freedom Multiple R-squared: 0.8391, Adjusted R-squared: 0.8161 F-statistic: 36.5 on 7 and 49 DF, p-value: < 2.2e-16
For the NA part of your question you can have a look here: [linear regression "NA" estimate just for last coefficient, actually of your variables can be described as a linear combination of the rest. For the factors and their levels the way r works is showing intercept with first factor level and shows the difference of the intercept with the rest of the factors. I think it will be more clear with just one factor regression: lm1 <- lm(logDIOX ~ 1 + OXYGEN , data=df) > summary(lm1) Call: lm(formula = logDIOX ~ 1 + OXYGEN, data = df) Residuals: Min 1Q Median 3Q Max -1.7803 -0.7833 -0.2027 0.6597 3.1229 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.5359 0.2726 20.305 <2e-16 *** OXYGENL -0.4188 0.3909 -1.071 0.289 OXYGENN -0.1896 0.3807 -0.498 0.621 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.188 on 54 degrees of freedom Multiple R-squared: 0.02085, Adjusted R-squared: -0.01542 F-statistic: 0.5749 on 2 and 54 DF, p-value: 0.5662 what this result is saying is that for OXYGEN="H" intercept is 5.5359, for OXYGEN="L" intercept is 5.5359-0.4188=5.1171 and for OXYGEN="N" intercept is 5.5359-0.1896= 5.3463. Hope this helps UPDATE: Following your comment I generalize to your model. when OXYGEN = "H" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983 when OXYGEN = "L" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086 =6.8897 when OXYGEN = "L" LOAD = "L" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086-1.0645 =5.8252 etc.
GLM submodel testing in R: why all statistics are still the same after remove one continuous covariate?
I am doing a submodel testing. The smaller model is nested in the bigger model. The bigger model has one continuous variable compared to the smaller model. I use the likelihood ratio test. The result is quite strange. Both models have the same statistics such as residual deviance and df. I also find two models have the same estimated coefficients are std.errors. How is the fact possible? summary(m2221) Call: glm(formula = clm ~ veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area, family = "binomial", data = Car) Deviance Residuals: Min 1Q Median 3Q Max -0.9245 -0.3939 -0.3683 -0.3437 2.7095 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.294118 0.382755 -3.381 0.000722 *** veh_age2 0.051790 0.098463 0.526 0.598897 veh_age3 -0.166801 0.094789 -1.760 0.078457 . veh_age4 -0.239862 0.096154 -2.495 0.012611 * veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 ** veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 * veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 ** veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 * veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765 veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 ** veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 * veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883 veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 ** veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 ** veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 ** veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 *** agecat2 -0.198002 0.058382 -3.391 0.000695 *** agecat3 -0.224492 0.056905 -3.945 7.98e-05 *** agecat4 -0.253377 0.056774 -4.463 8.09e-06 *** agecat5 -0.441906 0.063227 -6.989 2.76e-12 *** agecat6 -0.447231 0.072292 -6.186 6.15e-10 *** veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740 veh_age2:veh_value 0.035386 0.031465 1.125 0.260753 veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 ** veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 *** veh_value:areaB 0.044099 0.021550 2.046 0.040722 * veh_value:areaC 0.021892 0.019189 1.141 0.253931 veh_value:areaD -0.023616 0.024939 -0.947 0.343658 veh_value:areaE -0.013506 0.026886 -0.502 0.615415 veh_value:areaF 0.057780 0.026602 2.172 0.029850 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 33767 on 67855 degrees of freedom Residual deviance: 33592 on 67826 degrees of freedom AIC: 33652 Number of Fisher Scoring iterations: 5 summary(m222) Call: glm(formula = clm ~ veh_value + veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area, family = "binomial", data = Car) Deviance Residuals: Min 1Q Median 3Q Max -0.9245 -0.3939 -0.3683 -0.3437 2.7095 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.294118 0.382755 -3.381 0.000722 *** veh_value -0.000637 0.026387 -0.024 0.980740 veh_age2 0.051790 0.098463 0.526 0.598897 veh_age3 -0.166801 0.094789 -1.760 0.078457 . veh_age4 -0.239862 0.096154 -2.495 0.012611 * veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 ** veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 * veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 ** veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 * veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765 veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 ** veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 * veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883 veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 ** veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 ** veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 ** veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 *** agecat2 -0.198002 0.058382 -3.391 0.000695 *** agecat3 -0.224492 0.056905 -3.945 7.98e-05 *** agecat4 -0.253377 0.056774 -4.463 8.09e-06 *** agecat5 -0.441906 0.063227 -6.989 2.76e-12 *** agecat6 -0.447231 0.072292 -6.186 6.15e-10 *** veh_value:veh_age2 0.036023 0.034997 1.029 0.303331 veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 ** veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 ** veh_value:areaB 0.044099 0.021550 2.046 0.040722 * veh_value:areaC 0.021892 0.019189 1.141 0.253931 veh_value:areaD -0.023616 0.024939 -0.947 0.343658 veh_value:areaE -0.013506 0.026886 -0.502 0.615415 veh_value:areaF 0.057780 0.026602 2.172 0.029850 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 33767 on 67855 degrees of freedom Residual deviance: 33592 on 67826 degrees of freedom AIC: 33652 anova(m2221,m222, test ="LRT")### Analysis of Deviance Table Model 1: clm ~ veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area Model 2: clm ~ veh_value + veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 67826 33592 2 67826 33592 0 0
You have specified the same model two different ways. To show this, I'll first explain what is going under the hood a bit, then I'll walk through the coefficients of your two models to show they are the same, and I'll end with some higher level intuition and explanation. Different formula creating different interaction terms First, the only difference between your two models is the first models includes veh_value as a predictor, whereas in the second does not. However, veh_value is interacted with other predictors in both models. So let's consider a simple reproducible example and see what R does when we do this. I'll use the model.matrix to simply feed in two different formulas and view the continuous and dummy variables that R creates as a result. colnames(model.matrix(am ~ mpg + factor(cyl):mpg, mtcars)) #[1] "(Intercept)" "mpg" "mpg:factor(cyl)6" "mpg:factor(cyl)8" colnames(model.matrix(am ~ factor(cyl):mpg, mtcars)) #"(Intercept)" "factor(cyl)4:mpg" "factor(cyl)6:mpg" "factor(cyl)8:mpg" Notice in the first call I included the continuous predictor mpg whereas I did not include it in the second call (a simpler example of what you are doing). Now note that the second model matrix contains an extra interaction variable (factor(cyl)4:mpg) in the model that the first does not. In other words, because we did not include mpg in the model directly, all levels of cyl get included in the interaction. Your models are the same Your models are essentially doing the same thing as the simple example above and we can show that the coefficients at the end of the day are actually the same when added together. In your first model all 4 levels of veh_age are included when it is included as interaction with veh_value but veh_value is not included in the model. veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740 veh_age2:veh_value 0.035386 0.031465 1.125 0.260753 veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 ** veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 *** In your second model only 3 levels of veh_age are included when it is interacted with veh_value because veh_value is included in the model. veh_value:veh_age2 0.036023 0.034997 1.029 0.303331 veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 ** veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 ** Now, here is the critical piece to see that the models are actually the same. It's easiest to show by just walking through all of the levels of veh_age. First consider veh_age = 1 For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 1 is -0.000637 # For first model veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740 # For second model veh_value -0.000637 0.026387 -0.024 0.980740 Now consider veh_age = 2 For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 2 is 0.035386 # For first model veh_age2:veh_value 0.035386 0.031465 1.125 0.260753 # For second model - note sum of the two below is 0.035386 veh_value -0.000637 0.026387 -0.024 0.980740 veh_value:veh_age2 0.036023 0.034997 1.029 0.303331 Intution When you include the interaction veh_value:veh_age you are essentially saying that you want the coefficient of veh_value a continuous variable to be conditioned on veh_age, a categorical variable. Including both the interaction veh_value:veh_age and veh_value as predictors is saying the same thing. You want to know the coefficient of veh_value but you want to condition it based on the value of veh_age.
How to Interpret binomial GLMM results
I have a large dataset (24765 obs) I am trying to look at how does cleaning method effect emergence success(ES). I have several factors: beach(4 levels), cleaning method(3 levels) -->fixed I also have a few random variables: Zone (128 levels),Year(18 years) and Index(24765) This is an ORLE model to account for overdispersion. My best fit model based on AIC scores is: mod8a<-glmer(ES.test~beach+method+(1|Year)+(1|index),data=y5,weights=egg.total,family=binomial) The summary showed: summary(mod8a)#AIC=216732.9, same affect at every beach Generalized linear mixed model fit by maximum likelihood (LaplaceApproximation) ['glmerMod'] Family: binomial ( logit ) Formula: ES.test ~ beach + method + (1 | Year) + (1 | index) Data: y5 Weights: egg.total AIC BIC logLik deviance df.resid 214834.2 214891.0 -107410.1 214820.2 24758 Scaled residuals: Min 1Q Median 3Q Max -1.92900 -0.09344 0.00957 0.14682 1.62327 Random effects: Groups Name Variance Std.Dev. index (Intercept) 1.6541 1.286 Year (Intercept) 0.6512 0.807 Number of obs: 24765, groups: index, 24765; Year, 19 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.65518 0.18646 3.514 0.000442 *** beachHillsboro -0.06770 0.02143 -3.159 0.001583 ** beachHO/HA 0.31927 0.03716 8.591 < 2e-16 *** methodHTL only 0.18106 0.02526 7.169 7.58e-13 *** methodno clean 0.05989 0.03170 1.889 0.058853 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) bchHll bHO/HA mtHTLo beachHllsbr -0.002 beachHO/HA -0.054 0.047 mthdHTLonly -0.107 -0.242 0.355 methodnclen -0.084 -0.060 0.265 0.628 What is my "intercept" (as seen above)? I am missing levels of fixed factors, is that because R could not compute it? I tested for Overdispersion: overdisp_fun <- function(mod8a) { + ## number of variance parameters in + ## an n-by-n variance-covariance matrix + vpars <- function(m) { + nrow(m)*(nrow(m)+1)/2 + } + + model8a.df <- sum(sapply(VarCorr(mod8a),vpars))+length(fixef(mod8a)) + rdf <- nrow(model.frame(mod8a))-model8a.df + rp <- residuals(mod8a,type="pearson") + Pearson.chisq <- sum(rp^2) + prat <- Pearson.chisq/rdf + pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE) + c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval) + } > overdisp_fun(mod8a) chisq ratio rdf p 2.064765e+03 8.339790e-02 2.475800e+04 1.000000e+00 This shows the plot of mod8a I would like to know why I am getting such a curve and what it means Lastly I did a multicomparion analysis using multcomp ls1<- glht(mod8a, mcp(beach = "Tukey"))$linfct ls2 <- glht(mod8a, mcp(method= "Tukey"))$linfct summary(glht(mod8a, linfct = rbind(ls1, ls2))) Simultaneous Tests for General Linear Hypotheses Fit: glmer(formula = ES.test ~ beach + method + (1 | Year) + (1 | index), data = y5, family = binomial, weights = egg.total) Linear Hypotheses: Estimate Std. Error z value Pr(>|z|) Hillsboro - FTL/P == 0 -0.06770 0.02143 -3.159 0.00821 ** HO/HA - FTL/P == 0 0.31927 0.03716 8.591 < 0.001 *** HO/HA - Hillsboro == 0 0.38696 0.04201 9.211 < 0.001 *** HTL only - HTL and SB == 0 0.18106 0.02526 7.169 < 0.001 *** no clean - HTL and SB == 0 0.05989 0.03170 1.889 0.24469 no clean - HTL only == 0 -0.12117 0.02524 -4.800 < 0.001 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Adjusted p values reported -- single-step method) At this point help with interpreting for analysis would help and be greatly appreciated. (Especially with that sigmoid curve for my residuals)
What happens to the coefficients when we switch labels (0/1) - in practice?
I am trying to see in practice what was explained here what happens to the coefficients once labels are switched but I am not getting what is expected. Here is my attempt: I am using the example of natality public-use data given as an example in "Practical Data Science with R" Where the output is a logical variable that classifies new born babies if they are atRisk with levels FALSE and TRUE load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData")) train <- sdata[sdata$ORIGRANDGROUP<=5,] test <- sdata[sdata$ORIGRANDGROUP>5,] complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH") riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER", "URF_ECLAM") y <- "atRisk" x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors) fmla <- paste(y, paste(x, collapse="+"), sep="~") model <- glm(fmla, data=train, family=binomial(link="logit")) summary(model) This results to the following coefficients: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.412189 0.289352 -15.249 < 2e-16 *** PWGT 0.003762 0.001487 2.530 0.011417 * UPREVIS -0.063289 0.015252 -4.150 3.33e-05 *** CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 . GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 *** DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 ** DPLURALtwin 0.312319 0.241088 1.295 0.195163 ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 *** ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951 ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 *** URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187 URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676 URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029 URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489 OK, now let us switch the labels in our atRisk variable: esdata$atRisk <- factor(sdata$atRisk) levels(sdata$atRisk) <- c("TRUE", "FALSE") and re-run the above analysis where I am expecting to see a change in the signs of the above reported coefficients, however, I am getting exactly the same coefficients: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.412189 0.289352 -15.249 < 2e-16 *** PWGT 0.003762 0.001487 2.530 0.011417 * UPREVIS -0.063289 0.015252 -4.150 3.33e-05 *** CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 . GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 *** DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 ** DPLURALtwin 0.312319 0.241088 1.295 0.195163 ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 *** ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951 ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 *** URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187 URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676 URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029 URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489 What is that am I doing wrong here? Can you help please
its because you set train <- sdata[sdata$ORIGRANDGROUP<=5,] and then you change sdata$atRisk <- factor(sdata$atRisk) but your model is using the train dataset, whose levels DID NOT get changed. Instead you can do y <- "!atRisk" x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors) fmla <- paste(y, paste(x, collapse="+"), sep="~") model <- glm(fmla, data=train, family=binomial(link="logit")) Call: glm(formula = fmla, family = binomial(link = "logit"), data = train) Deviance Residuals: Min 1Q Median 3Q Max -3.2641 0.1358 0.1511 0.1818 0.9732 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.412189 0.289352 15.249 < 2e-16 *** PWGT -0.003762 0.001487 -2.530 0.011417 * UPREVIS 0.063289 0.015252 4.150 3.33e-05 *** CIG_RECTRUE -0.313169 0.187230 -1.673 0.094398 . GESTREC3< 37 weeks -1.545183 0.140795 -10.975 < 2e-16 *** DPLURALtriplet or higher -1.394193 0.498866 -2.795 0.005194 ** DPLURALtwin -0.312319 0.241088 -1.295 0.195163 ULD_MECOTRUE -0.818426 0.235798 -3.471 0.000519 *** ULD_PRECIPTRUE -0.191720 0.357680 -0.536 0.591951 ULD_BREECHTRUE -0.749237 0.178129 -4.206 2.60e-05 *** URF_DIABTRUE 0.346467 0.287514 1.205 0.228187 URF_CHYPERTRUE -0.560025 0.389678 -1.437 0.150676 URF_PHYPERTRUE -0.161599 0.250003 -0.646 0.518029 URF_ECLAMTRUE -0.498064 0.776948 -0.641 0.521489 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2698.7 on 14211 degrees of freedom Residual deviance: 2463.0 on 14198 degrees of freedom AIC: 2491 Number of Fisher Scoring iterations: 7
Logistic Regression: Strange Variable arises
I am using R to perform logistic regression on my data set. My data set has more than 50 variables. I am running the following code: glm(X...ResponseFlag ~ NetWorth + LOR + IntGrandChld + OccupInput, family = binomial, data = data) When I see summary() I got the following output: > summary(ResponseModel) Call: glm(formula = X...ResponseFlag ~ NetWorth + LOR + IntGrandChld + OccupInput, family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.2785 -0.9576 -0.8925 1.3736 1.9721 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.971166 0.164439 -5.906 3.51e-09 *** NetWorth 0.082168 0.019849 4.140 3.48e-05 *** LOR -0.019716 0.006494 -3.036 0.0024 ** IntGrandChld -0.021544 0.085274 -0.253 0.8005 OccupInput2 0.005796 0.138390 0.042 0.9666 OccupInput3 0.471020 0.289642 1.626 0.1039 OccupInput4 -0.031880 0.120636 -0.264 0.7916 OccupInput5 -0.148898 0.129922 -1.146 0.2518 OccupInput6 -0.481183 0.416277 -1.156 0.2477 OccupInput7 -0.057485 0.218309 -0.263 0.7923 OccupInput8 0.505676 0.123955 4.080 4.51e-05 *** OccupInput9 -0.382375 0.821362 -0.466 0.6415 OccupInputA -12.903334 178.064831 -0.072 0.9422 OccupInputB 0.581272 1.003193 0.579 0.5623 OccupInputC -0.034188 0.294507 -0.116 0.9076 OccupInputD 0.224634 0.385959 0.582 0.5606 OccupInputE -1.292358 1.072864 -1.205 0.2284 OccupInputF 14.132144 308.212341 0.046 0.9634 OccupInputH 0.622677 1.006982 0.618 0.5363 OccupInputU 0.087526 0.095740 0.914 0.3606 OccupInputV -1.010939 0.637746 -1.585 0.1129 OccupInputW 0.262031 0.256238 1.023 0.3065 OccupInputX 0.332209 0.428806 0.775 0.4385 OccupInputY 0.059771 0.157135 0.380 0.7037 OccupInputZ 0.638520 0.711979 0.897 0.3698 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5885.1 on 4467 degrees of freedom Residual deviance: 5809.6 on 4443 degrees of freedom AIC: 5859.6 Number of Fisher Scoring iterations: 12 From the output, it is seen that some new variable like OccupInput2... has arisen. Actually OccupInput had values 1,2,3,...A,B,C,D.. But it did not happen for NetWorth,LOR. I am new to R and do not have any explanation, why there are new variables. Can anybody give me an explanation? Thank you in advance.
I would assume that OccupInput in your model is a factor variable. R introduces so-called dummy variables, when you include factorial regressors in a linear model. What you see as OccupInput2 and so forth in the table are the coefficients associated with the individual factor levels (the reference level OccupInput1 is covered by the intercept term). You can verify the type of OccupInput from the output of the sapply(data, class) call, which yields the data types of the columns in your input data frame.