Logistic Regression: Strange Variable arises - r
I am using R to perform logistic regression on my data set. My data set has more than 50 variables.
I am running the following code:
glm(X...ResponseFlag ~ NetWorth + LOR + IntGrandChld + OccupInput, family = binomial, data = data)
When I see summary() I got the following output:
> summary(ResponseModel)
Call:
glm(formula = X...ResponseFlag ~ NetWorth + LOR + IntGrandChld +
OccupInput, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2785 -0.9576 -0.8925 1.3736 1.9721
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.971166 0.164439 -5.906 3.51e-09 ***
NetWorth 0.082168 0.019849 4.140 3.48e-05 ***
LOR -0.019716 0.006494 -3.036 0.0024 **
IntGrandChld -0.021544 0.085274 -0.253 0.8005
OccupInput2 0.005796 0.138390 0.042 0.9666
OccupInput3 0.471020 0.289642 1.626 0.1039
OccupInput4 -0.031880 0.120636 -0.264 0.7916
OccupInput5 -0.148898 0.129922 -1.146 0.2518
OccupInput6 -0.481183 0.416277 -1.156 0.2477
OccupInput7 -0.057485 0.218309 -0.263 0.7923
OccupInput8 0.505676 0.123955 4.080 4.51e-05 ***
OccupInput9 -0.382375 0.821362 -0.466 0.6415
OccupInputA -12.903334 178.064831 -0.072 0.9422
OccupInputB 0.581272 1.003193 0.579 0.5623
OccupInputC -0.034188 0.294507 -0.116 0.9076
OccupInputD 0.224634 0.385959 0.582 0.5606
OccupInputE -1.292358 1.072864 -1.205 0.2284
OccupInputF 14.132144 308.212341 0.046 0.9634
OccupInputH 0.622677 1.006982 0.618 0.5363
OccupInputU 0.087526 0.095740 0.914 0.3606
OccupInputV -1.010939 0.637746 -1.585 0.1129
OccupInputW 0.262031 0.256238 1.023 0.3065
OccupInputX 0.332209 0.428806 0.775 0.4385
OccupInputY 0.059771 0.157135 0.380 0.7037
OccupInputZ 0.638520 0.711979 0.897 0.3698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5885.1 on 4467 degrees of freedom
Residual deviance: 5809.6 on 4443 degrees of freedom
AIC: 5859.6
Number of Fisher Scoring iterations: 12
From the output, it is seen that some new variable like OccupInput2... has arisen. Actually OccupInput had values 1,2,3,...A,B,C,D.. But it did not happen for NetWorth,LOR.
I am new to R and do not have any explanation, why there are new variables.
Can anybody give me an explanation? Thank you in advance.
I would assume that OccupInput in your model is a factor variable. R introduces so-called dummy variables, when you include factorial regressors in a linear model.
What you see as OccupInput2 and so forth in the table are the coefficients associated with the individual factor levels (the reference level OccupInput1 is covered by the intercept term).
You can verify the type of OccupInput from the output of the sapply(data, class) call, which yields the data types of the columns in your input data frame.
Related
Preparation of variables for binary logistic regression
I want to perform a binary logistic regression of a binary variable. The variable "burdened" (german: "belastet") has the two values agree (1) and disagree (2). The formula I am working with is glm () and family = "binomial". When I put my independent variables into the model (both categorical and metric) and then calculate the p-value using pchisq, I get 0. belastet0 <- glm(belastetB ~ 1, data = MF, family = binomial (), subset = (sex == 2)) summary(belastet0) belastet1 <- glm(belastetB ~ age + SES_3 + eig_Kinder + Zufriedenh_BZ + LZ + Angst + guteSeiten + finanzielleEinb + persKontakt, data = MF, family = "binomial", subset = (sex == 2)) summary(belastet1) bel_chi <- belastet0$null.deviance - belastet1$deviance bel_chidf <- belastet1$df.null - belastet1$df.residual bel_pchisq <- 1 - pchisq(bel_chi, bel_chidf) The output i get: Deviance Residuals: Min 1Q Median 3Q Max -3.0832 -0.5579 0.4269 0.7315 2.1323 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.933019 0.345034 11.399 < 2e-16 *** age -0.017936 0.005805 -3.090 0.00200 ** SES_3mittel -0.252995 0.081740 -3.095 0.00197 ** SES_3niedrig -0.426660 0.131045 -3.256 0.00113 ** eig_Kinder 0.195782 0.044914 4.359 1.31e-05 *** Zufriedenh_BZ 0.074256 0.021855 3.398 0.00068 *** LZ -0.452521 0.026458 -17.103 < 2e-16 *** Angststimme zu 0.955357 0.073680 12.966 < 2e-16 *** Angstweder noch 0.554067 0.109405 5.064 4.10e-07 *** guteSeitenstimme zu -1.312848 0.105667 -12.424 < 2e-16 *** guteSeitenweder noch -0.451338 0.144038 -3.133 0.00173 ** finanzielleEinbstimme zu 0.759940 0.092765 8.192 2.57e-16 *** finanzielleEinbweder noch 0.814164 0.136931 5.946 2.75e-09 *** persKontaktstimme zu 1.001333 0.082896 12.079 < 2e-16 *** persKontaktweder noch 0.538896 0.124962 4.312 1.61e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6691.7 on 5928 degrees of freedom Residual deviance: 5366.5 on 5914 degrees of freedom (14325 Beobachtungen als fehlend gelöscht) AIC: 5396.5 And for: bel_pchisq <- 1 - pchisq(bel_chi, bel_chidf) I receive: bel_pchisq = 0 I think the problem is that I have not cleaned my data? I have already done a revision of my metric variables: data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T) and categorically: MF$belastetB <- as.factor(MF$belastetB), unfortunately with only partial success. In addition, by applying the metric formula, I had overwritten all my variables, but I still need them in their original form. Unfortunately, I am not at all sure how to prepare my variables for binary logistic regression so that I get a p-value that is not 0. Because that indicates that I have an error in my formula, or my variables are not prepared correctly. My categorical independent variables are: SES (high, medium, low), Angst (agree, neither, disagree), guteSeiten (agree, neither, disagree), finanzielleEinb (agree, neither, disagree), persKontakt (agree, neither, disagree) My metric independent variables are: age, eig_Kinder, Zufriedenh_BZ (scale: 0-10), LZ (scale: 0-10) For example, the output of LZ (metric) looks like this: summary(MF$LZ) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 6.000 7.000 6.794 8.000 10.000 707 table(MF$LZ) 0 1 2 3 4 5 6 7 8 9 10 231 261 728 1551 1775 4024 4166 7937 9792 4085 1710 And for Angst (categorial): table(MF$Angst) stimme nicht zu stimme zu weder noch 16918 14607 5255 summary(MF$Angst) Length Class Mode 36967 character character What formulas can I apply or how do I need to change/adjust my variables so that I get an output for the p-value, other than 0?
Logistic regression from R returning values greater than one
I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0. When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful. library(ResourceSelection) library(MASS) data=read.csv(file.choose()) data$Arth2004 = as.factor(data$Arth2004) data$Gen = as.factor(data$Gen) data$Smoke1993 = as.factor(data$Smoke1993) data$HT1993 = as.factor(data$HT1993) data$HC1993 = as.factor(data$HC1993) data$BMI1993 = as.numeric(data$BMI1993) logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial") summary(logistic) hoslem.test(logistic$y, fitted(logistic)) confint(logistic) min(data$BMI1993) median(data$BMI1993) max(data$BMI1993) e=2.71828 The output is as follows: Call: glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -2.0362 -1.0513 -0.7831 1.1844 1.8807 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.346104 0.158043 -14.845 < 2e-16 *** Gen1 -0.748286 0.048398 -15.461 < 2e-16 *** Smoke19931 -0.059342 0.064606 -0.919 0.358 BMI1993 0.084056 0.006005 13.997 < 2e-16 *** HC19931 0.388217 0.047820 8.118 4.72e-16 *** HT19931 0.341375 0.058423 5.843 5.12e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 10890 on 7895 degrees of freedom Residual deviance: 10309 on 7890 degrees of freedom AIC: 10321 Number of Fisher Scoring iterations: 4 Hosmer and Lemeshow goodness of fit (GOF) test data: logistic$y, fitted(logistic) X-squared = 18.293, df = 8, p-value = 0.01913 Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -2.65715966 -2.03756775 Gen1 -0.84336906 -0.65364134 Smoke19931 -0.18619647 0.06709748 BMI1993 0.07233866 0.09588198 HC19931 0.29454661 0.48200673 HT19931 0.22690608 0.45595006 [1] 18 [1] 26 [1] 43 A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following: e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104) [1] 1.7664 I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?
By default, family = "binomial" uses the logit link function (see ?family). So the probability you're looking for is 1.7664 / (1+1.7664).
General Linear Model interpretation of parameter estimates in R
I have a data set that looks like "","OBSERV","DIOX","logDIOX","OXYGEN","LOAD","PRSEK","PLANT","TIME","LAB" "1",1011,984.06650389,6.89169348002254,"L","H","L","RENO_N","1","KK" "2",1022,1790.7973641,7.49041625445373,"H","H","L","RENO_N","1","USA" "3",1031,661.95870145,6.4952031694744,"L","H","H","RENO_N","1","USA" "4",1042,978.06853583,6.88557974511529,"H","H","H","RENO_N","1","KK" "5",1051,270.92290942,5.60183431332639,"N","N","N","RENO_N","1","USA" "6",1062,402.98269729,5.99889362626069,"N","N","N","RENO_N","1","USA" "7",1071,321.71945701,5.77367991426247,"H","L","L","RENO_N","1","KK" "8",1082,223.15260359,5.40785585845064,"L","L","L","RENO_N","1","USA" "9",1091,246.65350151,5.507984523849,"H","L","H","RENO_N","1","USA" "10",1102,188.48323034,5.23900903921703,"L","L","H","RENO_N","1","KK" "11",1141,267.34994025,5.58855843790491,"N","N","N","RENO_N","1","KK" "12",1152,452.10355987,6.11391126834609,"N","N","N","RENO_N","1","KK" "13",2011,2569.6672555,7.85153169693888,"N","N","N","KARA","1","USA" "14",2021,604.79620572,6.40489155123453,"N","N","N","KARA","1","KK" "15",2031,2610.4804449,7.86728956188212,"L","H",NA,"KARA","1","KK" "16",2032,3789.7097503,8.24004471210954,"L","H",NA,"KARA","1","USA" "17",2052,338.97054188,5.82591320649553,"L","L","L","KARA","1","KK" "18",2061,391.09027375,5.96893841249289,"H","L","H","KARA","1","USA" "19",2092,410.04420258,6.01626496505788,"N","N","N","KARA","1","USA" "20",2102,313.51882368,5.74785940190679,"N","N","N","KARA","1","KK" "21",2112,1242.5931417,7.12495571830002,"H","H","H","KARA","1","KK" "22",2122,1751.4827969,7.46821802066524,"H","H","L","KARA","1","USA" "23",3011,60.48026048,4.10231703874031,"N","N","N","RENO_S","1","KK" "24",3012,257.27729731,5.55015448107691,"N","N","N","RENO_S","1","USA" "25",3021,46.74282552,3.84466077914493,"N","N","N","RENO_S","1","KK" "26",3022,73.605375516,4.29871805996994,"N","N","N","RENO_S","1","KK" "27",3031,108.25433812,4.68448344109116,"H","H","L","RENO_S","1","KK" "28",3032,124.40704234,4.82355878915293,"H","H","L","RENO_S","1","USA" "29",3042,123.66859296,4.81760535031397,"L","H","L","RENO_S","1","KK" "30",3051,170.05332632,5.13611207209694,"N","N","N","RENO_S","1","USA" "31",3052,95.868704018,4.56297958887925,"N","N","N","RENO_S","1","KK" "32",3061,202.69261215,5.31169060558111,"N","N","N","RENO_S","1","USA" "33",3062,70.686307069,4.25825187761015,"N","N","N","RENO_S","1","USA" "34",3071,52.034715526,3.95191110210073,"L","H","H","RENO_S","1","KK" "35",3072,93.33525462,4.53619789950355,"L","H","H","RENO_S","1","USA" "36",3081,121.47464906,4.79970559129829,"H","H","H","RENO_S","1","USA" "37",3082,94.833869239,4.55212661590867,"H","H","H","RENO_S","1","KK" "38",3091,68.624596439,4.22865101914209,"H","L","L","RENO_S","1","USA" "39",3092,64.837097371,4.17187792984139,"H","L","L","RENO_S","1","KK" "40",3101,32.351569811,3.47666254561192,"L","L","L","RENO_S","1","KK" "41",3102,29.285124102,3.37707967726539,"L","L","L","RENO_S","1","USA" "42",3111,31.36974463,3.44584388158928,"L","L","H","RENO_S","1","USA" "43",3112,28.127853881,3.33676032670116,"L","L","H","RENO_S","1","KK" "44",3121,91.825330102,4.51988818660262,"H","L","H","RENO_S","1","KK" "45",3122,136.4559307,4.91600171048243,"H","L","H","RENO_S","1","USA" "46",4011,126.11889968,4.83722511024933,"H","L","H","RENO_N","2","KK" "47",4022,76.520259821,4.33755554003153,"L","L","L","RENO_N","2","KK" "48",4032,93.551979795,4.53851721545715,"L","L","H","RENO_N","2","USA" "49",4041,207.09703422,5.33318744777751,"H","L","L","RENO_N","2","USA" "50",4052,383.44185307,5.94918798759058,"N","N","N","RENO_N","2","USA" "51",4061,156.79345897,5.05492939129363,"N","N","N","RENO_N","2","USA" "52",4071,322.72413197,5.77679787769979,"L","H","L","RENO_N","2","USA" "53",4082,554.05710342,6.31726775620079,"H","H","H","RENO_N","2","USA" "54",4091,122.55552697,4.80856420867156,"N","N","N","RENO_N","2","KK" "55",4102,112.70050456,4.72473389805434,"N","N","N","RENO_N","2","KK" "56",4111,94.245481423,4.54590288271731,"L","H","H","RENO_N","2","KK" "57",4122,323.16498582,5.77816298482521,"H","H","L","RENO_N","2","KK" I define a linear model in R using lm as lm1 <- lm(logDIOX ~ 1 + OXYGEN + LOAD + PLANT + TIME + LAB, data=data) and I want to interpret the estimated coefficients. However, when I extract the coefficients I get multiple 'NAs' (I'm assuming it's due to linear dependencies among the variables). How can I then interpret the coefficients? I only have one intercept that somehow represents one of the levels of each of the included factors in the model. Is it possible to get an estimate for each factor level? > summary(lm1) Coefficients: Call: lm(formula = logDIOX ~ OXYGEN + LOAD + PLANT + TIME + LAB, data = data) Residuals: Min 1Q Median 3Q Max -0.90821 -0.32102 -0.08993 0.27311 0.97758 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 7.2983 0.2110 34.596 < 2e-16 *** OXYGENL -0.4086 0.1669 -2.449 0.017953 * OXYGENN -0.7567 0.1802 -4.199 0.000113 *** LOADL -1.0645 0.1675 -6.357 6.58e-08 *** LOADN NA NA NA NA PLANTRENO_N -0.6636 0.2174 -3.052 0.003664 ** PLANTRENO_S -2.3452 0.1929 -12.158 < 2e-16 *** TIME2 -0.9160 0.2065 -4.436 5.18e-05 *** LABUSA 0.3829 0.1344 2.849 0.006392 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5058 on 49 degrees of freedom Multiple R-squared: 0.8391, Adjusted R-squared: 0.8161 F-statistic: 36.5 on 7 and 49 DF, p-value: < 2.2e-16
For the NA part of your question you can have a look here: [linear regression "NA" estimate just for last coefficient, actually of your variables can be described as a linear combination of the rest. For the factors and their levels the way r works is showing intercept with first factor level and shows the difference of the intercept with the rest of the factors. I think it will be more clear with just one factor regression: lm1 <- lm(logDIOX ~ 1 + OXYGEN , data=df) > summary(lm1) Call: lm(formula = logDIOX ~ 1 + OXYGEN, data = df) Residuals: Min 1Q Median 3Q Max -1.7803 -0.7833 -0.2027 0.6597 3.1229 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.5359 0.2726 20.305 <2e-16 *** OXYGENL -0.4188 0.3909 -1.071 0.289 OXYGENN -0.1896 0.3807 -0.498 0.621 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.188 on 54 degrees of freedom Multiple R-squared: 0.02085, Adjusted R-squared: -0.01542 F-statistic: 0.5749 on 2 and 54 DF, p-value: 0.5662 what this result is saying is that for OXYGEN="H" intercept is 5.5359, for OXYGEN="L" intercept is 5.5359-0.4188=5.1171 and for OXYGEN="N" intercept is 5.5359-0.1896= 5.3463. Hope this helps UPDATE: Following your comment I generalize to your model. when OXYGEN = "H" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983 when OXYGEN = "L" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086 =6.8897 when OXYGEN = "L" LOAD = "L" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086-1.0645 =5.8252 etc.
GLM submodel testing in R: why all statistics are still the same after remove one continuous covariate?
I am doing a submodel testing. The smaller model is nested in the bigger model. The bigger model has one continuous variable compared to the smaller model. I use the likelihood ratio test. The result is quite strange. Both models have the same statistics such as residual deviance and df. I also find two models have the same estimated coefficients are std.errors. How is the fact possible? summary(m2221) Call: glm(formula = clm ~ veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area, family = "binomial", data = Car) Deviance Residuals: Min 1Q Median 3Q Max -0.9245 -0.3939 -0.3683 -0.3437 2.7095 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.294118 0.382755 -3.381 0.000722 *** veh_age2 0.051790 0.098463 0.526 0.598897 veh_age3 -0.166801 0.094789 -1.760 0.078457 . veh_age4 -0.239862 0.096154 -2.495 0.012611 * veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 ** veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 * veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 ** veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 * veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765 veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 ** veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 * veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883 veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 ** veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 ** veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 ** veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 *** agecat2 -0.198002 0.058382 -3.391 0.000695 *** agecat3 -0.224492 0.056905 -3.945 7.98e-05 *** agecat4 -0.253377 0.056774 -4.463 8.09e-06 *** agecat5 -0.441906 0.063227 -6.989 2.76e-12 *** agecat6 -0.447231 0.072292 -6.186 6.15e-10 *** veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740 veh_age2:veh_value 0.035386 0.031465 1.125 0.260753 veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 ** veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 *** veh_value:areaB 0.044099 0.021550 2.046 0.040722 * veh_value:areaC 0.021892 0.019189 1.141 0.253931 veh_value:areaD -0.023616 0.024939 -0.947 0.343658 veh_value:areaE -0.013506 0.026886 -0.502 0.615415 veh_value:areaF 0.057780 0.026602 2.172 0.029850 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 33767 on 67855 degrees of freedom Residual deviance: 33592 on 67826 degrees of freedom AIC: 33652 Number of Fisher Scoring iterations: 5 summary(m222) Call: glm(formula = clm ~ veh_value + veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area, family = "binomial", data = Car) Deviance Residuals: Min 1Q Median 3Q Max -0.9245 -0.3939 -0.3683 -0.3437 2.7095 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.294118 0.382755 -3.381 0.000722 *** veh_value -0.000637 0.026387 -0.024 0.980740 veh_age2 0.051790 0.098463 0.526 0.598897 veh_age3 -0.166801 0.094789 -1.760 0.078457 . veh_age4 -0.239862 0.096154 -2.495 0.012611 * veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 ** veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 * veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 ** veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 * veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765 veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 ** veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 * veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883 veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 ** veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 ** veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 ** veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 *** agecat2 -0.198002 0.058382 -3.391 0.000695 *** agecat3 -0.224492 0.056905 -3.945 7.98e-05 *** agecat4 -0.253377 0.056774 -4.463 8.09e-06 *** agecat5 -0.441906 0.063227 -6.989 2.76e-12 *** agecat6 -0.447231 0.072292 -6.186 6.15e-10 *** veh_value:veh_age2 0.036023 0.034997 1.029 0.303331 veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 ** veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 ** veh_value:areaB 0.044099 0.021550 2.046 0.040722 * veh_value:areaC 0.021892 0.019189 1.141 0.253931 veh_value:areaD -0.023616 0.024939 -0.947 0.343658 veh_value:areaE -0.013506 0.026886 -0.502 0.615415 veh_value:areaF 0.057780 0.026602 2.172 0.029850 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 33767 on 67855 degrees of freedom Residual deviance: 33592 on 67826 degrees of freedom AIC: 33652 anova(m2221,m222, test ="LRT")### Analysis of Deviance Table Model 1: clm ~ veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area Model 2: clm ~ veh_value + veh_age + veh_body + agecat + veh_value:veh_age + veh_value:area Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 67826 33592 2 67826 33592 0 0
You have specified the same model two different ways. To show this, I'll first explain what is going under the hood a bit, then I'll walk through the coefficients of your two models to show they are the same, and I'll end with some higher level intuition and explanation. Different formula creating different interaction terms First, the only difference between your two models is the first models includes veh_value as a predictor, whereas in the second does not. However, veh_value is interacted with other predictors in both models. So let's consider a simple reproducible example and see what R does when we do this. I'll use the model.matrix to simply feed in two different formulas and view the continuous and dummy variables that R creates as a result. colnames(model.matrix(am ~ mpg + factor(cyl):mpg, mtcars)) #[1] "(Intercept)" "mpg" "mpg:factor(cyl)6" "mpg:factor(cyl)8" colnames(model.matrix(am ~ factor(cyl):mpg, mtcars)) #"(Intercept)" "factor(cyl)4:mpg" "factor(cyl)6:mpg" "factor(cyl)8:mpg" Notice in the first call I included the continuous predictor mpg whereas I did not include it in the second call (a simpler example of what you are doing). Now note that the second model matrix contains an extra interaction variable (factor(cyl)4:mpg) in the model that the first does not. In other words, because we did not include mpg in the model directly, all levels of cyl get included in the interaction. Your models are the same Your models are essentially doing the same thing as the simple example above and we can show that the coefficients at the end of the day are actually the same when added together. In your first model all 4 levels of veh_age are included when it is included as interaction with veh_value but veh_value is not included in the model. veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740 veh_age2:veh_value 0.035386 0.031465 1.125 0.260753 veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 ** veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 *** In your second model only 3 levels of veh_age are included when it is interacted with veh_value because veh_value is included in the model. veh_value:veh_age2 0.036023 0.034997 1.029 0.303331 veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 ** veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 ** Now, here is the critical piece to see that the models are actually the same. It's easiest to show by just walking through all of the levels of veh_age. First consider veh_age = 1 For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 1 is -0.000637 # For first model veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740 # For second model veh_value -0.000637 0.026387 -0.024 0.980740 Now consider veh_age = 2 For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 2 is 0.035386 # For first model veh_age2:veh_value 0.035386 0.031465 1.125 0.260753 # For second model - note sum of the two below is 0.035386 veh_value -0.000637 0.026387 -0.024 0.980740 veh_value:veh_age2 0.036023 0.034997 1.029 0.303331 Intution When you include the interaction veh_value:veh_age you are essentially saying that you want the coefficient of veh_value a continuous variable to be conditioned on veh_age, a categorical variable. Including both the interaction veh_value:veh_age and veh_value as predictors is saying the same thing. You want to know the coefficient of veh_value but you want to condition it based on the value of veh_age.
Extract data from Partial least square regression on R
I want to use the partial least squares regression to find the most representative variables to predict my data. Here is my code: library(pls) potion<-read.table("potion-insomnie.txt",header=T) potionTrain <- potion[1:182,] potionTest <- potion[183:192,] potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO") The summary(lm(potion1)) give me this answer: Call: lm(formula = potion1) Residuals: Min 1Q Median 3Q Max -14.9475 -5.3961 0.0056 5.2321 20.5847 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.63931 1.67955 22.410 < 2e-16 *** Aubepine -0.28226 0.05195 -5.434 1.81e-07 *** Bave -1.79894 0.26849 -6.700 2.68e-10 *** Poudre 0.35420 0.72849 0.486 0.627 Pavot -0.47678 0.52027 -0.916 0.361 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.845 on 177 degrees of freedom Multiple R-squared: 0.293, Adjusted R-squared: 0.277 F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12 I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables: potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO") And I plot: plot(potion1, ncomp = 2, asp = 1, line = TRUE) Here is the plot of predicted vs measured values: The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ? Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls) data(mtcars) potion <- mtcars potionTrain <- potion[1:28,] potionTest <- potion[29:32,] potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO") coef(potion1) # coefficeints scores(potion1) # scores ## R^2: R2(potion1, estimate = "train") ## cross-validated R^2: R2(potion1) ## Both: R2(potion1, estimate = "all")