lm predict won't predict - r

I have 2 data frames. One is training data (pubs1), the other (pubs2) test data. I can create a linear regression object but am unable to create a prediction. This is not my first time doing this and can't figure out what is going wrong.
> head(pubs1 )
id pred37 actual weight diff1 weightDiff1 pred1 pred2 pred3 pred4
1 11 128.3257 128.3990 6.43482732 -0.07333650 -0.4719076922 126.3149 126.1024 126.9057 126.2718
2 31 100.8822 100.9777 3.55520287 -0.09553741 -0.3396548680 100.7820 100.8589 100.9179 100.8903
3 33 100.7204 100.9630 7.46413438 -0.24262409 -1.8109787866 100.8576 100.8434 100.8521 100.8914
4 52 100.8564 100.9350 0.01299138 -0.07855588 -0.0010205495 100.8700 100.8925 100.8344 100.8714
5 56 100.8410 100.9160 0.01299138 -0.07502125 -0.0009746298 100.8695 100.8889 100.8775 100.8871
6 71 100.8889 100.8591 1.19266269 0.02979818 0.0355391800 100.8357 100.9205 100.8107 100.8316
> head(pubs2 )
id pred37 pred1 pred2 pred3 pred4
1 762679 98.32212 97.84181 98.0776 98.03222 97.90022
2 762680 115.79698 114.91411 115.1470 115.27129 115.45027
3 762681 104.56418 104.81372 104.8537 104.66239 104.55240
4 762682 106.65768 106.71011 106.6722 106.68662 106.60757
5 762683 102.15662 103.14207 103.2035 103.31190 103.40397
6 762684 101.96057 102.25939 102.1031 102.20659 102.04557
> lm1 <- lm(pubs1$actual ~ pubs1$pred37 + pubs1$pred1 + pubs1$pred2
+ + pubs1$pred3 + pubs1$pred4)
> summary(lm1)
Call:
lm(formula = pubs1$actual ~ pubs1$pred37 + pubs1$pred1 + pubs1$pred2 +
pubs1$pred3 + pubs1$pred4)
Residuals:
Min 1Q Median 3Q Max
-18.3415 -0.2309 0.0016 0.2236 17.8639
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.122478 0.027227 -4.498 6.85e-06 ***
pubs1$pred37 0.543270 0.005086 106.823 < 2e-16 ***
pubs1$pred1 0.063680 0.007151 8.905 < 2e-16 ***
pubs1$pred2 0.317768 0.010977 28.950 < 2e-16 ***
pubs1$pred3 0.024302 0.008321 2.921 0.00349 **
pubs1$pred4 0.052183 0.010879 4.797 1.61e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7298 on 99994 degrees of freedom
Multiple R-squared: 0.9932, Adjusted R-squared: 0.9932
F-statistic: 2.926e+06 on 5 and 99994 DF, p-value: < 2.2e-16
>
> pred2 <- predict(lm1, pubs2)
Warning message:
'newdata' had 50000 rows but variable(s) found have 100000 rows
> str(pubs1)
'data.frame': 100000 obs. of 10 variables:
$ id : num 11 31 33 52 56 71 85 87 92 95 ...
$ pred37 : num 128 101 101 101 101 ...
$ actual : num 128 101 101 101 101 ...
$ weight : num 6.435 3.555 7.464 0.013 0.013 ...
$ diff1 : num -0.0733 -0.0955 -0.2426 -0.0786 -0.075 ...
$ weightDiff1: num -0.471908 -0.339655 -1.810979 -0.001021 -0.000975 ...
$ pred1 : num 126 101 101 101 101 ...
$ pred2 : num 126 101 101 101 101 ...
$ pred3 : num 127 101 101 101 101 ...
$ pred4 : num 126 101 101 101 101 ...
> str(pubs2)
'data.frame': 50000 obs. of 6 variables:
$ id : num 762679 762680 762681 762682 762683 ...
$ pred37: num 98.3 115.8 104.6 106.7 102.2 ...
$ pred1 : num 97.8 114.9 104.8 106.7 103.1 ...
$ pred2 : num 98.1 115.1 104.9 106.7 103.2 ...
$ pred3 : num 98 115 105 107 103 ...
$ pred4 : num 97.9 115.5 104.6 106.6 103.4 ...
> colnames(pubs1)
[1] "id" "pred37" "actual" "weight" "diff1" "weightDiff1" "pred1" "pred2" "pred3" "pred4"
> colnames(pubs2)
[1] "id" "pred37" "pred1" "pred2" "pred3" "pred4"
Is there anything here that I'm missing?

Instead of,
lm1 <- lm(pubs1$actual ~ pubs1$pred37 + pubs1$pred1 + pubs1$pred2
pubs1$pred3 + pubs1$pred4)
try,
lm1 <- lm(actual ~ pred37 + pred1 + pred2
pred3 + pred4, data = pubs1)
Otherwise predict.lm will be looking for variables called pubs1$pred37 in your new data frame.

Related

Plot interaction in paneldata

I have run this regression without any problems and I get 4 coefficients, for each interaction between econ_sit and educ_cat. Econ_sit is a continous variable, and educ_cat is a categorical variable from 1-6. How can i plot the coefficients only for the interaction terms in a good way?
model_int_f <- felm(satis_gov_sc ~ econ_sit*factor(educ_cat) + factor(benefit) + econ_neth + age + gender + pol_sof
| factor(wave) + factor(id) # Respondent and time fixed effects
| 0
| id, # Cluster standard errors on each respondent
data = full1)
summary(model_int_f)
Call:
felm(formula = satis_gov_sc ~ econ_sit * factor(educ_cat) + factor(benefit) + econ_neth + age + gender + pol_sof | factor(wave) + factor(id) | 0 | id, data = full1)
Residuals:
Min 1Q Median 3Q Max
-0.58468 -0.04464 0.00000 0.04728 0.78470
Coefficients:
Estimate Cluster s.e. t value Pr(>|t|)
econ_sit 0.1411692 0.0603100 2.341 0.01928 *
factor(educ_cat)2 0.0525580 0.0450045 1.168 0.24292
factor(educ_cat)3 0.1229048 0.0576735 2.131 0.03313 *
factor(educ_cat)4 0.1244146 0.0486455 2.558 0.01057 *
factor(educ_cat)5 0.1245556 0.0520246 2.394 0.01669 *
factor(educ_cat)6 0.1570034 0.0577240 2.720 0.00655 **
factor(benefit)2 -0.0030380 0.0119970 -0.253 0.80010
factor(benefit)3 0.0026064 0.0072590 0.359 0.71957
econ_neth 0.0642726 0.0131940 4.871 1.14e-06 ***
age 0.0177453 0.0152661 1.162 0.24512
gender 0.1088780 0.0076137 14.300 < 2e-16 ***
pol_sof 0.0006003 0.0094504 0.064 0.94935
econ_sit:factor(educ_cat)2 -0.0804820 0.0653488 -1.232 0.21816
econ_sit:factor(educ_cat)3 -0.0950652 0.0793818 -1.198 0.23114
econ_sit:factor(educ_cat)4 -0.1259772 0.0692072 -1.820 0.06877 .
econ_sit:factor(educ_cat)5 -0.1469749 0.0654870 -2.244 0.02485 *
econ_sit:factor(educ_cat)6 -0.1166243 0.0693709 -1.681 0.09279 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1161 on 11159 degrees of freedom
(23983 observations deleted due to missingness)
Multiple R-squared(full model): 0.8119 Adjusted R-squared: 0.717
Multiple R-squared(proj model): 0.00657 Adjusted R-squared: -0.4946
F-statistic(full model, *iid*):8.557 on 5630 and 11159 DF, p-value: < 2.2e-16
F-statistic(proj model): 55.38 on 17 and 5609 DF, p-value: < 2.2e-16
This is what my data looks like:
$ id : num 1 1 1 1 2 2 2 2 3 3 3 3
$ wave : chr "2013" "2015" "2016" "2017" ...
$ satis_gov_sc: num 0.5 0.4 0.4 0.6 0.6 0.5 0.6 0.7 0.7 0.7 ...
$ econ_sit : num NA NA 0.708 0.75 0.708 ...
$ educ_cat : num 5 5 5 5 5 6 6 6 6 6 ...
$ benefit : num 3 3 3 3 3 3 3 3 3 3 ...
$ econ_neth : num NA 0.6 0.6 0.7 0.7 0.5 0.4 0.6 0.8 0.7 ...
$ age : num 58 60 61 62 63 51 53 54 55 56 ...
$ gender : num 1 1 1 1 1 1 1 1 1 1 ...
$ pol_sof : num 1 1 1 0.8 1 1 1 1 0.8 1 ...
I've tried to run af simple plot_model with the following code:
plot_model(model_int_f, type = "pred", terms = c("econ_sit", "educ_cat"))
However I only get error because the felm function is not compatible with "pred":
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "felm"
Any suggestions on how to plot the interaction terms?
Thanks in advance!
felm does not have a predict method so it is not compatible with plot_model. You could use some other fixed effects library.
Here's an example using fixest. As you did not provide a sample of your data, I have used data(iris).
library(fixest); library(sjPlot)
res = feols(Sepal.Length ~ Sepal.Width + Petal.Length:Species | Species, cluster='Species', iris)
plot_model(res, type = "pred", terms = c("Petal.Length", "Species"))

R loops and plotting regression model

I am learning to work with apply family functions and R loops.
I am working with a basic data set table that has y (outcome variable) column and x (predictor variable) column with 100 rows.
I have already used the lm() function to run a regression for the data.
Model.1<-lm(y~x, data = data)
Coefficients:
(Intercept) x
13.87 4.89
summary(Model.1)
Residuals:
Min 1Q Median 3Q Max
-4.1770 -1.7005 -0.0011 1.5625 6.4893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.87039 0.95625 14.51 <2e-16 ***
x 4.88956 0.09339 52.35 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.195 on 98 degrees of freedom
Multiple R-squared: 0.9655, Adjusted R-squared: 0.9651
F-statistic: 2741 on 1 and 98 DF, p-value: < 2.2e-16
anova(Model.1)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 13202 13202.5 2740.9 < 2.2e-16 ***
Residuals 98 472 4.8
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
attributes(Model.1)
$names
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
$class
[1] "lm"
I know want to randomly sample 100 observations from my table "y" and "x" table. This is the function I created to run the random sample with replacement
draw_100<-function(){
random_100=sample(data, 100, replace = TRUE)
}
Running random_100 gives me these outputs
random_100
x x.1 y
1 8.112187 8.112187 53.69602
2 8.403589 8.403589 53.79438
3 9.541786 9.541786 58.48542
4 8.989281 8.989281 57.08601
5 6.965905 6.965905 46.62331
6 10.167800 10.167800 63.91487
7 10.683152 10.683152 65.84915
8 10.703093 10.703093 66.24738
9 8.337231 8.337231 51.87687
10 13.106177 13.106177 75.94588
11 10.726036 10.726036 65.19384
12 8.601641 8.601641 51.95095
13 10.338696 10.338696 62.92599
14 5.771682 5.771682 42.14190
15 6.161545 6.161545 46.36998
16 9.874543 9.874543 63.67148
17 8.540996 8.540996 58.85341
18 9.866002 9.866002 63.26319
19 8.622546 8.622546 57.05820
20 9.539929 9.539929 64.76654
21 9.498090 9.498090 61.38521
22 8.206142 8.206142 53.43508
23 8.245825 8.245825 58.29646
24 12.192542 12.192542 76.17440
25 6.955028 6.955028 49.73094
26 10.237639 10.237639 65.71210
27 10.927818 10.927818 67.18048
28 8.536011 8.536011 52.97402
29 9.574403 9.574403 60.53908
30 9.507752 9.507752 58.40020
31 5.838214 5.838214 41.93612
32 10.702791 10.702791 64.54986
33 6.704084 6.704084 46.88057
34 12.914798 12.914798 78.99422
35 16.607947 16.607947 96.60247
36 8.334241 8.334241 55.32263
37 12.287914 12.287914 71.46411
38 11.214098 11.214098 68.53254
39 7.722161 7.722161 50.81632
40 14.065276 14.065276 80.31033
41 10.402173 10.402173 64.36506
42 10.984727 10.984727 64.25032
43 8.491214 8.491214 58.36475
44 9.120864 9.120864 61.24240
45 10.251654 10.251654 60.56177
46 4.497277 4.497277 33.20243
47 11.384417 11.384417 68.61502
48 14.033980 14.033980 83.95417
49 9.909422 9.909422 62.27733
50 8.692219 8.692219 55.73567
51 12.864750 12.864750 79.08818
52 9.886267 9.886267 65.87693
53 10.457541 10.457541 61.36505
54 13.395296 13.395296 76.01832
55 10.343134 10.343134 60.84247
56 10.233329 10.233329 65.12074
57 10.756491 10.756491 70.05930
58 9.287774 9.287774 57.65071
59 11.704419 11.704419 72.65211
60 13.075236 13.075236 77.87956
61 12.066161 12.066161 69.34647
62 10.044714 10.044714 65.80648
63 13.331926 13.331926 80.72634
64 10.816099 10.816099 67.11356
65 10.377846 10.377846 63.14035
66 11.824583 11.824583 67.51041
67 7.114326 7.114326 51.80456
68 9.752344 9.752344 59.36107
69 10.869720 10.869720 67.97186
70 10.366262 10.366262 66.28012
71 10.656127 10.656127 67.86625
72 6.246312 6.246312 45.95457
73 8.003875 8.003875 49.29802
74 11.541176 11.541176 67.89918
75 11.799510 11.799510 73.15802
76 9.787112 9.787112 62.90187
77 13.187445 13.187445 80.26162
78 13.019787 13.019787 75.69156
79 3.854378 3.854378 35.82556
80 11.724234 11.724234 71.79034
81 6.953864 6.953864 45.72355
82 12.822231 12.822231 76.93698
83 9.285428 9.285428 59.61610
84 10.259240 10.259240 62.37958
85 10.613086 10.613086 63.91694
86 8.547155 8.547155 54.72216
87 15.069100 15.069100 86.23767
88 7.816772 7.816772 51.41676
89 13.854272 13.854272 88.10100
90 9.495968 9.495968 61.61393
91 9.881453 9.881453 65.24259
92 7.475875 7.475875 50.80777
93 13.286219 13.286219 81.15708
94 9.703433 9.703433 60.75532
95 5.415999 5.415999 42.55981
96 12.997555 12.997555 78.12987
97 11.893787 11.893787 68.97691
98 5.228217 5.228217 37.38417
99 8.392504 8.392504 54.81151
100 8.077527 8.077527 51.47045
I am having an road block, using this new random sample of 100 values and fitting a regression model to it to extract the coefficient and standard error?
I thought I may need to use the supply() function but I truly believe I am overthinking this. Because when I the regression model with my R object with the store random sample and it was identical to Model.1. I am off.
Model.2<-lm(y~x, data = random_100)
Call:
lm(formula = y ~ x, data = random_100)
Coefficients:
(Intercept) x
13.87 4.89
Coefficient and slop were identical to Model.1
Call:
lm(formula = y ~ x, data = random_100)
Residuals:
Min 1Q Median 3Q Max
-4.1770 -1.7005 -0.0011 1.5625 6.4893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.87039 0.95625 14.51 <2e-16 ***
x 4.88956 0.09339 52.35 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.195 on 98 degrees of freedom
Multiple R-squared: 0.9655, Adjusted R-squared: 0.9651
F-statistic: 2741 on 1 and 98 DF, p-value: < 2.2e-16

Survey package (survival analysis)

I am using the survey package to analyse a longitudinal database. The data looks like
personid spellid long.w Dur rc sex 1 10 age
1 1 278 6.4702295519 0 0 47 20 16
2 1 203 2.8175129012 1 1 126 87 62
3 1 398 6.1956669321 0 0 180 6 37
4 1 139 7.2791061847 1 0 104 192 20
7 1 10 3.6617503439 1 0 18 24 25
8 1 3 2.265464682 0 1 168 136 40
9 1 134 6.3180994022 0 1 116 194 35
10 1 272 6.9167936912 0 0 39 119 45
11 1 296 5.354798213 1 1 193 161 62
After the variable SEX I have 10 bootstrap weights, then the variable Age.
The longitudinal weight is given in the column long.w
I am using the following code.
data.1 <- read.table("Panel.csv", sep = ",",header=T)
library(survey)
library(survival)
#### Unweigthed model
mod.1 <- summary(coxph(Surv(Dur, rc) ~ age + sex, data.1))
mod.1
coxph(formula = Surv(Dur, rc) ~ age + sex, data = data.1)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age -4.992e-06 1.000e+00 2.291e-02 0.000 1.000
sex 5.277e-01 1.695e+00 5.750e-01 0.918 0.359
exp(coef) exp(-coef) lower .95 upper .95
age 1.000 1.00 0.9561 1.046
sex 1.695 0.59 0.5492 5.232
Concordance= 0.651 (se = 0.095 )
Rsquare= 0.024 (max possible= 0.858 )
### --- Weights
weights <- data.1[,7:16]*data.1$long.w
panel <-svrepdesign(data=data.1,
weights=data.1[,3],
type="BRR",
repweights=weights,
combined.weights=TRUE
)
#### Weighted model
mod.1.w <- svycoxph(Surv(Dur,rc)~ age+ sex ,design=panel)
summary(mod.1.w)
Balanced Repeated Replicates with 10 replicates.
Call:
svycoxph.svyrep.design(formula = Surv(Dur, rc) ~ age + sex, design = panel)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0198 1.0200 0.0131 1.512 0.131
sex 1.0681 2.9098 0.2336 4.572 4.84e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.02 0.9804 0.9941 1.047
sex 2.91 0.3437 1.8407 4.600
Concordance= 0.75 (se = 0.677 )
Rsquare= NA (max possible= NA )
Likelihood ratio test= NA on 2 df, p=NA
Wald test = 28.69 on 2 df, p=5.875e-07
Score (logrank) test = NA on 2 df, p=NA
### ----
> panel.2 <-svrepdesign(data=data.1,
+ weights=data.1[,3],
+ type="BRR",
+ repweights=data.1[,7:16],
+ combined.weights=FALSE,
+ )
Warning message:
In svrepdesign.default(data = data.1, weights = data.1[, 3], type = "BRR", :
Data look like combined weights: mean replication weight is 101.291666666667 and mean sampling weight is 203.944444444444
mod.2.w <- svycoxph(Surv(Dur,rc)~ age+ sex ,design=panel.2)
> summary(mod.2.w)
Call: svrepdesign.default(data = data.1, weights = data.1[, 3], type = "BRR",
repweights = data.1[, 7:16], combined.weights = FALSE, )
Balanced Repeated Replicates with 10 replicates.
Call:
svycoxph.svyrep.design(formula = Surv(Dur, rc) ~ age + sex, design = panel.2)
n= 36, number of events= 14
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0198 1.0200 0.0131 1.512 0.131
sex 1.0681 2.9098 0.2336 4.572 4.84e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.02 0.9804 0.9941 1.047
sex 2.91 0.3437 1.8407 4.600
Concordance= 0.75 (se = 0.677 )
Rsquare= NA (max possible= NA )
Likelihood ratio test= NA on 2 df, p=NA
Wald test = 28.69 on 2 df, p=5.875e-07
Score (logrank) test = NA on 2 df, p=NA
The sum of the longitudinal weights is 7,342. The total of events must be around 2,357 and the censored observations a total of 4,985 for a "population" of 7,342 individuals
Do models mod.1.w and mod.2.w take into consideration the longitudinal weights? If the do, why the summary report only n= 36, number of events= 14 ?
The design works well if I take other statistics. For example the mean of Dur in data.1 without considering the sampling design is around 4.9 and 5.31 when I consider svymean(~Dur, panel.2) for example.

R - cox hazard model not including levels of a factor

I am fitting a cox model to some data that is structured as such:
str(test)
'data.frame': 147 obs. of 8 variables:
$ AGE : int 71 69 90 78 61 74 78 78 81 45 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
$ RACE : Factor w/ 5 levels "","BLACK","HISPANIC",..: 5 2 5 5 5 5 5 5 5 1 ...
$ SIDE : Factor w/ 2 levels "L","R": 1 1 2 1 2 1 1 1 2 1 ...
$ LESION.INDICATION: Factor w/ 12 levels "CLAUDICATION",..: 1 11 4 11 9 1 1 11 11 11 ...
$ RUTH.CLASS : int 3 5 4 5 4 3 3 5 5 5 ...
$ LESION.TYPE : Factor w/ 3 levels "","OCCLUSION",..: 3 3 2 3 3 3 2 3 3 3 ...
$ Primary : int 1190 1032 166 689 219 840 1063 115 810 157 ...
the RUTH.CLASS variable is actually a factor, and i've changed it to one as such:
> test$RUTH.CLASS <- as.factor(test$RUTH.CLASS)
> summary(test$RUTH.CLASS)
3 4 5 6
48 56 35 8
great.
after fitting the model
stent.surv <- Surv(test$Primary)
> cox.ruthclass <- coxph(stent.surv ~ RUTH.CLASS, data=test )
>
> summary(cox.ruthclass)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 0.1599 1.1734 0.1987 0.804 0.42111
RUTH.CLASS5 0.5848 1.7947 0.2263 2.585 0.00974 **
RUTH.CLASS6 0.3624 1.4368 0.3846 0.942 0.34599
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 1.173 0.8522 0.7948 1.732
RUTH.CLASS5 1.795 0.5572 1.1518 2.796
RUTH.CLASS6 1.437 0.6960 0.6762 3.053
Concordance= 0.574 (se = 0.026 )
Rsquare= 0.045 (max possible= 1 )
Likelihood ratio test= 6.71 on 3 df, p=0.08156
Wald test = 7.09 on 3 df, p=0.06902
Score (logrank) test = 7.23 on 3 df, p=0.06478
> levels(test$RUTH.CLASS)
[1] "3" "4" "5" "6"
When i fit more variables in the model, similar things happen:
cox.fit <- coxph(stent.surv ~ RUTH.CLASS + LESION.INDICATION + LESION.TYPE, data=test )
>
> summary(cox.fit)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS + LESION.INDICATION +
LESION.TYPE, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 -0.5854 0.5569 1.1852 -0.494 0.6214
RUTH.CLASS5 -0.1476 0.8627 1.0182 -0.145 0.8847
RUTH.CLASS6 -0.4509 0.6370 1.0998 -0.410 0.6818
LESION.INDICATIONEMBOLIC -0.4611 0.6306 1.5425 -0.299 0.7650
LESION.INDICATIONISCHEMIA 1.3794 3.9725 1.1541 1.195 0.2320
LESION.INDICATIONISCHEMIA/CLAUDICATION 0.2546 1.2899 1.0189 0.250 0.8027
LESION.INDICATIONREST PAIN 0.5302 1.6993 1.1853 0.447 0.6547
LESION.INDICATIONTISSUE LOSS 0.7793 2.1800 1.0254 0.760 0.4473
LESION.TYPEOCCLUSION -0.5886 0.5551 0.4360 -1.350 0.1770
LESION.TYPESTEN -0.7895 0.4541 0.4378 -1.803 0.0714 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 0.5569 1.7956 0.05456 5.684
RUTH.CLASS5 0.8627 1.1591 0.11726 6.348
RUTH.CLASS6 0.6370 1.5698 0.07379 5.499
LESION.INDICATIONEMBOLIC 0.6306 1.5858 0.03067 12.964
LESION.INDICATIONISCHEMIA 3.9725 0.2517 0.41374 38.141
LESION.INDICATIONISCHEMIA/CLAUDICATION 1.2899 0.7752 0.17510 9.503
LESION.INDICATIONREST PAIN 1.6993 0.5885 0.16645 17.347
LESION.INDICATIONTISSUE LOSS 2.1800 0.4587 0.29216 16.266
LESION.TYPEOCCLUSION 0.5551 1.8015 0.23619 1.305
LESION.TYPESTEN 0.4541 2.2023 0.19250 1.071
Concordance= 0.619 (se = 0.028 )
Rsquare= 0.137 (max possible= 1 )
Likelihood ratio test= 21.6 on 10 df, p=0.01726
Wald test = 22.23 on 10 df, p=0.01398
Score (logrank) test = 23.46 on 10 df, p=0.009161
> levels(test$LESION.INDICATION)
[1] "CLAUDICATION" "EMBOLIC" "ISCHEMIA" "ISCHEMIA/CLAUDICATION"
[5] "REST PAIN" "TISSUE LOSS"
> levels(test$LESION.TYPE)
[1] "" "OCCLUSION" "STEN"
truncated output from model.matrix below:
> model.matrix(cox.fit)
RUTH.CLASS4 RUTH.CLASS5 RUTH.CLASS6 LESION.INDICATIONEMBOLIC LESION.INDICATIONISCHEMIA
1 0 0 0 0 0
2 0 1 0 0 0
We can see that the the first level of each of these is being excluded from the model. Any input would be greatly appreciated. I noticed that on the LESION.TYPE variable, the blank level "" is not being included, but that is not by design - that should be a NA or something similar.
I'm confused and could use some help with this. Thanks.
Factors in any model return coefficients based on a base level (a contrast).Your contrasts default to a base factor. There is no point in calculating a coefficient for the dropped value because the model will return the predictions when that dropped value = 1 given that all the other factor values are 0 (factors are complete and mutually exclusive for every observation). You can alter your default contrast by changing the contrasts in your options.
For your coefficients to be versus an average of all factors:
options(contrasts=c(unordered="contr.sum", ordered="contr.poly"))
For your coefficients to be versus a specific treatment (what you have above and your default):
options(contrasts=c(unordered="contr.treatment", ordered="contr.poly"))
As you can see there are two types of factors in R: unordered (or categorical, e.g. red, green, blue) and ordered (e.g. strongly disagree, disagree, no opinion, agree, strongly agree)

Splitting a large data frame into smaller segments

I have the following data frame and I want to break it up into 10 different data frames. I want to break the initial 100 row data frame into 10 data frames of 10 rows. I could do the following and get the desired results.
df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), three=c(rnorm(100)))
df1 = df[1:10,]
df2 = df[11:20,]
df3 = df[21:30,]
df4 = df[31:40,]
df5 = df[41:50,]
...
Of course, this isn't an elegant way to perform this task when the initial data frames are larger or if there aren't an easy number of segments that it can be broken down into.
So given the above, let's say we have the following data frame.
df = data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
Now I want to split it into new data frames comprised of 200 rows, and the final data frame with the remaining rows. What would be a more elegant (aka 'quick') way to perform this task.
> str(split(df, (as.numeric(rownames(df))-1) %/% 200))
List of 6
$ 0:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.592 1.664 -1.231 0.269 0.912 ...
..$ two : num [1:200] 0.639 -0.525 0.642 1.347 1.142 ...
..$ three: num [1:200] -0.45 -0.877 0.588 1.188 -1.977 ...
$ 1:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -0.0017 1.9534 0.0155 -0.7732 -1.1752 ...
..$ two : num [1:200] -0.422 0.869 0.45 -0.111 0.073 ...
..$ three: num [1:200] -0.2809 1.31908 0.26695 0.00594 -0.25583 ...
$ 2:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.578 0.433 0.277 1.297 0.838 ...
..$ two : num [1:200] 0.913 0.378 0.35 -0.241 0.783 ...
..$ three: num [1:200] -0.8402 -0.2708 -0.0124 -0.4537 0.4651 ...
$ 3:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] 1.432 1.657 -0.72 -1.691 0.596 ...
..$ two : num [1:200] 0.243 -0.159 -2.163 -1.183 0.632 ...
..$ three: num [1:200] 0.359 0.476 1.485 0.39 -1.412 ...
$ 4:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.43 -0.345 -1.206 -0.925 -0.551 ...
..$ two : num [1:200] -1.343 1.322 0.208 0.444 -0.861 ...
..$ three: num [1:200] 0.00807 -0.20209 -0.56865 1.06983 -0.29673 ...
$ 5:'data.frame': 123 obs. of 3 variables:
..$ one : num [1:123] -1.269 1.555 -0.19 1.434 -0.889 ...
..$ two : num [1:123] 0.558 0.0445 -0.0639 -1.934 -0.8152 ...
..$ three: num [1:123] -0.0821 0.6745 0.6095 1.387 -0.382 ...
If some code might have changed the rownames it would be safer to use:
split(df, (seq(nrow(df))-1) %/% 200)
require(ff)
df <- data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
for(i in chunk(from = 1, to = nrow(df), by = 200)){
print(df[min(i):max(i), ])
}
If you can generate a vector that defines the groups, you can split anything:
f <- rep(seq_len(ceiling(1123 / 200)),each = 200,length.out = 1123)
> df1 <- split(df,f = f)
> lapply(df1,dim)
$`1`
[1] 200 3
$`2`
[1] 200 3
$`3`
[1] 200 3
$`4`
[1] 200 3
$`5`
[1] 200 3
$`6`
[1] 123 3
Chops df into 1 million row groups and pushes and appends a million at a time to df in SQL
batchsize = 1000000 # vary to your liking
# cycles through data by batchsize
for (i in 1:ceiling(nrow(df)/batchsize))
{
print(i) # just to show the progress
# below shows how to cycle through data
batch <- df[(((i-1)*batchsize)+1(batchsize*i),,drop=FALSE] # drop = FALSE keeps it from being converted to a vector
# if below not done then the last batch has Nulls above the number of rows of actual data
batch <- batch[!is.na(batch$ID),] # ID is a variable I presume is in every row
#in this case the table already existed, if new table overwrite = TRUE
(dbWriteTable(con, "df", batch, append = TRUE,row.names = FALSE))
}
Something like this...?
b <- seq(10, 100, 10)
lapply(seq_along(b), function(i) df[(b-9)[i]:b[i], ])
[[1]]
one two three
1 -2.4157992 -0.6232517 1.0531358
2 0.6769020 0.3908089 -1.9543895
3 0.9804026 -2.5167334 0.7120919
4 -1.2200089 0.5108479 0.5599177
5 0.4448290 -1.2885275 -0.7665413
6 0.8431848 -0.9359947 0.1068137
7 -1.8168134 -0.2418887 1.1176077
8 1.4475904 -0.8010347 2.3716663
9 0.7264027 -0.3573623 -1.1956806
10 0.2736119 -1.5553148 0.2691115
[[2]]
one two three
11 -0.3273536 -1.92475496 -0.08031696
12 1.5558892 -1.20158371 0.09104958
13 1.9202047 -0.13418754 0.32571632
14 -0.0515136 -2.15669216 0.23099397
15 0.1909732 -0.30802742 -1.28651457
16 0.8545580 -0.18238266 1.57093844
17 0.4903039 0.02895376 -0.47678196
18 0.5125400 0.97052082 -0.70541908
19 -1.9324370 0.22093545 -0.34436105
20 -0.5763433 0.10442551 -2.05597985
[[3]]
one two three
21 0.7168771 -1.22902943 -0.18728871
22 1.2785641 0.14686576 -1.74738091
23 -1.1856173 0.43829361 0.41269975
24 0.0220843 1.57428924 -0.80163986
25 -1.0012255 0.05520813 0.50871603
26 -0.1842323 -1.61195239 0.04843504
27 0.2328831 -0.38432225 0.95650710
28 0.8821687 -1.32456215 -1.33367967
29 -0.8902177 0.86414661 -1.39629358
30 -0.6586293 -2.27325919 0.27367902
[[4]]
one two three
31 1.3810437 -1.0178835 0.07779591
32 0.6102753 0.3538498 1.92316801
33 -1.5034439 0.7926925 2.21706284
34 0.8251638 0.3992922 0.56781321
35 -1.0832114 0.9878058 -0.16820827
36 -0.4132375 -0.9214491 1.06681472
37 -0.6787631 1.3497766 2.18327887
38 -3.0082585 -1.3047024 -0.04913214
39 -0.3433300 1.1008951 -2.02065141
40 0.6009334 1.2334421 0.15623298
[[5]]
one two three
41 -1.8608051 -0.08589437 0.02370983
42 -0.1829953 0.91139017 -0.01356590
43 1.1146731 0.42384993 -0.68717391
44 1.9039900 -1.70218225 0.06100297
45 -0.4851939 1.38712015 -1.30613414
46 -0.4661664 0.23504099 -0.29335162
47 0.5807227 -0.87821946 -0.14816121
48 -2.0168910 -0.47657382 0.90503226
49 2.5056404 0.27574224 0.10326333
50 0.2238735 0.34441325 -0.17186115
[[6]]
one two three
51 1.51613140 -2.5630782 -0.6720399
52 0.03859537 -2.6688365 0.3395574
53 -0.08695292 -0.5114117 -0.1378789
54 -0.51878363 -0.5401962 0.3946324
55 -2.20482710 0.1716744 0.1786546
56 -0.28133749 -0.4497112 0.5936497
57 -2.38269088 -0.4625695 1.0048914
58 0.37865952 0.5055141 0.3337986
59 0.09329172 0.1560469 0.2835735
60 -1.10818863 -0.2618910 0.3650042
[[7]]
one two three
61 -1.2507208 -1.5050083 -0.63871084
62 0.1379394 0.7996674 -1.80196762
63 0.1582008 -0.3208973 0.40863693
64 -0.6224605 0.1416938 -0.47174711
65 1.1556149 -1.4083576 -1.12619693
66 -0.6956604 0.7994991 1.16073748
67 0.6576676 1.4391007 0.04134445
68 1.4610598 -1.0066840 -1.82981058
69 1.1951788 -0.4005535 1.57256648
70 -0.1994519 0.2711574 -1.04364396
[[8]]
one two three
71 1.23897065 0.4473611 -0.35452535
72 0.89015916 2.3747385 0.87840852
73 -1.17339703 0.7433220 0.40232381
74 -0.24568490 -0.4776862 1.24082294
75 -0.47187443 -0.3271824 0.38542703
76 -2.20899136 -1.1131712 -0.33663075
77 -0.05968035 -0.6023045 -0.23747388
78 1.19687199 -1.3390960 -1.37884241
79 -1.29310506 0.3554548 -0.05936756
80 -0.17470891 1.6198307 0.69170207
[[9]]
one two three
81 -1.06792315 0.04801998 0.08166394
82 0.84152560 -0.45793907 0.27867619
83 0.07619456 -1.21633682 -2.51290495
84 0.55895466 -1.01844178 -0.41887672
85 0.33825508 -1.15061381 0.66206732
86 -0.36041720 0.32808609 -1.83390913
87 -0.31595401 -0.87081019 0.45369366
88 0.92331087 1.22055348 -1.91048757
89 1.30491142 1.22582353 -1.32244004
90 -0.32906839 1.76467263 1.84479228
[[10]]
one two three
91 2.80656707 -0.9708417 0.25467304
92 0.35770119 -0.6132523 -1.11467041
93 0.09598908 -0.5710063 -0.96412216
94 -1.08728715 0.3019572 -0.04422049
95 0.14317455 0.1452287 -0.46133199
96 -1.00218917 -0.1360570 0.88864256
97 -0.25316855 0.6341925 -1.37571664
98 0.36375921 1.2244921 0.12718650
99 0.13345555 0.5330221 -0.29444683
100 2.28548261 -2.0413222 -0.53209956

Resources