MICE package in R: passive imputation - r

I aimed to handle missing values with multiple imputation and then analyse with mixed linear model.
I am stacked by passive imputation for "BMI" (body mass index) and "BMI category". "BMI" was calculated by height and weight and then categorized into "BMI category".
How to impute 'BMI category'?
The database looks like below:
sub_eu_surf[1:5, 3:12]
age gender smoking exercise education sbp dbp height weight bmi
1 41 1 1 2 18 120 80 185 107 31.26370
2 46 1 3 2 18 130 70 182 102 30.79338
3 46 1 3 2 18 130 70 182 102 30.79338
4 47 1 1 2 14 130 80 178 78 24.61810
5 47 1 1 1 14 150 80 175 85 27.75510
Since 'bmi category' is not a predictor of my imputation, I decided to create it after imputation. And details are below:
1. To define method and predictor
ini<-mice(sub_eu_surf, maxit=0)
meth<-ini$meth
meth["bmi"]<-"~I(weight/(height/100)^2)"
pred <- ini$predictorMatrix
pred[c("pm25_global", "pm25_eu", "pm10_eu", "no2_eu"), ]<-0
pred[,c("bmi", "hba1c", "pm25_eu", "pm10_eu")]<-0
pred[,"tc"]<-0
pred[c("smoking", "exercise", "hdl", "glucose"), "tc"]<-1
pred[c("smoking", "exercise", "hdl", "glucose"), "ldl"]<-0
vis <- ini$vis
imp_eu<-mice(sub_eu_surf, meth=meth, pred=pred, vis=vis, seed=200, print=F, m=5, maxit=5)
long_eu<- complete(imp_eu, "long", include=TRUE)
long_eu$bmi_category<-cut(as.numeric(long_eu$bmi), breaks=c(0, 18.5, 25, 30, 72))
complete_eu<-as.mids(long_eu)
But I received an error when analyzing my data:
test1<-with(imp_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))
Error in eval(expr, envir, enclos) : object 'bmi_category' not found
How does this happen?

You are running your analyses on the original mids object imp_eu, not on the modified complete_eu. Try:
test1<-with(complete_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))

Related

Getting Warning: " 'newdata' had 5 rows but variables found have 750 rows" even though column names are the same

I have a dataset called PimaDiabetes. And the dataset can be pulled from here.
PimaDiabetes <- read.csv("PimaDiabetes.csv")
From which I derived a logistic model:
chosen_glm = glm(PimaDiabetes$Outcome ~ PimaDiabetes$Pregnancies+PimaDiabetes$Glucose
+PimaDiabetes$SkinThickness+PimaDiabetes$BMI
+PimaDiabetes$DiabetesPedigree, data = PimaDiabetes)
However, when ever I try to run it against a new dataset called ToPredict:
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigree
Age
4
136
70
0
0
20
31.2
22
1
121
78
39
74
20
39
28
3
108
62
24
0
20
26
25
0
181
88
44
510
20
43.3
26
8
154
78
32
0
20
32.4
45
I get the following error:
>predict(chosen_glm,ToPredict,type="response")
Warning message:
'newdata' had 5 rows but variables found have 750 rows
And I'm not sure what's wrong.
The colnames
colnames(PimaDiabetes)
"Pregnancies" "Glucose" "BloodPressure" "SkinThickness" "Insulin" "BMI" "DiabetesPedigree" "Age"
[9] "Outcome"
Are the same
colnames(ToPredict)
[1] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" "Insulin" "BMI" "DiabetesPedigree" "Age"
Try this:
PimaDiabetes = read.csv("diabetes.csv")
chosen_glm = glm(
Outcome ~ Pregnancies + Glucose + SkinThickness + BMI + DiabetesPedigreeFunction,
data = PimaDiabetes
)
ToPredict = PimaDiabetes[sample(nrow(PimaDiabetes),5),]
predict(chosen_glm,ToPredict,type="response")

Obtaining Predictions for New Observations (R Programming Language)

I am working with the R programming language. I created a decision tree for this dataset in R (to predict whether the "diabetes" column is either "pos" or "neg"):
#load libraries
library(pdp)
library(C50)
#load data
data(pima)
#remove na's
new_data = na.omit(pima)
#format data
new_data$age = as.factor(ifelse(new_data$age >30, "1", "0"))
new_data$pregnant = as.factor(ifelse(new_data$pregnant >2, "1", "0"))
#run model
tree_mod <- C5.0(x = new_data[, 1:8], rules = TRUE, y = new_data$diabetes)
Here is my question: I am trying to obtain a column of "predictions" made by the model for new observations. I am then want to take this column and append it to the original dataset.
Using the following link, https://cran.r-project.org/web/packages/C50/vignettes/C5.0.html, I used the "predict" function:
#pretend this is new data
new = new_data[1:10,]
#run predictions
pred = predict(tree_mod, newdata = new[, 1:8])
But this produces the following error:
Error in x[j] : invalid subscript type 'closure'
Can anyone please show me how to do this?
I am trying to create something like this ("prediction_made_by_model"):
pregnant glucose pressure triceps insulin mass pedigree age diabetes prediction_made_by_model
4 0 89 66 23 94 28.1 0.167 0 neg pos
5 0 137 40 35 168 43.1 2.288 1 pos neg
7 1 78 50 32 88 31.0 0.248 0 pos neg
9 0 197 70 45 543 30.5 0.158 1 pos pos
14 0 189 60 23 846 30.1 0.398 1 pos neg
15 1 166 72 19 175 25.8 0.587 1 pos pos
Thanks!
I was able to figure it out. For some reason, this was not working before:
pred = predict(tree_mod, newdata = new[, 1:8])
new$prediction_made_by_model = pred

A concise way to extract some elements of a "survfit" object into a data frame

I load a data set from the survival library, and generate a survfit object:
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
This object is a list:
> str(fit)
List of 13
$ n : int 228
$ time : int [1:186] 5 11 12 13 15 26 30 31 53 54 ...
$ n.risk : num [1:186] 228 227 224 223 221 220 219 218 217 215 ...
$ n.event : num [1:186] 1 3 1 2 1 1 1 1 2 1 ...
...
Now I specify some members (all same length) that I want to turn into a data frame:
members <- c("time", "n.risk", "n.event")
I'm looking for a concise way to make a data frame with the three list members as columns, with the columns named time, n.risk, n.event (not fit$time, fit$n.risk, fit$n.event)
Thus the resulting data frame should look like this:
time n.risk n.event
[1,] 5 228 1
[2,] 11 227 3
[3,] 12 224 1
...
This is OK
data.frame(unclass(fit)[members])
Another (more canonical) way is
with(fit, data.frame(time, n.risk, n.event))
The broompackage contains functions to tidy up the results of regression models and present them in an object of class data.frame. For those unfamiliar with the tidy philosophy, please see Tidy data [ 1 ]
library(broom)
#create tidy dataframe and subset by the columns saved in members
df <- tidy(fit)[,members]
head(df)
# time n.risk n.event
#1 5 228 1
#2 11 227 3
#3 12 224 1
#4 13 223 2
#5 15 221 1
#6 26 220 1
[ 1 ] Wickham, Hadley . "Tidy Data." Journal of Statistical Software [Online], 59.10 (2014): 1 - 23. Web. 16 Jun. 2017
Used cbind to bind the dataframes, then used names to change the name of columns
time=as.data.frame(fit$time)
n.risk=as.data.frame(fit$n.risk)
n.event=as.data.frame(fit$n.event)
members2=cbind(time,n.risk,n.event)
names(members2)=c("time","n.risk","n.event")
head(members2)
time n.risk n.event
1 5 228 1
2 11 227 3
3 12 224 1
4 13 223 2
5 15 221 1
6 26 220 1
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
str(fit)
members<-data.frame(time=fit$time,n.risk=fit$n.risk,n.event=fit$n.event)
members

Plotting estimated probabilities from binary logistic regression when one or more predictor variables are held constant

I am a biology grad student who has been spinning my wheels for about thirty hours on the following issue. In summary I would like to plot a figure of estimated probabilities from a glm binary logistic regression model i produced. I have already gone through model selection, validation, etc and am now simply trying to produce figures. I had no problem plotting probability curves for the model i selected but what i am really interested in is producing a figure that shows probabilities of a binary outcome for a predictor variable when the other predictor variable is held constant.
I cannot figure out how to assign this constant value to only one of the predictor variables and plot the probability for the other variable. Ultimately i would like to produce figures similar to the crude example i attached desired output. I admit I am a novice in R and I certainly appreciate folks' time but i have exhausted online searches and have yet to find the approach or a solution adequately explained. This is the closest information related to my question but i found the explanation vague and it failed to provide an example for assigning one predictor a constant value while plotting the probability of the other predictor. https://stat.ethz.ch/pipermail/r-help/2010-September/253899.html
Below i provided a simulated dataset and my progress. Thank you very much for your expertise, i believe a solution and code example would be helpful for other ecologists who use logistic regression.
The simulated dataset shows survival outcomes over the winter for lizards. The predictor variables are "mass" and "depth".
x<-read.csv('logreg_example_data.csv',header = T)
x
survival mass depth
1 0 4.294456 262
2 0 8.359857 261
3 0 10.740580 257
4 0 10.740580 257
5 0 6.384678 257
6 0 6.384678 257
7 0 11.596380 270
8 0 11.596380 270
9 0 4.294456 262
10 0 4.294456 262
11 0 8.359857 261
12 0 8.359857 261
13 0 8.359857 261
14 0 7.920406 258
15 0 7.920406 258
16 0 7.920406 261
17 0 10.740580 257
18 0 10.740580 258
19 0 38.824960 262
20 0 9.916840 239
21 1 6.384678 257
22 1 6.384678 257
23 1 11.596380 270
24 1 11.596380 270
25 1 11.596380 270
26 1 23.709520 288
27 1 23.709520 288
28 1 23.709520 288
29 1 38.568970 262
30 1 38.568970 262
31 1 6.581013 295
32 1 6.581013 298
33 1 0.766564 269
34 1 5.440803 262
35 1 5.440803 262
36 1 19.534710 252
37 1 19.534710 259
38 1 8.359857 263
39 1 10.740580 257
40 1 38.824960 264
41 1 38.824960 264
42 1 41.556970 239
#Dataset name is x
# time to run the glm model
model1<-glm(formula=survival ~ mass + depth, family = "binomial", data=x)
model1
summary(model1)
#Ok now heres how i predict the probability of a lizard "Bob" surviving the winter with a mass of 32.949 grams and a burrow depth of 264 mm
newdata<-data.frame(mass = 32.949, depth = 264)
predict(model1, newdata, type = "response")
# the lizard "Bob" has a 87.3% chance of surviving the winter
#Now lets assume the glm. model was robust and the lizard was endangered,
#from all my research I know the average burrow depth is 263.9 mm at a national park
#lets say i am also interested in survival probabilities at burrow depths of 200 and 100 mm, respectively
#how do i use the valuable glm model produced above to generate a plot
#showing the probability of lizards surviving with average burrow depths stated above
#across a range of mass values from 0.0 to 100.0 grams??????????
#i know i need to use the plot and predict functions but i cannot figure out how to tell R that i
#want to use the glm model i produced to predict "survival" based on "mass" when the other predictor "depth" is held at constant values of biological relevance
#I would also like to add dashed lines for 95% CI

Confusion matrix with a four-level class in R

I am trying to get a confusion matrix from a multi-level factor variable (Rating)
My data looks like this:
> head(credit)
Income Rating Cards Age Education Gender Student Married Ethnicity Balance
1 14.891 bad 2 34 11 Male No Yes Caucasian 333
2 106.025 excellent 3 82 15 Female Yes Yes Asian 903
3 104.593 excellent 4 71 11 Male No No Asian 580
4 148.924 excellent 3 36 11 Female No No Asian 964
5 55.882 good 2 68 16 Male No Yes Caucasian 331
6 80.180 excellent 4 77 10 Male No No Caucasian 1151
I built a classification tree with the rpart() function then predicted probabilities.
credit_model <- rpart(Rating ~ ., data=credit_train, method="class")
credit_pred <- predict(credit_model, credit_test)
Then I want to assess the prediction with CrossTable() from the gmodels package.
library(gmodels)
CrossTable(credit_test, credit_pred, prop.chisq=FALSE, prop.c=FALSE, prop.r=FALSE, dnn=c("actual Rating", "predicted Rating"))
But I get this error:
Error in CrossTable(credit_test, credit_pred, prop.chisq = FALSE,
prop.c = FALSE, : x and y must have the same length
I don't know why I get this error for a 4-level class. When I have a binary class it works fine.

Resources