Obtaining Predictions for New Observations (R Programming Language) - r

I am working with the R programming language. I created a decision tree for this dataset in R (to predict whether the "diabetes" column is either "pos" or "neg"):
#load libraries
library(pdp)
library(C50)
#load data
data(pima)
#remove na's
new_data = na.omit(pima)
#format data
new_data$age = as.factor(ifelse(new_data$age >30, "1", "0"))
new_data$pregnant = as.factor(ifelse(new_data$pregnant >2, "1", "0"))
#run model
tree_mod <- C5.0(x = new_data[, 1:8], rules = TRUE, y = new_data$diabetes)
Here is my question: I am trying to obtain a column of "predictions" made by the model for new observations. I am then want to take this column and append it to the original dataset.
Using the following link, https://cran.r-project.org/web/packages/C50/vignettes/C5.0.html, I used the "predict" function:
#pretend this is new data
new = new_data[1:10,]
#run predictions
pred = predict(tree_mod, newdata = new[, 1:8])
But this produces the following error:
Error in x[j] : invalid subscript type 'closure'
Can anyone please show me how to do this?
I am trying to create something like this ("prediction_made_by_model"):
pregnant glucose pressure triceps insulin mass pedigree age diabetes prediction_made_by_model
4 0 89 66 23 94 28.1 0.167 0 neg pos
5 0 137 40 35 168 43.1 2.288 1 pos neg
7 1 78 50 32 88 31.0 0.248 0 pos neg
9 0 197 70 45 543 30.5 0.158 1 pos pos
14 0 189 60 23 846 30.1 0.398 1 pos neg
15 1 166 72 19 175 25.8 0.587 1 pos pos
Thanks!

I was able to figure it out. For some reason, this was not working before:
pred = predict(tree_mod, newdata = new[, 1:8])
new$prediction_made_by_model = pred

Related

Getting Warning: " 'newdata' had 5 rows but variables found have 750 rows" even though column names are the same

I have a dataset called PimaDiabetes. And the dataset can be pulled from here.
PimaDiabetes <- read.csv("PimaDiabetes.csv")
From which I derived a logistic model:
chosen_glm = glm(PimaDiabetes$Outcome ~ PimaDiabetes$Pregnancies+PimaDiabetes$Glucose
+PimaDiabetes$SkinThickness+PimaDiabetes$BMI
+PimaDiabetes$DiabetesPedigree, data = PimaDiabetes)
However, when ever I try to run it against a new dataset called ToPredict:
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigree
Age
4
136
70
0
0
20
31.2
22
1
121
78
39
74
20
39
28
3
108
62
24
0
20
26
25
0
181
88
44
510
20
43.3
26
8
154
78
32
0
20
32.4
45
I get the following error:
>predict(chosen_glm,ToPredict,type="response")
Warning message:
'newdata' had 5 rows but variables found have 750 rows
And I'm not sure what's wrong.
The colnames
colnames(PimaDiabetes)
"Pregnancies" "Glucose" "BloodPressure" "SkinThickness" "Insulin" "BMI" "DiabetesPedigree" "Age"
[9] "Outcome"
Are the same
colnames(ToPredict)
[1] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" "Insulin" "BMI" "DiabetesPedigree" "Age"
Try this:
PimaDiabetes = read.csv("diabetes.csv")
chosen_glm = glm(
Outcome ~ Pregnancies + Glucose + SkinThickness + BMI + DiabetesPedigreeFunction,
data = PimaDiabetes
)
ToPredict = PimaDiabetes[sample(nrow(PimaDiabetes),5),]
predict(chosen_glm,ToPredict,type="response")

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

anova - selecting multiple DVs simultaneously

I am trying to run anova on many dependent variables. I have one independent variable, which is my grouping variable (Group). I have about 25 DVs - "TMTG, TMTF, CUE, CSE, TCUE, TCSE, WRS, WMAO, TWRS, TWMAO, JCP, JCPE ....etc". I used the following code for the first three variables and I am getting the desired output. How do I tweak the code to get the output for all 25 variables at the same time, but without naming them? I have another dataset with 100 DV - I cant write those out!
here is the data frame
Group TMTG TMTF CUE CSE WRS
TN 27 33 35.12 13.56 0
TN 32 34 12.90 25.56 0
TN 14 78 11 14.78 0
TN 89 41 98 45.25 0
TL 65 11 18.5 23.89 0
TL 12 78 34.6 41.85 0
TL 11 20 35.5 45.5 0
TL 27 25 11.28 55.69 0
Here is the code:
mydataframe
manova_1 <-
manova(cbind(TMTG, TMTF, CUE) ~ as.factor(Group), data = mydataframe)
manova_1
summary.aov(manova_1)
Here is the output
Response TMTG :
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(Group) 1 0.535 0.5351 0.1683 0.6858
Residuals 21 66.769 3.1795
Response TMTF :
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(Group) 1 0.02 0.016 5e-04 0.9831
Residuals 21 749.13 35.673
Response CUE :
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(Group) 1 14.7 14.75 0.0372 0.8489
Residuals 21 8325.7 396.46
I want to tweak this line:
manova(cbind(TMTG, TMTF, CUE) ~ as.factor(Group), data = mydataframe,
so that cbind can take in all the columns without me having to write them out. I tried cbind(2:24) but its not working! Any help would be appreciated!!!
Assuming 1) Group is the first variable in mydataframe, and 2) you want to do a manova as opposed to a number of separate anovas, you could replace the line:
manova(cbind(TMTG, TMTF, CUE) ~ as.factor(Group), data = mydataframe)
with:
manova(as.matrix(mydataframe[, -1]) ~ as.factor(Group), data = mydataframe)

A concise way to extract some elements of a "survfit" object into a data frame

I load a data set from the survival library, and generate a survfit object:
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
This object is a list:
> str(fit)
List of 13
$ n : int 228
$ time : int [1:186] 5 11 12 13 15 26 30 31 53 54 ...
$ n.risk : num [1:186] 228 227 224 223 221 220 219 218 217 215 ...
$ n.event : num [1:186] 1 3 1 2 1 1 1 1 2 1 ...
...
Now I specify some members (all same length) that I want to turn into a data frame:
members <- c("time", "n.risk", "n.event")
I'm looking for a concise way to make a data frame with the three list members as columns, with the columns named time, n.risk, n.event (not fit$time, fit$n.risk, fit$n.event)
Thus the resulting data frame should look like this:
time n.risk n.event
[1,] 5 228 1
[2,] 11 227 3
[3,] 12 224 1
...
This is OK
data.frame(unclass(fit)[members])
Another (more canonical) way is
with(fit, data.frame(time, n.risk, n.event))
The broompackage contains functions to tidy up the results of regression models and present them in an object of class data.frame. For those unfamiliar with the tidy philosophy, please see Tidy data [ 1 ]
library(broom)
#create tidy dataframe and subset by the columns saved in members
df <- tidy(fit)[,members]
head(df)
# time n.risk n.event
#1 5 228 1
#2 11 227 3
#3 12 224 1
#4 13 223 2
#5 15 221 1
#6 26 220 1
[ 1 ] Wickham, Hadley . "Tidy Data." Journal of Statistical Software [Online], 59.10 (2014): 1 - 23. Web. 16 Jun. 2017
Used cbind to bind the dataframes, then used names to change the name of columns
time=as.data.frame(fit$time)
n.risk=as.data.frame(fit$n.risk)
n.event=as.data.frame(fit$n.event)
members2=cbind(time,n.risk,n.event)
names(members2)=c("time","n.risk","n.event")
head(members2)
time n.risk n.event
1 5 228 1
2 11 227 3
3 12 224 1
4 13 223 2
5 15 221 1
6 26 220 1
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
str(fit)
members<-data.frame(time=fit$time,n.risk=fit$n.risk,n.event=fit$n.event)
members

MICE package in R: passive imputation

I aimed to handle missing values with multiple imputation and then analyse with mixed linear model.
I am stacked by passive imputation for "BMI" (body mass index) and "BMI category". "BMI" was calculated by height and weight and then categorized into "BMI category".
How to impute 'BMI category'?
The database looks like below:
sub_eu_surf[1:5, 3:12]
age gender smoking exercise education sbp dbp height weight bmi
1 41 1 1 2 18 120 80 185 107 31.26370
2 46 1 3 2 18 130 70 182 102 30.79338
3 46 1 3 2 18 130 70 182 102 30.79338
4 47 1 1 2 14 130 80 178 78 24.61810
5 47 1 1 1 14 150 80 175 85 27.75510
Since 'bmi category' is not a predictor of my imputation, I decided to create it after imputation. And details are below:
1. To define method and predictor
ini<-mice(sub_eu_surf, maxit=0)
meth<-ini$meth
meth["bmi"]<-"~I(weight/(height/100)^2)"
pred <- ini$predictorMatrix
pred[c("pm25_global", "pm25_eu", "pm10_eu", "no2_eu"), ]<-0
pred[,c("bmi", "hba1c", "pm25_eu", "pm10_eu")]<-0
pred[,"tc"]<-0
pred[c("smoking", "exercise", "hdl", "glucose"), "tc"]<-1
pred[c("smoking", "exercise", "hdl", "glucose"), "ldl"]<-0
vis <- ini$vis
imp_eu<-mice(sub_eu_surf, meth=meth, pred=pred, vis=vis, seed=200, print=F, m=5, maxit=5)
long_eu<- complete(imp_eu, "long", include=TRUE)
long_eu$bmi_category<-cut(as.numeric(long_eu$bmi), breaks=c(0, 18.5, 25, 30, 72))
complete_eu<-as.mids(long_eu)
But I received an error when analyzing my data:
test1<-with(imp_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))
Error in eval(expr, envir, enclos) : object 'bmi_category' not found
How does this happen?
You are running your analyses on the original mids object imp_eu, not on the modified complete_eu. Try:
test1<-with(complete_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))

Resources