So I have a data set called x. The contents are simple enough to just write out so I'll just outline it here:
the dependent variable, Report, in the first column is binary yes/no (0 = no, 1 = yes)
the subsequent 3 columns are all categorical variables (race.f, sex.f, gender.f) that have all been converted to factors, and they're designated by numbers (e.g. 1= white, 2 = black, etc.)
I have run a logistic regression on x as follows:
glm <- glm(Report ~ race.f + sex.f + gender.f, data=x,
family = binomial(link="logit"))
And I can check the fitted probabilities by looking at summary(glm$fitted).
My question: How do I create a fifth column on the right side of this data set x that will include the predictions (i.e. fitted probabilities) for Report? Of course, I could just insert the glm$fitted as a column, but I'd like to try to write a code that predicts it based on whatever is in the race, sex, gender columns for a more generalized use.
Right now I the follow code which I will hope create a predicted column as well as lower and upper bounds for the confidence interval.
xnew <- cbind(xnew, predict(glm5, newdata = xnew, type = "link", se = TRUE))
xnew <- within(xnew, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
Unfortunately I get the error:
Error in eval(expr, envir, enclos) : object 'race.f' not found
after the cbind code.
Anyone have any idea?
There appears to be a few typo in your codes; First Xnew calls on glm5 but your model as far as I can see is glm (by the way using glm as name of your output is probably not a good idea). Secondly make sure the variable race.f is actually in the dataset you wish to do the prediction from. My guess is R can't find that variable hence the error.
Related
I am trying to report all the interactions in a linear model that reads:
mod1.lme <- lm(volume ~ Group * Treatment + Group + Treatment, data = df)
Group is a factor variable with 3 levels: A, B and C.
The result that I currently get is for (I made up the data):
These two estimates are in reference to Treatment:A, but I would like to see each effect independently. So the output that I would like to get is:
Treatment:A
Treatment:B
Treatment:C
If I eliminate the intercept adding -1 at the end I get:
What is the best way to code this?
Thanks
The reason you are seeing the output that you are, is that one of the factor levels of Treatment becomes a reference level. When interpreting the model the coefficients become "the difference in effect from the reference level". This is necessary as long as the model includes an intercept, so the only way to get the interpretation with all coefficients shown is to remove the intercept as shown below.
mod1.lme <- lm(volume ~ Group * Treatment - 1, data = df)
Edit:
To change the name of the interaction effect, one would have to edit the name manually
sum.lm <- summary(mod1.lme)
rownames(sum.lm$coef) <- c("groupA","groupB","groupC", "groupA:Treatment", "groupB:Treatment", "groupC:Treatment")
or alternatively use another package for summaries such as sjPlot
library(sjPlot)
tab_model(mod1.lme, pred.labels = c("groupA","groupB","groupC", "groupA:Treatment", "groupB:Treatment", "groupC:Treatment"))
I want to plot how the estimated survival from a Cox model depends upon the value of a covariate of interest, while the rest of variables are fixed to their average values (if they are continuous variables) or lowest values for dummy. Following this example http://www.sthda.com/english/wiki/cox-proportional-hazards-model , I have construct a new data frame with three rows, one for each value of my variable of interest; and the other covariates are fixed. Among these covariates I have two factor vectors. I created the new dataset and later it is passed to survfit() via the newdata argument.
When I passed the data frame to survfit(), I obtain the following error message error in relevel.default(occupation) : 'relevel' only for factors. Where is the source of problem? If the source of problem is related to the factor vectors, how I can solve it? Below find an example of the code. Unfortunately, I cannot share the data or find a dataset that produces the same error message:
I have transformed the factor variables into integer vectors in the cox model and in the new dataset. it did not work.
I have deleated all the factor variables and it works.
I have tried to implement this strategy, but it did not work: Plotting predicted survival curves for continuous covariates in ggplot
fit <- coxph(Surv(entry, exit, event == 1) ~ status_plot +
exp_national + relevel(occupation, 5) + age + gender + EDUCATION , data = data)
data_rank <- with(data,
data.frame(status_plot = c(1,2,3), # factor vector of interest
exp_national=rep(mean(exp_national, na.rm = TRUE), 3),
occupation = c(5,5,5), # factor with 6 categories, number 5 is the category of reference in the cox model
age=rep(mean(age, na.rm = TRUE), 3),
gender = c(1,1,1),
EDUCATION=rep(mean(EDUCATION, na.rm = TRUE), 3) ))
surv.fin <- survfit(fit, newdata=data_rank) # this produces the error
Looking at the code it appears you probably attempted to take the mean of a factor. So do post at least str(data) as an edit to the body of your question. You should also realize that you can give a single value to a column in a data.frame call and have it recycled to the correct length, you all the meanss could be entered as a single item rather thanrep`-ng.
Apologies for a very basic question. I'm struggling to get R to recognise the y values for a ROC
I'm trying to do a basic ROC but can't seem to set the vector for y.
fullmodel= glm(culture_positive ~ No_symptoms + sex + art_status_v1 +current_cd4 +
bmi_v1 +nurse_tb_diagnosis_crp_v1 + temperature_v1,
family="binomial", data= Data1)
roc(y , fullmodel$fitted.values, plot=TRUE)
Error in roc(y, fullmodel$fitted.values, plot = TRUE) :
object 'y' not found
So 'y' is a column in my dataset Data1 labelled 'culture_positive' as per the glm but whatever I try I keep getting this message that 'y' is not found.
Once again, apologies for a basic question but it is really holding me up.
Since y is not in your global environment you need to specify where to find y. You can either use the value you used to fit the model:
roc(culture_positive , fullmodel$fitted.values, plot=TRUE)
or the response stored in the glm object
roc(fullmodel$y , fullmodel$fitted.values, plot=TRUE)
I would recommend the second option, it's somewhat safer, because you take y and fitted.values from the same object, so they will fit together.
I have the following data
scorer<-function(points){
points["scores"] <- as.vector((points$X-5)^2+(points$Y-5)^2-9)
points["class"]<-(as.vector( points$scores<0 ))
points
}
dt<-scorer(data.frame(X=c(0,1,5,20,5,3,9,3,5,5),Y=c(0,9,9,0,-18,3,4,5,7,4)))
Then i am trying to predict the last column (class) using SVM
library(e1071)
model <- svm(class ~ . , dt)
predictedClass <- predict(model, dt)
but it complains with:
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
Need numeric dependent variable for regression.
The advice from nya really works.
Please, have a look type parameter description
svm can be used as a classification machine, as a regression machine, or for
novelty detection. Depending on whether y is a factor or not, the default setting
for type is C-classification or eps-regression ... page 50
With your dataset you can make classification using svm method.
But if you want absolutely to make regression, try to transform your variable "class" in numeric form which can take value 1 for negative score and 0 for positif score.
function(points) {
points["scores"] <- as.vector((points$X-5)^2+(points$Y-5)^2-9)
points["class"]<-as.vector( ifelse(points$scores<0 ,1,0))
points
}
dt<-scorer(data.frame(X=c(0,1,`enter code here`5,20,5,3,9,3,5,5),Y=c(0,9,9,0,-18,3,4,5,7,4)))
svm(class~.,dt)
I used ApacheData data with 83784 rows to build a linear regression model:
fit <-lm(tomorrow_apache~ as.factor(state_today)
+as.numeric(daily_creat)
+ as.numeric(last1yr_min_hosp_icu_MDRD)
+as.numeric(bun)
+as.numeric(urin)
+as.numeric(category6)
+as.numeric(category7)
+as.numeric(other_fluid)
+ as.factor(daily)
+ as.factor(age)
+ as.numeric(apache3)
+ as.factor(mv)
+ as.factor(icu_loc)
+ as.factor(liver_tr_before_admit)
+ as.numeric(min_GCS)
+ as.numeric(min_PH)
+ as.numeric(previous_day_creat)
+ as.numeric(previous_day_bun) ,ApacheData)
And I want to use this model to predict a new input so I give each predictor variable a value:
predict(fit, data=data.frame(state_today=1, daily_creat=2.3, last1yr_min_hosp_icu_MDRD=3, bun=10, urin=0.01, category6=10, category7=20, other_fluid=0, daily=2 , age=25, apache3=12, mv=1, icu_loc=1, liver_tr_before_admit=0, min_GCS=20, min_PH=3, previous_day_creat=2.1, previous_day_bun=14))
I expect a single value as a prediction to this new input, but I get many many predictions! I don't know why is this happening. What am I doing wrong?
Thanks a lot for your time!
You may also want to try the excellent effects package in R (?effects). It's very useful for graphing the predicted probabilities from your model by setting the inputs on the right-hand side of the equation to particular values. I can't reproduce the example you've given in your question, but to give you an idea of how to quickly extract predicted probabilities in R and then plot them (since this is vital to understanding what they mean), here's a toy example using the in-built data sets in R:
install.packages("effects") # installs the "effects" package in R
library(effects) # loads the "effects" package
data(Prestige) # loads in-built dataset
m <- lm(prestige ~ income + education + type, data=Prestige)
# this last step creates predicted values of the outcome based on a range of values
# on the "income" variable and holding the other inputs constant at their mean values
eff <- effect("income", m, default.levels=10)
plot(eff) # graphs the predicted probabilities