I have a data frame of bird counts. I have the participants ID number, the number of birds they counted, the year they counted them, their lat and long coordinates, and their effort. I have made this model:
model = lmer(count~year+lat+long+effort+(1|participant), data = df)
I now want the model to plot predicted values from that same data set. So, that data was for 1997-2017, and I want the model to give me predicted values for each year. I want to plot these, so the final plot will have the predicted count on the y-axis, and the year (categorical) on the x-axis. Each year will have one data point w/ a confidence interval.
I have tried figuring out predict(), but I'm not quite sure how to use that to get what I want. It seems to need a new data frame, but I don't have a new data set to run through the model to predict a future count. I want the model to go back and work on the previous data that I put into it already, based off of the Beta values in the output of summary(model).
I found this thread, and it seems to be basically what I'm looking to do, but I can't get the sjPlot dependencies to download, sjlabelled throws an error every time: How to plot predicted values with standard errors for lmer model results?
You could try the ggeffects-package, which will be used in the forthcoming sjPlot-update to plot predicted values.
library(ggeffects)
dat <- ggpredict(model, terms = "dat")
plot(dat)
If you're missing dependencies, try:
install.packages(
c("sjlabelled", "sjmisc", "sjstats", "ggeffects", "sjPlot"),
dependencies = TRUE
)
You may even want to install ggeffects from GitHub, since the current dev-version has some fixes and improvements for mixed models.
devtools::install_github("strengejacke/ggeffects")
I found the package I was looking for, it's called predictedmeans and has a function where you put in the model and the model term you want predictions for predictmeans(model, model term). It works perfectly!
Related
I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly
Im using the book Applied Survival Analysis Using R by Moore to try and model some time-to-event data. The issue I'm running into is plotting the estimated survival curves from the cox model. Because of this I'm wondering if my understanding of the model is wrong or not. My data is simple: a time column t, an event indicator column (1 for event 0 for censor) i, and a predictor column with 6 factor levels p.
I believe I can plot estimated surival curves for a cox model as follows below. But I don't understand how to use survfit and baseplot, nor functions from survminer to achieve the same end. Here is some generic code for clarifying my question. I'll use the pharmcoSmoking data set to demonstrate my issue.
library(survival)
library(asaur)
t<-pharmacoSmoking$longestNoSmoke
i<-pharmacoSmoking$relapse
p<-pharmacoSmoking$levelSmoking
data<-as.data.frame(cbind(t,i,p))
model <- coxph(Surv(data$t, data$i) ~ p, data=data)
As I understand it, with the following code snippets, modeled after book examples, a baseline (cumulative) hazard at my reference factor level for p may be given from
base<-basehaz(model, centered=F)
An estimate of the survival curve is given by
s<-exp(-base$hazard)
t<-base$time
plot(s~t, typ = "l")
The survival curve associated with a different factor level may then be given by
beta_n<-model$coefficients #only one coef in this case
s_n <- s^(exp(beta_n))
lines(s_n~t)
where beta_n is the coefficient for the nth factor level from the cox model. The code above gives what I think are estimated survival curves for heavy vs light smokers in the pharmcoSmokers dataset.
Since thats a bit of code I was looking to packages for a one-liner solution, I had a hard time with the documentation for Survival ( there weren't many examples in the docs) and also tried survminer. For the latter I've tried:
library(survminer)
ggadjustedcurves(model, variable ="p" , data=data)
This gives me something different than my prior code, although it is similar. Is the method I used earlier incorrect? Or is there a different methodology that accounts for the difference? The survminer code doesn't work from my data (I get a 'can't allocated vector of size yada yada error, and my data is ~1m rows) which seems weird considering I can make plots using what I did before no problem. This is the primary reason I am wondering if I am understanding how to plot survival curves for my model.
I have a glm in R that nicely explains abundance of a species of the form
x<-glm(log(abundance) ~ distance+sampling_effort, data=df)
All terms are significant (p-value<0.01) and model assumptions seem to be valid. The data is actually from a raster map. Now I want to create predicted values from my model, but while leaving out the sampling_effort term. So it would create a new raster map that compensates for sampling effort and thus provides a better prediction of abundance if sampling_effort would be equal everywhere. How can I do this?
Ok, after some better googling I found the answer already on http://r.789695.n4.nabble.com/Remove-term-from-formula-for-predict-lm-td1017686.html
Basically the easiest way is just to set the sampling_effort to 0 in a new dataset and use that with predict like this:
newdata <- df
newdata$sampling_effort = 0
predicted_values_compensated <- predict(x, newdata)
there is quite a lot of information (internet and textbooks) on how to do survival analysis in R with the survival package. But I don't find any information on how to do this when you have left censored data.
Problem background:
I have a self constructed data set with published survival data. Usually the event time and the date of the last follow-up (right censoring) is given. There is however one study that only states that the event happened before day 360. So I left censored this data.
What I want to do:
I want to analyse the complete data set with left truncation, events, and right truncation. I want to plot the Kaplan-Meier curve by gender and then
do a log-rank test
do a Cox regression
What I need:
I am able to create a Surv object with type = interval2. But this does neither allow to calculate survdiff, nor coxph of the survival package.
The intcox package was removed from CRAN and I don't find what I search in the icenReg or interval packages.
Can anyone please give me a hind how to solve my problem or where to find practical information on this? I am already spending days on this one.
Many thanks!
You can fit a Cox-PH model with both right and left censoring in icenReg by using the ic_sp function. You can fit this using the standard Surv response variable, i.e.
fit <- ic_sp(Surv(L, R, type = 'interval2') ~ treatment, data = myData)
or a little more succinctly with
fit <- ic_sp(cbind(L, R) ~ treatment, data = myData)
Log-rank tests are not available in icenReg, but can be found in the interval package.
I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.