I do have a question related to plotting actual data of a time series and the values from a fitted model. In particular, my questions relate to this paper:
https://static.googleusercontent.com/media/www.google.com/en//googleblogs/pdfs/google_predicting_the_present.pdf
In the appendix of the document, you can find an R script. Here, I do have two initial questions: (1) What does
##### Define Predictors - Time Lags;
dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)]);
dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)]);
do and what is the function of:
##### Divide data by two parts - model fitting & prediction
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]
Final and main question: Let's say I get a calculation for my data with
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1);
summary(fit)
The adj. R-squared value is 0.342. Thus, I'd argue that the model above explains roughly 34% of the variance between modeled data (predictive data?) and the actual data. Now, how can I plot this "model graph" (fitted) so that I get something like this in the paper?
I assume the second graph's "fitted" is actually the data from the estimated model, right? If so, then this part seems missing in the script.
Thanks a lot!
EDIT 1:
Tried this:
# Actual values and fitted values
plot(sales ~ month, data= dat1, col="blue", lwd=1, type="l", xaxt = "n", xaxs="r",yaxs="r", xlab="", ylab="Total Sales");
par(new=TRUE)
plot(fitted(fit) ~ month, data= dat1, col="red", lwd=1, type="l", xaxs="r", yaxs="r", yaxt = "n", xlab="Month", ylab="Index", xaxt="n");
axis(4)
Output: Error in (function (formula, data = NULL, subset = NULL, na.action = na.fail, : variable lengths differ (found for 'month')
dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)])
This creates a new column s1 with data from sales where first element is NA. Last item from sales is missing.
dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)])
Crate s12 column with 12 NAs and the rest is first nrow(dat)-12 values from dat$sales.
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]
dat1 is all but last observation (rows), dat2 is only last row. When predicting the response (sales), you only need to feed a data.frame with at least the columns that are on the right side of the formula (called also explanatory variables), in this case s1 and s12, as a newdata argument to predict() function. This is where dat2 is used.
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
This next line fits a model using dat1.
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1)
fitted(fit) will give you fitted values. Try predict(fit) and compare if it's any different.
Semicolons at the end of each statement is redundant.
Related
I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html
I have set up a logistic regression model in R and successfully plotted the points of the model to show a relationship in the dataset. I am having trouble showing the line graph of the prediction. The model predicts readmission rates of a hospital based on the length of the initial stay (in days). Here is my code:
mydata <- read.csv(file = 'C:\\Users\\nickg\\Downloads\\3kfid8emf9rkc9ek30sf\\medical_clean.csv', header=TRUE)[,c("Initial_days","ReAdmis")]
head(mydata)
mydata$ReAdmis.f <- factor(mydata$ReAdmis)
logfit <- glm(mydata$ReAdmis.f ~ mydata$Initial_days, data = mydata, family = binomial)
summary(logfit)
range(mydata$Initial_days)
xweight <- seq(0, 79.992, .008)
yweight <- predict(logfit, list(xweight), type = "response")
plot(mydata$Initial_days, mydata$ReAdmis.f, pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
lines(xweight, yweight)
As you can see I have the model set up and ranges described by xweight and yweight, but nothing shows up for the line.
Always use curve for this:
plot(ReAdmis.f ~ Initial_days, data = mydata,
pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
curve(predict(logfit, newdata = data.frame(Initial_days = x),
#x is created by the curve function based on the plot's x limits
#note that newdata must contain the x variable with exactly the same name as in the original data
type = "response"),
add = TRUE)
However, the issue here could be that your y variable is a factor variable (internally that's values of 1 and 2 if you have two levels) whereas logistic regression predictions are always in the interval [0, 1]. You should convert ReAdmis.f into 0/1 integer values before running the code.
I am using the R programming language. I am trying to follow this tutorial :https://rdrr.io/cran/randomForestSRC/man/plot.competing.risk.rfsrc.html
This tutorial shows how to use the "survival random forest" algorithm - an algorithm used to analyze survival data. In this example, the "follic" data set is used, the survival random forest algorithm is used to analyze the instant hazard of observation experiencing "status 1" vs "status 2" (this is called "competing risks).
In the code below, the survival random forest model is trained on the follic data set using all observations except the last two observations. Then, this model is used to predict the hazards of the last two observations:
#load library
library(randomForestSRC)
#load data
data(follic, package = "randomForestSRC")
#train model on all observations except the last 2 observations
follic.obj <- rfsrc(Surv(time, status) ~ ., follic[c(1:539),], nsplit = 3, ntree = 100)
#use model to predict the last two observations
f <- predict(follic.obj, follic[540:541, ])
#plot individual curves - does not work
plot.competing.risk(f)
However, this seems to produce the average hazards for the last two observations experiencing "status 1 vs status 2".
Is there a way to plot the individual hazards of the first observation and the second observation?
Thanks
EDIT1:
I know how to do this for other functions in this package, e.g. here you can plot these curves for 7 observations at once:
data(veteran, package = "randomForestSRC")
plot.survival(rfsrc(Surv(time, status)~ ., veteran), cens.model = "rfsrc")
## pbc data
data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc)
## use subset to focus on specific individuals
plot.survival(pbc.obj, subset = c(3, 10))
This example seems to show the predicted survival curves for 7 observations (plus the confidence intervals - the red line is the average) at once. But I still do not know how to do this for the "plot.competing.risk" function.
EDIT2:
I think there might be an indirect way to solve this - you can predict each observation individually:
#use model to predict the last two observations individually
f1 <- predict(follic.obj, follic[540, ])
f2 <- predict(follic.obj, follic[541, ])
#plot individual curves
plot.competing.risk(f1)
plot.competing.risk(f2)
But I was hoping there was a more straightforward way to do this. Does anyone know how?
One possible way is to modify the function plot.competing.risk for individual line, and plot over a for loop for overlapping individual lines, as shown below.
#use model to predict the last three observations
f <- predict(follic.obj, follic[539:541, ])
x <- f
par(mfrow = c(2, 2))
for (k in 1:3) { #k for type of plot
for (i in 1:dim(x$chf)[1]) { #i for all individuals in x
#cschf <- apply(x$chf, c(2, 3), mean, na.rm = TRUE) #original group mean
cschf = x$chf[i,,] #individual values
#cif <- apply(x$cif, c(2, 3), mean, na.rm = TRUE) #original group mean
cif = x$cif[i,,] #individual values
cpc <- do.call(cbind, lapply(1:ncol(cif), function(j) {
cif[, j]/(1 - rowSums(cif[, -j, drop = FALSE]))
}))
if (k==1)
{matx = cschf
range = range(x$chf)
}
if (k==2)
{matx = cif
range = range(x$cif)
}
if (k==3)
{matx = cpc
range = c(0,1) #manually assign, for now
}
ylab = c("Cause-Specific CHF","Probability (%)","Probability (%)")[k]
matplot(x$time.interest, matx, type='l', lty=1, lwd=3, col=1:2,
add=ifelse(i==1,F,T), ylim=range, xlab="Time", ylab=ylab) #ADD tag for overlapping individual lines
}
legend <- paste(c("CSCHF","CIF","CPC")[k], 1:2, " ")
legend("bottomright", legend = legend, col = (1:2), lty = 1, lwd = 3)
}
I am running into an error when I try to compute the predicted values from a model averaged object using the MuMIn package's predict.averaging. I have been assured that when the full argument is set to FALSE the function should return predicted values based off the conditional average coefficients. However, it returns an error. See example below using the cars dataset. It is very similar to my actual set up.
library(MuMIn)
options(na.action = "na.fail")
global.model <- glm(mpg ~ hp + drat + wt,
data=mtcars)
dr <- dredge(global.model)
mod.avg <- model.avg(dr, subset = delta < 2, fit = T)
summary(mod.avg)
predict(mod.avg, se.fit = TRUE, full = FALSE)
The error indicates that full is ignored, meaning that the full model coefficients are used for the predicted values (not what I want). I have confirmed this by some simple manual checking of values. It is also evident my examining predict() output. Notice how the values jump, suggesting that a coefficient is set to zero or something. It has also been suggested that changing glm to lm will fix the issue but it does not, at least for me.
Thanks!
Comparing predictions from the component models to the averaged ones you can see that the "full averaged" predictions fall within the component predictions (which is as it should be).
On the other hand, the "subset averaged" coefficients produce predictions that are quite biased. This is because the effects are augmented due to ignoring the zero coefficients when calculating the mean.
# Full/subset averaged predictions
pyfa <- predict(mod.avg, full = TRUE)
pysa <- predict(mod.avg, full = FALSE)
# Note: full=FALSE works only with se.fit=FALSE
# Predictions from component models
pycm <- do.call("cbind", lapply(get.models(mod.avg, TRUE), predict))
n <- ncol(pycm)
k <- rep(1:3, c(n, 1, 1))
lty <- c(2,1,1); lwd <- c(1,2,2); col <- c(3,1,2)
matplot(cbind(pycm, pyfa, pysa), type = "l",
lty = lty[k], lwd = lwd[k], col = col[k],
ylab = "predicted")
legend("topleft", legend = c("component", "full average", "subset average"),
lty = lty, lwd = lwd, col = col)
I have this code
factors<-read.csv("India_Factors.csv",header=TRUE)
marketfactor<-factors[,4]
sizefactor<-factors[,5]
valuefactor<-factors[,6]
dati<-get.hist.quote("SI", quote = "AdjClose", compression = "m")
returns<-diff(dati)
regression<-lm(returns ~ marketfactor + sizefactor + valuefactor,na.action=na.omit)
that does multilinear regression.
I want to plot on a 2D plane the returns against a factor (and this is trivial of course) with superimposed the projection of the linear regression hyperplane for the specific factor. To be more clear the result should be like this: wolfram demonstrations (see the snapshots).
Any help will be greatly appreciated.
Thank you for your time and have a nice week end.
Giorgio.
The points in my comment withstanding, here is the canonical way to generate output from a fitted model in R for combinations of predictors. It really isn't clear what the plots you want are showing, but the ones that make sense to me are partial plots; where one variable is varied over its range whilst holding the others at some common value. Here I use the sample mean when holding a variable constant.
First some dummy data, with only to covariates, but this extends to any number
set.seed(1)
dat <- data.frame(y = rnorm(100))
dat <- transform(dat,
x1 = 0.2 + (0.4 * y) + rnorm(100),
x2 = 2.4 + (2.3 * y) + rnorm(100))
Fit the regression model
mod <- lm(y ~ x1 + x2, data = dat)
Next some data values to predict at using the model. You could do all variables in a single prediction and then subset the resulting object to plot only the relevant rows. Alternatively, more clearly (though more verbose), you can deal with each variable separately. Below I create two data frames, one per covariate in the model. In a data frame I generate 100 values over the range of the covariate being varied, and repeat the mean value of the other covariate(s).
pdatx1 <- with(dat, data.frame(x1 = seq(min(x1), max(x1), length = 100),
x2 = rep(mean(x2), 100)))
pdatx2 <- with(dat, data.frame(x1 = rep(mean(x1), 100),
x2 = seq(min(x2), max(x2), length = 100)))
In the linear regression with straight lines, you really don't need 100 values --- the two end points of the range of the covariate will do. However for models where the fitted function is not linear you need to predict at more locations.
Next, use the model to predict at these data points
pdatx1 <- transform(pdatx1, yhat = predict(mod, pdatx1))
pdatx2 <- transform(pdatx2, yhat = predict(mod, pdatx2))
Now we are ready to draw the partial plots. First compute a range for the y axis - again it is mostly redundant here but if you are adding confidence intervals you will need to include their values below,
ylim <- range(pdatx1$y, pdatx2$y, dat$y)
To plot (here putting two figures on the same plot device) we can use the following code
layout(matrix(1:2, ncol = 2))
plot(y ~ x1, data = dat)
lines(yhat ~ x1, data = pdatx1, col = "red", lwd = 2)
plot(y ~ x2, data = dat)
lines(yhat ~ x2, data = pdatx2, col = "red", lwd = 2)
layout(1)
Which produces