Predict Future values using polynomial regression in R - r

Was trying to predict the future value of a sample using polynomial regression in R. The y values within the sample forms a wave pattern.
For example
x = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
y= 1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4
But when the graph is plotted for future values the resultant y values was completely different from what was expected. Instead of a wave pattern, was getting a graph where the y values keep increasing.
futurY = 17,18,19,20,21,22
Tried different degrees of polynomial regression, but the predicted results for futurY were drastically different from what was expected
Following is the sample R code which was used to get the results
dfram <- data.frame('x'=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
dfram$y <- c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4)
plot(dfram,dfram$y,type="l", lwd=3)
pred <- data.frame('x'=c(17,18,19,20,21,22))
myFit <- lm(y ~ poly(x,5), data=dfram)
newdata <- predict(myFit, pred)
print(newdata)
plot(pred[,1],data.frame(newdata)[,1],type="l",col="red", lwd=3)
Is this the correct technique to be used for predicting the unknown future y values OR should I be using other techniques like forecasting?

# Reproducing your data frame
dfram <- data.frame("x" = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
"y" = c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4))
From your graph I've got the phase and period of the signal. There're better ways of calculating that automatically.
# Phase and period
fase = 1
per = 10
In the linear model function I've put the triangular signal equations.
fit <- lm(y ~ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * (x-fase)%%(per/2))
+ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * ((per/2)-((x-fase)%%(per/2))))
,data=dfram)
# Predict the old data
p_olddata <- predict(fit,type="response")
# Predict the new data
newdata <- data.frame('x'=c(17,18,19,20,21,22))
p_newdata <- predict(fit,newdata,type="response")
# Ploting Old and new data
plot(x=c(dfram$x,newdata$x),
y=c(p_olddata,p_newdata),
col=c(rep("blue",length(p_olddata)),rep("green",length(p_olddata))),
xlab="x",
ylab="y")
lines(dfram)
Where the black line is the original signal, the blue circles are the prediction for the original points and the green circles are the prediction for the new data.
The graph shows a perfect fit for the model because there's no noise in the data. In a real dataset you may find it so the fit will not look as nice as that.

Related

Asymptotic regression function not correlating with raw data

I'm trying to model raw data by an asymptotic function with the equation $$f(x) = a + (b-a)(1-\exp(-c x))$$ using R. To do so I used the following code:
rawData <- import("path/StackTestData.tsv")
# executing regression
X <- rawData$x
Y <- rawData$y
model <- drm(Y ~ X, fct = DRC.asymReg())
# creating the regression function
f_0_ <- model$coefficients[1] #value for y if x=0
steepness <- model$coefficients[2]
plateau <- model$coefficients[3]
eq <- function(x){f_0_+(plateau-f_0_)*(1-exp(-steepness*x))}
# plotting the regression function together with the raw data
ggplot(rawData,aes(x=x,y=y)) +
geom_line(col="red") +
stat_function(fun=eq,col="blue") +
ylim(10,12.5)
In some cases, I got a proper regression function. However, with the attached data I don't get one. The regression function is not showing any correlation with the raw data whatsoever, as shown in the figure below. Can you perhaps offer a better solution for performing the asymptotic regression or do you know where the error lies?
Best Max
R4.1.2 was used using R Studio 1.4.1106. For ggplot the package ggpubr, for DRC.asymReg() the packages aomisc and drc were load.

Add raw data points to jp.int (sjPlot)

For my manuscript, I plotted a lme with an interaction of two continuous variables:
Create data
mydata <- data.frame( SID=sample(1:150,400,replace=TRUE),age=sample(50:70,400,replace=TRUE), sex=sample(c("Male","Female"),200, replace=TRUE),time= seq(0.7, 6.2, length.out=400), Vol =rnorm(400),HCD =rnorm(400))
mydata$time <- as.numeric(mydata$time)
Run the model:
model <- lme(HCD ~ age*time+sex*time+Vol*time, random=~time|SID, data=mydata)
Make plot:
sjp.int(model, swap.pred=T, show.ci=T, mdrt.values="meansd")
The reviewer now wants me to add the raw data points to this plot. How can I do this? I tried adding geom_point() referring to mydata, but that is not possible.
Any ideas?
Update:
I thought that maybe I could extract the random slope of HCD and then residuals HCD for the covariates and also residuals Vol for the covariates and plot those two to make things easier (then I could plot the points in a 2D plot).
So, I tried to extract the slopes and use these to fit a linear regression, but the results are different (in the reproducible example less significant, but in my data: the interaction became non-significant (and was significant in the lme)). Not sure what that means or whether this just shows that I should not try to plot it this way.
get the slopes:
model <- lme(HCD ~ time, random=~time|SID, data=mydata)
slopes <- rbind(row.names(model$coefficients$random$SID), model$coef$random$SID[,2])
slopes2 <- data.frame(matrix(unlist(slopes), nrow=144, byrow=T))
names(slopes2)[1] <- "SID"
names(slopes2)[2] <- "slopes"
(save the slopes2 and reopen, because somehow R sees it as a factor)
Then create a cross-sectional dataframe and merge the slopes:
mydata$time2 <- round(mydata$time)
new <- reshape(mydata,idvar = "SID", timevar="time2", direction="wide")
newdata <- dplyr::left_join(new, slop, by="SID")
The lm:
modelw <- lm(slop$slopes ~ age.1+sex.1+Vol.1, data=newdata)
Vol now has a p-value of 0.8 (previously this was 0.14)

Predict out of sample using flexsurvreg in R

I have the following model in R
library(flexsurv)
data(ovarian)
model = flexsurvreg(Surv(futime, fustat) ~ ecog.ps + rx, data = ovarian, dist='weibull')
model
predict(model,data = ovarian, type = 'response')
The model summary looks like this flexsurvreg model output
I am trying to predict the survival time using the predict function in R and get the following error
error while trying to predict
How can I predict expected lifetime using this flexsurvreg model?
I understand that the documentation mentions a totlos.fs function, but this data does not seem to have a trans variable that totlos.fs requires to provide an output.
If there is no other alternative to totlos.fs how can I create a trans variable in this data and handle it along with existing covariates?
Please advise.
Section 3 of the supplementary examples doc for the flexsurv documentation has an example in which the predicted values are calculated directly using the model equation. As you are using the Weibull distribution (with n=2 parameters) I believe this should work:
pred.model <- model.matrix(model) %*% model$res[-(1:n),"est"]
Cheers
Nik,
I know your question is an old one, but see below how I hacked a way to do it. It involves retrieving the shape and rate parameters from your fit of test data, then instead of predict, you use the qgompertz() from flexsurv. Please excuse the use of my own encapsulated example code, but you should be able to follow along.
# generate the training data "lung1" from data(lung) in survival package
# hacked way for truncating the lung data to 2 years of follow up
require(survival)
lung$yrs <- lung$time/365
lung1 <- lung[c("status", "yrs")]
lung1$status[ lung1$yrs >2] <- 1
lung1$yrs[ lung1$yrs >2] <- 2
# from the training data build KM to obtain survival %s
s <- Surv(time=lung1$yrs, event=lung1$status)
km.lung <- survfit(s ~ 1, data=lung1)
plot(km.lung)
# generate dataframe to use later for plotting
cut.length <- sum((km.lung$time <= 2)) # so I can create example test data
test.data <- data.frame(yrs = km.lung$time[1:cut.length] , surv=round(km.lung$surv[1:cut.length], 3))
##
## doing the same as above with gompertz
##
require(flexsurv) #needed to run gompertz model
s <- Surv(time=lung1$yrs, event=lung1$status)
gomp <- flexsurvreg(s ~ 1, data=lung1, dist="gompertz") # run this to get shape and rate estimates for gompertz
gomp # notice the shape and rate values
# create variables for these values
g.shape <- 0.5866
g.rate <- 0.5816
##
## plot data and vizualize the gomperts
##
# vars for plotting
df1 <- test.data
xvar <- "yrs"
yvar <- "surv"
extendedtime <- 3 #
ylim1 <- c(0,1)
xlim1 <- c(0, extendedtime)
# plot the survival % for training data
plot(df1[,yvar]~df1[,xvar], type="S", ylab="", xlab="", lwd=3, xlim=xlim1, ylim=ylim1)
# Nik--here is where the magic happens... pay special attention to: qgompertz(seq(.01,.99,by=.01), shape=0.58656, rate = .5816)
lines (qgompertz(seq(.01,.99,by=.01), shape=0.58656, rate = .5816) , seq(.99,.01,by=-.01) , col="red", lwd=2, lty=2 )
# generate a km curve from the testing data
s <- Surv(time=lung$yrs, event=lung$status)
km.lung <- survfit(s ~ 1, data=lung)
par(new=T)
# now draw remaining survival curve from the testing section
plot(km.lung$surv[(cut.length+1):length(km.lung$time)]~km.lung$time[(cut.length+1):length(km.lung$time)], type="S", col="blue", ylab="", xlab="", lwd=3, xlim=xlim1, ylim=ylim1)

Abline not working with Linear regression Model

I have a data in R so i want to test the data on various models. I have split the data into 2 sets 80% training and 20% testing. So now what i want to do is train the training data set on a linear model and predict it on the testing data set.
I have don this so far.
temp<-lm(formula = cityMpg ~ peakRpm+horsePower+wheelBase , data=train)
temp_test<- predict(temp,test)
plot(temp_test)
Here, I get the scatter plot. Now I just want a line in this scatter plot.
When I use abline(temp_test), I get an error.
i WANT THE LINE as automatic, I do not wish to specify the co-ordinates.
getting error as:
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) :
invalid a=, b= specification
As pointed out above, this is a bit tricky for a multi-dimensional model.
Get some data (you neglected to include a reproducible example: see http://tinyurl.com/reproducible-000 ...)
library(foreign)
dat <- read.arff(url("http://www.cs.umb.edu/~rickb/files/UCI/autos.arff"))
Split into training and test data sets:
train <- dat[1:150,]
test <- dat[151:nrow(dat),]
The variable names are a bit awkward for R (the dashes are interpreted as minus operators, so we have to use back-quotes to protect the names):
fit <- lm(`city-mpg` ~ `peak-rpm`+horsepower+`wheel-base`,data=train)
temp_test <- predict(fit,test)
Plot the predictions vs peak RPM:
par(las=1,bty="l") ## cosmetic
plot(test[["peak-rpm"]],temp_test,xlab="peak rpm",ylab="predicted")
In order to add the line, we have to adjust the intercept according to some baseline values of the other parameters: we'll use the mean (another alternative is to center all the predictor variables before fitting the model):
cf <- coef(fit)
abline(a=cf["(Intercept)"]+
mean(test$horsepower)*cf["horsepower"]+
mean(test$`wheel-base`)*cf["`wheel-base`"],
b=coef(fit)["`peak-rpm`"])
Another way to do this is to use predict():
newdat <- with(test,
data.frame(horsepower=mean(horsepower),
"wheel-base"=mean(`wheel-base`),
"peak-rpm"=seq(min(`peak-rpm`),
max(`peak-rpm`),
length=41),
check.names=FALSE))
newdat["city-mpg"] <- predict(fit,newdat)
with(newdat,lines(`peak-rpm`,`city-mpg`,col=4))
(41 points is silly for a straight line -- we could have used just 2 -- but will work well if you want to plot something curved, like confidence intervals or a nonlinear fit.)
Alternatively you could just fit the marginal model, but the actual fitted line is somewhat different (it will only be the same if all the predictors are orthogonal to each other):
fit2 <- lm(`city-mpg` ~ `peak-rpm`,data=train)
abline(fit2,col="red")

How to get predicted values on Sigmoid Growth model in R

I am trying to forecast future revenue by using Sigmoid growth model in R. The model is like this:
Y = a/( 1+ce^(-bX) ) + noise
my code:
x <- seq(-5,5,length=n)
y <- 1/(1+exp(-x))
plot(y~x, type='l', lwd=3)
title(main='Sigmoid Growth')
I could draw the plot, but I don't know how to get the future values. Suppose I want to predict the next 6 years revenue values.
Make y a function, and plot that (plot has special support for functions):
y <- function(x) 1/(1+exp(-x))
plot(y,-5,11,type="l",lwd=3)

Resources