R: How to plot custom range of polynomial produced by lm poly fit - r

I'm confused by the coefficients produced by the output of lm
Here's a copy of the data I'm working with
(postprocessed.csv)
"","time","value"
"1",1,2.61066016308988
"2",2,3.41246054742996
"3",3,3.8608767964033
"4",4,4.28686048552237
"5",5,4.4923132964825
"6",6,4.50557049744317
"7",7,4.50944447661246
"8",8,4.51097373134893
"9",9,4.48788748823809
"10",10,4.34603985656981
"11",11,4.28677073671406
"12",12,4.20065901625172
"13",13,4.02514194962519
"14",14,3.91360194972916
"15",15,3.85865748409081
"16",16,3.81318053258601
"17",17,3.70380706527433
"18",18,3.61552922363713
"19",19,3.61405310598722
"20",20,3.64591327503384
"21",21,3.70234435835577
"22",22,3.73503970503372
"23",23,3.81003078640584
"24",24,3.88201196162666
"25",25,3.89872518158949
"26",26,3.97432743542362
"27",27,4.2523675144599
"28",28,4.34654855854847
"29",29,4.49276038902684
"30",30,4.67830892029687
"31",31,4.91896819673664
"32",32,5.04350767355202
"33",33,5.09073406942046
"34",34,5.18510849382162
"35",35,5.18353176529036
"36",36,5.2210776270173
"37",37,5.22643491929207
"38",38,5.11137006553725
"39",39,5.01052467981257
"40",40,5.0361056705898
"41",41,5.18149486951409
"42",42,5.36334869132276
"43",43,5.43053620818444
"44",44,5.60001072279525
I have fitted a 4th order polynomial to this data using the following script:
library(ggplot2)
library(matrixStats)
library(forecast)
df_input <- read.csv("postprocessed.csv")
x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)
poly4model <- lm(y~poly(x, degree=4), data=df)
v <- seq(30, 40)
vv <- poly4model$coefficients[1] +
poly4model$coefficients[2] * v +
poly4model$coefficients[3] * (v ^ 2) +
poly4model$coefficients[4] * (v ^ 3) +
poly4model$coefficients[5] * (v ^ 4)
pdf("postprocessed.pdf")
plot(df)
lines(v, vv, col="red", pch=20, lw=3)
dev.off()
I initially tried using the predict function to do this, but couldn't get that to work, so resorted to implementing this "workaround" using some new vectors v and vv to store the data for the line in the region I am trying to plot.
Ultimatly, I am trying to do this:
Fit a 4th order polynomial to the data
Plot the 4th order polynomial over the range of data in one color
Plot the 4th order polynomial over the range from the last value to the last value + 10 (prediction) in a different color
At the moment I am fairly sure using v and vv to do this is not "the best way", however I would have thought it should work. What is happening is that I get very large values.
Here is a screenshot from Desmos. I copied and pasted the same coefficients as shown by typing poly4model$coefficients into the console. However, something must have gone wrong because this function is nothing like the data.
I think I've provided enough info to be able to run this short script. However I will add the pdf as well.

It is easiest to use the predict function to create your line. To do that, you pass the model and a data frame with the desired independent variables to the predict function.
x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)
poly4model <- lm(y~poly(x, degree=4), data=df)
v <- seq(30, 40)
#Notice the column in the dataframe is the same variable name
# as the variable in the model!
predict(poly4model, data.frame(x=v))
plot(df)
lines(v, predict(poly4model, data.frame(x=seq(30, 40))), col="red", pch=20, lw=3)
NOTE
The function poly "Returns or evaluates orthogonal polynomials of degree 1 to degree over the specified set of points x: these are all orthogonal to the constant polynomial of degree 0." To return the "normal" polynomial coefficients one needs to use the "raw=TRUE" option in the function.
poly4model <- lm(y~poly(x, degree=4, raw=TRUE), data=df)
Now your equation above will work.

Related

How to do an exponential regression model?

I have a small data base (txt file).
I want to obtain an exponential regression in R.
The commands that I am using are:
regression <- read.delim("C:/Users/david/OneDrive/Desktop/regression.txt")
View(regression)
source('~/.active-rstudio-document', echo=TRUE)
m <- nls(DelSqRho ~ (1-exp(-a*(d-b)**2)), data=regression, start=list(a=1, b=1))
y_est<-predict(m,regression$d)
plot(x,y)
lines(x,y_est)
summary(m)
But, when I run it, I get an error:
Error in nls(DelSqRho ~ (1 - exp(-a * (d - b)^2)), data = regression, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
and I do not know how to solve it, how to obtain the exponwential regression, please, any hint?
nls is quite sensitive to the values of the starting parameters and so you want to choose values that give a reasonable fit to the data (minpack.lm::nlsLM can be a bit more forgiving).
You can plot the curve at your starting values of a=1 and b=1 and see that it doesn't do a great job of capturing the curve.
regression <- read.delim("regression.txt")
with(regression, plot(d, DelSqRho, ylim=c(-3, 1)))
xs <- seq(min(regression$d), max(regression$d), length=100)
a <- 1; b <- 1; ys <- 1 - exp(-a* (xs - b)**2)
lines(xs, ys)
One way to get starting values is by rearranging the objective function.
y = 1 - exp(-a*(x-b)**2) can be rearranged as log(1/(1-y)) = ab^2 - 2abx + ax^2 (here y must be less than one). Linear regression can then be used to get an estimate of a and b.
start_m <- lm(log(1/(1-DelSqRho)) ~ poly(d, 2, raw=TRUE), regression)
unname(a <- coef(start_m)[3]) # as `a` is aligned with the quadratic term
# [1] -0.2345953
unname(b <- sqrt(coef(start_m)[1]/coef(start_m)[3]))
# [1] 2.933345
(Sometimes it is not possible to rearrange the data in this way and you can try to get a rough idea of the parameters by plotting the curves at various starting parameters. nls2 can also do a brute force search or grid search over starting parameters.)
We can now try to estimate the nls model at these parameters:
m <- nls(DelSqRho ~ 1-exp(-a*(d-b)**2), data=regression, start=list(a=a, b=b))
coef(m)
# a b
# -0.2379078 2.8868374
And plot the results:
# note that `newdata` must be a named list or data frame
# in which to look for variables with which to predict.
y_est <- predict(m, newdata=data.frame(d=xs))
with(regression, plot(d, DelSqRho))
lines(xs, y_est, col="red", lwd=2)
The fit isn't great and is perhaps suggestive that a more flexible model is required.

Finding location of maximum D-statistic from KS test

I'm comparing two different empirical cumulative distribution functions using the KS-test, and I'd like to extract the location (in the ECDF) where the maximum of the test statistic is.
Question: Using R, is there a convenient way to extract that, perhaps from the ks.test function or otherwise?
Thanks for any and all comments.
It does not appear you can extract such a location (which might not be unique, BTW) from the output of ks.test, but by emulating the key calculation there you can obtain the answer:
compare <- function(x, y) {
n <- length(x); m <- length(y)
w <- c(x, y)
o <- order(w)
z <- cumsum(ifelse(o <= n, m, -n))
i <- which.max(abs(z))
w[o[i]]
}
The calculation through z <- ... is from the ks.test source, while the last two lines (fairly clearly) find the location where the maximum deviation is attained.
As an example, let's generate two datasets and compare them:
set.seed(17)
x <- rnorm(30)
y <- rnorm(20, sd=2/3)
u <- compare(x,y)
The reported value of u is 0.04946235. To see whether this is correct, check it against the ECDFs and the output of ks.test:
e.x <- ecdf(x)
e.y <- ecdf(y)
abs(e.x(u) - e.y(u))
ks.test(x,y)$statistic
The output in both cases is 0.4166667, indicating perfect agreement. A plot of the situation will clarify what is going on:
plot(e.x, col="Blue", main="ECDF", xlab="Value", ylab="Probability")
plot(e.y, add=TRUE, col="Red")
lines(c(u,u), c(0,1), col="Gray")
lines(c(u,u), c(e.x(u), e.y(u)), lwd=2)
It shows both ECDFs and marks the location found by compare (namely, u) with a vertical line: it is supposed to indicate the place where the two graphs attain their greatest vertical separation.

How to predict using a locally smoothed mean?

(Statistics beginner here).
I have some training data (x,y), and wish to make prediction for new data x_new.
Now let's assume I have the data for the plot below, but I do not know how y is computed. So I would like to use the data I have a calculate for any given x the local mean of y data, as this seems like the best guess I can make.
install.packages("gplots")
library("gplots")
x <- abs(rnorm(500))
y <- rnorm(500, mean=2*x, sd=2+2*x)
bandplot(x,y)
Is there a R function to predict y for a given x, using the locally smoothed mean (here shown in red thanks to the function bandplot), or something similar?
wapply from gplots returns the locally smoothed mean as a table for x and y.
x <- 1:1000
y <- rnorm(1000, mean=1, sd=1 + x/1000 )
wapply(x,y,mean)
to predict, one would need, I guess, to resolve the closest x that is in the table returned by wapply, then deduce the local mean for y.
For a value a, the closest x will be given by the index:
index = which(abs(wapply(x,y,mean)$x-a)==min(abs(wapply(x,y,mean)$x-a)))
then the prediction should be:
pred = wapply(x,y,mean)[index]
So in one line:
locally_smoothed_mean_prediction = function(a) wapply(x,y,mean)$y[which(abs(wapply(x,y,mean)$x-a)==min(abs(wapply(x,y,mean)$x-a)))]
> locally_smoothed_mean_prediction(600)
[1] 1.055642

specifying degrees of freedom for b-spline fit using bs function in splines package

I am using the bs function of the splines package to create a b-spline smoothing curve for graphical purposes. (There is at least one report that Excel uses a third order b-spline for its smooth line graphs, and I would like to be able to duplicate those curves.) I am having trouble understanding the arguments required by the bs function. Representative code follows below, as adapted from the bs documentation:
require(splines)
require(ggplot2)
n <- 10
x <- 1:10
y <- rnorm(n)
d <- data.frame(x=x, y=y)
summary(fm1 <- lm(y ~ bs(x, degree=3)), data=d)
x.spline <- seq(1, 10, length.out=n*10)
spline.data <- data.frame(x=x.spline, y=predict(fm1, data.frame(x=x.spline)))
ggplot(d, aes(x,y)) + geom_point + geom_line(aes(x,y), data=spline.data)
The example code in the bs documentation specifies df=5 in the call to bs, and does not specify degree. I have no idea how many degrees of freedom I have. All I know is that I want a third order b-spline. I have experimented with specifying different values of df instead of, or in addition to degree, and I get dramatically different results. This is why I suspect that a specification of df is the issue here. How would I calculate df in this context?
The help file suggests df = length(knots) + degree. If I treat the interior points as knots, this gives me df=11 for this example, which generates error messages and a nonsensical spline fit.
Thank you in advance.
I was apparently not clear in my intentions. I am trying to do this:
How can I use spline() with ggplot?, but with b-splines.
You should not be trying to fit every point. The goal is to find a summary that is an acceptable fit but which depends on a limited number of knots. There is not much value in increasing hte degree of the polynomial above the default of three. With only 10 points you surely do not want df=11. Try df=5 and the results should be reasonably flat. The rms/Hnisc package author, Frank Harrell, prefers restricted cubic splines because the predictions at the extremes are linear and thus less wild than would occur with other polynomial bases.
I corrected a couple of misspellings and added a knots argument to make your code work:
require(splines)
require(ggplot2); set.seed(trunc(100000*pi))
n <- 10
x <- 1:10
y <- rnorm(n)
d <- data.frame(x=x, y=y)
summary(fm1 <- lm(y ~ bs(x, degree=3, knots=2)), data=d)
x.spline <- seq(1, 10, length.out=n*10)
spline.data <- data.frame(x=x.spline, y=predict(fm1, data.frame(x=x.spline)))
ggplot(d, aes(x,y)) + geom_point() + geom_line(aes(x,y), data=spline.data)
I came away from the exercise of varying the randomseed with the opinion that Frank Harrell knows what he is talking about. I don't get the same sort of behavior at the extremes when using his packages.
I did a little more work and came up with the following. First, an apology. What I was looking for was a smoothing spline, rather than a regression spline. I did not have the vocabulary to phrase the question properly. While the example in the help file for bs() appears to provide this, the function does not provide the same behavior for my sample data. There is another function, smooth.spline, in the stats package, which offers what I needed.
set.seed(tunc(100000*pi))
n <- 10
x <- 1:n
xx <- seq(1, n, length.out=200)
y <- rnorm(n)
d <- data.frame(x=x, y=y)
spl <- smooth.spline(x,y, spar=0.1)
spline.data <- data.frame(y=predict(spl,xx))
ggplot(d,aes(x,y)) + geom_point() + geom_line(aes(x,y), spline.data)
spl2 <- smooth.spline(x, y, control=
list(trace=TRUE, tol=1e-6, spar=0.1, low=-1.5, high=0.3))
spline.data2 <- data.frame(predit(spl2,xx))
ggplot(d,aes(x,y)) + geom_point() + geom_line(aes(x,y), spline.data2)
The two calls to smooth.spline represent two approaches. The first specifies the smoothing parameter manually, and the second iterates to an optimal solution. I found that I had to constrain the optimization properly to get the type of solution I was after.
The result is intended to match the b-spline used by the Excel line plot. I have collaborators who consider Excel graphics to be the standard, and I need to at least match that performance.

Create function to automatically create plots from summary(fit <- lm( y ~ x1 + x2 +... xn))

I am running the same regression with small alterations of x variables several times. My aim is after having determined the fit and significance of each variable for this linear regression model to view all all major plots. Instead of having to create each plot one by one, I want a function to loop through my variables (x1...xn) from the following list.
fit <-lm( y ~ x1 + x2 +... xn))
The plots I want to create for all x are
1) 'x versus y' for all x in the function above
2) 'x versus predicted y
3) x versus residuals
4) x versus time, where time is not a variable used in the regression but provided in the dataframe the data comes from.
I know how to access the coefficients from fit, however I am not able to use the coefficient names from the summary and reuse them in a function for creating the plots, as the names are characters.
I hope my question has been clearly described and hasn't been asked already.
Thanks!
Create some mock data
dat <- data.frame(x1=rnorm(100), x2=rnorm(100,4,5), x3=rnorm(100,8,27),
x4=rnorm(100,-6,0.1), t=(1:100)+runif(100,-2,2))
dat <- transform(dat, y=x1+4*x2+3.6*x3+4.7*x4+rnorm(100,3,50))
Make the fit
fit <- lm(y~x1+x2+x3+x4, data=dat)
Compute the predicted values
dat$yhat <- predict(fit)
Compute the residuals
dat$resid <- residuals(fit)
Get a vector of the variable names
vars <- names(coef(fit))[-1]
A plot can be made using this character representation of the name if you use it to build a string version of a formula and translate that. The four plots are below, and the are wrapped in a loop over all the vars. Additionally, this is surrounded by setting ask to TRUE so that you get a chance to see each plot. Alternatively you arrange multiple plots on the screen, or write them all to files to review later.
opar <- par(ask=TRUE)
for (v in vars) {
plot(as.formula(paste("y~",v)), data=dat)
plot(as.formula(paste("yhat~",v)), data=dat)
plot(as.formula(paste("resid~",v)), data=dat)
plot(as.formula(paste("t~",v)), data=dat)
}
par(opar)
The coefficients are stored in the fit objects as you say, but you can access them generically in a function by referring to them this way:
x <- 1:10
y <- x*3 + rnorm(1)
plot(x,y)
fit <- lm(y~x)
fit$coefficient[1] # intercept
fit$coefficient[2] # slope
str(fit) # a lot of info, but you can see how the fit is stored
My guess is when you say you know how to access the coefficients you are getting them from summary(fit) which is a bit harder to access than taking them directly from the fit. By using fit$coeff[1] etc you don't have to have the name of the variable in your function.
Three options to directly answer what I think was the question: How to access the coefficients using character arguments:
x <- 1:10
y <- x*3 + rnorm(1)
fit <- lm(y~x)
# 1
fit$coefficient["x"]
# 2
coefname <- "x"
fit$coefficient[coefname]
#3
coef(fit)[coefname]
If the question was how to plot the various functions then you should supply a sufficiently complex construction (in R) to allow demonstration of methods with a well-specified set of objects.

Resources