specifying degrees of freedom for b-spline fit using bs function in splines package

specifying degrees of freedom for b-spline fit using bs function in splines package - r

I am using the bs function of the splines package to create a b-spline smoothing curve for graphical purposes. (There is at least one report that Excel uses a third order b-spline for its smooth line graphs, and I would like to be able to duplicate those curves.) I am having trouble understanding the arguments required by the bs function. Representative code follows below, as adapted from the bs documentation:
require(splines)
require(ggplot2)
n <- 10
x <- 1:10
y <- rnorm(n)
d <- data.frame(x=x, y=y)
summary(fm1 <- lm(y ~ bs(x, degree=3)), data=d)
x.spline <- seq(1, 10, length.out=n*10)
spline.data <- data.frame(x=x.spline, y=predict(fm1, data.frame(x=x.spline)))
ggplot(d, aes(x,y)) + geom_point + geom_line(aes(x,y), data=spline.data)
The example code in the bs documentation specifies df=5 in the call to bs, and does not specify degree. I have no idea how many degrees of freedom I have. All I know is that I want a third order b-spline. I have experimented with specifying different values of df instead of, or in addition to degree, and I get dramatically different results. This is why I suspect that a specification of df is the issue here. How would I calculate df in this context?
The help file suggests df = length(knots) + degree. If I treat the interior points as knots, this gives me df=11 for this example, which generates error messages and a nonsensical spline fit.
Thank you in advance.
I was apparently not clear in my intentions. I am trying to do this:
How can I use spline() with ggplot?, but with b-splines.

You should not be trying to fit every point. The goal is to find a summary that is an acceptable fit but which depends on a limited number of knots. There is not much value in increasing hte degree of the polynomial above the default of three. With only 10 points you surely do not want df=11. Try df=5 and the results should be reasonably flat. The rms/Hnisc package author, Frank Harrell, prefers restricted cubic splines because the predictions at the extremes are linear and thus less wild than would occur with other polynomial bases.
I corrected a couple of misspellings and added a knots argument to make your code work:
require(splines)
require(ggplot2); set.seed(trunc(100000*pi))
n <- 10
x <- 1:10
y <- rnorm(n)
d <- data.frame(x=x, y=y)
summary(fm1 <- lm(y ~ bs(x, degree=3, knots=2)), data=d)
x.spline <- seq(1, 10, length.out=n*10)
spline.data <- data.frame(x=x.spline, y=predict(fm1, data.frame(x=x.spline)))
ggplot(d, aes(x,y)) + geom_point() + geom_line(aes(x,y), data=spline.data)
I came away from the exercise of varying the randomseed with the opinion that Frank Harrell knows what he is talking about. I don't get the same sort of behavior at the extremes when using his packages.

I did a little more work and came up with the following. First, an apology. What I was looking for was a smoothing spline, rather than a regression spline. I did not have the vocabulary to phrase the question properly. While the example in the help file for bs() appears to provide this, the function does not provide the same behavior for my sample data. There is another function, smooth.spline, in the stats package, which offers what I needed.
set.seed(tunc(100000*pi))
n <- 10
x <- 1:n
xx <- seq(1, n, length.out=200)
y <- rnorm(n)
d <- data.frame(x=x, y=y)
spl <- smooth.spline(x,y, spar=0.1)
spline.data <- data.frame(y=predict(spl,xx))
ggplot(d,aes(x,y)) + geom_point() + geom_line(aes(x,y), spline.data)
spl2 <- smooth.spline(x, y, control=
list(trace=TRUE, tol=1e-6, spar=0.1, low=-1.5, high=0.3))
spline.data2 <- data.frame(predit(spl2,xx))
ggplot(d,aes(x,y)) + geom_point() + geom_line(aes(x,y), spline.data2)
The two calls to smooth.spline represent two approaches. The first specifies the smoothing parameter manually, and the second iterates to an optimal solution. I found that I had to constrain the optimization properly to get the type of solution I was after.
The result is intended to match the b-spline used by the Excel line plot. I have collaborators who consider Excel graphics to be the standard, and I need to at least match that performance.

Related

Using optim to choose initial values for nls

One method I have seen in the literature is the use of optim() to choose initial values for nonlinear models in the package nls or nlme, however, I am puzzled by the actual implementation.
Take an example using COVID data from Alachua, FL:
dat=data.frame(x=seq(1,10,1), y=c(27.9,23.1,24.6,33.0,48.0,136.4,243.4,396.7,519.9,602.8))
x are time points and y is the number of people infected per 10,000 people
Now, if I wanted to fit a four-parameter logistic model in nls, I could use
n1 <- nls(y ~ SSfpl(x, A, B, M, S), data = dat)
But now imagine that parameter estimation is highly sensitive to the initial values so I want to optimize my approach. How would this be achieved?
The way I have thought to try is as follows
fun_to_optim <- function(data, guess){
x = data$x
y = data$y
A = guess[1]
B = guess[2]
M = guess[3]
S = guess[4]
y = A + (B-A)/(1+exp((M-x)/S))
return(-sum(y)) }
optim(fn=fun_to_optim, data=dat,
par=c(10,10,10,10),
method="Nelder-Mead")
The result from optim() is wrong but I cannot see my error. Thank you for any assistance.

The main issue is that you're not computing/returning the sum of squares from your objective function. However: I think you really have it backwards. Using nls() with SSfpl is about the best you're going to do in terms of optimization: it has sensible heuristics for picking starting values (SS stands for "self-starting"), and it provides a gradient function for the optimizer. It's not impossible that, with a considerable amount of work, you could find better heuristics for picking starting values for a particular system, but in general switching from nls to optim + Nelder-Mead will leave you worse off than when you started (illustration below).
fun_to_optim <- function(data, guess){
x = data$x
y = data$y
A = guess[1]
B = guess[2]
M = guess[3]
S = guess[4]
y_pred = A + (B-A)/(1+exp((M-x)/S))
return(sum((y-y_pred)^2))
}
Fit optim() with (1) your suggested starting values; (2) better starting values that are somewhere nearer the correct values (you could get most of these values by knowing the geometry of the function — e.g. A is the left asymptote, B is the right asymptote, M is the midpoint, S is the scale); (3) same as #2 but using BFGS rather than Nelder-Mead.
opt1 <- optim(fn=fun_to_optim, data=dat,
par=c(A=10,B=10,M=10,S=10),
method="Nelder-Mead")
opt2 <- optim(fn=fun_to_optim, data=dat,
par=c(A=10,B=500,M=10,S=1),
method = "Nelder-Mead")
opt3 <- optim(fn=fun_to_optim, data=dat,
par=c(A=10,B=500,M=10,S=1),
method = "BFGS")
Results:
xvec <- seq(1,10,length=101)
plot(y~x, data=dat)
lines(xvec, predict(n1, newdata=data.frame(x=xvec)))
p1 <- with(as.list(opt1$par), A + (B-A)/(1+exp((M-xvec)/S)))
lines(xvec, p1, col=2)
p2 <- with(as.list(opt2$par), A + (B-A)/(1+exp((M-xvec)/S)))
lines(xvec, p2, col=4)
p3 <- with(as.list(opt3$par), A + (B-A)/(1+exp((M-xvec)/S)))
lines(xvec, p3, col=6)
legend("topleft", col=c(1,2,4,6), lty=1,
legend=c("nls","NM (bad start)", "NM", "BFGS"))
nls and good starting values + BFGS overlap, and provide a good fit
optim/Nelder-Mead from bad starting values is absolutely terrible — converges on a constant line
optim/N-M from good starting values gets a reasonable fit, but obviously worse; I haven't analyzed why it gets stuck there.

R: How to plot custom range of polynomial produced by lm poly fit

I'm confused by the coefficients produced by the output of lm
Here's a copy of the data I'm working with
(postprocessed.csv)
"","time","value"
"1",1,2.61066016308988
"2",2,3.41246054742996
"3",3,3.8608767964033
"4",4,4.28686048552237
"5",5,4.4923132964825
"6",6,4.50557049744317
"7",7,4.50944447661246
"8",8,4.51097373134893
"9",9,4.48788748823809
"10",10,4.34603985656981
"11",11,4.28677073671406
"12",12,4.20065901625172
"13",13,4.02514194962519
"14",14,3.91360194972916
"15",15,3.85865748409081
"16",16,3.81318053258601
"17",17,3.70380706527433
"18",18,3.61552922363713
"19",19,3.61405310598722
"20",20,3.64591327503384
"21",21,3.70234435835577
"22",22,3.73503970503372
"23",23,3.81003078640584
"24",24,3.88201196162666
"25",25,3.89872518158949
"26",26,3.97432743542362
"27",27,4.2523675144599
"28",28,4.34654855854847
"29",29,4.49276038902684
"30",30,4.67830892029687
"31",31,4.91896819673664
"32",32,5.04350767355202
"33",33,5.09073406942046
"34",34,5.18510849382162
"35",35,5.18353176529036
"36",36,5.2210776270173
"37",37,5.22643491929207
"38",38,5.11137006553725
"39",39,5.01052467981257
"40",40,5.0361056705898
"41",41,5.18149486951409
"42",42,5.36334869132276
"43",43,5.43053620818444
"44",44,5.60001072279525
I have fitted a 4th order polynomial to this data using the following script:
library(ggplot2)
library(matrixStats)
library(forecast)
df_input <- read.csv("postprocessed.csv")
x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)
poly4model <- lm(y~poly(x, degree=4), data=df)
v <- seq(30, 40)
vv <- poly4model$coefficients[1] +
poly4model$coefficients[2] * v +
poly4model$coefficients[3] * (v ^ 2) +
poly4model$coefficients[4] * (v ^ 3) +
poly4model$coefficients[5] * (v ^ 4)
pdf("postprocessed.pdf")
plot(df)
lines(v, vv, col="red", pch=20, lw=3)
dev.off()
I initially tried using the predict function to do this, but couldn't get that to work, so resorted to implementing this "workaround" using some new vectors v and vv to store the data for the line in the region I am trying to plot.
Ultimatly, I am trying to do this:
Fit a 4th order polynomial to the data
Plot the 4th order polynomial over the range of data in one color
Plot the 4th order polynomial over the range from the last value to the last value + 10 (prediction) in a different color
At the moment I am fairly sure using v and vv to do this is not "the best way", however I would have thought it should work. What is happening is that I get very large values.
Here is a screenshot from Desmos. I copied and pasted the same coefficients as shown by typing poly4model$coefficients into the console. However, something must have gone wrong because this function is nothing like the data.
I think I've provided enough info to be able to run this short script. However I will add the pdf as well.

It is easiest to use the predict function to create your line. To do that, you pass the model and a data frame with the desired independent variables to the predict function.
x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)
poly4model <- lm(y~poly(x, degree=4), data=df)
v <- seq(30, 40)
#Notice the column in the dataframe is the same variable name
# as the variable in the model!
predict(poly4model, data.frame(x=v))
plot(df)
lines(v, predict(poly4model, data.frame(x=seq(30, 40))), col="red", pch=20, lw=3)
NOTE
The function poly "Returns or evaluates orthogonal polynomials of degree 1 to degree over the specified set of points x: these are all orthogonal to the constant polynomial of degree 0." To return the "normal" polynomial coefficients one needs to use the "raw=TRUE" option in the function.
poly4model <- lm(y~poly(x, degree=4, raw=TRUE), data=df)
Now your equation above will work.

Fit distribution to given frequency values in R

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)

Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)

You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))

Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

Programming a QQ plot

I have a sample of math test scores for male and female students. I want to draw QQ plot for each gender to see if each of them is normally distributed. I know how to draw the QQ plot for the overall sample, but how can I draw them separately?

Here is a simple solution using base graphics:
scores <- rnorm(200, mean=12, sd=2)
gender <- gl(2, 50, labels=c("M","F"))
opar <- par(mfrow=c(1,2))
for (g in levels(gender))
qqnorm(scores[gender==g], main=paste("Gender =", g))
par(opar)
A more elegant lattice solution then:
qqmath(~ scores | gender, data=data.frame(scores, gender), type=c("p", "g"))
See the on-line help for qqmath for more discussion and example of possible customization.

In Python, you have a QQplot method offered by the OpenTURNS Library see doc here. Here is an example.
In a first step, we generate a random sample of size 300 from a Uniform distribution.
In a second step, we consider that we do not know where this sample comes from and try to fit a Normal distribution and a Uniform distribution.
In a third step, we draw the QQPlot of ;the sample against each of the fitted distributions in order to "see" which one is the best
1st step:
import openturns as ot
from openturns.viewer import View
distribution = ot.Uniform(-1, 1)
sample = distribution.getSample(300)
2nd step:
fitted_normal = ot.NormalFactory().build(sample)
fitted_uniform = ot.UniformFactory().build(sample)
3rd step:
QQ_plot1 = ot.VisualTest.DrawQQplot(sample, fitted_normal)
QQ_plot2 = ot.VisualTest.DrawQQplot(sample,fitted_uniform)
View(QQ_plot1)
View(QQ_plot2)
As expected, the fitted Uniform is more adapted to the sample the Normal which has bigger error at both ends of the domain.

Create function to automatically create plots from summary(fit <- lm( y ~ x1 + x2 +... xn))

I am running the same regression with small alterations of x variables several times. My aim is after having determined the fit and significance of each variable for this linear regression model to view all all major plots. Instead of having to create each plot one by one, I want a function to loop through my variables (x1...xn) from the following list.
fit <-lm( y ~ x1 + x2 +... xn))
The plots I want to create for all x are
1) 'x versus y' for all x in the function above
2) 'x versus predicted y
3) x versus residuals
4) x versus time, where time is not a variable used in the regression but provided in the dataframe the data comes from.
I know how to access the coefficients from fit, however I am not able to use the coefficient names from the summary and reuse them in a function for creating the plots, as the names are characters.
I hope my question has been clearly described and hasn't been asked already.
Thanks!

Create some mock data
dat <- data.frame(x1=rnorm(100), x2=rnorm(100,4,5), x3=rnorm(100,8,27),
x4=rnorm(100,-6,0.1), t=(1:100)+runif(100,-2,2))
dat <- transform(dat, y=x1+4*x2+3.6*x3+4.7*x4+rnorm(100,3,50))
Make the fit
fit <- lm(y~x1+x2+x3+x4, data=dat)
Compute the predicted values
dat$yhat <- predict(fit)
Compute the residuals
dat$resid <- residuals(fit)
Get a vector of the variable names
vars <- names(coef(fit))[-1]
A plot can be made using this character representation of the name if you use it to build a string version of a formula and translate that. The four plots are below, and the are wrapped in a loop over all the vars. Additionally, this is surrounded by setting ask to TRUE so that you get a chance to see each plot. Alternatively you arrange multiple plots on the screen, or write them all to files to review later.
opar <- par(ask=TRUE)
for (v in vars) {
plot(as.formula(paste("y~",v)), data=dat)
plot(as.formula(paste("yhat~",v)), data=dat)
plot(as.formula(paste("resid~",v)), data=dat)
plot(as.formula(paste("t~",v)), data=dat)
}
par(opar)

The coefficients are stored in the fit objects as you say, but you can access them generically in a function by referring to them this way:
x <- 1:10
y <- x*3 + rnorm(1)
plot(x,y)
fit <- lm(y~x)
fit$coefficient[1] # intercept
fit$coefficient[2] # slope
str(fit) # a lot of info, but you can see how the fit is stored
My guess is when you say you know how to access the coefficients you are getting them from summary(fit) which is a bit harder to access than taking them directly from the fit. By using fit$coeff[1] etc you don't have to have the name of the variable in your function.

Three options to directly answer what I think was the question: How to access the coefficients using character arguments:
x <- 1:10
y <- x*3 + rnorm(1)
fit <- lm(y~x)
# 1
fit$coefficient["x"]
# 2
coefname <- "x"
fit$coefficient[coefname]
#3
coef(fit)[coefname]
If the question was how to plot the various functions then you should supply a sufficiently complex construction (in R) to allow demonstration of methods with a well-specified set of objects.