I am using lowess function to fit a regression between two variables x and y. Now I want to know the fitted value at a new value of x. For example, how do I find the fitted value at x=2.5 in the following example. I know loess can do that, but I want to reproduce someone's plot and he used lowess.
set.seed(1)
x <- 1:10
y <- x + rnorm(x)
fit <- lowess(x, y)
plot(x, y)
lines(fit)
Local regression (lowess) is a non-parametric statistical method, it's a not like linear regression where you can use the model directly to estimate new values.
You'll need to take the values from the function (that's why it only returns a list to you), and choose your own interpolation scheme. Use the scheme to predict your new points.
Common technique is spline interpolation (but there're others):
https://www.r-bloggers.com/interpolation-and-smoothing-functions-in-base-r/
EDIT: I'm pretty sure the predict function does the interpolation for you. I also can't find any information about what exactly predict uses, so I've tried to trace the source code.
https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/stats/R/loess.R
else { ## interpolate
## need to eliminate points outside original range - not in pred_
I'm sure the R code calls the underlying C implementation, but it's not well documented so I don't know what algorithm it uses.
My suggestion is: either trust the predict function or roll out your own interpolation algorithm.
Related
Let´s assume I want to draw a plot similar to here here using R, i.e. hazard ratio on y axis and some predictor variable on x axis based on a Cox model with spline term. The only exception is that I want to set my own x axis points; termplot seems to pick all the unique values from the data set but I want to use some sort of grid. This is because I am doing multiple imputation which induces different unique values in every round. Otherwise I can do combined inference quite easily but it would be a lot easier to make predictions for the same predictor values every imputation round.
So, I need to find a way to use termplot function so that I can fix predictor values or to find a workaround. I tried to use predict function but its newdata argument requires values for all other (adjusting) variables too, which inflates standard errors. This is a problem because I am also plotting confidence intervals. I think I could do this manually without any functions except that spline terms are out of my reach in this sense.
Here is an illustrative example.
library(survival)
data(diabetic)
diabetic<-diabetic[diabetic$eye=="right",]
# Model with spline term and two adjusting variables
mod<-coxph(Surv(time,status)~+pspline(age,df=3)+risk+laser,data=diabetic)
summary(mod)
# Let's pretend this is the grid
# These are in the data but it's easier for comparison in this example
ages<-20:25
# Right SEs but what if I want to use different age values?
termplot(mod,term=1,se=TRUE,plot=F)$age[20:25,]
# This does something else
termplot(mod,data=data.frame(age=ages),term=1,se=TRUE,plot=F)$age
# This produces an error
predict(mod,newdata=data.frame(age=ages),se.fit=T)
# This inflates variance
# May actually work with models without categorical variables: what to do with them?
# Actual predictions are different but all that matters is the difference between ages
# and they are in line
predict(mod,newdata=data.frame(age=ages,risk=mean(diabetic$risk),laser="xenon"),se.fit=T)
Please let me know if didn't exlain my problem sufficiently. I tried to keep it as simple as possible.
In the end, this how I worked it out. First, I made the predictions and SEs using termplot function and then I used linear interpolation to get approximately right predictions and SEs for my custom grid.
ptemp<-termplot(mod,term=1,se=TRUE,plot=F)
ptemp<-data.frame(ptemp[1]) # Picking up age and corresponding estimates and SEs
x<-ptemp[,1]; y<-ptemp[,2]; se<-ptemp[,3]
f<-approxfun(x,y) # Linear interpolation function
x2<-seq(from=10,to=50,by=0.5) # You can make a finer grid if you want
y2<-f(x2) # Interpolation itself
f_se<-approxfun(x,se) # Same for SEs
se2<-f_se(x2)
dat<-data.frame(x2,y2,se2) # The wanted result
I have a data.frame in R whose variables represent locations and whose observations are measures of a certain variable in those locations. I want to measure the decay of dependence for certain locations depending on distance, so the variogram comes particularly useful for my studies.
I am trying to use gstat library but I am a bit confused about certain parameters. As far as I understand the (empirical) variogram should only need as basic data:
The locations of the variables
Observations for these variables
And then other parameters like maximun distance, directions, ...
Now, gstat::variogram() function requires as first input an object of class gstat. Checking the documentation of function gstat() I see that it outputs an object of this class, but this function requires a formula argument, which is described as:
formula that defines the dependent variable as a linear model of independent variables; suppose the dependent variable has name z, for ordinary and simple kriging use the formula z~1; for simple kriging also define beta (see below); for universal kriging, suppose z is linearly dependent on x and y, use the formula z~x+y
Could someone explain me what this formula is for?
try
methods(variogram)
and you'll see that gstat has several methods for variogram, one requiring a gstat object as first argument.
Given a data.frame, the easiest is to use the formula method:
variogram(z~1, ~x+y, data)
which specifies that in data, z is the observed variable of interest, ~1 specifies a constant mean model, ~x+y specify that the coordinates are found in columns x and y of data.
I would like to apply a gam model on a dataset with specifying the types of functions to use.
It would be something like :
y ~ cst1 * (s(var1)-s(var2)) * (1 - exp(var3*cst2))
s has to be the same function for both var1 and var2. I don't have a prior idea on s function family. If I resume, the model would find the constants (cst1 and cst2) plus the function s.
Is it possible? If not, is there any way (another type of models) i can use to do what i'm looking for?
Thanks in advance for replies.
This model could be fit with nls, the non-linear least squares package. This will allow you to model the formula you want directly. The splines will need to be done manually, though. This question gets at what you would be trying to do.
As far as getting the splines to be the same for var1 and var2, you can do this by subtracting the basis matrices. Basically you want to compute the coefficient vector A where the term is A * s(var1) + A * s(var2) = A * (s(var1) - s(var2)). You wouldn't want to just do s(var1 - var2); in general, f(x) - f(y) != f(x - y). To do this in R, you would
Compute the spline basis matrices with ns() for var1 and var2, giving them the same knots. You need to specify both the knots and the Boundary.knots parameters so that the two splines will share the same basis.
Subtract the two spline basis matrices (the output from the ns() function).
Adapt the resulting subtracted spline basis matrix for the nls formula, as they do in the question I linked earlier.
I'm using the nlsLM function to fit a nonlinear regression. How does one extract the hat values and Cook's Distance from an nlsLM model object?
With objects created using the nls or nlreg functions, I know how to extract the hat values and the Cook's Distance of the observations, but I can't figure out how to get them using nslLM.
Can anyone help me out on this? Thanks!
So, it's not Cook's Distance or based on hat values, but you can use the function nlsJack in the nlstools package to jackknife your nls model, which means it removes every point, one by one, and bootstraps the resulting model to see, roughly speaking, how much the model coefficients change with or without a given observation in there.
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
require(nlstools)
plot(nlsJack(nls1))
The plot shows the percentage change in each model coefficient as each individual observation is removed, and it marks influential points above a certain threshold as "influential" in the resulting plot. The documentation for nlsJack describes how this threshold is determined:
An observation is empirically defined as influential for one parameter if the difference between the estimate of this parameter with and without the observation exceeds twice the standard error of the estimate divided by sqrt(n). This empirical method assumes a small curvature of the nonlinear model.
My impression so far is that this a fairly liberal criterion--it tends to mark a lot of points as influential.
nlstools is a pretty useful package overall for diagnosing nls model fits though.
I have the model lm(y~x+I(log(x)) and I would like to use predict to get predictions of a new data frame containing new values of x, based on my model. How does predict deal with the AsIs function I in the model? Does the I(log(x)) need to be extra specified in the newdata argument of predict or does predict understand that it should construct and use I(log(x)) from x?
UPDATE
#DWin: The way the variables enter in the model affect the coefficients especially for interactions. My example is simplistic but try this out
x<-rep(seq(0,100,by=1),10)
y<-15+2*rnorm(1010,10,4)*x+2*rnorm(1010,10,4)*x^(1/2)+rnorm(1010,20,100)
z<-x^2
plot(x,y)
lm1<-lm(y~x*I(x^2))
lm2<-lm(y~x*x^2)
lm3<-lm(y~x*z)
summary(lm1)
summary(lm2)
summary(lm3)
You see that lm1=lm3, but lm2 is something different (only 1 coefficient). Assuming you don't want to create the dummy variable z (computationally inefficient for large datasets), the only way to build an interaction model like lm3 is with I. Again this is a very simplistic example (that may make no statistical sense) however it makes sense in complicated models.
#Ben Bolker: I would like to avoid guessing and try to ask for an authoritative answer (I can't direct check this with my models since they are much more complicated than the example). My guess is that predict correctly assumes and constructs the I(log(x))
You do not need to make your variable names look like the term I(x). Just use "x" in the newdata argument.
The reason lm(y~x*I(x^2)) and lm(y~x*x^2) are different is that "^" and "*" are reserved symbols for formula in R. That's not the case with the log function. It is also incorrect that interactions can only be constructed with I(). If you wanted a second degree polynomial in R you should use poly(x, 2). If you build with I(log(x)) or with just log(x) you should get the same model. Both of them will get transformed to the predictor value properly with predict if you use:
newdata=dataframe( x=seq( min(x), max(x), length=10) )
Using poly will protect you from incorrect inferences that are so commonly caused by the use of I(x^2).