I am doing a functional regression in R (package fda)and am supposed to eliminate the intercept term. But the fda package in R seems have no such formula.
Here is what I wish to do:
fit.fd <- fRegress(Acc.fd~Velo.fd - 1)
where Acc.fd and Velo.fd are two functional objects in the package fda. But it is no different from:
fit.fd <- fRegress(Acc.fd~Velo.fd)
Since the result is deeply nested, I am adding an example so the codes could be run on a small scale and detail of result could be generated.
list3d <- rep(0, 10*5*2)
list3d <- array(list3d, c(10,5, 2))
# The data is 5 functions each evaluated at 10 points
# Indep variable
list3d[, , 2] <- matrix(rnorm(50, 0, 1), 10, 5)
# Response variable
list3d[, , 1] <- matrix(rnorm(50, 0, 0.1) , 10, 5)+list3d[, , 2] ^ 2
dimnames(list3d)[[1]] <- seq(0,9)
time.range <- c(0, 9)
time.basis <- create.fourier.basis(time.range, nbasis = 3)
lfd <- vec2Lfd(c(0, (2*pi/20)^2, 0), rangeval = time.range)
time.lfd<- smooth.basisPar(seq(0,9), list3d , time.basis, Lfdobj = lfd, lambda = 0.01)$fd
Acc.fd <- time.lfd[, 1]
Velo.fd <- time.lfd[, 2]
# Expecting to see without intercept here
fit.fd <- fRegress(Acc.fd ~ Velo.fd - 1)
# plot of coef func
plot(plotpoints, eval.fd(plotpoints, fit.fd$betaestlis$Velo.fd$fd))
# Plot of intercept func, I wish to limit it to zero
plot(plotpoints, eval.fd(plotpoints, fit.fd$betaestlis$const$fd))
# Compare with regular functional regression with no restriction
fit.fd <- fRegress(Acc.fd ~ Velo.fd)
plot(plotpoints, eval.fd(plotpoints, fit.fd$betaestlis$Velo.fd$fd))
So the no intercept option does not work the same way as in lm? Could anyone helps me out here? Many thanks!
Related
I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer
I am trying to convert an "nlmrt object to an "nls" object using nls2. However, I can only manage to do it if I write explicitly the names of the parameters in the call. Can't I define the parameter names programmatically? See the reproducible example:
library(nlmrt)
scale_vector <- function(vector, ranges_in, ranges_out){
t <- (vector - ranges_in[1, ])/(ranges_in[2, ]-ranges_in[1, ])
vector <- (1-t) * ranges_out[1, ] + t * ranges_out[2, ]
}
shobbs.res <- function(x) {
# UNSCALED Hobbs weeds problen -- coefficients are rescaled internally using
# scale_vector
ranges_in <- rbind(c(0, 0, 0), c(100, 10, 0.1))
ranges_out <- rbind(c(0, 0, 0), c(1, 1, 1))
x <- scale_vector(x, ranges_in, ranges_out)
tt <- 1:12
res <- 100*x[1]/(1+10*x[2]*exp(-0.1*x[3]*tt)) - y }
y <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443,
38.558, 50.156, 62.948, 75.995, 91.972)
st <- c(b1=100, b2=10, b3=0.1)
ans1n <- nlfb(st, shobbs.res)
print(coef(ans1n))
This works:
library(nls2)
ans_nls2 <- nls2(y ~ shobbs.res(c(b1, b2, b3)) + y, start = coef(ans1n), alg = "brute")
However, this forces me to hard-code the parameters names in the call to nls2. For reasons related to my actual code, I would like to be able to do something like
ans_nls2 <- nls2(y ~ shobbs.res(names(st)) + y, start = coef(ans1n), alg = "brute")
But this returns an error:
Error in vector - ranges_in[1, ] :
non-numeric argument to binary operator
Is it possible to fix this, without having to hard-code explicitly the names of parameters in the call to nls2?
nls2 will accept a string as a formula:
co <- coef(ans1n)
fo_str <- sprintf("y ~ shobbs.res(c(%s)) + y", toString(names(co)))
nls2(fo_str, start = co, alg = "brute")
giving:
Nonlinear regression model
model: y ~ shobbs.res(c(b1, b2, b3)) + y
data: NULL
b1 b2 b3
196.1863 49.0916 0.3136
residual sum-of-squares: 2.587
Number of iterations to convergence: 3
Achieved convergence tolerance: NA
I'm trying a very simple random forest, as shown below: The code is entirely self-contained and runnable.
library(randomForest)
n = 1000
factor=10
x1 = seq(n) + rnorm(n, 0, 150)
y = x1*factor + rnorm(n, 0, 550)
x_data = data.frame(x1)
y_data = data.frame(y)
k=2
for (nfold in seq(k)){
fold_ids <- cut(seq(1, nrow(x_data)), breaks=k, labels=FALSE)
id_indices <- which(fold_ids==nfold)
fold_x <- x_data[id_indices,]
fold_y <- y_data[id_indices,]
fold_x_df = data.frame(x=fold_x)
fold_y_df = data.frame(y=fold_y)
print(paste("number of rows in fold_x_df is ", nrow(fold_x_df), sep=" "))
print(paste("number of rows in fold_y_df is ", nrow(fold_y_df), sep=" "))
rf = randomForest(fold_x_df, fold_y_df, ntree=1000)
print(paste("mse for fold number ", " is ", sum(rf$mse)))
}
rf = randomForest(x_data, y_data, ntree=1000)
It gives me an error:
...The response has five or fewer unique values. Are you sure you want to do regression?
I don't understand why it gives me that error.
I've checked these sources:
Use of randomforest() for classification in R?
RandomForest error code
https://www.kaggle.com/c/15-071x-the-analytics-edge-competition-spring-2015/forums/t/13383/warning-message-in-random-forest
None of those solved my problem. you can look at the print statements, there are clearly more than 5 unique labels. Not to mention, I'm doing regression here, not classification, so I'm not sure why the word "label" is used in the error.
The problem is giving the response as a data frame. Since the response must be one-dimensional, it makes sense that it should be a vector. Here's how I would simplify your code to use the data argument of randomForest with the formula method to avoid the issue entirely:
## simulation: unchanged (but seed set for reproducibility)
library(randomForest)
n = 1000
factor=10
set.seed(47)
x1 = seq(n) + rnorm(n, 0, 150)
y = x1*factor + rnorm(n, 0, 550)
## use a single data frame
all_data = data.frame(y, x1)
## define the folds outside the loop
fold_ids <- cut(seq(1, nrow(x_data)), breaks = k, labels = FALSE)
for (nfold in seq(k)) {
id_indices <- which(fold_ids == nfold)
## sprintf can be nicer than paste for "filling in blanks"
print(sprintf("number of rows in fold %s is %s", nfold, length(id_indices)))
## just pass the subset of the data directly to randomForest
## no need for extracting, subsetting, putting back in data frames...
rf <- randomForest(y ~ ., data = all_data[id_indices, ], ntree = 1000)
## sprintf also allows for formatting
## the %g will use scientific notation if the exponent would be >= 3
print(sprintf("mse for fold %s is %g", nfold, sum(rf$mse)))
}
I have solved what I want to get out of my code, I'm in search of a cleaner way of getting this result out? As in any built in functions, who I don't know about?
We have 2 correlated variables and a lot of binomial factors (around 200),
here illustrated with just f1 and f2:
x <- rnorm(100)
y <- rnorm(100)
f1 <- rbinom(100, 1, 0.5)
f2 <- rbinom(100, 1, 0.5)
# which gives the possible groups:
group <- rep(NA, 100)
group[which(f1 & f2)] <- "A"
group[which(!f1 & f2)] <- "B"
group[which(f1 & !f2)] <- "C"
group[which(!f1 & !f2)] <- "D"
df <- data.frame(group,y,x,f1,f2)
We run a model selection adding and removing terms and interactions and end up
with a model, here we say both f1 and f2 and their interactions with x
came out as predictors
m <- glm(y ~ x * f1 + x * f2)
Then my aim is to make a simple linear model output for each group i.e.:
y = a * x + b
# The possible groups:
groups <- data.frame(groups = c("A", "B", "C", "D"), f1=c(1,0,1,0), f2=c(1,1,0,0))
interactions <- grep(":", attr(m$terms, "term.labels"))
factors <- attr(m$terms, "term.labels")[-c(1,interactions)]
interaction.terms <- substring(attr(m$terms, "term.labels")[interactions], 3)
functions <- data.frame(groups$groups, intercept=NA, slope=NA)
for(i in seq(along=groups$groups)) {
intercept <- coef(m)["(Intercept)"] + sum(groups[i, factors]*coef(m)[factors])
slope <- coef(m)["x"] + sum(groups[i, interaction.terms]*coef(m)[paste("x:", interaction.terms, sep="")])
functions[i, "intercept"] <- intercept
functions[i, "slope"] <- slope
}
Which gives an output like this:
> functions
groups.groups intercept slope
1 A -0.10932806 -0.07468630
2 B -0.37755949 -0.17769345
3 C 0.23635139 0.18406047
4 D -0.03188004 0.08105332
The output is the correct, and what I would like. So that is fine. I just think that this method is a quite complicated mess. I can't seem to find a cleaner way of getting these functions out.
I would probably recommend using predict() for this. The intercept is just the value a time x=0, and the slope is the difference in the values between x=1 and x=0. So you can do
int <- predict(m, cbind(groups,x=0))
t1 <- predict(m, cbind(groups,x=1))
data.frame(group=groups$groups, int=int, slope=t1-int)
You didn't set a seed for your example so your exact results aren't reproducible, but if you do set.seed(15) before the sample generation, you should get
group int slope
1 A -0.08372785 -0.16037708
2 B -0.03904330 0.14322623
3 C 0.16455660 -0.02951151
4 D 0.20924114 0.27409179
with both methods
I am trying to build a rolling regression function based on the example here, but in addition to returning the predicted values, I would like to return the some rolling model diagnostics (i.e. coefficients, t-values, and mabye R^2). I would like the results to be returned in discrete objects based on the type of results. The example provided in the link above sucessfully creates thr rolling predictions, but I need some assistance packaging and writing out the rolling model diagnostics:
In the end, I would like the function to return three (3) objects:
Predictions
Coefficients
T values
R^2
Below is the code:
require(zoo)
require(dynlm)
## Create Some Dummy Data
set.seed(12345)
x <- rnorm(mean=3,sd=2,100)
y <- rep(NA,100)
y[1] <- x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int <- 1:100
dummydata <- data.frame(int=int,x=x,y=y)
zoodata <- as.zoo(dummydata)
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted <- predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted) <- NULL
c(predicted=predicted,square.res <-(predicted-zoodata[nextOb,'y'])^2)
# 2) Extract coefficients
#coefficients <- coef(mod)
# 3) Extract rolling coefficient t values
#tvalues <- ????(mod)
# 4) Extract rolling R^2
#rsq <-
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
So after figuring out how to extract t values from model (i.e. mod) , what do I need to do to make the function return three (3) seperate objects (i.e. Predictions, Coefficients, and T-values)?
I am fairly new to R, really new to functions, and extreemly new to zoo, and I'm stuck.
Any assistance would be greatly appreciated.
I hope I got you correctly, but here is a small edit of your function:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Solution 1; Quicker to write
# c(predicted=predicted,
# square.res=(predicted-zoodata[nextOb,'y'])^2,
# summary(mod)$coef[, 1],
# summary(mod)$coef[, 3],
# AdjR = summary(mod)$adj.r.squared)
#Solution 2; Get column names right
c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
coef_intercept = summary(mod)$coef[1, 1],
coef_Ly = summary(mod)$coef[2, 1],
coef_Lx = summary(mod)$coef[3, 1],
tValue_intercept = summary(mod)$coef[1, 3],
tValue_Ly = summary(mod)$coef[2, 3],
tValue_Lx = summary(mod)$coef[3, 3],
AdjR = summary(mod)$adj.r.squared)
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
head(results.z)
predicted square.res coef_intercept coef_Ly coef_Lx tValue_intercept tValue_Ly tValue_Lx AdjR
20 10.849344 0.721452 0.26596465 0.5798046 1.049594 0.38309211 7.977627 13.59831 0.9140886
21 12.978791 2.713053 0.26262820 0.5796883 1.039882 0.37741499 7.993014 13.80632 0.9190757
22 9.814676 11.719999 0.08050796 0.5964808 1.073941 0.12523824 8.888657 15.01353 0.9340732
23 5.616781 15.013297 0.05084124 0.5984748 1.077133 0.08964998 9.881614 16.48967 0.9509550
24 3.763645 6.976454 0.26466039 0.5788949 1.068493 0.51810115 11.558724 17.22875 0.9542983
25 9.433157 31.772658 0.38577698 0.5812665 1.034862 0.70969330 10.728395 16.88175 0.9511061
To see how it works, make a small example with a regression:
x <- rnorm(1000); y <- 2*x + rnorm(1000)
reg <- lm(y ~ x)
summary(reg)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02694322 0.03035502 0.8876033 0.374968
x 1.97572544 0.03177346 62.1816310 0.000000
As you can see, calling summary first and then getting the coefficients of it (coef(summary(reg)) works as well) gives you a table with estimates, standard errors, and t-values. So estimates are saved in column 1 of that table, t-values in column 3. And that's how I obtain them in the updated rolling.regression function.
EDIT
I updated my solution; now it also contains the adjusted R2. If you just want the normal R2, get rid of the .adj.
EDIT 2
Quick and dirty hack how to name the columns:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Get variable names
strVar <- c("Intercept", paste0("L", 1:(nrow(summary(mod)$coef)-1)))
vec <- c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
AdjR = summary(mod)$adj.r.squared,
summary(mod)$coef[, 1],
summary(mod)$coef[, 3])
names(vec)[4:length(vec)] <- c(paste0("Coef_", strVar), paste0("tValue_", strVar))
vec
}
}