multiple linear regression: error in user defined function - r

I have written my function for MLR. However, there seems to an issue with output (see examples in the end).
But when I run the code, line by line, the output is correct.
mlr <- function(dependentvar, dataset) {
x <- model.matrix(dependentvar ~., dataset) # Design Matrix for x
y <- dependentvar # dependent variable
betas <- solve(crossprod(x))%*%crossprod(x,y) # beta values
SST <- t(y)%*%y - (sum(y)^2/dim(dataset)[1]) # total sum of squares
SSres <- t(y)%*%y -(t(betas)%*%crossprod(x,y)) # sum of squares of residuals
SSreg <- SST - SSres # regression sum of squares
sigmasqr <- SSres/(length(y) - dim(dataset)[2]) # variance or (MSE)
varofbeta <- sigmasqr[1]*solve( crossprod(x)) # variance of beta
cat("SST:", SST,"SSresiduals:", SSres,"SSregression:", SSreg, sep = "\n", append = FALSE)
return(betas)
}
To see the problem, try
mlr(trees$Height, trees)
I get the same problem even if I get rid of $
Height <- trees$Height
mlr(Height, trees)

Use the following:
x <- model.matrix(reformulate(".", dependentvar), dataset)
y <- dataset[[dependentvar]]
and pass in dependentvar as a string.
Example:
mlr("Height", trees)

Related

Finding residuals in r

Having created a least-squares regression model using a data set with certain set of x and y values, how do I then use an x value that is not from the original data set to find residual of the y-value corresponding to that x-value?
When I use resid(lm(y~x)), it gives me the residuals of all the original points/observations, but I am interested in finding out residual for a point on the regression line that was not part of the observations in the original dataset.
This snippet of code should give you an idea. I generate data, fit a model and use it to predict a new X vector, finding the residuals for it.
# creating data
x <- rnorm(1000)
y <- x * 2 + rnorm(1000)
new_x <- rnorm(1000)
new_y <- new_x * 2 + rnorm(10000)
# creating model
lm_model <- lm(y ~ x)
# predicting using model new data
y_hat_new <- predict(lm_model, data.frame(new_x)) # data must be in data.frame
new_resid <- new_y - y_hat_new

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

Different set.seed each run in R

I want to "measure" which Regression Method is more robust to the outliers.
For this, I sum the variances of model coefficients. Each run, I generate data from t-distribution. I set.seed Ten times to have Ten specific data.
However, I also want to have Ten different seed each run. So, in total, I will have 10 sums of the variances. The code below is giving me one sum of the first (Ten different seed).
How can I do this?
#######################################
p <- 5
n <- 50
#######################################
FX <- function(seed, data) {
#for loops over a seed #
for (i in seed) {
set.seed(seed)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
dat=cbind(x,y)
data<-as.data.frame(dat)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs <- coef(lm1)
lad1 <- lad(y ~ ., data=data, method="BR")
lad.coefs <- coef(lad1)
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=var(lm.coefs), lad=var(lad.coefs)), "seed", seed))
}
#######################################
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.
Here is something closer to your intended results.
The code below fixes a key issues from your original code. It was not clear on what data was intended to be returned from the function.
This creates a vector of seeds numbers inside the function
This also creates a vector to inside the function to store the value of the variance of coefficients for each iteration of the loop. (not sure if is what you want).
I needed to comment out the lad function since I do not know which package this is from. (you would need to follow 2 from above to add this back in.
Some general clean of the code
p <- 5
n <- 50
FX <- function(seed, data) {
#for loops over a seed #
#Fixes the starting seed issue
startingSeed <- (seed-1)*10 +1
seeds <- seq( startingSeed, startingSeed+9)
#create vector to store results from loop iteration
lm.coefs <- vector(mode="numeric", length=10)
index <- 1
for (i in seeds) {
set.seed(i)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
data<-data.frame(x, y)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs[index] <- var(coef(lm1))
# lad1 <- lad(y ~ ., data=data, method="BR")
# lad.coefs <- coef(lad1)
index <- index +1
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=lm.coefs), "seed", seed))
}
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.
Hope this provides the answer or at least guidance to solve your problem.

R: implementing my own gradient boosting algorithm

I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.
I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.
library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data, learning.rate, M, grad.fun) {
# Initialize fit to be 0
fit <- rep(0, nrow(data))
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
With this, I split up the iris data set into a training and testing data set and fit my model to it.
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)
Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.
yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)
# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972
I have a few questions
Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?
Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.
But to address something that was not asked, I noticed that your training setup has a few quirks.
The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.

Issues with predict function in R

I'm having issues with using the predict() function in R and I hope that I can get some help. Consider a dataset with two columns - 1) Y, 2) X
My goal is to fit a natural spline fit and get a 95% CI and to mark points outside of the 95% CI as outlier. Here is what I do:
1) Initially no point in the dataset is marked as outlier.
2) I fit my ns fit and using its 95% CI, I mark the points outside of the CI as outlier
3) I, then, exclude the initially marked outliers, and fit another ns and using it's 95% CI, I mark outliers.
* Issue: *
Suppose my initial dataset has 1000 obs. I mark some outliers in the first round and I get 23 outliers. Then I fit another ns (call it fit.ns) using the remaining 977 non-outliers. I then use ALL X's (all 1000) to get predicted values based on this new fit but I get warning AND error that newdata in my predict function has 1000 obs but fit has 977. The predicted values returned has also 977 values and NOT 1000.
* My predict() code *
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time),
interval="prediction", level = 1 - 0.05) # ??? PROBLEM
I really appreciate your help.
Seems that I cannot upload the dataset, but my code is:
library(splines)
ns.knot <- 10
for (i in 1:2){
# I exclude outliers so that my ns.fit does not get affected my outliers
data.ns <- data.temp[data.temp$OutlierInd == 0,]
data.ns$BeatNum <- 1:nrow(data.ns) # BeatNum is like a row number for me and is an auxilary variable
# Place Holder for Natural Spline results:
data.temp$IBI.NSfit <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.L95 <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.U95 <- rep(NA, nrow(data.temp))
# defining the knots in n.s.:
knots <- (data.ns$BeatNum)[seq(ns.knot, (length(data.ns$BeatNum) - ns.knot), by = ns.knot)]
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time), interval="prediction", level = 1 - 0.05) # ??? PROBLEM
data.temp$IBI.NSfit <- fit.ns.values[,1]
data.temp$IBI.NSfit.L95 <- fit.ns.values[,2]
data.temp$IBI.NSfit.U95 <- fit.ns.values[,3]
# Updating OutlierInd based on Natural Spline 95% CI:
data.temp$OutlierInd <- ifelse(data.temp$IBI < data.temp$IBI.NSfit.U95 & data.temp$IBI > data.temp$IBI.NSfit.L95, 0, 1)
}
Finally, I found the solution:
When I fit the model, I should use the "data =" option. In other words, instead of the command below,
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
I should use the command below instead:
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(IBI ~ ns(Time, knots = Time[knots]), data = data.ns)
Then the predict function will work.
I wanted to add a comment but my rep level doesnt allow that.
Anyways, I think this is a well documented point that predict uses the exact variables names used in the fit function. So, naming your variables is the best way to get around this error in my experience.
So, in the case above, please redefine a data frame just for your fit purposes like this
library(splines)
#Fit part
fit.data <- data.frame(y=rnorm(30),x=rnorm(30))
fit.ns <- lm(y ~ ns(x,3),data=fit.data)
#Predict
pred.data <- data.frame(y=rnorm(10),x=rnorm(10))
pred.fit <- predict(fit.ns,interval="confidence",limit=0.95,data.frame(x=pred.data$x))
IMHO, this should get rid of your error

Resources