I am really struggling with the following problem set using R
.
I want to simulate a data set with one dependent and 20 independent variables that are normally i.i.d.. Each variable should have 100 observations. (I managed to do this part)
(Now the part I am struggling with):
My plan is to conduct automated regressions for all possible combinations of up to 5 regressor using an own coded regression function that simulates the output of summary(lm) that uses a vector y and a matrix or vector x as input (so my.lm(y,x)). And then bringing the results in a suitable data structure.
I would be thankful for every hint!
I doubt the soundness of what you are trying to do but here it goes.
I will make up a dataset, since you have not posted one.
my.lm <- function(x, y, n = 5){
f <- function(inx){
inx_cols <- Combn[inx, ]
inx_cols <- inx_cols[inx_cols != 0]
X <- as.data.frame(x[, inx_cols])
names(X) <- paste0("X", inx_cols)
X <- cbind(X, y)
name_y <- names(X)[length(names(X))]
fmla <- as.formula(paste(name_y, ".", sep = "~"))
tryCatch(lm(fmla, data = X), error = function(e) e)
}
nc_x <- ncol(x)
nr <- sum(choose(nc_x, seq_len(n)))
Combn <- matrix(0, nrow = nr, ncol = n)
first <- 1
for(i in seq_len(n)){
last <- first + choose(nc_x, i) - 1
Combn[first:last, seq_len(i)] <- t(combn(nc_x, i))
first <- last + 1
}
apply(Combn, 1, f)
}
set.seed(6876)
regr <- replicate(20, rnorm(100))
coefs <- sample(-5:5, 20, TRUE)
resp <- regr %*% coefs + rnorm(100)
lm_list <- my.lm(regr, resp)
length(lm_list)
#[1] 21699
So the function above produced as many objects as expected.
Before continuing, let's see how many are errors (singular matrix, for instance).
err_list <- lapply(lm_list, function(x){
if(inherits(x, "error")) x else NULL
})
err_list <- err_list[!sapply(err_list, is.null)]
length(err_list)
#[1] 0
No errors.
So get the summaries of the objects of class "lm".
good_list <- lapply(lm_list, function(x){
if(inherits(x, "lm")) x else NULL
})
good_list <- good_list[!sapply(good_list, is.null)]
smry_list <- lapply(good_list, summary)
smry_list[[1]]
#
#Call:
# lm(formula = fmla, data = X)
#Residuals:
# Min 1Q Median 3Q Max
#-34.654 -9.487 -1.985 9.486 50.213
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.6449 1.5237 0.423 0.673
#X1 -7.3969 1.5074 -4.907 3.68e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 15.02 on 98 degrees of freedom
#Multiple R-squared: 0.1972, Adjusted R-squared: 0.189
#F-statistic: 24.08 on 1 and 98 DF, p-value: 3.684e-06
Related
I'm trying to run a different model for 28 experiment by location (latitude) values in order to determine the unique regression slope for each experiment at each location. Here is a link to my data. Similar code has worked before for looping regressions when varying the predictor, but now this code is doing several confusing things. The P value and slope columns contain 28 rows of NAs followed by 55 rows of values, and the R squared column contains 88 rows of values. I ran a sample model and couldn't find any of the values anywhere in the table.
data$Grain<- as.numeric(as.character(data$Grain))
data2 <- subset(data, Age >1)
ls1 <- list()
ls2 <- list()
ls3 <- list()
data2$Key3 <- paste(data2$LAT, data2$Experiment)
for (i in unique(data2$Key3)){
model <- lm(as.formula(paste0("Grain", "~", "Age")), subset(data2, Key3 == i))
pval <- summary(model)$coefficients[,4] #extracts P values for model
rsq <- summary(model)$r.squared
slope <- summary(model)$coefficients[,1]
ls1 <- c(ls1, pval[2]) #extracts P values of second row, which is the predictor Age
ls2 <- c(ls2, rsq)
ls3 <- c(ls3, slope[2]) #extracts slope of second row, which is the predictor Age
}
Ps <- do.call(rbind, ls1)
Rs <- do.call(rbind, ls2)
slopes <- do.call(rbind, ls3)
table <- cbind(Ps, Rs, slopes)
## Sample model- sorry the Key3 values are so ugly
summary(lm(Grain~Age, subset(data2, Key3 == '44.06751 IREE- N Rate')))
Any ideas on how to get the actual P, R2, and slopes into a table? Any thoughts as to why this is spitting out so many rows of NAs?
The code seems to work now. Use Key3 ='44.06751IREE- N Rate' if you use paste0() because it does not add a space by default unlike paste()
data$Grain <- readr::parse_number(data$Grain)
data2 <- subset(data, Age > 1)
ls1 <- list()
ls2 <- list()
ls3 <- list()
data2$Key3 <- paste0(data2$LAT, data2$Experiment)
for (i in unique(data2$Key3)){
model <- lm(as.formula(paste0("Grain", "~", "Age")), subset(data2, Key3 == i))
pval <- summary(model)$coefficients[,4] #extracts P values for model
rsq <- summary(model)$r.squared
slope <- summary(model)$coefficients[,1]
ls1 <- c(ls1, pval[2]) #extracts P values of second row, which is the predictor Age
ls2 <- c(ls2, rsq)
ls3 <- c(ls3, slope[2]) #extracts slope of second row, which is the predictor Age
}
Ps <- do.call(rbind, ls1)
Rs <- do.call(rbind, ls2)
slopes <- do.call(rbind, ls3)
table <- cbind(Ps, Rs, slopes)
## Sample model- sorry the Key3 values are so ugly
summary(lm(Grain~Age, subset(data2, Key3 == '44.06751IREE- N Rate')))
summary(lm(Grain~Age, subset(data2, Key3 == '44.06751IREE- N Rate')))
Call:
lm(formula = Grain ~ Age, data = subset(data2, Key3 == "44.06751IREE- N Rate"))
Residuals:
Min 1Q Median 3Q Max
-245 -210 19 135 586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1175.3 68.1 17.2 <2e-16 ***
Age -310.2 22.2 -14.0 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 211 on 132 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.596, Adjusted R-squared: 0.593
F-statistic: 195 on 1 and 132 DF, p-value: <2e-16
I am having trouble estimating the constant (intercept) of a multivariate linear regression model by using stochastic gradient descent optimization method (which means batch size equals to 1 of mini-batch gradient descent). The function in R that I use is:
StochasticGradientDescent <- function(data, alpha, iteration, epsilon){data <- matrix(unlist(data), ncol=ncol(data), byrow=FALSE)
independent.variable<- data[,1:ncol(data)-1]
dependent.variable<- data[,ncol(data)]
#add column of 1s for constant
independent.variable <- cbind(theta0 = 1, independent.variable)
theta_new <- matrix( 0, ncol = ncol(independent.variable))
theta_old <- matrix( 1, ncol = ncol(independent.variable))
#Cost function
CostFunction <- function (independent.variable, dependent.variable, theta){
1/(2*(NROW(dependent.variable))) * sum(((independent.variable %*% t(theta)) - dependent.variable)^2);
}
thetas <- vector( mode = "list", length = iteration )
thetas[[1]] <- theta_new
J <- numeric( length = iteration )
J[1] <- CostFunction(independent.variable, dependent.variable, theta_old )
derivative <- function(independent.variable, dependent.variable, theta){
idx <- sample.int(NROW(independent.variable), 1)
descent <- (t(independent.variable[idx, , drop = FALSE]) %*% ((independent.variable[idx, , drop = FALSE] %*% t(theta)) - dependent.variable[idx, drop = FALSE]))
return( t(descent) )
}
#stopping criterion
step <- 1
while(any(abs(theta_new - theta_old) > epsilon) & step <= iteration )
{
step <- step + 1
# gradient descent
theta_old <- theta_new
theta_new <- theta_old - alpha * derivative(independent.variable, dependent.variable, theta_old)
# record keeping
thetas[[step]] <- theta_new
J[step] <- CostFunction(independent.variable, dependent.variable, theta_new)
}
costs <- data.frame( costs = J )
theta <- data.frame( do.call( rbind, thetas ), row.names = NULL )
return( list( costs = costs, theta = theta))
}
I simulate an artificial data.
x1 <- runif(1000000,1,100);
x2 <- runif(1000000,1,200);
y <- 5+4*x1+3*x2;
QR decomposition of lm package gives this result:
fit <- lm(y ~ x1+x2);
summary(fit)
#
#Call:
# lm(formula = y ~ x1 + x2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-7.386e-09 0.000e+00 0.000e+00 0.000e+00 9.484e-10
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 5.000e+00 2.162e-14 2.313e+14 <2e-16 ***
# x1 4.000e+00 2.821e-16 1.418e+16 <2e-16 ***
# x2 3.000e+00 1.403e-16 2.138e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 8.062e-12 on 999997 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: 1
#F-statistic: 3.292e+32 on 2 and 999997 DF, p-value: < 2.2e-16
My initial values for thetas are 0s. Learning rate is chosen to be 0.00005. Number of iterations is 5000. Stopping criteria which is epsilon here (a user-defined small value) is 0.000001. If the trained parameter’s difference between the two iteration is smaller than this value then the algorithm will stop. The result I get is given below:
data<- data.frame(cbind(x1, x2, y))
results <- StochasticGradientDescent( data = data, alpha = 0.00005, iteration = 5000, epsilon = .000001)
results$theta[ nrow(results$theta), ]
# theta0 V2 V3
#5001 0.2219142 4.04408 2.999861
As you can see estimates of coefficients are very close to actual ones. However, the coefficient estimation for theta0 (intercept/constant) is not even close. Besides, I get these values at the end of the cycle of iterations, which is not good. I cannot converge efficiently. I tried but I really could not figure out why this is the case. Could someone help me, please?
Following the code of a tutorial I defined class and methods for my function of linear regression.
data(cats, package="MASS")
linmodeEst <- function(x,y){
qx <- qr(x) # QR-decomposition
coef <- solve.qr(qx,y) # solve(t(x)%*%x)%*%t(x)%*%y
df <- nrow(x)-ncol(x)
sigma2 <- sum((y-x%*%coef)^2)/df
vcov <- sigma2 * chol2inv(qx$qr)
colnames(vcov) <- rownames(vcov) <- colnames(x)
list(coefficients = coef, vcov=vcov, sigma = sqrt(sigma2), df=df)
}
linmod <- function(x,...) UseMethod("linmod")
linmod.default <- function(x,y,...){
x <- as.matrix(x)
y <- as.matrix(y)
est <- linmodeEst(x,y)
est$fitted.values <- as.vector(x%*%est$coefficients)
est$residuals <- y - est$fitted.values
est$call <- match.call()
class(est) <- "linmod"
est
}
print.linmod <- function(x,...){
cat("Call:\n")
print(x$call)
cat("\nCoefficients:\n")
print(x$coefficients)
}
summary.linmod <- function(object,...){
se <- sqrt(diag(object$vcov))
tval <- coef(object)/se
TAB <- cbind(Estimate = coef(object),
StdErr = se,
t.value = tval,
p.value = 2*pt(-abs(tval), df=object$df))
res <- list(call=object$call, coefficients=TAB)
class(res) <- "summary.linmod"
res
}
print.summary.linmod <- function(x,...){
cat("Call:\n")
print(x$call)
cat("\n")
printCoefmat(x$coefficients, P.value=TRUE, has.Pvalue=TRUE)
}
x = cbind(Const=1, Bwt=cats$Bwt)
y = cats$Hw
mod1 <- linmod(x,y)
summary(mod1)
So, in the summary.linmod <- function(object,...) I defined the table names: Estimate, StdErr, t.value, p.value. In R I get all the names in the header, in RStudio just StdErr. why is this happening?
My system: Linux 64bit, R 3.1.1
The documentation at ?cbind states that: "For cbind (rbind) the column (row) names are taken from the colnames (rownames) of the arguments if these are matrix-like."
In constructing TAB, you are binding 3 single-column matrices (i.e. coef(object), tval, and 2*pt(-abs(tval), df=object$df), to se (a non-matrix vector with 2 elements). Because of the behaviour quoted above, cbind uses the matrices' names (empty) to name the matrix-like columns of TAB.
Use cbind.data.frame or simply data.frame to construct TAB, and your summary output will have the expected names:
summary.linmod <- function(object,...){
se <- sqrt(diag(object$vcov))
tval <- coef(object)/se
TAB <- data.frame(Estimate = coef(object),
StdErr = se,
t.value = tval,
p.value = 2*pt(-abs(tval), df=object$df))
res <- list(call=object$call, coefficients=TAB)
class(res) <- "summary.linmod"
res
}
> summary(mod1)
# Call:
# linmod.default(x = x, y = y)
#
# Estimate StdErr t.value p.value
# Const -0.35666 0.69228 -0.5152 0.6072
# Bwt 4.03406 0.25026 16.1194 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
this piece of code will return coefficients :intercept , slop1 , slop2
set.seed(1)
n=10
y=rnorm(n)
x1=rnorm(n)
x2=rnorm(n)
lm.ft=function(y,x1,x2)
return(lm(y~x1+x2)$coef)
res=list();
for(i in 1:n){
x1.bar=x1-x1[i]
x2.bar=x2-x2[i]
res[[i]]=lm.ft(y,x1.bar,x2.bar)
}
If I type:
> res[[1]]
I get:
(Intercept) x1 x2
-0.44803887 0.06398476 -0.62798646
How can we return predicted values,residuals,R square, ..etc?
I need something general to extract whatever I need from the summary?
There are a couple of things going on here.
First, you are better off combining your variables into a data.frame:
df <- data.frame(y=rnorm(10), x1=rnorm(10), x2 = rnorm(10))
fit <- lm(y~x1+x2, data=df)
If you do this, using you model for prediction with a new dataset will be much easier.
Second, some of the statistics of the fit are accessible from the model itself, and some are accessible from summary(fit).
coef <- coefficients(fit) # coefficients
resid <- residuals(fit) # residuals
pred <- predict(fit) # fitted values
rsq <- summary(fit)$r.squared # R-sq for the fit
se <- summary(fit)$sigma # se of the fit
To get the statistics of the coefficients, you need to use summary:
stat.coef <- summary(fit)$coefficients
coef <- stat.coef[,1] # 1st column: coefficients (same as above)
se.coef <- stat.coef[,2] # 2nd column: se for each coef
t.coef <- stat.coef[,3] # 3rd column: t-value for each coef
p.coef <- stat.coef[,4] # 4th column: p-value for each coefficient
In your function, you return just the coefficients. Try returning the whole model:
lm.ft=function(y,x1,x2) lm(y~x1+x2) # You don't need the return statement.
Now try your code, and then run:
summary(res[[1]])
# Call:
# lm(formula = y ~ x1 + x2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.88518 -0.25311 0.03868 0.43110 0.61753
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.44804 0.32615 -1.374 0.2119
# x1 0.06398 0.24048 0.266 0.7979
# x2 -0.62799 0.26915 -2.333 0.0524 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6149 on 7 degrees of freedom
# Multiple R-squared: 0.5173, Adjusted R-squared: 0.3794
# F-statistic: 3.751 on 2 and 7 DF, p-value: 0.07814
You need predict -
set.seed(1)
n=10
y=rnorm(n)
x1=rnorm(n)
x2=rnorm(n)
lm.ft=function(y,x1,x2)
# return(lm(y~x1+x2)$coef)
return(lm(y~x1+x2))
res=lm.ft(y,x1,x2)
ypredicted <- predict(res)
residuals <- y - ypredicted
I'm trying to write a function that regresses multiple items, then tries to predict data based on the model:
"tnt" <- function(train_dep, train_indep, test_dep, test_indep)
{
y <- train_dep
x <- train_indep
mod <- lm (y ~ x)
estimate <- predict(mod, data.frame(x=test_indep))
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}
If I pass the above this, it fails:
train_dep = vector1
train_indep <- cbind(vector2, vector3)
test_dep = vector4
test_indep <- cbind(vector5, vector6)
tnt(train_dep, train_indep, test_dep, test_indep)
Changing the above to something like the following works, but I want this done dynamically so I can pass it a matrix of any number of columns:
x1 = x[,1]
x2 = x[,2]
mod <- lm(y ~ x1+x2)
estimate <- predict(mod, data.frame(x1=test_indep[,1], x2=test_indep[,2]))
Looks like this could help, but I'm still confused on the rest of the process: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/70843.html
Try this instead:
tnt <- function(train_dep, train_indep, test_dep, test_indep)
{ dat<- as.data.frame(cbind(y=train_dep, train_indep))
mod <- lm (y ~ . , data=dat )
newdat <- as.data.frame(test_indep)
names(newdat) <- names(dat)[2:length(dat)]
estimate <- predict(mod, newdata=newdat )
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}
Call:
lm(formula = y ~ ., data = dat)
Residuals:
1 2 3
0 0 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NA NA
V2 1 0 Inf <2e-16 ***
V3 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 1 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: Inf on 1 and 1 DF, p-value: < 2.2e-16
[1] "RMSE: 0"
Warning message:
In predict.lm(mod, newdata = newdat) :
prediction from a rank-deficient fit may be misleading
>
The warning is because of the exact fit you are offering
Modified using the as.formula suggestion in the comments. Roman's comment above about passing all as one data.frame and using the . notation in formulas is probably the best solution, but I implemented it in paste because you should know how to use paste and as.formula :-).
tnt <- function(train_dep, train_indep, test_dep, test_indep) {
form <- as.formula(paste("train_dep ~", paste( "train_indep$",colnames(train_indep) ,sep="",collapse=" + " ), sep=" "))
mod <- lm(form)
estimate <- predict(mod, data.frame(x=test_indep))
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}