Following the code of a tutorial I defined class and methods for my function of linear regression.
data(cats, package="MASS")
linmodeEst <- function(x,y){
qx <- qr(x) # QR-decomposition
coef <- solve.qr(qx,y) # solve(t(x)%*%x)%*%t(x)%*%y
df <- nrow(x)-ncol(x)
sigma2 <- sum((y-x%*%coef)^2)/df
vcov <- sigma2 * chol2inv(qx$qr)
colnames(vcov) <- rownames(vcov) <- colnames(x)
list(coefficients = coef, vcov=vcov, sigma = sqrt(sigma2), df=df)
}
linmod <- function(x,...) UseMethod("linmod")
linmod.default <- function(x,y,...){
x <- as.matrix(x)
y <- as.matrix(y)
est <- linmodeEst(x,y)
est$fitted.values <- as.vector(x%*%est$coefficients)
est$residuals <- y - est$fitted.values
est$call <- match.call()
class(est) <- "linmod"
est
}
print.linmod <- function(x,...){
cat("Call:\n")
print(x$call)
cat("\nCoefficients:\n")
print(x$coefficients)
}
summary.linmod <- function(object,...){
se <- sqrt(diag(object$vcov))
tval <- coef(object)/se
TAB <- cbind(Estimate = coef(object),
StdErr = se,
t.value = tval,
p.value = 2*pt(-abs(tval), df=object$df))
res <- list(call=object$call, coefficients=TAB)
class(res) <- "summary.linmod"
res
}
print.summary.linmod <- function(x,...){
cat("Call:\n")
print(x$call)
cat("\n")
printCoefmat(x$coefficients, P.value=TRUE, has.Pvalue=TRUE)
}
x = cbind(Const=1, Bwt=cats$Bwt)
y = cats$Hw
mod1 <- linmod(x,y)
summary(mod1)
So, in the summary.linmod <- function(object,...) I defined the table names: Estimate, StdErr, t.value, p.value. In R I get all the names in the header, in RStudio just StdErr. why is this happening?
My system: Linux 64bit, R 3.1.1
The documentation at ?cbind states that: "For cbind (rbind) the column (row) names are taken from the colnames (rownames) of the arguments if these are matrix-like."
In constructing TAB, you are binding 3 single-column matrices (i.e. coef(object), tval, and 2*pt(-abs(tval), df=object$df), to se (a non-matrix vector with 2 elements). Because of the behaviour quoted above, cbind uses the matrices' names (empty) to name the matrix-like columns of TAB.
Use cbind.data.frame or simply data.frame to construct TAB, and your summary output will have the expected names:
summary.linmod <- function(object,...){
se <- sqrt(diag(object$vcov))
tval <- coef(object)/se
TAB <- data.frame(Estimate = coef(object),
StdErr = se,
t.value = tval,
p.value = 2*pt(-abs(tval), df=object$df))
res <- list(call=object$call, coefficients=TAB)
class(res) <- "summary.linmod"
res
}
> summary(mod1)
# Call:
# linmod.default(x = x, y = y)
#
# Estimate StdErr t.value p.value
# Const -0.35666 0.69228 -0.5152 0.6072
# Bwt 4.03406 0.25026 16.1194 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Related
In the example below I use the vcovHC option to generate robust standard errors for a standard linear model. Using this same example, I tried to generate the distribution of F statistics with robust standard errors using the Fstats() function. The problem is that Fstats uses vcovHC internally as shown in the bottom line of the code below (which I extracted from the Fstats function).
else {
allX <- cbind(X1, matrix(rep(0, point[i]*k), ncol=k))
allX <- rbind(allX, cbind(X2, X2))
fm2 <- lm(y ~ 0 + allX)
beta2 <- coef(fm2)[-(1:k)]
V <- vcov.(fm2)
I don't know how to specify the vcovHC option when using the Fstats() function. What I did try (see below) does not work. Thank you for your help.
rm(list = ls())
>
> library(lmtest)
> library(sandwich)
> library(strucchange)
>
> y <- as.ts(c(4,3,-2,1,-4,0,1,-2,-3,2,0,1,-2,2,4,1,3,5,2,2,3,4))
> X <- as.ts(rep(1,22))
>
> # lm model with robust standard errors
> model<- lm(y ~ X - 1)
> coeftest(model, vcov = vcovHC(model, type = "HC0"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
X 1.13636 0.51545 2.2046 0.03877 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> # Fstats model with regular standard errors
> model2 <- Fstats(y ~ X - 1, from = 3, to = 18)
> model2$Fstats
Time Series:
Start = 3
End = 18
Frequency = 1
[1] 0.1530982 0.1010540 0.5609994 0.8631524 0.8332749 2.2221599 5.5956726
[8] 4.5129612 5.6715064 5.8951319 11.4357216 10.2821317 6.1921347 7.1910292
[15] 5.4546513 2.1974238
>
# Fstats model with robust standard errors
> model3 <- Fstats(y ~ X - 1, from = 3, to = 18, vcovHC(.,type = "HC0"))
Error in vcovHC(., type = "HC0") : object '.' not found
The vcov argument in Fstats() needs to be a function (see ?Fstats) because it has to be applied to a whole range of different linear models fitted to subsets of the data. This is in contrast to coeftest() where it can either be a function or a matrix because it is just combined with a single model.
A worked example is included in the durab data set, providing the growth rate of the Industrial Production Index to average weekly labor hours in the manufacturing/durables sector. The example is taken from Hansen (2001) and uses a simple AR(1) model:
library("strucchange")
data("durab", package = "strucchange")
fsHC <- Fstats(y ~ lag, data = durab, from = 0.1,
vcov = function(x, ...) vcovHC(x, type = "HC", ...))
plot(fsHC)
sctest(fsHC)
## supF test
##
## data: fsHC
## sup.F = 20.221, p-value = 0.001371
Note also that for the classical HC sandwich estimator you can use sandwich() rather than vcovHC(..., type = "HC0"). Thus, using vcov = sandwich in the code above is simpler than setting up a new function on the fly.
I had a weird problem in plm() function. Below is the code:
library(data.table)
library(tidyverse)
library(plm)
#Data Generation
n <- 500
set.seed(75080)
z <- rnorm(n)
w <- rnorm(n)
x <- 5*z + 50
y <- -100*z+ 1100 + 50*w
y <- 10*round(y/10)
y <- ifelse(y<200,200,y)
y <- ifelse(y>1600,1600,y)
dt1 <- data.table('id'=1:500,'sat'=y,'income'=x,'group'=rep(1,n))
z <- rnorm(n)
w <- rnorm(n)
x <- 5*z + 80
y <- -80*z+ 1200 + 50*w
y <- 10*round(y/10)
y <- ifelse(y<200,200,y)
y <- ifelse(y>1600,1600,y)
dt2 <- data.table('id'=501:1000,'sat'=y,'income'=x,'group'=rep(2,n))
z <- rnorm(n)
w <- rnorm(n)
x <- 5*z + 30
y <- -120*z+ 1000 + 50*w
y <- 10*round(y/10)
y <- ifelse(y<200,200,y)
y <- ifelse(y>1600,1600,y)
dt3 <- data.table('id'=1001:1500,'sat'=y,'income'=x,'group'=rep(3,n))
dtable <- merge(dt1 ,dt2, all=TRUE)
dtable <- merge(dtable ,dt3, all=TRUE)
# Model
dtable_p <- pdata.frame(dtable, index = "group")
mod_1 <- plm(sat ~ income, data = dtable_p,model = "pooling")
Error in [.data.frame(x, , which) : undefined columns selected
I checked all possibilities but I can not figure out why it gives me an error. the columns'names are correct, why R said undefined columns??? Thank you!
Follow up: I add another data set test as the #StupidWolf use to prove
data("Produc", package = "plm")
form <- log(gsp) ~ log(pc)
Produc$group <- Produc$region
pProduc <- pdata.frame(Produc, index = "group")
Produc$group <- rep(1:48, each = 17)
summary(plm(form, data = pProduc, model = "pooling"))
>Error in `[.data.frame`(x, , which) : undefined columns selected
This is extremely weird, the answer is index must not be named "group".
I suspect somewhere in the plm function, it must be adding a "group" to your data.frame.
We can use the example dataset
data("Produc", package = "plm")
form <- log(gsp) ~ log(pc)
Produc$group = Produc$region
pProduc <- pdata.frame(Produc, index = c("group"))
summary(plm(form, data = pProduc, model = "random"))
Error in `[.data.frame`(x, , which) : undefined columns selected
Using the "region" column from which I copied, it works:
pProduc <- pdata.frame(Produc, index = c("region"))
summary(plm(form, data = pProduc, model = "random"))
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = form, data = pProduc, model = "random")
Unbalanced Panel: n = 9, T = 51-136, N = 816
Effects:
var std.dev share
idiosyncratic 0.03691 0.19213 0.402
individual 0.05502 0.23457 0.598
theta:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.8861 0.9012 0.9192 0.9157 0.9299 0.9299
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.68180 -0.11014 0.00977 -0.00039 0.13815 0.45491
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) -1.099088 0.138395 -7.9417 1.994e-15 ***
log(pc) 1.100102 0.010623 103.5627 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 459.71
Residual Sum of Squares: 30.029
R-Squared: 0.93468
Adj. R-Squared: 0.9346
Chisq: 11647.6 on 1 DF, p-value: < 2.22e-16
For your example, just rename the column "group" and also set it as a factor to avoid the other errors. (For "pooling" it should be treated a categorical not numeric).
dtable <- merge(dt1 ,dt2, all=TRUE)
dtable <- merge(dtable ,dt3, all=TRUE)
dtable$group = factor(dtable$group)
colnames(dtable)[4] = "GROUP"
dtable_p <- pdata.frame(dtable, index = "GROUP")
summary(plm(sat ~ income, data = dtable_p,method="pooling"))
I am really struggling with the following problem set using R
.
I want to simulate a data set with one dependent and 20 independent variables that are normally i.i.d.. Each variable should have 100 observations. (I managed to do this part)
(Now the part I am struggling with):
My plan is to conduct automated regressions for all possible combinations of up to 5 regressor using an own coded regression function that simulates the output of summary(lm) that uses a vector y and a matrix or vector x as input (so my.lm(y,x)). And then bringing the results in a suitable data structure.
I would be thankful for every hint!
I doubt the soundness of what you are trying to do but here it goes.
I will make up a dataset, since you have not posted one.
my.lm <- function(x, y, n = 5){
f <- function(inx){
inx_cols <- Combn[inx, ]
inx_cols <- inx_cols[inx_cols != 0]
X <- as.data.frame(x[, inx_cols])
names(X) <- paste0("X", inx_cols)
X <- cbind(X, y)
name_y <- names(X)[length(names(X))]
fmla <- as.formula(paste(name_y, ".", sep = "~"))
tryCatch(lm(fmla, data = X), error = function(e) e)
}
nc_x <- ncol(x)
nr <- sum(choose(nc_x, seq_len(n)))
Combn <- matrix(0, nrow = nr, ncol = n)
first <- 1
for(i in seq_len(n)){
last <- first + choose(nc_x, i) - 1
Combn[first:last, seq_len(i)] <- t(combn(nc_x, i))
first <- last + 1
}
apply(Combn, 1, f)
}
set.seed(6876)
regr <- replicate(20, rnorm(100))
coefs <- sample(-5:5, 20, TRUE)
resp <- regr %*% coefs + rnorm(100)
lm_list <- my.lm(regr, resp)
length(lm_list)
#[1] 21699
So the function above produced as many objects as expected.
Before continuing, let's see how many are errors (singular matrix, for instance).
err_list <- lapply(lm_list, function(x){
if(inherits(x, "error")) x else NULL
})
err_list <- err_list[!sapply(err_list, is.null)]
length(err_list)
#[1] 0
No errors.
So get the summaries of the objects of class "lm".
good_list <- lapply(lm_list, function(x){
if(inherits(x, "lm")) x else NULL
})
good_list <- good_list[!sapply(good_list, is.null)]
smry_list <- lapply(good_list, summary)
smry_list[[1]]
#
#Call:
# lm(formula = fmla, data = X)
#Residuals:
# Min 1Q Median 3Q Max
#-34.654 -9.487 -1.985 9.486 50.213
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.6449 1.5237 0.423 0.673
#X1 -7.3969 1.5074 -4.907 3.68e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 15.02 on 98 degrees of freedom
#Multiple R-squared: 0.1972, Adjusted R-squared: 0.189
#F-statistic: 24.08 on 1 and 98 DF, p-value: 3.684e-06
I am having trouble estimating the constant (intercept) of a multivariate linear regression model by using stochastic gradient descent optimization method (which means batch size equals to 1 of mini-batch gradient descent). The function in R that I use is:
StochasticGradientDescent <- function(data, alpha, iteration, epsilon){data <- matrix(unlist(data), ncol=ncol(data), byrow=FALSE)
independent.variable<- data[,1:ncol(data)-1]
dependent.variable<- data[,ncol(data)]
#add column of 1s for constant
independent.variable <- cbind(theta0 = 1, independent.variable)
theta_new <- matrix( 0, ncol = ncol(independent.variable))
theta_old <- matrix( 1, ncol = ncol(independent.variable))
#Cost function
CostFunction <- function (independent.variable, dependent.variable, theta){
1/(2*(NROW(dependent.variable))) * sum(((independent.variable %*% t(theta)) - dependent.variable)^2);
}
thetas <- vector( mode = "list", length = iteration )
thetas[[1]] <- theta_new
J <- numeric( length = iteration )
J[1] <- CostFunction(independent.variable, dependent.variable, theta_old )
derivative <- function(independent.variable, dependent.variable, theta){
idx <- sample.int(NROW(independent.variable), 1)
descent <- (t(independent.variable[idx, , drop = FALSE]) %*% ((independent.variable[idx, , drop = FALSE] %*% t(theta)) - dependent.variable[idx, drop = FALSE]))
return( t(descent) )
}
#stopping criterion
step <- 1
while(any(abs(theta_new - theta_old) > epsilon) & step <= iteration )
{
step <- step + 1
# gradient descent
theta_old <- theta_new
theta_new <- theta_old - alpha * derivative(independent.variable, dependent.variable, theta_old)
# record keeping
thetas[[step]] <- theta_new
J[step] <- CostFunction(independent.variable, dependent.variable, theta_new)
}
costs <- data.frame( costs = J )
theta <- data.frame( do.call( rbind, thetas ), row.names = NULL )
return( list( costs = costs, theta = theta))
}
I simulate an artificial data.
x1 <- runif(1000000,1,100);
x2 <- runif(1000000,1,200);
y <- 5+4*x1+3*x2;
QR decomposition of lm package gives this result:
fit <- lm(y ~ x1+x2);
summary(fit)
#
#Call:
# lm(formula = y ~ x1 + x2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-7.386e-09 0.000e+00 0.000e+00 0.000e+00 9.484e-10
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 5.000e+00 2.162e-14 2.313e+14 <2e-16 ***
# x1 4.000e+00 2.821e-16 1.418e+16 <2e-16 ***
# x2 3.000e+00 1.403e-16 2.138e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 8.062e-12 on 999997 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: 1
#F-statistic: 3.292e+32 on 2 and 999997 DF, p-value: < 2.2e-16
My initial values for thetas are 0s. Learning rate is chosen to be 0.00005. Number of iterations is 5000. Stopping criteria which is epsilon here (a user-defined small value) is 0.000001. If the trained parameter’s difference between the two iteration is smaller than this value then the algorithm will stop. The result I get is given below:
data<- data.frame(cbind(x1, x2, y))
results <- StochasticGradientDescent( data = data, alpha = 0.00005, iteration = 5000, epsilon = .000001)
results$theta[ nrow(results$theta), ]
# theta0 V2 V3
#5001 0.2219142 4.04408 2.999861
As you can see estimates of coefficients are very close to actual ones. However, the coefficient estimation for theta0 (intercept/constant) is not even close. Besides, I get these values at the end of the cycle of iterations, which is not good. I cannot converge efficiently. I tried but I really could not figure out why this is the case. Could someone help me, please?
I have a table with Ancylostoma's infection, vs sex (2 factor), location (2 factor), year, management (2 factor), ancestry (4 factor) and viremia like categorical variable, and the I have HL an age like numeric variable.**
I made a glmm:
glm_toxo<-glmer((Ancylostoma) ~ as.factor(Sexo)+(Edad)+as.factor(año)+as.factor(Manejo)+as.factor(Localizacion)+as.factor(Viremia.FeLV) +(Ancestria) +(HL)+as.factor(1|Nombre), family="binomial", data= data_silv)
dd_toxo <- dredge (glm_toxo)
a<- get.models(dd_toxo, subset = delta < 2)
b<-(model.avg(a))
And I got this result
Model-averaged coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.0222 0.8911 2.269 0.0233 *
as.factor(Localizacion)PORT -15.2935 2163.9182 0.007 0.9944
as.factor(Localizacion)SMO -3.0012 0.7606 3.946 7.95e-05 ***
as.factor(Manejo)SILV 1.8125 0.7799 2.324 0.0201 *
Edad -0.1965 0.1032 1.904 0.0569 .
as.factor(Sexo)M 0.5015 0.4681 1.071 0.2840
HL -0.9381 1.4244 0.659 0.5102
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I would like represent the probability of infection (y) vs age (x), but using the estimate of my model.avg**
I tried with this script:
nseq <- function(x, len = length(x)) seq(min(x, na.rm = TRUE),max(x, na.rm=TRUE), length = len)
####
newdata <- as.data.frame(lapply(lapply(data_silv[2:4], mean), rep, 213))
newdata$Edad <- nseq(data_silv$Edad, nrow(newdata))
(año <- sample(as.factor(data_silv$año),size=213,rep=T))
(Manejo <- sample(as.factor(data_silv$Manejo),size=213,rep=T))
(Sexo <- sample(as.factor(data_silv$Sexo),size=213,rep=T))
newdata <- as.data.frame(cbind(mean(data_silv$HL), año,Manejo,Sexo,
data_silv$Localizacion, nseq(data_silv$Edad, nrow(newdata)),
data_silv$Ancylostoma))
names(newdata) <- c("HL","año","Manejo","Sexo","Localizacion","Edad",
"Ancylostoma")
newdata$pred <- data.frame(
model = sapply(a, predict, newdata = newdata),
averaged.subset = predict(b, newdata, full = FALSE),
averaged.full = predict(b, newdata, full = TRUE)
)
library(ggplot2)
ggplot(newdata,aes(x="Edad",y="pred",color="Localizacion")) + geom_line()
#####
But I haven't got graph...or I have error
Someone know any form to represent my model.avg with categorical and variable numeric?, But taking into account that I only want represent probability of infection vs age, with two line: localizacion1 and localizacion2...(localization had 2 factors).**
my original date would be this table:
#
año <- sample(as.factor(2005:2009),size=213,rep=T)
riqueza <- sample((0:3),size=213,rep=T)
HL <- rnorm(213, mean=0.54, sd=0.13)
Ancylostoma <- sample(as.factor(0:1),size=213,rep=T)
Edad <- sample(as.factor(0:21),size=213,rep=T)
Manejo<- sample(c("CCC", "SILV"), 213, replace = TRUE)
Sexo<- sample(c("M", "H"), 213, replace = TRUE)
Localizacion<- sample(c("SMO", "DON", "PORT"), 213, replace = TRUE)
Ancestria<- sample(c("DON", "SMO", "F1", "F2"), 213, replace = TRUE)
newdata <- as.data.frame(cbind(HL,año,Manejo,Sexo,
Localizacion, Edad,Ancylostoma))
names(newdata) <- c("HL","año","Manejo","Sexo","Localizacion","Edad",
"Ancylostoma")
#
And with that date I make my model's estimates. Then I would like do prediction
Thank you, I don't sure if I am explaining well
I so sorry for my english