Following the code of a tutorial I defined class and methods for my function of linear regression.
data(cats, package="MASS")
linmodeEst <- function(x,y){
qx <- qr(x) # QR-decomposition
coef <- solve.qr(qx,y) # solve(t(x)%*%x)%*%t(x)%*%y
df <- nrow(x)-ncol(x)
sigma2 <- sum((y-x%*%coef)^2)/df
vcov <- sigma2 * chol2inv(qx$qr)
colnames(vcov) <- rownames(vcov) <- colnames(x)
list(coefficients = coef, vcov=vcov, sigma = sqrt(sigma2), df=df)
linmod <- function(x,...) UseMethod("linmod")
linmod.default <- function(x,y,...){
x <- as.matrix(x)
y <- as.matrix(y)
est <- linmodeEst(x,y)
est$fitted.values <- as.vector(x%*%est$coefficients)
est$residuals <- y - est$fitted.values
est$call <-
class(est) <- "linmod"
print.linmod <- function(x,...){
summary.linmod <- function(object,...){
se <- sqrt(diag(object$vcov))
tval <- coef(object)/se
TAB <- cbind(Estimate = coef(object),
StdErr = se,
t.value = tval,
p.value = 2*pt(-abs(tval), df=object$df))
res <- list(call=object$call, coefficients=TAB)
class(res) <- "summary.linmod"
print.summary.linmod <- function(x,...){
printCoefmat(x$coefficients, P.value=TRUE, has.Pvalue=TRUE)
x = cbind(Const=1, Bwt=cats$Bwt)
y = cats$Hw
mod1 <- linmod(x,y)
So, in the summary.linmod <- function(object,...) I defined the table names: Estimate, StdErr, t.value, p.value. In R I get all the names in the header, in RStudio just StdErr. why is this happening?
My system: Linux 64bit, R 3.1.1
The documentation at ?cbind states that: "For cbind (rbind) the column (row) names are taken from the colnames (rownames) of the arguments if these are matrix-like."
In constructing TAB, you are binding 3 single-column matrices (i.e. coef(object), tval, and 2*pt(-abs(tval), df=object$df), to se (a non-matrix vector with 2 elements). Because of the behaviour quoted above, cbind uses the matrices' names (empty) to name the matrix-like columns of TAB.
Use or simply data.frame to construct TAB, and your summary output will have the expected names:
summary.linmod <- function(object,...){
se <- sqrt(diag(object$vcov))
tval <- coef(object)/se
TAB <- data.frame(Estimate = coef(object),
StdErr = se,
t.value = tval,
p.value = 2*pt(-abs(tval), df=object$df))
res <- list(call=object$call, coefficients=TAB)
class(res) <- "summary.linmod"
> summary(mod1)
# Call:
# linmod.default(x = x, y = y)
# Estimate StdErr t.value p.value
# Const -0.35666 0.69228 -0.5152 0.6072
# Bwt 4.03406 0.25026 16.1194 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the example below I use the vcovHC option to generate robust standard errors for a standard linear model. Using this same example, I tried to generate the distribution of F statistics with robust standard errors using the Fstats() function. The problem is that Fstats uses vcovHC internally as shown in the bottom line of the code below (which I extracted from the Fstats function).
else {
allX <- cbind(X1, matrix(rep(0, point[i]*k), ncol=k))
allX <- rbind(allX, cbind(X2, X2))
fm2 <- lm(y ~ 0 + allX)
beta2 <- coef(fm2)[-(1:k)]
V <- vcov.(fm2)
I don't know how to specify the vcovHC option when using the Fstats() function. What I did try (see below) does not work. Thank you for your help.
rm(list = ls())
> library(lmtest)
> library(sandwich)
> library(strucchange)
> y <- as.ts(c(4,3,-2,1,-4,0,1,-2,-3,2,0,1,-2,2,4,1,3,5,2,2,3,4))
> X <- as.ts(rep(1,22))
> # lm model with robust standard errors
> model<- lm(y ~ X - 1)
> coeftest(model, vcov = vcovHC(model, type = "HC0"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
X 1.13636 0.51545 2.2046 0.03877 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # Fstats model with regular standard errors
> model2 <- Fstats(y ~ X - 1, from = 3, to = 18)
> model2$Fstats
Time Series:
Start = 3
End = 18
Frequency = 1
[1] 0.1530982 0.1010540 0.5609994 0.8631524 0.8332749 2.2221599 5.5956726
[8] 4.5129612 5.6715064 5.8951319 11.4357216 10.2821317 6.1921347 7.1910292
[15] 5.4546513 2.1974238
# Fstats model with robust standard errors
> model3 <- Fstats(y ~ X - 1, from = 3, to = 18, vcovHC(.,type = "HC0"))
Error in vcovHC(., type = "HC0") : object '.' not found
The vcov argument in Fstats() needs to be a function (see ?Fstats) because it has to be applied to a whole range of different linear models fitted to subsets of the data. This is in contrast to coeftest() where it can either be a function or a matrix because it is just combined with a single model.
A worked example is included in the durab data set, providing the growth rate of the Industrial Production Index to average weekly labor hours in the manufacturing/durables sector. The example is taken from Hansen (2001) and uses a simple AR(1) model:
data("durab", package = "strucchange")
fsHC <- Fstats(y ~ lag, data = durab, from = 0.1,
vcov = function(x, ...) vcovHC(x, type = "HC", ...))
## supF test
## data: fsHC
## sup.F = 20.221, p-value = 0.001371
Note also that for the classical HC sandwich estimator you can use sandwich() rather than vcovHC(..., type = "HC0"). Thus, using vcov = sandwich in the code above is simpler than setting up a new function on the fly.
I had a weird problem in plm() function. Below is the code:
#Data Generation
n <- 500
z <- rnorm(n)
w <- rnorm(n)
x <- 5*z + 50
y <- -100*z+ 1100 + 50*w
y <- 10*round(y/10)
y <- ifelse(y<200,200,y)
y <- ifelse(y>1600,1600,y)
dt1 <- data.table('id'=1:500,'sat'=y,'income'=x,'group'=rep(1,n))
z <- rnorm(n)
w <- rnorm(n)
x <- 5*z + 80
y <- -80*z+ 1200 + 50*w
y <- 10*round(y/10)
y <- ifelse(y<200,200,y)
y <- ifelse(y>1600,1600,y)
dt2 <- data.table('id'=501:1000,'sat'=y,'income'=x,'group'=rep(2,n))
z <- rnorm(n)
w <- rnorm(n)
x <- 5*z + 30
y <- -120*z+ 1000 + 50*w
y <- 10*round(y/10)
y <- ifelse(y<200,200,y)
y <- ifelse(y>1600,1600,y)
dt3 <- data.table('id'=1001:1500,'sat'=y,'income'=x,'group'=rep(3,n))
dtable <- merge(dt1 ,dt2, all=TRUE)
dtable <- merge(dtable ,dt3, all=TRUE)
# Model
dtable_p <- pdata.frame(dtable, index = "group")
mod_1 <- plm(sat ~ income, data = dtable_p,model = "pooling")
Error in [.data.frame(x, , which) : undefined columns selected
I checked all possibilities but I can not figure out why it gives me an error. the columns'names are correct, why R said undefined columns??? Thank you!
Follow up: I add another data set test as the #StupidWolf use to prove
data("Produc", package = "plm")
form <- log(gsp) ~ log(pc)
Produc$group <- Produc$region
pProduc <- pdata.frame(Produc, index = "group")
Produc$group <- rep(1:48, each = 17)
summary(plm(form, data = pProduc, model = "pooling"))
>Error in `[.data.frame`(x, , which) : undefined columns selected
This is extremely weird, the answer is index must not be named "group".
I suspect somewhere in the plm function, it must be adding a "group" to your data.frame.
We can use the example dataset
data("Produc", package = "plm")
form <- log(gsp) ~ log(pc)
Produc$group = Produc$region
pProduc <- pdata.frame(Produc, index = c("group"))
summary(plm(form, data = pProduc, model = "random"))
Error in `[.data.frame`(x, , which) : undefined columns selected
Using the "region" column from which I copied, it works:
pProduc <- pdata.frame(Produc, index = c("region"))
summary(plm(form, data = pProduc, model = "random"))
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
plm(formula = form, data = pProduc, model = "random")
Unbalanced Panel: n = 9, T = 51-136, N = 816
var share
idiosyncratic 0.03691 0.19213 0.402
individual 0.05502 0.23457 0.598
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.8861 0.9012 0.9192 0.9157 0.9299 0.9299
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.68180 -0.11014 0.00977 -0.00039 0.13815 0.45491
Estimate Std. Error z-value Pr(>|z|)
(Intercept) -1.099088 0.138395 -7.9417 1.994e-15 ***
log(pc) 1.100102 0.010623 103.5627 < 2.2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 459.71
Residual Sum of Squares: 30.029
R-Squared: 0.93468
Adj. R-Squared: 0.9346
Chisq: 11647.6 on 1 DF, p-value: < 2.22e-16
For your example, just rename the column "group" and also set it as a factor to avoid the other errors. (For "pooling" it should be treated a categorical not numeric).
dtable <- merge(dt1 ,dt2, all=TRUE)
dtable <- merge(dtable ,dt3, all=TRUE)
dtable$group = factor(dtable$group)
colnames(dtable)[4] = "GROUP"
dtable_p <- pdata.frame(dtable, index = "GROUP")
summary(plm(sat ~ income, data = dtable_p,method="pooling"))
I am really struggling with the following problem set using R
I want to simulate a data set with one dependent and 20 independent variables that are normally i.i.d.. Each variable should have 100 observations. (I managed to do this part)
(Now the part I am struggling with):
My plan is to conduct automated regressions for all possible combinations of up to 5 regressor using an own coded regression function that simulates the output of summary(lm) that uses a vector y and a matrix or vector x as input (so my.lm(y,x)). And then bringing the results in a suitable data structure.
I would be thankful for every hint!
I doubt the soundness of what you are trying to do but here it goes.
I will make up a dataset, since you have not posted one.
my.lm <- function(x, y, n = 5){
f <- function(inx){
inx_cols <- Combn[inx, ]
inx_cols <- inx_cols[inx_cols != 0]
X <-[, inx_cols])
names(X) <- paste0("X", inx_cols)
X <- cbind(X, y)
name_y <- names(X)[length(names(X))]
fmla <- as.formula(paste(name_y, ".", sep = "~"))
tryCatch(lm(fmla, data = X), error = function(e) e)
nc_x <- ncol(x)
nr <- sum(choose(nc_x, seq_len(n)))
Combn <- matrix(0, nrow = nr, ncol = n)
first <- 1
for(i in seq_len(n)){
last <- first + choose(nc_x, i) - 1
Combn[first:last, seq_len(i)] <- t(combn(nc_x, i))
first <- last + 1
apply(Combn, 1, f)
regr <- replicate(20, rnorm(100))
coefs <- sample(-5:5, 20, TRUE)
resp <- regr %*% coefs + rnorm(100)
lm_list <- my.lm(regr, resp)
#[1] 21699
So the function above produced as many objects as expected.
Before continuing, let's see how many are errors (singular matrix, for instance).
err_list <- lapply(lm_list, function(x){
if(inherits(x, "error")) x else NULL
err_list <- err_list[!sapply(err_list, is.null)]
#[1] 0
No errors.
So get the summaries of the objects of class "lm".
good_list <- lapply(lm_list, function(x){
if(inherits(x, "lm")) x else NULL
good_list <- good_list[!sapply(good_list, is.null)]
smry_list <- lapply(good_list, summary)
# lm(formula = fmla, data = X)
# Min 1Q Median 3Q Max
#-34.654 -9.487 -1.985 9.486 50.213
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.6449 1.5237 0.423 0.673
#X1 -7.3969 1.5074 -4.907 3.68e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 15.02 on 98 degrees of freedom
#Multiple R-squared: 0.1972, Adjusted R-squared: 0.189
#F-statistic: 24.08 on 1 and 98 DF, p-value: 3.684e-06
I am having trouble estimating the constant (intercept) of a multivariate linear regression model by using stochastic gradient descent optimization method (which means batch size equals to 1 of mini-batch gradient descent). The function in R that I use is:
StochasticGradientDescent <- function(data, alpha, iteration, epsilon){data <- matrix(unlist(data), ncol=ncol(data), byrow=FALSE)
independent.variable<- data[,1:ncol(data)-1]
dependent.variable<- data[,ncol(data)]
#add column of 1s for constant
independent.variable <- cbind(theta0 = 1, independent.variable)
theta_new <- matrix( 0, ncol = ncol(independent.variable))
theta_old <- matrix( 1, ncol = ncol(independent.variable))
#Cost function
CostFunction <- function (independent.variable, dependent.variable, theta){
1/(2*(NROW(dependent.variable))) * sum(((independent.variable %*% t(theta)) - dependent.variable)^2);
thetas <- vector( mode = "list", length = iteration )
thetas[[1]] <- theta_new
J <- numeric( length = iteration )
J[1] <- CostFunction(independent.variable, dependent.variable, theta_old )
derivative <- function(independent.variable, dependent.variable, theta){
idx <-, 1)
descent <- (t(independent.variable[idx, , drop = FALSE]) %*% ((independent.variable[idx, , drop = FALSE] %*% t(theta)) - dependent.variable[idx, drop = FALSE]))
return( t(descent) )
#stopping criterion
step <- 1
while(any(abs(theta_new - theta_old) > epsilon) & step <= iteration )
step <- step + 1
# gradient descent
theta_old <- theta_new
theta_new <- theta_old - alpha * derivative(independent.variable, dependent.variable, theta_old)
# record keeping
thetas[[step]] <- theta_new
J[step] <- CostFunction(independent.variable, dependent.variable, theta_new)
costs <- data.frame( costs = J )
theta <- data.frame( rbind, thetas ), row.names = NULL )
return( list( costs = costs, theta = theta))
I simulate an artificial data.
x1 <- runif(1000000,1,100);
x2 <- runif(1000000,1,200);
y <- 5+4*x1+3*x2;
QR decomposition of lm package gives this result:
fit <- lm(y ~ x1+x2);
# lm(formula = y ~ x1 + x2)
# Min 1Q Median 3Q Max
#-7.386e-09 0.000e+00 0.000e+00 0.000e+00 9.484e-10
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 5.000e+00 2.162e-14 2.313e+14 <2e-16 ***
# x1 4.000e+00 2.821e-16 1.418e+16 <2e-16 ***
# x2 3.000e+00 1.403e-16 2.138e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 8.062e-12 on 999997 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: 1
#F-statistic: 3.292e+32 on 2 and 999997 DF, p-value: < 2.2e-16
My initial values for thetas are 0s. Learning rate is chosen to be 0.00005. Number of iterations is 5000. Stopping criteria which is epsilon here (a user-defined small value) is 0.000001. If the trained parameter’s difference between the two iteration is smaller than this value then the algorithm will stop. The result I get is given below:
data<- data.frame(cbind(x1, x2, y))
results <- StochasticGradientDescent( data = data, alpha = 0.00005, iteration = 5000, epsilon = .000001)
results$theta[ nrow(results$theta), ]
# theta0 V2 V3
#5001 0.2219142 4.04408 2.999861
As you can see estimates of coefficients are very close to actual ones. However, the coefficient estimation for theta0 (intercept/constant) is not even close. Besides, I get these values at the end of the cycle of iterations, which is not good. I cannot converge efficiently. I tried but I really could not figure out why this is the case. Could someone help me, please?
I have a table with Ancylostoma's infection, vs sex (2 factor), location (2 factor), year, management (2 factor), ancestry (4 factor) and viremia like categorical variable, and the I have HL an age like numeric variable.**
I made a glmm:
glm_toxo<-glmer((Ancylostoma) ~ as.factor(Sexo)+(Edad)+as.factor(año)+as.factor(Manejo)+as.factor(Localizacion)+as.factor(Viremia.FeLV) +(Ancestria) +(HL)+as.factor(1|Nombre), family="binomial", data= data_silv)
dd_toxo <- dredge (glm_toxo)
a<- get.models(dd_toxo, subset = delta < 2)
And I got this result
Model-averaged coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.0222 0.8911 2.269 0.0233 *
as.factor(Localizacion)PORT -15.2935 2163.9182 0.007 0.9944
as.factor(Localizacion)SMO -3.0012 0.7606 3.946 7.95e-05 ***
as.factor(Manejo)SILV 1.8125 0.7799 2.324 0.0201 *
Edad -0.1965 0.1032 1.904 0.0569 .
as.factor(Sexo)M 0.5015 0.4681 1.071 0.2840
HL -0.9381 1.4244 0.659 0.5102
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I would like represent the probability of infection (y) vs age (x), but using the estimate of my model.avg**
I tried with this script:
nseq <- function(x, len = length(x)) seq(min(x, na.rm = TRUE),max(x, na.rm=TRUE), length = len)
newdata <-[2:4], mean), rep, 213))
newdata$Edad <- nseq(data_silv$Edad, nrow(newdata))
(año <- sample(as.factor(data_silv$año),size=213,rep=T))
(Manejo <- sample(as.factor(data_silv$Manejo),size=213,rep=T))
(Sexo <- sample(as.factor(data_silv$Sexo),size=213,rep=T))
newdata <-$HL), año,Manejo,Sexo,
data_silv$Localizacion, nseq(data_silv$Edad, nrow(newdata)),
names(newdata) <- c("HL","año","Manejo","Sexo","Localizacion","Edad",
newdata$pred <- data.frame(
model = sapply(a, predict, newdata = newdata),
averaged.subset = predict(b, newdata, full = FALSE),
averaged.full = predict(b, newdata, full = TRUE)
ggplot(newdata,aes(x="Edad",y="pred",color="Localizacion")) + geom_line()
But I haven't got graph...or I have error
Someone know any form to represent my model.avg with categorical and variable numeric?, But taking into account that I only want represent probability of infection vs age, with two line: localizacion1 and localizacion2...(localization had 2 factors).**
my original date would be this table:
año <- sample(as.factor(2005:2009),size=213,rep=T)
riqueza <- sample((0:3),size=213,rep=T)
HL <- rnorm(213, mean=0.54, sd=0.13)
Ancylostoma <- sample(as.factor(0:1),size=213,rep=T)
Edad <- sample(as.factor(0:21),size=213,rep=T)
Manejo<- sample(c("CCC", "SILV"), 213, replace = TRUE)
Sexo<- sample(c("M", "H"), 213, replace = TRUE)
Localizacion<- sample(c("SMO", "DON", "PORT"), 213, replace = TRUE)
Ancestria<- sample(c("DON", "SMO", "F1", "F2"), 213, replace = TRUE)
newdata <-,año,Manejo,Sexo,
Localizacion, Edad,Ancylostoma))
names(newdata) <- c("HL","año","Manejo","Sexo","Localizacion","Edad",
And with that date I make my model's estimates. Then I would like do prediction
Thank you, I don't sure if I am explaining well
I so sorry for my english