In the example below I use the vcovHC option to generate robust standard errors for a standard linear model. Using this same example, I tried to generate the distribution of F statistics with robust standard errors using the Fstats() function. The problem is that Fstats uses vcovHC internally as shown in the bottom line of the code below (which I extracted from the Fstats function).
else {
allX <- cbind(X1, matrix(rep(0, point[i]*k), ncol=k))
allX <- rbind(allX, cbind(X2, X2))
fm2 <- lm(y ~ 0 + allX)
beta2 <- coef(fm2)[-(1:k)]
V <- vcov.(fm2)
I don't know how to specify the vcovHC option when using the Fstats() function. What I did try (see below) does not work. Thank you for your help.
rm(list = ls())
>
> library(lmtest)
> library(sandwich)
> library(strucchange)
>
> y <- as.ts(c(4,3,-2,1,-4,0,1,-2,-3,2,0,1,-2,2,4,1,3,5,2,2,3,4))
> X <- as.ts(rep(1,22))
>
> # lm model with robust standard errors
> model<- lm(y ~ X - 1)
> coeftest(model, vcov = vcovHC(model, type = "HC0"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
X 1.13636 0.51545 2.2046 0.03877 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> # Fstats model with regular standard errors
> model2 <- Fstats(y ~ X - 1, from = 3, to = 18)
> model2$Fstats
Time Series:
Start = 3
End = 18
Frequency = 1
[1] 0.1530982 0.1010540 0.5609994 0.8631524 0.8332749 2.2221599 5.5956726
[8] 4.5129612 5.6715064 5.8951319 11.4357216 10.2821317 6.1921347 7.1910292
[15] 5.4546513 2.1974238
>
# Fstats model with robust standard errors
> model3 <- Fstats(y ~ X - 1, from = 3, to = 18, vcovHC(.,type = "HC0"))
Error in vcovHC(., type = "HC0") : object '.' not found
The vcov argument in Fstats() needs to be a function (see ?Fstats) because it has to be applied to a whole range of different linear models fitted to subsets of the data. This is in contrast to coeftest() where it can either be a function or a matrix because it is just combined with a single model.
A worked example is included in the durab data set, providing the growth rate of the Industrial Production Index to average weekly labor hours in the manufacturing/durables sector. The example is taken from Hansen (2001) and uses a simple AR(1) model:
library("strucchange")
data("durab", package = "strucchange")
fsHC <- Fstats(y ~ lag, data = durab, from = 0.1,
vcov = function(x, ...) vcovHC(x, type = "HC", ...))
plot(fsHC)
sctest(fsHC)
## supF test
##
## data: fsHC
## sup.F = 20.221, p-value = 0.001371
Note also that for the classical HC sandwich estimator you can use sandwich() rather than vcovHC(..., type = "HC0"). Thus, using vcov = sandwich in the code above is simpler than setting up a new function on the fly.
Related
I am using the nls package in R to perform a nonlinear fit. I have specified my independent variable as follows:
t <- seq(1,7)
and my dependent variables as P <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
I then have tried:
m <- nls(P ~ 1 / (c + q*exp(-b*t))^(1/v)),
but every time I get:
"Error in c + q * exp(-b * t) : non-numeric argument to binary
operator"
Every one of my variables is numeric. Any ideas?
Thanks!
You have more than one problem in your script. The main issue is that you should never use names which are used by R: t is the matrix transpose, c is a common method to create vectors, and q is the quit instruction. nls() will not try to fit them, as they are already defined. I recommend using more meaningful and less dangerous variables such as Coef1, Coef2, …
The second problem is that you are trying to fit a model with 4 variables to a dataset with 7 data... This may yield singularities and other problems.
For the sake of the argument, I have reduced your model to three variables, and changed some names:
Time <- seq(1,7)
Prob <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
plot(Time, Prob)
And now we perform the nls() fit:
Fit <- nls(Prob ~ 1 / (Coef1 + Coef2 * exp(-Coef3 * Time)))
X <- data.frame(Time = seq(0, 7, length.out = 100))
Y <- predict(object = Fit, newdata = X)
lines(X$Time, Y)
And a summary of the results:
summary(Fit)
# Formula: Prob ~ 1/(Coef1 + Coef2 * exp(-Coef3 * Time))
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# Coef1 1.00778 0.06113 16.487 7.92e-05 ***
# Coef2 23.43349 14.42378 1.625 0.1796
# Coef3 1.04899 0.21892 4.792 0.0087 **
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.06644 on 4 degrees of freedom
#
# Number of iterations to convergence: 12
# Achieved convergence tolerance: 3.04e-06
I know it is not exactly what you wanted, but I hope it helps.
I'm trying to reproduce this stata example and move from stargazer to texreg. The data is available here.
To run the regression and get the se I run this code:
library(readstata13)
library(sandwich)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- as.integer(rownames(model_result$model))
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
sqrt(diag(vcovCL))
}
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <- lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll, data = elemapi2)
se.lm1 <- cluster_se(model_result = lm1, data = elemapi2, cluster = "dnum")
stargazer::stargazer(lm1, type = "text", style = "aer", se = list(se.lm1))
==========================================================
api00
----------------------------------------------------------
acs_k3 6.954
(6.901)
acs_46 5.966**
(2.531)
full 4.668***
(0.703)
enroll -0.106**
(0.043)
Constant -5.200
(121.786)
Observations 395
R2 0.385
Adjusted R2 0.379
Residual Std. Error 112.198 (df = 390)
F Statistic 61.006*** (df = 4; 390)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
texreg produces this:
texreg::screenreg(lm1, override.se=list(se.lm1))
========================
Model 1
------------------------
(Intercept) -5.20
(121.79)
acs_k3 6.95
(6.90)
acs_46 5.97 ***
(2.53)
full 4.67 ***
(0.70)
enroll -0.11 ***
(0.04)
------------------------
R^2 0.38
Adj. R^2 0.38
Num. obs. 395
RMSE 112.20
========================
How can I fix the p-values?
Robust Standard Errors with texreg are easy: just pass the coeftest directly!
This has become much easier since the question was last answered: it appears you can now just pass the coeftest with the desired variance-covariance matrix directly. Downside: you lose the goodness of fit statistics (such as R^2 and number of observations), but depending on your needs, this may not be a big problem
How to include robust standard errors with texreg
> screenreg(list(reg1, coeftest(reg1,vcov = vcovHC(reg1, 'HC1'))),
custom.model.names = c('Standard Standard Errors', 'Robust Standard Errors'))
=============================================================
Standard Standard Errors Robust Standard Errors
-------------------------------------------------------------
(Intercept) -192.89 *** -192.89 *
(55.59) (75.38)
x 2.84 ** 2.84 **
(0.96) (1.04)
-------------------------------------------------------------
R^2 0.08
Adj. R^2 0.07
Num. obs. 100
RMSE 275.88
=============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
To generate this example, I created a dataframe with heteroscedasticity, see below for full runnable sample code:
require(sandwich);
require(texreg);
set.seed(1234)
df <- data.frame(x = 1:100);
df$y <- 1 + 0.5*df$x + 5*100:1*rnorm(100)
reg1 <- lm(y ~ x, data = df)
First, notice that your usage of as.integer is dangerous and likely to cause problems once you use data with non-numeric rownames. For instance, using the built-in dataset mtcars whose rownames consist of car names, your function will coerce all rownames to NA, and your function will not work.
To your actual question, you can provide custom p-values to texreg, which means that you need to compute the corresponding p-values. To achieve this, you could compute the variance-covariance matrix, compute the test-statistics, and then compute the p-value manually, or you just compute the variance-covariance matrix and supply it to e.g. coeftest. Then you can extract the standard errors and p-values from there. Since I am unwilling to download any data, I use the mtcars-data for the following:
library(sandwich)
library(lmtest)
library(texreg)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model) # changed to be able to work with mtcars, not tested with other data
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
}
lm1 <- lm(formula = mpg ~ cyl + disp, data = mtcars)
vcov.lm1 <- cluster_se(model_result = lm1, data = mtcars, cluster = "carb")
standard.errors <- coeftest(lm1, vcov. = vcov.lm1)[,2]
p.values <- coeftest(lm1, vcov. = vcov.lm1)[,4]
texreg::screenreg(lm1, override.se=standard.errors, override.p = p.values)
And just for completeness sake, let's do it manually:
t.stats <- abs(coefficients(lm1) / sqrt(diag(vcov.lm1)))
t.stats
(Intercept) cyl disp
38.681699 5.365107 3.745143
These are your t-statistics using the cluster-robust standard errors. The degree of freedom is stored in lm1$df.residual, and using the built in functions for the t-distribution (see e.g. ?pt), we get:
manual.p <- 2*pt(-t.stats, df=lm1$df.residual)
manual.p
(Intercept) cyl disp
1.648628e-26 9.197470e-06 7.954759e-04
Here, pt is the distribution function, and we want to compute the probability of observing a statistic at least as extreme as the one we observe. Since we testing two-sided and it is a symmetric density, we first take the left extreme using the negative value, and then double it. This is identical to using 2*(1-pt(t.stats, df=lm1$df.residual)). Now, just to check that this yields the same result as before:
all.equal(p.values, manual.p)
[1] TRUE
I need generate data on the a given value of coefficient of multiple determination.
For example,if i indicated R^2 = 0.77, i want generate data, which create regression model with R^2=0.77
but these data must be in a certain range. For example, sample= 100 and i need 4 variables(x1 - dependent var), where values in range from 5-15. How do that?
I use optim
optim(0.77, fn, gr = NULL,
method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN",
"Brent"),
lower = 5, upper = 15,
control = list(), hessian = FALSE)
but i don't know how create function fn for my purpose. Please help to write this function
First here's a solution:
library(mvtnorm)
get.r <- function(x) c((x+sqrt(x**2+3*x))/(3),(x-sqrt(x**2+3*x))/(3))
set.seed(123)
cv <- get.r(0.77)[1]
out <- rmvnorm(100,sigma=matrix(c(1,cv,cv,cv,cv,1,cv,cv,cv,cv,1,cv,cv,cv,cv,1),ncol=4))
out1 <- as.data.frame(10*(out-min(out))/diff(range(out))+5)
range(out1)
# [1] 5 15
lm1 <- lm(V1~V2+V3+V4,data=out1)
summary(lm1)
# Call:
# lm(formula = V1 ~ V2 + V3 + V4, data = out1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.75179 -0.64323 -0.03397 0.64770 2.23142
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.36180 0.50940 0.710 0.479265
# V2 0.29557 0.09311 3.175 0.002017 **
# V3 0.31433 0.08814 3.566 0.000567 ***
# V4 0.35438 0.07581 4.674 9.62e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.927 on 96 degrees of freedom
# Multiple R-squared: 0.7695, Adjusted R-squared: 0.7623
# F-statistic: 106.8 on 3 and 96 DF, p-value: < 2.2e-16
Now let me explain how I got there. We can construct this statistically. First we need to understand a little about correlation and covariance. One formula for correlation is
Corr(X, Y) = Cov(X,Y)/sqrt(Var(X)Var(Y))
And one formula for covariance is:
Cov(X,Y) = E(XY) - E(X)E(Y)
In your question you want to get the multiple correlation of the regression model:
Y = X1 + X2 + X3
Let's make this as simple as possible and force the variance of all variables to be 1 and let's make the pairwise correlation between any two variables to be equal and call it r.
Now we're looking for the square of the correlation between Y and X1 + X2 + X3, which is:
R^2 = [Cov(Y,X1 + X2 + X3)]^2/[Var(Y)Var(X1 + X2 + X3)]
Note that
Cov(Y,X1 + X2 + X3) = Cov(Y,X1) + Cov(Y,X2) + Cov(Y,X3)
Further note that the variance of each variable is 1 and the pairwise correlation is r, so the above result is equivalent to 3r.
Also note that
Var(X1 + X2 + X3) = Var(X1) + Var(X2) + Var(X3) + Cov(X1,X2) + Cov(X1,X3) + Cov(X2,X3).
Since the variance of each is 1, this is equivalent to 3 + 6r, so
R^2 = 9r^2/(3 + 6r) = 3r^2/(1 + 2r)
We can use the quadratic equation to solve for r and get
r = (R^2 +/- sqrt((R^2)^2+3R^2))/3
If we substitute R^2 = 0.77, then r = -0.3112633 or 0.8245966. We can use either to get what you need by using rmvnorm() within the mvtnorm package. And since R^2 is invariant to linear transformations, we can transform the resulting variables so that they fall between 5 and 15.
Update:
If we want to simulate with n predictors, we can use the following (note that I am not transforming the range of each predictor, but that can be done after the fact without altering the multiple R^2):
get.r <- function(x,n) c(((n-1)*x+sqrt(((n-1)*x)**2+4*n*x))/(2*n),
((n-1)*x-sqrt(((n-1)*x)**2+4*n*x))/(2*n))
sim.data <- function(R2, n) {
sig.mat <- matrix(get.r(R2,n+1)[1],n+1,n+1)
diag(sig.mat) <- 1
out <- as.data.frame(rmvnorm(100,sigma=sig.mat))
return(out)
}
This isn't an answer, but I wanted to share what I did. I don't believe optim can be used the way you want it to. I attempted a "brute force" method to find a dataset that could work, but the highest r-squared I "randomed" was 0.23:
# Initializing our boolean and counter.
rm(list = ls())
Done <- FALSE
count <- 1
maxr2 <- .000001
# I set y ahead of time.
y <- sample(5:15, 100, replace = TRUE)
# Running until an appropriate r-squared is found.
while(!Done) {
# Generating a sample data set to optimize y on.
a <- sample(5:15, 100, replace = TRUE)
b <- sample(5:15, 100, replace = TRUE)
c <- sample(5:15, 100, replace = TRUE)
data <- data.frame(y = y, a = a, b = b, c = c)
# Making our equation and making a linear model.
EQ <- "y ~ a + b + c" # Creating the equation.
model <- lm(EQ, data) # Running the model.
if (count != 1) { if (summary(model)$r.squared > maxr2) { maxr2 <- summary(model)$r.squared } }
r2 <- summary(model)$r.squared # Grabbing the r-squared.
print(r2) # Printing r-squared out to see what is popping out.
if (r2 <= 0.78 & r2 >= 0.76) { Done <- TRUE } # If the r-squared is satfisfactory, pop it out.
count <- count + 1 # Incrementing our counter.
if (count >= 1000000) { Done <- TRUE ; print("A satisfactory r-squared was not found.") } # Setting this to run at most 1,000,000 times.
}
# Data will be your model that has an r-squared of 0.77 if you found one.
The issue with optim is that it optimizes individual parameters, single values. The first argument in optim is the par argument, which is meant to be a list of the values you want to optimize. This could be used in optimizing an r-squared by some decay function that is dependent on several values (these would be your par values). However, in this case, you're asking to optimize entire columns towards maximizing an r-squared, which doesn't make sense (as far as I know) with optim.
I'm trying to find solution in R that performs similarly to MATLAB's trust region reflective algorithm. This question has been asked before but the author was asked to provide reproducible example. I couldn't comment there so the only solution was to post new question. Here's my example:
x <- c(5000,5000,5000,5000,2500,2500,2500,2500,1250,1250,1250,1250,625,625, 625,625,312,312,312,312,156,156,156,156)
y <- c(0.209065186,0.208338898,0.211886104,0.209638321,0.112064803,0.110535275,0.111748670,0.111208841,0.060416469,0.059098975,0.059274827,0.060859512,0.032178730,0.033190833,0.031621743,0.032345817,0.017983939,0.016632180,0.018468540,0.019513489,0.011490089,0.011076365,0.009282322,0.012309134)
Since initial parameter values are the central issue I tried using 'nls2' package which uses 'brute-force' algorithm to find good starting parameters. Even with that, nls and nls.lm cannot reach convergence. Here's some basic code for this:
library('nls2'); library('minpack.lm')
fo <- y ~ I(A * (x ^ B) + C)
sA <- seq(-2,1,len=10) # range of parameter values
sB <- seq(-1,1,len=10)
sC <- seq(-1,1,len=10)
st1 <- expand.grid(A=sA,B=sB,C=sC)
mod1 <- nls2(fo,start=st1,algorithm="brute-force")
fit_ <- nls(fo,start=coef(mod1)) # basic nls
# or nls.lm
fit_ <- nlsLM(fo, start=coef(mod1),algorithm = "LM")
MATLAB produced:
a = 7.593e-05 (6.451e-05, 8.736e-05)
b = 0.9289 (0.9116, 0.9462)
c = 0.002553 (0.001333, 0.003772)
Goodness of fit:
SSE: 2.173e-05
R-square: 0.9998
Adjusted R-square: 0.9998
RMSE: 0.001017
and yes, using these parameter values, R also produced the solution.
Question: how to obtain this in R without using matlab ?
After looking at a the plotted data, I have no problem guessing suitable starting values:
plot(y ~ x)
The data is almost on a straight line through 0. So good starting value vor B and C should be 1 and 0, respectively. Then you only need to guestimate the slope of the straight line. Of course you could also use lm(y ~ x) to find starting values for A and C.
fo <- y ~ A * (x ^ B) + C
DF <- data.frame(x, y)
fit <- nls(fo, start = list(A = 0.001, B = 1, C = 0), data = DF)
summary(fit)
#Formula: y ~ A * (x^B) + C
#
#Parameters:
# Estimate Std. Error t value Pr(>|t|)
#A 7.593e-05 5.495e-06 13.820 5.17e-12 ***
#B 9.289e-01 8.317e-03 111.692 < 2e-16 ***
#C 2.552e-03 5.866e-04 4.351 0.000281 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.001017 on 21 degrees of freedom
#
#Number of iterations to convergence: 5
#Achieved convergence tolerance: 9.084e-07
lines(seq(min(x), max(x), length.out = 100),
predict(fit, newdata = data.frame(x = seq(min(x), max(x), length.out = 100))),
col = "blue")
I have an outcome variable, say Y and a list of 20 variables that could affect Y (say X1...X20). I would like to test which variables are NOT independent of Y. To do this I want to run a univariable glm for each variable and Y (ie Y~X1,...,Y~X20) and then do a likelihood ratio test for each model. Finally I would like to create a table the has the resulting P value from the likelihood test for each model.
From what I have seen the lapply function and split function could be useful for this but I don't really understand how they work in the examples I've seen.
This is what I tried at first:
> VarNames<-c(names(data[30:47]))
> glms<-glm(intBT~VarNames,family=binomial(logit))
Error in model.frame.default(formula = intBT ~ VarNames, drop.unused.levels = TRUE) :
variable lengths differ (found for 'VarNames')
I'm not sure if that was a good approach though.
It is easier to answer your questions if you provide a minimal example.
One way to go - but certainly not the most beautiful - is to use paste to create the formulas as a vector of strings and then use lapply on them. The Code for this could look like this:
example.data <- data.frame(intBT=1:10, bli=1:10, bla=1:10, blub=1:10)
var.names <- c('bli', 'bla', 'blub')
formulas <- paste('intBT ~', var.names)
fitted.models <- lapply(formulas, glm, data=example.data)
This gives a list of fitted model. You can then use the apply functions on fitted.models to execute further tests.
Like Paul said it really helps if you provide a minimal example, but I think this does what you want.
set.seed(123)
N <- 100
num_vars <- 5
df <- data.frame(lapply(1:num_vars, function(i) i = rnorm(N)))
names(df) <- c(paste0(rep("X",5), 1:num_vars ))
e <- rnorm(N)
y <- as.numeric((df$X1 + df$X2 + e) > 0.5)
pvalues <- vector(mode = "list")
singlevar <- function(var, y, df){
model <- as.formula(paste0("y ~ ", var))
pvalues[var] <- coef(summary(glm(model, family = "binomial", data = df)))[var,4]
}
sapply(colnames(df), singlevar, y, df)
X1 X2 X3 X4 X5
1.477199e-04 4.193461e-05 8.885365e-01 9.064953e-01 9.702645e-01
For comparison:
Call:
glm(formula = y ~ X2, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0674 -0.8211 -0.5296 0.9218 2.5463
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5591 0.2375 -2.354 0.0186 *
X2 1.2871 0.3142 4.097 4.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 130.68 on 99 degrees of freedom
Residual deviance: 106.24 on 98 degrees of freedom
AIC: 110.24
Number of Fisher Scoring iterations: 4