I have a regression output in the form of a dataset. How can I input the estimates and standard errors into stargazer manually? Such that, stargazer creates its typical regression table?
term estimate std.error statistic p.value
1 rho 0.56782511824 0.016618530837 34.168190 0.000000e+00
2 (Intercept) -4.10698330735 0.537699847356 -7.638059 2.198242e-14
4 Unemployment_Rate 0.02288489900 0.016412419393 1.394365 1.632075e-01
5 pop_sq_mi 0.00020135202 0.000045361286 4.438852 9.044016e-06
6 prcntHS 0.13303000437 0.006002571434 22.162169 0.000000e+00
7 prcntBA 0.03698563228 0.012723399878 2.906899 3.650316e-03
8 prcntBlack 0.00877367484 0.004458885465 1.967683 4.910448e-02
9 prcntMulti 0.01404154066 0.004182210799 3.357445 7.866653e-04
10 prcntHisp 0.04316697336 0.003523552546 12.250980 0.000000e+00
11 prcntForeignBorn 0.02229836451 0.009707563865 2.297009 2.161824e-02
12 medianIncome -0.00002809549 0.000002933667 -9.576917 0.000000e+00
13 per_gop_2016 -0.02366390363 0.002698813668 -8.768261 0.000000e+00
I have tried to use the following method (as an example) without much luck.
X1 <- sample(seq(1,100,1), 100,replace= T)
X2 <- sample(seq(1,100,1), 100,replace= T)
Y <- sample(seq(1,100,1), 100,replace= T)
df <- data.frame(Y, X1, X2)
Results <- lm(Y ~ X1 + X2, data = df)
library(broom)
Results_DF <- data.frame(tidy(Results))
stargazer(type = "text",
coef = list(Results_DF$estimate, Results_DF$estimate),
se = list(Results_DF$std.error, Results_DF$std.error),
omit.table.layout = "s")
Error in if (substr(inside[i], 1, nchar("list(")) == "list(") { :
missing value where TRUE/FALSE needed
Any advice would be greatly appreciated. Thank You!
You are almost there.
Here you find a reproducible example. It should be possible to modify it so that it works with your data. Be careful with the t and p values. Check out the p.auto option in stargazer. Of course, you need to change manually or delete the regression footer containing observations, F-stat etc.
library(stargazer)
# coefficients data
d_lm <- data.frame(var = letters[1:4],
est = runif(4),
sd = runif(4),
t = runif(4),
p = runif(4))
# fake data
d <- data.frame(y = runif(30),
a = runif(30),
b = runif(30),
c = runif(30),
d = runif(30))
# fake regression
lm <- lm(y ~ a + b + c + d -1, d)
stargazer(lm,
coef = list(d_lm$est),
se = list(d_lm$sd),
t = list(d_lm$t), # if not supplied stargazer will calculate t values for you
p = list(d_lm$p), # if not supplied stargazer will calculate p values for you
type = "text")
Related
Forewarning: I am a complete noob, so I'm sorry for the dumb question. I've tried everything trying to figure out how to write out the actual polynomial function given these coefficients.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 89.6131 0.8525 105.119 < 2e-16
poly(log(x), 3, raw = TRUE)1 -36.8351 2.3636 -15.584 1.13e-10
poly(log(x), 3, raw = TRUE)2 6.9735 1.6968 4.110 0.000928
poly(log(x), 3, raw = TRUE)3 -0.7105 0.3124 -2.274 0.038063
I thought that it would just be f(x) = 89.6131 - 36.8351log(x) + 6.9735log(x^2) - 0.7105*log(x^3).
I've tried a bunch of variations of this but nothing seems to work. I'm trying to plug my polynomial function and my x-values in to Desmos and get it to return what I'm getting in R which is:
1 2 3 4 5 6
9.806469 15.028672 20.317227 25.669588 28.757896 35.816853
7 8 9 10 11 12
41.334623 43.919057 49.267966 53.880519 60.862101 63.830004
13 14 15 16 17 18
70.390727 79.412081 80.416065 85.214063 86.165068 98.187744
19
96.723278
My x values are:
x = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.8,0.83)
Modeling code:
#data
x = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.8,0.83)
y = c(10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100)
#fitting the model
model1 <- lm(y~poly(log(x),3,raw=TRUE))
new.distance <- data.frame(
distance = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.83,0.8)
)
predict(model1, newdata = new.distance)
summary(model1)
Libraries
library(tidyverse)
Sample data
x <- c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.8,0.83)
y <- c(10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100)
df <-
tibble(
x = x,
y = y
) %>%
mutate(
lx = log(x)
)
Fitting model
model1 <- lm(y~poly(log(x),3,raw=TRUE))
Predicting data
df_to_pred <-
data.frame(
x_to_pred = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.83,0.8)
)
Original data x predicted
Predicted data - function x manual
df %>%
cbind(y_pred_model = predict(model1, newdata = new.distance)) %>%
mutate(y_pred_manual = 89.6131 - 36.8351*log(x) + 6.9735*log(x)^2 - 0.7105*log(x)^3) %>%
ggplot(aes(y_pred_manual,y_pred_model))+
geom_abline(intercept = 0,slope = 1,size = 1, col = "red")+
geom_point()
I read that it is possible to store dataframes in a column of a dataframe with nest:
https://tidyr.tidyverse.org/reference/nest.html
Is it also possible to store tables in a column of a dataframe?
The reason is that I would like to calculate the Kappa for every subgroup of a dataframe with Caret. Although caret::confusionMatrix(t) expects a table as input.
In the example-code below this works fine if I calculate the Kappa for the complete dataframe at once:
library(tidyverse)
library(caret)
# generate some sample data:
n <- 100L
x1 <- rnorm(n, 1.0, 2.0)
x2 <- rnorm(n, -1.0, 0.5)
y <- rbinom(n, 1L, plogis(1 * x1 + 1 * x2))
my_factor <- rep( c('A','B','C','D'), 25 )
df <- cbind(x1, x2, y, my_factor)
# fit a model and make predictions:
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
# confusion matrix
probs_round <- round(probs)
t <- table(factor(probs_round, c(1,0)), factor(y, c(1,0)))
ccm <- caret::confusionMatrix(t)
# extract Kappa:
ccm$overall[2]
> Kappa
> 0.5232
Although if I try to do group_by to generate the Kappa for every factor as a subgroup (see code below) it does not succeed. I suppose I need to nest t in a certain way in df although I don't know how:
# extract Kappa for every subgroup with same factor (NOT WORKING CODE):
df <- cbind(df, probs_round)
df <- as.data.frame(df)
output <- df %>%
dplyr::group_by(my_factor) %>%
dplyr::mutate(t = table(factor(probs_round, c(1,0)), factor(y, c(1,0)))) %>%
summarise(caret::confusionMatrix(t))
Expected output:
>my_factor Kappa
>1 A 0.51
>2 B 0.52
>3 C 0.53
>4 D 0.54
Is this correct and is this possible?
(the exact values for Kappa will be different due to the randomness in the sample data)
Thanks a lot!
You could skip the intermediate mutate() that's giving you trouble to do:
library(dplyr)
library(caret)
df %>%
group_by(my_factor) %>%
summarize(t = confusionMatrix(table(factor(probs_round, c(1,0)),
factor(y, c(1,0))))$overall[2])
Returns:
# A tibble: 4 x 2
my_factor t
<chr> <dbl>
1 A 0.270
2 B 0.513
3 C 0.839
4 D 0.555
The above approach is the easiest to get the desired results. But just to show whats possible, we can use your approach with rowwise::nest_by which groups the data set rowwise.
In the approach below we calculate a separate glm for each subgroup. I'm not sure if that's what you want to do.
library(tidyverse)
library(caret)
# generate some sample data:
n <- 1000L
df <- tibble(x1 = rnorm(n, 1.0, 2.0),
x2 = rnorm(n, -1.0, 0.5),
y = rbinom(n, 1L, plogis(x1 + 1 * x1 + 1 * x2)),
my_factor = rep( c('A','B','C','D'), 250))
output <- df %>%
nest_by(my_factor) %>%
mutate(y = list(data$y),
mod = list(glm(y ~ x1 + x2,
family = "binomial",
data = data)),
probs = list(predict(mod, type = "response")),
probs_round = list(round(probs)),
t = list(table(factor(probs_round, c(1, 0)),
factor(y, c(1, 0)))),
ccm = caret::confusionMatrix(t)$overall[2])
output %>%
pull(ccm)
#> Kappa Kappa Kappa Kappa
#> 0.7743682 0.7078112 0.7157761 0.7549340
Created on 2021-06-23 by the reprex package (v0.3.0)
What I did: I did a linear mixed effect model analysis in R with nlme library. I have a categorical fixed variable, Blurriness, with 2 levels: B standing for Blurred, N standing for Non-Blurred. Upon suggestion, I changed them into, 1(for B) and 0(for N).
Problem: I re-run the model. And I got different p-values/results (I do not mean the + p values became -, I mean like the numbers changed).
What I did to solve it: Then, I reversed the order (I gave 0 for B, and 1 for N) to see if it changes anything. And I got the same p-values and coefficients as when I coded it as B and N (great!). But do you have any idea why that might be?
Edit: I add here a reproducible example: the data with only 80 rows: https://home.mycloud.com/action/share/dedef0a3-794c-4ccc-b245-f93559de1f33
katilimci = factor(dENEME$Participants)
resimler = factor(dENEME$ImageID)
bugu0 = factor(dENEME$Blurriness)
sira0 = factor(dENEME$TheOrderofTheImages)
cekicilik0 = factor(dENEME$TargetAttractiveness)
bugu1 = factor(dENEME$Blurriness2)
sira1= factor(dENEME$TheOrderofTheImages2)
cekicilik1 = factor(dENEME$TargetAttractiveness2)
library(nlme)
myModel1 = lme(Ratings~bugu0+sira0+cekicilik0+bugu0:cekicilik0+bugu0:sira0+sira0:cekicilik0+bugu0:cekicilik0:sira0,data = dENEME, random=list(katilimci=~1, resimler=~1),na.action = na.exclude)
myModel2 = lme(Ratings~bugu1+sira1+cekicilik1+bugu1:cekicilik1+bugu1:sira1+sira1:cekicilik1+bugu1:cekicilik1:sira1,data = dENEME, random=list(katilimci=~1, resimler=~1),na.action = na.exclude)
summary(myModel1)
summary(myModel2)
The resulting p-values are different and I could not find the reason why...
Edit 2: Another reproducible example:
library(nlme)
#fixed factors:
variable1<-as.factor(rep(c("A","B"),each=20))
variable2<-as.factor(sample(rep(c("A","B"),each=20)))
variable3<-as.factor(sample(rep(c("A","B"),each=20)))
#y variable:
ratings<-c(rnorm(20,0,2),rnorm(20,1,6))
#random factor:
ID<-as.factor(paste("ID",rep(1:20,times=2),sep=""))
#symmetrical matrixes:
contrasts(variable1)<-c(0,1)
#in the Line 11, for variable1, level A becomes 0, and level B becomes 1.
contrasts(variable2)<-c(0,1)
contrasts(variable3)<-c(0,1)
#model1:
m1<-lme(ratings~variable1*variable2*variable3,random=~1|ID)
contrasts(variable1)<-c(1,0)
#in the line 19, for variable1, level A becomes 1 and level B becomes 0. So, all the fixed variables mirrors each other in the data that we created.
contrasts(variable2)<-c(1,0)
contrasts(variable3)<-c(1,0)
#model2:
m2<-lme(ratings~variable1*variable2*variable3,random=~1|ID)
summary(m1)
summary(m2)
#we bind the parameters of the 2 models to see them together for comparison:
rbind(
summary(m1)[[20]][,1],
summary(m2)[[20]][,1]
)
I've had a go at making a reproducible example since there isn't one in the question.
require(nlme)
df <- data.frame(dv = c(rnorm(20, 0), rnorm(20, 1)),
Blurriness = factor(c(rep("B", 20), rep("N", 20))),
Random = factor(rep(rep(c("x", "y"), each = 5), 2)),
Blurriness_1_0 = rep(1:0, each = 20),
Blurriness_0_1 = rep(0:1, each = 20))
m <- list()
m[[1]] <- lme(dv ~ Blurriness, random = ~ Blurriness | Random, data = df)
m[[2]] <- lme(dv ~ Blurriness_1_0, random = ~ Blurriness_1_0 | Random, data = df)
m[[3]] <- lme(dv ~ Blurriness_0_1, random = ~ Blurriness_0_1 | Random, data = df)
models <- lapply(m, function(x) summary(x)$tTable)
This gives 3 models which hopefully show the behaviour you describe:
models
#> [[1]]
#> Value Std.Error DF t-value p-value
#> (Intercept) -0.138797 0.2864303 37 -0.4845752 0.630834098
#> BlurrinessN 1.008572 0.3451909 37 2.9217817 0.005901891
#>
#> [[2]]
#> Value Std.Error DF t-value p-value
#> (Intercept) 0.8697753 0.2864293 37 3.036614 0.004366652
#> Blurriness_1_0 -1.0085723 0.3451909 37 -2.921781 0.005901898
#>
#> [[3]]
#> Value Std.Error DF t-value p-value
#> (Intercept) -0.138797 0.2864303 37 -0.4845752 0.630834098
#> Blurriness_0_1 1.008572 0.3451909 37 2.9217817 0.005901891
In this example, the p values are different only for the intercepts, which is what you would expect (it just tells us that the means of the two fixed-effects groups sit at different numbers of standard deviations from 0).
Perhaps this is not what you meant though - it's difficult to tell from your question without a reproducible example.
I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.
Note that all my predictors are categorical.
Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
summary(vv)
First 6 lines of output:
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17835 0.63573 0.281 0.779330
Age1 -0.04576 0.86803 -0.053 0.958010
Age2 0.46431 0.87686 0.530 0.596990
Age99 -1.64099 1.04830 -1.565 0.118949
genre2 1.57015 0.55699 2.819 0.005263 **
genre4 NA NA NA NA ## For example here is all `NA`s? there are 4 more !
As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).
I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")
library("gplots")
heatmap.2(cor_mat, trace = "none")
## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")
d2 <- df %>%
as.matrix() %>%
cor(use = "complete.obs") %>%
## Set diag (a vs a) to NA, then remove
(function(x) {
diag(x) <- NA
x
}) %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(!is.na(value)) %>%
## Sort by decreasing absolute correlation
arrange(-abs(value))
## 2 pairs of variables are almost exactly correlated!
head(d2)
#> var1 var2 value
#> 1 id study.name 0.9999430
#> 2 study.name id 0.9999430
#> 3 Location timed 0.9994082
#> 4 timed Location 0.9994082
#> 5 Age ed.level 0.7425026
#> 6 ed.level Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")
## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
x = X,
y = y,
## alpha = 0 is ridge regression
alpha = 0)
plot(glmnet)
Created on 2019-11-08 by the reprex package (v0.3.0)
Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R
library("olsrr")
#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)
# stepwise regression
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)
summary(vv)
k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)
# stepwise regression plot
plot(k)
# final model
k$model
It will provide you exactly the same output as that of SPSS.
Is there a single function, similar to "runif", "rnorm" and the like which will produce simulated predictions for a linear model? I can code it on my own, but the code is ugly and I assume that this is something someone has done before.
slope = 1.5
intercept = 0
x = as.numeric(1:10)
e = rnorm(10, mean=0, sd = 1)
y = slope * x + intercept + e
fit = lm(y ~ x, data = df)
newX = data.frame(x = as.numeric(11:15))
What I'm interested in is a function that looks like the line below:
sims = rlm(1000, fit, newX)
That function would return 1000 simulations of y values, based on the new x variables.
Showing that Gavin Simpson's suggestion of modifying stats:::simulate.lm is a viable one.
## Modify stats:::simulate.lm by inserting some tracing code immediately
## following the line that reads "ftd <- fitted(object)"
trace(what = stats:::simulate.lm,
tracer = quote(ftd <- list(...)[["XX"]]),
at = list(6))
## Prepare the data and 'fit' object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Define new covariate values and compute their predicted/fitted values
newX <- 8:1
newFitted <- predict(fit, newdata = data.frame(x = newX))
## Pass in fitted via the argument 'XX'
simulate(fit, nsim = 4, XX = newFitted)
# sim_1 sim_2 sim_3 sim_4
# 1 11.0910257 11.018211 10.95988582 13.398902
# 2 12.3802903 10.589807 10.54324607 11.728212
# 3 8.0546746 9.925670 8.14115433 9.039556
# 4 6.4511230 8.136040 7.59675948 7.892622
# 5 6.2333459 3.131931 5.63671024 7.645412
# 6 3.7449859 4.686575 3.45079655 5.324567
# 7 2.9204519 3.417646 2.05988078 4.453807
# 8 -0.5781599 -1.799643 -0.06848592 0.926204
That works, but this is a cleaner (and likely better) approach:
## A function for simulating at new x-values
simulateX <- function(object, nsim = 1, seed = NULL, X, ...) {
object$fitted.values <- predict(object, X)
simulate(object = object, nsim = nsim, seed = seed, ...)
}
## Prepare example data and a fit object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Supply new x-values in a data.frame of the form expected by
## the newdata= argument of predict.lm()
newX <- data.frame(x = 8:1)
## Try it out
simulateX(fit, nsim = 4, X = newX)
# sim_1 sim_2 sim_3 sim_4
# 1 11.485024 11.901787 10.483908 10.818793
# 2 10.990132 11.053870 9.181760 10.599413
# 3 7.899568 9.495389 10.097445 8.544523
# 4 8.259909 7.195572 6.882878 7.580064
# 5 5.542428 6.574177 4.986223 6.289376
# 6 5.622131 6.341748 4.929637 4.545572
# 7 3.277023 2.868446 4.119017 2.609147
# 8 1.296182 1.607852 1.999305 2.598428