How to nest tables in a column of a dataframe? - r

I read that it is possible to store dataframes in a column of a dataframe with nest:
https://tidyr.tidyverse.org/reference/nest.html
Is it also possible to store tables in a column of a dataframe?
The reason is that I would like to calculate the Kappa for every subgroup of a dataframe with Caret. Although caret::confusionMatrix(t) expects a table as input.
In the example-code below this works fine if I calculate the Kappa for the complete dataframe at once:
library(tidyverse)
library(caret)
# generate some sample data:
n <- 100L
x1 <- rnorm(n, 1.0, 2.0)
x2 <- rnorm(n, -1.0, 0.5)
y <- rbinom(n, 1L, plogis(1 * x1 + 1 * x2))
my_factor <- rep( c('A','B','C','D'), 25 )
df <- cbind(x1, x2, y, my_factor)
# fit a model and make predictions:
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
# confusion matrix
probs_round <- round(probs)
t <- table(factor(probs_round, c(1,0)), factor(y, c(1,0)))
ccm <- caret::confusionMatrix(t)
# extract Kappa:
ccm$overall[2]
> Kappa
> 0.5232
Although if I try to do group_by to generate the Kappa for every factor as a subgroup (see code below) it does not succeed. I suppose I need to nest t in a certain way in df although I don't know how:
# extract Kappa for every subgroup with same factor (NOT WORKING CODE):
df <- cbind(df, probs_round)
df <- as.data.frame(df)
output <- df %>%
dplyr::group_by(my_factor) %>%
dplyr::mutate(t = table(factor(probs_round, c(1,0)), factor(y, c(1,0)))) %>%
summarise(caret::confusionMatrix(t))
Expected output:
>my_factor Kappa
>1 A 0.51
>2 B 0.52
>3 C 0.53
>4 D 0.54
Is this correct and is this possible?
(the exact values for Kappa will be different due to the randomness in the sample data)
Thanks a lot!

You could skip the intermediate mutate() that's giving you trouble to do:
library(dplyr)
library(caret)
df %>%
group_by(my_factor) %>%
summarize(t = confusionMatrix(table(factor(probs_round, c(1,0)),
factor(y, c(1,0))))$overall[2])
Returns:
# A tibble: 4 x 2
my_factor t
<chr> <dbl>
1 A 0.270
2 B 0.513
3 C 0.839
4 D 0.555

The above approach is the easiest to get the desired results. But just to show whats possible, we can use your approach with rowwise::nest_by which groups the data set rowwise.
In the approach below we calculate a separate glm for each subgroup. I'm not sure if that's what you want to do.
library(tidyverse)
library(caret)
# generate some sample data:
n <- 1000L
df <- tibble(x1 = rnorm(n, 1.0, 2.0),
x2 = rnorm(n, -1.0, 0.5),
y = rbinom(n, 1L, plogis(x1 + 1 * x1 + 1 * x2)),
my_factor = rep( c('A','B','C','D'), 250))
output <- df %>%
nest_by(my_factor) %>%
mutate(y = list(data$y),
mod = list(glm(y ~ x1 + x2,
family = "binomial",
data = data)),
probs = list(predict(mod, type = "response")),
probs_round = list(round(probs)),
t = list(table(factor(probs_round, c(1, 0)),
factor(y, c(1, 0)))),
ccm = caret::confusionMatrix(t)$overall[2])
output %>%
pull(ccm)
#> Kappa Kappa Kappa Kappa
#> 0.7743682 0.7078112 0.7157761 0.7549340
Created on 2021-06-23 by the reprex package (v0.3.0)

Related

How to add coefficients to existing data base such that their effect on the final intercept is given?

Firstly, let's say I have a data frame df with variables y, x1, x2, x1 is a continuous variable and x2 is a factor.
Let's say I have a model:
model <- glm(y ~ x1 + x2, data = df, family = binomial)
This will result in an object where I can extract the coefficients using the command model$coefficients.
However, for use in another program I would like to export the data frame df, but I'd also like to be able to display the results of the model beyond simply adding the fitted values to the data frame.
Therefore I would like to have coeff1*x1 and coeff2*x2 also in the same dataframe, so that I could use these and the original data together to display their effects. The problem arises from the fact that one of the variables is a multi-level factor and therefore it's not preferable to simply use a for-loop to extract the coefficients and multiply the variables with them.
Is there another way to add two new variables to the dataframe df such that they've been derived from combining the original variables x1, x2 and their respective coefficients?
Try:
set.seed(123)
N <- 10
df <- data.frame(x1 = rnorm(N, 10, 1),
x2 = sample(1:3, N, TRUE),
y = as.integer(50 - x2* 0.4 + x1 * 1.2 + rnorm(N, 0, 0.5) > 52))
model <- glm(y ~ x1 + x2, data = df, family = binomial)
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
df$intercept <- df$x0 * model$coefficients["(Intercept)"]
df[["coeff1*x1"]] <- df$x1 * model$coefficients["x1"]
df[["coeff2*x2"]] <- df$x2 * model$coefficients["x2"]
# x0 x1 x2 y intercept coeff1*x1 coeff2*x2
# 1 1 9.439524 1 1 24.56607 -3.361333e-06 -4.281056e-07
# 2 1 9.769823 1 1 24.56607 -3.478949e-06 -4.281056e-07
# 3 1 11.558708 1 1 24.56607 -4.115956e-06 -4.281056e-07
Alternatively:
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
tmp <- as.data.frame(Map(function(x, y) x * y, subset(df, select = -y), model$coefficients))
names(tmp) <- paste0("coeff*", names(model$coefficients))
cbind(df, tmp)

Regression imputation with dplyr in R

I want to do regression imputation with dplyr in R efficiently. Here is my problem: I have a data set with many missing values for one column - let's call it p. Now I want to estimate the missing values of p with a regression imputation approach. For that I regress p on a set of variables with OLS using uncensored data (a subset of the data set without missing values for p). Then I use the estimated coefficients to calculate the missing values of p.
My data set looks like that:
df = data.frame(
id = c(1, 1, 1, 2, 2, 2),
group = c(1, 1, 2, 1, 1, 2),
sub_group = c(1, 2, 3, 1, 2, 3),
p = c(4.3, 5.7, NA, NA, NA, 10),
var1 = c(0.3, 0.1, 0.4, 0.9, 0.1, 0.2),
var2 = c(0, 0, 0, 1, 1, 1)
)
where id represent individuals, which buy goods from a group (e.g. "food") with subgroups (like "bread"). p is the price, while var1 and var2 are some demographic variables (like "education" and "age").
What I've done so far:
library(dplyr)
df <- as_tibble(df)
# Create uncensored data
uncensored_df <- df %>%
filter(!is.na(p))
# Run regression on uncensored data
imp_model <- lm(p ~ var1 + var2, data = uncensored_df)
# Get the coefficients of the fitted model
coefs <- unname(imp_model$coefficients)
# Use coefficients to compute missing values of p
censored_df <-df %>%
filter(is.na(p)) %>%
group_by(id, group, sub_group) %>%
mutate(p = coefs[1] + coefs[2] * var1 + coefs[3] * var2)
# And finally combine the two subsets
bind_rows(uncensored_df, censored_df) %>% arrange(id, group, sub_group)
As I use more than var1 and var2 in my actual problem (about 30 variables), what is a better way to do regression imputation with dplyr? (I'm also open for non-dplyr solutions, though.)
library(dplyr)
fit <- lm(p ~ ., data = select(df, p, starts_with("var")))
df %>%
rowwise() %>%
mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>%
ungroup()
How it works
For starters, when fitting your model, you can subset your data frame using select and any of the tidyselect helpers to select your dependent variables (here used starts_with("var")). This subset data frame then allows you to use the ~ . notation which means regress p on everything in the subset data frame.
Next you create a row-wise data frame and use your model to predict where p is missing. In this instance across turns each row into a 1x6 tibble that you can pass to the newdata argument. predict then uses the model fit and this new data to predict a value of p.
Output
id group sub_group p var1 var2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 4.3 0.3 0
2 1 1 2 5.7 0.1 0
3 1 2 3 3.60 0.4 0
4 2 1 1 5.10 0.9 1
5 2 1 2 10.7 0.1 1
6 2 2 3 10 0.2 1
Benchmarking
As mentioned in the comments, for large data frames the rowwise operation takes significantly longer than some other options:
library(microbenchmark)
set.seed(1)
df1 <- df %>%
slice_sample(n = 1E5, replace = T)
fit <- lm(p ~ ., data = select(df1, p, starts_with("var")))
dplyr_rowwise <- function(){
df1 %>%
rowwise() %>%
mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>%
ungroup()
}
dplyr_coalesce <- function(){
df1 %>%
mutate(p = coalesce(p, predict(fit, newdata = df1)))
}
base_index <- function(){
isna <- is.na(df1$p)
df1$p[isna] <- predict(fit, newdata = subset(df1, isna))
}
microbenchmark(
dplyr_rowwise(),
dplyr_coalesce(),
base_index(),
times = 10L
)
Unit: milliseconds
expr min lq mean median uq
dplyr_rowwise() 63739.9512 64441.0800 66926.46041 65513.51785 66923.0241
dplyr_coalesce() 6.5901 6.9037 8.55971 7.21125 7.7157
base_index() 13.0368 13.1790 15.73682 13.53310 19.3004

Input Covariates From Data Frame With Regression Results into Stargazer

I have a regression output in the form of a dataset. How can I input the estimates and standard errors into stargazer manually? Such that, stargazer creates its typical regression table?
term estimate std.error statistic p.value
1 rho 0.56782511824 0.016618530837 34.168190 0.000000e+00
2 (Intercept) -4.10698330735 0.537699847356 -7.638059 2.198242e-14
4 Unemployment_Rate 0.02288489900 0.016412419393 1.394365 1.632075e-01
5 pop_sq_mi 0.00020135202 0.000045361286 4.438852 9.044016e-06
6 prcntHS 0.13303000437 0.006002571434 22.162169 0.000000e+00
7 prcntBA 0.03698563228 0.012723399878 2.906899 3.650316e-03
8 prcntBlack 0.00877367484 0.004458885465 1.967683 4.910448e-02
9 prcntMulti 0.01404154066 0.004182210799 3.357445 7.866653e-04
10 prcntHisp 0.04316697336 0.003523552546 12.250980 0.000000e+00
11 prcntForeignBorn 0.02229836451 0.009707563865 2.297009 2.161824e-02
12 medianIncome -0.00002809549 0.000002933667 -9.576917 0.000000e+00
13 per_gop_2016 -0.02366390363 0.002698813668 -8.768261 0.000000e+00
I have tried to use the following method (as an example) without much luck.
X1 <- sample(seq(1,100,1), 100,replace= T)
X2 <- sample(seq(1,100,1), 100,replace= T)
Y <- sample(seq(1,100,1), 100,replace= T)
df <- data.frame(Y, X1, X2)
Results <- lm(Y ~ X1 + X2, data = df)
library(broom)
Results_DF <- data.frame(tidy(Results))
stargazer(type = "text",
coef = list(Results_DF$estimate, Results_DF$estimate),
se = list(Results_DF$std.error, Results_DF$std.error),
omit.table.layout = "s")
Error in if (substr(inside[i], 1, nchar("list(")) == "list(") { :
missing value where TRUE/FALSE needed
Any advice would be greatly appreciated. Thank You!
You are almost there.
Here you find a reproducible example. It should be possible to modify it so that it works with your data. Be careful with the t and p values. Check out the p.auto option in stargazer. Of course, you need to change manually or delete the regression footer containing observations, F-stat etc.
library(stargazer)
# coefficients data
d_lm <- data.frame(var = letters[1:4],
est = runif(4),
sd = runif(4),
t = runif(4),
p = runif(4))
# fake data
d <- data.frame(y = runif(30),
a = runif(30),
b = runif(30),
c = runif(30),
d = runif(30))
# fake regression
lm <- lm(y ~ a + b + c + d -1, d)
stargazer(lm,
coef = list(d_lm$est),
se = list(d_lm$sd),
t = list(d_lm$t), # if not supplied stargazer will calculate t values for you
p = list(d_lm$p), # if not supplied stargazer will calculate p values for you
type = "text")

Why lm() not showing some output in R?

I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.
Note that all my predictors are categorical.
Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
summary(vv)
First 6 lines of output:
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17835 0.63573 0.281 0.779330
Age1 -0.04576 0.86803 -0.053 0.958010
Age2 0.46431 0.87686 0.530 0.596990
Age99 -1.64099 1.04830 -1.565 0.118949
genre2 1.57015 0.55699 2.819 0.005263 **
genre4 NA NA NA NA ## For example here is all `NA`s? there are 4 more !
As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).
I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")
library("gplots")
heatmap.2(cor_mat, trace = "none")
## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")
d2 <- df %>%
as.matrix() %>%
cor(use = "complete.obs") %>%
## Set diag (a vs a) to NA, then remove
(function(x) {
diag(x) <- NA
x
}) %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(!is.na(value)) %>%
## Sort by decreasing absolute correlation
arrange(-abs(value))
## 2 pairs of variables are almost exactly correlated!
head(d2)
#> var1 var2 value
#> 1 id study.name 0.9999430
#> 2 study.name id 0.9999430
#> 3 Location timed 0.9994082
#> 4 timed Location 0.9994082
#> 5 Age ed.level 0.7425026
#> 6 ed.level Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")
## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
x = X,
y = y,
## alpha = 0 is ridge regression
alpha = 0)
plot(glmnet)
Created on 2019-11-08 by the reprex package (v0.3.0)
Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R
library("olsrr")
#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)
# stepwise regression
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)
summary(vv)
k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)
# stepwise regression plot
plot(k)
# final model
k$model
It will provide you exactly the same output as that of SPSS.

Predict function in R

I am trying to use to predict function to predict 100 points new points. I have a data.frame with one vector that is 100 doubls long.
I am trying the predict function: predict(model, newdata=mydat)
The function only returns a vector of length four.
This could be due to the fact that the model was made only with four points, but I am unsure.
EDIT:
Creation of mydat
mydat <- data.frame(V1 = seq(0, max(myExperimentSummary$V1), length.out = 100))
The model I am using
model
#Nonlinear regression model
# model: mean ~ (1/(1 + exp(-b * (V1 - c))))
# data: myExperimentSummary
# b c
#-0.6721 3.2120
# residual sum-of-squares: 0.04395
#
#Number of iterations to convergence: 1
#Achieved convergence tolerance: 5.204e-06
EDIT2: Fixing the typos
EDIT3:
fitcoef = nlsLM(mean~(a/(1+exp(-b*(V5-c)))), data = myExperimentSummary,
start=c(a=1,b=.1,c=25))
fitmodel = nls(mean~(1/(1+exp(-b*(V1-c)))), data = myExperimentSummary,
start=coef(fitcoef))
mydat <- data.frame(V1 = seq(0, max(myExperimentSummary$V1), length.out = 100))
predict(fitmodel, mydat)
If your data are still as in your previous question:
dat <- read.table(text = " V1 N mean
0.1 9 0.9
1 9 0.8
10 9 0.1
5 9 0.2",
header = TRUE)
model <- nls(mean ~ -a/(1 + exp(-b * (V1-o))), data = dat,
start=list(a=-1.452, b=-0.451, o=1.292))
Then I can not reproduce your problem:
mydat <- data.frame(V1 = seq(0, max(dat$V1), length.out = 100))
y <- predict(model, mydat)
length(y)
# [1] 100

Resources