How to first group_by() then iterate lm() through columns? - r

Let's say we have a data frame with a set of 3 dependent variables and 6 independent variables tagged by a grouping variable. An example of this format is generated with the sample code below:
library(tidyverse)
library(broom)
n <- 15
df <- data.frame(groupingvar= sample(letters[1:2], size = n, replace = TRUE),
y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
x1= rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df <- arrange(df,groupingvar)
If I wanted to regress each of the y1, y2, y3 on the set of x1 through x6 I could use something along the lines of:
y <- as.matrix(select(df,y1:y3))
x <- as.matrix(select(df,x1:x6))
regs <-lm(y~x)
coeffs <- tidy(regs)
coeffs <- arrange(coeffs,response, term)
(by making use of the following line from the lm() help: "If response is a matrix, a linear model is fitted separately by least-squares to each column of the matrix.")
However, if I need to first group by the grouping variable and then apply the lm function then I'm not quite sure how to do it. I have tried the following, but it produces the same set of coefficients for both groups.
regs2 <- df %>% group_by(groupingvar) %>%
do(fit2 = lm(as.matrix(select(df,y1:y3)) ~ as.matrix(select(df,x1:x6))))
coeffs2 <- tidy(regs2,fit2)
coeffs2 <- arrange(coeffs2,groupingvar, response)

In data.table, you could melt (reshape long -- stack the outcome variables in one column instead of stored in three columns) & lm by both groupingvar and the outcome variable:
library(data.table)
setDT(df)
#alternatively, set id.vars = c('groupingvar', paste0('x', 1:6)), etc.
longDT = melt(df, id.vars = grep('y', names(df), invert = TRUE))
#this helper function basically splits a named vector into
# its two components
coefsplit = function(reg) {
beta = coef(reg)
list(var = names(beta), coef = beta)
}
#I personally wouldn't assign longDT, I'd just chain this onto
# the output of melt;
longDT[ , coefsplit(lm(value ~ ., data = .SD)), by = .(groupingvar, variable)]
# groupingvar variable var coef
# 1: a y1 (Intercept) -3.595564e+03
# 2: a y1 x1 -3.796627e+01
# 3: a y1 x2 -1.557268e+02
# 4: a y1 x3 2.862738e+02
# 5: a y1 x4 1.579548e+02
# ...
# 38: b y3 x2 2.136253e+01
# 39: b y3 x3 -3.810176e+01
# 40: b y3 x4 4.187719e+01
# 41: b y3 x5 -2.586184e+02
# 42: b y3 x6 1.181879e+02
# groupingvar variable var coef

I also found a way to achieve this using cbind() as follows:
library(tidyverse)
library(broom)
n <- 20
df4 <- data.frame(groupingvar= sample(1:2, size = n, replace = TRUE),
y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
x1= rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df4 <- arrange(df4,groupingvar)
regs <- df4 %>% group_by(groupingvar) %>%
do(fit = lm(cbind(y1,y2,y3) ~ . -groupingvar, data = .))
coeffs <- tidy(regs, fit)

Related

How to add coefficients to existing data base such that their effect on the final intercept is given?

Firstly, let's say I have a data frame df with variables y, x1, x2, x1 is a continuous variable and x2 is a factor.
Let's say I have a model:
model <- glm(y ~ x1 + x2, data = df, family = binomial)
This will result in an object where I can extract the coefficients using the command model$coefficients.
However, for use in another program I would like to export the data frame df, but I'd also like to be able to display the results of the model beyond simply adding the fitted values to the data frame.
Therefore I would like to have coeff1*x1 and coeff2*x2 also in the same dataframe, so that I could use these and the original data together to display their effects. The problem arises from the fact that one of the variables is a multi-level factor and therefore it's not preferable to simply use a for-loop to extract the coefficients and multiply the variables with them.
Is there another way to add two new variables to the dataframe df such that they've been derived from combining the original variables x1, x2 and their respective coefficients?
Try:
set.seed(123)
N <- 10
df <- data.frame(x1 = rnorm(N, 10, 1),
x2 = sample(1:3, N, TRUE),
y = as.integer(50 - x2* 0.4 + x1 * 1.2 + rnorm(N, 0, 0.5) > 52))
model <- glm(y ~ x1 + x2, data = df, family = binomial)
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
df$intercept <- df$x0 * model$coefficients["(Intercept)"]
df[["coeff1*x1"]] <- df$x1 * model$coefficients["x1"]
df[["coeff2*x2"]] <- df$x2 * model$coefficients["x2"]
# x0 x1 x2 y intercept coeff1*x1 coeff2*x2
# 1 1 9.439524 1 1 24.56607 -3.361333e-06 -4.281056e-07
# 2 1 9.769823 1 1 24.56607 -3.478949e-06 -4.281056e-07
# 3 1 11.558708 1 1 24.56607 -4.115956e-06 -4.281056e-07
Alternatively:
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
tmp <- as.data.frame(Map(function(x, y) x * y, subset(df, select = -y), model$coefficients))
names(tmp) <- paste0("coeff*", names(model$coefficients))
cbind(df, tmp)

How to nest tables in a column of a dataframe?

I read that it is possible to store dataframes in a column of a dataframe with nest:
https://tidyr.tidyverse.org/reference/nest.html
Is it also possible to store tables in a column of a dataframe?
The reason is that I would like to calculate the Kappa for every subgroup of a dataframe with Caret. Although caret::confusionMatrix(t) expects a table as input.
In the example-code below this works fine if I calculate the Kappa for the complete dataframe at once:
library(tidyverse)
library(caret)
# generate some sample data:
n <- 100L
x1 <- rnorm(n, 1.0, 2.0)
x2 <- rnorm(n, -1.0, 0.5)
y <- rbinom(n, 1L, plogis(1 * x1 + 1 * x2))
my_factor <- rep( c('A','B','C','D'), 25 )
df <- cbind(x1, x2, y, my_factor)
# fit a model and make predictions:
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
# confusion matrix
probs_round <- round(probs)
t <- table(factor(probs_round, c(1,0)), factor(y, c(1,0)))
ccm <- caret::confusionMatrix(t)
# extract Kappa:
ccm$overall[2]
> Kappa
> 0.5232
Although if I try to do group_by to generate the Kappa for every factor as a subgroup (see code below) it does not succeed. I suppose I need to nest t in a certain way in df although I don't know how:
# extract Kappa for every subgroup with same factor (NOT WORKING CODE):
df <- cbind(df, probs_round)
df <- as.data.frame(df)
output <- df %>%
dplyr::group_by(my_factor) %>%
dplyr::mutate(t = table(factor(probs_round, c(1,0)), factor(y, c(1,0)))) %>%
summarise(caret::confusionMatrix(t))
Expected output:
>my_factor Kappa
>1 A 0.51
>2 B 0.52
>3 C 0.53
>4 D 0.54
Is this correct and is this possible?
(the exact values for Kappa will be different due to the randomness in the sample data)
Thanks a lot!
You could skip the intermediate mutate() that's giving you trouble to do:
library(dplyr)
library(caret)
df %>%
group_by(my_factor) %>%
summarize(t = confusionMatrix(table(factor(probs_round, c(1,0)),
factor(y, c(1,0))))$overall[2])
Returns:
# A tibble: 4 x 2
my_factor t
<chr> <dbl>
1 A 0.270
2 B 0.513
3 C 0.839
4 D 0.555
The above approach is the easiest to get the desired results. But just to show whats possible, we can use your approach with rowwise::nest_by which groups the data set rowwise.
In the approach below we calculate a separate glm for each subgroup. I'm not sure if that's what you want to do.
library(tidyverse)
library(caret)
# generate some sample data:
n <- 1000L
df <- tibble(x1 = rnorm(n, 1.0, 2.0),
x2 = rnorm(n, -1.0, 0.5),
y = rbinom(n, 1L, plogis(x1 + 1 * x1 + 1 * x2)),
my_factor = rep( c('A','B','C','D'), 250))
output <- df %>%
nest_by(my_factor) %>%
mutate(y = list(data$y),
mod = list(glm(y ~ x1 + x2,
family = "binomial",
data = data)),
probs = list(predict(mod, type = "response")),
probs_round = list(round(probs)),
t = list(table(factor(probs_round, c(1, 0)),
factor(y, c(1, 0)))),
ccm = caret::confusionMatrix(t)$overall[2])
output %>%
pull(ccm)
#> Kappa Kappa Kappa Kappa
#> 0.7743682 0.7078112 0.7157761 0.7549340
Created on 2021-06-23 by the reprex package (v0.3.0)

Having a function `t.test()` take multiple variables inputted to it in R

I was wondering if there is a way to avoid using t.test() 3 times for comparing 3 variables x1, x2, and x3 and instead using t.test() one time to take any two variables at a time inputted to it?
For example, for: x1 = rnorm(20) ; x2 = rnorm(20) ; x3 = rnorm(20), I'm now using: t.test(x1, x2) ; t.test(x1, x3) ; t.test(x2, x3) but could I just use t.test() one time?
Here is what I tried with no success:
t.test(cbind(x1, x2, x3))
similar to your question on cor just now, here is the syntax for handling pairwise calculation:
set.seed(21L)
x1 <- rnorm(20); x2 <- rnorm(20); x3 <- rnorm(20)
pcor <- function(...) {
combn(list(...),
2,
function(y) cor(y[[1]], y[[2]]),
simplify=FALSE)
}
pcor(x1, x2, x3)
pttest <- function(...) {
combn(list(...),
2,
function(a) t.test(x=a[[1]], y=a[[2]]) #change this to whatever your want
simplify=FALSE)
}
pttest(x1, x2, x3)
We can use pairwise.t.test
library(dplyr)
library(magrittr)
data(airquality)
airquality %>%
mutate(Month = factor(Month, labels = month.abb[5:9])) %>%
summarise(pval = list(pairwise.t.test(Ozone, Month, p.adj = "bonf")$p.value)) %>%
pull(pval) %>%
extract2(1)
# May Jun Jul Aug
#Jun 1.0000000000 NA NA NA
#Jul 0.0002931151 0.10225483 NA NA
#Aug 0.0001949061 0.08312222 1.000000000 NA
#Sep 1.0000000000 1.00000000 0.006969712 0.004847635
Using the OP's example
pairwise.t.test(c(x1, x2, x3), rep(paste0("x", 1:3), each = 20), p.adj = "bonf")
# Pairwise comparisons using t tests with pooled SD
#data: c(x1, x2, x3) and rep(paste0("x", 1:3), each = 20)
# x1 x2
# x2 0.486 -
# x3 1.000 0.095
data
set.seed(24)
x1 <- rnorm(20)
x2 <- rnorm(20)
x3 <- rnorm(20)
If you want to randomly use any of the variable try this:
s = sample(x = c("x1","x2","x3"),size = 2,replace = F)
t.test(eval(parse(text=s[1])),eval(parse(text=s[2])))
By using pairwise t-test the alpha value must be adjusted. Bonferroni correction is often used in agriculture, Holm sometime in medicine. Without such correcting one have more significant differences than it should be.

how to put a threshold for Step package?

Thanks to this post regarding the failure of stepwise variable selection in lm
I have a data for example looks like below as described in that post
set.seed(1) # for reproducible example
x <- sample(1:500,500) # need this so predictors are not perfectly correlated.
x <- matrix(x,nc=5) # 100 rows, 5 cols
y <- 1+ 3*x[,1]+2*x[,2]+4*x[,5]+rnorm(100) # y depends on variables 1, 2, 5 only
# you start here...
df <- data.frame(y,as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
step(full.model,direction="backward")
What I need is to select only 5 best variables and then 6 best variables out of these 20, does anyone know how to make this contarains?
MuMIn::dredge() has the option about the limits for number of terms. [NOTE]: the number of combinations, the time required, grows exponentially with number of predictors.
set.seed(1) # for reproducible example
x <- sample(100*20)
x <- matrix(x, nc = 20) # 20 predictor
y <- 1 + 2*x[,1] + 3*x[,2] + 4*x[,3] + 5*x[,7] + 6*x[,8] + 7*x[,9] + rnorm(100) # y depends on variables 1,2,3,7,8,9 only
df <- data.frame(y, as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
library(MuMIn)
# options(na.action = "na.fail") # trace = 2: a progress bar is displayed
dredge(full.model, m.lim = c(5, 5), trace = 2) # result: x2, x3, x7, x8, x9

Why predict multinom() gives the same probabilities when I give it different data frames?

I have 6 classes of outcome variable and 14 predictor variables. I built the model below:
fit <- multinom(y ~ X1 + X2 + as.factor(X3) + ... + X14, data= Original)
And I want to predict probabilities of each class of outcome for a given new data point.
X1 <- 1.6
X2 <- 4
x3 <- 15
.
.
.
X14 <- 8
dfin <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
Then I run predict:
predict(fit, todaydata = dfin, type = "probs")
The outcome looks like:
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
Then I change values of new data point by running the lines below:
X1 <- 2.7
X2 <- 5.1
x3 <- 28
.
.
.
X14 <- 2
dfin2 <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
predict(fit, todaydata = dfin2, type = "probs")
again I got exactly the same probabilities.
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
What am I doing wrong that cause same outcome for 2 different dfin and dfin2 data frames?
My second question is that why for a single data point I get so many rows of outcome?
Thanks a lot for your time!

Resources