chi square over multiple groups and variables

chi square over multiple groups and variables - r

I have a huge dataset with several groups (factors with between 2 to 6 levels), and dichotomous variables (0, 1).
example data
DF <- data.frame(
group1 = sample(x = c("A","B","C","D"), size = 100, replace = T),
group2 = sample(x = c("red","blue","green"), size = 100, replace = T),
group3 = sample(x = c("tiny","small","big","huge"), size = 100, replace = T),
var1 = sample(x = 0:1, size = 100, replace = T),
var2 = sample(x = 0:1, size = 100, replace = T),
var3 = sample(x = 0:1, size = 100, replace = T),
var4 = sample(x = 0:1, size = 100, replace = T),
var5 = sample(x = 0:1, size = 100, replace = T))
I want to do a chi square for every group, across all the variables.
library(tidyverse)
library(rstatix)
chisq_test(DF$group1, DF$var1)
chisq_test(DF$group1, DF$var2)
chisq_test(DF$group1, DF$var3)
...
etc
I managed to make it work by using two nested for loops, but I'm sure there is a better solution
groups <- c("group1","group2","group3")
vars <- c("var1","var2","var3","var4","var5")
results <- data.frame()
for(i in groups){
for(j in vars){
test <- chisq_test(DF[,i], DF[,j])
test <- mutate(test, group=i, var=j)
results <- rbind(results, test)
}
}
results
I think I need some kind of apply function, but I can't figure it out

Here is one way to do it with apply. I am sure there is an even more elegant way to do it with dplyr. (Note that here I extract the p.value of the test, but you can extract something else or the whole test result if you prefer).
res <- apply(DF[,1:3], 2, function(x) {
apply(DF[,4:7], 2,
function(y) {chisq.test(x,y)$p.value})
})

Here's a quick and easy dplyr solution, that involves transforming the data into long format keyed by group and var, then running the chi-sq test on each combination of group and var.
DF %>%
pivot_longer(starts_with("group"), names_to = "group", values_to = "group_val") %>%
pivot_longer(starts_with("var"), names_to = "var", values_to = "var_val") %>%
group_by(group, var) %>%
summarise(chisq_test(group_val, var_val)) %>%
ungroup()

Related

Create a matrix with lowest cell count for every pair of binary variables

I have a dataset with several binary variables (x1-x5, values: 1, 2, NA). My goal is to identify whether pairs of binary variables have zero or very low cell counts in the cross-tab table (after ignoring the missing values). So, I would like to calculate the cross-tab table for each pair of binary variables in my data set, extract the lowest value from each table, and report the lowest value from each cross-table into a matrix. By doing so, I would have something similar to a correlation matrix where, instead of correlation coefficients, I would be able to look at the lowest cell count for each pair of variables. Below I created a toy dataset for anyone who will decide to help.
library(tidyverse)
df <- data.frame(x1 = rbinom(n = 1000, size = 1, prob = 0.5),
x2 = rbinom(n = 1000, size = 1, prob = 0.3),
x3 = rbinom(n = 1000, size = 1, prob = 0.4),
x4 = rbinom(n = 1000, size = 1, prob = 0.2),
x5 = rbinom(n = 1000, size = 1, prob = 0.05)) |>
mutate(across(everything(), ~as.factor(.))) |>
mutate(across(everything(), ~recode(., "1" = "2", "0" = "1")))
df1 <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))

I think this is what you mean. It's inefficient (we should only compute one triangle) but short.
cfun <- function(i, j) {
min(table(df[[i]], df[[j]]))
}
outer(1:ncol(df), 1:ncol(df), Vectorize(cfun))
If you want to be more efficient:
n <- ncol(df)
m <- matrix(NA_integer_, n, n, dimnames = list(names(df), names(df)))
for (i in 1:(n-1)) {
for (j in (i+1):n) {
m[j,i] <- cfun(i,j)
}
}

Someone (probably #dcsuka) suggested another solution but then deleted it from the answer section. Thankfully, I had already saved it in my script. After tweaking the code a tiny bit, it returned the correct results. So I am copying it here because, as Ben said, diversity is good.
df2 <- df1 %>%
colnames() %>%
combn(2) %>%
t() %>%
as_tibble(.name_repair = ~c("var1", "var2"))
df3 <- df2 %>%
rowwise() %>%
mutate(crosstab = list(as_tibble(table(select(df1, var1, var2)))),
value = min(list(select(crosstab, n))[[1]])) %>%
select(-crosstab) %>%
pivot_wider(names_from = var1, values_from = value)

I want a function in R to check whether values of one column are greater than the 75th quantile then writes yes or no in the next column

I have tried the following formula but it gives all nos even when I change the quantile value.
NOTE: I have 3 independent datasets that I want to apply the function.
outlier<-function(x1,x2){
q1<-quantile(x1 , .75, na.rm = TRUE)
if(x1>q1){x2<-"Yes"
}else{
x2<-"No"
}
}
I have tried x2<-ifelse(x1>q1,"Yes","No")
inside the function but it still doesn't work.

You can use an ifelse statement and create a new column using mutate.
library(dplyr)
set.seed(1)
df <- tibble(x1 = sample(c(1:10), size = 10, replace = T))
df %>%
mutate(x2 = ifelse(quantile(x1, 0.75, na.rm = T) < x1, "Yes", "No"))
If you want a function
library(dplyr)
set.seed(1)
df <- tibble(x1 = sample(c(1:10), size = 10, replace = T),
x2 = sample(c(1:10), size = 10, replace = T),
x3 = sample(c(1:10), size = 10, replace = T),
x4 = sample(c(1:10), size = 10, replace = T))
outlier<-function(dataframe, quant = 0.75, col = c("x1", "x2")){
dataframe %>%
mutate(across(all_of(col), ~ifelse(.x>quantile(.x,0.75), 'Yes', 'No'),
.names = '{col}_yes'))
}
outlier(dataframe = df,quant = 0.25)

New column with random boolean values while controlling the ratio of TRUE/FALSE per category

In R I've got a dataset like this one:
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
Now I want to add a new column with randomized boolean values, but inside each category the proportion of TRUE and FALSE values should be the same (i.e. the randomizing process should generate the same count of true and false values, in the above data frame 5 TRUEs and 5 FALSEs in each of the 3 categories). How to do this?

You can sample a vector of "TRUE" and "FALSE" values without replacement so you have a randomized and balanced column in your data-frame.
sample(rep(c("TRUE","FALSE"),each=5),10,replace=FALSE)

Based on Yacine Hajji answer:
addRandomBool <- function(df, p){
n <- ceiling(nrow(df) * p)
df$bool <- sample(rep(c("TRUE","FALSE"), times = c(n, nrow(df) - n)))
df
}
Reduce(rbind, lapply(split(df, df$category), addRandomBool, p = 0.5))
where parametar p determines the proportion of TRUE.

This will sample within each group from a vector of 5 TRUE and 5 FALSE without replacement. It will assume that there are always 10 records per group.
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
set.seed(pi)
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){ # Function to saple and assign the new_col
df$new_col <- sample(rep(c(FALSE, TRUE),
each = 5),
size = 10,
replace = FALSE)
df
})) %>%
unnest(cols = "data")
This next example is a little more generalized, but still assumes (approximately) even distribution of TRUE and FALSE within a group. But it can accomodate variable group sizes, and even groups with odd numbers of records (but will favor FALSE for odd numbers of records)
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
set.seed(pi)
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})) %>%
unnest(cols = "data")
Maintaining Column Order
A couple of options to maintain the column order:
First, you can save the column order before you do your group_by - nest, and then use select to set the order when you're done.
set.seed(pi)
orig_col <- names(df) # original column order
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})) %>%
unnest(cols = "data") %>%
select_at(c(orig_col, "new_col")) # Restore the column order
Or you can use a base R solution that doesn't change the column order in the first place
df <- split(df, df["category"])
df <- lapply(df,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})
do.call("rbind", c(df, list(make.row.names = FALSE)))
There are likely a dozen other ways to do this, and probably more efficient ways that I'm not thinking of.

How I can create a function for create a plot with echarts4r?

Hello everyone and good night. I would like to know if it is possible to create a function to simplify the creation of a chart with Echarts4r in r. Im trying but I get the error Error: Can't subset columns that don't exist.. Anyone knows how I can fix it? The code im using is the following:
library(echarts4r)
graf_func <- function(dataframe, vary, varx){
grafico <- base |>
e_charts(vary) |>
e_bar(varx) |>
e_tooltip(trigger = "axis")
return(grafico)
}
df <- data.frame(
var1 = runif(10, min = 100, max = 200),
var2 = runif(10, min = 10, max = 200)
)
graf_func(dataframe = df, vary = var1, varx = var2)

Use the functions e_charts_ and e_bar_ and pass the column names as character.
library(echarts4r)
graf_func <- function(dataframe, vary, varx){
grafico <- dataframe |>
e_charts_(vary) |>
e_bar_(varx) |>
e_tooltip(trigger = "axis")
return(grafico)
}
df <- data.frame(
var1 = runif(10, min = 100, max = 200),
var2 = runif(10, min = 10, max = 200)
)
graf_func(dataframe = df, vary = "var1", varx = "var2")

How to plot sjPlots from a nested tibble?

I create some models like this using a nested tidyr dataframe:
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(purrr)
fits <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0, sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1, sample(10, replace = T), sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data=-group) %>%
mutate(fit= map(data, ~glm(formula = colA ~ colB + colC, data = .x, family="binomial"))) %>%
dplyr::select(group, fit) %>%
tibble::column_to_rownames("group")
I would like to use this data to create some quick marginal effects plots with sjPlot::plot_models like this
plot_models(as.list(fits), type = "pred", terms = c("colB", "colA", "colC"))
Unfortunately, I get the error
Error in if (fam.info$is_linear) tf <- NULL else tf <- "exp" :
argument is of length zero
In addition: Warning message:
Could not access model information.
I've played around a bit with the nesting of the data but I've been unable to get it into a format that sjPlot::plot_models will accept.
What I was expecting to get is a "Forest plot of multiple regression models" as described in the help file. Ultimately, the goal is to plot the marginal effects of regression models by group, which I was hoping the plot_models will do (please correct me if I'm wrong).

It think there are some issues with the original code as well as with the data. There are arguments from plot_model in the function call which are not supported in plot_models. I first show an example that shows how plot_models can be called and used with a nested tibble using {ggplot2}'s diamonds data set. Then I apply this approach to the OP's sample data, which doesn't yield useable results*. Finally, I create some new toy data to show how the approach could be applied to a binominal model.
(* In the original toy data the dependent variable is either always 0 or always 1 in each model so this is unlikely to yield useable results).
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(ggplot2)
# general example
fits <- tibble(id = c("x", "y", "z")) %>%
rowwise() %>%
mutate(fit = list(glm(reformulate(
termlabels = c("cut", "color", "depth", "table", "price", id),
response = "carat"),
data = diamonds)))
plot_models(fits$fit)
# OP's example data
fits2 <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0,
sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1,
sample(10, replace = T),
sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data = -group) %>%
rowwise() %>%
mutate(fit = list(glm(formula = colA ~ colB + colC, data = data, family="binomial")))
plot_models(fits2$fit)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 4 rows containing missing values (geom_point).
# new data for binominal model
n <- 500
g <- round(runif(n, 0L, 1L), 0)
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y <- (x2 - x1 + rnorm(n,sd=20)) < 0
fits3 <- tibble(g, y, x1, x2) %>%
nest_by(g) %>%
mutate(fit = list(glm(formula = y ~ x1 + x2, data = data, family="binomial")))
plot_models(fits3$fit)
Created on 2021-01-23 by the reprex package (v0.3.0)