How can I use the count function under certain conditions in R? - r

I have a set of sample data such as the following:
tableData <- tibble(Fruits = sample(c('Apple', 'Banana', 'Orange'), 30, T),
Ripeness = sample(c('yes', 'no'), 30, T),
Mean = ifelse(Ripeness == 'yes', 1.4 + runif(30), 1.6 + runif(30))) %>%
add_row(Fruits = "Peach", Ripeness = "yes", Mean = 5)
I have a function that summarizes for p-value calculation and a mean difference calculation.
tableData %>%
group_by(Fruits) %>%
summarise(Meandiff = mean(Mean[Ripeness == 'yes'])-
mean(Mean[Ripeness == 'no']),
t_test_pval = get_t_test_pval(Mean ~ Ripeness))
Using the summarise function, is it also possible to add another column that counts the number of observations for each fruit if the fruit has a ripeness of "yes" (ie count apple observations with yes ripeness)?

How about this:
set.seed(2)
tableData <- tibble(Fruits = sample(c('Apple', 'Banana', 'Orange'), 30, T),
Ripeness = sample(c('yes', 'no'), 30, T),
Mean = ifelse(Ripeness == 'yes', 1.4 + runif(30), 1.6 + runif(30))) %>%
add_row(Fruits = "Peach", Ripeness = "yes", Mean = 5)
tableData %>%
group_by(Fruits) %>%
summarise(Meandiff = mean(Mean[Ripeness == 'yes']) - mean(Mean[Ripeness == 'no']),
t_test_p_val = if(length(unique(Ripeness))!=2) NaN else t.test(Mean ~ Ripeness)$p.value,
N.yes = sum(Ripeness=="yes"))
Fruits Meandiff t_test_p_val N.yes
<chr> <dbl> <dbl> <int>
1 Apple -0.260 0.241 5
2 Banana -0.223 0.305 4
3 Orange -0.692 0.000290 7
4 Peach NaN NaN 1

Related

Correlation by group and unique pairs

I want to do correlations for each unique combination and grouped by another variable. My solutions works for a very small dataset buy imagine more columns it's getting very tedious.
set.seed((13))
df <- data.frame(group = rep(LETTERS[1:3], 3),
var1 = rnorm(9, 1),
var2 = rnorm(9, 2),
var3 = rnorm(9, 1))
df %>%
group_by(group) %>%
summarise(var1_var2 = cor(var1, var2),
var1_var3 = cor(var1, var3),
var2_var3 = cor(var2, var3))
I also tried this one, but it doens't work.
df %>%
group_by(group) %>%
summarise(cor = cor(df[,2:ncol(df)]))
Here is an option. Map out all the combos then run a cor test for each group and each var and then pivot wider at the end:
library(tidyverse)
map_dfr(unique(df$group), \(x){
data.frame(t(combn(c("var1", "var2", "var3"), 2))) |>
mutate(group = x)
}) |>
mutate(cor = pmap_dbl(list(X1, X2, group),
~cor(df[df$group == ..3, ..1],
df[df$group == ..3, ..2]))) |>
unite(test, X1, X2) |>
pivot_wider(names_from = test, values_from = cor)
#> # A tibble: 3 x 4
#> group var1_var2 var1_var3 var2_var3
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 0.318 -0.476 -0.985
#> 2 B -0.373 -0.487 -0.628
#> 3 C 0.535 0.991 0.645
Another solution. This works for any number of variables.
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
set.seed((13))
df <- data.frame(group = rep(LETTERS[1:3], 3),
var1 = rnorm(9, 1),
var2 = rnorm(9, 2),
var3 = rnorm(9, 2))
df %>%
select(-group) %>%
split(df$group) %>%
imap_dfr(
~ {
expand.grid(
first = names(.x),
second = names(.x),
stringsAsFactors = FALSE
) %>%
filter(first < second) %>%
rowwise() %>%
transmute(
group = .y,
pair = paste(first, second, sep = "_"),
cor = cor(.x[[first]], .x[[second]])
)
}
) %>%
pivot_wider(
names_from = "pair",
values_from = "cor"
)
# # A tibble: 3 × 4
# group var1_var2 var1_var3 var2_var3
# <chr> <dbl> <dbl> <dbl>
# 1 A 0.318 -0.476 -0.985
# 2 B -0.373 -0.487 -0.628
# 3 C 0.535 0.991 0.645

map filter and mutate across columns

I want to calculate the proportion of a variable in subgroups compared to the proportion of the whole dataset. The subgroups are based on binary columns. I want to filter the dataframe for each column, count the grouping variable and calculate the proportions. To compare the proportions, I calculate an index value which is 100*prop_subgroup/prop_overall.
I tried and failed to do this with map. Below is a for-loop and a lot of detours to achieve this, and I´m looking for some help to clean up this code and solve this "the tidyverse way". Thank you!
data <- data.frame(group = sample(c(LETTERS[1:6], NA), 1000, T),
v1 = sample(c(0, 1, NA), 1000, T),
v2 = sample(c(0, 1, 2, 3, 4, NA), 1000, T),
v3 = sample(c(0, 1, NA), 1000, T, prob = c(0.05, 0.05, 0.9)),
v4 = sample(c(0, 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)),
v5 = sample(c("a", 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)))
Calculate the prop.table
result <- data %>% count(group) %>% na.omit() %>% transmute(group = group, prop = n/sum(n))
Select binary columns
data_binary <- data %>% select(where(is.numeric)) %>%
select(where(function(x) {max(x, na.rm = T) == 1})) %>%
bind_cols(data %>% select(group), .)
Very ugly peace of code to calculate the frequencies for each group. Left join because some subgroups do not contain all grouping variables. The key peace I failed to do with map is the filtering based on one column and count of another column applied to all binary columns.
for(i in 2:ncol(data_binary)){
name <- names(data_binary)[i]
result <- left_join(result, data_binary %>% filter(.[[i]] == 1) %>% count(group) %>%
na.omit() %>% transmute(group = group, "{{name}}_index" := n/sum(n)))
}
Calculate index based on the frequencies
index <- bind_cols(result %>% select(group),
result %>% transmute_at(vars(-c("prop", "group")), function(x) {100 * x / result$prop}))
Result
group "v1"_index "v3"_index "v4"_index
1 A 79.90019 16.21418 60.54443
2 B 91.31450 97.28507 87.45307
3 C 114.26996 122.50712 95.30142
4 D 96.63614 175.24198 109.06017
5 E 100.08550 116.05938 126.39978
6 F 116.70123 62.55683 116.79493
I think you can accomplish this with a group_by, summarize to get counts and group_by, mutate to calculate fractions. However, I don't produce the same result so perhaps I don't understand exactly how you want to calculate the fractions (sum only the ones?)
data <- data.frame(group = sample(c(LETTERS[1:6], NA), 1000, T),
v1 = sample(c(0, 1, NA), 1000, T),
v2 = sample(c(0, 1, 2, 3, 4, NA), 1000, T),
v3 = sample(c(0, 1, NA), 1000, T, prob = c(0.05, 0.05, 0.9)),
v4 = sample(c(0, 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)),
v5 = sample(c("a", 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)))
library(tidyverse)
# counts and fractions for each combination of group and variable
data_long <- data %>%
as_tibble() %>%
# select only binary
select(group, where(~max(., na.rm = TRUE) == 1)) %>%
# pivot and calculate sums and fractions
pivot_longer(-group) %>%
drop_na(value) %>%
group_by(group, name) %>% summarize(count = sum(value), .groups = "drop") %>%
group_by(group) %>% mutate(fraction = count / sum(count))
print(data_long)
#> # A tibble: 21 x 4
#> # Groups: group [7]
#> group name count fraction
#> <chr> <chr> <dbl> <dbl>
#> 1 A v1 61 0.693
#> 2 A v3 7 0.0795
#> 3 A v4 20 0.227
#> 4 B v1 54 0.659
#> 5 B v3 10 0.122
#> 6 B v4 18 0.220
#> 7 C v1 45 0.75
#> 8 C v3 4 0.0667
#> 9 C v4 11 0.183
#> 10 D v1 48 0.716
#> # ... with 11 more rows
# pivot wider on fractions to get output in desired form
data_wide <- data_long %>%
pivot_wider(id_cols = group, values_from = fraction)
print(data_wide)
#> # A tibble: 7 x 4
#> # Groups: group [7]
#> group v1 v3 v4
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 0.693 0.0795 0.227
#> 2 B 0.659 0.122 0.220
#> 3 C 0.75 0.0667 0.183
#> 4 D 0.716 0.0896 0.194
#> 5 E 0.707 0.0690 0.224
#> 6 F 0.677 0.154 0.169
#> 7 <NA> 0.725 0.0980 0.176
Created on 2022-03-31 by the reprex package (v2.0.1)

Take Symmetrical Mean of a tibble (ignoring the NAs)

I have a tibble where the rows and columns are the same IDs and I would like to take the mean (ignoring the NAs) to make the df symmetrical. I am struggling to see how.
data <- tibble(group = LETTERS[1:4],
A = c(NA, 10, 20, NA),
B = c(15, NA, 25, 30),
C = c(20, NA, NA, 10),
D = c(10, 12, 15, NA)
)
I would normally do
A <- as.matrix(data[-1])
(A + t(A))/2
But this does not work because of the NAs.
Edit: below is the expected output.
output <- tibble(group = LETTERS[1:4],
A = c(NA, 12.5, 20, 10),
B = c(12.5, NA, 25, 21),
C = c(20, 25, NA, 12.5),
D = c(10, 21, 12.5, NA))
Here is a suggestion using tidyverse code.
library(tidyverse)
data <- tibble(group = LETTERS[1:4],
A = c(NA, 10, 20, NA),
B = c(15, NA, 25, 30),
C = c(20, NA, NA, 10),
D = c(10, 12, 15, NA)
)
A <- data %>%
pivot_longer(-group, values_to = "x")
B <- t(data) %>%
as.data.frame() %>%
setNames(LETTERS[1:4]) %>%
rownames_to_column("group") %>%
pivot_longer(-group, values_to = "y") %>%
left_join(A, by = c("group", "name")) %>%
mutate(
mean = if_else(!(is.na(x) | is.na(y)), (x + y)/2, x),
mean = if_else(is.na(mean) & !is.na(y), y, mean)
) %>%
select(-x, -y) %>%
pivot_wider(names_from = name, values_from = mean)
B
## A tibble: 4 x 5
# group A B C D
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A NA 12.5 20 10
#2 B 12.5 NA 25 21
#3 C 20 25 NA 12.5
#4 D 10 21 12.5 NA
Okay so this is how I ended up doing this. I would have preferred if I didnt use a for loop because the actual data I have is much bigger but beggars cant be choosers!
A <- as.matrix(data[-1])
for (i in 1:nrow(A)){
for (j in 1:ncol(A)){
if(is.na(A[i,j])){
A[i,j] <- A[j, i]
}
}
}
output <- (A + t(A))/2
output %>%
as_tibble() %>%
mutate(group = data$group) %>%
select(group, everything())
# A tibble: 4 x 5
group A B C D
<chr> <dbl> <dbl> <dbl> <dbl>
1 A NA 12.5 20 10
2 B 12.5 NA 25 21
3 C 20 25 NA 12.5
4 D 10 21 12.5 NA

Proportions by group with survey weights

data = data.frame(ID = 1:1000,
GROUP = factor(sample(1:5, rep = T)),
CAT = factor(sample(1:5, rep = T)),
DOG = factor(sample(1:5, rep = T)),
FOX = factor(sample(1:5, rep = T)),
MOUSE = factor(sample(1:5, rep = T)),
WEIGHT = round(runif(1000)*100,0)
)
data_WANT = data.frame(VARS = c("CAT", "DOG", "FOX", "MOUSE", "WEIGHT"),
GROUP1_N = NA,
GROUP1_PROP = NA,
GROUP2_N = NA,
GROUP2_PROP = NA,
GROUP3_N = NA,
GROUP3_PROP = NA,
GROUP4_N = NA,
GROUP4_PROP = NA,
GROUP5_N = NA,
GROUP5_PROP = NA)
I have a dataframe called 'data' and I wish to create a dataframe or datatable that presents the COUNT(_N) of each variable by GROUP and also the weighted proportion (_PROP) for each variable for each group using the variable WEIGHT in the dataframe called 'data'. This is a probability weight that is given to me to get representative estimates.
We can use data.table methods
library(data.table)
dcast(melt(setDT(type.convert(data, as.is = TRUE))[,
c(list(N = .N), lapply(.SD, weighted.mean, WEIGHT)),
GROUP, .SDcols = CAT:MOUSE], id.var = c('GROUP', 'N'),
variable.name = 'Animal'), Animal ~
paste0('GROUP_', GROUP), value.var = c('value', 'N'))
Perhaps, you are trying to do :
library(dplyr)
library(tidyr)
data %>%
type.convert(as.is = TRUE) %>%
group_by(GROUP) %>%
summarise(across(CAT:MOUSE, list(N = ~n(),
PROP = ~weighted.mean(., WEIGHT)))) %>%
pivot_longer(-GROUP,
names_to = c('Animal', 'prop'),
names_sep = '_') %>%
pivot_wider(names_from = c(GROUP, prop), values_from = value,
names_prefix = 'GROUP_')
# A tibble: 4 x 11
# Animal GROUP_1_N GROUP_1_PROP GROUP_2_N GROUP_2_PROP GROUP_3_N
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 CAT 200 5 200 1 200
#2 DOG 200 5 200 2 200
#3 FOX 200 1 200 3 200
#4 MOUSE 200 2 200 1 200
# … with 5 more variables: GROUP_3_PROP <dbl>, GROUP_4_N <dbl>,
# GROUP_4_PROP <dbl>, GROUP_5_N <dbl>, GROUP_5_PROP <dbl>
The pivot_longer and pivot_wider step is to get data in the same format as shown in data_WANT and they are not necessary to perform the calculation.

Cumsum table with grouping

How can I get Cumsum table grouped by both Gender and State?
Gender = sample(c('male', 'female'), 100, replace=TRUE)
State = sample(c('CA', 'WA', 'NV', 'OR', "AZ"), 100, replace=TRUE)
Number = sample(1:8, size=100, replace=TRUE)
df <- data.frame(Gender,State, Number)
If we are looking for cumsum table, then
library(data.table)
dcast(setDT(df)[, .N, .(Gender, State, Number)
][, perc := round(100*N/sum(N), 2), .(Gender, State)],
Gender + State ~Number, value.var = 'perc', fill = 0, drop = FALSE)[,
(3:10) := lapply(Reduce(`+`, .SD, accumulate = TRUE),
function(x) paste0(x, "%")), .SDcols = -(1:2)][]
For a simpler approach, I would recommend using dplyr. Dplyr is loaded along with a bunch of other helpful packages when you load tidyverse.
library(tidyverse)
Gender = sample(c('male', 'female'), 100, replace=TRUE)
State = sample(c('CA', 'WA', 'NV', 'OR', "AZ"), 100, replace=TRUE)
Number = sample(1:8, size=100, replace=TRUE)
df <- data.frame(Gender,State, Number)
df <- df %>%
group_by(Gender, State) %>%
mutate(Number_CumSum = cumsum(Number)) %>%
ungroup() %>%
arrange(State, Gender)
head(df)
# A tibble: 6 x 4
Gender State Number Number_CumSum
<fctr> <fctr> <int> <int>
1 female AZ 8 8
2 female AZ 3 11
3 female AZ 4 15
4 female AZ 5 20
5 female AZ 2 22
6 female AZ 7 29

Resources