I'm a beginner in R and I am stuck with the following..
df <- tibble(
id = c(01, 02),
a = c(0.44, 0.42),
b = c(1, 0.42),
c = c(NaN, 0.71),
d = c(0.75, 0),
e = c(0.66, 0.75),
f = c(0.5, 0.22),
g = c(1, NaN),
h = c(0.8, NaN)
)
I wonder how I can mutate a column that counts the number of cases of cells >0 - separately for the columns a:d and e:h (&rowwise)
I have been thinking of something like this..
df1 <- df %>%
rowwise() %>%
mutate(casesatod = length(which(., > 0), na.rm = TRUE),
casesetoh = length(which(., > 0), na.rm = TRUE))
Of course, this code is not complete but to give you an idea of what I was thinking of..
I'd really looking forward to receiving help from you !
Thanks in advance !
An option would be rowSums after selecting subset of columns from the dataset. It would be more efficient than rowwise as it is vectorized
library(dplyr)
df %>%
mutate(casesatod = rowSums(.[2:5] > 0, na.rm = TRUE),
casesetoh = rowSums(.[6:9] > 0, na.rm = TRUE))
If we need to use column names for selecting, use select
df %>%
mutate(casesatod = rowSums(select(., a:d) > 0, na.rm = TRUE),
casesetoh = rowSums(select(., e:h) > 0, na.rm = TRUE))
# A tibble: 2 x 11
# id a b c d e f g h casesatod casesetoh
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.44 1 NaN 0.75 0.66 0.5 1 0.8 3 4
#2 2 0.42 0.42 0.71 0 0.75 0.22 NaN NaN 3 2
Related
I have a list of questions, and I want to know how many rows have non-NA values using summarize. I want to use summarize because I'm already using that to calculate the average, which works in the below code. Why does the below code not work and how can I fix it?
library(dplyr)
test <- tibble(student = c("j", "c", "s"),
q1 = c(1, 2, 3),
q2 = c(NA_real_, NA_real_, 4),
q3 = c(43, NA_real_, 232))
test %>%
dplyr::summarise(n = across(starts_with("q"), ~n(.x)),
avg = across(contains("q"), ~ round(mean(.x, na.rm = T), 2)))
expected_outcome <- tibble(n_q1 = 3,
n_q2 = 1,
n_q3 = 2,
avg_q1 = 2,
avg_q2 = 4,
avg_q3 = 138)
library(dplyr)
test %>%
summarize(across(starts_with("q"), list(n = ~sum(!is.na(.)),
avg = ~mean(., na.rm = T)),
.names = "{.fn}_{.col}"))
From the ?across documentation, you can pass a list to the .fns argument:
A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))
This will apply every function in that list to the columns you have specified. You can then use the .names argument of across to set the column names how you desire.
Output
n_q1 avg_q1 n_q2 avg_q2 n_q3 avg_q3
<int> <dbl> <int> <dbl> <int> <dbl>
1 3 2 1 4 2 138.
Update: Upps I missed the whole question. sorry: But here is an alternative just for fun: The preferred answer is already given by #LMc:
library(dplyr)
test %>%
summarise(across(starts_with("q"), list(avg = ~mean(., na.rm = T)),
.names = "{.fn}_{.col}")) %>%
bind_cols(test %>% purrr::map_df(~sum(!is.na(.))))
avg_q1 avg_q2 avg_q3 student q1 q2 q3
<dbl> <dbl> <dbl> <int> <int> <int> <int>
1 2 4 138. 3 3 1 2
test %>%
summarise(across(starts_with("q"), list(avg = ~mean(., na.rm = T)),
.names = "{.fn}_{.col}")) %>%
bind_cols(test %>% purrr::map_df(~sum(!is.na(.))))
First not full answer:
To get the non-nas of the whole dataset, we could do this:
library(dplyr)
test %>%
purrr::map_df(~sum(!is.na(.)))
student q1 q2 q3
<int> <int> <int> <int>
1 3 3 1 2
I want to calculate the proportion of a variable in subgroups compared to the proportion of the whole dataset. The subgroups are based on binary columns. I want to filter the dataframe for each column, count the grouping variable and calculate the proportions. To compare the proportions, I calculate an index value which is 100*prop_subgroup/prop_overall.
I tried and failed to do this with map. Below is a for-loop and a lot of detours to achieve this, and I´m looking for some help to clean up this code and solve this "the tidyverse way". Thank you!
data <- data.frame(group = sample(c(LETTERS[1:6], NA), 1000, T),
v1 = sample(c(0, 1, NA), 1000, T),
v2 = sample(c(0, 1, 2, 3, 4, NA), 1000, T),
v3 = sample(c(0, 1, NA), 1000, T, prob = c(0.05, 0.05, 0.9)),
v4 = sample(c(0, 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)),
v5 = sample(c("a", 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)))
Calculate the prop.table
result <- data %>% count(group) %>% na.omit() %>% transmute(group = group, prop = n/sum(n))
Select binary columns
data_binary <- data %>% select(where(is.numeric)) %>%
select(where(function(x) {max(x, na.rm = T) == 1})) %>%
bind_cols(data %>% select(group), .)
Very ugly peace of code to calculate the frequencies for each group. Left join because some subgroups do not contain all grouping variables. The key peace I failed to do with map is the filtering based on one column and count of another column applied to all binary columns.
for(i in 2:ncol(data_binary)){
name <- names(data_binary)[i]
result <- left_join(result, data_binary %>% filter(.[[i]] == 1) %>% count(group) %>%
na.omit() %>% transmute(group = group, "{{name}}_index" := n/sum(n)))
}
Calculate index based on the frequencies
index <- bind_cols(result %>% select(group),
result %>% transmute_at(vars(-c("prop", "group")), function(x) {100 * x / result$prop}))
Result
group "v1"_index "v3"_index "v4"_index
1 A 79.90019 16.21418 60.54443
2 B 91.31450 97.28507 87.45307
3 C 114.26996 122.50712 95.30142
4 D 96.63614 175.24198 109.06017
5 E 100.08550 116.05938 126.39978
6 F 116.70123 62.55683 116.79493
I think you can accomplish this with a group_by, summarize to get counts and group_by, mutate to calculate fractions. However, I don't produce the same result so perhaps I don't understand exactly how you want to calculate the fractions (sum only the ones?)
data <- data.frame(group = sample(c(LETTERS[1:6], NA), 1000, T),
v1 = sample(c(0, 1, NA), 1000, T),
v2 = sample(c(0, 1, 2, 3, 4, NA), 1000, T),
v3 = sample(c(0, 1, NA), 1000, T, prob = c(0.05, 0.05, 0.9)),
v4 = sample(c(0, 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)),
v5 = sample(c("a", 1, NA), 1000, T, prob = c(0.8, 0.1, 0.1)))
library(tidyverse)
# counts and fractions for each combination of group and variable
data_long <- data %>%
as_tibble() %>%
# select only binary
select(group, where(~max(., na.rm = TRUE) == 1)) %>%
# pivot and calculate sums and fractions
pivot_longer(-group) %>%
drop_na(value) %>%
group_by(group, name) %>% summarize(count = sum(value), .groups = "drop") %>%
group_by(group) %>% mutate(fraction = count / sum(count))
print(data_long)
#> # A tibble: 21 x 4
#> # Groups: group [7]
#> group name count fraction
#> <chr> <chr> <dbl> <dbl>
#> 1 A v1 61 0.693
#> 2 A v3 7 0.0795
#> 3 A v4 20 0.227
#> 4 B v1 54 0.659
#> 5 B v3 10 0.122
#> 6 B v4 18 0.220
#> 7 C v1 45 0.75
#> 8 C v3 4 0.0667
#> 9 C v4 11 0.183
#> 10 D v1 48 0.716
#> # ... with 11 more rows
# pivot wider on fractions to get output in desired form
data_wide <- data_long %>%
pivot_wider(id_cols = group, values_from = fraction)
print(data_wide)
#> # A tibble: 7 x 4
#> # Groups: group [7]
#> group v1 v3 v4
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 0.693 0.0795 0.227
#> 2 B 0.659 0.122 0.220
#> 3 C 0.75 0.0667 0.183
#> 4 D 0.716 0.0896 0.194
#> 5 E 0.707 0.0690 0.224
#> 6 F 0.677 0.154 0.169
#> 7 <NA> 0.725 0.0980 0.176
Created on 2022-03-31 by the reprex package (v2.0.1)
I have a toy example of a dataframe:
df <- data.frame(matrix(, nrow = 5, ncol = 0))
df["A|A"] <- c(0.3, 0, 0, 100, 23)
df["A|B"]= c(0, 0, 0.3, 10, 0.23)
df["A|C"]= c(0.3, 0.1, 0, 100, 2)
df["B|B"]= c(0, 0, 0, 12, 2)
df["B|B"]= c(0, 0, 0.3, 0, 0.23)
df["B|C"]= c(0.3, 0, 0, 21, 3)
df["C|A"]= c(0.3, 0, 1, 100, 0)
df["C|B"]= c(0, 0, 0.3, 10, 0.2)
df["C|C"]= c(0.3, 0, 1, 1, 0.3)
I need to get a matrix with counts of non-zero values between A and A, A and B, ..., C and C.
I started splitting the colnames and assigning them to variables. But I don't know how to create a matrix with certain rows and columns in a loop
counts <- colSums(df != 0)
df <- rbind(df, counts)
for(i in colnames(df)) {
cluster1 <- (strsplit(i, "\\|")[[1]])[1]
cluster2 <- (strsplit(i, "\\|")[[1]])[2]
}
A base R option
> table(read.table(text = rep(names(df), colSums(df > 0)), sep = "|"))
V2
V1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
or a longer version
table(
data.frame(
do.call(
rbind,
strsplit(
as.character(subset(stack(df), values > 0)$ind),
"\\|"
)
)
)
)
gives
X2
X1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
Reshape the data into 'long' format with pivot_longer, then separate the 'name' column into two, and reshape back to 'wide' with pivot_wider, specifying the values_fn as a lambda function to get the count of non-zero values
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
separate(name, into = c('name1', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value,
values_fn = list(value = ~ sum(. > 0)), values_fill = 0)
-output
# A tibble: 3 x 4
name1 A B C
<chr> <int> <int> <int>
1 A 3 3 4
2 B 0 2 3
3 C 3 3 4
I have a data frame with duplicated ID´s. An ID stands for a specific entity. The ID´s are duplicated because the dataset refers to a process that every entity can go through multiple times.
Here is a small example dat:
library(dplyr)
glimpse(dat)
Observations: 6
Variables: 3
$ ID <dbl> 1, 1, 1, 2, 2, 2
$ Amount <dbl> 10, 70, 80, 50, 10, 10
$ Product <fct> A, B, C, B, E, A
ID stands for the entity, Amount stands for the amount of money the entity has spend and Product stands for the good the entity bought.
The issue is that I have to "condense" this data. So, every ID / entity may occur only once. For the continuous variable, this is not an issue because I can simply calculate the mean per ID.
library(tidyr)
dat_con_ID <- dat %>%
select(ID) %>%
unique()
dat_con_Amount <- dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount))
dat_con <- inner_join(dat_con_ID, dat_con_Amount, by = "ID")
glimpse(dat_con)
Observations: 2
Variables: 2
$ ID <dbl> 1, 2
$ Amount <dbl> 53.33333, 23.33333
The problem is, that I can´t calculate the mean of Product because it´s a categorical variable. An option would be to make a dummy variable out of this factor and calculate the mean. But since the original data frame is really huge this is not a good solution. Any Idea how to handle this problem?
May be you are trying to do this:
I am using data.table library. I also modified your data by adding one extra row for ID = 1, so that you can see the difference in the output.
Data:
library('data.table')
dat <- data.table(ID =as.double(c(1, 1, 1, 2, 2, 2,1)),
Amount = as.double(c( 10, 70, 80, 50, 10, 10, 20)),
Product = factor( c('A', 'B', 'C', 'B', 'E', 'A', 'A')))
Code:
# average amount per id
dat[, .(avg_amt = mean(Amount)), by = .(ID) ]
# ID avg_amt
# 1: 1 45.00000
# 2: 2 23.33333
# average product per id
dat[, .SD[, .N, by = Product ][, .( avg_pdt = N/sum(N), Product)], by = .(ID) ]
# ID avg_pdt Product
# 1: 1 0.5000000 A
# 2: 1 0.2500000 B
# 3: 1 0.2500000 C
# 4: 2 0.3333333 B
# 5: 2 0.3333333 E
# 6: 2 0.3333333 A
# combining average amount and average product per id
dat[, .SD[, .N, by = Product ][, .( Product,
avg_pdt = N/sum(N),
avg_amt = mean(Amount))],
by = .(ID) ]
# ID Product avg_pdt avg_amt
# 1: 1 A 0.5000000 45.00000
# 2: 1 B 0.2500000 45.00000
# 3: 1 C 0.2500000 45.00000
# 4: 2 B 0.3333333 23.33333
# 5: 2 E 0.3333333 23.33333
# 6: 2 A 0.3333333 23.33333
edit
Another idea would be to count 'Product' as per 'ID', calculating the mean of 'Amount' and the relative frequencies for each product. spread the data by 'Product' to end up with the data in wide format. So, every ID / entity may occur only once.
dat %>%
add_count(Product, ID) %>%
group_by(ID) %>%
mutate(Amount = mean(Amount),
n = n / n()) %>%
unique() %>%
spread(Product, n, sep = "_") %>%
ungroup()
# A tibble: 2 x 6
# ID Amount Product_A Product_B Product_C Product_E
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1. 45.0 0.500 0.250 0.250 NA
#2 2. 23.3 0.333 0.333 NA 0.333
My first attempt, not what OP was looking for but in case someone is interested:
As suggested by #steveb in the comments, you could summarise Product as a string.
library(dplyr)
dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount),
Product = toString( sort(unique(Product)))
)
# A tibble: 2 x 3
# ID Amount Product
# <dbl> <dbl> <chr>
#1 1. 45.0 A, B, C
#2 2. 23.3 A, B, E
data
dat <- structure(list(ID = c(1, 1, 1, 2, 2, 2, 1), Amount = c(10, 70,
80, 50, 10, 10, 20), Product = structure(c(1L, 2L, 3L, 2L, 4L,
1L, 1L), .Label = c("A", "B", "C", "E"), class = "factor")), .Names = c("ID",
"Amount", "Product"), row.names = c(NA, -7L), .internal.selfref = <pointer: 0x2c14528>, class = c("tbl_df",
"tbl", "data.frame"))
Consider the following example data:
tmp_df_dplyr <- data.frame(groups = rep(c("C", "B", "A"), each = 3),
a = c(-2, 0, -1, -1, 0, 1, 0, 1, 2),
b = rep(c(-1, 0, 1), each = 3))
I wish to do the following, except using colSums:
tmp_df_dplyr %>%
group_by(groups) %>%
summarise(min_group = min(c(sum(a), sum(b))))
# produces:
# A tibble: 3 × 2
groups min_group
<fctr> <dbl>
1 A 3
2 B 0
3 C -3
Using dot referencing, I get an unexpected result:
tmp_df_dplyr %>%
group_by(groups) %>%
summarise(min_group = min(colSums(.[, c('a', 'b')])))
# produces
# A tibble: 3 × 2
groups min_group
<fctr> <dbl>
1 A 0
2 B 0
3 C 0
that is, it looks like the groups are not being applied.