Counts of factor levels for multiple variables grouped by row - r

I want to count the number of occurrences that a specific factor level occurs across multiple factor varaibles per row.
Simplified, I want to know how many times each factor level is chosen across specific variables per row (memberID).
Example data:
results=data.frame(MemID=c('A','B','C','D','E','F','G','H'),
value_a = c(1,2,1,4,5,1,4,0),
value_b = c(1,5,2,3,4,1,0,3),
value_c = c(3,5,2,1,1,1,2,1)
)
In this example, I want to know the frequency of each factor level for value_a and value_b for each MemID. How many times does A respond 1? How many times does A respond 2? Etc...for each level and for each MemID but only for value_a and value_b.
I would like the output to look something like this:
counts_by_level = data.frame(MemID=c('A','B','C','D','E','F','G','H'),
count_1 = c(2, 0, 1, 0, 0, 2, 0, 0),
count_2 = c(0, 1, 1, 0, 0, 0, 0, 0),
count_3 = c(0, 0, 0, 1, 0, 0, 0, 1),
count_4 = c(0, 0, 0, 1, 1, 0, 1, 0),
count_5 = c(0, 1, 0, 0, 1, 0, 0, 0))
I have been trying to use add_count or add_tally as well as table and searching other ways to answer this question. However, I am struggling to identify specific factor levels across multiple variables and then output new columns for the counts of those levels for each row.

You could do something like this. Note that you didn't include a zero count, but there are some zero selections.
library(tidyverse)
results |>
select(-value_c) |>
pivot_longer(cols = starts_with("value"),
names_pattern = "(value)") |>
mutate(count = 1) |>
select(-name) |>
pivot_wider(names_from = value,
values_from = count,
names_prefix = "count_",
values_fill = 0,
values_fn = sum)
#> # A tibble: 8 x 7
#> MemID count_1 count_2 count_5 count_4 count_3 count_0
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2 0 0 0 0 0
#> 2 B 0 1 1 0 0 0
#> 3 C 1 1 0 0 0 0
#> 4 D 0 0 0 1 1 0
#> 5 E 0 0 1 1 0 0
#> 6 F 2 0 0 0 0 0
#> 7 G 0 0 0 1 0 1
#> 8 H 0 0 0 0 1 1

Another solution:
results %>%
group_by(MemID, value_a, value_b) %>%
summarise(n=n()) %>%
pivot_longer(c(value_a,value_b)) %>%
group_by(MemID, value) %>%
summarise(n=sum(n)) %>%
pivot_wider(MemID,
names_from = value, names_sort = T, names_prefix = "count_",
values_from=n, values_fn=sum, values_fill = 0)

Related

Conditional rolling sum across columns

I have a data frame of values across successive years (columns) for unique individuals (rows). A dummy data example is provided here:
dt = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), `2015` = c(0,
0.8219178, 0, 0.1369863, 0, 1.369863, 0.2739726, 0.8219178, 5,
0), `2016` = c(0, 1.369863, 0, 0.2739726, 0, 0.2739726, 0, 3.2876712,
0, 0), `2017` = c(0.6849315, 0, 0, 0.6849315, 0, 0.5479452, 0,
0, 0, 0), `2018` = c(1.0958904, 0.5479452, 1.9178082, 0, 0, 0,
0, 0, 0, 3), `2019` = c(0, 0, 0, 1.0958904, 0, 0.9589041, 0.5479452,
0, 0, 0), `2020` = c(0.4383562, 0, 0, 0, 0.2739726, 0.6849315,
0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))
I want to create a dataset where the maximum value for each individual that should appear for each year is 1. In cases where it exceeds this value, I want to carry the excess value over 1 into the next year (column) and sum it to that year's value for each individual and so on.
The expected result is:
dt_expected = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), `2015` = c(0,
0.8219178, 0, 0.1369863, 0, 1, 0.2739726, 0.8219178, 1, 0), `2016` = c(0,
1, 0, 0.2739726, 0, 0.6438356, 0, 1, 1, 0), `2017` = c(0.6849315,
0.369863, 0, 0.6849315, 0, 0.5479452, 0, 1, 1, 0), `2018` = c(1,
0.5479452, 1, 0, 0, 0, 0, 1, 1, 1), `2019` = c(0.0958904, 0,
0.9178082, 1, 0, 0.9589041, 0.5479452, 0.2876712, 1, 1), `2020` = c(0.4383562,
0, 0, 0.0958904, 0.2739726, 0.6849315, 0, 0, 0, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I am at a total loss of where to start with this problem, so any assistance achieving this using data.table would be greatly appreciated. My only thought is to use lapply with an ifelse function for the conditional component. Then should I be using rowSums or Reduce to achieve my outcome of shifting excess values across columns?
A translation of Martin Morgan's answer to data.table:
for (i in 2:(ncol(dt) - 1)) {
x = dt[[i]]
set(dt, j = i, value = pmin(x, 1))
set(dt, j = i + 1, value = dt[[i + 1L]] + pmax(x - 1, 0))
}
Not particularly pretty or efficient, but as a starting point I used pmin() and pmax() to update each year (and the subsequent year), iteratively. The current year is the minimum of the current year and 1 (pmin(x, 1)); the subsequent year is the current subsequent year plus the excess of the previous year (pmax(x - 1, 0))
update <- function(df) {
result = df
for (idx in 2:(ncol(df) - 1)) {
x = result[[ idx ]]
result[[ idx ]] = pmin(x, 1)
result[[ idx + 1 ]] = result[[ idx + 1 ]] + pmax(x - 1, 0)
}
result
}
We have
> all.equal(update(dt), dt_expected)
[1] TRUE
I don't know how to translate this into efficient data.table syntax, but the function 'works' as is on a data.table, update(as.data.table(dt)).
Not sure if there is a more efficient way with built in functions, but I simply wrote a recursive function that implements your described algorithm for the rows and then apply it over every row.
f <- function(l, rest = 0, out = list()) {
if (length(l) == 0) return(unlist(out))
if (l[[1]] + rest <= 1) {
f(l[-1], rest = 0, out = append(out, list(l[[1]] + rest)))
} else (
f(l[-1], rest = l[[1]] + rest - 1, out = append(out, list(1)))
)
}
dt[-1] <- apply(dt[-1], 1, f, simplify = F) |>
do.call(what = rbind)
dt
#> # A tibble: 10 × 7
#> ID `2015` `2016` `2017` `2018` `2019` `2020`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0.685 1 0.0959 0.438
#> 2 2 0.822 1 0.370 0.548 0 0
#> 3 3 0 0 0 1 0.918 0
#> 4 4 0.137 0.274 0.685 0 1 0.0959
#> 5 5 0 0 0 0 0 0.274
#> 6 6 1 0.644 0.548 0 0.959 0.685
#> 7 7 0.274 0 0 0 0.548 0
#> 8 8 0.822 1 1 1 0.288 0
#> 9 9 1 1 1 1 1 0
#> 10 10 0 0 0 1 1 1
Created on 2022-03-25 by the reprex package (v2.0.1)
Here is my solution:
dt |>
pivot_longer(cols = -ID, "year") |>
arrange(ID, year) |>
group_by(ID) |>
mutate(x = {
r <- accumulate(value,
~max(0,.y + .x - 1),
.init = 0)
pmin(1, value + head(r, -1))
}) |>
select(x, year, ID) |>
pivot_wider(names_from = "year", values_from = "x")
##> + # A tibble: 10 × 7
##> # Groups: ID [10]
##> ID `2015` `2016` `2017` `2018` `2019` `2020`
##> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##> 1 1 0 0 0.685 1 0.0959 0.438
##> 2 2 0.822 1 0.370 0.548 0 0
##> 3 3 0 0 0 1 0.918 0
##> 4 4 0.137 0.274 0.685 0 1 0.0959
##> 5 5 0 0 0 0 0 0.274
##> 6 6 1 0.644 0.548 0 0.959 0.685
##> 7 7 0.274 0 0 0 0.548 0
##> 8 8 0.822 1 1 1 0.288 0
##> 9 9 1 1 1 1 1 0
##> 10 10 0 0 0 1 1 1

how to add condition to mutate(across

I have df and I would like to calculate percentage (.x/.x[1] * 100 ) when row_number >2 and the first row in the same col is not 0. What should I do if we want to use mutate(across...? where and how can I add the part on .x[1]!=0?
mutate(across(.fns = ~ifelse(row_number() > 2 ... sprintf("%1.0f (%.2f%%)", .x, .x/.x[1] * 100), .x)))
df<-structure(list(Total = c(4, 2, 1, 1, 0, 0), `ELA` = c(0,
0, 0, 0, 0, 0), `Math` = c(4, 2, 1, 1, 0,
0), `PE` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
df %>%
mutate(across(
where(~.x[1] > 0),
~ifelse(
row_number() > 2,
sprintf("%1.0f (%.2f%%)", .x, .x/.x[1] * 100),
.x
)))
# # A tibble: 6 × 4
# Total ELA Math PE
# <chr> <dbl> <chr> <dbl>
# 1 4 0 4 0
# 2 2 0 2 0
# 3 1 (25.00%) 0 1 (25.00%) 0
# 4 1 (25.00%) 0 1 (25.00%) 0
# 5 0 (0.00%) 0 0 (0.00%) 0
# 6 0 (0.00%) 0 0 (0.00%) 0
Have a look at the ?across help page for more examples.

Subtracting each column from its previous one in a data frame

I have a very simple case here in which I would like to subtract each column from its previous one. As a matter of fact I am looking for a sliding subtraction as the first column stays as is and then the first one subtracts the second one and second one subtracts the third one and so on till the last column.
here is my sample data set:
structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
and my desired output:
structure(list(x = c(1, 0, 0, 0), y = c(0, 0, 1, 1), z = c(-1,
1, 0, 0)), class = "data.frame", row.names = c(NA, -4L))
I am personally looking for a solution with purrr family of functions. I also thought about slider but I'm not quite familiar with the latter one. So I would appreciate any help and idea with these two packages in advance. Thank you very much.
A simple dplyr only solution-
cur_data() inside mutate/summarise just creates a whole copy. So
just substract cur_data()[-ncol(.)] from cur_data()[-1]
with pmap_df you can do similar things
df <- structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
library(dplyr)
df %>%
mutate(cur_data()[-1] - cur_data()[-ncol(.)])
#> x y z
#> 1 1 0 -1
#> 2 0 0 1
#> 3 0 1 0
#> 4 0 1 0
similarly
pmap_dfr(df, ~c(c(...)[1], c(...)[-1] - c(...)[-ncol(df)]))
I think you are looking for pmap_df with lag to subtract the previous value.
library(purrr)
library(dplyr)
pmap_df(df, ~{x <- c(...);x - lag(x, default = 0)})
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
Verbose, but simple:
df %>%
select(x) %>%
bind_cols(df %>%
select(-1) %>%
map2_dfc(df %>%
select(-ncol(df)), ~.x -.y))
# x y z
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
We can just do (no need of any packages)
cbind(df1[1], df1[-1] - df1[-ncol(df1)])
-output
x y z
1 1 0 -1
2 0 0 1
3 0 1 0
4 0 1 0
Or using dplyr
library(dplyr)
df1 %>%
mutate(.[-1] - .[-ncol(.)])

How to pass a vector of column names in case_when

I am using case_when to summarise a data frame using rowwise in dplyr. I have a sample data frame as shown below
structure(list(A = c(NA, 1, 0, 0, 0, 0, 0), B = c(NA, 0, 0, 1,
0, 0, 0), C = c(NA, 1, 0, 0, 0, 0, 0), D = c(NA, 1, 0, 1, 0,
0, 1), E = c(NA, 1, 0, 1, 0, 0, 1)), row.names = c(NA, -7L), class = "data.frame")
The code works when I mention all the names
df %>%
rowwise() %>%
mutate(New = case_when(any(c(A,B,C,D,E) == 1) ~ 1,
all(c(A,B,C,D,E) == 0 ) ~ 0
))
Can I pass the names in a vector, e.g cols <- colnames(df), and then that in case_when
To answer your question you can use cur_data() in dplyr 1.0.0 or c_across()
library(dplyr)
df %>%
rowwise() %>%
mutate(New = case_when(any(cur_data() == 1) ~ 1,
all(cur_data() == 0 ) ~ 0))
# A B C D E New
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 NA NA NA NA NA NA
#2 1 0 1 1 1 1
#3 0 0 0 0 0 0
#4 0 1 0 1 1 1
#5 0 0 0 0 0 0
#6 0 0 0 0 0 0
#7 0 0 0 1 1 1
With c_across() :
df %>%
rowwise() %>%
mutate(New = case_when(any(c_across()== 1) ~ 1,
all(c_across()== 0 ) ~ 0))
But you can also solve this using rowSums :
df %>%
mutate(New = case_when(rowSums(. == 1, na.rm = TRUE) > 0 ~ 1,
rowSums(. == 0, na.rm = TRUE) == ncol(.) ~ 0))
If you only have 0's and 1's in your dataset you could use this
df$New <- ifelse(rowSums(df) > 0, 1, 0)
If the rowsum > 0 it means that at least one '1' is present. Output
A B C D E New
1 NA NA NA NA NA NA
2 1 0 1 1 1 1
3 0 0 0 0 0 0
4 0 1 0 1 1 1
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 1 1 1
In base R, we can do this with
df$New <- +( rowSums(df) > 0)

How to convert indicator columns to a concatenated column (of column names)

I have 3 columns consisting of indicator (0/1)
icols <-
structure(list(delivery_group = c(0, 1, 1, 0, 0), culturally_tailored = c(0,
0, 1, 0, 1), integrated_intervention = c(1, 0, 0, 0, 0)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
I would like to return a single character column 'qualifiers', such that column names with indicator == 1 are concatenated in a string as below:
*qualifiers*
integrated_intervention
delivery_group
delivery_group, culturally_tailored
culturally_tailored
I tried extdplyr::ind (with various options) without success. The one below crashed my R session.
icols <- extdplyr::ind_to_char(col = qualifiers, ret_factor = FALSE, remove = TRUE,
from = c("delivery_group", "culturally_tailored", "integrated_intervention"),
mutually_exclusive = FALSE, collectively_exhaustive = FALSE)
I found Convert Boolean indicator columns to a single factor column, but thought there might be a simpler solution.
You can try:
icols$collapsed <- apply(icols, 1, function(x) paste0(names(icols)[x == 1], collapse = ", "))
icols
delivery_group culturally_tailored integrated_intervention collapsed
1 0 0 1 integrated_intervention
2 1 0 0 delivery_group
3 1 1 0 delivery_group, culturally_tailored
4 0 0 0
5 0 1 0 culturally_tailored
Or, even more compactly as Maurits suggested:
apply(icols, 1, function(x) toString(names(icols)[x == 1]))
I'm not sure this is a "simple" solution, but here is a solution using the tidyverse.
library(tidyverse)
icols <- tibble(
delivery_group = c(0, 1, 1, 0, 0),
culturally_tailored = c(0, 0, 1, 0, 1),
integrated_intervention = c(1, 0, 0, 0, 0)
)
icols %>%
rowid_to_column(var = "rowid") %>%
gather(key = "qualifiers", value = "indicator", -rowid) %>%
filter(indicator == 1) %>%
group_by(rowid) %>%
summarize(qualifiers = paste(qualifiers, collapse = ", ")) %>%
ungroup() %>%
complete(rowid = 1:nrow(icols)) %>%
select(qualifiers)
#> # A tibble: 5 x 1
#> qualifiers
#> <chr>
#> 1 integrated_intervention
#> 2 delivery_group
#> 3 delivery_group, culturally_tailored
#> 4 <NA>
#> 5 culturally_tailored
Created on 2019-02-27 by the reprex package (v0.2.1)
Here's a crazy way:
library(tidyverse)
icols %>%
mutate(qualifiers = case_when(
delivery_group & culturally_tailored == 1 ~ "delivery_group, culturally_tailored",
delivery_group & integrated_intervention == 1 ~ "delivery_group, integrated_intervation",
culturally_tailored & integrated_intervention == 1 ~ "culturally_tailored, integrated_intervation",
culturally_tailored == 1 ~ "culturally_tailored",
integrated_intervention == 1 ~ "integrated_intervention",
delivery_group == 1 ~ "delivery_group"))
# A tibble: 5 x 4
delivery_group culturally_tailored integrated_intervention qualifiers
<dbl> <dbl> <dbl> <chr>
1 0 0 1 integrated_intervention
2 1 0 0 delivery_group
3 1 1 0 delivery_group, culturally_tailored
4 0 0 0 NA
5 0 1 0 culturally_tailored

Resources