Conditions based on dynamic column names in R - r

I have a following data:
name | product1_flag1 | product1_flag2 | product1_flag3 | product2_flag1 | product2_flag2 | product2_flag3
lmn | 0 | 1 | 0 | 1 | 0 | 1
Here, Product names and number of products are dynamic. I want to create new column Product1_Final_Flag based on multiple flag values for each name like if((flag1=1 or flag=0) and flag3=1) then "1" else "0".
Expected output as flows;
name | Product1_final_Flag | Product2_final_Flag
lmn | 0 | 1
How should I achieve the same?

Using DF shown reproducibly in the Note at the end, convert to long form having columns name, product, flag and value. Then convert to wide form having columns name, product, flag1, flag2 and flag3. Compute flag, append "_final_flag" to the product and select desired columns. Finally use pivot_wider to produce products along the top.
library(dplyr)
library(tidyr)
DF %>%
pivot_longer(-1, names_to = c("product", "flag"), names_sep = "_") %>%
pivot_wider(names_from = "flag") %>%
mutate(flag = (flag1 | !flag2) * flag3,
product = paste0(product, "_final_flag")) %>%
select(name, product, flag) %>%
pivot_wider(names_from = "product", values_from = "flag")
## # A tibble: 1 x 3
## name product1_final_flag product2_final_flag
## <chr> <int> <int>
## 1 lmn 0 1
Note
DF is shown in a reproducible manner here:
DF <- structure(list(name = "lmn", product1_flag1 = 0L, product1_flag2 = 1L,
product1_flag3 = 0L, product2_flag1 = 1L, product2_flag2 = 0L,
product2_flag3 = 1L), class = "data.frame", row.names = c(NA, -1L))

Related

Use list names inside purrr:::map_dfr function

I was trying something relatively simple, but having some struggles. Let's say I have two dataframes df1 and df2:
df1:
id expenditure
1 10
2 20
1 30
2 50
df2:
id expenditure
1 30
2 50
1 60
2 10
I also added them to a named list:
table_list = list()
table_list[['a']] = df1
table_list[['b']] = df2
And now I want to perform some summary operation through a function and then bind those rows:
get_summary = function(table){
final_table = table %>% group_by(id) %>% summarise(total_expenditure= sum(expenditure))
}
And then apply this through map_dfr:
summary = table_list %>% map_dfr(get_summary, id='origin_table')
So, this will create a almost what I'm looking for:
origin_table id total_expenditure
a 1 40
a 2 70
b 1 90
b 2 60
But, what if I would like to do something specific depending on the element of the list that is being passed, something like this:
get_summary = function(table, name){
dummy_list = c(TRUE, FALSE)
names(dummy_list) = c('a', 'b')
final_table = table %>% group_by(id) %>% summarise(total_expenditure= sum(expenditure))
is_true = dummy_list[[name]] # Want to use the original name to call another list
if(is_true) final_table = final_table %>% mutate(total_expenditure = total_expenditure + 1)
return(final_table)
}
This would bring something like this:
origin_table id total_expenditure
a 1 41
a 2 71
b 1 90
b 2 60
So is there any way to use list names inside my function? Or any way to identify which element of my list I'm working with? Maybe map_dfr is too restricted and I have to use something else?
Edit: changed example so it is more grounded in reality
Instead of using map, use imap, which can return the names of the list in .y
library(purrr)
library(dplyr)
get_summary = function(dat, name){
dat %>%
group_by(id) %>%
summarise(total_expenditure= sum(expenditure, na.rm = TRUE),
.groups = "drop") %>%
mutate(total_expenditure = if(name=='a')
total_expenditure + 1 else total_expenditure)
}
-testing
> table_list %>%
imap_dfr(~ get_summary(.x, name = .y), .id = 'origin_table')
# A tibble: 4 × 3
origin_table id total_expenditure
<chr> <int> <dbl>
1 a 1 41
2 a 2 71
3 b 1 90
4 b 2 60
data
table_list <- list(a = structure(list(id = c(1L, 2L, 1L, 2L),
expenditure = c(10L,
20L, 30L, 50L)), class = "data.frame", row.names = c(NA, -4L)),
b = structure(list(id = c(1L, 2L, 1L, 2L), expenditure = c(30L,
50L, 60L, 10L)), class = "data.frame", row.names = c(NA,
-4L)))
Managed to do it, by adding origin_table as a pre-existing column on the dataframes:
df1 = df1 %>% mutate(origin_table = 'a')
df2 = df2 %>% mutate(origin_table = 'b')
Then I can extract the origin by doing the following:
get_summary = function(table){
dummy_list = c(TRUE, FALSE)
names(dummy_list) = c('a', 'b')
origin = table %>% distinct(origin_table) %>% pull
final_table = table %>% group_by(id) %>% summarise(total_expenditure= sum(expenditure))
is_true = dummy_list[[origin ]] # Want to use the original name to call another list
if(is_true) final_table = final_table %>% mutate(total_expenditure = total_expenditure + 1)
return(final_table)
}

Finding maximum difference between columns of same name in R

I have the following table in R. I have 2 A columns, 3 B columns and 1 C column. I need to calculate the maximum difference possible between any columns of the same name and return the column name as output.
For row 1
The max difference between A is 2
The max difference between B is 4
I need the output as B
For row 2
The max difference between A is 3
The max difference between B is 2
I need the output as A
| A | A | B | B | B | C |
| 2 | 4 |5 |2 |1 |0 |
| -3 |0 |2 |3 |4 |2 |
First of all, it's a bit dangerous (and not allowed in some cases) to have non-unique column names, so the first thing I did was to uniqueify the names using base::make.unique(). From there, I used tidyr::pivot_longer() so that the grouping information contained in the column names could be accessed more easily. Here I use a regex inside names_pattern to discard the differentiating parts of the column names so they will be the same again. Then we use dplyr::group_by() followed by dplyr::summarize() to get the largest difference in each id and grp which corresponds to your rows and similar columns in the original data. Finally we use dplyr::slice_max() to return only the largest difference per group.
library(tidyverse)
d <- structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L), B = 2:3, B = c(1L, 4L), C = c(0L, 2L)), row.names = c(NA, -2L), class = "data.frame")
# give unique names
names(d) <- make.unique(names(d), sep = "_")
d %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "grp", names_pattern = "([A-Z])*") %>%
group_by(id, grp) %>%
summarise(max_diff = max(value) - min(value)) %>%
slice_max(order_by = max_diff, n = 1, with_ties = F)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#> # Groups: id [2]
#> id grp max_diff
#> <int> <chr> <int>
#> 1 1 B 4
#> 2 2 A 3
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is base R option using aggregate + range + diff + which.max
df$max_diff <- with(
p <- aggregate(
. ~ id,
cbind(id = names(df), as.data.frame(t(df))),
function(v) diff(range(v))
),
id[sapply(p[-1],which.max)]
)
which gives
> df
A A B B B C max_diff
1 2 4 5 2 1 0 B
2 -3 0 2 3 4 2 A
data
> dput(df)
structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L),
B = 2:3, B = c(1L, 4L), C = c(0L, 2L), max_diff = c("B",
"A")), row.names = c(NA, -2L), class = "data.frame")
We may also use split.default to split based on the column names similarity and then with max.col find the index of the max diff
m1 <- sapply(split.default(df, names(df)), \(x)
apply(x, 1, \(u) diff(range(u))))
df$max_diff <- colnames(m1)[max.col(m1, "first")]
df$max_diff
[1] "B" "A"

Computing Percentages of each Subgroup

This question has been answered before, but solutions not working for my particular situation.
col1 | col2
A | 0
B | 1
A | 0
A | 1
B | 0
I'm basically looking for this:
col1 | col2 | Percentage
A | 0 | 0.67
A | 1 | 0.33
B | 0 | 0.50
B | 1 | 0.50
Both columns are factors. The following solutions is what I keep finding on other threads:
df %>% group_by(col1, col2) %>% summarise(n=n()) %>% mutate(freq = n / sum(n))
or something along those lines.
In fact, group_by doesn't really seem to be doing anything at all. It's not giving me an 'n' or 'freq' column. Don't know what I'm doing wrong. Is it because I'm working with factors? Also, if it's not obvious, the values provided in the columns are hypothetical.
An option would be to get the frequency count after grouping by 'col1', then with the 'col2' also as grouping column, divide that frequency by the already created frequency
library(dplyr)
df %>%
group_by(col1) %>%
mutate(n = n()) %>%
group_by(col2, add = TRUE) %>%
summarise(freq = n()/n[1])
# A tibble: 4 x 3
# Groups: col1 [2]
# col1 col2 freq
# <chr> <int> <dbl>
#1 A 0 0.667
#2 A 1 0.333
#3 B 0 0.5
#4 B 1 0.5
data
df <- structure(list(col1 = c("A", "B", "A", "A", "B"), col2 = c(0L,
1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L
))

How to find all combinations in column and count occurrences in data

I am trying to find all actual combinations within my data of values in column 1.
I then want to count all occurrences of these by column 2.
It feels like R should be able to do this fairly quickly. I tried reading up on combn and expand.grid, but with no success. The main problem was I could not find any guidance on how to generate combinations within a column.
My data looks like:
Animal (n=57) | Person ID (n=1000)
Dog | 0001
Cat | 0004
Bird | 0001
Snake | 0002
Spider | 0002
Cat | 0003
Dog | 0004
Expected output is:
AnimalComb | CountbyID
Cat | 1
DogBird | 1
SnakeSpider | 1
CatDog | 1
EDIT deleted an erroneous entry for cat
If I have understood you correctly, you need to group_by PersonID and paste the all the unique Animals in the group and count the number of occurrence of their combination which can be done counting the number of rows in the group (n()) and dividing it by number of distinct values (n_distinct).
library(dplyr)
df %>%
group_by(PersonID) %>%
summarise(AnimalComb = paste(unique(Animal), collapse = ""),
CountbyID = n() / n_distinct(Animal))
# PersonID AnimalComb CountbyID
# <int> <chr> <dbl>
#1 1 DogBird 1
#2 2 SnakeSpider 1
#3 3 Cat 1
#4 4 CatDog 1
An option using data.table
library(data.table)
setDT(df)[, .(AnimalComb = toString(unique(Animal)),
CountbyID = .N/uniqueN(Animal)), by = PersonID]
data
df <- structure(list(Animal = c("Dog", "Cat", "Bird", "Snake", "Spider",
"Cat", "Dog"), PersonID = c(1L, 4L, 1L, 2L, 2L, 3L, 4L)),
class = "data.frame", row.names = c(NA, -7L))

Embed nested list of data.frames in R

Setup:
I have a tibble (named data) with an embedded list of data.frames.
df1 <- data.frame(name = c("columnName1","columnName2","columnName3"),
value = c("yes", 1L, 0L),
stringsAsFactors = F)
df2 <- data.frame(name = c("columnName1","columnName2","columnName3"),
value = c("no", 1L, 1L),
stringsAsFactors = F)
df3 <- data.frame(name = c("columnName1","columnName2","columnName3"),
value = c("yes", 0L, 0L),
stringsAsFactors = F)
responses = list(df1,
df2,
df3)
data <- tibble(ids = c(23L, 42L, 84L),
responses = responses)
Note this is a simplified example of the data. The original data is from a flat json file and loaded with jsonlite::stream_in() function.
Objective:
My goal is to convert this tibble to another tibble where the embedded data.frames are spread (transposed) as columns; for example, my goal tibble is:
goal <- tibble(ids = c(23L, 42L, 84L),
columnName1 = c("yes","no","yes"),
columnName2 = c(1L, 1L, 0L),
columnName3 = c(0L, 1L, 0L))
# goal tibble
> goal
# A tibble: 3 x 4
ids columnName1 columnName2 columnName3
<int> <chr> <int> <int>
1 23 yes 1 0
2 42 no 1 1
3 84 yes 0 0
My inelegant solution:
Use dplyr::bind_rows() and tidyr::spread():
rdf <- dplyr::bind_rows(data$responses, .id = "id") %>%
tidyr::spread(key = "name", -id)
goal2 <- cbind(ids = data$ids, rdf[,-1]) %>%
as.tibble()
Comparing my solution to the goal:
# produced tibble
> goal2
# A tibble: 3 x 4
ids columnName1 columnName2 columnName3
* <int> <chr> <chr> <chr>
1 23 yes 1 0
2 42 no 1 1
3 84 yes 0 0
Overall, my solution works but has a few problems:
I don't know how to pass the unique ids through bind_rows() which forces me to create a dummy id ("id") which can't match to the original id ("ids"). This forces me to use a cbind() (which I don't like) and manually remove the dummy id (using -1 slicing on rdf).
The format of the columns are lost as my approach converts the integer columns to characters.
Any suggestions on how to improve my solution (especially using tidyverse based packages like tidyjson or tidyr)?
We can loop over the 'responses' column with map, spread it to 'wide' with convert = TRUE so that the column types, create that as a column with transmute and then unnest
library(tidyverse)
data %>%
transmute(ids, ind = map(responses, ~.x %>%
spread(name, value, convert = TRUE))) %>%
unnest
# A tibble: 3 x 4
# ids columnName1 columnName2 columnName3
# <int> <chr> <int> <int>
#1 23 yes 1 0
#2 42 no 1 1
#3 84 yes 0 0
Or using the OP's code, we set the names of the list with 'ids' column, do the bind_rows and then spread
bind_rows(setNames(data$responses, data$ids), .id = 'ids') %>%
spread(name, value, convert = TRUE)

Resources