I have a dataframe in r which contains information about clients purchasing history of the last year the data frame looks something like this:
Client | Prod A | Prod B | Prod C
---------------------------------
A | 1 | 0 | 1
B | 1 | 1 | 0
C | 1 | 0 | 1
D | 0 | 0 | 1
E | 1 | 0 | 0
---------------------------------
Where 1 means the client has purchased the product at some point and 0 it hasnt bought it at all.
In this particular table the most frequent combination is Product A and Product C with 2 cases out of 5.
I want to find a method/function that will get me the most common combination of products for a data frame of this type of any dimensions.
Thanks in advance for your help.
res <- as.data.frame(xtabs(~., data=dat[,-1]))
res
# Prod.A Prod.B Prod.C Freq
# 1 0 0 0 0
# 2 1 0 0 1
# 3 0 1 0 0
# 4 1 1 0 1
# 5 0 0 1 1
# 6 1 0 1 2
# 7 0 1 1 0
# 8 1 1 1 0
From this you can see the counts of combinations, the "max" of which is
subset(res, Freq == max(Freq))
# Prod.A Prod.B Prod.C Freq
# 6 1 0 1 2
Your dataframe in df
aggregate(Client~Prod.A+Prod.B+Prod.C,df,length)
Prod.A Prod.B Prod.C Client
1 1 0 0 1
2 1 1 0 1
3 0 0 1 1
4 1 0 1 2
the last column Client giving the count
Solution using dplyr
library(dplyr)
df <- data.frame(Client = c("A","B","C","D","E"),
`Prod A` = c(1,1,1,0,1),
`Prod B` = c(0,1,0,0,0),
`Prod C` = c(1,0,1,1,0))
df %>%
dplyr::group_by(Prod.A,Prod.B,Prod.C) %>%
dplyr::summarise(count = n())
# A tibble: 4 x 4
# Groups: Prod.A, Prod.B [3]
Prod.A Prod.B Prod.C count
<dbl> <dbl> <dbl> <int>
1 0 0 1 1
2 1 0 0 1
3 1 0 1 2
4 1 1 0 1
library(dplyr)
df <- data.frame(Client = c("A", "B", "C", "D", "E"),
`Prod A` = c(1, 1, 1, 0, 1),
`Prod B` = c(0, 1, 0, 0, 0),
`Prod C` = c(1, 0, 1, 1, 0))
df %>%
rowwise() %>%
mutate(length = sum(Prod.A, Prod.B, Prod.C)) %>%
group_by(Prod.A, Prod.B, Prod.C) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count == max(count) & length > 1) %>%
select(1:4)
which will produce:
Client Prod.A Prod.B Prod.C
<chr> <dbl> <dbl> <dbl>
1 A 1 0 1
2 C 1 0 1
Related
I have a long dataset with students grades and subjects. I want to keep a long dataset, but I want to add a column that tells me how many Fs a student had in their humanity courses (English & History) and their STEM courses (biology & math). I also want the same for Ds, Cs, Bs, and As.
I know I could explicitly spell this out, but in the future, they might have other subjects (like adding Chemistry to STEM) or completely different categories, like Foreign Languages, so I want it to be scalable.
I know how to get all combinations of columns, and I know how to do to each part manually--but I don't know how to combine the two. Any help would be greatly appreciated!
#Sample data
library(tidyverse)
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4))
#All combinations of grades and subjects
all_subject_combos <- c("eng|his", "bio|math")
all_grades <- c("F", "D", "C",
"B", "A")
subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)
all_combos <- subjects_and_letter_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = str_replace_all(names, "\\|", "_")) %>%
pull(names)
#Manual generation of numbers of Fs by subject
#This is what I want the results to look like, but with all other letter grades
student_grades %>%
group_by(student_id) %>%
mutate(eng_his_F = sum((case_when(
str_detect(subject, "eng|his") & grade == 1 ~ 1,
TRUE ~ 0)), na.rm = TRUE),
bio_math_F = sum((case_when(
str_detect(subject, "bio|math") & grade == 1 ~ 1,
TRUE ~ 0)), na.rm = TRUE)) %>%
ungroup()
Ideally, this would be scalable for any number of subject combos and wouldn't require me to write out the same code for Ds, Cs, Bs, and As. Thank you!
We may loop over the all_combos vector with map, then within each list, do the grouping by 'student_id' (could also do this outside the loop and create an object to use this within here), create the new column with same name as looped by evaluating (!!) and using := operator on the sum of the output from case_when and bind the data with the original data
library(dplyr)
library(purrr)
library(stringr)
map_dfc(all_combos, ~ student_grades %>%
group_by(student_id) %>%
transmute(!! .x := sum(case_when(str_detect(subject,
str_replace(.x, "(\\w+)_(\\w+)_.", "\\1|\\2")) &
grade == match(str_extract(.x, ".$"), all_grades)~ 1, TRUE ~ 0))) %>%
ungroup %>%
dplyr::select(-student_id)) %>%
bind_cols(student_grades, .)
-output
# A tibble: 18 × 13
student_id subject grade eng_his_F bio_math_F eng_his_D bio_math_D eng_his_C bio_math_C eng_his_B bio_math_B eng_hi…¹ bio_m…²
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 english 1 1 0 0 1 0 1 1 0 0 0
2 1 biology 2 1 0 0 1 0 1 1 0 0 0
3 1 math 3 1 0 0 1 0 1 1 0 0 0
4 1 history 4 1 0 0 1 0 1 1 0 0 0
5 2 english 5 0 0 1 0 0 1 0 1 1 0
6 2 biology 4 0 0 1 0 0 1 0 1 1 0
7 2 math 3 0 0 1 0 0 1 0 1 1 0
8 2 history 2 0 0 1 0 0 1 0 1 1 0
9 3 english 2 1 1 1 0 0 0 0 1 0 0
10 3 biology 4 1 1 1 0 0 0 0 1 0 0
11 3 math 1 1 1 1 0 0 0 0 1 0 0
12 3 history 1 1 1 1 0 0 0 0 1 0 0
13 4 english 1 1 1 0 1 1 0 0 0 0 0
14 4 biology 1 1 1 0 1 1 0 0 0 0 0
15 4 math 2 1 1 0 1 1 0 0 0 0 0
16 4 history 3 1 1 0 1 1 0 0 0 0 0
17 5 <NA> 3 0 0 0 0 0 0 0 1 0 0
18 5 biology 4 0 0 0 0 0 0 0 1 0 0
# … with abbreviated variable names ¹eng_his_A, ²bio_math_A
Here's another way of looking at it. I use a small mapping table (subject_to_field) which maps the subject to it's field (english -> humanities, math -> STEM etc.). I thought that this may help scalability. You need to maintain this table as subjects are added or removed.
The left_join then combines the field with the student_grades tibble.
Adding the column "grade2" is not needed but improves readability.
Finally, all we need to to do is to perform the appropriate grouping and counting.
In this approach you will not get a zero tally for grades that do not occur for a student.
library(tidyverse)
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4))
student_grades <- student_grades %>%
mutate(grade2 = case_when(
grade == 1 ~ "A",
grade == 2 ~ "B",
grade == 3 ~ "C",
grade == 4 ~ "D",
grade == 5 ~ "F"))
subject_to_field <- tibble(
subject = c("biology", "english", "history", "math"),
field = c("STEM", "Humanities", "Humanities", "STEM")
)
student_grades <- student_grades %>%
left_join(subject_to_field, by = c("subject" = "subject"))
student_summary <- student_grades %>%
group_by(student_id, field, subject, grade2) %>%
summarise(count = n())
Which will give you this output:
> student_summary
# A tibble: 18 × 5
# Groups: student_id, field, subject [18]
student_id field subject grade2 count
<dbl> <chr> <chr> <chr> <int>
1 1 Humanities english A 1
2 1 Humanities history D 1
3 1 STEM biology B 1
4 1 STEM math C 1
5 2 Humanities english F 1
6 2 Humanities history B 1
7 2 STEM biology D 1
8 2 STEM math C 1
9 3 Humanities english B 1
10 3 Humanities history A 1
11 3 STEM biology D 1
12 3 STEM math A 1
13 4 Humanities english A 1
14 4 Humanities history C 1
15 4 STEM biology A 1
16 4 STEM math B 1
17 5 STEM biology D 1
18 5 NA NA C 1
Background
I've got a dataframe df:
df <- data.frame(task = c("a","b","c", "d","e"),
rater_1 = c(1,0,1,0,0),
rater_2 = c(1,0,1,1,1),
rater_3 = c(1,0,0,0,0),
stringsAsFactors=FALSE)
> df
task rater_1 rater_2 rater_3
1 a 1 1 1
2 b 0 0 0
3 c 1 1 0
4 d 0 1 0
5 e 0 1 0
Raters are given rating tasks about the quality of a product -- if the thing they're rating is of good quality, it gets a 1; if not, it gets a 0.
The problem
I'd like to get R to break down how many of the 5 tasks rated had 3/3 raters mark 1, how many had 2/3 raters mark 1, etc. And to include a percent, too.
I'm looking for something like this:
raters count percent
3 of 3 1 20.0
2 of 3 1 20.0
1 of 3 2 40.0
0 of 3 1 20.0
What I've tried
I've managed to get dplyr to sum across rows, but I can't then collapse it all like I want:
df %>%
mutate(sum1 = rowSums(across(where(is.numeric))))
task rater_1 rater_2 rater_3 sum1
1 a 1 1 1 3
2 b 0 0 0 0
3 c 1 1 0 2
4 d 0 1 0 1
5 e 0 1 0 1
I think I'm overthinking things, but I'm running on little sleep and am missing several billion neurons. Thanks.
Perhaps something like this?
library(dplyr)
df <- data.frame(task = c("a", "b", "c", "d", "e"),
rater_1 = c(1,0,1,0,0),
rater_2 = c(1,0,1,1,1),
rater_3 = c(1,0,0,0,0))
df |>
mutate(count = rowSums(across(where(is.numeric)))) |>
group_by(count) |>
summarize(pct = n()/nrow(df))
#> # A tibble: 4 x 2
#> count pct
#> <dbl> <dbl>
#> 1 0 0.2
#> 2 1 0.4
#> 3 2 0.2
#> 4 3 0.2
This question already has answers here:
Convert column with pipe delimited data into dummy variables [duplicate]
(4 answers)
Closed 2 years ago.
I have some data similar to that below.
df <- data.frame(id = 1:5, tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"))
df
# id tags
# 1 1 A,B,AB,C
# 2 2 C
# 3 3 AB,E
# 4 4 <NA>
# 5 5 B,C
I'd like to create a new dummy variable column for each tag in the "tags" column, resulting in a dataframe like the following:
correct_df <- data.frame(id = 1:5,
tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"),
A = c(1, 0, 0, 0, 0),
B = c(1, 0, 0, 0, 1),
C = c(1, 1, 0, 0, 1),
E = c(0, 0, 1, 0, 0),
AB = c(1, 0, 1, 0, 0)
)
correct_df
# id tags A B C E AB
# 1 1 A,B,AB,C 1 1 1 0 1
# 2 2 C 0 0 1 0 0
# 3 3 AB,E 0 0 0 1 1
# 4 4 <NA> 0 0 0 0 0
# 5 5 B,C 0 1 1 0 0
One of the challenges is ensuring that the "A" column has 1 only for the "A" tag, so that it doesn't has 1 for the "AB" tag, for example. The following won't work for this reason, since "A" gets 1 for the "AB" tag:
df <- df %>%
mutate(A = ifelse(grepl("A", tags, fixed = T), 1, 0))
df
# id tags A
# 1 1 A,B,AB,C 1
# 2 2 C 0
# 3 3 AB,E 1 < Incorrect
# 4 4 <NA> 0
# 5 5 B,C 0
Another challenge is doing this programmatically. I can probably deal with a solution that manually creates a column for each tag, but a solution that doesn't assume which tag columns need to be created beforehand is best, since there can potentially be many different tags. Is there some relatively simple solution that I'm overlooking?
Does this work:
> library(tidyr)
> library(dplyr)
> df %>% separate_rows(tags) %>% mutate(A = case_when(tags == 'A' ~ 1, TRUE ~ 0),
+ B = case_when(tags == 'B' ~ 1, TRUE ~ 0),
+ C = case_when(tags == 'C' ~ 1, TRUE ~ 0),
+ E = case_when(tags == 'E' ~ 1, TRUE ~ 0),
+ AB = case_when(tags == 'AB' ~ 1, TRUE ~ 0)) %>%
+ group_by(id) %>% mutate(tags = toString(tags)) %>% group_by(id, tags) %>% summarise(across(A:AB, sum))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 5 x 7
# Groups: id [5]
id tags A B C E AB
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A, B, AB, C 1 1 1 0 1
2 2 C 0 0 1 0 0
3 3 AB, E 0 0 0 1 1
4 4 NA 0 0 0 0 0
5 5 B, C 0 1 1 0 0
>
Here's a solution:
library(dplyr)
library(stringr)
library(magrittr)
library(tidyr)
#Data
df <- data.frame(id = 1:5, tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"))
#Separate into rows
df %<>% mutate(t2 = tags) %>% separate_rows(t2, sep = ",")
#Create a presence/absence column
df %<>% mutate(pa = 1)
#Pivot wider and use the presence/absence
#column as entries; fill with 0 if absent
df %<>% pivot_wider(names_from = t2, values_from = pa, values_fill = 0)
df
# # A tibble: 5 x 8
# id tags A B AB C E `NA`
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 A,B,AB,C 1 1 1 1 0 0
# 2 2 C 0 0 0 1 0 0
# 3 3 AB,E 0 0 1 0 1 0
# 4 4 NA 0 0 0 0 0 1
# 5 5 B,C 0 1 0 1 0 0
Edit: updated the code to enable it to retain the tags column. Sorry.
I am trying to filter a data set to only include subjects who have data in all conditions (levels of a factor).
I have tried to filter by calculating the number of levels for each subject, but that does not work.
library(tidyverse)
Data <- data.frame(
Subject = factor(c(rep(1, 3),
rep(2, 3),
rep(3, 1))),
Condition = factor(c("A", "B", "C",
"A", "B", "C",
"A")),
Val = c(1, 0, 1,
0, 0, 1,
1)
)
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
summarize(Num_Cond = length(levels(Condition))) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This attempt yields:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
7 3 A 1
Desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
I want to filter subject 3 out because they only have data for one condition.
Is there a dplyr/tidyverse approach for this problem?
We can create a condition with all and levels
library(dplyr)
Data %>%
group_by(Subject) %>%
filter(all(levels(Condition) %in% Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Or with n_distinct and nlevels
Data %>%
group_by(Subject) %>%
filter(nlevels(Condition) == n_distinct(Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Here is a solution testing wether the number of rows of each groupis equal to the number of levels of Condition.
Data %>%
group_by(Subject) %>%
filter(n() == nlevels(Condition))
## A tibble: 6 x 3
## Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Edit
Following the comment by user #akrun I tested with a data set having duplicate values for each row and the code above does fail.
bind_rows(Data, Data) %>%
group_by(Subject) %>%
#distinct() %>%
filter(n() == nlevels(Condition))
## A tibble: 0 x 3
## Groups: Subject [0]
## ... with 3 variables: Subject <fct>, Condition <fct>, Val <dbl>
To run the commented out code line would solve the problem.
I found a relatively simple solution by sub-setting on Subject:
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
droplevels() %>%
summarize(Num_Cond = length(levels(Condition)[Subject])) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This gives the desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
This is bugging me from 2days.
I have data like
Account.ID asset_name
6yS A
6yS B
6yS B
6yS C
6yU D
876 C
From here I want to make more columns of like dummies. But I want only one row each ID.
My output should look like this
Account.ID asset_name Flag_A Flag_B Flag_C Flag_D
6yS A 1 2 1 0
6yU D 0 0 0 1
876 C 0 0 1 0
I tried aggregating but they make it into another table, which I do not want to merge again, because I will be losing information.
Please help me out.
Thank y'll in advance.
This one?
df %>%
count(Account.ID, asset_name) %>%
tidyr::pivot_wider( names_from = asset_name,
values_from = n,
values_fill = list(n = 0))
# A tibble: 3 x 5
Account.ID A B C D
<chr> <int> <int> <int> <int>
1 6yS 1 2 1 0
2 6yU 0 0 0 1
3 876 0 0 1 0
You can use dcast from data.table with fun.aggregate argument:
library(data.table)
dcast(data = setDT(df)[, asset_name := paste0('Flag_', asset_name)],
formula = Account.ID ~ asset_name,
fun.aggregate = length)
Output:
Account.ID Flag_A Flag_B Flag_C Flag_D
1: 6yS 1 2 1 0
2: 6yU 0 0 0 1
3: 876 0 0 1 0
Here's a tidyverse solution, although not the most elegant.
Account.ID <- c('6yS', '6yS', '6yS', '6yS', '6yU', '876')
asset_name <- c('A','B','B','C','D','C')
df <- data.frame(Account.ID, asset_name)
df <- df %>%
group_by(Account.ID, asset_name) %>%
summarise(Count = n()) %>%
spread(key = asset_name, value = Count, fill = 0)
Returns:
Account.ID A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 6yS 1 2 1 0
2 6yU 0 0 0 1
3 876 0 0 1 0
I think I have an answer for you. So this is your dataset:
Account.ID <- c("6yS", "6yS", "6yS", "6yS", "6yU", 876)
asset_name <- c("A", "B", "B", "C", "D", "C")
df <- data.frame(Account.ID, asset_name)
df
Account.ID asset_name
1 6yS A
2 6yS B
3 6yS B
4 6yS C
5 6yU D
6 876 C
For further transformations I am using tidyverse, so install it and load the library:
install.packages("tidyverse")
library(tidyverse)
df <-df %>%
group_by(Account.ID, asset_name) %>%
summarize(n=n()) %>%
spread(asset_name, n)
df
# A tibble: 3 x 5
# Groups: Account.ID [3]
Account.ID A B C D
<fct> <int> <int> <int> <int>
1 6yS 1 2 1 NA
2 6yU NA NA NA 1
3 876 NA NA 1 NA
Now all that needs to be done is turn NAs into 0 and rename columns:
df[is.na(df)] <- 0
names(df)[2:ncol(df)] <- paste0("Flag_", names(df)[2:ncol(df)])
df
# A tibble: 3 x 5
# Groups: Account.ID [3]
Account.ID Flag_A Flag_B Flag_C Flag_D
<fct> <dbl> <dbl> <dbl> <dbl>
1 6yS 1 2 1 0
2 6yU 0 0 0 1
3 876 0 0 1 0
Is this what you were looking for?