Pipe output of one data.frame to another using dplyr - r

I have two data.frames--one look-up table that tells me a set products included in a group. Each group has at least one product of Type 1 and Type 2.
The second data.frame tells me details about the transaction. Each transaction can have one of the following products:
a) Only products of Type 1 from one of the groups
b) Only products of Type 2 from one of the groups
c) Product of Type 1 and Type 2 from the same group
For my analysis, I am interested in finding out c) above i.e. how many transactions have products of Type 1 and Type 2 (from the same group) sold. We will ignore the transaction altogether if Product of Type 1 and that of Type 2 from different groups that are sold in the same transaction.
Thus, each product of Type 1 or Type 2 MUST belong to the same group.
Here's my look up table:
> P_Lookup
Group ProductID1 ProductID2
Group1 A 1
Group1 B 2
Group1 B 3
Group2 C 4
Group2 C 5
Group2 C 6
Group3 D 7
Group3 C 8
Group3 C 9
Group4 E 10
Group4 F 11
Group4 G 12
Group5 H 13
Group5 H 14
Group5 H 15
For instance, I won't have Product G and Product 15 in one transaction because they belong to different group.
Here are the transactions:
TransactionID ProductID ProductType
a1 A 1
a1 B 1
a1 1 2
a2 C 1
a2 4 2
a2 5 2
a3 D 1
a3 C 1
a3 7 2
a3 8 2
a4 H 1
a5 1 2
a5 2 2
a5 3 2
a5 3 2
a5 1 2
a6 H 1
a6 15 2
My Code:
Now, I was able to write code using dplyr for shortlisting transactions from one group. However, I am not sure how I can vectorize my code for all groups.
Here's my code:
P_Groups<-unique(P_Lookup$Group)
Chosen_Group<-P_Groups[5]
P_Group_Ind <- P_Trans %>%
group_by(TransactionID)%>%
dplyr::filter((ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID1)) |
(ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID2)) ) %>%
mutate(No_of_PIDs = n_distinct(ProductType)) %>%
mutate(Group_Name = Chosen_Group)
P_Group_Ind<-P_Group_Ind[P_Group_Ind$No_of_PIDs>1,]
This works well as long as I manually select each group i.e. by setting Chosen_Group. However, I am not sure how I can automate this. One way, I am thinking is to use for loop, but I know that the beauty of R is vectorization, so I want to stay away from using for loop.
I'd sincerely appreciate any help. I have spent almost two days on this. I looked at using dplyr in for loop in r, but it seems this thread is talking about a different issue.
DATA:
Here's dput for P_Trans:
structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2",
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5",
"a6", "a6"), ProductID = c("A", "B", "1", "C", "4", "5", "D",
"C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15"), ProductType = c(1,
1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2)), .Names = c("TransactionID",
"ProductID", "ProductType"), row.names = c(NA, 18L), class = "data.frame")
Here's dput for P_Lookup:
structure(list(Group = c("Group1", "Group1", "Group1", "Group2",
"Group2", "Group2", "Group3", "Group3", "Group3", "Group4", "Group4",
"Group4", "Group5", "Group5", "Group5"), ProductID1 = c("A",
"B", "B", "C", "C", "C", "D", "C", "C", "E", "F", "G", "H", "H",
"H"), ProductID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15)), .Names = c("Group", "ProductID1", "ProductID2"), row.names = c(NA,
15L), class = "data.frame")
Here's the dput() after adding a product to P_Trans that doesn't exist in the look-up table:
structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2",
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5",
"a6", "a6", "a7"), ProductID = c("A", "B", "1", "C", "4", "5",
"D", "C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15",
"22"), ProductType = c(1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2,
2, 2, 2, 1, 2, 3)), .Names = c("TransactionID", "ProductID",
"ProductType"), row.names = c(NA, 19L), class = "data.frame")

Below is a tidyverse (dplyr, tidyr, and purrr) solution that I hope will help.
Note that the use of map_df in the last line returns all results as a data frame. If you'd prefer it to be a list object for each group, then simply use map.
library(dplyr)
library(tidyr)
library(purrr)
# Save unique groups for later use
P_Groups <- unique(P_Lookup$Group)
# Convert lookup table to product IDs and Groups
P_Lookup <- P_Lookup %>%
gather(ProductIDn, ProductID, ProductID1, ProductID2) %>%
select(ProductID, Group) %>%
distinct() %>%
nest(-ProductID, .key = Group)
# Bind Group information to transactions
# and group for next analysis
P_Trans <- P_Trans %>%
left_join(P_Lookup) %>%
filter(!map_lgl(Group, is.null)) %>%
unnest(Group) %>%
group_by(TransactionID)
# Iterate through Groups to produce results
map(P_Groups, ~ filter(P_Trans, Group == .)) %>%
map(~ mutate(., No_of_PIDs = n_distinct(ProductType))) %>%
map_df(~ filter(., No_of_PIDs > 1))
#> Source: local data frame [12 x 5]
#> Groups: TransactionID [4]
#>
#> TransactionID ProductID ProductType Group No_of_PIDs
#> <chr> <chr> <dbl> <chr> <int>
#> 1 a1 A 1 Group1 2
#> 2 a1 B 1 Group1 2
#> 3 a1 1 2 Group1 2
#> 4 a2 C 1 Group2 2
#> 5 a2 4 2 Group2 2
#> 6 a2 5 2 Group2 2
#> 7 a3 D 1 Group3 2
#> 8 a3 C 1 Group3 2
#> 9 a3 7 2 Group3 2
#> 10 a3 8 2 Group3 2
#> 11 a6 H 1 Group5 2
#> 12 a6 15 2 Group5 2

Here is a single pipe dplyr solution:
P_DualGroupTransactionsCount <-
P_Lookup %>% # data needing single column map of Keys
gather(IDnum, ProductID, ProductID1:ProductID2) %>% # produce long single map of Keys for GroupID (tidyr::)
right_join(P_trans) %>% # join transactions to groupID info
group_by(TransactionID, Group) %>% # organize for same transaction & same group
mutate(DualGroup = ifelse(n_distinct(ProductType)==2, T, F)) %>% # flag groups with both groups in a single transaction
filter(DualGroup == T) %>% # choose only doubles
select(TransactionID, Group) %>% # remove excess columns
distinct %>% # remove excess rows
nrow # count of unique transaction ID's
# P_DualGroupTransactions
# Source: local data frame [4 x 2]
# Groups: TransactionID, Group [4]
#
# TransactionID Group
# <chr> <chr>
# 1 a1 Group1
# 2 a2 Group2
# 3 a3 Group3
# 4 a6 Group5
# P_DualGroupTransactionsCount
[1] 4

Related

How to group_by(x) and summarise by counting distinct(y) for each x level? [duplicate]

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 7 months ago.
I have the following situation:
V1
V2
A
A1
A
A1
A
A1
A
A2
A
A2
A
A3
B
B1
B
B2
B
B2
and i need to group by V1, and summarise counting how many distinct groups each V1 level has in V2. Something like this:
V1
n
A
3
B
2
How can i use dplyr funcitons to solve that?
Thanks!!
We can use rle after grouping by 'V1'
library(dplyr)
df1 %>%
group_by(V1) %>%
summarise(n = length(rle(V2)$values), .groups = 'drop')
-output
# A tibble: 2 × 2
V1 n
<chr> <int>
1 A 3
2 B 2
Or with rleid and n_distinct
library(data.table)
df1 %>%
group_by(V1) %>%
summarise(n = n_distinct(rleid(V2)))
# A tibble: 2 × 2
V1 n
<chr> <int>
1 A 3
2 B 2
data
df1 <- structure(list(V1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B"), V2 = c("A1", "A1", "A1", "A2", "A2", "A1", "B1", "B2",
"B2")), class = "data.frame", row.names = c(NA, -9L))

How to leave only rows that meet a specific condition in R

I have a data frame that contains around 700 cases with 1800 examinations. Some cases underwent several different modalities. I want to leave only one examination result based on the specific condition of the modality.
Here is a dummy data frame:
df <- data.frame (ID = c("1", "1", "1", "2", "2", "3", "4", "4", "5", "5"),
c1 = c("A", "B", "C", "A", "C", "A", "A", "B", "B", "C"),
x1 = c(5, 4, 5, 3, 1, 3, 4, 2, 3, 5),
x2 = c(4, 3, 7, 9, 1, 2, 4, 7, 5, 0))
There are five cases with 10 exams. [c1] is the exam modality (condition), and the results are x1 and x2.
I want to leave only one row based on the following condition:
C > B > A
I want to leave the row with C first; if not, leave the row with B; If C and B are absent, leave the row with A.
Desired output:
output <- data.frame (ID = c("1", "2", "3", "4", "5"),
c1 = c("C", "C", "A", "B", "C"),
x1 = c(5, 1, 3, 2, 5),
x2 = c(7, 1, 2, 7, 0))
You can arrange the data based on required correct order and for each ID select it's 1st row.
library(dplyr)
req_order <- c('C', 'B', 'A')
df %>%
arrange(ID, match(c1, req_order)) %>%
distinct(ID, .keep_all = TRUE)
# ID c1 x1 x2
# <chr> <chr> <dbl> <dbl>
#1 1 C 5 7
#2 2 C 1 1
#3 3 A 3 2
#4 4 B 2 7
#5 5 C 5 0
In base R, this can be written as :
df1 <- df[order(match(df$c1, req_order)), ]
df1[!duplicated(df1$ID), ]
Here is one approach:
df.srt <- df[order(df$c1, decreasing=TRUE), ]
df.spl <- split(df.srt, df.srt$ID)
first <- lapply(df.spl, head, n=1)
result <- do.call(rbind, first)
result
# ID c1 x1 x2
# 1 1 C 5 7
# 2 2 C 1 1
# 3 3 A 3 2
# 4 4 B 2 7
# 5 5 C 5 0

Applying adjusted boxplot method adjboxstats() with group_by in R?

I am a beginner and would like to
produce adjboxStats() for every code in my data (see below)
eliminate the outliers for every code
Some dummy data:
code=c("A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A2","A3","A1","A2","A3","A1","A2"),
duration=c(100,100,100,200,200,200,23523,213123,12,23213,968,37253,573012,472662,3846516,233,262,5737,3038,2,5,123,969,6,40582)
)
At the moment, I am able to produce the results across all codes, see below. But I have problems i) to run the statistics for every code (would group_by(code) work?) and then ii) to exclude the found outliers ($out) for every code.
library(robustbase)
adjboxStats(data$duration, coef = 1.5, a = -4, b = 3, do.conf = TRUE, do.out = TRUE)
$stats
[1] 2 100 262 23523 573012
$n
[1] 50
$conf
[1] -4971.77 5495.77
$fence
[1] -571.2153 707257.8400
$out
[1] 3846516 3846516
Thank you very much in advance for your help!
We can do a group by and summarise in a list
library(dplyr)
library(robustbase)
data1 <- data %>%
group_by(code) %>%
summarise(out = list(adjboxStats(duration, coef = 1.5,
a = -4, b = 3, do.conf = TRUE, do.out = TRUE)))
data1
# A tibble: 3 x 2
# code out
# <chr> <list>
#1 A1 <named list [5]>
#2 A2 <named list [5]>
#3 A3 <named list [5]>
data1$out[[1]]
#$stats
#[1] 5.0 53.0 216.5 23368.0 573012.0
#$n
#[1] 8
#$conf
#[1] -12807.59 13240.59
#$fence
#[1] -624.4143 696935.1967
#$out
#numeric(0)
If we are interested in filtering out the outliers, then use %in% with ! after extracting the 'out' component
data %>%
group_by(code) %>%
filter(!duration %in% adjboxStats(duration, coef = 1.5,
a = -4, b = 3, do.conf = TRUE, do.out = TRUE)$out)
# A tibble: 24 x 2
# Groups: code [3]
# code duration
# <chr> <dbl>
# 1 A1 100
# 2 A2 100
# 3 A3 100
# 4 A1 200
# 5 A2 200
# 6 A3 200
# 7 A1 23523
# 8 A2 213123
# 9 A3 12
#10 A1 23213
# … with 14 more rows
data
data <- structure(list(code = c("A1", "A2", "A3", "A1", "A2", "A3", "A1",
"A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3",
"A2", "A3", "A1", "A2", "A3", "A1", "A2"), duration = c(100,
100, 100, 200, 200, 200, 23523, 213123, 12, 23213, 968, 37253,
573012, 472662, 3846516, 233, 262, 5737, 3038, 2, 5, 123, 969,
6, 40582)), class = "data.frame", row.names = c(NA, -25L))

extracting specific values in R

I am a beginner in R coding and I have the sample data here. I am trying to extract all the entries which have 2 "d7" and one "d1" for identical Idvalue number.
Sample name Idvalue_number
a d1 1
f d7 1
b d7 1
s d1 5
g d7 5
r d7 5
z d1 7
y d7 7
d d1 7
Expected output
a d1 1
f d7 1
b d7 1
s d1 5
g d7 5
r d7 5
Some code I have tried which is not giving me the desired output is here:
d1d7 <- data_ %>%
group_by(dvalue_number) %>%
filter(n() >= 3 & any(name == first(name)))
Could someone help me here? Thanks in advance.
One way that you can do it is shown below.]
library(tidyverse)
#create a dataframe for example
df = data.frame(Sample = c("a", "f", "b", "s", "g", "r", "z", "y", "d"),
name = c("d1", "d7", "d7", "d1", "d7", "d7", "d7", "d7","d7"),
Idvalue_number = c(1, 1, 1, 5, 5, 5, 7, 7, 7))
df %>% group_by(Idvalue_number, name) %>%
summarise(total = n()) %>%
filter(name == "d1" & total == 1 | name == "d7" & total == 2)
Idvalue_number name total
<dbl> <fct> <int>
1 1 d1 1
2 1 d7 2
3 5 d1 1
4 5 d7 2
An option would be to filter based on the frequency of 'd1', 'd7' in each 'Idvalue_number'
library(dplyr)
data_ %>%
group_by(Idvalue_number) %>%
filter(n() >= 3, sum(name == 'd1') == 1, sum(name == "d7")== 2)
# A tibble: 6 x 3
# Groups: Idvalue_number [2]
# Sample name Idvalue_number
# <chr> <chr> <int>
#1 a d1 1
#2 f d7 1
#3 b d7 1
#4 s d1 5
#5 g d7 5
#6 r d7 5
data
data_ <- structure(list(Sample = c("a", "f", "b", "s", "g", "r", "z",
"y", "d"), name = c("d1", "d7", "d7", "d1", "d7", "d7", "d1",
"d7", "d1"), Idvalue_number = c(1L, 1L, 1L, 5L, 5L, 5L, 7L, 7L,
7L)), class = "data.frame", row.names = c(NA, -9L))

how to create new variables from one variable using two rules

I would appreciate any help to create new variables from one variable.
Specifically, I need help to simultaneously create one row per each ID and various columns of E, where each of the new columns of E, (that is, E1, E2, E3) contains the values of E for each row of ID. I tried doing this which melt followed by spread but I am getting the error:
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
Additionally, I tried the solutions discussed here and here but these did not work for my case because I need to be able to create row identifiers for rows (4, 1, 2), (7, 3, 5), and (9, 6, 8). That is, E for rows (4, 1, 2) should be named E1, E for rows (7, 3, 5) should be named E2, E for rows (9, 6, 8) should be named E3, and so on.
#data
dT<-structure(list(A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1",
"a2", "a1"), B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1",
"b2", "b1"), ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"
), E = c(0.621142094943352, 0.742109450696123, 0.39439152996948,
0.40694392882818, 0.779607277916503, 0.550579323666347, 0.352622183880119,
0.690660491345867, 0.23378944873769)), class = c("data.table",
"data.frame"), row.names = c(NA, -9L))
#my attempt
A B ID E
1: a1 b2 3 0.6211421
2: a2 b2 4 0.7421095
3: a1 b2 3 0.3943915
4: a1 b1 1 0.4069439
5: a2 b2 4 0.7796073
6: a1 b2 3 0.5505793
7: a1 b1 1 0.3526222
8: a2 b2 4 0.6906605
9: a1 b1 1 0.2337894
aTempDF <- melt(dT, id.vars = c("A", "B", "ID")) )
A B ID variable value
1: a1 b2 3 E 0.6211421
2: a2 b2 4 E 0.7421095
3: a1 b2 3 E 0.3943915
4: a1 b1 1 E 0.4069439
5: a2 b2 4 E 0.7796073
6: a1 b2 3 E 0.5505793
7: a1 b1 1 E 0.3526222
8: a2 b2 4 E 0.6906605
9: a1 b1 1 E 0.2337894
aTempDF%>%spread(variable, value)
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
#expected output
A B ID E1 E2 E3
1: a1 b2 3 0.6211421 0.3943915 0.5505793
2: a2 b2 4 0.7421095 0.7796073 0.6906605
3: a1 b1 1 0.4069439 0.3526222 0.2337894
Thanks in advance for any help.
You can use dcast from data.table
library(data.table)
dcast(dT, A + B + ID ~ paste0("E", rowid(ID)))
# A B ID E1 E2 E3
#1 a1 b1 1 0.4069439 0.3526222 0.2337894
#2 a1 b2 3 0.6211421 0.3943915 0.5505793
#3 a2 b2 4 0.7421095 0.7796073 0.6906605
You need to create the correct 'time variable' first which is what rowid(ID) does.
For those looking for a tidyverse solution:
library(tidyverse)
dT <- structure(
list(
A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", "a2", "a1"),
B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", "b2", "b1"),
ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"),
E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 0.40694392882818,
0.550579323666347, 0.352622183880119, 0.690660491345867, 0.23378944873769,
0.779607277916503)),
class = c("data.table",
"data.frame"),
row.names = c(NA, -9L))
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# Just so columns are "E1", "E2", etc.
mutate(rn = glue::glue("E{row_number()}")) %>%
ungroup() %>%
spread(rn, E) %>%
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234
As mentioned in the accepted answer, you need a "key" variable to spread on first. This is created using row_number() and glue where glue just gives you the proper E1, E2, etc. variable names.
The group_by piece just makes sure that the row numbers are with respect to A, B and ID.
EDIT for tidyr >= 1.0.0
The (not-so) new pivot_ functions supercede gather and spread and eliminate the need to glue the new variable names together in a mutate.
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# no longer need to glue (or paste) the names together but still need a row number
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = rn, values_from = E, names_glue = "E{.name}") %>% # names_glue argument allows for easy transforming of the new variable names
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234

Resources