I have dataset like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
I would like to create a classification column for items according to the set theory conditions:
VisitID contains A only, B only, C only, A&B, A&C, B&C, A&B&C, Others (Neither A,B,C exists)
The results should look like this:
VisitID | Item | Classification |
1 | A | A&B&C |
1 | B | A&B&C |
1 | C | A&B&C |
1 | D | A&B&C |
2 | A | A&B |
2 | D | A&B |
2 | B | A&B |
3 | B | B&C |
3 | C | B&C |
4 | D | C only |
4 | C | C only |
How can I do this in R, especially with dplyr?
You can use a left_join of the data with a group_by, filtered, summarised one.
library(dplyr)
data %>% left_join(
group_by(data, VisitID) %>%
distinct(VisitID, Item) %>%
filter(Item %in% c("A","B","C")) %>%
summarise(set=paste0(Item, collapse="&")),
by="VisitID")
Output:
VisitID Item set
1 1 A A&B&C
2 1 B A&B&C
3 1 C A&B&C
4 1 D A&B&C
5 2 A A&B
6 2 D A&B
7 2 B A&B
8 3 B B&C
9 3 C B&C
10 4 D C
11 4 C C
12 5 D <NA>
13 5 E <NA>
Data:
dput(data)
structure(list(VisitID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L), Item = c("A", "B", "C", "D", "A", "D", "B",
"B", "C", "D", "C", "D", "E")), class = "data.frame", row.names = c(NA,
-13L))
We can write a custom function :
paste_values <- function(x) {
x1 <- x[x %in% c("A", "B", "C")]
if (n_distinct(x1) == 1)
#If want to keep in base R
#if (length(unique(x1) == 1)
paste0(unique(x1), " only")
else
paste0(unique(x1), collapse = " & ")
}
and apply it for each group.
library(dplyr)
df %>% group_by(VisitID) %>% mutate(Item = paste_values(Item))
# VisitID Item
# <int> <chr>
# 1 1 A & B & C
# 2 1 A & B & C
# 3 1 A & B & C
# 4 1 A & B & C
# 5 2 A & B
# 6 2 A & B
# 7 2 A & B
# 8 3 B & C
# 9 3 B & C
#10 4 C only
#11 4 C only
We can also use the same function in base R :
df$Item <- with(df, ave(Item, VisitID, FUN = paste_values))
Related
This is what my data looks like:
+---------+--+----------+--+
| Subj_ID | | Location | |
+---------+--+----------+--+
| 1 | | 1 | |
| 1 | | 2 | |
| 1 | | 3 | |
| 2 | | 1 | |
| 2 | | 4 | |
| 2 | | 2 | |
| 3 | | 1 | |
| 3 | | 2 | |
| 3 | | 5 | |
+---------+--+----------+--+
In this dataset, only subject 1 has a location value of 3, so I want to label subject 1 as YES for intervention. Since subject 2 and 3 didn't have a location value of 3, they need to be labeled as false.
This is what I want the data to look like.
| Subj_ID | | Location | Intervention |
+---------+--+----------+--------------+
| 1 | | 1 | YES |
| 1 | | 2 | YES |
| 1 | | 3 | YES |
| 2 | | 1 | NO |
| 2 | | 4 | NO |
| 2 | | 3 | NO |
| 3 | | 1 | NO |
| 3 | | 2 | NO |
| 3 | | 5 | NO |
+---------+--+----------+-----+
Thanks in advance for the help! Dplyr preferred if possible.
An option with dplyr is after grouping by 'Subj_ID', check whether 3 is %in/% Location which returns a single TRUE/FALSE, change that to a numeric index to replace the values with "NO", "YES"
library(dplyr)
df1 %>%
group_by(Subj_ID) %>%
mutate(Intervention = c("NO", "YES")[(3 %in% Location)+1])
# A tibble: 9 x 3
# Groups: Subj_ID [3]
# Subj_ID Location Intervention
# <int> <dbl> <chr>
#1 1 1 YES
#2 1 2 YES
#3 1 3 YES
#4 2 1 NO
#5 2 4 NO
#6 2 2 NO
#7 3 1 NO
#8 3 2 NO
#9 3 5 NO
Or use any
df1 %>%
group_by(Subj_ID) %>%
mutate(Intervention = case_when(any(Location == 3) ~ "YES", TRUE ~ "NO"))
Or using base R
df1$Intervention <- with(df1, c("NO", "YES")[1 + (Subj_ID %in%
Subj_ID[Location == 3])])
data
df1 <- data.frame(Subj_ID = rep(1:3, each = 3),
Location = c(1:3, 1, 4, 2, 1, 2, 5))
We can use match for each Subj_ID to check if 3 is present in any Location.
library(dplyr)
df %>%
group_by(Subj_ID) %>%
mutate(Intervention = c('Yes', 'No')[is.na(match(3,Location)) + 1])
#Can also use
#mutate(Intervention = c('No', 'Yes')[(match(3,Location, nomatch = 0L) > 0) + 1])
# Subj_ID Location Intervention
# <int> <dbl> <chr>
#1 1 1 Yes
#2 1 2 Yes
#3 1 3 Yes
#4 2 1 No
#5 2 4 No
#6 2 2 No
#7 3 1 No
#8 3 2 No
#9 3 5 No
data
df <- structure(list(Subj_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
Location = c(1, 2, 3, 1, 4, 2, 1, 2, 5)), class = "data.frame",
row.names = c(NA, -9L))
This question already has answers here:
Merge dataframes of different sizes
(4 answers)
Update columns by joining more than one columns
(2 answers)
Closed 4 years ago.
I am just venturing into R programming and finding my way around.
Lets say I have a table as below:
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
Y | A | 1
Y | B | 2
Y | C | 5
Z | A | 3
Z | B | 6
Z | C | 2
I need to change the sales values of certain products based on another table. Please find below:
Product | Sales
A | 10
B | 7
C | 15
My final table should be:
Store | Product | Sales
X | A | 10
X | B | 7
X | C | 15
Y | A | 10
Y | B | 7
Y | C | 15
Z | A | 10
Z | B | 7
Z | C | 15
I have 2 methods of doing this now:
1) Using joins
2) Using an if-else statement inside a for loop to subset the
Is there any other way to do this more effectively and in fewer steps?
Thanks in advance!
EDIT: I forgot to mention an exception earlier. What if my dataset is like below?
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
X | D | 4
Y | A | 1
Y | B | 2
Y | C | 5
Y | D | 2
Z | A | 3
Z | B | 6
Z | C | 2
Z | D | 3
There's an extra product(D) with sales. I want to retain the value of sales for that product if it is not present in the 2nd table which is:
Product | Sales
A | 10
B | 7
C | 15
How about this join?
Since you want to change the Sales values of certain Products only so to illustrate this I have considered only two products in lookup_df
library(dplyr)
df %>%
left_join(lookup_df, by = "Product", suffix = c("_Original", "_New")) %>%
mutate(Sales_New = coalesce(Sales_New, Sales_Original))
Output is:
Store Product Sales_Original Sales_New
1 X A 2 10
2 X B 1 1
3 X C 3 15
4 Y A 1 10
5 Y B 2 2
6 Y C 5 15
7 Z A 3 10
8 Z B 6 6
9 Z C 2 15
Sample data:
df <- structure(list(Store = c("X", "X", "X", "Y", "Y", "Y", "Z", "Z",
"Z"), Product = c("A", "B", "C", "A", "B", "C", "A", "B", "C"
), Sales = c(2L, 1L, 3L, 1L, 2L, 5L, 3L, 6L, 2L)), .Names = c("Store",
"Product", "Sales"), class = "data.frame", row.names = c(NA,
-9L))
lookup_df <- structure(list(Product = c("A", "C"), Sales = c(10L, 15L)), .Names = c("Product", "Sales"), class = "data.frame", row.names = c(NA,
-2L))
# Product Sales
#1 A 10
#2 C 15
If you use a lookup-vector, it is relatively short:
d <- read.table(text = "
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
Y | A | 1
Y | B | 2
Y | C | 5
Z | A | 3
Z | B | 6
Z | C | 2", sep = "|", header = T, stringsAsFactors = F)
lookup <- read.table(text = "Product | Sales
A | 10
B | 7
C | 15", sep = "|", header = T, stringsAsFactors = F)
lookup$Product <- gsub("^\\s+|\\s+$", "", lookup$Product) # remove spaces
lookup <- setNames(lookup$Sales, lookup$Product) # convert to vector
d$Product <- gsub("^\\s+|\\s+$", "", d$Product) # remove spaces
d$Sales <- lookup[d$Product] # main part
d
I have the following table:
+----+------------+----------+
| ID | Date | Variable |
+----+------------+----------+
| a | 12/03/2017 | d |
| a | 15/04/2017 | d |
| a | 20/06/2017 | c |
| b | 14/05/2017 | c |
| b | 15/08/2017 | c |
| b | 16/09/2017 | c |
+----+------------+----------+
For each ID, I'd like to have a check in the separate column which tells whether there was a "c" value after the occurence of "d" value, like this:
+----+------------+----------+-------+------------+
| ID | Date | Variable | Check | Date |
+----+------------+----------+-------+------------+
| a | 12/03/2017 | d | 1 | 20/06/2017 |
| a | 15/04/2017 | d | 1 | 20/06/2017 |
| a | 20/06/2017 | c | 1 | 20/06/2017 |
| b | 14/05/2017 | c | 0 | 0 |
| b | 15/08/2017 | c | 0 | 0 |
| b | 16/09/2017 | c | 0 | 0 |
+----+------------+----------+-------+------------+
It's not just about finding the occurence of "c", but about seeing whether "c" occurs after d or not. It would also help to have the corresponding date in a separate column. I was trying with removing the duplicates & then identifying the lead value (or n of rows > 1), but is there a simpler way to do this?
Any dplyr or data.table approach would be most helpful.
A solution using dplyr. There must be a better way than this, but I think this should work. unique(Variable[!is.na(Variable)]) is to get a vector with only c("c", "d"), c("d", "c"), "c", or "d". If you are sure there are no NA, you can remove !is.na. Date[Variable %in% "c"][1] is to select the first date.
dat2 <- dat %>%
group_by(ID) %>%
mutate(Check = ifelse(identical(unique(Variable[!is.na(Variable)]), c("d", "c")),
1L, 0L)) %>%
mutate(Date2 = ifelse(Check == 1L, Date[Variable %in% "c"][1], "0")) %>%
ungroup()
dat2
# # A tibble: 6 x 5
# ID Date Variable Check Date2
# <chr> <chr> <chr> <int> <chr>
# 1 a 12/03/2017 d 1 20/06/2017
# 2 a 15/04/2017 d 1 20/06/2017
# 3 a 20/06/2017 c 1 20/06/2017
# 4 b 14/05/2017 c 0 0
# 5 b 15/08/2017 c 0 0
# 6 b 16/09/2017 c 0 0
DATA
dat <- read.table(text = "ID Date Variable
a '12/03/2017' d
a '15/04/2017' d
a '20/06/2017' c
b '14/05/2017' c
b '15/08/2017' c
b '16/09/2017' c",
header = TRUE, stringsAsFactors = FALSE)
A data.table solution. Also suggested by #RYoda, you can use data.table::shift to test for your condition and then merge the results back to the original dataset
check <- dat[, {
idx <- Variable =='d' & shift(Variable, type="lead") == "c"
list(MatchDate=ifelse(any(idx), shift(Date, type="lead", fill=NA_character_)[idx][1L], "0"),
Check=as.integer(any(idx)))
}, by=.(ID)]
dat[check, on=.(ID)]
# ID Date Variable MatchDate Check
# 1: a 12/03/2017 d 20/06/2017 1
# 2: a 15/04/2017 d 20/06/2017 1
# 3: a 20/06/2017 c 20/06/2017 1
# 4: b 14/05/2017 c 0 0
# 5: b 15/08/2017 c 0 0
# 6: b 16/09/2017 c 0 0
data:
library(data.table)
dat <- data.table(ID=rep(c('a','b'), each=3),
Date=c("12/03/2017","15/04/2017","20/06/2017","14/05/2017","15/08/2017","16/09/2017"),
Variable=c('d','d','c','c','c','c'))
One solution can be arrived using fill from tidyr package. The approach is as:
First populate Check and C_Date for rows with Variable as c. Then fill up the rows above using fill function on both Check and C_Date columns. This steps will populate desired values in rows with d value. Finally, just replace the value of Check and C_Date for rows having Variable as c.
Note: OP suggested that Check for rows with Variable as c can be either 0 or 1. My solution has considered it to be 0.
# Data
df <- read.table(text = "ID Date Variable
a 12/03/2017 d
a 15/04/2017 d
a 20/06/2017 c
b 14/05/2017 c
b 15/08/2017 c
b 16/09/2017 c", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y")
library(dplyr)
library(tidyr)
df %>% group_by(ID) %>%
arrange(ID, Date) %>%
mutate(Check = ifelse(Variable == "c", 1L, NA),
c_Date = ifelse(Variable == "c", as.character(Date), NA) ) %>%
fill(Check, .direction = "up") %>%
fill(c_Date, .direction = "up") %>%
mutate(Check = ifelse(Variable == "c", 0L, Check),
c_Date = ifelse(Variable == "c", NA, c_Date) )
# Result
# ID Date Variable Check c_Date
# <chr> <dttm> <chr> <int> <chr>
# 1 a 2017-03-12 00:00:00 d 1 2017-06-20
# 2 a 2017-04-15 00:00:00 d 1 2017-06-20
# 3 a 2017-06-20 00:00:00 c 0 <NA>
# 4 b 2017-05-14 00:00:00 c 0 <NA>
# 5 b 2017-08-15 00:00:00 c 0 <NA>
# 6 b 2017-09-16 00:00:00 c 0 <NA>
I will post a reproducible Example.
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
I want something like this as end result.
+====+========+========+
| id | group1 | group2 |
+====+========+========+
| 1 | a | b |
+----+--------+--------+
| 1 | b | c |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
| 2 | a | b |
+----+--------+--------+
| 2 | b | - |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
Just to mention the order of ID's matter. I have another column as timestamp.
One solution with dplyr and rleid from data.table:
library(dplyr)
df %>%
mutate(id2 = data.table::rleid(id)) %>%
group_by(id2) %>%
mutate(group2 = lead(group))
# A tibble: 8 x 4
# Groups: id2 [3]
id group id2 group2
<dbl> <fct> <int> <fct>
1 1.00 a 1 b
2 1.00 b 1 c
3 1.00 c 1 d
4 1.00 d 1 NA
5 2.00 a 2 b
6 2.00 b 2 NA
7 1.00 c 3 d
8 1.00 d 3 NA
If I understood correct your question, you can use the following function:
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
add_group2 <- function(df) {
n <-length(group)
group2 <- as.character(df$group[2:n])
group2 <- c(group2, "-")
group2[which(c(df$id[-n] - c(df$id[2:n]), 0) != 0)] <- "-"
return(data.frame(df, group2))
}
add_group2(df)
Result should be:
id group group2
1 1 a b
2 1 b c
3 1 c d
4 1 d -
5 2 a b
6 2 b -
7 1 c d
8 1 d -
I have a data.table like so:
id | id2 | val
--------------
1 | 1 | A
1 | 2 | B
2 | 3 | C
2 | 4 | D
3 | 5 | E
3 | 6 | F
I want to group by the id column, and return the maximum id2 for that `id. Like so:
id | id2 | val
--------------
1 | 2 | B
2 | 4 | D
3 | 6 | F
It's easy in SQL:
SELECT id, MAX(id2) FROM tbl GROUP BY id;
But I want to know how to do this with data.table. So far I have:
tbl[, .(id2 = max(id2)), by = id]
but I don't know how to get the val part.
df <- read.table(header = T, text = "id id2 val
1 1 A
1 2 B
2 3 C
2 4 D
3 5 E
3 6 F")
library(data.table)
setDT(df)
df[, max_id2 := max(id2), by = id]
df <- df[id2 == max_id2, ]
df[, max_id2 := NULL]
id id2 val
1: 1 2 B
2: 2 4 D
3: 3 6 F