I have a data.table like so:
id | id2 | val
--------------
1 | 1 | A
1 | 2 | B
2 | 3 | C
2 | 4 | D
3 | 5 | E
3 | 6 | F
I want to group by the id column, and return the maximum id2 for that `id. Like so:
id | id2 | val
--------------
1 | 2 | B
2 | 4 | D
3 | 6 | F
It's easy in SQL:
SELECT id, MAX(id2) FROM tbl GROUP BY id;
But I want to know how to do this with data.table. So far I have:
tbl[, .(id2 = max(id2)), by = id]
but I don't know how to get the val part.
df <- read.table(header = T, text = "id id2 val
1 1 A
1 2 B
2 3 C
2 4 D
3 5 E
3 6 F")
library(data.table)
setDT(df)
df[, max_id2 := max(id2), by = id]
df <- df[id2 == max_id2, ]
df[, max_id2 := NULL]
id id2 val
1: 1 2 B
2: 2 4 D
3: 3 6 F
Related
I have a df like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
In R, how do I filter for VisitIDs as long as they contain Item A & B?
Expected Outcome:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
I tried df %>% group_by(VisitID) %>% filter(any(Item == 'A' & Item == 'B')) but it doesn't work..
df <- read_delim("ID | Item
1 | A
1 | B
2 | A
3 | B
1 | C
4 | C
5 | B
3 | A
4 | A
5 | D", delim = "|", trim_ws = TRUE)
Since you want both "A" and "B" you can use all
library(dplyr)
df %>% group_by(VisitID) %>% filter(all(c("A", "B") %in% Item))
# VisitID Item
# <int> <chr>
#1 1 A
#2 1 B
#3 1 C
#4 1 D
#5 2 A
#6 2 D
#7 2 B
OR if you want to use any use them separately.
df %>% group_by(VisitID) %>% filter(any(Item == 'A') && any(Item == 'B'))
An otion with data.table
library(data.table)
setDT(df)[, .SD[all(c("A", "B") %in% Item)], VisitID]
Hi I have df as below:
ID | Gender
1 | M
1 | F
2 | F
2 | F
2 | F
3 | M
3 | M
3 | F
4 | M
4 | M
4 | M
I'd like to distinct filter IDs which have more than 1 Gender (filter dirty data as can't have > 1 Gender per person)
Results should be:
ID | Gender
1 | M
1 | F
3 | M
3 | F
How can I go about in R using dplyr?
Using dplyr,
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Gender) > 1) %>%
distinct(Gender)
which gives,
# A tibble: 4 x 2
# Groups: ID [2]
Gender ID
<chr> <int>
1 M 1
2 F 1
3 M 3
4 F 3
I have the following table:
+----+------------+----------+
| ID | Date | Variable |
+----+------------+----------+
| a | 12/03/2017 | d |
| a | 15/04/2017 | d |
| a | 20/06/2017 | c |
| b | 14/05/2017 | c |
| b | 15/08/2017 | c |
| b | 16/09/2017 | c |
+----+------------+----------+
For each ID, I'd like to have a check in the separate column which tells whether there was a "c" value after the occurence of "d" value, like this:
+----+------------+----------+-------+------------+
| ID | Date | Variable | Check | Date |
+----+------------+----------+-------+------------+
| a | 12/03/2017 | d | 1 | 20/06/2017 |
| a | 15/04/2017 | d | 1 | 20/06/2017 |
| a | 20/06/2017 | c | 1 | 20/06/2017 |
| b | 14/05/2017 | c | 0 | 0 |
| b | 15/08/2017 | c | 0 | 0 |
| b | 16/09/2017 | c | 0 | 0 |
+----+------------+----------+-------+------------+
It's not just about finding the occurence of "c", but about seeing whether "c" occurs after d or not. It would also help to have the corresponding date in a separate column. I was trying with removing the duplicates & then identifying the lead value (or n of rows > 1), but is there a simpler way to do this?
Any dplyr or data.table approach would be most helpful.
A solution using dplyr. There must be a better way than this, but I think this should work. unique(Variable[!is.na(Variable)]) is to get a vector with only c("c", "d"), c("d", "c"), "c", or "d". If you are sure there are no NA, you can remove !is.na. Date[Variable %in% "c"][1] is to select the first date.
dat2 <- dat %>%
group_by(ID) %>%
mutate(Check = ifelse(identical(unique(Variable[!is.na(Variable)]), c("d", "c")),
1L, 0L)) %>%
mutate(Date2 = ifelse(Check == 1L, Date[Variable %in% "c"][1], "0")) %>%
ungroup()
dat2
# # A tibble: 6 x 5
# ID Date Variable Check Date2
# <chr> <chr> <chr> <int> <chr>
# 1 a 12/03/2017 d 1 20/06/2017
# 2 a 15/04/2017 d 1 20/06/2017
# 3 a 20/06/2017 c 1 20/06/2017
# 4 b 14/05/2017 c 0 0
# 5 b 15/08/2017 c 0 0
# 6 b 16/09/2017 c 0 0
DATA
dat <- read.table(text = "ID Date Variable
a '12/03/2017' d
a '15/04/2017' d
a '20/06/2017' c
b '14/05/2017' c
b '15/08/2017' c
b '16/09/2017' c",
header = TRUE, stringsAsFactors = FALSE)
A data.table solution. Also suggested by #RYoda, you can use data.table::shift to test for your condition and then merge the results back to the original dataset
check <- dat[, {
idx <- Variable =='d' & shift(Variable, type="lead") == "c"
list(MatchDate=ifelse(any(idx), shift(Date, type="lead", fill=NA_character_)[idx][1L], "0"),
Check=as.integer(any(idx)))
}, by=.(ID)]
dat[check, on=.(ID)]
# ID Date Variable MatchDate Check
# 1: a 12/03/2017 d 20/06/2017 1
# 2: a 15/04/2017 d 20/06/2017 1
# 3: a 20/06/2017 c 20/06/2017 1
# 4: b 14/05/2017 c 0 0
# 5: b 15/08/2017 c 0 0
# 6: b 16/09/2017 c 0 0
data:
library(data.table)
dat <- data.table(ID=rep(c('a','b'), each=3),
Date=c("12/03/2017","15/04/2017","20/06/2017","14/05/2017","15/08/2017","16/09/2017"),
Variable=c('d','d','c','c','c','c'))
One solution can be arrived using fill from tidyr package. The approach is as:
First populate Check and C_Date for rows with Variable as c. Then fill up the rows above using fill function on both Check and C_Date columns. This steps will populate desired values in rows with d value. Finally, just replace the value of Check and C_Date for rows having Variable as c.
Note: OP suggested that Check for rows with Variable as c can be either 0 or 1. My solution has considered it to be 0.
# Data
df <- read.table(text = "ID Date Variable
a 12/03/2017 d
a 15/04/2017 d
a 20/06/2017 c
b 14/05/2017 c
b 15/08/2017 c
b 16/09/2017 c", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y")
library(dplyr)
library(tidyr)
df %>% group_by(ID) %>%
arrange(ID, Date) %>%
mutate(Check = ifelse(Variable == "c", 1L, NA),
c_Date = ifelse(Variable == "c", as.character(Date), NA) ) %>%
fill(Check, .direction = "up") %>%
fill(c_Date, .direction = "up") %>%
mutate(Check = ifelse(Variable == "c", 0L, Check),
c_Date = ifelse(Variable == "c", NA, c_Date) )
# Result
# ID Date Variable Check c_Date
# <chr> <dttm> <chr> <int> <chr>
# 1 a 2017-03-12 00:00:00 d 1 2017-06-20
# 2 a 2017-04-15 00:00:00 d 1 2017-06-20
# 3 a 2017-06-20 00:00:00 c 0 <NA>
# 4 b 2017-05-14 00:00:00 c 0 <NA>
# 5 b 2017-08-15 00:00:00 c 0 <NA>
# 6 b 2017-09-16 00:00:00 c 0 <NA>
I will post a reproducible Example.
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
I want something like this as end result.
+====+========+========+
| id | group1 | group2 |
+====+========+========+
| 1 | a | b |
+----+--------+--------+
| 1 | b | c |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
| 2 | a | b |
+----+--------+--------+
| 2 | b | - |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
Just to mention the order of ID's matter. I have another column as timestamp.
One solution with dplyr and rleid from data.table:
library(dplyr)
df %>%
mutate(id2 = data.table::rleid(id)) %>%
group_by(id2) %>%
mutate(group2 = lead(group))
# A tibble: 8 x 4
# Groups: id2 [3]
id group id2 group2
<dbl> <fct> <int> <fct>
1 1.00 a 1 b
2 1.00 b 1 c
3 1.00 c 1 d
4 1.00 d 1 NA
5 2.00 a 2 b
6 2.00 b 2 NA
7 1.00 c 3 d
8 1.00 d 3 NA
If I understood correct your question, you can use the following function:
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
add_group2 <- function(df) {
n <-length(group)
group2 <- as.character(df$group[2:n])
group2 <- c(group2, "-")
group2[which(c(df$id[-n] - c(df$id[2:n]), 0) != 0)] <- "-"
return(data.frame(df, group2))
}
add_group2(df)
Result should be:
id group group2
1 1 a b
2 1 b c
3 1 c d
4 1 d -
5 2 a b
6 2 b -
7 1 c d
8 1 d -
There are so many posts on how to get the group-wise min or max with SQL. But how do you do it in R?
Let's say, you have got the following data frame
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5
For every ID, I don't want the min t, but the value at the min t.
ID | value
a | 3
b| 2
df is your data.frame -
library(data.table)
setDT(df) # convert to data.table in place
df[, value[which.min(t)], by = ID]
Output -
> df[, value[which.min(t)], by = ID]
ID V1
1: a 3
2: b 2
You are looking for tapply:
df <- read.table(textConnection("
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5"), header=TRUE, sep="|")
m <- tapply(1:nrow(df), df$ID, function(i) {
df$value[i[which.min(df$t[i])]]
})
# a b
# 3 2
Two more solutions (with sgibb's df):
sapply(split(df, df$ID), function(x) x$value[which.min(x$t)])
#a b
#3 2
library(plyr)
ddply(df, .(ID), function(x) x$value[which.min(x$t)])
# ID V1
#1 a 3
#2 b 2