There are so many posts on how to get the group-wise min or max with SQL. But how do you do it in R?
Let's say, you have got the following data frame
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5
For every ID, I don't want the min t, but the value at the min t.
ID | value
a | 3
b| 2
df is your data.frame -
library(data.table)
setDT(df) # convert to data.table in place
df[, value[which.min(t)], by = ID]
Output -
> df[, value[which.min(t)], by = ID]
ID V1
1: a 3
2: b 2
You are looking for tapply:
df <- read.table(textConnection("
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5"), header=TRUE, sep="|")
m <- tapply(1:nrow(df), df$ID, function(i) {
df$value[i[which.min(df$t[i])]]
})
# a b
# 3 2
Two more solutions (with sgibb's df):
sapply(split(df, df$ID), function(x) x$value[which.min(x$t)])
#a b
#3 2
library(plyr)
ddply(df, .(ID), function(x) x$value[which.min(x$t)])
# ID V1
#1 a 3
#2 b 2
Related
I have a df like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
In R, how do I filter for VisitIDs as long as they contain Item A & B?
Expected Outcome:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
I tried df %>% group_by(VisitID) %>% filter(any(Item == 'A' & Item == 'B')) but it doesn't work..
df <- read_delim("ID | Item
1 | A
1 | B
2 | A
3 | B
1 | C
4 | C
5 | B
3 | A
4 | A
5 | D", delim = "|", trim_ws = TRUE)
Since you want both "A" and "B" you can use all
library(dplyr)
df %>% group_by(VisitID) %>% filter(all(c("A", "B") %in% Item))
# VisitID Item
# <int> <chr>
#1 1 A
#2 1 B
#3 1 C
#4 1 D
#5 2 A
#6 2 D
#7 2 B
OR if you want to use any use them separately.
df %>% group_by(VisitID) %>% filter(any(Item == 'A') && any(Item == 'B'))
An otion with data.table
library(data.table)
setDT(df)[, .SD[all(c("A", "B") %in% Item)], VisitID]
I will post a reproducible Example.
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
I want something like this as end result.
+====+========+========+
| id | group1 | group2 |
+====+========+========+
| 1 | a | b |
+----+--------+--------+
| 1 | b | c |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
| 2 | a | b |
+----+--------+--------+
| 2 | b | - |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
Just to mention the order of ID's matter. I have another column as timestamp.
One solution with dplyr and rleid from data.table:
library(dplyr)
df %>%
mutate(id2 = data.table::rleid(id)) %>%
group_by(id2) %>%
mutate(group2 = lead(group))
# A tibble: 8 x 4
# Groups: id2 [3]
id group id2 group2
<dbl> <fct> <int> <fct>
1 1.00 a 1 b
2 1.00 b 1 c
3 1.00 c 1 d
4 1.00 d 1 NA
5 2.00 a 2 b
6 2.00 b 2 NA
7 1.00 c 3 d
8 1.00 d 3 NA
If I understood correct your question, you can use the following function:
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
add_group2 <- function(df) {
n <-length(group)
group2 <- as.character(df$group[2:n])
group2 <- c(group2, "-")
group2[which(c(df$id[-n] - c(df$id[2:n]), 0) != 0)] <- "-"
return(data.frame(df, group2))
}
add_group2(df)
Result should be:
id group group2
1 1 a b
2 1 b c
3 1 c d
4 1 d -
5 2 a b
6 2 b -
7 1 c d
8 1 d -
I have a data.table like so:
id | id2 | val
--------------
1 | 1 | A
1 | 2 | B
2 | 3 | C
2 | 4 | D
3 | 5 | E
3 | 6 | F
I want to group by the id column, and return the maximum id2 for that `id. Like so:
id | id2 | val
--------------
1 | 2 | B
2 | 4 | D
3 | 6 | F
It's easy in SQL:
SELECT id, MAX(id2) FROM tbl GROUP BY id;
But I want to know how to do this with data.table. So far I have:
tbl[, .(id2 = max(id2)), by = id]
but I don't know how to get the val part.
df <- read.table(header = T, text = "id id2 val
1 1 A
1 2 B
2 3 C
2 4 D
3 5 E
3 6 F")
library(data.table)
setDT(df)
df[, max_id2 := max(id2), by = id]
df <- df[id2 == max_id2, ]
df[, max_id2 := NULL]
id id2 val
1: 1 2 B
2: 2 4 D
3: 3 6 F
I have two dataframes: df_workingFile and df_groupIDs
df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. So group "a1" has 2 events in df_workingFile, "v" and "w". I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. The final output should look like this:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
Now here's the problem. I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. I need help figuring out how to rewrite my code to be more efficient. Here's what I currently have:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}
Using dplyr:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column):
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))
This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from.
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
giving:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
Note: The two input data frames shown reproducibly are:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
EDIT:
Upon further examination, this dataset is way more insane than I previously believed.
Values have been encapsulated in the column names!
My dataframe looks like this:
| ID | Year1_A | Year1_B | Year2_A | Year2_B |
|----|---------|---------|---------|---------|
| 1 | a | b | 2a | 2b |
| 2 | c | d | 2c | 2d |
I am searching for a way to reformat it as such:
| ID | Year | _A | _B |
|----|------|-----|-----|
| 1 | 1 | a | b |
| 1 | 2 | 2a | 2b |
| 2 | 1 | c | d |
| 2 | 2 | 2c | 2d |
The answer below is great, and works perfectly, but the issue is that the dataframe needs more work -- somehow possibly be spread back out, so that each row has 3 columns.
My best idea was to do merge(df, df, by="ID") and then filter out the unwanted rows but this is quickly becoming unwieldy.
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
library(tidyr)
# your example data
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
# the solution
df <- gather(df, Year, value, -ID)
# cleaning up
df$Year <- gsub("Year", "", df$Year)
Result:
> df
ID Year value
1 1 1_A a
2 2 1_A c
3 1 1_B b
4 2 1_B d
5 1 2_A 2a
6 2 2_A 2c
7 1 2_B 2b
8 2 2_B 2d