Quick way of matching data between two dataframes [R] - r

I have two dataframes: df_workingFile and df_groupIDs
df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. So group "a1" has 2 events in df_workingFile, "v" and "w". I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. The final output should look like this:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
Now here's the problem. I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. I need help figuring out how to rewrite my code to be more efficient. Here's what I currently have:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}

Using dplyr:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column):
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))

This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from.
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
giving:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
Note: The two input data frames shown reproducibly are:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)

Related

join each row to the whole second table in R dplyr [duplicate]

This question already has answers here:
Cartesian product with dplyr
(7 answers)
Closed 1 year ago.
I have two tables:
table 1:
| | a | b |
|---|----|----|
| 1 | a1 | b1 |
| 2 | a2 | b2 |
and
table 2:
| | c | d |
|---|----|----|
| 1 | c1 | d1 |
| 2 | c2 | d2 |
I want to join them in a way that each row of table one bind column-wise with table two to get this result:
| | a | b | c | d |
|---|----|----|----|----|
| 1 | a1 | b1 | c1 | d1 |
| 2 | a1 | b1 | c2 | d2 |
| 3 | a2 | b2 | c1 | d1 |
| 4 | a2 | b2 | c2 | d2 |
I feel like this is a duplicated question, but I could not find right wordings and search terms to find the answer.
There is no need to join, we can use tidyr::expand_grid:
library(dplyr)
library(tidyr)
table1 <- tibble(a = c("a1", "a2"),
b = c("b1", "b2"))
table2 <- tibble(c = c("c1","c2"),
d = c("d1", "d2"))
expand_grid(table1, table2)
#> # A tibble: 4 x 4
#> a b c d
#> <chr> <chr> <chr> <chr>
#> 1 a1 b1 c1 d1
#> 2 a1 b1 c2 d2
#> 3 a2 b2 c1 d1
#> 4 a2 b2 c2 d2
Created on 2021-09-17 by the reprex package (v2.0.1)
I found a crude answer:
table1$key <- 1
table2$key <- 1
result <- left_join(table1,table2, by="key") %>%
select(-key)
Any better answers is much appreciated.

Aggregate rows into new column based on common value in another column in R

I have two data frames
df1 is like this
| NOC | 2007 | 2008 |
|:---- |:------:| -----:|
| A | 100 | 5 |
| B | 100 | 5 |
| C | 100 | 5|
| D | 20 | 2 |
| E | 10 | 12 |
| F | 2 | 1 |
df2
| NOC | GROUP |
|:---- |:------:|
| A | aa|
| B | aa |
| C | aa |
| D | bb |
| E | bb |
| F | cc |
I would like to create a new df3 which will aggregate the columns 2007 and 2008 based on Group identity by assigning the sum of rows with the same group identity, so my df3 would look like this
NOC
2007
2008
GROUP
S2007
s2008
A
100
5
aa
300
15
B
100
5
aa
300
15
C
100
5
aa
300
15
D
20
2
bb
30
14
E
10
12
bb
30
14
F
2
1
cc
2
1
my codes are not very efficient, I first merged df1 with df2 by NOC, into df3
df3<-merge(df1, df2, by="NOC",all.x=TRUE)
then used dprl summarised into df4 and created s2007 and s2008
df3 %>%
group_by(GROUP) %>%
summarise(num = n(),
s2017 = sum(2007),s2018 = sum(2008))->df3
then I merged df1 with df3 again to create my final database
I am wondering two problems:
is there a more efficient way?
since my dataframe contains annual data 2007-2030, currently I am writing out the summarize function for each year, is there a faster way of summarize all the columns except NOC?
Thank you!
Before this, a small piece of advice, never name your columns in numeric, it may create you many glitches.
library(tidyverse)
df1 %>% left_join(df2, by = 'NOC') %>%
group_by(GROUP) %>%
mutate(across(c(`2007`, `2008`), ~sum(.), .names = 's.{.col}' ))
# A tibble: 6 x 6
# Groups: GROUP [3]
NOC `2007` `2008` GROUP s.2007 s.2008
<chr> <int> <int> <chr> <int> <int>
1 A 100 5 aa 300 15
2 B 100 5 aa 300 15
3 C 100 5 aa 300 15
4 D 20 2 bb 30 14
5 E 10 12 bb 30 14
6 F 2 1 cc 2 1

R: How to filter column as long as it contains combination of values?

I have a df like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
In R, how do I filter for VisitIDs as long as they contain Item A & B?
Expected Outcome:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
I tried df %>% group_by(VisitID) %>% filter(any(Item == 'A' & Item == 'B')) but it doesn't work..
df <- read_delim("ID | Item
1 | A
1 | B
2 | A
3 | B
1 | C
4 | C
5 | B
3 | A
4 | A
5 | D", delim = "|", trim_ws = TRUE)
Since you want both "A" and "B" you can use all
library(dplyr)
df %>% group_by(VisitID) %>% filter(all(c("A", "B") %in% Item))
# VisitID Item
# <int> <chr>
#1 1 A
#2 1 B
#3 1 C
#4 1 D
#5 2 A
#6 2 D
#7 2 B
OR if you want to use any use them separately.
df %>% group_by(VisitID) %>% filter(any(Item == 'A') && any(Item == 'B'))
An otion with data.table
library(data.table)
setDT(df)[, .SD[all(c("A", "B") %in% Item)], VisitID]

How to reshape a dataframe in R, when values and variables are in the column names? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
EDIT:
Upon further examination, this dataset is way more insane than I previously believed.
Values have been encapsulated in the column names!
My dataframe looks like this:
| ID | Year1_A | Year1_B | Year2_A | Year2_B |
|----|---------|---------|---------|---------|
| 1 | a | b | 2a | 2b |
| 2 | c | d | 2c | 2d |
I am searching for a way to reformat it as such:
| ID | Year | _A | _B |
|----|------|-----|-----|
| 1 | 1 | a | b |
| 1 | 2 | 2a | 2b |
| 2 | 1 | c | d |
| 2 | 2 | 2c | 2d |
The answer below is great, and works perfectly, but the issue is that the dataframe needs more work -- somehow possibly be spread back out, so that each row has 3 columns.
My best idea was to do merge(df, df, by="ID") and then filter out the unwanted rows but this is quickly becoming unwieldy.
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
library(tidyr)
# your example data
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
# the solution
df <- gather(df, Year, value, -ID)
# cleaning up
df$Year <- gsub("Year", "", df$Year)
Result:
> df
ID Year value
1 1 1_A a
2 2 1_A c
3 1 1_B b
4 2 1_B d
5 1 2_A 2a
6 2 2_A 2c
7 1 2_B 2b
8 2 2_B 2d

R: group-wise min or max

There are so many posts on how to get the group-wise min or max with SQL. But how do you do it in R?
Let's say, you have got the following data frame
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5
For every ID, I don't want the min t, but the value at the min t.
ID | value
a | 3
b| 2
df is your data.frame -
library(data.table)
setDT(df) # convert to data.table in place
df[, value[which.min(t)], by = ID]
Output -
> df[, value[which.min(t)], by = ID]
ID V1
1: a 3
2: b 2
You are looking for tapply:
df <- read.table(textConnection("
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5"), header=TRUE, sep="|")
m <- tapply(1:nrow(df), df$ID, function(i) {
df$value[i[which.min(df$t[i])]]
})
# a b
# 3 2
Two more solutions (with sgibb's df):
sapply(split(df, df$ID), function(x) x$value[which.min(x$t)])
#a b
#3 2
library(plyr)
ddply(df, .(ID), function(x) x$value[which.min(x$t)])
# ID V1
#1 a 3
#2 b 2

Resources