Merging two data frames of different length by group id [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I am trying to merge two data frames by group id. However, both data frames are not of the same length and some elements of certain groups are missing in the second data frame. In the merged file, the missing elements of a certain group should be NAs.
The data looks something like this
df1 <- data.frame(id = c(1,1,1,2,3,3,4), x = c("a", "b", "c", "d", "e", "f", "g"))
df2 <- data.frame(id = c(1,1,2,3,4), y = c("A", "B", "D", "E", "G"))
Ideally, the result would look like this:
id x y
1 a A
1 b B
1 c <NA>
2 d D
3 e E
3 f <NA>
4 g G
It would be great if the code worked for additional columns that also correspond to the same group ids but may miss elements at different places.
I have tried full_join and merge so far but without success, as they just repreat the y values instead of introducing na's.
I know there are similar questions out there, but I have found none that solves this problem. Any help is appreciated.

This data.table solution might work..
first, create row_id's per group. The join by id on these row id's.
library(data.table)
dt1 <- data.table(id = c(1,1,1,2,3,3,4), x = c("a", "b", "c", "d", "e", "f", "g"))
dt2 <- data.table(id = c(1,1,2,3,4), y = c("A", "B", "D", "E", "G"))
#rumber rows by group
dt1[ , row_id := seq.int(1:.N), by = .(id)]
dt2[ , row_id := seq.int(1:.N), by = .(id)]
dt1[dt2, y := i.y, on = .(id, row_id)][, row_id := NULL][]
# id x y
# 1: 1 a A
# 2: 1 b B
# 3: 1 c <NA>
# 4: 2 d D
# 5: 3 e E
# 6: 3 f <NA>
# 7: 4 g G

Related

Adding new information to a table upon matching rows

I have very basic knowledge of R. I have two tabs (A and B) with rows I want to compare - some values match and some don't. I want R to find the matching elements and add the text value "E" to a pre-existing row in tab A if this is the case.
Example:
Tab A
ID Existing?
1 A
2 B
3 C
4 D
5 E
Tab B
ID
1 D
2 B
3 Y
4 A
5 W
Upon match:
Tab A
ID Existing?
1 A E
2 B E
3 C
4 D E
5 E
I have found information online on how to match tables but none on how to write new information when the match takes place.
Please explain like I'm 5... I have no programming background.
Thank you in advance!
Use match to get the elements in df1$ID that are also in df2$ID, and ifelse to recode the values that are both in df1 and in df2 with "E", and NA otherwise.
df1 <- data.frame(ID = LETTERS[1:5])
df2 <- data.frame(ID = c("D", "B", "Y", "A", "W"))
df1$Existing <- ifelse(match(df1$ID, df2$ID), "E", NA)
ID Existing
1 A E
2 B E
3 C <NA>
4 D E
5 E <NA>
Another solution - using dplyr - would be to join the two dataframes, where you have added the column Existing to the one being joined:
library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(ID = LETTERS[1:5])
df2 <- tibble(ID = c("D", "B", "Y", "A", "W"))
df1 %>%
left_join(df2 %>% mutate(Existing = "E"))
#> Joining, by = "ID"
#> # A tibble: 5 x 2
#> ID Existing
#> <chr> <chr>
#> 1 A E
#> 2 B E
#> 3 C <NA>
#> 4 D E
#> 5 E <NA>
This will set all matching IDs to E and all non-matching to NA.
# data
tab1 <- structure(list(ID = c("A", "B", "C", "D", "E"), Existing = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-5L))
tab2 <- structure(list(ID = c("D", "B", "Y", "A", "W")), class = "data.frame", row.names = c(NA,
-5L))
There are many ways to skin this cat. In base-R, you could try, e.g.,
tab1$Existing[tab1$ID %in% tab2$ID] <- 'E'
In practise, for anything more complicated than tables with 6 rows, you could try dplyr:
library(dplyr)
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA))
Another useful tool -- with a slightly differing syntax -- is data.table.
library(data.table)
setDT(tab1) -> tab1
setDT(tab2) -> tab2
tab1[,Existing := ifelse(tab1$ID %in% tab2$ID, 'E',NA)]
Note that, here mutate and := play roughly the same role. Probably, if you work more with R, you will develop an affinity with one of the "dialects" above.
EDIT: To drop the rows NA values values (in dplyr), you could either do:
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA)) %>%
filter(!is.na(Existing))
Or piggy-backing on #jpiversen's solution:
df1 %>%
inner_join(df2 %>% mutate(Existing = "E"))

Efficiently Repeating Observations by Group

I am trying to find an efficient way to repeat rows by group in data.table for only some groups. Please consider the following example:
library(data.table)
DT <- data.table(x = c("A","A", "B", "B", "C","C", "D","D"),
y = 1:8)
This dataset looks like:
head(DT)
x y
1: A 1
2: A 2
3: B 3
4: B 4
5: C 5
6: C 6
Say I have a separate vector rep <- c("A", "A", "A", "B", "B", "C"). Given this vector, I want to be able to repeat all rows of A three times (due to the cardinality of the "A" characters in rep) and all rows associated with B two times. Thus, the final dataset should like:
x y
1: A 1
2: A 2
3: A 1
4: A 2
5: A 1
6: A 2
7: B 3
8: B 4
9: B 3
10: B 4
11: C 5
12: C 6
Notice that I did not repeat "C" because the cardinality of "C" is only 1 in rep. I have a hackish way of doing this procedure at the moment, but I'm wondering if there was a more efficient data.table way of doing the above.
Thank you!
P.S. The reason I am doing this is because I am doing some matching with replacement in my regressions and sometimes, the same control firm is assigned to more than one treatment firm.
A data.table merge won't give you the same ordering but you aren't supposed to rely on ordering in datatables, anyway:
merge(DT, data.frame(x=rep), by="x")
x y
1: A 1
2: A 1
3: A 1
4: A 2
5: A 2
6: A 2
7: B 3
8: B 3
9: B 4
10: B 4
11: C 5
12: C 6
One solution is to gather up the counts and left join onto them:
library(data.table)
library(data.table)
DT <- data.table(x = c("A","A", "B", "B", "C","C", "D","D"),
y = 1:8)
rep_vec <- c("A", "A", "A", "B", "B", "C")
rep_DT <- DT %>%
left_join(data.frame(group = rep_vec), by = c("x" = "group"))
Are you sure duplicating rows in a dataframe is your ideal option though?
We can do
DT[ data.table(x = v1)[, .N, x], on = .(x)][rep(seq_len(.N), N)]
Or to return in the same order
DT[, .(y = list(y)), x][data.table(x = v1), on = .(x)][, .(x, y = unlist(y))]
data
v1 <- c("A", "A", "A", "B", "B", "C")

identified distinct arrangement of values by groups in data.frame

I have a large dataframe that has as it's primary organization a single row with groups that are all identical length (in the toy example 3).
df <- data.frame(groups = c("gr1","gr1","gr1","gr2","gr2","gr2","gr3","gr3","gr3"),
no = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
colA = c("a", "b", "c", "a", "b", "c", "a", "b", "c"),
colB = c("a", "b", "c", "X_", "b", "c", "a", "b", "c"),
colC = c("a", "b", "c", "X_", "b", "c", "c", "b", "a"))
df
> df
> groups no colA colB colC
> 1 gr1 1 a a a
> 2 gr1 2 b b b
> 3 gr1 3 c c c
> 4 gr2 1 a X_ X_
> 5 gr2 2 b b b
> 6 gr2 3 c c c
> 7 gr3 1 a a c
> 8 gr3 2 b b b
> 9 gr3 3 c c a
I want to identify for each column which group is the first example of a unique arrangement of values. So for colA it should return (T, F, F) since all three groups are identical so only group one is the 1st unique on. For colB it should return (T, T, F) since there are two distinct groups and only the 3rd is identical to the 1st. And for colC it should be (T, T, T) since the order of items matters.
So the final output could be a matrix like this
colA colB colC
> gr1 T T T
> gr2 F T T
> gr3 F F T
I think I could figure this out by breaking down the data frame into pairs of group and colA/B/B, identify which ones are identical, storing the results in a vector, and then reassembling the whole deal. But I am seeing a ton of for-loops and have a hard time thinking about how to vectorize this. I have been using dplyr a bit, but I don't yet see how it can help.
Maybe there's a decent way to unstack each of the columns based on the groups and then run a comparison across the relevant subsets of new (and shorter) columns?
Edited to add:
Maybe group_by %>% summarize is a way to get at this. If the summary can essentially concatenate all values in a group per column into a really long string I could then see which of those is distinct per group?
Second edit:
I got as far as:
d1 <- df %>% group_by(groups) %>% summarise(colB = paste(unique(colB), collapse = ', ')) %>% distinct(colB)
which puts out
> # A tibble: 2 x 1
> colB
> <chr>
> 1 a, b, c
> 2 X_, b, c
It identifies the distinct groups, but I now have to figure out how to compare it against the rest full column to get T/F for each group.
Here's a base R approach :
cols <- grep('col', names(df))
cbind(unique(df[1]), sapply(df[cols], function(x)
!duplicated(by(x, df$groups, paste0, collapse = '-'))))
# groups colA colB colC
#1 gr1 TRUE TRUE TRUE
#4 gr2 FALSE TRUE TRUE
#7 gr3 FALSE FALSE TRUE
Your summarize idea is spot on:
df %>%
group_by(groups) %>%
summarize(across(starts_with("col"), paste, collapse = ""), .groups = "drop") %>%
mutate(across(starts_with("col"), ~!duplicated(.)))
# # A tibble: 3 x 4
# groups colA colB colC
# <chr> <lgl> <lgl> <lgl>
# 1 gr1 TRUE TRUE TRUE
# 2 gr2 FALSE TRUE TRUE
# 3 gr3 FALSE FALSE TRUE
With "data.table" you can try:
library(data.table)
cols <- c("colA", "colB", "colC")
fun <- function(x) !duplicated(x)
as.data.table(df)[, lapply(.SD, toString), groups, .SDcols = cols][
, (cols) := lapply(.SD, fun), .SDcols = cols][]
# groups colA colB colC
# 1: gr1 TRUE TRUE TRUE
# 2: gr2 FALSE TRUE TRUE
# 3: gr3 FALSE FALSE TRUE

R - building new variables from sequenced data

This is an update / follow-up on this question. The answer outlined their doesn't meet the new requirements.
I am looking for an efficient way (data.table?) to construct two new measures for each ID.
Measure 1 and Measure 2 needs to meet the following conditions:
Condition 1:
Find a sequence of three rows for which:
the first count > 0
the second `count >1' and
the third count ==1.
Condition 2 for Measure 1:
takes the value of the elements in product of the third row of the sequence that are:
in the product of second row of sequence and
NOT in the stock of the first row in sequence.
Condition 2 for measure 2:
takes the value of the elements in product of the last row of the sequence that are:
NOT in the product of second row of sequence
NOT in the stock of the first row in sequence.
Data:
df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
> df2
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 A,C,E A,B,C,E
5 1 5 1 A,B A,B,C,E
6 1 6 2 A,B,C A,B,C,E
7 1 7 3 D A,B,C,D,E
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
The desired output looks like this:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D
How would you code this?
Few things you need to know to be able to do this:
shift function to compare values in your groups
separate_rows function to split your strings to get to the normalised data view.
library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]
dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]
dt <- dt[count > 0 & count.2 > 1 & count.3 == 1]
dt <- unique(dt, by = "ID")
library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")
dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[,
.(
measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE),
measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
),
ID
]
dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1: 1 C E
# 2: 2 <NA> <NA>
# 3: 3 D <NA>
I'm not sure what the criteria for efficient is, but here's an approach using embed and tidyverse style. It filters down so you are working with less and less.
Loading up the data and packages (note later on setdiff and intersect are from dplry)
library(purrr)
library(dplyr)
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B",
"A,B,C", "D", "A", "B", "A", "A",
"A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E",
"A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A",
"A,B,C", "A,B,C,D", "A,B,C,D"),
stringsAsFactors = FALSE)
Define a helper function to evaluate condition 1
meetsCond1 <- function(rseg) {
seg <- rev(rseg)
all(seg[1] > 0, seg[2] > 1, seg[3] == 1)
}
The embed function warps a time series into a matrix where essentially each row is a window of the length of interest. Using apply, you filter down to which rows start relevant sequences.
cond1Match<- embed(df1$count, 3) %>%
apply(1, meetsCond1) %>%
which()
You can translate that back to final products, the previous products, and stock rows of interest to determine the measures by adding offsets. Split them into a list of individual components.
finalProds <- df1$product[cond1Match + 2] %>%
strsplit(",")
prevProds <- df1$product[cond1Match + 1] %>%
strsplit(",")
initialStock <- df1$stock[cond1Match] %>%
strsplit(",")
For both measures, neither of them can be in the stock.
notStock <- map2(finalProds, initialStock, ~.x[!(.x %in% .y)])
Then generate your data.frame by retrieving the seqs and ID values of the window. The measures then are just the intersect and setdiff of the final products with those in the previous rows.
data.frame(ID = df1$ID[cond1Match],
seq1 = df1$seqs[cond1Match],
seq2 = df1$seqs[cond1Match + 1],
seq3 = df1$seqs[cond1Match + 2],
measure1 = imap_chr(notStock,
~intersect(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
measure2 = imap_chr(notStock,
~setdiff(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
stringsAsFactors = FALSE
) %>%
slice(match(unique(ID), ID))
which yields the desired output, which seems to limit at most one line per ID. In the original post, you specify you want all reported. Removing the slice call would then instead yield
#> ID seq1 seq2 seq3 measure1 measure2
#> 1 1 2 3 4 C E
#> 2 1 6 7 1
#> 3 2 1 2 3
#> 4 2 3 1 2 C
#> 5 3 2 3 4 D
If you're looking to really squeeze efficiency, you might be able to gain some by placing the definitions of finalProds, prevProds, and initialStock instead of assigning them to variables first. I would imagine unless your set of matches is really large, it would be negligible.
A rolling window approach using data.table with base R code in j:
library(data.table)
cols <- c("product", "stock")
setDT(df2)[, (cols) := lapply(.SD, function(x) strsplit(as.character(x), split=",")), .SDcols=cols]
ans <- df2[,
transpose(lapply(1L:(.N-2L), function(k) {
if(count[k]>0 && count[k+1L]>1 && count[k+2L]==1) {
m1 <- setdiff(intersect(product[[k+2L]], product[[k+1L]]), stock[[k]])
m2 <- setdiff(setdiff(product[[k+2L]], product[[k+1L]]), stock[[k]])
c(seq1=seqs[k], seq2=seqs[k+1L], seq3=seqs[k+2L],
measure1=if(length(m1) > 0) paste(m1, collapse=",") else "",
measure2=if(length(m2) > 0) paste(m2, collapse=",") else "")
}
}), ignore.empty=TRUE),
ID]
setnames(ans, names(ans)[-1L], c(paste0("seq", 1:3), paste0("measure", 1:2)))
ans
output:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D

melt() is using all column names as id variables

So, with ths dummy dataset
test_species <- c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- rbind(test_species, test_abundance)
df <- as.data.frame(df)
colnames(df) <- c("a", "b", "c", "d", "e")
df <- dplyr::slice(df, 2)
we get a dataframe that's something like this:
a b c d e
4 7 15 2 9
I'd like to transform it into something like
species abundance
a 4
b 7
c 15
d 2
e 9
using the reshape2 function melt(). I tried the code
melted_df <- melt(df,
variable.name = "species",
value.name = "abundance")
but that tells me: "Using a, b, c, d, e as id variables", and the end result looks like this:
a b c d e
4 7 15 2 9
What am I doing wrong, and how can I fix it?
You can define it in the correct shape from the start, using only base library functions:
> data.frame(species=test_species, abundance=test_abundance)
species abundance
1 a 4
2 b 7
3 c 15
4 d 2
5 e 9
Rbind is adding some odd behaviour I think, I cannot explain why.
A fairly basic fix is:
test_species <-c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- data.frame(test_species, test_abundance) #skip rbind and go straight to df
colnames(df) <- c('species', 'abundance') #colnames correct
This skips the rbind function and will give desired outcome.

Resources