Compare and merge two dataframes - r

I have the following two dataframes in R:
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")
df1
X Y Z score
1 A 1 6 1
2 A 11 20 2
3 A 21 30 3
4 B 35 40 4
5 B 45 60 5
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")
df2
X Y Z out
1 A 1 5 x1
2 A 6 20 x2
3 A 21 30 x3
4 A 50 60 x4
5 B 20 30 x5
6 B 31 40 x6
7 B 50 60 x7
8 C 10 20 x8
For every row in df1, I want to check:
is there a match with the value in 'X' and any other 'X' value from df2
if the above is true: I want to check if the values from 'Y' and 'Z' are in the range of the values 'Y' and 'Z' from df2
if both are true: then I want to add the value from 'out' to df1.
This is how the output should look like:
output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")
X Y Z score out
1 A 1 6 1 x1, x2
2 A 11 20 2 x2
3 A 21 30 3 x3
4 B 35 40 4 x6
5 B 45 60 5 x7
The original df1 is kept with an extra column 'out' that is added.
Line 1 from 'output', contains 'x1, x2' in column 'out'. Why: there is a match between the values in column 'X' and range 1 to 6 overlap with lines 1 and 2 from df2.
I've asked this question before (Compare values from two dataframes and merge) where it is suggested to use the foverlaps function. However because of the different columns between df1 and df2 and the extra rows in df2, I cannot make it work.

Here are two possible ways, a) using the newly implemented non equi joins feature, and b) foverlaps as you'd specifically mentioned that..
a) non-equi joins
dt2[dt1, on=.(X, Z>=Y, Y<=Z),
.(score, out=paste(out, collapse=",")),
by=.EACHI]
where dt1 and dt2 are data.tables corresponding to df1 and df2. Note that you'll have to revert column names Z and Y in the result (since the column names come from dt2 but the values from dt1.
Matching rows from dt2 corresponding to each row is dt1 is found based on the condition provided to the on argument and .() is evaluated for each of those matching rows (because of by=.EACHI).
b) foverlaps
setkey(dt1, X, Y, Z)
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L)
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]

library(dplyr)
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45),
c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10),
c(5, 20, 30, 60, 30, 40, 60, 20),
c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")
df1 %>%
left_join(df2, by="X") %>% # join on main column
rowwise() %>% # for each row
mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>% # get how many elements of those ranges overlap
filter(counter > 0) %>% # keep rows with overlap
group_by(X, Y.x, Z.x, score) %>% # for each combination of those columns
summarise(out = paste(out, collapse=", ")) %>% # combine out column
ungroup() %>%
rename(Y = Y.x,
Z = Z.x)
# # A tibble: 5 × 5
# X Y Z score out
# <chr> <dbl> <dbl> <dbl> <chr>
# 1 A 1 6 1 x1, x2
# 2 A 11 20 2 x2
# 3 A 21 30 3 x3
# 4 B 35 40 4 x6
# 5 B 45 60 5 x7
The above process is based on dplyr package and involves a join and some grouping and filtering. If your initial datasets (df1, df2) are extremely large then the join will create an even bigger dataset that will need some time to be created.
Also, note that this process works with character and not factor variables. The process might convert factor variables to character if it tries to join factor variables with different levels.
I'd suggest you run the chained commands step by step to see how it works and spot if I missed anything that might lead to bugs in the code.

Here is another options using sqldf
library(sqldf)
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))')
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", "))

Related

How to calculate the sum of rows in group only when the condition is met [duplicate]

This question already has answers here:
Summarize all group values and a conditional subset in the same call
(4 answers)
Closed 6 days ago.
I have a data frame similar to this:
data.frame(Group1 = c("A", "A", "A", "A"),
Group2 = c("X", "X", "X", "Y"),
ValueA = c(20, 40, 50, 80),
ValueB = c(0, 0, 70, 60))
I want to calculate the sum of rows in ValueA within group by Group1 and Group2, only when value in ValueB is 0.
My expected output is:
data.frame(Group1 = c("A", "A", "A", "A"),
Group2 = c("X", "X", "X", "Y"),
ValueA = c(20, 40, 50, 80),
ValueB = c(0, 0, 70, 60),
SumA_whenBis0 = c(60, 60, 0, 0))
We can use a grouped_by mutate and only sum ValueA when ValueB == 0 by subsetting it sum(ValueA[ValueB == 0]):
library(dplyr)
df <- data.frame(Group1 = c("A", "A", "A", "A"),
Group2 = c("X", "X", "X", "Y"),
ValueA = c(20, 40, 50, 80),
ValueB = c(0, 0, 70, 60))
df %>%
group_by(Group1, Group2) %>%
mutate(SumA_whenBis0 = sum(ValueA[ValueB == 0]))
#> # A tibble: 4 x 5
#> # Groups: Group1, Group2 [2]
#> Group1 Group2 ValueA ValueB SumA_whenBis0
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 A X 20 0 60
#> 2 A X 40 0 60
#> 3 A X 50 70 60
#> 4 A Y 80 60 0
Created on 2023-02-14 by the reprex package (v2.0.1)

Divide one table by another, with matching index

I have two table with a shared index, I want to divide one by another. This could be done with division on two data frames. But It seems arbitrary (how would I know I am dividing the right number?) and does not preserve index, so I want to do this division by matching rows with the same index. What's the best way to do this? Is there a best practice in terms of table division in this case?
tb1 <- data.frame(index = c(1, 2, 3), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))
tb2 <- data.frame(index = c(1, 2, 3), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1[,-1]/tb2[,-1]
total_1 total_2
1 25 10
2 225 13
3 100 10
Another case, two col of index must match.
tb2 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))
If both data have the same index and the number of rows are same. One way is to order by 'index' in both data to enforce that they are in the same order. Then do the division
tb1new <- tb1[order(tb1$index),]
tbl2new <- tb2[order(tb2$index),]
tb1new[-1] <- tbl1new[-1]/tbl2new[-1]
Or we can make a check on both 'index' first and use that condition to do the division
i1 <- all.equal(tbl1$index, tbl2$index)
if(i1) tb1[-1]/tbl2[-1]
Or another option in a join
library(data.table)
nm1 <- c('total_1', 'total_2')
nm2 <- c('unit_1', 'unit_2')
setDT(tb1)[tb2, (nm1) := .SD/mget(nm2), on = .(index), .SDcols = nm1]
You can perform a join and divide the columns. In base R :
result <- merge(tb1, tb2, by = c('index_1', 'index_2'))
result
# index_1 index_2 total_1 total_2 unit_1 unit_2
#1 a c 100 20 4 2
#2 b b 300 60 3 6
#3 b d 450 39 2 3
total_cols <- grep('total', names(result), value = TRUE)
unit_cols <- grep('unit', names(result), value = TRUE)
result[total_cols]/result[unit_cols]
# total_1 total_2
#1 25 10
#2 100 10
#3 225 13
Maybe this is not the most efficient solution but here is another way:
library(dplyr)
library(tidyr)
# For one index matching
tb1 %>%
left_join(tb2, by = "index") %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_")))
index result_1 result_2
1 1 25 10
2 2 225 13
3 3 100 10
# For two indices matching
tb1 %>%
left_join(tb2, by = c("index_1", "index_2")) %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_"))) %>%
select(!starts_with(c("total", "unit")))
index_1 index_2 result_1 result_2
1 a c 25 10
2 b d 225 13
3 b b 100 10

Using 3 different data to output 4th dataframe

I’m having trouble working with 3 different sets of data (df1, df2, vec1) to output a third dataframe df3. I have 2 dataframes df1 and df2. In df1, each letter in X1 corresponds to a value in X2. In df2, X3 represents a numerical value found in vec1 and X4 represents a letter or multiple letters from df1$X1. I’m looking to scan the letters found in df2$X4 and see if there is a sequential order of N values determined from df2$X3 in vec1, and then remove any letters that do not fit this criterion.
For example, in df2[1, ], the letters are “A, B, D” and the value is 3. Looking at vec1, the max sequential order that includes the value 3 is “2, 3, 4, 5”, meaning df2[1, 2] should be replaced with “A, D” instead of “A, B, D”. The final output should look like df3. Any ideas would be greatly appreciated.
df1 <- data.frame(c("A", "B", "C", "D"), c(4, 8, 1, 3))
colnames(df1) <- c("X1", "X2")
df2 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, B, D", "A, C", NA, "B", "B, D", "C"))
colnames(df2) <- c("X3", "X4")
vec1 <- c(2, 3, 4, 5, 21, 22, 23, 27, 33, 34, 35, 36, 37, 38, 39, 46)
df3 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, D", "C", NA, NA, "D", NA))
This is not elegant but it may do what you need it to do.
First, create a list that contains consecutive integers:
vec1_seq <- split(vec1, cumsum(c(0, diff(vec1) > 1)))
$`0`
[1] 2 3 4 5
$`1`
[1] 21 22 23
$`2`
[1] 27
$`3`
[1] 33 34 35 36 37 38 39
$`4`
[1] 46
Then, do the following. Check for X3 in each element of the list, and determine the length if contained in that element. Then, keep only those letters that meet the length requirement:
cbind(df2,
X5 = apply(df2, 1, function(x) {
l <- length(unlist(vec1_seq[sapply(seq_along(vec1_seq), function(i) {
as.numeric(x[["X3"]]) %in% vec1_seq[[i]]
})]))
toString(na.omit(as.vector(sapply(trimws(unlist(strsplit(x[["X4"]], ","))), function(i) {
ifelse(i == df1[["X1"]] & df1[["X2"]] <= l, i, NA)
}))))
}))
It seems that "C" should remain for row 6; if that is incorrect let me know.
Output
X3 X4 X5
1 3 A, B, D A, D
2 21 A, C C
3 27 <NA>
4 34 B
5 35 B, D D
6 46 C C

Creating an additional dataframe using two existing dataframe

Input Data Frame
Data Frame 1 (example - nrow = 100)
Col A | Col B | Col C
a 1 2
a 3 4
b 5 6
c 9 10
Data frame 2 (example - nrow = 200)
Col A | Col B | Col E
a 1 22
a 31 41
a 3 63
b 5 6
b 11 13
c 9 20
I want to create a third data set which contains each of the additional row found in the Data Frame 2 for the Col A entry.
Output File (nrow = 200-100 = 100)
Col A | Col B | Col E
a 31 41
b 11 13
You could add row numbers to each data frame, and then do an anti_join:
library(tidyverse)
df2 %>%
group_by(colA) %>%
mutate(rn = row_number()) %>%
anti_join(df1 %>% group_by(colA) %>% mutate(rn = row_number())) %>%
select(-rn)
Output
# A tibble: 2 x 3
# Groups: colA [2]
colA colD colE
<chr> <dbl> <dbl>
1 a 51 63
2 b 11 13
If we need it in a loop, create an empty dataset with the column names of the second dataset. Loop over the unique values of 'ColA' from second dataset, subset the 'df2', get the difference in number of rows between the subset and the corresponding row of 'df1' ('cnt'), rbind the 'out' with the tail of the subset of dataset
#// Create an empty dataset structure
out <- data.frame(ColA = character(), ColD = numeric(), ColE = numeric())
# // Get the unique values of the column
un1 <- unique(df2$ColA)
# // Loop over the unique values
for(un in un1) {
# // subset the dataset df2
tmp <- subset(df2, ColA == un)
# // get a difference in row count
cnt <- nrow(tmp) - sum(df1$ColA == un)
# // use the count to subset the subset of df2
# // rbind and assign back to the original out
out <- rbind(out, tail(tmp, cnt))
}
row.names(out) <- NULL
out
# ColA ColD ColE
#1 a 51 63
#2 b 11 13
For multiple columns, we could paste to create a single column
df1 <- data.frame(ColA = c('a', 'a', 'b', 'c'), ColB = c(1, 3, 5, 9),
ColC = c(2, 4, 6, 10))
df2 <- data.frame(ColA = c('a', 'a', 'a', 'b', 'b', 'c'),
ColB = c(1, 31, 3, 5, 11, 9), ColE = c(22, 41, 63, 6, 13, 20))
create the function
f1 <- function(data1, data2, by_cols) {
# // Create an empty dataset structure
# // Get the unique value by pasteing the by_cols
data2$new <- do.call(paste, data2[by_cols])
data1$new <- do.call(paste, data1[by_cols])
out <- data2[0,]
un1 <- unique(data2$new)
# // Loop over the unique values
for(un in un1) {
# // subset the second dataset
tmp <- subset(data2, new == un)
# // get the difference in row count
cnt <- nrow(tmp) - sum(data1$new == un)
# // use the count to subet the subset of data2
# // rbind and assign back to the original out
out <- rbind(out, tail(tmp, cnt))
}
out$new <- NULL
row.names(out) <- NULL
out
}
f1(df1, df2, c("ColA", "ColB"))
# ColA ColB ColE
#1 a 31 41
#2 b 11 13
data
df1 <- structure(list(ColA = c("a", "a", "b", "c"), ColB = c(1, 3, 5,
9), ColC = c(2, 4, 6, 10)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ColA = c("a", "a", "a", "b", "b", "c"), ColD = c(12,
31, 51, 71, 11, 93), ColE = c(22, 41, 63, 86, 13, 20)), class = "data.frame",
row.names = c(NA,
-6L))

Match value from one dataframe to values from a second dataframe of different length

I have two dataframes like so
df_1 <- data.frame(Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100))
df_2 <- data.frame(Value = c(5, 2, 33),
Symbol = c("B", "A", "D"))
I want to attach df_2$Symbol to df_1 based on whether or not df_2$Value falls between df_1$Min and df_1$Max. If there's no df_2$Value in the appropriate range I'd like NA instead:
df_target <- data.frame(
Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100),
Symbol = c("A", "B", NA, "D")
)
If df_1 and df_2 were of equal lengths this would be simple with findInterval or something with cut but alas...
A solution in either base or tidyverse would be appreciated.
We could use a non-equi join
library(data.table)
setDT(df_1)[df_2, Symbol := Symbol, on = .(Min < Value, Max > Value)]
df_1
# Min Max Symbol
#1: 1 3 A
#2: 4 7 B
#3: 9 14 <NA>
#4: 25 100 D
Or can use fuzzy_left_join
library(fuzzyjoin)
fuzzy_left_join(df_1, df_2, by = c('Min' = 'Value',
'Max' = 'Value'), list(`<`, `>`) ) %>%
dplyr::select(-Value)
# Min Max Symbol
#1 1 3 A
#2 4 7 B
#3 9 14 <NA>
#4 25 100 D

Resources