Match in R while disregarding order [duplicate] - r

This question already has an answer here:
Match Dataframes Excluding Last Non-NA Value and disregarding order
(1 answer)
Closed 5 years ago.
I am trying to do a match in R regardless of the order of the columns.
Basically the problem I am trying to solve is that if all of the values in the columns of df2, from column 2-to the end, are found in df1 (after Partner), then match df1.
Here's the catch: disregard the last non-NA value in each row when doing this match but include it in the final output. So don't take the last non-NA value into account when matching but include it.
After the match, determine if that last non-na value exists in any of the columns with it's respective row.
df1
Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA
df2
lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1
How do I match df1 with df2 so that the following happens:
1) Disregards the order of the columns found in both dataframes.
2) Then determine if the last non-na value exists in the row currently.
Final output:
df3
Partner Col1 Col2 Col3 Col4 lift rule1 rule2 rule3 EXIST?
A A1 A2 NA NA 11 A2 A1 A9 YES
A A1 A2 NA NA 10 A1 A3 NA NOPE
B A2 B9 NA NA 11 B9 A2 D7 YES
B A2 B9 NA NA 11 A2 B9 B1 YES
D Q1 Q3 Q4 NA 10 Q4 Q1 NA YES

I get one more B match than you, but this solution is very close to what you want. You first have to add an id column as we use it to reconstruct the data. Then to perform the match, you first need to melt it with gather from tidyr and use inner_join from dplyr. We then cbind using the ids and the original data.frames.
library(tidyr);library(dplyr)
df1 <- read.table(text="Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA",header=TRUE, stringsAsFactors=FALSE)
df2 <- read.table(text="lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1",header=TRUE, stringsAsFactors=FALSE)
df1 <- cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
#melt with gather
d11 <- df1 %>% gather(Col, Value,starts_with("C")) #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22 <- df2 %>% gather(Rule, Value,starts_with("r")) #Long
res <- inner_join(d11,d22)
cbind(df1[res$df1_id,],df2[res$df2_id,])
df1_id Partner Col1 Col2 Col3 Col4 df2_id lift rule1 rule2 rule3
1 1 A A1 A2 <NA> <NA> 2 10 A1 A3 <NA>
1.1 1 A A1 A2 <NA> <NA> 1 11 A2 A1 A9
2 2 B A2 B9 <NA> <NA> 1 11 A2 A1 A9
2.1 2 B A2 B9 <NA> <NA> 5 11 A2 B9 B1
2.2 2 B A2 B9 <NA> <NA> 3 11 B9 A2 D7
4 4 D Q1 Q3 Q4 <NA> 4 10 Q4 Q1 <NA>

Related

Calculate medians of multiple columns in a data frame

I would like to calculate the median of my df database below.
In this case, I would like to get the median results for columns A1 through A10 and return the results for the columns separately.
Thanks!
#database
df <- structure(
list(D1 = c("a","a","b","b","b"),
D2 = c("c","d","c","d","c"), D3 = c("X","X","Y","Z","Z"), A1=c(1,2,3,4,5),A2=c(4,2,3,4,4), A3=c(1,2,3,4,6),
A4=c(1,9,4,4,6),A5=c(1,4,3,9,6),A6=c(1,2,4,4,8),A7=c(1,1,3,4,7),A8=c(1,6,4,4,2),A9=c(1,2,3,4,6),A10=c(1,5,3,2,7)),
class = "data.frame", row.names = c(NA, -5L))
If you would want to keep it simple:
apply(df[, 4:13], 2, median)
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
3 4 3 4 4 4 3 4 3 3
We can loop over the numeric columns and get the median
library(dplyr)
df %>%
summarise(across(where(is.numeric), median))
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
1 3 4 3 4 4 4 3 4 3 3
Or use colMedians from matrixStats
library(matrixStats)
colMedians(as.matrix(df[startsWith(names(df), "A")]))
[1] 3 4 3 4 4 4 3 4 3 3
Or in base R
sapply(df[startsWith(names(df), "A")], median)
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
3 4 3 4 4 4 3 4 3 3

How to find common rows (considering vice versa format) of 2 dataframe in R

I want to find the common rows between 2 dataframe. To find the common rows, I can use inner_join(), semi_join(), and merge(). I have gone through different posts including this. But, these operations are not fulfilling my purposes. Because my data in the dataframe is a little different!
Sometimes, the data in the dataframe can be vise versa. Like the 3rd and 5th rows of dataframe-1 and dataframe-2. Dataframe-1 contains A3 A1 0.75 but Dataframe-2 contains A1 A3 0.75 . I would like to take these 2 rows as the same.
My first dataframe looks like
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
3 A3 A1 0.75
4 A4 A5 0.88
5 A5 A3 0.99
6 (+)-1(10),4-Cadinadiene Falcarinone-10 0.09
7 Leucodelphinidin-100 (+)-1(10),4-Cadinadiene 0.876
8 Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
9 (2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
10 Falcarinone Lignin 1
11 A1 (+)-1(10),4-Cadinadiene 1
12 A2 Lignin-10 1
13 A3 (2E,7R,11R)-2-Phyten-1-ol 1
14 Falcarinone A6 1
15 A4 Leucodelphinidin 1
16 A4 Leucodelphinidin 1
17 Falcarinone A100 1
18 A4 Falcarinone 1
the second dataframe looks like
query target
1 A1 A2
2 A2 A5
3 A1 A3 // Missing in the output
4 A4 A5
5 A3 A5 // Missing in the output
6 A3 (2E,7R,11R)-2-Phyten-1-ol
7 (+)-1(10),4-Cadinadiene Falcarinone
8 Leucodelphinidin (+)-1(10),4-Cadinadiene-100
9 Lignin-2 (2E,7R,11R)-2-Phyten-1-ol
10 A11 (+)-1(10),4-Cadinadiene
11 A2 Lignin
12 A3 (2E,7R,11R)-2-Phyten-1-0l
13 Falcarinone A60
14 A4 Leucodelphinidin // Missing in the output
The code I am using
output <- semi_join(Dataframe-1, Dataframe-2) OR
output <- inner_join(df_only_dd, sample_data_dd_interaction)
The output I am getting
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
But, my expected output is like this
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
3 A3 A1 0.75
4 A4 A5 0.88
5 A5 A3 0.99
6 A4 Leucodelphinidin 1
Reproducible code is given below
df_1 <- read.table(text="query target weight
A1 A2 0.6
A2 A5 0.5
A3 A1 0.75
A4 A5 0.88
A5 A3 0.99
(+)-1(10),4-Cadinadiene Falcarinone 0.09
Leucodelphinidin (+)-1(10),4-Cadinadiene 0.876
Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
(2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
Falcarinone Lignin 1
A1 (+)-1(10),4-Cadinadiene 1
A2 Lignin 1
A3 (2E,7R,11R)-2-Phyten-1-ol 1
Falcarinone A6 1
A4 Leucodelphinidin 1
A4 Leucodelphinidin 1
Falcarinone A100 1
A5 Falcarinone 1", header=TRUE)
df_2 <- read.table(text="query target
A1 A2
A2 A5
A1 A3
A4 A5
A3 A5
(+)-1(10),4-Cadinadiene Falcarinone
Leucodelphinidin (+)-1(10),4-Cadinadiene-100
Lignin-2 (2E,7R,11R)-2-Phyten-1-ol
A11 (+)-1(10),4-Cadinadiene
A2 Lignin
A3 (2E,7R,11R)-2-Phyten-1-0l
Falcarinone A6
A4 Leucodelphinidin ", header=TRUE)
Any kind of suggestion is appreciated.
You could write a small function that sorts rows of first two columns of both data frames, then merge them.
sc <- function(x, i) setNames(cbind(data.frame(t(apply(x[i], 1, sort))), x[-i]), names(x))
res <- merge(sc(df_1, 1:2), sc(df_2, 1:2))
res[!duplicated(res), ] ## remove duplicates
# query target weight
# 1 (+)-1(10),4-Cadinadiene Falcarinone 0.09
# 2 A1 A2 0.60
# 3 A1 A3 0.75
# 4 A2 A5 0.50
# 5 A2 Lignin 1.00
# 6 A3 A5 0.99
# 7 A4 A5 0.88
# 8 A4 Leucodelphinidin 1.00
# 10 A6 Falcarinone 1.00
Edit
Solution with data.table which should be more memory efficient.
library(data.table)
setDT(df_1)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
setDT(df_2)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
res <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)
res
# query target weight
# 1: (+)-1(10),4-Cadinadiene Falcarinone 0.09
# 2: A1 A2 0.60
# 3: A1 A3 0.75
# 4: A2 A5 0.50
# 5: A2 Lignin 1.00
# 6: A3 A5 0.99
# 7: A4 A5 0.88
# 8: A4 Leucodelphinidin 1.00
# 9: A6 Falcarinone 1.00
To get back "data.frame"s, just do e.g. setDF(res).
maybe can try:
output <- merge(df_1, df_2, all=T)
and then check for duplicated rows regardless of ordering, smthing like:
same.rows <- duplicated(t(apply(output, 1, sort)))
which returnds a vector of flags
FALSE FALSE FALSE TRUE FALSE FALSE TRUE
you can then keep the rows which are FALSE
output[which(same.rows==F),]
query target weight
1 A1 A2 0.60
2 A1 A3 0.75
3 A2 A5 0.50
5 A3 A5 0.99
6 A4 A5 0.88
does it make sense?

How to create an incrementing variable with 2 variables in R?

I would like to create an incrementing variable (Id1 or Id2) from 2 others variables (Var1 and Var2).
Thank you.
Elodie
EDIT (reproductible example for Aaron Montgomery)
I want to create an incrementing variable : "Id". The value of "Id" changes if VarA is a new value and if VarB is a new value. See in particular when Id = 4 in the expected table.
data_example <- data.table::fread("
VarA VarB
A1 B1
A1 B2
A1 B3
A1 B4
A2 B5
A3 B6
A4 B7
A5 B7
A5 B8
A6 B9
A7 B10
A8 B10
A9 B10")
Expected table
VarA VarB Id
A1 B1 1
A1 B2 1
A1 B3 1
A1 B4 1
A2 B5 2
A3 B6 3
A4 B7 4
A5 B7 4
A5 B8 4
A6 B9 5
A7 B10 6
A8 B10 6
A9 B10 6
Here is one solution using the tidyverse
library(tidyverse)
data_example <- data.table::fread("
Var1 Var2 Id1 Id2
604211 1001 3 1
604211 1093 3 1
604211 1146 3 1
604211 1319 3 1
635348 1002 5 2
634849 1005 5 2
620861 1004 4 3
622281 1004 4 3
622281 1041 4 3
600044 1100 1 4
600049 1033 2 5
607692 1033 2 5
612595 1033 2 5")
data_example %>%
arrange(Var1,Var2) %>%
group_by(Var1) %>%
mutate(id1 = group_indices()) %>%
group_by(Var2) %>%
mutate(id2 = group_indices())

Combine two dataframes same/different names [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have 2 dataframes, i am trying to combine both the dataframes not only the ones with common names but also with different variable names and tell as NA if respective value not found.
I tried normal rbind but it asks for same column names.
Dataframes:
d1 <- data.frame(a=c('a1','a2','a3'), b = c("a51","a52","a53"), d = c(12,13,14))
d2 <- data.frame(a=c('a4','a5','a6'), g = c("a151","a152","a153"), k = c(122,123,124))
Expected Output:
a b d g k
1 a1 a51 12 <NA> NA
2 a2 a52 13 <NA> NA
3 a3 a53 14 <NA> NA
4 a4 <NA> NA a151 122
5 a5 <NA> NA a152 123
6 a6 <NA> NA a153 124
Here is an option with bind_rows
library(dplyr)
bind_rows(d1, d2)
# a b d g k
#1 a1 a51 12 <NA> NA
#2 a2 a52 13 <NA> NA
#3 a3 a53 14 <NA> NA
#4 a4 <NA> NA a151 122
#5 a5 <NA> NA a152 123
#6 a6 <NA> NA a153 124
Or using rbindlist
library(data.table)
rbindlist(list(d1, d2))

Using Reshape to Combine Columns [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I have this dataset that I'm trying to melt and combine "Debit" and "Credit" into the same column.
random
Address ID Debit Credit
1 tower1 A1 33 NA
2 happy1 A2 NA 24
3 today2 A3 145 NA
4 yesterday3 A4 122 NA
5 random3 A5 NA 14143
random <- melt(random, id = c("Address", "ID"))
Address ID variable value
1 tower1 A1 Debit 33
2 happy1 A2 Debit NA
3 today2 A3 Debit 145
4 yesterday3 A4 Debit 122
5 random3 A5 Debit NA
6 tower1 A1 Credit NA
7 happy1 A2 Credit 24
8 today2 A3 Credit NA
9 yesterday3 A4 Credit NA
10 random3 A5 Credit 14143
random[!(is.na(random$value)| random$value == ""),] #to remove NA and join them together
I'm wondering if it is possible to achieve my final dataset directly via reshape package?
This is the final dataset I hope to obtain
Address ID variable value
1 tower1 A1 Debit 33
3 today2 A3 Debit 145
4 yesterday3 A4 Debit 122
7 happy1 A2 Credit 24
10 random3 A5 Credit 14143
We can use gather to convert the dataframe into long format and then use na.omit to remove NA rows.
library(tidyverse)
df %>%
gather(key, value, -c(Address, ID)) %>%
na.omit()
# Address ID key value
#1 tower1 A1 Debit 33
#3 today2 A3 Debit 145
#4 yesterday3 A4 Debit 122
#7 happy1 A2 Credit 24
#10 random3 A5 Credit 14143
gather also has na.rm parameter to remove NA rows
df %>% gather(key, value, -c(Address, ID), na.rm = TRUE)
With reshape2 you can add na.rm = TRUE to remove NA rows
library(reshape2)
melt(df, id = c("Address", "ID"), na.rm = TRUE)
# Address ID variable value
#1 tower1 A1 Debit 33
#3 today2 A3 Debit 145
#4 yesterday3 A4 Debit 122
#7 happy1 A2 Credit 24
#10 random3 A5 Credit 14143

Resources