Match corresponding observations from two different datasets [duplicate] - r

This question already has answers here:
Check if each row of a data frame is contained in another data frame
(4 answers)
Closed 4 years ago.
I would like to find matching observations in two different datasets based on two variables.
The first dataset "df1" exists of the following two variables:
SessionID MarkerID
14 5
14 5
14 5
14 8
17 9
17 9
17 8
17 2
17 9
The othere dataset "df2" exists of the same two variables
SessionID MarkerID
14 5
17 8
17 2
Now, I would like to add another variable "Match" to df1 that shows if a match was found between the two datasets (Match = 1) or not (Match = 0) for an observation. The observation should have the same value for both the SessionID AND MarkerID.
The desired output looks as follows:
SessionID MarkerID Match
14 5 1
14 5 1
14 5 1
14 8 0
17 9 0
17 9 0
17 8 1
17 2 1
17 9 0
Reproducable example:
SessionID <- c(14,14,14,14,17,17,17,17,17)
MarkerID <- c(5,5,5,8,9,9,8,2,9)
df1 <- as.data.frame(cbind(SessionID, MarkerID))
SessionID <- c(14,17,17)
MarkerID <- c(5,8,2)
df2 <- as.data.frame(cbind(SessionID,MarkerID))
I have tried the following code but it did not produce the desired output:
df1$Match <- 0
df1$Match[which(df1$MarkerID == df2$MarkerID & df1$SessionID == df2$SessionID )] <- 1

Here is a possibility using match
df1$Match <- ifelse(is.na(match(
paste0(df1$SessionID, df1$MarkerID, sep = "_"),
paste0(df2$SessionID, df2$MarkerID, sep = "_"))), 0, 1)
df1;
# SessionID MarkerID Match
#1 14 5 1
#2 14 5 1
#3 14 5 1
#4 14 8 0
#5 17 9 0
#6 17 9 0
#7 17 8 1
#8 17 2 1
#9 17 9 0
Explanation: We concatenate SessionID and MarkerID entries in both data.frames and use match to identify matching rows; ifelse marks matching entries with 1 and NA (non-matching) entries with 0.
If you want to avoid ifelse you can also do
df1$Match <- as.numeric(!is.na(match(
paste0(df1$SessionID, df1$MarkerID, sep = "_"),
paste0(df2$SessionID, df2$MarkerID, sep = "_"))))

This works for me:
df1$Match <- as.numeric(do.call(paste, c(df1, sep = "-")) %in% do.call(paste, c(df2, sep = "-")))

You can use left_join
df1 %>%
left_join(df2 %>% mutate(Match = 1), by = c('SessionID', 'MarkerID')) %>%
mutate(Match = ifelse(is.na(Match), 0 , Match))
# SessionID MarkerID Match
# 1 14 5 1
# 2 14 5 1
# 3 14 5 1
# 4 14 8 0
# 5 17 9 0
# 6 17 9 0
# 7 17 8 1
# 8 17 2 1
# 9 17 9 0

Related

Add a unique identifier to the same column value in R data frame

I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2

conditional merge or left join two dataframes in R

I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")

Recreating a dataframe by using conditions from two different columns

I have a massive dataframe seems like this:
df = data.frame(year = c(rep(1998,5),rep(1999,5)),
loc = c(10,rep(14,4),rep(10,2),rep(14,3)),
sitA = c(rep(0,3),1,1,0,1,0,1,1),
sitB = c(1,0,1,0,1,rep(0,4),1),
n = c(2,13,2,9,4,7,2,7,7,4))
df
year loc sitA sitB n
1 1998 10 0 1 2
2 1998 14 0 0 13
3 1998 14 0 1 2
4 1998 14 1 0 9
5 1998 14 1 1 4
6 1999 10 0 0 7
7 1999 10 1 0 2
8 1999 14 0 0 7
9 1999 14 1 0 7
10 1999 14 1 1 4
As you can see, there are years, localities, two different situation (denoted as sitA and sitB) and finally the counts of these records (column n).
I wanted to create a new data frame which reflects the counts for only year and localities where counts for situation A and B stored in the columns conditionally such as desired output below:
df.new
year loc sitB.0.sitA.0 sitB.0.sitA.1 sitB.1.sitA.0 sitB.1.sitA.1
1 1998 10 0 0 2 0
2 1998 14 13 9 2 4
3 1999 10 7 2 0 0
4 1999 14 7 7 0 4
The tricky part as you can realize is that the original dataframe doesn't include all of the conditions. It only has the ones where the count is above 0. So the new dataframe should have "0" for the missing conditions in the original dataframe. Therefore, well known functions such as melt (reshape) or aggregate failed to solve my issue. A little help would be appreciated.
A tidyverse method, we first append the column names to the values for sit.. columns. Then we unite and combine them into one column and finaly spread the values.
library(tidyverse)
df[3:4] <- lapply(names(df)[3:4], function(x) paste(x, df[, x], sep = "."))
df %>%
unite(key, sitA, sitB, sep = ".") %>%
spread(key, n, fill = 0)
# year loc sitA.0.sitB.0 sitA.0.sitB.1 sitA.1.sitB.0 sitA.1.sitB.1
#1 1998 10 0 2 0 0
#2 1998 14 13 2 9 4
#3 1999 10 7 0 2 0
#4 1999 14 7 0 7 4
If the position of the columns is not fixed you can use grep first
cols <- grep("^sit", names(df))
df[cols] <- lapply(names(df)[cols], function(x) paste(x, df[, x], sep = "."))

R merge two data.frame by id and sub-id while changing column names?

I have two dataframes of this format.
df1:
id x y
1 2 3
2 4 5
3 6 7
4 8 9
5 1 1
df2:
id id2 v v2
1 t 11 21
1 b 12 22
2 t 13 23
2 b 14 24
3 t 15 25
3 b 16 26
4 b 17 27
Hence, sometimes, the id in main 'df' will appear twice (maximum) sometimes once, and sometimes not at all. The expected result would be:
df_merged:
id x y v.t v2.t v.b v2.b
1 2 3 11 21 12 22
2 4 5 13 23 24 24
3 6 7 15 25 16 26
4 8 9 NA NA 17 27
5 1 1 NA NA NA NA
I have used merge but due to the fact that id2 in df2 doesn't match, I get two instances of id in df_merged like so:
id x y v v2
1 ...
1 ...
Thanks in advance!
We can start by adjusting df2 to the right format then do a normal joining.
librar(dplyr)
library(tidyr)
df2 %>% gather(key,val,-id,-id2) %>% #Transfer from wide to long format for v and v2
mutate(new_key=paste0(key,'.',id2)) %>% #Create a new id2 as new_key
select(-id2,-key) %>% #de-select the unnessary columns
spread(new_key,val) %>% #Transfer back to wide foramt with right foramt for id
right_join(df1) %>% #right join df1 "To includes all rows in df1" using id
select(id,x,y,v.t,v2.t,v.b,v2.b) #rearrange columns name
Joining, by = "id"
id x y v.t v2.t v.b v2.b
1 1 2 3 11 21 12 22
2 2 4 5 13 23 14 24
3 3 6 7 15 25 16 26
4 4 8 9 NA NA 17 27
5 5 1 1 NA NA NA NA
You can solve this just using merge. Split df2 based on whether id2 equals b or t. Merge these two new objects with df1, and finally merge them together. The code includes one additional step to also include data found in df1 but not df2.
dfb <- merge(df1, df2[df2$id2=='b',], by='id')
dft <- merge(df1, df2[df2$id2=='t',], by='id')
dfRest <- df1[!df1$id %in% df2$id,]
dfAll <- merge(dfb[,c('id','x','y','v','v2')], dft[,c('id','v','v2')], by='id', all.x=T)
merge(dfAll, dfRest, all.x=T, all.y=T)
id x y v.x v2.x v.y v2.y
1 1 2 3 12 22 11 21
2 2 4 5 14 24 13 23
3 3 6 7 16 26 15 25
4 4 8 9 17 27 NA NA
5 5 1 1 NA NA NA NA

How to give a "/" in a column name to a dataframe in R?

I wish to give a "/" (backslash) in a column name in a dataframe. Any idea how?
I tried following to no avail,
tmp1 <- data.frame("Cost/Day"=1:10,"Days"=11:20)
tmp1
Cost.Day Days
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
I then tried this, it worked.
tmp <- data.frame(1:10,11:20)
colnames(tmp) <- c("Cost/Day","Days")
tmp
Cost/Day Days
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
I would prefer giving the name while constructing the dataframe itself. I tried escaping it but it still didn't work.
tmp2 <- data.frame("Cost\\/Day"=1:10,"Days"=11:20)
tmp2
You can use check.names=FALSE in the data.frame. By default, it is TRUE. And when it is TRUE, the function make.names changes the colnames. ie.
make.names('Cost/Day')
#[1] "Cost.Day"
So, try
dat <- data.frame("Cost/Day"=1:10,"Days"=11:20, check.names=FALSE)
head(dat,2)
# Cost/Day Days
#1 1 11
#2 2 12
The specific lines in data.frame function changing the column names is
--------
if (check.names)
vnames <- make.names(vnames, unique = TRUE)
names(value) <- vnames
--------

Resources