Merging two dataframes on multiple columns - r

I have two dataframes that I am trying to merge :
set.seed(123)
df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE))
df2 <- data.frame(
ID1 = letters[1:2],
ID2 = letters[3:4],
ID3 = letters[5:6],
V1 = c(23.32,21.24),
V2 = c(45.32,47.21)
)
Post merging, I want my df1 to contain the columns V1 and V2 along with ID. I have tried using merge, left_join and inner_join (from dplyr) but can't figure out how to use the by argument. The ID column from df1 could exist in any of the three columns (ID1, ID2 and ID3) of df2. How can I achieve this?

You have to reshape in long format first, then join:
library(dplyr)
library(tidyr)
df2 %>%
gather(IDnr, ID, 1:3) %>%
left_join(df1, ., by = 'ID')
# alternative:
df1 %>%
left_join(., df2 %>% gather(IDnr, ID, 1:3), by = 'ID')
The result:
ID V1 V2 IDnr
1 d 21.24 47.21 ID2
2 e 23.32 45.32 ID3
3 f 21.24 47.21 ID3
4 d 21.24 47.21 ID2
5 f 21.24 47.21 ID3
6 c 23.32 45.32 ID2
7 a 23.32 45.32 ID1
8 e 23.32 45.32 ID3
9 a 23.32 45.32 ID1
10 d 21.24 47.21 ID2

The by argument is used to specify the ID columns you want to join on assuming they are named differently for the left & right tables (if its the same name then it will automatically choose).
However, I have a way to simplify what you want to do. First, why not reshape df2 to only have 1 single ID column (assumption that the ID is unique within all 3 columns). You can do this by creating 3 separate dfs and then union together using bind_rows.
Now that it is reshaped, you can do a right join. df1 is on the right side of the join and all the records in df1 will remain whether or not there is a match with df2 (otherwise V1 and V2 will be NULL i.e. NA).
With the sample data provided from df1, the results would be unexpected because each ID is repeated and not unique (so I have redefined df1 to have unique IDs only). If the IDs are not unique, you can group the results by ID and do an aggregation prior to doing the join.
set.seed(123)
#df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE)) #This one has repeated IDs
df1 <- data.frame(ID=letters[1:6])
df2 <- data.frame(
ID1 = letters[1:2],
ID2 = letters[3:4],
ID3 = letters[5:6],
V1 = c(23.32,21.24),
V2 = c(45.32,47.21)
)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- bind_rows(df2 %>% select(ID=ID1, V1, V2),
df2 %>% select(ID=ID2, V1, V2),
df2 %>% select(ID=ID3, V1, V2)) %>%
right_join(df1)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Joining, by = "ID"
#> Warning: Column `ID` joining character vector and factor, coercing into
#> character vector
df
#> ID V1 V2
#> 1 a 23.32 45.32
#> 2 b 21.24 47.21
#> 3 c 23.32 45.32
#> 4 d 21.24 47.21
#> 5 e 23.32 45.32
#> 6 f 21.24 47.21

Related

Convert every n # of rows to columns and stack them in R?

I have a tab-delimited text file with a series of timestamped data. I've read it into R using read.delim() and it gives me all the data as characters in a single column. Example:
df <- data.frame(c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
colnames(df) <- "col1"
df
I want to convert every n # of rows (in this case 4) to columns and stack them without using a for loop. Desired result:
col1 <- c("2017","2018","2018")
col2 <- c("A","X","X")
col3 <- c("B","Y","B")
col4 <- c("C","Z","C")
df2 <- data.frame(col1, col2, col3, col4)
df2
I created a for loop, but it can't handle the millions of rows in my df. Should I convert to a matrix? Would converting to a list help? I tried as.matrix(read.table()) and unlist() but without success.
You could use tidyr to reshape data into the form you want, you will first need to mutate the data as to identify which indexes should be first, and which go with a specific column.
Assuming you know there are 4 groups (n = 4) you could do something like the following with the help of the dplyr package.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 4
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C")) %>%
mutate(cols = rep(1:n, n()/n),
id = rep(1:(n()/n), each = n))
pivot_wider(df, id_cols = id, names_from = cols, values_from = x, names_prefix = "cols")
#> # A tibble: 3 × 5
#> id cols1 cols2 cols3 cols4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 2017 A B C
#> 2 2 2018 X Y Z
#> 3 3 2018 X B C
Or, in base you could use the split function on the vector, and then use do.call to make the data frame
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
split_df <- setNames(split(df$x, rep(1:4, 3)), paste0("cols", 1:4))
do.call("data.frame", split_df)
#> cols1 cols2 cols3 cols4
#> 1 2017 A B C
#> 2 2018 X Y Z
#> 3 2018 X B C
Created on 2022-02-01 by the reprex package (v2.0.1)
The easiest way would be to create a matrix with matrix(ncol=x, byrow=TRUE), then convert back to data.frame. Should be quite fast too.
df |>
unlist() |>
matrix(ncol=4, byrow = TRUE) |>
as.data.frame() |>
setNames(paste0('col', 1:4))
col1 col2 col3 col4
1 2017 A B C
2 2018 X Y Z
3 2018 X B C

How do I merge data frames (and duplicate values) by the nearest date for each individual ID?

I have two data frames that I am trying to join by date (grouped by individual).
I have made example data frames of both (the real df1 is 5700 rows, and the real df2 is 287 rows).
df1 has IDs (including some not in df2), dates, and behavior values.
df2 has IDs (though fewer than df1), dates (fewer than those in df1), and hormone values.
My goal is to match the hormones for a given individual from the nearest date in df2 to the nearest date in df1 (matching as closely as possible but only duplicating the values of hormones from df2 in df1 when the nearest dates are less than or equal to 2 days apart).
I would like to have the hormones that don't match a behavioral observation printed at the bottom of the new data frame with their date such that they aren't lost (example in df3)
df1
ID Date behavior
a 1-12-2020 0
b 1-12-2020 1
b 1-13-2020 1
c 1-12-2020 2
d 1-12-2020 0
c 1-13-2020 1
c 1-14-2020 0
c 1-15-2020 1
c 1-16-2020 2
df2
ID Date hormone
a 1-10-2020 20
b 1-18-2019 70
c 1-10-2020 80
c 1-16-2020 90
#goal dataframe
df3
ID Date behavior hormone
a 1-12-2020 0 20
b 1-12-2020 1 NA [> 2 days from hormone]
b 1-13-2020 1 NA [> 2 days from hormone]
c 1-12-2020 2 80
d 1-12-2020 0 NA [no matching individual in df2]
c 1-13-2020 1 NA [> 2 days from hormone]
c 1-14-2020 0 90
c 1-15-2020 1 90
c 1-16-2020 2 90
b 1-18-2019 NA 70 [unmatched hormone at bottom of df3]
here is the code to create these data frames:
df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
behavior = c(0,1,1,2,0,1,0,1,2) )
df2 <- data.frame(ID = c("a", "b", "c", "c"),
date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
hormone = c(20,70,80,90) )
df1$date<-as.factor(df1$date)
df1$date<-strptime(df1$date,format="%m-%d-%Y")
#for nearest date function to work
df1$date<-as.Date(df1$date,"%m/%d/%y")
df2$date<-as.factor(df2$date)
df2$date<-strptime(df2$date,format="%m-%d-%Y")
#for nearest date function to work
df2$date<-as.Date(df2$date,"%m/%d/%y")
I have been able to use a function from a previous question on the forum (link and code below) to match the nearest dates and duplicate to fill, but am not able to limit the time frame of matches, or print unmatched dates in new rows. Is there a way to do this?
This is what I started working from (code below):
How to match by nearest date from two data frames?
# Function to get the index specifying closest or after
Ind_closest_or_after <- function(d1, d2){
which.min(ifelse(d1 - d2 < 0, Inf, d1 - d2))
}
# Calculate the indices
closest_or_after_ind <- map_int(.x = df1$date, .f = Ind_closest_or_after, d2 = df2$date)
# Add index columns to the data frames and join
df2 <- df2 %>%
mutate(ind = 1:nrow(df2))
df1 <- df1 %>%
mutate(ind = closest_or_after_ind)
df3<-left_join(df2, df1, by = 'ind')
This answer seems the closest but doesn't limit the values:
Merge two data frames by nearest date and ID
#function to do all but limit dates and print unmatched
library(data.table)
setDT(df2)[, date := date]
df2[df1, on = .(ID, date = date), roll = 'nearest']
You can join the tables by filtering all possible combinations (cross product using expand_grid):
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
behavior = c(0,1,1,2,0,1,0,1,2) )
df2 <- data.frame(ID = c("a", "b", "c", "c"),
date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
hormone = c(20,70,80,90) )
joined <-
df1 %>%
rename_all(~ paste0(., ".1")) %>%
expand_grid(df2 %>% rename_all(~ paste0(., ".2"))) %>%
mutate(across(starts_with("date"), ~ .x %>% parse_date(format = "%m-%d-%Y"))) %>%
mutate(time_diff = abs(date.1 - date.2)) %>%
filter(time_diff <= days(2) & ID.1 == ID.2) %>%
select(ID = ID.1, behavior = behavior.1, hormone = hormone.2)
joined
#> # A tibble: 5 x 3
#> ID behavior hormone
#> <chr> <dbl> <dbl>
#> 1 a 0 20
#> 2 c 2 80
#> 3 c 0 90
#> 4 c 1 90
#> 5 c 2 90
df1 %>%
left_join(joined) %>%
full_join(df2) %>%
as_tibble() %>%
distinct(ID, behavior, .keep_all = TRUE) %>%
arrange(ID, behavior)
#> Joining, by = c("ID", "behavior")
#> Joining, by = c("ID", "date", "hormone")
#> # A tibble: 9 x 4
#> ID date behavior hormone
#> <chr> <chr> <dbl> <dbl>
#> 1 a 1-12-2020 0 20
#> 2 a 1-10-2020 NA 20
#> 3 b 1-12-2020 1 NA
#> 4 b 1-18-2019 NA 70
#> 5 c 1-14-2020 0 90
#> 6 c 1-13-2020 1 90
#> 7 c 1-12-2020 2 80
#> 8 c 1-10-2020 NA 80
#> 9 d 1-12-2020 0 NA
Created on 2022-02-18 by the reprex package (v2.0.0)
This will result in one row for each (ID, behavior) pair. You can replace this e.g. with ID, date to have only one time point at any given time point for each ID.

How to move dataframe variable names to first row and add new variable names to multiple dataframes in a list?

library(purrr)
library(tibble)
library(dplyr)
Starting list of dataframes
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2]),
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2)))
lst
#> $df1
#> X.1 heading
#> 1 1 a
#> 2 2 b
#>
#> $df2
#> X.32 another.topic
#> 1 3 Line 1
#> 2 4 Line 2
Expected "combined" dataframe, with new consistent variable names, and old variable names in the first row of each constituent dataframe.
#> id h1 h2
#> 1 df1 X.1 heading
#> 2 df1 1 a
#> 3 df1 2 b
#> 4 df2 X.32 another.topic
#> 5 df2 3 Line 1
#> 6 df2 4 Line 2
add_row requires "Name-value pairs, passed on to tibble(). Values can be defined only for columns that already exist in .data and unset columns will get an NA value."
Which is what I think I have achieved with this:
df_nms <-
map(lst, names) %>%
map(set_names)
#> $df1
#> X.1 heading
#> "X.1" "heading"
#>
#> $df2
#> X.32 another.topic
#> "X.32" "another.topic"
But I cannot tie up the last bit, using a purrr function to add the names to the head of each dataframe. I've tried numerous variations with map2 and pmap the closest I can get at present (if I treat add_row as a formula , prefixing it with ~ and remove the .y I get a new first row populated with NAs). I think I'm missing how to pass the name-value pairs to the add_row function.
map2(lst, df_nms, add_row(.x, .y, .before = 1)) %>%
map(set_names, c("h1", "h2")) %>%
map_dfr(bind_rows, .id = "id")
#> Error in add_row(.x, .y, .before = 1): object '.x' not found
A pointer to resolve this last step would be most appreciated.
Not quite sure how to do this via purrr map functions, but here is an alternative,
library(dplyr)
bind_rows(lapply(lst, function(i){d1 <- as.data.frame(matrix(names(i), ncol = ncol(i)));
rbind(d1, setNames(i, names(d1)))}), .id = 'id')
# id V1 V2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Here's an approach using map, rbindlist from data.table and some base R functions:
library(purrr)
library(dplyr)
library(data.table)
map(lst, ~ as.data.frame(unname(rbind(colnames(.x),as.matrix(.x))))) %>%
rbindlist(idcol = "id")
# id V1 V2
#1: df1 X.1 heading
#2: df1 1 a
#3: df1 2 b
#4: df2 X.32 another.topic
#5: df2 3 Line 1
#6: df2 4 Line 2
Alternatively we could use map_df if we use colnames<-:
map_df(lst, ~ as.data.frame(rbind(colnames(.x),as.matrix(.x))) %>%
`colnames<-`(.,paste0("h",seq(1,dim(.)[2]))), .id = "id")
# id h1 h2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Key things here are:
Use as.matrix to get rid of the factor / character incompatibility.
Remove names with unname or set them with colnames<-
Use the idcols = or .id = feature to get the names of the list as a column.
I altered your sample data a bit, setting stringsAsFactors to FALSE when creating the data.frames in lst.
here is a solution using data.table::rbindlist().
#sample data
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2],
stringsAsFactors = FALSE), # !! <--
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2),
stringsAsFactors = FALSE) # !! <--
)
DT <- data.table::rbindlist( lapply( lst, function(x) rbind( names(x), x ) ),
use.names = FALSE, idcol = "id" )
setnames(DT, names( lst[[1]] ), c("h1", "h2") )
# id h1 h2
# 1: df1 X.1 heading
# 2: df1 1 a
# 3: df1 2 b
# 4: df2 X.32 another.topic
# 5: df2 3 Line 1
# 6: df2 4 Line 2

Reshape a dataframe in R with non numeric values

I have a data frame with non-numeric values with the following format:
DF1:
col1 col2
1 a b
2 a c
3 z y
4 z x
5 a d
6 m n
I need to convert it into this format,
DF2:
col1 col2 col3 col4
1 a b c d
2 z y x NA
3 m n NA NA
With col1 as the primary key (not sure if this is the right terminology in R), and the rest of the columns contain the elements associated with that key (as seen in DF1).
DF2 will include more columns compared to DF1 depending upon the number of elements associated with any key.
Some columns will have no value resulting from different number of elements associated with each key, represented as NA (as shown in DF2).
The column names could be anything.
I have tried to use the reshape(), melt() + cast(), even a generic for loop where I use cbind and try to delete the row.
It is part of a very big dataset with over 50 million rows. I might have to use cloud services for this task but that is a different discussion.
I am new to R so there might be some obvious solution which I am missing.
Any help would be much appreciated.
-Thanks
If this is a big dataset, we can use data.table
library(data.table)
setDT(DF1)[, i1:=paste0("col", seq_len(.N)+1L), col1]
dcast(DF1, col1~i1, value.var='col2')
# col1 col2 col3 col4
#1: a b c d
#2: m n NA NA
#3: z y x NA
Using dplyr and tidyr :
library(tidyr)
library(dplyr)
DF <- data_frame(col1 = c("a", "a", "z", "z", "a", "m"),
col2 = c("b", "c", "y", "x", "d", "n"))
# you need to another column as key value for spreading
DF %>%
group_by(col1) %>%
mutate(colname = paste0("col", 1:n() + 1)) %>%
spread(colname, col2)
#> Source: local data frame [3 x 4]
#> Groups: col1 [3]
#>
#> col1 col2 col3 col4
#> (chr) (chr) (chr) (chr)
#> 1 a b c d
#> 2 m n NA NA
#> 3 z y x NA

left_join two data frames and overwrite

I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?
Here is an example:
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
Desired output:
y x1
1 A 5
2 B 6
3 C 7
4 D 4
I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:
"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."
My solution:
df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)
Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character
> df3
Source: local data frame [4 x 2]
y x1
(chr) (dbl)
1 D 4
2 A 5
3 B 6
4 C 7
this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.
This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.
The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.
In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
.keys <- c("key1", "key2")
.base_table <- tribble(
~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
.join_table <- tribble(
~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
df_result <- .base_table %>%
# Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
df_result %>%
print()
#> # A tibble: 4 x 4
#> key1 key2 val1 val2
#> <chr> <chr> <dbl> <dbl>
#> 1 A b 0 100
#> 2 B a 1 111
#> 3 A a 0 0
#> 4 B b 1 1
Created on 2019-12-12 by the reprex package (v0.3.0)

Resources