How to merge two file in R with the same column? - r

I want to merge two file in R with the same column ID as below:
File1:
ID feature1 feature2 feature3
A,B,C 1 100 150
D,F 2 200 500
G,R 2 200 600
H 6 500 800
S 8 600 700
File2:
ID feature4 feature5
A 5 4
F 6 7
G 4 3
H 8 2
P 2 1
OUTPUT:
ID feature1 feature2 feature3 ID feature4 feature5
A,B,C 1 100 150 A 5 4
D,F 2 200 500 F 6 7
G,R 2 200 600 G 4 3
H 6 500 800 H 8 2
S 8 600 700 * * *
* * * * P 2 1

With dplyr and full_join. Arranging to get the NAs last in case of duplicated IDs.
library(dplyr)
library(tidyr) # unnest
full_join(df1 %>%
mutate(ID_1 = strsplit(ID, ",")) %>%
unnest(ID_1), df2, c("ID_1" = "ID")) %>%
arrange(ID, feature4) %>%
filter(!(duplicated(ID) & is.na(feature4)))
# A tibble: 6 × 7
ID feature1 feature2 feature3 ID_1 feature4 feature5
<chr> <int> <int> <int> <chr> <int> <int>
1 A,B,C 1 100 150 A 5 4
2 D,F 2 200 500 F 6 7
3 G,R 2 200 600 G 4 3
4 H 6 500 800 H 8 2
5 S 8 600 700 S NA NA
6 NA NA NA NA P 2 1
Data
df1 <- structure(list(ID = c("A,B,C", "D,F", "G,R", "H", "S"), feature1 = c(1L,
2L, 2L, 6L, 8L), feature2 = c(100L, 200L, 200L, 500L, 600L),
feature3 = c(150L, 500L, 600L, 800L, 700L)), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(ID = c("A", "F", "G", "H", "P"), feature4 = c(5L,
6L, 4L, 8L, 2L), feature5 = c(4L, 7L, 3L, 2L, 1L)), class = "data.frame", row.names = c(NA,
-5L))

You can use fuzzyjoin and customize a vectorized matching function (like grepl below) given two columns, returning TRUE or FALSE as to whether they are a match.
library(fuzzyjoin)
fuzzy_full_join(df1, df2, by = "ID", match_fun = Vectorize(\(x, y) grepl(y, x)))
# ID.x feature1 feature2 feature3 ID.y feature4 feature5
# 1 A,B,C 1 100 150 A 5 4
# 2 D,F 2 200 500 F 6 7
# 3 G,R 2 200 600 G 4 3
# 4 H 6 500 800 H 8 2
# 5 S 8 600 700 <NA> NA NA
# 6 <NA> NA NA NA P 2 1

Related

R Create multiple rows from 1 row based on presence of values in certain columns

I have a data frame that looks like the following:
ID Date Participant_1 Participant_2 Participant_3 Covariate 1 Covariate 2 Covariate 3
1 9/1 A B 16 2 1
2 5/4 B 4 2 2
3 6/3 C A B 8 3 6
4 2/8 A 7 8 4
5 9/3 C A 7 1 3
I need to expand this data frame so that a row is present for all of the participants present at each event "ID", with the date and all other variables in all the created rows. The multiple participant columns would now only be one column for participant. The output would therefore be:
ID Date Participant Covariate 1 Covariate 2 Covariate 3
1 9/1 A 16 2 1
1 9/1 B 16 2 1
2 5/4 B 4 2 2
3 6/3 C 8 3 6
3 6/3 A 8 3 6
3 6/3 B 8 3 6
4 2/8 A 7 8 4
5 9/3 C 7 1 3
5 9/3 A 7 1 3
Is there a way to do this efficiently? Perhaps with a pivot function?
We can use pivot_longer and then some formatting
library(tidyr)
df %>%
pivot_longer(starts_with("Participant"), values_to = "Participant") %>%
select(-name) %>%
relocate(Participant, .before = Covariate_1) %>%
drop_na()
# A tibble: 9 × 6
ID Date Participant Covariate_1 Covariate_2 Covariate_3
<int> <chr> <chr> <int> <int> <int>
1 1 9/1 A 16 2 1
2 1 9/1 B 16 2 1
3 2 5/4 B 4 2 2
4 3 6/3 C 8 3 6
5 3 6/3 A 8 3 6
6 3 6/3 B 8 3 6
7 4 2/8 A 7 8 4
8 5 9/3 C 7 1 3
9 5 9/3 A 7 1 3
Here's the example data used:
df <- structure(list(ID = 1:5, Date = c("9/1", "5/4", "6/3", "2/8",
"9/3"), Participant_1 = c("A", "B", "C", "A", "C"), Participant_2 = c("B",
NA, "A", NA, "A"), Participant_3 = c(NA, NA, "B", NA, NA), Covariate_1 = c(16L,
4L, 8L, 7L, 7L), Covariate_2 = c(2L, 2L, 3L, 8L, 1L), Covariate_3 = c(1L,
2L, 6L, 4L, 3L)), class = "data.frame", row.names = c(NA, -5L
))

Conditionally take value from column1 if the column1 name == first(value) from column2 BY GROUP

I have this fake dataframe:
df <- structure(list(Group = c(1L, 1L, 2L, 2L), A = 1:4, B = 5:8, C = 9:12,
X = c("A", "A", "B", "B")), class = "data.frame", row.names = c(NA, -4L))
Group A B C X
1 1 1 5 9 A
2 1 2 6 10 A
3 2 3 7 11 B
4 2 4 8 12 B
I try to mutate a new column, which should take the value of THE column that has the column name in an other column:
Desired output:
Group A B C X new_col
1 1 5 9 A 1
1 2 6 10 A 1
2 3 7 11 B 7
2 4 8 12 B 7
My try so far:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(across(c(A,B,C), ~ifelse(first(X) %in% colnames(.), first(.), .), .names = "new_{.col}"))
Group A B C X new_A new_B new_C
<int> <int> <int> <int> <chr> <int> <int> <int>
1 1 1 5 9 A 1 5 9
2 1 2 6 10 A 1 5 9
3 2 3 7 11 B 3 7 11
4 2 4 8 12 B 3 7 11
One option might be:
df %>%
rowwise() %>%
mutate(new_col = get(X)) %>%
group_by(Group, X) %>%
mutate(new_col = first(new_col))
Group A B C X new_col
<int> <int> <int> <int> <chr> <int>
1 1 1 5 9 A 1
2 1 2 6 10 A 1
3 2 3 7 11 B 7
4 2 4 8 12 B 7
Using by and add + 1 to the group number to select column. Assuming group columns are arranged as in example after "Group" column.
transform(df, new_col=do.call(rbind, by(df, df$Group, \(x)
cbind(paste(x$X, x[1, x$Group[1] + 1])))))
# Group A B C X new_col
# 1 1 1 5 9 A A 1
# 2 1 2 6 10 A A 1
# 3 2 3 7 11 B B 7
# 4 2 4 8 12 B B 7
Note: R version 4.1.2 (2021-11-01).
Data:
df <- structure(list(Group = c(1L, 1L, 2L, 2L), A = 1:4, B = 5:8, C = 9:12,
X = c("A", "A", "B", "B")), class = "data.frame", row.names = c(NA,
-4L))
In base R, we may use row/column indexing
df$new_col <- df[2:4][cbind(match(unique(df$Group), df$Group)[df$Group],
match(df$X, names(df)[2:4]))]
df$new_col
[1] 1 1 7 7

Check if value of column A is present in the same row or previous rows of column B

I have this dataframe:
df <- structure(list(A = 1:5, B = c(1L, 5L, 2L, 3L, 3L)),
class = "data.frame", row.names = c(NA, -5L))
A B
1 1 1
2 2 5
3 3 2
4 4 3
5 5 3
I would like to get this result:
A B Result
1 1 1 B
2 2 5 <NA>
3 3 2 <NA>
4 4 3 <NA>
5 5 3 B
Strategy:
Check if A==B then assign B to new column Result if not NA.
But do this also for all PREVIOUS rows of B.
AIM:
I want to learn how to check if a certain value of column A say in row 5
is in the previous rows of column B (eg. row 1-4).
I hope the following code fits your general cases
transform(
df,
Result = replace(rep(NA, length(B)), match(A, B) <= seq_along(A), "B")
)
which gives
A B Result
1 1 1 B
2 2 5 <NA>
3 3 2 <NA>
4 4 3 <NA>
5 5 3 B
Here is a dplyr::rowwise approach:
library(dplyr)
df %>%
rowwise %>%
mutate(result = ifelse(A %in% .[seq(cur_group_rows()),]$B, "B", NA))
#> # A tibble: 5 x 3
#> # Rowwise:
#> A B result
#> <int> <int> <chr>
#> 1 1 1 B
#> 2 2 5 <NA>
#> 3 3 2 <NA>
#> 4 4 3 <NA>
#> 5 5 3 B
Created on 2021-08-26 by the reprex package (v0.3.0)
Just some minor changes to #ThomasIsCoding's answer to make it dplyr. Slightly more laid-out to be easier to read, in my opinion.
library(tidyverse)
df <- structure(list(A = 1:5, B = c(1L, 5L, 2L, 3L, 3L)),
class = "data.frame", row.names = c(NA, -5L))
match(df$A, df$B)
#> [1] 1 3 4 NA 2
df %>% mutate(Result = if_else(match(A, B) <= row_number(),
"B",
NA_character_))
#> A B Result
#> 1 1 1 B
#> 2 2 5 <NA>
#> 3 3 2 <NA>
#> 4 4 3 <NA>
#> 5 5 3 B
Created on 2021-08-26 by the reprex package (v1.0.0)
We can use
library(dplyr)
library(purrr)
df %>%
mutate(Result = map_chr(row_number(), ~ case_when(A[.x] %in% B[seq(.x)]~ "B")))
-output
A B Result
1 1 1 B
2 2 5 <NA>
3 3 2 <NA>
4 4 3 <NA>
5 5 3 B

R Fill in missing rows

I have a similar question like this one: Fill in missing rows in R
However, the gaps I need to fill are not only months, but also missing years in between for one ID. This is an example:
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 3 1 10
4 A 3 2 9
5 A 3 3 8
6 B 2 1 7
7 B 2 2 6
8 B 2 3 5
9 B 3 3 4
And this is what I want it to look like:
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 1 3 0
4 A 2 1 0
5 A 2 2 0
6 A 2 3 0
7 A 3 1 10
8 A 3 2 9
9 A 3 3 8
10 B 2 1 7
11 B 2 2 6
12 B 2 3 5
13 B 3 1 0
14 B 3 2 0
15 B 3 3 4
Has someone an idea how to solve it? I have already played around with the solutions mentioned above.
library(tidyverse)
df <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
df %>%
complete(ID, A, B, fill = list(Var1 = 0))
#> # A tibble: 18 x 4
#> ID A B Var1
#> <chr> <int> <int> <dbl>
#> 1 A 1 1 12
#> 2 A 1 2 11
#> 3 A 1 3 0
#> 4 A 2 1 0
#> 5 A 2 2 0
#> 6 A 2 3 0
#> 7 A 3 1 10
#> 8 A 3 2 9
#> 9 A 3 3 8
#> 10 B 1 1 0
#> 11 B 1 2 0
#> 12 B 1 3 0
#> 13 B 2 1 7
#> 14 B 2 2 6
#> 15 B 2 3 5
#> 16 B 3 1 0
#> 17 B 3 2 0
#> 18 B 3 3 4
Created on 2021-03-03 by the reprex package (v1.0.0)
You could use the solution described there altering it slightly for your problem.
df
full <- with(df, unique(expand.grid(ID = ID, A = A, B = B)))
complete <- merge(df, full, by = c('ID', 'A', 'B'), all.y = TRUE)
complete$Var1[is.na(complete$Var1)] <- 0
Just in case somebody else has the same question, this is what I came up with, thanks to the answers provided:
library(tidyverse)
df %>% group_by(ID) %>% complete(ID, A = full_seq(A,1), B, fill = list(Var1 = 0))
This code avoids that too many unused datasets are produced.

R- Specific merging of rows in a dataframe within unique groups

I have a huge data frame in R like the following:
df <- data.frame("ITEM" = c(1,1,1,2,2,3,3,3,3,4),
"ID" = c("A","B","C","D","E","F","G","A","B","C"),
"Score" = c(7,8,7,3,5,4,6,9,10,5),
"Date" = = c("1/1/2018","1/3/2018","1/6/2018","1/7/2017","1/10/2017","1/1/2003","1/3/2004","1/5/2008","1/7/2010","1/8/2010"))
ITEM ID Score Date
1 1 A 7 1/1/2018
2 1 B 8 1/3/2018
3 1 C 7 1/6/2018
4 2 D 3 1/7/2017
5 2 E 5 1/10/2017
6 3 F 4 1/1/2003
7 3 G 6 1/3/2004
8 3 A 9 1/5/2008
9 3 B 10 1/7/2010
10 4 C 5 1/8/2010
11 4 H 8 1/3/2011
The data is already grouped by unique items and in ascending date order. I would like to transpose the data into the following:
ITEM ID Score Date ID_2 Score_2 Date_2
1 1 A 7 1/1/2018 B 8 1/3/2018
2 1 B 8 1/3/2018 C 7 1/6/2018
4 2 D 3 1/7/2017 E 5 1/10/2017
6 3 F 4 1/1/2003 G 6 1/3/2004
7 3 G 6 1/3/2004 A 9 1/5/2008
8 3 A 9 1/5/2008 B 10 1/7/2010
10 4 C 5 1/8/2010 H 8 1/3/2011
Each item has an owner and is transferred to another person and given a score. E.g. Item 1 is held by A who gets a score of 7, then it moves to B who scores 8, then C who scores 7.
I would like to get it in the above format...to merge each row with the above row (but within the item groups) - I tried reshaping the data using dcast from what I know, but you would get ID_3, ID_4 columns as well for some items whereas I only want the columns for ID_2, Score_2 and Date_2.
Any ideas? Thanks.
Based on the expected output, we could split by 'ITEM', cbind the rows with the lag of rows and then convert the list of data.frame to a single data.frame with rbind
out <- do.call(rbind, lapply(split(df, df$ITEM),
function(x) cbind(x[-nrow(x), ], x[-1, -1])))
row.names(out) <- NULL
out
# ITEM ID Score Date ID Score Date
#1 1 A 7 1/1/2018 B 8 1/3/2018
#2 1 B 8 1/3/2018 C 7 1/6/2018
#3 2 D 3 1/7/2017 E 5 1/10/2017
#4 3 F 4 1/1/2003 G 6 1/3/2004
#5 3 G 6 1/3/2004 A 9 1/5/2008
#6 3 A 9 1/5/2008 B 10 1/7/2010
#7 4 C 5 1/8/2010 H 8 1/3/2011
Or using tidyverse
library(tidyverse)
df %>%
group_by(ITEM) %>%
nest %>%
mutate(data = map(data, ~ bind_cols(.x[-nrow(.x), ], .x[-1, ]))) %>%
unnest
# A tibble: 7 x 7
# ITEM ID Score Date ID1 Score1 Date1
# <int> <chr> <int> <chr> <chr> <int> <chr>
#1 1 A 7 1/1/2018 B 8 1/3/2018
#2 1 B 8 1/3/2018 C 7 1/6/2018
#3 2 D 3 1/7/2017 E 5 1/10/2017
#4 3 F 4 1/1/2003 G 6 1/3/2004
#5 3 G 6 1/3/2004 A 9 1/5/2008
#6 3 A 9 1/5/2008 B 10 1/7/2010
#7 4 C 5 1/8/2010 H 8 1/3/2011
data
df <- structure(list(ITEM = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L), ID = c("A", "B", "C", "D", "E", "F", "G", "A", "B", "C",
"H"), Score = c(7L, 8L, 7L, 3L, 5L, 4L, 6L, 9L, 10L, 5L, 8L),
Date = c("1/1/2018", "1/3/2018", "1/6/2018", "1/7/2017",
"1/10/2017", "1/1/2003", "1/3/2004", "1/5/2008", "1/7/2010",
"1/8/2010", "1/3/2011")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))

Resources