I need to combine two data frames (df1 and df2) by matching up two site columns of each data frame to produce a third data frame (df3).
df1 = data.frame(Site.1=c("A","A","B"),
Site.2=c("B","C","C"),
Score1=c(60,70,80))
df1
Site.1 Site.2 Score1
1 A B 60
2 A C 70
3 B C 80
df2 = data.frame(Site.1=c("B","A","A"),
Site.2=c("C","B","C"),
Score2=c(10,20,30))
df2
Site.1 Site.2 Score2
1 B C 10
2 A B 20
3 A C 30
df3 = data.frame(Site.1=c("A","A","B"),
Site.2=c("B","C","C"),
Score1=c(60,70,80),
Score2=c(20,30,10))
df3
Site.1 Site.2 Score1 Score2
1 A B 60 20
2 A C 70 30
3 B C 80 10
You want the merge function. Since your column names that you want to match on already have the same name you don't even need to do anything special. If that wasn't the case you would want to look into the by.x and by.y parameters that merge takes.
df1 = data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"),Score1=c(60,70,80))
df2 = data.frame(Site.1=c("B","A","A"),Site.2=c("C","B","C"), Score2=c(10,20,30))
df3 = data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"), Score1=c(60,70,80),Score2=c(20,30,10))
df3
# Merge gives you what you want
merge(df1, df2)
dplyr may be helpful here.
library(dplyr)
df1 = data.frame(Site.1 = c("A", "A", "B"),
Site.2 = c("B", "C", "C"),
Score1 = c(60, 70, 80))
df2 = data.frame(Site.1 = c("B", "A", "A"),
Site.2 = c("C", "B", "C"),
Score2 = c(10, 20, 30))
inner_join(df1, df2)
Related
I have the following data:
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
I want to "extend" this data frame to make name pairs for every possible combination of names without repetition like so:
names_1 <- c("a", "a", "a", "b", "b", "c")
names_2 <- c("b", "c", "d", "c", "d", "d")
scores_1 <- c(95, 95, 95, 55, 55, 100)
scores_2 <- c(55, 100, 60, 100, 60, 60)
df_extended <- cbind.data.frame(names_1, names_2, scores_1, scores_2)
In the extended data, scores_1 are the scores for the corresponding name in names_1, and scores_2 are for names_2.
The following bit of code makes the appropriate name pairs. But I do not know how to get the scores in the right place after that.
t(combn(df$names,2))
The final goal is to get the row-wise difference between scores_1 and scores_2.
df_extended$score_diff <- abs(df_extended$scores_1 - df_extended$scores_2)
df_ext <- data.frame(t(combn(df$names, 2,\(x)c(x, df$scores[df$names %in%x]))))
df_ext <- setNames(type.convert(df_ext, as.is =TRUE), c('name_1','name_2', 'type_1', 'type_2'))
df_ext
name_1 name_2 type_1 type_2
1 a b 95 55
2 a c 95 100
3 a d 95 60
4 b c 55 100
5 b d 55 60
6 c d 100 60
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
library(tidyverse)
map(df, ~combn(x = .x, m = 2)%>% t %>% as_tibble) %>%
imap_dfc(~set_names(x = .x, nm = paste(.y, seq(ncol(.x)), sep = "_"))) %>%
mutate(score_diff = scores_1 - scores_2)
#> # A tibble: 6 × 5
#> names_1 names_2 scores_1 scores_2 score_diff
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 a b 95 55 40
#> 2 a c 95 100 -5
#> 3 a d 95 60 35
#> 4 b c 55 100 -45
#> 5 b d 55 60 -5
#> 6 c d 100 60 40
Created on 2022-06-06 by the reprex package (v2.0.1)
First, we can create a new data frame with the unique combinations of names. Then, we can merge on the scores to match the names for both names_1 and names_2 to get the final data.frame.
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
new_df <- data.frame(t(combn(df$names,2)))
names(new_df)[1] <- "names_1"; names(new_df)[2] <- "names_2"
new_df <- merge(new_df, df, by.x = 'names_1', by.y = 'names')
new_df <- merge(new_df, df, by.x = 'names_2', by.y = 'names')
names(new_df)[3] <- "scores_1"; names(new_df)[4] <- "scores_2"
> new_df
names_2 names_1 scores_1 scores_2
1 b a 95 55
2 c a 95 100
3 c b 55 100
4 d a 95 60
5 d b 55 60
6 d c 100 60
Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5
Lets assume there are 2 columns of two huge dataframes (different lengths) like:
df1 df2
A 1 C X
A 1 D X
B 4 C X
A 1 F X
B 4 A X
B 4 B X
C 7 B X
Each time there is a match in the 1st columns, X should be replaced with data from column 2 of df1. If the 1st column of df2 contains Elements, which are still not in the first column of df1 (F, D), X should be replaced with 0.
Hence there is a huge dataframe, a loop in a loop would not be useful.
The solution should look like this:
df1 df2
A 1 C 7
A 1 D 0
B 4 C 7
A 1 F 0
B 4 A 1
B 4 B 4
C 7 B 4
Thank You in advance
As there are duplicate rows in 'df1', we can get the unique rows
df3 <- unique(df1)
Then, use match to get the idnex
i1 <- match(df2$Col1, df3$Col1)
and based on the index, assign
df2$Col2 <- df3$Col2[i1]
If there are no matches, it would be NA, which can be changed to 0
df2$Col2[is.na(df2$Col2)] <- 0
df2
# Col1 Col2
#1 C 7
#2 D 0
#3 C 7
#4 F 0
#5 A 1
#6 B 4
#7 B 4
Or this can be done with data.table by joining on the 'Col1' and assigning the 'Col2' (after removing the Col2 from the second data) with the Col2 from 'df3'
library(data.table)
setDT(df2)[, Col2 := NULL][df3, Col2 := Col2, on = .(Col1)]
data
df1 <- structure(list(Col1 = c("A", "A", "B", "A", "B", "B", "C"), Col2 = c(1,
1, 4, 1, 4, 4, 7)), class = "data.frame", row.names = c(NA, -7L
))
df2 <- structure(list(Col1 = c("C", "D", "C", "F", "A", "B", "B"), Col2 = c("X",
"X", "X", "X", "X", "X", "X")), class = "data.frame", row.names = c(NA,
-7L))
Let's say we have two data frames:
df1 <- data.frame(A = letters[1:3], B = letters[4:6], C = letters[7:9], stringsAsFactors = FALSE)
A B C
1 a d g
2 b e h
3 c f i
df2 <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
I need to build a function that takes as input a single value or a vector containing elements from one of the data frames and returns the elements from the other data frame according to their positional indexes.
The function should work like this:
> matchdf(values = c("a", "e", "i"), dfin = df1, dfout = df2)
[1] 1 5 9
> matchdf(values = c(1, 5, 9), dfin = df2, dfout = df1)
[1] "a" "e" "i"
> matchdf(values = c(1, 1, 1), dfin = df2, dfout = df1)
[1] "a" "a" "a"
This is what I have tried so far:
requiere(dplyr)
toVec <- function(df) df %>% as.matrix %>% as.vector
matchdf <- function(values, dfin, dfout) toVec(dfout)[toVec(dfin) %in% values]
# But sometimes the output values aren't in correct order:
> matchdf(c("c", "i", "h"), dt1, dt2)
[1] 3 8 9
# should output 3 9 8
> matchdf(values = c("a", "a", "a"), dfin = dt1, dfout = dt2)
[1] 1
# Should output 1 1 1
Feel free to use data.table or/and dplyr if it eases the task. I would prefer a solution without for loops.
Assumptions:
elements from df1 are different from df2
dim(df1) = dim(df2)
matchdf <- function(values, dfin, dfout){
unlist(sapply(values,
function(val) dfout[dfin == val],
USE.NAMES = F)
)
}
matchdf(c("c", "i", "h"), df1, df2)
#should output 3 9 8
[1] 3 9 8
matchdf(values = c("a", "a", "a"), dfin = df1, dfout = df2)
#should output 1 1 1
[1] 1 1 1
matchdf(values = c("X", "Y", "a"), dfin = df1, dfout = df2)
#should output vector, not list
[1] 1
I'm looking to obtain a subset of my first, larger, dataframe 'df1' by selecting rows which contain particular combinations in the first two variables, as specified in a smaller 'df2'. For example:
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df1 # my actual df has 20 varables
ID day value
A 1 4
A 2 5
A 2 6
B 1 7
B 2 8
B 3 9
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
df2 # this df remains at 2 variables
ID day
A 2
B 1
Where the output would be:
ID day value
A 2 5
A 2 6
B 1 7
Any help wouldbe much appreciated, thanks!
This is a good use of the merge function.
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
merge(df1,
df2,
by = c("ID", "day"))
Which gives output:
ID day value
1 A 2 5
2 A 2 6
3 B 1 7
Here is a dplyr solution:
library("dplyr")
semi_join(df1, df2, by = c("ID", "day"))
# ID day value
# 1 A 2 5
# 2 A 2 6
# 3 B 1 7