Joining dataframes of different dimensions with varying merge by criterion - r

Good evening, I am trying to merge a couple datasets and my normal tools in R are failing me tonight. Consider df1 and df2 below.
df1 = data.frame(a = c("a", "b", "c"),
b = c("1", "2", "3"),
c = c("x", "y", "z"))
df2 = data.frame(a = c("1", "b", "c", "d", "e"),
b = c("a", "2", "3", "4", "5"),
d = c("x2", "y2", "z2", "x3", "y3"))
In both cases, column a and b are supposed to act as grouping variables. For example, in df1, when a = a and b = 1, then c = x. Given the structure of the data I am working with, the actual order of a and b does not matter such that if a were to = 1 and b = a, then c will still equal x. Herein lies the problem, I would like to merge df1 with a new df, df2. df2, is similarly structured, but contains a new variable d. And, as can be seen df2 includes some a and b combinations that are backwards compared to A. In addition, B has some additional observations.
The desired dataframe I am looking for looks like this:
desired = data.frame(a = c("a", "b", "c"),
b = c("1", "2", "3"),
c = c("x", "y", "z"),
d = c("x2", "y2", "z2"))
As can be seen the original column structure from a b and c are preserved, and we have added in column D. However, we have not added any new observations.
I have tried using merge() with varying combinations of by.x, by.y.
I also tried using various left_join and inner_join but I keep on getting whaping data sets that still aren't handling the mismatch in the a/b columns.
Thanks for any thoughts or help you might be able to provide.
Cheers

You can left_join df2 with df1 twice and use coalesce -
library(dplyr)
df1 %>%
left_join(df2, by = c("a"="a", "b"="b")) %>%
left_join(df2, by = c("a"="b", "b"="a")) %>%
mutate(
d = coalesce(d.x, d.y)
) %>%
select(a,b,c,d)
a b c d
1 a 1 x x2
2 b 2 y y2
3 c 3 z z2

Good morning. It appears, the actual order of a and b does matter. sort your df2, or maybe both.
df2[1:2] <- t(apply(df2[1:2], 1, sort, decreasing=TRUE))
merge(df1, df2)
# a b c d
# 1 a 1 x x2
# 2 b 2 y y2
# 3 c 3 z z2

Related

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

R - How to use dplyr left_join by column index?

How to use column index to dplyr::left_join (and your family)?
Example (by column names):
library(dplyr)
data1 <- data.frame(var1 = c("a", "b", "c"), var2 = c("d", "d", "f"))
data2 = data.frame(alpha = c("d", "f"), beta = c(20, 30))
left_join(data1, data2, by = c("var2" = "alpha"))
However, replacing by = c("var2" = "alpha")) to by = c(data1[,2] = data2[,1]) results to this error:
by must be a (named) character vector, list, or NULL for natural
joins (not recommended in production code), not logical.
I need to use the "column position" for loop on new functions.
How can I do it?
Using dplyr:
# rename_at changes alpha into var2 in data2
left_join(data1, rename_at(data2, 1, ~ names(data1)[2]), by = names(data1)[2])
# output
var1 var2 beta
1 a d 20
2 b d 20
3 c f 30
Using base R:
merge(data1, data2, by.x = 2, by.y = 1, all.x = T, all.y = F)
# output
var2 var1 beta
1 d a 20
2 d b 20
3 f c 30
I don't know how you're going to use the column index but a hacky solution is the following:
#make a named vector for the by argument, see ?left_join
join_var <- names(data2)[1] #change index here based on data2
names(join_var) <- names(data1)[2] #change index here based on data1
left_join(data1, data2, by = join_var)
Depending on the final output you desire by using the column index, there is probably a more appropriate solution than this.

How to combine columns in a data frame so that they overlap in R?

Basically, I have data from a between subjects study design that looks like this:
> head(have, 3)
A B C
1 b
2 a
3 c
Here, A, B, C are various conditions of the study, so each subject (indicated by each row), only has a value for one of the conditions. I would like to combine these so that it looks like this:
> head(want, 3)
all
1 b
2 a
3 c
How can I combine the columns so that they "overlap" like this?
So far, I have tried using some of dplyr's join functions, but they haven't worked out for me. I appreciate any guidance in combining my columns in this way.
We can use pmax
want <- data.frame(all= do.call(pmax, have))
Or using dplyr
transmute(have, all= pmax(A, B, C))
# all
#1 b
#2 a
#3 c
data
have <- structure(list(A = c("", "a", ""), B = c("b", "", ""),
C = c("",
"", "c")), .Names = c("A", "B", "C"), class = "data.frame",
row.names = c("1", "2", "3"))

Computing correlation of vectors by factor label

I have have two data frames. The first one, df1, is a matrix of vectors with labeled columns, like the following:
df1 <- data.frame(A=rnorm(10), B=rnorm(10), C=rnorm(10), D=rnorm(10), E=rnorm(10))
> df1
A B C D E
-0.3200306 0.4370963 -0.9146660 1.03219577 0.5215359
-0.3193144 0.8900656 -1.1720264 -0.42591761 0.1936993
0.4897262 -1.3970806 0.6054637 0.12487936 1.0149530
0.3772420 0.8726322 0.3250020 -0.36952560 -0.5447512
-0.6921561 -0.6734468 0.3500812 -0.53373720 -0.6129472
0.2540649 -1.1911106 -0.3266428 0.14013437 1.0830148
0.6606825 -0.8942715 1.1099637 -1.52416540 -0.2383048
1.4767074 -2.1492360 0.2441242 -0.36136344 0.5589114
-0.5338117 -0.2357821 0.7694879 -0.21652356 0.3185631
3.4215916 -0.3157938 0.8895597 0.09946069 -1.0961730
The second data frame, df2, contains items that match the colnames of df1. Example:
group <- c("1", "1", "2", "2", "3", "3")
S1 <- c("A", "D", "E", "C", "B", "D")
S2 <- c("D", "B", "A", "C", "B", "A")
S3 <- c("B", "C", "A", "E", "E", "A")
df2 <- data.frame(group,S1, S2, S3)
> df2
group S1 S2 S3
1 A D B
1 D B C
2 E A A
2 C C E
3 B B E
3 D A A
I would like to compute the correlations between the column vectors in df1 that correspond to the labeled items in df2. Specifically, the vectors that match cor(df2$S1, df2$S2) and cor(df2$S1, df2$S3).
The output should be something like this:
group S1 S2 S3 cor.S1.S2 cor.S1.S3
1 A D B 0.003825055 -0.2817946
1 D B C -0.2817946 -0.4928023
2 E A A -0.3856809 -0.3856809
2 C C E 1 -0.3862433
3 B B E 1 -0.3888541
3 D A A 0.003825055 0.003825055
I've been trying to resolve this with cbind[] but keep running into problems such as the 'x' must be numeric error with cor. Thanks in advance for any help!
You can do this with mapply().
my.cor <- function(x,y) {
cor(df1[,x],df1[,y])
}
df2$cor.S1.S2 <- mapply(my.cor,df2$S1,df2$S2)
df2$cor.S2.S3 <- mapply(my.cor,df2$S2,df2$S3)
Another approach would to the get the correlation between the matrix/data.frame after subsetting the columns of 'df1' with the columns of 'df2', get the diag and assign the output as new column in 'df2'. Here, I am using lapply as we have to do both 'S1 vs S2' and 'S1 vs S3'.
df2[c('cor.S1.S2', 'cor.S1.S3')] <- lapply(c('S2', 'S3'),
function(x) diag(cor(df1[, df2[,x]], df1[,df2$S1])))

Converting given list into dataframe

I have the following list:
$id1
$id1[[1]]
A B
"A" "B"
$id1[[2]]
A B
"A" "A1"
$id2
$id2[[1]]
A B
"A2" "B2"
In R-pastable form:
dat = structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
I want this given list to be converted into following dataframe:
id1 A B
id1 A A1
id2 A2 B2
Your data structure (apparently a named list of unnamed lists of 1-row data.frames) is a bit complicated: the easiest may be to use a loop to build the data.frame.
It can be done directly with do.call, lapply and rbind, but it is not very readable, even if you are familiar with those functions.
# Sample data
d <- list(
id1 = list(
data.frame( x=1, y=1 ),
data.frame( x=2, y=2 )
),
id2 = list(
data.frame( x=3, y=3 ),
data.frame( x=4, y=4 )
),
id3 = list(
data.frame( x=5, y=5 ),
data.frame( x=6, y=6 )
)
)
# Convert
d <- data.frame(
id=rep(names(d), unlist(lapply(d,length))),
do.call( rbind, lapply(d, function(u) do.call(rbind, u)) )
)
Other solution, using a loop, if you have a ragged data structure, containing vectors (not data.frames) as explained in the comments.
d <- structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
result <- list()
for(i in seq_along(d$SampleTable)) {
id <- names(d$SampleTable)[i]
block <- d$SampleTable[[i]]
if(is.atomic(block)) {
block <- list(block)
}
for(row in block) {
result <- c(result, list(data.frame(id, as.data.frame(t(row)))))
}
}
result <- do.call(rbind, result)
NOTE! I could not get melt and cast working on this kind of ragged data (I tried for over an hour...) I am going to leave this answer here to show that for this kind of operation, the reshape pacakge could also be used.
Using the example data of vincent, you can use melt and cast from the reshape package:
library(reshape)
res = cast(melt(d))[-1]
names(res) = c("id","x","y")
res
id x y
1 id1 1 1
2 id2 3 3
3 id3 5 5
4 id1 2 2
5 id2 4 4
6 id3 6 6
The order in the resulting data.frame is not the same, but the result is identical. And the code is a bit shorter. I use the [-1] to delete the first column which is also returned by melt. This additional variable is the column index of the individual data.frames in the list of lists. Just have a look at the result of melt(d), that will hopefully make it more clear.
This is a bit messier that you let on. That dat object has an extra "layer" above it, so it is easier to work with dat[[1]]:
dfrm <- data.frame(dat[[1]], stringsAsFactors=FALSE)
names(dfrm) <- sub("\\..+$", "", names(dfrm))
> dfrm
id2 id2 id1
T 90 90 1
G 7 8 1
> t(dfrm)
T G
id2 "90" "7"
id2 "90" "8"
id1 "1" "1"

Resources