How to merge data frames in R using *alternative* columns - r

I'm trying to merge 2 data frames in R, but I have two different columns with different types of ID variable. Sometimes a row will have a value for one of those columns but not the other. I want to consider them both, so that if one frame is missing a value for one of the columns then the other will be used.
> df1 <- data.frame(first = c('a', 'b', NA), second = c(NA, 'q', 'r'))
> df1
first second
1 a <NA>
2 b q
3 <NA> r
> df2 <- data.frame(first = c('a', NA, 'c'), second = c('p', 'q', NA))
> df2
first second
1 a p
2 <NA> q
3 c <NA>
I want to merge these two data frames and get 2 rows:
row 1, because it has the same value for "first"
row 2, because it has the same value for "second"
row 3 would be dropped, because df1 has a value for "second", but not "first", and df2 has the reverse
It's important that NAs are ignored and don't "match" in this case.
I can get kinda close:
> merge(df1,df2, by='first', incomparables = c(NA))
first second.x second.y
1 a <NA> p
> merge(df1,df2, by='second', incomparables = c(NA))
second first.x first.y
1 q b <NA>
But I can't rbind these two data frames together because they have different column names, and it doesn't seem like the "R" way to do it (in the near future, I'll have a 3rd, 4th and even 5th type of ID).
Is there a less clumsy way to do this?
Edit: Ideally, the output would look like this:
> df3 <- data.frame(first = c('a', 'b'), second = c('p','q'))
> df3
first second
1 a p
2 b q
row 1, has matched because the column "first" has the same value in both data frames, and it fills in the value for "second" from df2
row 2, has matched because the column "second" has the same value in both data frames, and it fills in the value for "first" from df1
there is no row 3, because there is no column that has a value in both data frames

Using sqldf we can do, as in SQL we can alternate between joining conditions using OR
library(sqldf)
df <- sqldf("select a.*, b.*
from df1 a
join df2 b
ON a.first = b.first
OR a.second = b.second")
library(dplyr)
#If value in first is NA i.e. is.na(first) is TRUE then use first..3 value's else use first value's and the same for second
df %>% mutate(first = ifelse(is.na(first), first..3, first),
second = ifelse(is.na(second), second..4, second)) %>%
#Discard first..3 and second..4 since we no longer need them
select(-first..3, -second..4)
first second
1 a p
2 b q

Related

Create a Column with Unique values from Lists Columns

I have a dataset on Rstudio made of columns that contains lists inside them. Here is an example where column "a" and column "c" contain lists in each row.
¿What I am looking for?
I need to create a new column that collects unique values from columns a b and c and that skips NA or null values
Expected result is column "desired_result".
test <- tibble(a = list(c("x1","x2"), c("x1","x3"),"x3"),
b = c("x1", NA,NA),
c = list(c("x1","x4"),"x4","x2"),
desired_result = list(c("x1","x2","x4"),c("x1","x3","x4"),c("x2","x3")))
What i have tried so far?
I tried the following but do not produces the expected result as in column "desired_result
test$attempt_1_ <-lapply(apply((test[, c("a","b","c"), drop = T]),
MARGIN = 1, FUN= c, use.names= FALSE),unique)
We may use pmap to loop over each of the corresponding elements of 'a' to 'c', remove the NA (na.omit) and get the unique values to store as a list in 'desired_result'
library(dplyr)
library(purrr)
test <- test %>%
mutate(desired_result2 = pmap(across(a:c), ~ sort(unique(na.omit(c(...))))))
-checking with OP's expected
> all.equal(test$desired_result, test$desired_result2)
[1] TRUE

duplicate 'row.names' are not allowed - R

So, I am new in R and trying to implement a differential gene expression analysis.
I'm trying to store gene names as rownames so that I can create a DGEList object.
asthma <- read.csv("Asthma_3 groups-Our study gene expression.csv")
head(asthma, 10)
dim(asthma)
asthma <- na.omit(asthma)
distinct(asthma)
countdata <- asthma[,-1]
head(countdata)
rownames(countdata) <- asthma[,1]
'''
I am getting this error:
Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
The first column in asthma likely has duplicate values. Two options I can think of
Can the first column be combined with another column to generate a new column with unique values that can be used as the rownames?
If not, you can probably use make.names().
Here is a reproducible example.
df = data.frame(col1 = c('A', 'A', 'B'), col2 = c(1, 2, 3))
df
That defines a data.frame that looks like this
col1 col2
1 A 1
2 A 2
3 B 3
The data.frame by default has rownames 1, 2, 3. If you try this
rownames(df) = df[,1]
you get an error because df[,1] has 'A' twice, so it can't be used as a rowname without modification. You use make.names to create rownames with unique values like this
unique.col1 = make.names(df[,1], unique=T)
unique.col1
This results in
"A" "A.1" "B"
Note that the .1 was added to the second A to make it different from the first A. Then define the rownames as unique.col1:
rownames(df) = unique.col1
df
The data.frame df now looks like this
col1 col2
A A 1
A.1 A 2
B B 3

adding unique rows from one data frame to another

I have a data frame which comprises a subset of records contained in a 2nd data frame. I would like to add the record rows of the 2nd data frame that are not common in the first data frame to the first... Thank you.
If you want all unique rows from both dataframes, this would work:
df1 <- data.frame(X = c('A','B','C'), Y = c(1,2,3))
df2 <- data.frame(X = 'A', Y = 1)
df <- rbind(df1,df2)
no.dupes <- df[!duplicated(df),]
no.dupes
# X Y
#1 A 1
#2 B 2
#3 C 3
But it won't work if there's duplicate rows in either dataframe that you want to preserve.
You should look dplyr's distint() and bind_rows() functions.
Or Better provide a dummy data to work on and expected output .
Suppose you have two dataframes a and b ,and you want to merge unique rows of a dataframe to the b dataframe
a = data.frame(
x = c(1,2,3,1,4,3),
y = c(5,2,3,5,3,3)
)
b = data.frame(
x = c(6,2,2,3,3),
y = c(19,13,12,3,1)
)
library(dplyr)
distinct(a) %>% bind_rows(.,b)

Select row by level of a factor

I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.

Intersecting multiple columns between two data frames

I have two data frames with 2 columns in each. For example:
df.1 = data.frame(col.1 = c("a","a","a","a","b","b","b","c","c","d"), col.2 = c("b","c","d","e","c","d","e","d","e","e"))
df.2 = data.frame(col.1 = c("b","b","b","a","a","e"), col.2 = c("a","c","e","c","e","c"))
and I'm looking for an efficient way to look up the row index in df.2 of every col.1 col.2 row pair of df.1. Note that a row pair in df.1 may appear in df.2 in reverse order (for example df.1[1,], which is "a","b" appears in df.2[1,] as "b","a"). That doesn't matter to me. In other words, as long as a row pair in df.1 appears in any order in df.2 I want its row index in df.2, otherwise it should return NA. One more note, row pairs in both data frames are unique - meaning each row pair appears only once.
So for these two data frames the return vector would be:
c(1,4,NA,5,2,NA,3,NA,6,NA)
Maybe something using dplyr package:
first make the reference frame
use row_number() to number as per row index efficiently.
use select to "flip" the column vars.
two halves:
df_ref_top <- df.2 %>% mutate(n=row_number())
df_ref_btm <- df.2 %>% select(col.1=col.2, col.2=col.1) %>% mutate(n=row_number())
then bind together:
df_ref <- rbind(df_ref_top,df_ref_btm)
Left join and select vector:
gives to get your answer
left_join(df.1,df_ref)$n
# Per #thelatemail's comment, here's a more elegant approach:
match(apply(df.1,1,function(x) paste(sort(x),collapse="")),
apply(df.2,1,function(x) paste(sort(x),collapse="")))
# My original answer, for reference:
# Check for matches with both orderings of df.2's columns
match.tmp = cbind(match(paste(df.1[,1],df.1[,2]), paste(df.2[,1],df.2[,2])),
match(paste(df.1[,1],df.1[,2]), paste(df.2[,2],df.2[,1])))
# Convert to single vector of match indices
match.index = apply(match.tmp, 1,
function(x) ifelse(all(is.na(x)), NA, max(x, na.rm=TRUE)))
[1] 1 4 NA 5 2 NA 3 NA 6 NA
Here's a little function that tests a few of the looping options in R (which was not really intentional, but it happened).
check.rows <- function(data1, data2)
{
df1 <- as.matrix(data1);df2 <- as.matrix(data2);ll <- vector('list', nrow(df1))
for(i in seq(nrow(df1))){
ll[[i]] <- sapply(seq(nrow(df2)), function(j) df2[j,] %in% df1[i,])
}
h <- sapply(ll, function(x) which(apply(x, 2, all)))
sapply(h, function(x) ifelse(is.double(x), NA, x))
}
check.rows(df.1, df.2)
## [1] 1 4 NA 5 2 NA 3 NA 6 NA
And here's a benchmark when row dimensions are increased for both df.1 and df.2. Not too bad I guess, considering the 24 checks on each of 40 rows.
> dim(df.11); dim(df.22)
[1] 40 2
[1] 24 2
> f <- function() check.rows(df.11, df.22)
> microbenchmark(f())
## Unit: milliseconds
## expr min lq median uq max neval
## f() 75.52258 75.94061 76.96523 78.61594 81.00019 100
1) sort/merge First sort df.2 creating df.2.s and append a row number column. Then merge this new data frame with df.1 (which is already sorted in the question):
df.2.s <- replace(df.2, TRUE, t(apply(df.2, 1, sort)))
df.2.s$row <- 1:nrow(df.2.s)
merge(df.1, df.2.s, all.x = TRUE)$row
The result is:
[1] 1 4 NA 5 2 NA 3 NA 6 NA
2) sqldf Since dot is an SQL operator rename the data frames as df1 and df2. Note that for the same reason the column names will be transformed to col_1 and col_2 when df1 and df2 are automatically uploaded to the backend database. We sort df2 using min and max and left join it to df1 (which is already sorted):
df1 <- df.1
df2 <- df.2
library(sqldf)
sqldf("select b.rowid row
from df1
left join
(select min(col_1, col_2) col_1, max(col_1, col_2) col_2 from df2) b
using (col_1, col_2)")$row
REVISED Some code improvements. Added second solution.

Resources