Merge two dataframes containing duplicate elements - r

Given two dataframes whose names overlap partially, foo and bar:
foo <- iris[1:10,-c(4,5)]
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
# 7 4.6 3.4 1.4
# 8 5.0 3.4 1.5
# 9 4.4 2.9 1.4
# 10 4.9 3.1 1.5
bar <- iris[3:13,-c(3,5)]
bar[1:8, ] <- bar[1:8, ] * 2
# Sepal.Length Sepal.Width Petal.Width
# 3 9.4 6.4 0.4
# 4 9.2 6.2 0.4
# 5 10.0 7.2 0.4
# 6 10.8 7.8 0.8
# 7 9.2 6.8 0.6
# 8 10.0 6.8 0.4
# 9 8.8 5.8 0.4
# 10 9.8 6.2 0.2
# 11 5.4 3.7 0.2
# 12 4.8 3.4 0.2
# 13 4.8 3.0 0.1
How can I merge the dataframes such that both rows and columns are padded for missing cases, while prioritising the results of one dataframe for overlapping elements? In this example, it is the overlapping results in bar that I wish to prioritise.
merge(..., by = "row.names", all = TRUE) is close, in that it retains all 13 rows, and returns missing values as NA:
foobar <- merge(foo, bar, by = "row.names", all = TRUE)
# Row.names Sepal.Length.x Sepal.Width.x Petal.Length Sepal.Length.y Sepal.Width.y Petal.Width
# 1 1 5.1 3.5 1.4 NA NA NA
# 2 10 4.9 3.1 1.5 9.8 6.2 0.2
# 3 11 NA NA NA 5.4 3.7 0.2
# 4 12 NA NA NA 4.8 3.4 0.2
# 5 13 NA NA NA 4.8 3.0 0.1
# 6 2 4.9 3.0 1.4 NA NA NA
# 7 3 4.7 3.2 1.3 9.4 6.4 0.4
# 8 4 4.6 3.1 1.5 9.2 6.2 0.4
# 9 5 5.0 3.6 1.4 10.0 7.2 0.4
# 10 6 5.4 3.9 1.7 10.8 7.8 0.8
# 11 7 4.6 3.4 1.4 9.2 6.8 0.6
# 12 8 5.0 3.4 1.5 10.0 6.8 0.4
# 13 9 4.4 2.9 1.4 8.8 5.8 0.4
However, it creates a distinct column for each column in the constituent dataframes, regardless of the fact that they share names.
The desired output would be as such:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 5.1 3.5 1.4 NA # unique to foo
# 2 4.9 3.0 1.4 NA # unique to foo
# 3 9.4 6.4 1.3 0.4 # overlap, retained from bar
# 4 9.2 6.2 1.5 0.4 #
# 5 10.0 7.2 1.4 0.4 # .
# 6 10.8 7.8 1.7 0.8 # .
# 7 9.2 6.8 1.4 0.6 # .
# 8 10.0 6.8 1.5 0.4 #
# 9 8.8 5.8 1.4 0.4 #
# 10 9.8 6.2 1.5 0.2 # overlap, retained from bar
# 11 5.4 3.7 NA 0.2 # unique to bar
# 12 4.8 3.4 NA 0.2 # unique to bar
# 13 4.8 3.0 NA 0.1 # unique to bar
My intuition is to subset the data into two disjoint sets, and the set of intersecting elements in bar, then merge these, but I'm sure there is a more elegant solution!

(Edited)
The package plyr is awesome for this sort of thing. Just do:
library(plyr)
foo$ID <- row.names(foo)
bar$ID <- row.names(bar)
foobar <- join(foo, bar, type = "full", by = "ID")
Joining by row.names didn't work, as Flodl noted in the comments, so that's why I made a new column "ID".

I see the glowing recommendation for plyr::join but do not see how it is much different than what the base merge offers:
merge(foo, bar, by=c("Sepal.Length", "Sepal.Width"), all=TRUE)

Related

How to overwrite entries in a data frame by entries from a smaller dataframe?

I am trying to join two dataframes. The smaller is a subset of the larger, with updated values. I wish to keep all rows and columns in the larger dataframe, but overwrite values with the values in the smaller where the row ID and column correspond.
I can't see that any of the normal dplyr or base join operations (join, right, outer, inner) can easily achieve this. I am therefore looking for a join function/operation that can achieve what I want.
df1 <- structure(list(
ID = as.factor(c(1,2,5,6)),
Sepal.Width = c(4.5, 7, 3.2, 3.1),
Petal.Length = c(1.8, 2.4, 3.3, 6.5),
Petal.Width = c(1.2, 7.2, 3.2, 3.2)), row.names = c(NA,
4L), class = "data.frame")
df2 <- cbind(data.frame(ID = as.factor(1:10)), iris[1:10, 1:5])
A data.frame: 4 × 4
ID Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl>
1 1 4.5 1.8 1.2
2 2 7.0 2.4 7.2
3 5 3.2 3.3 3.2
4 6 3.1 6.5 3.2
A data.frame: 10 × 6
ID Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
7 7 4.6 3.4 1.4 0.3 setosa
8 8 5.0 3.4 1.5 0.2 setosa
9 9 4.4 2.9 1.4 0.2 setosa
10 10 4.9 3.1 1.5 0.1 setosa
I want to merge these into one:
A data.frame: 10 × 6
ID Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 1 5.1 4.5 1.8 1.2 setosa #<-- Updated rows
2 2 4.9 7.0 2.4 7.2 setosa #<-- Updated rows
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.2 3.3 3.2 setosa #<-- Updated rows
6 6 5.4 3.1 6.5 3.2 setosa #<-- Updated rows
7 7 4.6 3.4 1.4 0.3 setosa
8 8 5.0 3.4 1.5 0.2 setosa
9 9 4.4 2.9 1.4 0.2 setosa
10 10 4.9 3.1 1.5 0.1 setosa
# Î Î Î
# Updated columns
Have you tried the (relatively) new function rows_update from dplyr which does this.
library(dplyr)
df2 %>% rows_update(df1, by = 'ID')
# ID Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 1 5.1 4.5 1.8 1.2 setosa
#2 2 4.9 7.0 2.4 7.2 setosa
#3 3 4.7 3.2 1.3 0.2 setosa
#4 4 4.6 3.1 1.5 0.2 setosa
#5 5 5.0 3.2 3.3 3.2 setosa
#6 6 5.4 3.1 6.5 3.2 setosa
#7 7 4.6 3.4 1.4 0.3 setosa
#8 8 5.0 3.4 1.5 0.2 setosa
#9 9 4.4 2.9 1.4 0.2 setosa
#10 10 4.9 3.1 1.5 0.1 setosa
we can also use {powerjoin}
library(powerjoin)
power_left_join(df2, df1, by = "ID", conflict = coalesce_yx)
#> ID Sepal.Length Species Sepal.Width Petal.Length Petal.Width
#> 1 1 5.1 setosa 4.5 1.8 1.2
#> 2 2 4.9 setosa 7.0 2.4 7.2
#> 3 3 4.7 setosa 3.2 1.3 0.2
#> 4 4 4.6 setosa 3.1 1.5 0.2
#> 5 5 5.0 setosa 3.2 3.3 3.2
#> 6 6 5.4 setosa 3.1 6.5 3.2
#> 7 7 4.6 setosa 3.4 1.4 0.3
#> 8 8 5.0 setosa 3.4 1.5 0.2
#> 9 9 4.4 setosa 2.9 1.4 0.2
#> 10 10 4.9 setosa 3.1 1.5 0.1
It moves the conflicted columns to the end though

How to drop a column from a list?

Good afternoon :
Suppose i have the following list of dataframes :
[[4]]
[[4]]$L.1
Sepal.Length Sepal.Width Petal.Length Petal.Width v
1 5.1 3.5 1.4 0.2 1
5 5.0 3.6 1.4 0.2 1
6 5.4 3.9 1.7 0.4 1
11 5.4 3.7 1.5 0.2 1
16 5.7 4.4 1.5 0.4 1
19 5.7 3.8 1.7 0.3 1
20 5.1 3.8 1.5 0.3 1
21 5.4 3.4 1.7 0.2 1
[[4]]$L.2
Sepal.Length Sepal.Width Petal.Length Petal.Width v
2 4.9 3.0 1.4 0.2 2
3 4.7 3.2 1.3 0.2 2
4 4.6 3.1 1.5 0.2 2
7 4.6 3.4 1.4 0.3 2
8 5.0 3.4 1.5 0.2 2
9 4.4 2.9 1.4 0.2 2
10 4.9 3.1 1.5 0.1 2
12 4.8 3.4 1.6 0.2 2
13 4.8 3.0 1.4 0.1 2
[[4]]$L.3
Sepal.Length Sepal.Width Petal.Length Petal.Width v
15 5.8 4.0 1.2 0.2 3
17 5.4 3.9 1.3 0.4 3
136 7.7 3.0 6.1 2.3 3.
My question is how to drop the column v?
I tried without success:
lapply(L, "[", -v)
Thank you in advance for help !
Try this approach:
#Code
L <- lapply(L, function(x){x$v<-NULL;x})
Or with dplyr:
#Code 2
L <- lapply(L, function(x){x %>% dplyr::select(-v)})
L <- L[,-5] where 5 is the column number
In base R, we can use setdiff
L1 <- lapply(L, function(x) x[setdiff(names(x), 'v')])

Re-order rows of a R dataframe based on a column/ label

My current dataframe in R has the following dimensions
nrows=605
ncol: 1514
The first column indicates the class/ label and my dataset has only two classes namely: setosa and iris.
test[1:5,]
class id1 id2...
1: setosa 2 4.....
2: setosa 2 5 .....
3: setosa 5 4 .....
4: iris 5 9......
5: iris 7 9 ....
However the dataframe is ordered as of now : ie. Rows 2- row 233 of my dataframe correspond to class setosa and class iris is from 234 until end. I want the dataset to be rearranged so that the samples are mixed up.
The expected output should be in following form:
If I do df[1:10,] ie. 10 lines of dataframe ,I should be able to see samples of both iris and setosa. Any ideas or suggestion on how to do this?
library( tidyverse )
iris[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5.0 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
# 10 4.9 3.1 1.5 0.1 setosa
df <- iris %>%
group_by( Species ) %>%
mutate( id = row_number() ) %>%
arrange( id ) %>%
select ( -id )
df[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 7 3.2 4.7 1.4 versicolor
# 3 6.3 3.3 6 2.5 virginica
# 4 4.9 3 1.4 0.2 setosa
# 5 6.4 3.2 4.5 1.5 versicolor
# 6 5.8 2.7 5.1 1.9 virginica
# 7 4.7 3.2 1.3 0.2 setosa
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 7.1 3 5.9 2.1 virginica
# 10 4.6 3.1 1.5 0.2 setosa

Selectively Remove Column Values in R Data Frame

Example
Suppose in the famous iris data set, I have determined that when Sepal.Length > 5.0, there was an error in my measurement device.
In this contrived example, I would like to keep the Sepal.Length column with its original value, but change the remaining columns to NA if the Sepal.Length > 5.0 for that row.
As an example, this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Would become this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA NA
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA 1.7 NA NA
I could certain do this manually via vectorization. Something along the lines of:
iris$Sepal.Width <- ifelse(iris$Sepal.Length > 5.0, NA, iris$Sepal.Width)
In this approach however, I would need to manually specify every column.
Question
I strongly suspect there is a clever way to tackle this via either purrr or dplyr. Nevertheless, I've gotten myself down a pmap / modify_at rabbit hole. Any suggestions towards elegance would be much appreciated.
Thanks!
library(data.table)
dt <- copy(iris)
setDT(dt)
dt[Sepal.Length > 5.0, (which(!names(dt) == "Sepal.Length")) := NA]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 NA NA NA NA
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 NA NA NA NA
# 147: 6.3 NA NA NA NA
# 148: 6.5 NA NA NA NA
# 149: 6.2 NA NA NA NA
# 150: 5.9 NA NA NA NA
Alternative would be to simply use this (this is only handy if you are interested in all columns, beginning with the second one)
iris[iris$Sepal.Length > 5.0, 2:ncol(iris)] <- NA
# And the output for first six rows
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA <NA>
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA NA NA <NA>
It sounds like this would work for you
my_clip <- function(x, z) ifelse(z>5, NA, x)
iris %>% mutate_at(vars(-Sepal.Length), my_clip, z=.$Sepal.Length)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA NA
# 2 4.9 3.0 1.4 0.2 1
# 3 4.7 3.2 1.3 0.2 1
# 4 4.6 3.1 1.5 0.2 1
# 5 5.0 3.6 1.4 0.2 1
# 6 5.4 NA NA NA NA
We use mutate_at to grab all the column we want to transform and then since you can't reference other columns easily in your mutate_at function, we need to pass in the threshold column as a separate parameter using the .$ syntax.
Since you asked for a purrr example, here goes. Although I prefer the data.table answer already proposed:
library(purrr)
library(tidyr)
iris %>% nest(-Sepal.Length) %>%
mutate(data = ifelse(Sepal.Length > 5.0,
map(data, function(x) x*NA), data)) %>%
unnest
With magrittr you could do this :
library(magrittr)
iris %>% head %>% inset(.$Sepal.Length > 5,-1,NA)
or using base R instead of magrittr (same output, just uglier function :), and you still need magrittr or dplyr for the pipes):
iris %>% head %>% `[<-`(.$Sepal.Length > 5,-1,NA)
-1 is the index of the column you want to keep, negated.
result
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA <NA>
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 NA NA NA <NA>

How to drop identical columns when combining data frames?

How can I remove identical columns when combining two data frames?
Consider the dummy example below:
data(iris)
iris2 <- iris
iris2[ 2:7, c(1,3,5)] <- NA
Xa <- cbind(iris, iris2)
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa NA 3.0 NA 0.2 <NA>
##3 4.7 3.2 1.3 0.2 setosa NA 3.2 NA 0.2 <NA>
##4 4.6 3.1 1.5 0.2 setosa NA 3.1 NA 0.2 <NA>
##5 5.0 3.6 1.4 0.2 setosa NA 3.6 NA 0.2 <NA>
##6 5.4 3.9 1.7 0.4 setosa NA 3.9 NA 0.4 <NA>
It is very easy to drop columns with the same name:
Xa <- Xa[ , !(duplicated(names(Xa)))]
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa
##3 4.7 3.2 1.3 0.2 setosa
##4 4.6 3.1 1.5 0.2 setosa
##5 5.0 3.6 1.4 0.2 setosa
##6 5.4 3.9 1.7 0.4 setosa
But not all dropped columns have the same contents. How can I drop identical columns (same name and same contents) from a data frame?
The expected result is:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Petal.Length Species
## 1 5.1 3.5 1.4 0.2 setosa 5.1 1.4 setosa
## 2 4.9 3.0 1.4 0.2 setosa NA NA <NA>
## 3 4.7 3.2 1.3 0.2 setosa NA NA <NA>
## 4 4.6 3.1 1.5 0.2 setosa NA NA <NA>
## 5 5.0 3.6 1.4 0.2 setosa NA NA <NA>
## 6 5.4 3.9 1.7 0.4 setosa NA NA <NA>
You could do
Xa[!duplicated.default(Xa)]
# or
Xa[, !duplicated.default(Xa)]
# or, as mentioned by #akrun in a comment
Xa[!duplicated(c(Xa))]
Whichever way, the columns are renamed automatically (as data.frame() usually does) so that there are no longer any dupes among them.
We can't use vanilla duplicated here because it would use duplicated.data.frame, which compares rows to find duplicates, while duplicated.default compares elements of a vector. A data.frame is an vector of (pointers to) column vectors, so that's why duplicated.default works in this case. duplicated(c(Xa)) or duplicated(as.list(Xa)) also work because they change Xa from a data.frame into a vanilla vector.
Based on the accepted answer, I came up with a very simple function for this task:
rm.df.dupl <- function(x){
stopifnot(is.data.frame(x))
x <- x[ , !duplicated.default(x)]
return(x)
}
All you have to do now is:
rm.df.dupl(Xa)

Resources