Finding Differences Between Two Dataframes - r

I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.
The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.
Here is a reproducible example:
df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7', '3', '5', '10'), special_NO=NA, stringsAsFactors=F)
Ideally I would want an output like this:
df3
Name Vessel_ID special_NO add_remove
Vessel2 2 20 remove
Vessel4 4 40 remove
Vessel6 10 NA add
x 6 NA add
y 7 NA add
Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.
I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.

An approach using bind_rows
library(dplyr)
bind_rows(df1 %>% mutate(add_remove="remove"),
df2 %>% mutate(add_remove="add")) %>%
group_by(Vessel_ID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 × 4
Name Vessel_ID special_NO add_remove
<chr> <chr> <dbl> <chr>
1 Vessel2 2 20 remove
2 Vessel4 4 40 remove
3 x 6 NA add
4 y 7 NA add
5 Vessel6 10 NA add

Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:
df1$old_new <- "old"
df2$old_new <- "new"
#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)
#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"
# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]

Related

Merge two datasets using multiple column checks

I do have two dataframes with one ID-Variable in the first df ("ID") and three in the second df ("SIC","Ur","Sonst"). Now I am trying to merge these two datasets by checking if the "ID" in the first df either matches with SIC, Ur, or Sonst in the respective row. Here is my reproducable example:
df1 <- data.frame(ID = c("A", "B", "C","D"),
Value=c(1:4))
df2 <- data.frame(SIC = c("B", NA,NA,NA,NA,NA),
Ur = c(NA, "C", NA,NA,NA,NA),
Sonst=c(NA,NA,"A",NA,NA,NA),
Age=c(14:19))
Now I want a final df only with IDs and all information of the first df (as it is the target df) plus the corresponding age information, if ID either matches with SIC, Ur or Sonst. I have tried dplyr and merge function approaches but did not come up with a proper solution. I'm thankful for any suggestions.
An approach using dplyr and left_join with tidyrs unite
library(dplyr)
library(tidyr)
left_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
4 D 4 NA
or an inner_join if you only want A, B and C to show up
inner_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
The convenient way is perhaps to do with the tidyverse family as was nicely indicated by Andre Wildberg. You can do it also using base R merge() function but in your case we need to create an ifelse() function to put all non-missing values of the three columns from df2 into a single column:
df2$ID <- ifelse(!is.na(df2$SIC), df2$SIC,
ifelse(!is.na(df2$Ur), df2$Ur, df2$Sonst))
merge the two dfs:
df3 <- merge(df1, df2, by= "ID", all.x = TRUE)
Discard unwanted columns from merged data (df3):
df3 <- df3[, c("ID", "Value", "Age")]
df3
ID Value Age
A 1 16
B 2 14
C 3 15
D 4 NA

How to join a data frame at the bottom of an existing one without rewriting the whole?

df1 = data.frame(c(1,2),c(3,4))
colnames(df1) = c("V1","V2")
df2 = data.frame(c(2,3),c(5,6))
colnames(df2) = c("V1","V2")
How to add df2 at the bottom of df1 without using the rbind that requires the re writing of the entire dataframe?
We can do an assignment by creating a new row in 'df1' after extracting the first column of 'df2'
df1[nrow(df1)+1, ] <- df2[[1]]
df1
# V1 V2
#1 1 3
#2 2 4
#3 2 3
NOTE: The OP showed a dataset 'df2' with a single column. It is assumed that the number of rows in that column equals to the number of columns in 'df1'
With the new dataset, we can use
df1[nrow(df1) + seq_len(nrow(df2)),] <- df2

How to find indices of specific rows in dataframe r

I have a dataframe, A, which looks like this:
col 1 col2 col3
NL 6 9
UK 5 5
US 9 7
and I have a dataframe, B, consisting of a subset of the rows of the large dataframe looking like this:
col 1 col2 col3
NL 6 9
UK 5 5
Now, I want to find the indices of the rows from B in A, so it should return 1 and 2. Does someone know how to do this?
EDIT
Next, I also want to find the indices of the rows in A, when I have only the first two columns in B. So, in that case it should also return 1 and 2. Anyone an idea how to do this?
Generally, match gets the index. In our case, an approach is to paste the rows together and get the index with match
match(do.call(paste, df2), do.call(paste, df1)
If there are only subset of columns that are having the same column names, get the vector of column names with intersect, subset the datasets, do the paste and get the index with match
nm1 <- intersect(names(df1), names(df2))
match(do.call(paste, df2[nm1]), do.call(paste, df1[nm1]))
Another option is join where we create a row index in both datasets, do a join and extract the row index
library(dplyr)
df2 %>%
mutate(rn = row_number()) %>%
left_join(df2 %>%
mutate(rn = row_number()), by = c('col1', 'col2', 'col3')) %>%
pull(rn.y)

Create data frame, matching based on first elements of a list

I want to create a data frame based on the first element of a list. Specifically, I have
One vector containing variables (names1);
One list that contains two variables (some vars1 and the values);
And the end product should a data.frame with "names1" that contains as many lines as cases that match.
If there is no match between a specific list and a the vector, it should be NA.
The values can also be factors or strings.
names1 <- c("a", "b", "c")
dat1 <- data.frame(names1 =c("a", "b", "c", "f"),values= c("val1", 13, 11, 0))
dat1$values <- as.factor(dat1$values)
dat2 <- data.frame(names1 =c("a", "b", "x"),values= c(12, 10, 2))
dat2$values <- as.factor(dat2$values)
list1 <- list(dat1, dat2)
The results should be a new data frame with the variables "names" and all values that match of each of list parts:
a b c
val1 13 11
12 10 NA
One option would be to loop through the list ('list1'), filter the 'names' column based on the 'names' vector, convert it to a single dataset while creating an identification column with .id, spread from 'long' to 'wide' and remove the 'grp' column
library(tidyverse)
map_df(list1, ~ .x %>%
filter(names %in% !! names), .id = 'grp') %>%
spread(names, values) %>%
select(-grp)
# a b c
#1 25 13 11
#2 12 10 NA
Or another option is to bind the datasets together with bind_rows, created a grouping id 'grp' to specify the list element, filter the rows by selecting only 'names' column that match with the 'names' vector and spread from 'long' to 'wide'
bind_rows(list1, .id = 'grp') %>%
filter(names %in% !! names) %>%
spread(names, values)
NOTE: It is better not to use reserved keywords for specifying object names (names). Also, to avoid confusions, the object should be different from the column names of the dataframe object.
It can be also done with only base R. Create a group identifier with Map, rbind the list elements to single dataset, subset the rows by keeping only the values from the 'names' vector, and reshape from 'long' to 'wide'
df1 <- subset(do.call(rbind, Map(cbind, list1,
ind = seq_along(list1))), names %in% .GlobalEnv$names)
reshape(df1, idvar = 'ind', direction = 'wide', timevar = 'names')[-1]
A mix of base R and dplyr. For every list element we create a dataframe with 1 row. Using dplyr's rbind_list row bind them together and then subset only those columns which we need using names.
library(dplyr)
rbind_list(lapply(list1, function(x)
setNames(data.frame(t(x$values)), x$names)))[names]
# a b c
# <dbl> <dbl> <dbl>
#1 25 13 11
#2 12 10 NA
Output without subset looks like this
rbind_list(lapply(list1, function(x) setNames(data.frame(t(x$values)), x$names)))
# a b c x
# <dbl> <dbl> <dbl> <dbl>
#1 25 13 11 NA
#2 12 10 NA 2
In base R
t(sapply(list1, function(x) setNames(x$values, names)[match(names, x$names)]))
# a b c
# [1,] 25 13 11
# [2,] 12 10 NA
Using base R only
body <- do.call('rbind', lapply(list1, function(list.element){
element.vals <- list.element[['values']]
element.names <- list.element[['names']]
names(element.vals) <- element.names
return.vals <- element.vals[names]
if(all(is.na(return.vals))) NULL else return.vals
}))
df <- as.data.frame(body)
names(df) <- names
df
For the sake of completeness, here is a data.table approach using dcast() and rowid():
library(data.table)
nam <- names1 # avoid name conflict with column name
rbindlist(list1)[names1 %in% nam, dcast(.SD, rowid(names1) ~ names1)][, names1 := NULL][]
a b c
1: val1 13 11
2: 12 10 <NA>
Or, more concisely, pick columns after reshaping:
library(data.table)
rbindlist(list1)[, dcast(.SD, rowid(names1) ~ names1)][, .SD, .SDcols = names1]

Merging two data.frames by key column

I have two dataframes. In the first one, I have a KEY/ID column and two variables:
KEY V1 V2
1 10 2
2 20 4
3 30 6
4 40 8
5 50 10
In the second dataframe, I have a KEY/ID column and a third variable
KEY V3
1 5
2 10
3 20
I would like to extract the rows of the first dataframe that are also in the second dataframe by matching them according to the KEY column. I would also like to add the V3 column to final dataset.
KEY V1 V2 V3
1 10 2 5
2 20 4 10
3 30 6 20
This are my attempts by using the subset and the merge function
subset(data1, data1$KEY == data2$KEY)
merge(data1, data2, by.x = "KEY", by.y = "KEY")
None of them does the task.
Any hint would be appreaciated. Thank you!
merge(data1, data2, by="KEY") should do it!
If what you want is an inner join, then your attempt should do it. If it doesn't check the formats of Key columns in both the table using class(data1$key).
Apart from these and the merge suggested by Christian, you can use -
library(plyr)
join(data1, data2, by="KEY", type="inner")
or
library(data.table)
setkey(data1, KEY)
setkey(data2, KEY)
data1[,list(data1,data2)]
You could use a dplyr *_join. Given the sample data, both of the following would give the same result:
library(dplyr)
df_merged <- inner_join(data1, data2, by = 'KEY')
df_merged <- right_join(data1, data2, by = 'KEY')
A inner_join returns all rows from df1 where there are matching values in df2, and all columns from df1 and df2.
A right_join returns all rows from df2, and all columns from df1 and df2.

Resources