Left join Merge data.table [duplicate] - r

This question already has answers here:
merge data.frame but keep only unique columns?
(2 answers)
Closed 5 years ago.
I want to do a left join with 2 data.frames on R, using data.table library. What I have:
library(data.table)
id<-c("a1","a2","a3","a4")
id2<-c("a2","a3","a1","a4")
y<-c(1,2,3,4)
z<-c(3,5,6,7)
k<-c(1,3,8,7)
df1<-data.table(id,y,z)
id<-c("a2","a3","a1","a4")
df2<-data.table(id,k,y)
I want that the result is a new data.table frame, being this the result of a LEFT JOIN, this is:
result--> id,x,y,z
I use this as a guide:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
merge(df1,df2,by="id",all.x=TRUE)
But this return me:
id y.x z x y.y
1: a1 1 3 3 3
2: a2 2 5 0 1
3: a3 3 6 2 2
4: a4 4 7 1 4
The problem with this is that column y is duplicated, and I want that only appear once.
I have tried with all=FALSE, all.x=T,... but I dont achieve what I want.
I have also tried other solutions, as proposed in: left join in data.table
setkey(df1,id)
setkey(df2,id)
df1[df2]
But this again, duplicate the y column.
id y z k i.y
1: a1 1 3 8 3
2: a2 2 5 1 1
3: a3 3 6 3 2
4: a4 4 7 7 4
How can I do it?

You can merge df1 and df2 by removing column y in one of the tables. Try dplyr::left_join(df1, df2[, -c("y")], by = "id") or merge(df1, df2[, -c("y")], by = "id").

Related

In data.table in R, how can we create an sequenced indicator variable by the values of two columns? [duplicate]

This question already has answers here:
data.table "key indices" or "group counter"
(2 answers)
Create a new data frame column based on the values of two other columns
(2 answers)
Closed 4 years ago.
In the data.table package in R, for a given data table, I am wondering how an indicator index can be created for the values that are the same in two columns. For example, for the following data table,
> M <- data.table(matrix(c(2,2,2,2,2,2,2,5,2,5,3,3,3,6), ncol = 2, byrow = T))
> M
V1 V2
1: 2 2
2: 2 2
3: 2 2
4: 2 5
5: 2 5
6: 3 3
7: 3 6
I would like to create a new column that essentially orders the values that are the same for each row of the two columns, so that I can get something like:
> M
V1 V2 Index
1: 2 2 1
2: 2 2 1
3: 2 2 1
4: 2 5 2
5: 2 5 2
6: 3 3 3
7: 3 6 4
I essentially would like to repeat values of .N above, is there a nice way to do it?
We can use .GRP after grouping by 'V1' and 'V2'
M[, Index := .GRP, .(V1, V2)]

join/merge data frames in R [duplicate]

This question already has answers here:
Merge dataframes of different sizes
(4 answers)
Left join using data.table
(3 answers)
Closed 5 years ago.
I would like to join similar data frames:
input:
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
desired output:
z <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,1,2,NA))
I tried
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
z <- merge(x,y, all=TRUE)
but it has one inconvenience:
a b c
1 1 4 1
2 2 5 1
3 2 5 NA
4 3 6 2
5 3 6 NA
6 4 7 NA
It doubles rows where there are similarities. Is there a way to get desired output without deleting unwanted rows?
EDIT
I can not delete rows with NA, x data frame consists of rows with NA which are not in y data frame. If I would do this I would deleted 4th row from x data frame (4 7 NA)
Thanks for help
You can use an update join with the data.table package:
# load the packge and convert the dataframes to data.table's
library(data.table)
setDT(x)
setDT(y)
# update join
x[y, on = .(a, b), c := i.c][]
which gives:
a b c
1: 1 4 1
2: 2 5 1
3: 3 6 2
4: 4 7 NA

Update columns in one data.frame using another dataframe with columns of the same name

I have a dataframe, D1, with these columns:
a b c
3 4 2
2 1 2
2 0 3
and another, D2, with these columns
b c
2 1
3 2
4 4
I want to build another dataframe with all D2 columns, and D1 columns that are not in D2. I mean, D3 would be this:
a b c
3 2 1
2 3 2
2 4 4
There are a lot of columns. Is it possible to build D3 without explicitly referencing them?
We can use setdiff to find the columns that are not in the second dataset
cbind(df1[setdiff(names(df1), names(df2))], df2)

Create a new dataframe according to the contrast between two similar df [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have a dataframe made like this:
X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3
After several steps (not important which one) i obtained this df:
X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3
i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:
X Y Z T
1 2 4 2
7 5 NA 3
How could I do?
One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset
dfO[do.call(paste, dfO) == do.call(paste, df),]
# X Y Z T
#1 1 2 4 2
#3 7 5 NA 3
where 'dfO' is the old dataset and 'df' is the new
You can use dplyr's intersect function:
library(dplyr)
intersect(d1, d2)
# X Y Z T
#1 1 2 4 2
#2 7 5 NA 3
This is a data.frame-equivalent of base R's intersect function.
In case you're working with data.tables, that package also provides such a function:
library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
# X Y Z T
#1: 1 2 4 2
#2: 7 5 NA 3
Another dplyr solution: semi_join.
dt1 %>% semi_join(dt2, by = colnames(.))
X Y Z T
1 1 2 4 2
2 7 5 NA 3
Data
dt1 <- read.table(text = "X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- read.table(text = " X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.
From this perspective, I think the only correct one so far is akrun's.
You could also do something like:
df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]
But I think akrun's way is more elegant and likely to perform better in terms of speed.

Delete Duplicates when Merging DF [duplicate]

This question already has answers here:
Select only the first row when merging data frames with multiple matches
(4 answers)
Closed 5 years ago.
I know, I know.... Another merging Df question, please hear me out as I have searched SO for an answer on this but none has come.
I am merging two Df's, one smaller than the other, and doing a left merge, to match up the longer DF to the smaller DF.
This works well except for one issue, rows get added to the left (smaller) df when the right(longer) df has duplicates.
An Example:
Row<-c("a","b","c","d","e")
Data<-(1:5)
df1<-data.frame(Row,Data)
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
names(df2)<-c("Row","Data2")
DATA<-merge(x = df1, y = df2, by = "Row", all.x = TRUE)
>DATA
Row Data Data2
1 a 1 1
2 b 2 2
3 b 2 3
4 c 3 4
5 d 4 5
6 e 5 6
See the extra "b" row?, that is what I want to get rid of, I want to keep the left DF, but very strictly, as in if there are 5 rows in DF1, when merged I want there to only be 5 rows.
Like this...
Row Data Data2
1 a 1 1
2 b 2 2
3 c 3 4
4 d 4 5
5 e 5 6
Where it only takes the first match and moves on.
I realize the merge function is only doing its job here, so is there another way to do this to get my expected result? OR is there a post-merge modification that should be done instead.
Thank you for your help and time.
Research:
How to join (merge) data frames (inner, outer, left, right)?
deleting duplicates
Merging two data frames with different sizes and missing values
We can use the duplicated function as follows:
DATA[!duplicated(DATA$Row),]
Row Data Data2
1 a 1 1
2 b 2 2
4 c 3 4
5 d 4 5
6 e 5 6
It´s possible also like
merge(x = df1, y = df1[unique(df1$Row),], by = "Row", all.x = TRUE)
# Row Data.x Data.y
#1 a 1 1
#2 b 2 2
#3 c 3 3
#4 d 4 4
#5 e 5 5
Since you only want the first row and don't care what variables are chosen, then you can use this code (before you merge):
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
library(dplyr)
df2 %>%
group_by(Row2) %>%
slice(1)

Resources