join/merge data frames in R [duplicate] - r

This question already has answers here:
Merge dataframes of different sizes
(4 answers)
Left join using data.table
(3 answers)
Closed 5 years ago.
I would like to join similar data frames:
input:
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
desired output:
z <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,1,2,NA))
I tried
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
z <- merge(x,y, all=TRUE)
but it has one inconvenience:
a b c
1 1 4 1
2 2 5 1
3 2 5 NA
4 3 6 2
5 3 6 NA
6 4 7 NA
It doubles rows where there are similarities. Is there a way to get desired output without deleting unwanted rows?
EDIT
I can not delete rows with NA, x data frame consists of rows with NA which are not in y data frame. If I would do this I would deleted 4th row from x data frame (4 7 NA)
Thanks for help

You can use an update join with the data.table package:
# load the packge and convert the dataframes to data.table's
library(data.table)
setDT(x)
setDT(y)
# update join
x[y, on = .(a, b), c := i.c][]
which gives:
a b c
1: 1 4 1
2: 2 5 1
3: 3 6 2
4: 4 7 NA

Related

Repeating rows in data frame by using the content of a column in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I want to create a data frame by repeating rows by using content of a column in a data frame. Below is the source data frame.
data.frame(c("a","b","c"), c(4,5,6), c(2,2,3)) -> df
colnames(df) <- c("sample", "measurement", "repeat")
df
sample measurement repeat
1 a 4 2
2 b 5 2
3 c 6 3
I want to repeat the rows by using the "repeat" column and its content to get a data frame like the one below. Ideally, I would like to have a function to this.
sample measurement repeat
1 a 4 2
2 a 4 2
3 b 5 2
4 b 5 2
5 c 6 3
6 c 6 3
7 c 6 3
Thanks in advance!
Solved. df[rep(rownames(df), df$repeat), ] did the job.

Select column dynamically based on value from another column in R [duplicate]

This question already has answers here:
Select values from different columns based on a variable containing column names [duplicate]
(2 answers)
Closed 4 years ago.
How do I use column in data table as variable name to fetch values from other columns based on the said column.
library(data.table)
a = c(2,3,5)
b = c(5,7,7)
c = c(1,2,3)
x = c ('a','b','c')
dt <- data.table(a,b,c,x)
> dt
a b c x
1: 2 5 1 a
2: 3 7 2 b
3: 5 7 3 c
output I desire column y which is based on values of column x which contains the column names of values to be fetched.
dt
a b c x y
1: 2 5 1 a 2
2: 3 7 2 b 7
3: 5 7 3 c 3
I tried
dt[,get(x)]
dt[,match(x,colnames(dt))]
By looping through the sequence of rows, extract the value with get and assign it to create 'y'
dt[, y := .SD[, get(x), seq_len(.N)]$V1]
dt
# a b c x y
#1: 2 5 1 a 2
#2: 3 7 2 b 7
#3: 5 7 3 c 3

In data.table in R, how can we create an sequenced indicator variable by the values of two columns? [duplicate]

This question already has answers here:
data.table "key indices" or "group counter"
(2 answers)
Create a new data frame column based on the values of two other columns
(2 answers)
Closed 4 years ago.
In the data.table package in R, for a given data table, I am wondering how an indicator index can be created for the values that are the same in two columns. For example, for the following data table,
> M <- data.table(matrix(c(2,2,2,2,2,2,2,5,2,5,3,3,3,6), ncol = 2, byrow = T))
> M
V1 V2
1: 2 2
2: 2 2
3: 2 2
4: 2 5
5: 2 5
6: 3 3
7: 3 6
I would like to create a new column that essentially orders the values that are the same for each row of the two columns, so that I can get something like:
> M
V1 V2 Index
1: 2 2 1
2: 2 2 1
3: 2 2 1
4: 2 5 2
5: 2 5 2
6: 3 3 3
7: 3 6 4
I essentially would like to repeat values of .N above, is there a nice way to do it?
We can use .GRP after grouping by 'V1' and 'V2'
M[, Index := .GRP, .(V1, V2)]

Left join Merge data.table [duplicate]

This question already has answers here:
merge data.frame but keep only unique columns?
(2 answers)
Closed 5 years ago.
I want to do a left join with 2 data.frames on R, using data.table library. What I have:
library(data.table)
id<-c("a1","a2","a3","a4")
id2<-c("a2","a3","a1","a4")
y<-c(1,2,3,4)
z<-c(3,5,6,7)
k<-c(1,3,8,7)
df1<-data.table(id,y,z)
id<-c("a2","a3","a1","a4")
df2<-data.table(id,k,y)
I want that the result is a new data.table frame, being this the result of a LEFT JOIN, this is:
result--> id,x,y,z
I use this as a guide:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
merge(df1,df2,by="id",all.x=TRUE)
But this return me:
id y.x z x y.y
1: a1 1 3 3 3
2: a2 2 5 0 1
3: a3 3 6 2 2
4: a4 4 7 1 4
The problem with this is that column y is duplicated, and I want that only appear once.
I have tried with all=FALSE, all.x=T,... but I dont achieve what I want.
I have also tried other solutions, as proposed in: left join in data.table
setkey(df1,id)
setkey(df2,id)
df1[df2]
But this again, duplicate the y column.
id y z k i.y
1: a1 1 3 8 3
2: a2 2 5 1 1
3: a3 3 6 3 2
4: a4 4 7 7 4
How can I do it?
You can merge df1 and df2 by removing column y in one of the tables. Try dplyr::left_join(df1, df2[, -c("y")], by = "id") or merge(df1, df2[, -c("y")], by = "id").

Create a new dataframe according to the contrast between two similar df [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have a dataframe made like this:
X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3
After several steps (not important which one) i obtained this df:
X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3
i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:
X Y Z T
1 2 4 2
7 5 NA 3
How could I do?
One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset
dfO[do.call(paste, dfO) == do.call(paste, df),]
# X Y Z T
#1 1 2 4 2
#3 7 5 NA 3
where 'dfO' is the old dataset and 'df' is the new
You can use dplyr's intersect function:
library(dplyr)
intersect(d1, d2)
# X Y Z T
#1 1 2 4 2
#2 7 5 NA 3
This is a data.frame-equivalent of base R's intersect function.
In case you're working with data.tables, that package also provides such a function:
library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
# X Y Z T
#1: 1 2 4 2
#2: 7 5 NA 3
Another dplyr solution: semi_join.
dt1 %>% semi_join(dt2, by = colnames(.))
X Y Z T
1 1 2 4 2
2 7 5 NA 3
Data
dt1 <- read.table(text = "X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- read.table(text = " X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.
From this perspective, I think the only correct one so far is akrun's.
You could also do something like:
df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]
But I think akrun's way is more elegant and likely to perform better in terms of speed.

Resources