Delete Duplicates when Merging DF [duplicate] - r

This question already has answers here:
Select only the first row when merging data frames with multiple matches
(4 answers)
Closed 5 years ago.
I know, I know.... Another merging Df question, please hear me out as I have searched SO for an answer on this but none has come.
I am merging two Df's, one smaller than the other, and doing a left merge, to match up the longer DF to the smaller DF.
This works well except for one issue, rows get added to the left (smaller) df when the right(longer) df has duplicates.
An Example:
Row<-c("a","b","c","d","e")
Data<-(1:5)
df1<-data.frame(Row,Data)
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
names(df2)<-c("Row","Data2")
DATA<-merge(x = df1, y = df2, by = "Row", all.x = TRUE)
>DATA
Row Data Data2
1 a 1 1
2 b 2 2
3 b 2 3
4 c 3 4
5 d 4 5
6 e 5 6
See the extra "b" row?, that is what I want to get rid of, I want to keep the left DF, but very strictly, as in if there are 5 rows in DF1, when merged I want there to only be 5 rows.
Like this...
Row Data Data2
1 a 1 1
2 b 2 2
3 c 3 4
4 d 4 5
5 e 5 6
Where it only takes the first match and moves on.
I realize the merge function is only doing its job here, so is there another way to do this to get my expected result? OR is there a post-merge modification that should be done instead.
Thank you for your help and time.
Research:
How to join (merge) data frames (inner, outer, left, right)?
deleting duplicates
Merging two data frames with different sizes and missing values

We can use the duplicated function as follows:
DATA[!duplicated(DATA$Row),]
Row Data Data2
1 a 1 1
2 b 2 2
4 c 3 4
5 d 4 5
6 e 5 6

It´s possible also like
merge(x = df1, y = df1[unique(df1$Row),], by = "Row", all.x = TRUE)
# Row Data.x Data.y
#1 a 1 1
#2 b 2 2
#3 c 3 3
#4 d 4 4
#5 e 5 5

Since you only want the first row and don't care what variables are chosen, then you can use this code (before you merge):
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
library(dplyr)
df2 %>%
group_by(Row2) %>%
slice(1)

Related

Repeating rows in data frame by using the content of a column in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I want to create a data frame by repeating rows by using content of a column in a data frame. Below is the source data frame.
data.frame(c("a","b","c"), c(4,5,6), c(2,2,3)) -> df
colnames(df) <- c("sample", "measurement", "repeat")
df
sample measurement repeat
1 a 4 2
2 b 5 2
3 c 6 3
I want to repeat the rows by using the "repeat" column and its content to get a data frame like the one below. Ideally, I would like to have a function to this.
sample measurement repeat
1 a 4 2
2 a 4 2
3 b 5 2
4 b 5 2
5 c 6 3
6 c 6 3
7 c 6 3
Thanks in advance!
Solved. df[rep(rownames(df), df$repeat), ] did the job.

How to use bind_rows() and ignore column names [duplicate]

This question already has answers here:
Simplest way to get rbind to ignore column names
(2 answers)
Closed 4 years ago.
This question probably has been answered before, but I can't seem to find the answer. How do you use bind_rows() to just union the two tables and ignore the column names.
The documentation on bind_rows() has the following example:
#Columns don't need to match when row-binding
bind_rows(data.frame(x = 1:3), data.frame(y = 1:4))
This returns column x and y. How do I just get a single column back without having to change the column names?
Desired output, I don't really care what the column name ends up being:
x
1 1
2 2
3 3
4 1
5 2
6 3
7 4
You can do this with a quick 2-line function:
force_bind = function(df1, df2) {
colnames(df2) = colnames(df1)
bind_rows(df1, df2)
}
force_bind(df1, df2)
Output:
x
1 1
2 2
3 3
4 1
5 2
6 3
7 4
I think we still need change the names here
bind_rows(data.frame(x = 1:3), setNames(rev(data.frame(y = 1:4)), names(data.frame(x = 1:3))))
x
1 1
2 2
3 3
4 1
5 2
6 3
7 4

Create a new dataframe according to the contrast between two similar df [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have a dataframe made like this:
X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3
After several steps (not important which one) i obtained this df:
X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3
i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:
X Y Z T
1 2 4 2
7 5 NA 3
How could I do?
One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset
dfO[do.call(paste, dfO) == do.call(paste, df),]
# X Y Z T
#1 1 2 4 2
#3 7 5 NA 3
where 'dfO' is the old dataset and 'df' is the new
You can use dplyr's intersect function:
library(dplyr)
intersect(d1, d2)
# X Y Z T
#1 1 2 4 2
#2 7 5 NA 3
This is a data.frame-equivalent of base R's intersect function.
In case you're working with data.tables, that package also provides such a function:
library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
# X Y Z T
#1: 1 2 4 2
#2: 7 5 NA 3
Another dplyr solution: semi_join.
dt1 %>% semi_join(dt2, by = colnames(.))
X Y Z T
1 1 2 4 2
2 7 5 NA 3
Data
dt1 <- read.table(text = "X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- read.table(text = " X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.
From this perspective, I think the only correct one so far is akrun's.
You could also do something like:
df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]
But I think akrun's way is more elegant and likely to perform better in terms of speed.

How to merge two dataframes R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two data frames with some overlapping variables and some not. Each variable has an attribute (frequency of variable) and I need to combine the two into one dataframe where the result is two columns of attributes, one corresponding to the first dataframe, and the second corresponding to the first data frame, and the union of all the variables are represented.
dataframe 1:
var frequency
a 3
b 2
d 5
dataframe 2:
var frequency
a 2
b 3
c 3
Resulting dataframe:
var frequency1 frequency2
a 3 2
b 2 3
c 0 3
d 5 0
Thanks for your help.
This seems to work for me:
df1 = read.csv('df1.csv')
df2 = read.csv('df2.csv')
df1$frequency1 = df1$frequency
df2$frequency2 = df2$frequency
df1$frequency = NULL
df2$frequency = NULL
df = merge(df1, df2, by = 'var', all = TRUE)
print(df)
The idea is that if you want frequency1 and frequency2 to be the names in the final merged dataframe, you can rename them in df1 and df2 before merging. This produces:
var frequency1 frequency2
1 a 3 2
2 b 2 3
3 d 5 NA
4 c NA 3

How to mix dataframes in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I have the following situation,2 huge dataframes X and Y (the rownumber is about 13 millions per dataframe and the columns are 11 for each dataframe) and I need to merge them in a specific way.
The X dataframe example is
A 1 2 3
B 3 2 4
C 1 6 8
The Y dataframe is
A 9 1 8
B 3 1 7
D 2 9 4
I have to mix them with the following logic:
If the first element of the row in Y is present in X then i have to append it to the row
If the first element of the row in Y is not present in X then i have to append zeroes and then append the Y data
For all the X rows not present in Y I have to append then zeroes
The mix result should be like this:
A 1 2 3 9 1 8 I found A in Y and I appended
B 3 2 4 3 1 7 I found B in Y and I appended
C 1 6 8 0 0 0 I didn't found C in Y and added 0
D 0 0 0 2 9 4 I didn't found D in X and added 0 then appended C
I tried to go row by row but it takes ages and I need a one shot or double shot (double instruction ) solution...
Thanks
without a reproducible example I can't test this, but I think you want:
library(dplyr)
z<-full_join(x,y, by=FirstColumn)
z[is.na(Z)]<-0
this assumes there are no NA's in the original data.

Resources