This question already has answers here:
Dropping common rows in two different dataframes
(3 answers)
Closed 4 years ago.
I have below mentioned dataframe:
DF1
T1 ID Type
1 A L
2 B Y
3 C B
4 D U
5 E Z
DF2
T1 ID Type
1 A L
2 B Y
3 F K
4 G I
5 H T
Now i want to merge DF1 and DF2 but every row should be unique in New_Data based on ID coloumn of both the data frame.
Required Dataframe:
New_Data
T1 ID Type
1 A L
2 B Y
3 C B
4 D U
5 E Z
3 F K
4 G I
5 H T
I think you can just use
unique(rbind(DF1,DF2))
Row bind the two data frames, then drop duplicates based on ID column or ID + Type columns (duplicated rows based on id columns from later data frames in bind_rows will be dropped):
bind_rows(df1, df2) %>% distinct(ID, Type, .keep_all = T)
# T1 ID Type
#1 1 A L
#2 2 B Y
#3 3 C B
#4 4 D U
#5 5 E Z
#6 3 F K
#7 4 G I
#8 5 H T
Based on ID column only:
bind_rows(df1, df2) %>% distinct(ID, .keep_all = T)
# T1 ID Type
#1 1 A L
#2 2 B Y
#3 3 C B
#4 4 D U
#5 5 E Z
#6 3 F K
#7 4 G I
#8 5 H T
I'm not sure if this is exactly what you wanted, but to combine the dataframes, you can use the merge function:
# merge two data frames by ID
New_Data <- merge(DF1, DF2 ,by="ID", all=TRUE)
The "all" parameter just means that for all IDs in DF1 and all IDs in DF2 there will be a row in New_Data. However, the merge should not duplicate rows. For further information, I suggest looking up inner and outer joins as well as the documentation for the merge function.
Here are some links:
joins diagram
docs 1
docs 2
Edit: Binding the rows will also work if you don't want to deal with merging. Row binds performs a vertical stacking of one dataframe on top of the other. To order the stacked data alphabetically, you could try:
New_Data <- unique(rbind( DF1, DF2))
New_Data <- New_Data[order(ID),]
Related
This question already has answers here:
Adding values in two data.tables
(2 answers)
Closed 5 years ago.
I've 2 different data.tables. I need to merge and sum based on a row values. The examples of two tables are given as Input below and expected output shown below.
Input
Table 1
X A B
A 3
B 4 6
C 5
D 9 12
Table 2
X A B
A 1 5
B 6 8
C 7 14
D 5
E 1 1
F 2 3
G 5 6
Expected Output:
X A B
A 4 5
B 10 14
C 12 14
D 14 12
E 1 1
F 2 3
G 5 6
We can do this by rbinding the two tables and then do a group by sum
library(data.table)
rbindlist(list(df1, df2))[, lapply(.SD, sum, na.rm = TRUE), by = X]
# X A B
#1: A 4 5
#2: B 10 14
#3: C 12 14
#4: D 14 12
#5: E 1 1
#6: F 2 3
#7: G 5 6
Or using a similar approach with dplyr
library(dplyr)
bind_rows(df1, df2) %>%
group_by(X) %>%
summarise_all(funs(sum(., na.rm = TRUE)))
Note: Here, we assume that the blanks are NA and the 'A' and 'B' columns are numeric/integer class
Merge your tables together first, then do the sum. If you later want to drop the individual values you can do so easily.
out <- merge(df1, df2, by.x="X", by.y="X", all.x=T, all.y=T)
out$sum <- rowSums(out[2:3])
out$A <- out$B <- NULL # drop original values
Below code will help you to do required job for all numeric columns at once
library(dplyr)
Table = Table1 %>% full_join(Table2) %>%
group_by(X) %>% summarise_all(funs(sum(.,na.rm = T)))
Suppose I have two data frame
df1 <- data.frame(A = 1:6, B = 7:12, C = rep(1:2, 3))
df2 <- data.frame(C = 1:2, D = c("A", "B"))
I want to create a new column E in df1 whose value is based on the values of Column C, which can then be connected to Column D in df2. For example, the C value in the first row of df1 is "1". And value 1 of column C in df2 corresponds to "A" of Column D, so the value E created in df2 should from column "A", i.e., 1.
As suggested by Select values from different columns based on a variable containing column names, I can achieve this by two steps:
setDT(df1)
setDT(df2)
df3 <- df1[df2, on = "C"] # step 1 combines the two data.tables
df3[, E := .SD[[.BY[[1]]]], by = D] # step 2
My question is: Could we do this in one step? Furthermore, as my data is relatively large, the first step in this original solution takes a lot time. Could we do this in a faster way?
Any suggestions?
Here's how I would do it:
df1[df2, on=.(C), D := i.D][, E := .SD[[.BY$D]], by=D]
A B C D E
1: 1 7 1 A 1
2: 2 8 2 B 8
3: 3 9 1 A 3
4: 4 10 2 B 10
5: 5 11 1 A 5
6: 6 12 2 B 12
This adds the columns to df1 by reference instead of making a new table and so I guess is more efficient than building df3. Also, since they're added to df1, the rows retain their original ordering.
you can try this, the C column can indicates column value from df1
setDT(df1)
df1[, e := eval(parse(text = names(df1)[C])), by = 1:nrow(df1)]
df1
A B C e
1: 1 7 1 1
2: 2 8 2 8
3: 3 9 1 3
4: 4 10 2 10
5: 5 11 1 5
6: 6 12 2 12
Say I have two data frames:
df1<- data.frame(id1=c('A','B','C','D','P'),
id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3), stringsAsFactors=FALSE)
df2<- data.frame(id1=c('A','H','Q','D','P'),
id2=c('P','B','C','S','Z'),var=c(2,1,2,2,1), stringsAsFactors=FALSE)
I want to join these two data frames by id1 and id2 but sometimes the records are switched in both tables. For instance, the second and third record of each frame should be the same and the output in the merged table should be:
B H 4 1
C Q 2 2
I thought about sorting first the columns and do the merge but this approach does not work because not all the records appear in both tables (even after sorting, you can have id1 and id2 switched). This is a toy example, but in the actual application id1 and id2 are long strings.
What's a way to tackle this task?
Here a solution by creating an intermediate colunm that combine both id's in a sorted way.
df1$key <- with(df1,mapply(function(x,y){
paste(sort(c(x,y)),collapse="")
},id1,id2))
df2$key <- with(df2,mapply(function(x,y){
paste(sort(c(x,y)),collapse="")
},id1,id2))
merge(df1,df2,by="key")
# key id1.x id2.x weight id1.y id2.y var
# 1 AP A P 3 A P 2
# 2 AP P A 3 A P 2
# 3 BH B H 4 H B 1
# 4 CQ C Q 2 Q C 2
# 5 DS D S 7 D S 2
Let's say I have the a data frame df as:
df<- data.frame(id1=c('A','B','C','D','P'), id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3))
id1 id2 weight
1 A P 3
2 B H 4
3 C Q 2
4 D S 7
5 P A 3
This data frame is the edge list representation of a weighted-undirected graph. In this example, I want to remove either the first row or the last row since they are repeated edges. Of course, I want to do the same with all the repeated edges.
I tried this:
w=df[!duplicated(df[,c('id1', 'id2','weight')]),]
but this is not enough.
We can use pmin/pmax
df[!duplicated(cbind(pmin(df$id1, df$id2), pmax(df$id1, df$id2))),]
# id1 id2 weight
#1 A P 3
#2 B H 4
#3 C Q 2
#4 D S 7
data
df<- data.frame(id1=c('A','B','C','D','P'),
id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3), stringsAsFactors=FALSE)
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have a data.frame
df1=data.frame(f=LETTERS[1:4],v=c(1:4))
f v
1 A 1
2 B 2
3 C 3
4 D 4
The first column is a list of factors, in which I have another data frame that houses these values, which are also factors
df2=data.frame(f=LETTERS[1:7],f2=letters[26:20])
f f2
1 A z
2 B y
3 C x
4 D w
5 E v
6 F u
I am wondering how to write a function so that I can alter the values from the first column of df1 to what they map to from df2. I would like to get:
f v
1 z 1
2 y 2
3 x 3
4 w 4
I tried a for loop with no success. Ant suggestions is greatly appreciated
Note: this is a simplified example of my work. A merge would add too many columns to work with and I don't think the extra memory storage would be very useful
We can use match
df1$f <- df2$f2[match(df1$f, df2$f)]
df1
# f v
#1 z 1
#2 y 2
#3 x 3
#4 w 4
You can use merge
merge(df1,df2,by = "f")[,c(1,3,2)]
f f2 v
1 A z 1
2 B y 2
3 C x 3
4 D w 4
library(dplyr)
left_join(df1,df2)
You could try using the merge function to merge the two tables, then specify which columns you want to keep.
For example:
df1 <- data.frame(f=LETTERS[1:4],v=c(1:4))
df2 <- data.frame(f=LETTERS[1:7],f2=letters[26:20])
merge(df1, df2, by.x = "f")[,c("f2", "v")]
f2 v
1 z 1
2 y 2
3 x 3
4 w 4