Say I have two data frames:
df1<- data.frame(id1=c('A','B','C','D','P'),
id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3), stringsAsFactors=FALSE)
df2<- data.frame(id1=c('A','H','Q','D','P'),
id2=c('P','B','C','S','Z'),var=c(2,1,2,2,1), stringsAsFactors=FALSE)
I want to join these two data frames by id1 and id2 but sometimes the records are switched in both tables. For instance, the second and third record of each frame should be the same and the output in the merged table should be:
B H 4 1
C Q 2 2
I thought about sorting first the columns and do the merge but this approach does not work because not all the records appear in both tables (even after sorting, you can have id1 and id2 switched). This is a toy example, but in the actual application id1 and id2 are long strings.
What's a way to tackle this task?
Here a solution by creating an intermediate colunm that combine both id's in a sorted way.
df1$key <- with(df1,mapply(function(x,y){
paste(sort(c(x,y)),collapse="")
},id1,id2))
df2$key <- with(df2,mapply(function(x,y){
paste(sort(c(x,y)),collapse="")
},id1,id2))
merge(df1,df2,by="key")
# key id1.x id2.x weight id1.y id2.y var
# 1 AP A P 3 A P 2
# 2 AP P A 3 A P 2
# 3 BH B H 4 H B 1
# 4 CQ C Q 2 Q C 2
# 5 DS D S 7 D S 2
Related
So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?
This question already has answers here:
Dropping common rows in two different dataframes
(3 answers)
Closed 4 years ago.
I have below mentioned dataframe:
DF1
T1 ID Type
1 A L
2 B Y
3 C B
4 D U
5 E Z
DF2
T1 ID Type
1 A L
2 B Y
3 F K
4 G I
5 H T
Now i want to merge DF1 and DF2 but every row should be unique in New_Data based on ID coloumn of both the data frame.
Required Dataframe:
New_Data
T1 ID Type
1 A L
2 B Y
3 C B
4 D U
5 E Z
3 F K
4 G I
5 H T
I think you can just use
unique(rbind(DF1,DF2))
Row bind the two data frames, then drop duplicates based on ID column or ID + Type columns (duplicated rows based on id columns from later data frames in bind_rows will be dropped):
bind_rows(df1, df2) %>% distinct(ID, Type, .keep_all = T)
# T1 ID Type
#1 1 A L
#2 2 B Y
#3 3 C B
#4 4 D U
#5 5 E Z
#6 3 F K
#7 4 G I
#8 5 H T
Based on ID column only:
bind_rows(df1, df2) %>% distinct(ID, .keep_all = T)
# T1 ID Type
#1 1 A L
#2 2 B Y
#3 3 C B
#4 4 D U
#5 5 E Z
#6 3 F K
#7 4 G I
#8 5 H T
I'm not sure if this is exactly what you wanted, but to combine the dataframes, you can use the merge function:
# merge two data frames by ID
New_Data <- merge(DF1, DF2 ,by="ID", all=TRUE)
The "all" parameter just means that for all IDs in DF1 and all IDs in DF2 there will be a row in New_Data. However, the merge should not duplicate rows. For further information, I suggest looking up inner and outer joins as well as the documentation for the merge function.
Here are some links:
joins diagram
docs 1
docs 2
Edit: Binding the rows will also work if you don't want to deal with merging. Row binds performs a vertical stacking of one dataframe on top of the other. To order the stacked data alphabetically, you could try:
New_Data <- unique(rbind( DF1, DF2))
New_Data <- New_Data[order(ID),]
Suppose I have two data frame
df1 <- data.frame(A = 1:6, B = 7:12, C = rep(1:2, 3))
df2 <- data.frame(C = 1:2, D = c("A", "B"))
I want to create a new column E in df1 whose value is based on the values of Column C, which can then be connected to Column D in df2. For example, the C value in the first row of df1 is "1". And value 1 of column C in df2 corresponds to "A" of Column D, so the value E created in df2 should from column "A", i.e., 1.
As suggested by Select values from different columns based on a variable containing column names, I can achieve this by two steps:
setDT(df1)
setDT(df2)
df3 <- df1[df2, on = "C"] # step 1 combines the two data.tables
df3[, E := .SD[[.BY[[1]]]], by = D] # step 2
My question is: Could we do this in one step? Furthermore, as my data is relatively large, the first step in this original solution takes a lot time. Could we do this in a faster way?
Any suggestions?
Here's how I would do it:
df1[df2, on=.(C), D := i.D][, E := .SD[[.BY$D]], by=D]
A B C D E
1: 1 7 1 A 1
2: 2 8 2 B 8
3: 3 9 1 A 3
4: 4 10 2 B 10
5: 5 11 1 A 5
6: 6 12 2 B 12
This adds the columns to df1 by reference instead of making a new table and so I guess is more efficient than building df3. Also, since they're added to df1, the rows retain their original ordering.
you can try this, the C column can indicates column value from df1
setDT(df1)
df1[, e := eval(parse(text = names(df1)[C])), by = 1:nrow(df1)]
df1
A B C e
1: 1 7 1 1
2: 2 8 2 8
3: 3 9 1 3
4: 4 10 2 10
5: 5 11 1 5
6: 6 12 2 12
Let's say I have the a data frame df as:
df<- data.frame(id1=c('A','B','C','D','P'), id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3))
id1 id2 weight
1 A P 3
2 B H 4
3 C Q 2
4 D S 7
5 P A 3
This data frame is the edge list representation of a weighted-undirected graph. In this example, I want to remove either the first row or the last row since they are repeated edges. Of course, I want to do the same with all the repeated edges.
I tried this:
w=df[!duplicated(df[,c('id1', 'id2','weight')]),]
but this is not enough.
We can use pmin/pmax
df[!duplicated(cbind(pmin(df$id1, df$id2), pmax(df$id1, df$id2))),]
# id1 id2 weight
#1 A P 3
#2 B H 4
#3 C Q 2
#4 D S 7
data
df<- data.frame(id1=c('A','B','C','D','P'),
id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3), stringsAsFactors=FALSE)
We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6