Special removal of duplicated rows in R data frame - r

Let's say I have the a data frame df as:
df<- data.frame(id1=c('A','B','C','D','P'), id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3))
id1 id2 weight
1 A P 3
2 B H 4
3 C Q 2
4 D S 7
5 P A 3
This data frame is the edge list representation of a weighted-undirected graph. In this example, I want to remove either the first row or the last row since they are repeated edges. Of course, I want to do the same with all the repeated edges.
I tried this:
w=df[!duplicated(df[,c('id1', 'id2','weight')]),]
but this is not enough.

We can use pmin/pmax
df[!duplicated(cbind(pmin(df$id1, df$id2), pmax(df$id1, df$id2))),]
# id1 id2 weight
#1 A P 3
#2 B H 4
#3 C Q 2
#4 D S 7
data
df<- data.frame(id1=c('A','B','C','D','P'),
id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3), stringsAsFactors=FALSE)

Related

How to skip not completly empty rows in r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

Conditionally update Dataframe from second dataframe in R

I have 2 dataframes and would like to use the second to update the first. The problem though is that the second dataframe consists of all entries but either with different amounts of data (as shown below)
DF1 DF2 DF3
X Y X Y X Y
1 A 1 B 1 B
2 <NA> 2 B 2 B
3 <NA> 3 C --> 3 C
4 D 4 <NA> 4 D
5 E 5 <NA> 5 E
It should be a simple update query where entries in DF1 updates if DF2 is not NA
I first thought of removing the NA from the list
DF2sub <- subset(DF2,!is.na(Y)
DF3 <- transform(DF1, Y = DF2sub$Y[match(X,DF2sub$X)])
but the resulting code does the following
DF3
X Y
X Y
1 B
2 B
3 C
4 <NA>
5 <NA>
You can directly use the which function to obtain the indices of the NA and not NA values and map it together. like this.
DF3 <- rbind(DF2[which(!is.na(DF2$Y)),],DF1[which(is.na(DF2$Y)),])
Hope this solves your issue.

Particular join in R data frame

Say I have two data frames:
df1<- data.frame(id1=c('A','B','C','D','P'),
id2=c('P','H','Q','S','A'),weight=c(3,4,2,7,3), stringsAsFactors=FALSE)
df2<- data.frame(id1=c('A','H','Q','D','P'),
id2=c('P','B','C','S','Z'),var=c(2,1,2,2,1), stringsAsFactors=FALSE)
I want to join these two data frames by id1 and id2 but sometimes the records are switched in both tables. For instance, the second and third record of each frame should be the same and the output in the merged table should be:
B H 4 1
C Q 2 2
I thought about sorting first the columns and do the merge but this approach does not work because not all the records appear in both tables (even after sorting, you can have id1 and id2 switched). This is a toy example, but in the actual application id1 and id2 are long strings.
What's a way to tackle this task?
Here a solution by creating an intermediate colunm that combine both id's in a sorted way.
df1$key <- with(df1,mapply(function(x,y){
paste(sort(c(x,y)),collapse="")
},id1,id2))
df2$key <- with(df2,mapply(function(x,y){
paste(sort(c(x,y)),collapse="")
},id1,id2))
merge(df1,df2,by="key")
# key id1.x id2.x weight id1.y id2.y var
# 1 AP A P 3 A P 2
# 2 AP P A 3 A P 2
# 3 BH B H 4 H B 1
# 4 CQ C Q 2 Q C 2
# 5 DS D S 7 D S 2

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

Counting number of unique rows that have repeated records in one column

This is what my dataframe looks like:
a <- c(1,1,4,4,5)
b <- c(1,2,3,3,5)
c <- c(1,4,4,4,5)
d <- c(2,2,4,4,5)
e <- c(1,5,3,3,5)
df <- data.frame(a,b,c,d,e)
I'd like to write something that returns all unique instances of vectors a,b,c,d that have a repeated value in vector e.
For example:
a b c d e
1 1 1 1 2 1
2 1 2 4 2 5
3 4 3 4 4 3
4 4 3 4 4 3
5 5 5 5 5 5
Rows 3 and 4 are exactly the same till vector d (having a combination of 4344) so only one instance of those should be returned, but they have 2 repeated values in vector e. I would want to get a count on those - so the combination of 4344 has 2 repeated values in vector e.
The expected output would me how many times a certain combination such as 4344 had repeated values in vector e. So in this case it would be something like:
a b c d e
4 3 4 4 2
Both R and SQL work, whatever does the job.
Again, see my comments above, but I believe the following gives you a start on your first question. First, create a "key" variable (in this case named key_abcd which uses tidyr::unite to unite columns a, b, c, and d). Then, count up e by this key_abcd variable. The group_by is implicit.
library(tidyr)
library(dplyr)
df <- data.frame(a,b,c,d,e,f,g)
df %>%
unite(key_abcd, a, b, c, d) %>%
count(key_abcd, e)
# key_abcd e n
# (chr) (dbl) (int)
# 1 1_1_1_2 1 1
# 2 1_2_4_2 5 1
# 3 4_3_4_4 3 2
# 4 5_5_5_5 5 1
It appears from how you've worded the question, you are only interested in "more than one" combinations, therefore, you could add %>% filter(n > 1) to the above code.

Resources