i have a table which shows two columns having IDs. The below table implies that ID in column 1 is related to ID in column 2. The schema of table is such that we have both ( say IDs are A and B , both are related. Then entry will appear twice , once as A to B and B to A ) , sample table below :
ID.1 ID.2
A B
A C
B C
C B
C A
B A
D E
E F
F E
D F
E D
F D
( e.g. for A,B,C we see A & B are related , A & C are related , B & C are related - i tag all of them in one house and give a unique id )
Output
ID.1 ID.2 HouseID
A B X1
A C X1
B C X1
C B X1
C A X1
B A X1
D E X2
E F X2
F E X2
D F X2
D F X2
E D X2
F D X2
How do i get the above in R ? what if i add a transitive logic for example A is related to B and A is related to C , Hence B also must know C ?
As I understand the question, #Scarabee had it right. The answer depends on maximal cliques, but the bounty shows that the OP did not consider that a full answer. This answer pushes that through to assigning the HouseID.
library(igraph)
## Your sample data
Edges1 = read.table(text="ID.1 ID.2
A B
A C
B C
C B
C A
B A
D E
E F
F E
D F
E D
F D",
header=TRUE, stringsAsFactors=FALSE)
g1 = graph_from_edgelist(as.matrix(Edges1), directed=FALSE)
plot(g1)
MC1 = max_cliques(g1)
MC1
[[1]]
+ 3/6 vertices, named, from 8123133:
[1] A B C
[[2]]
+ 3/6 vertices, named, from 8123133:
[1] D E F
This gives the maximal cliques (the houses), but we need to construct the HouseID variable.
Edges1$HouseID = apply(Edges1, 1,
function(e)
which(sapply(MC1, function(mc) all(e %in% names(unclass(mc))))))
Edges1
ID.1 ID.2 HouseID
1 A B 1
2 A C 1
3 B C 1
4 C B 1
5 C A 1
6 B A 1
7 D E 2
8 E F 2
9 F E 2
10 D F 2
11 E D 2
12 F D 2
The outer apply loops through the edges. The inner sapply checks which clique (house) contains both nodes from the edge.
This provides the structure that the question asked for. But as #Scarabee pointed out, a node may belong to more than one maximal clique (house). That is not exactly a problem as the requested structure assigns the HouseID to edges. Here is an example with a node that belongs to two houses.
Edges3 = read.table(text="ID.1 ID.2
A B
A C
B C
D E
D A
E A",
header=TRUE, stringsAsFactors=FALSE)
g3 = graph_from_edgelist(as.matrix(Edges3), directed=FALSE)
plot(g3)
MC3 = max_cliques(g3)
Edges3$HouseID = apply(Edges3, 1,
function(e)
which(sapply(MC3, function(mc) all(e %in% names(unclass(mc))))))
Edges3
ID.1 ID.2 HouseID
1 A B 2
2 A C 2
3 B C 2
4 D E 1
5 D A 1
6 E A 1
In this case, We can still assign a HouseID to each edge, even though the node A is in two different Houses. Notice that the edge A-B has HouseID = 2, but edge D-A has HouseD = 1. The HouseID is a property of the edge, not the node.
However, there is still a problem. It is possible for both ends of an edge to belong to two houses and one cannot assign a single house to the edge.
Edges4 = read.table(text="ID.1 ID.2
A B
A C
B C
D A
D B",
header=TRUE, stringsAsFactors=FALSE)
g4 = graph_from_edgelist(as.matrix(Edges4), directed=FALSE)
plot(g4)
MC4 = max_cliques(g4)
MC4
[[1]]
+ 3/4 vertices, named, from fbd5929:
[1] A B C
[[2]]
+ 3/4 vertices, named, from fbd5929:
[1] A B D
The edge A-B belongs to two maximal cliques. As #Scarabee said, the question is not actually well-defined for all graphs.
Related
I got these two data frames:
a <- c('A','B','C','D','E','F','G','H')
b <- c(1,2,1,3,1,3,1,6)
c <- c('K','K','H','H','K','K','H','H')
frame1 <- data.frame(a,b,c)
a <- c('A','A','B','B','C','C','D','D','E','E','F','F','G','H','H')
d <- c(5,5,6,3,1,9,1,0,2,3,6,5,5,5,4)
e <- c('W','W','D','D','D','D','W','W','D','D','W','W','D','W','W')
frame2<- data.frame(a,d,e)
And now I want to include the column 'e' from 'frame2' into 'frame1' depending on the matching value in column 'a' of both data frames. Note: 'e' is the same for all rows with the same value in 'a'.
The result should look like this:
a b c e
1 A 1 K W
2 B 2 K D
3 C 1 H D
4 D 3 H W
5 E 1 K D
6 F 3 K W
7 G 1 H D
8 H 6 H W
Any sugestions?
You can use match to matching value in column 'a' of both data frames:
frame1$e <- frame2$e[match(frame1$a, frame2$a)]
frame1
# a b c e
#1 A 1 K W
#2 B 2 K D
#3 C 1 H D
#4 D 3 H W
#5 E 1 K D
#6 F 3 K W
#7 G 1 H D
#8 H 6 H W
or using merge:
merge(frame1, frame2[!duplicated(frame2$a), c("a", "e")], all.x=TRUE)
you can perform join operation on 'a' column of both dataframes and take those values only which are matched. you can do left join , and after that remove 'a' column from 2nd dataframe and also remove rest of the columns, which are'nt needed from 2nd dataframe.
Using dplyr :
library(dplyr)
frame2 %>%
distinct(a, e, .keep_all = TRUE) %>%
right_join(frame1, by = 'a') %>%
select(-d) %>%
arrange(a)
# a e b c
#1 A W 1 K
#2 B D 2 K
#3 C D 1 H
#4 D W 3 H
#5 E D 1 K
#6 F W 3 K
#7 G D 1 H
#8 H W 6 H
A scientific publication published a pancreatic cancer classifier and I want to use this classifier on my own expression set. The only information that they provide is a data frame with centroids (rows: genes x columns: subtypes)(https://doi.org/10.1053/j.gastro.2018.08.033, supplementary table 2). Up until now I haven’t figured out to reproduce this classification model for prediction.
All packages that I found, they calculate the centroids from expression data and labels, and output a models to predict a new set. Unfortunately the labels are not published with this article; recalculating the centroids is not possible.
Question: How can I use centroids to classify an other expression set?
You can use k-Nearest Neighbors with only the centroids. Just use the centroids as the training data and k = the number of centroids. Since you do not provide any data, I will give an example using the iris data. The specific centroids don't matter here, but they must be in a data frame with the same format as the data that you wish to classify. You can call the classes whatever you want. I just called them A,B and C.
## Define some centroids
Centroids = aggregate(iris[,1:4], list(iris$Species), mean)[,-1]
library(class)
knn(train=Centroids, test = iris[,1:4], k=3, cl=c("A", "B", "C"))
[1] B A A B C B C B C A B C C A A B C B C B A B C B C B B B B C B B C B A A A
[38] A B B B B B A A B B C A A A B C C B A C B C C C B A B B C C A B A B B C C
[75] A A C C B C C A B C C C B B C A C A C A B A A A B B A C C A C B B B C B A
[112] A B C B A C A B B A B B C B B C A A B A A B A C A B C B B A B C C A B A C
[149] B C
Levels: A B C
I want to join two dataframes by two columns they have in common but I do not want mutual pairs to be considered as duplicates.
Sample dataframes look like:
>df
letter1 letter2 value
d e 1
c d 2
c e 4
>dc
letter1 letter2
a e
c a
c d
c e
d a
d c
d e
e a
I want to join them by the first two columns, leaving in the third column the value in df$value and NA if the row does not exist in df. I have tried:
s <- join(dc,df, by = c("letter1","letter2"))
>s
letter1 letter2 value
a e NA
c a NA
c d 2
c e 4
d a NA
d c 2
d e 1
e a NA
Here, the pair d c is considered the same as c d and the value in the third column is the same. What I want is d c being considered as non-present in df, so their row value is NA. My desired output is:
>s
letter1 letter2 value
a e NA
c a NA
c d 2
c e 4
d a NA
d c NA
d e 1
e a NA
How can I join the dataframes so mutual pairs are considered different combinations?
UPDATE: I am sorry but I have just realized there was a problem with my input dataframes and that the join line I was trying actually works. I will accept the first answer that also works to give credit to the author.
We can use apply to change the order
df[1:2] <- t(apply(df[1:2], 1, sort))
dc <- t(apply(dc, 1, sort)
and then do the join
You could use merge instead of join:
merge(dc,df, by = c("letter1","letter2"),all=TRUE)
#Creating the data frames
df <- data.frame(letter1=c("d","c","c"),
letter2=c("e","d","e"),
value=c(1,2,4))
dc <- data.frame(letter1=c("a","c","c","c","d","d","d","e"),
letter2=c("e","a","d","e","a","c","e","a"))
# Merging the data frames
dout <- merge(df,dc,by=c("letter1","letter2"),all=T)
# Outcome
letter1 letter2 value
1 c d 2
2 c e 4
3 c a NA
4 d e 1
5 d a NA
6 d c NA
7 a e NA
8 e a NA
I have the below diagram
These red arrows represent weighting factors for each node in the direction that they are pointing to. The input file is the value of the factor and the direction.
Is this factors diagram plottable with R?
First some dummy data which (I hope) emulate yours (which is hard to say considering how little information you gave):
ow <- expand.grid(c(1.5,2.5),c(1.5,2.5))
row.names(ow)<-letters[1:4]
pw <- expand.grid(1:3,1:3)
row.names(pw)<-LETTERS[1:9]
B <- rbind(expand.grid("a",row.names(pw)[c(1,2,4,5)]),
expand.grid("b",row.names(pw)[c(2,3,5,6)]),
expand.grid("c",row.names(pw)[c(4,5,7,8)]),
expand.grid("d",row.names(pw)[c(5,6,8,9)]))
B <- cbind(B,abs(rnorm(16)))
So we have:
# The location of your oil wells:
ow
Var1 Var2
a 1.5 1.5
b 2.5 1.5
c 1.5 2.5
d 2.5 2.5
# Of your production wells:
pw
Var1 Var2
A 1 1
B 2 1
C 3 1
D 1 2
E 2 2
F 3 2
G 1 3
H 2 3
I 3 3
#And a b value for each pairs of neighbouring oil/production wells:
Var1 Var2 abs(rnorm(16))
1 a A 1.78527757
2 a B 1.61794028
3 a D 1.80234599
4 a E 0.04202002
5 b B 0.90265280
6 b C 1.05214769
7 b E 0.67932237
8 b F 0.11497430
9 c D 0.26288589
10 c E 0.50745137
11 c G 0.74102529
12 c H 1.43919338
13 d E 1.04111278
14 d F 0.49372216
15 d H 0.21500663
16 d I 0.20156929
And here is a simple function that plot more or less the kind of graph you showed:
weirdplot <- function(ow_loc, pw_loc, B,
pch_ow=19, pch_pw=17,
col_ow="green", col_pw="blue", col_b="red", breaks){
# with ow_loc and pw_loc the locations of your wells
# B the correspondance table
# pch_ow and pch_pw the point type for the wells
# col_b, col_ow and col_pw the colors for the arrows and the wells
# and breaks a vector of size categories for b values
plot(pw_loc,type="n")
b<-cut(B[,3], breaks=breaks)
for(i in 1:nrow(B)){
start=ow_loc[row.names(ow)==B[i,1],]
end=pw_loc[row.names(pw)==B[i,2],]
arrows(x0=start[,1],y0=start[,2],
x1=end[,1], y1=end[,2], lwd=b[i], col=col_b)
}
points(pw_loc, pch=pch_pw, col=col_pw)
points(ow_loc, pch=pch_ow, col=col_ow)
}
So with the values we created earlier:
weirdplot(ow, pw, B, breaks=c(0,0.5,1,1.5,2))
It's not particularly pretty but it should get you started.
I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
replace_index <- function(dat){
x <- as.data.frame(t(sapply(seq_len(nrow(dat)),
function(i)sort(unlist(dat[i, 1:2])))))
names(x) <- paste("V", seq_len(ncol(x)), sep="")
data.frame(x, dat[, -(1:2), drop=FALSE])
}
replace_index(df1)
V1 V2 C D E F G
1 1 2 a a a a a
2 2 3 b b b c c
3 1 4 e e f f e
This means you can use a straight-forward merge to combine the data.
merge(replace_index(df1), replace_index(df2), all.y=TRUE)
V1 V2 C D E F G Z
1 1 2 a a a a a g
2 1 2 a a a a a h
3 1 4 e e f f e j
4 3 4 <NA> <NA> <NA> <NA> <NA> i
This is slightly clunky, and has some potential collision and order issues but works with your example
df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)
to produce
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i <NA> <NA> <NA> <NA> <NA>
One approach would be to create an id key for matching that is order invariant.
# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B), "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y), "-", max(X, Y)))
# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])
This produces the output
X Y Z id C D E F G
1 1 2 g 1 - 2 a a a a a
1.1 2 1 h 1 - 2 a a a a a
NA 3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3 1 4 j 1 - 4 e e f f e
You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.
library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))
rbind(m,n) %>%
group_by(X,Y) %>%
arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
slice(1) # take top row from combination
Produces:
Source: local data frame [4 x 8]
Groups: X, Y
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i NA NA NA NA NA
Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:
merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
## X Y Z C D E F G
## 1 1 2 g a a a a a
## 2 2 1 h a a a a a
## 3 1 4 j e e f f e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>
Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what #Andrie does in his helper function, which would work for any number of key columns, but would be less performant.