Is this factors diagram plottable with R? - r

I have the below diagram
These red arrows represent weighting factors for each node in the direction that they are pointing to. The input file is the value of the factor and the direction.
Is this factors diagram plottable with R?

First some dummy data which (I hope) emulate yours (which is hard to say considering how little information you gave):
ow <- expand.grid(c(1.5,2.5),c(1.5,2.5))
row.names(ow)<-letters[1:4]
pw <- expand.grid(1:3,1:3)
row.names(pw)<-LETTERS[1:9]
B <- rbind(expand.grid("a",row.names(pw)[c(1,2,4,5)]),
expand.grid("b",row.names(pw)[c(2,3,5,6)]),
expand.grid("c",row.names(pw)[c(4,5,7,8)]),
expand.grid("d",row.names(pw)[c(5,6,8,9)]))
B <- cbind(B,abs(rnorm(16)))
So we have:
# The location of your oil wells:
ow
Var1 Var2
a 1.5 1.5
b 2.5 1.5
c 1.5 2.5
d 2.5 2.5
# Of your production wells:
pw
Var1 Var2
A 1 1
B 2 1
C 3 1
D 1 2
E 2 2
F 3 2
G 1 3
H 2 3
I 3 3
#And a b value for each pairs of neighbouring oil/production wells:
Var1 Var2 abs(rnorm(16))
1 a A 1.78527757
2 a B 1.61794028
3 a D 1.80234599
4 a E 0.04202002
5 b B 0.90265280
6 b C 1.05214769
7 b E 0.67932237
8 b F 0.11497430
9 c D 0.26288589
10 c E 0.50745137
11 c G 0.74102529
12 c H 1.43919338
13 d E 1.04111278
14 d F 0.49372216
15 d H 0.21500663
16 d I 0.20156929
And here is a simple function that plot more or less the kind of graph you showed:
weirdplot <- function(ow_loc, pw_loc, B,
pch_ow=19, pch_pw=17,
col_ow="green", col_pw="blue", col_b="red", breaks){
# with ow_loc and pw_loc the locations of your wells
# B the correspondance table
# pch_ow and pch_pw the point type for the wells
# col_b, col_ow and col_pw the colors for the arrows and the wells
# and breaks a vector of size categories for b values
plot(pw_loc,type="n")
b<-cut(B[,3], breaks=breaks)
for(i in 1:nrow(B)){
start=ow_loc[row.names(ow)==B[i,1],]
end=pw_loc[row.names(pw)==B[i,2],]
arrows(x0=start[,1],y0=start[,2],
x1=end[,1], y1=end[,2], lwd=b[i], col=col_b)
}
points(pw_loc, pch=pch_pw, col=col_pw)
points(ow_loc, pch=pch_ow, col=col_ow)
}
So with the values we created earlier:
weirdplot(ow, pw, B, breaks=c(0,0.5,1,1.5,2))
It's not particularly pretty but it should get you started.

Related

R - Adding a total row in Excel output

I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na
library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.
Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>

Detect Networks/communities and assign IDs

i have a table which shows two columns having IDs. The below table implies that ID in column 1 is related to ID in column 2. The schema of table is such that we have both ( say IDs are A and B , both are related. Then entry will appear twice , once as A to B and B to A ) , sample table below :
ID.1 ID.2
A B
A C
B C
C B
C A
B A
D E
E F
F E
D F
E D
F D
( e.g. for A,B,C we see A & B are related , A & C are related , B & C are related - i tag all of them in one house and give a unique id )
Output
ID.1 ID.2 HouseID
A B X1
A C X1
B C X1
C B X1
C A X1
B A X1
D E X2
E F X2
F E X2
D F X2
D F X2
E D X2
F D X2
How do i get the above in R ? what if i add a transitive logic for example A is related to B and A is related to C , Hence B also must know C ?
As I understand the question, #Scarabee had it right. The answer depends on maximal cliques, but the bounty shows that the OP did not consider that a full answer. This answer pushes that through to assigning the HouseID.
library(igraph)
## Your sample data
Edges1 = read.table(text="ID.1 ID.2
A B
A C
B C
C B
C A
B A
D E
E F
F E
D F
E D
F D",
header=TRUE, stringsAsFactors=FALSE)
g1 = graph_from_edgelist(as.matrix(Edges1), directed=FALSE)
plot(g1)
MC1 = max_cliques(g1)
MC1
[[1]]
+ 3/6 vertices, named, from 8123133:
[1] A B C
[[2]]
+ 3/6 vertices, named, from 8123133:
[1] D E F
This gives the maximal cliques (the houses), but we need to construct the HouseID variable.
Edges1$HouseID = apply(Edges1, 1,
function(e)
which(sapply(MC1, function(mc) all(e %in% names(unclass(mc))))))
Edges1
ID.1 ID.2 HouseID
1 A B 1
2 A C 1
3 B C 1
4 C B 1
5 C A 1
6 B A 1
7 D E 2
8 E F 2
9 F E 2
10 D F 2
11 E D 2
12 F D 2
The outer apply loops through the edges. The inner sapply checks which clique (house) contains both nodes from the edge.
This provides the structure that the question asked for. But as #Scarabee pointed out, a node may belong to more than one maximal clique (house). That is not exactly a problem as the requested structure assigns the HouseID to edges. Here is an example with a node that belongs to two houses.
Edges3 = read.table(text="ID.1 ID.2
A B
A C
B C
D E
D A
E A",
header=TRUE, stringsAsFactors=FALSE)
g3 = graph_from_edgelist(as.matrix(Edges3), directed=FALSE)
plot(g3)
MC3 = max_cliques(g3)
Edges3$HouseID = apply(Edges3, 1,
function(e)
which(sapply(MC3, function(mc) all(e %in% names(unclass(mc))))))
Edges3
ID.1 ID.2 HouseID
1 A B 2
2 A C 2
3 B C 2
4 D E 1
5 D A 1
6 E A 1
In this case, We can still assign a HouseID to each edge, even though the node A is in two different Houses. Notice that the edge A-B has HouseID = 2, but edge D-A has HouseD = 1. The HouseID is a property of the edge, not the node.
However, there is still a problem. It is possible for both ends of an edge to belong to two houses and one cannot assign a single house to the edge.
Edges4 = read.table(text="ID.1 ID.2
A B
A C
B C
D A
D B",
header=TRUE, stringsAsFactors=FALSE)
g4 = graph_from_edgelist(as.matrix(Edges4), directed=FALSE)
plot(g4)
MC4 = max_cliques(g4)
MC4
[[1]]
+ 3/4 vertices, named, from fbd5929:
[1] A B C
[[2]]
+ 3/4 vertices, named, from fbd5929:
[1] A B D
The edge A-B belongs to two maximal cliques. As #Scarabee said, the question is not actually well-defined for all graphs.

Reshape a correlation matrix, including each pair of variables only once

I have a table like this:
A B C D E
7 1 6 8 7
9 3 9 5 9
4 6 2 1 10
10 5 3 4 1
1 3 5 9 3
6 4 8 7 6
I am in the process of finding the correlation of each variable with every other variable in the table. This is the R code I use:
test <- read.csv("D:/AB/test.csv")
iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = iterations * iterations)
for (k in 1:iterations) {
for (l in 1:iterations){
corr <- cor(test[,k], test[,l])
corr_string_A <- names(test[k])
corr_string_B <- names(test[l])
correlation[l + ((k-1) * iterations),] <- rbind(corr_string_A, corr_string_B, corr)
}
}
The following is the output that I received:
Var1 Var2 value
1 A A 1.00000000
2 B A 0.50018605
3 C A -0.35747393
4 D A -0.25670054
5 E A -0.02974821
6 A B 0.50018605
7 B B 1.00000000
8 C B 0.56070716
9 D B 0.46164928
10 E B 0.16813991
11 A C -0.35747393
12 B C 0.56070716
13 C C 1.00000000
14 D C 0.52094589
15 E C 0.23190036
16 A D -0.25670054
17 B D 0.46164928
18 C D 0.52094589
19 D D 1.00000000
20 E D -0.39223227
21 A E -0.02974821
22 B E 0.16813991
23 C E 0.23190036
24 D E -0.39223227
25 E E 1.00000000
However, I don't want the values from the upper triangle; i.e., no diagonal values should occur, and each unique combination should appear only once. The final output should look like:
Var1 Var2 value
1 B A 0.50018605
2 C A -0.35747393
3 D A -0.25670054
4 E A -0.02974821
5 C B 0.56070716
6 D B 0.46164928
7 E B 0.16813991
8 D C 0.52094589
9 E C 0.23190036
10 E D -0.39223227
I understand that there are a few techniques like reshape using which the above output can be achieved, but I want to make the above R code to suit and produce the above mentioned results.
I believe the "n" in the second for loop should be made to change dynamically which can help achieving this. However I am not sure how to make this work.
You can convert your correlation matrix to the 3-column format with as.data.frame and as.table, and then limiting to values above or below the diagonal can be done with subset.
subset(as.data.frame(as.table(cor(dat))),
match(Var1, names(dat)) > match(Var2, names(dat)))
# Var1 Var2 Freq
# 2 B A -0.02299154
# 3 C A 0.23155350
# 4 D A -0.28036851
# 5 E A -0.05230260
# 8 C B -0.58384036
# 9 D B -0.80175393
# 10 E B 0.00000000
# 14 D C 0.52094589
# 15 E C 0.23190036
# 20 E D -0.39223227
Note that for larger datasets this should be much more efficient than separately calling cor on pairs of variables because cor is vectorized, and further it's clearly a lot less typing.
If you really must keep the looping code, then you can achieve your desired result with small changes to the pair of for loops and some book keeping about the row of correlation that you are computing:
iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = choose(iterations, 2))
pos <- 1
for (k in 2:iterations) {
for (l in 1:(k-1)){
corr <- cor(test[,k], test[,l])
corr_string_A <- names(test[k])
corr_string_B <- names(test[l])
correlation[pos,] <- rbind(corr_string_A, corr_string_B, corr)
pos <- pos+1
}
}
However I really wouldn't suggest this looping solution; it would be better to use the one-liner I provided and then to handle all generated NA values afterward.
From the OP's loop output, we can subset the rows,
df1[!duplicated(t(apply(df1[1:2], 1, sort))) & df1[,1]!=df1[,2],]
# Var1 Var2 value
#2 B A 0.50018605
#3 C A -0.35747393
#4 D A -0.25670054
#5 E A -0.02974821
#8 C B 0.56070716
#9 D B 0.46164928
#10 E B 0.16813991
#14 D C 0.52094589
#15 E C 0.23190036
#20 E D -0.39223227
Or as I mentioned (first) in the comments, just use
cor(test)

Replacing header in data frame based on values in second data frame

Say I have a data frame which looks like this:
df.A
A B C
x 1 3 4
y 5 4 6
z 8 9 1
And I want to replace the column names in the first based on column values in a second:
df.B
Low High
A D
B F
C G
Such that I get:
df.A
D F G
x 1 3 4
y 5 4 6
z 8 9 1
How would I do it?
I have tried extracting the vector df.B$High from df.B and using this in names(df.A), but everything is in alphabetical order and shifted over one. Furthermore, this only works if the order of columns in df.A is conserved with respect to the elements in df.B$High, which is not always the case (and in my real example there is no numeric or alphabetical way to sort the two to the same order). So I think I need an rbind-type argument for matching elements, but I'm not sure.
Thanks!
You can use rename from plyr:
library(plyr)
dat <- read.table(text = " A B C
x 1 3 4
y 5 4 6
z 8 9 1",header = TRUE,sep = "")
> new <- read.table(text = "Low High
A D
B F
C G",header = TRUE,sep = "")
> rename(dat,replace = setNames(new$High,new$Low))
D F G
x 1 3 4
y 5 4 6
z 8 9 1
using match:
df.A <- read.table(sep=" ", header=T, text="
A B C
x 1 3 4
y 5 4 6
z 8 9 1")
df.B <- read.table(sep=" ", header=T, text="
Low High
A D
B F
C G")
df.C <- df.A
names(df.C) <- df.B$High[match(names(df.A), df.B$Low)]
df.C
# D F G
# x 1 3 4
# y 5 4 6
# z 8 9 1
You can play games with the row names of df.B to make a lookup more convenient:
rownames(df.B) <- df.B$Low
names(df.A) <- df.B[names(df.A),"High"]
df.A
## D F G
## x 1 3 4
## y 5 4 6
## z 8 9 1
Here's an approach abusing factor:
f <- factor(names(df.A), levels=df.B$Low)
levels(f) <- df.B$High
f
## [1] D F G
## Levels: D F G
names(df.A) <- f
## Desired results

merge two dataframe based on matching two exchangable columns in each dataframe

I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
replace_index <- function(dat){
x <- as.data.frame(t(sapply(seq_len(nrow(dat)),
function(i)sort(unlist(dat[i, 1:2])))))
names(x) <- paste("V", seq_len(ncol(x)), sep="")
data.frame(x, dat[, -(1:2), drop=FALSE])
}
replace_index(df1)
V1 V2 C D E F G
1 1 2 a a a a a
2 2 3 b b b c c
3 1 4 e e f f e
This means you can use a straight-forward merge to combine the data.
merge(replace_index(df1), replace_index(df2), all.y=TRUE)
V1 V2 C D E F G Z
1 1 2 a a a a a g
2 1 2 a a a a a h
3 1 4 e e f f e j
4 3 4 <NA> <NA> <NA> <NA> <NA> i
This is slightly clunky, and has some potential collision and order issues but works with your example
df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)
to produce
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i <NA> <NA> <NA> <NA> <NA>
One approach would be to create an id key for matching that is order invariant.
# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B), "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y), "-", max(X, Y)))
# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])
This produces the output
X Y Z id C D E F G
1 1 2 g 1 - 2 a a a a a
1.1 2 1 h 1 - 2 a a a a a
NA 3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3 1 4 j 1 - 4 e e f f e
You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.
library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))
rbind(m,n) %>%
group_by(X,Y) %>%
arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
slice(1) # take top row from combination
Produces:
Source: local data frame [4 x 8]
Groups: X, Y
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i NA NA NA NA NA
Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:
merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
## X Y Z C D E F G
## 1 1 2 g a a a a a
## 2 2 1 h a a a a a
## 3 1 4 j e e f f e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>
Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what #Andrie does in his helper function, which would work for any number of key columns, but would be less performant.

Resources