fixing nodes when plotting networks - r

mynet is a network object with the 93 vertices and three vertex attributes: sex, indegree, and outdegree. Another network object, simnet, is simulated version of the network. The nodes and degree distributions are the same, but some edges have been rewired.
I plot them side by side...
par(mfrow=c(1,2))
plot(mynet, vertex.col="sex", main="mynet")
plot(simnet, vertex.col="sex", main="simnet")
...and get the following result:
This would be much more useful if I could fix the node location in both plots, as it would make the differences in edges very clear. Is there a way to do this with the base plot() function? If not, what is the simplest way to do this without manually entering coordinates for each node?

There is a way to do this by setting the layout in advance of plotting and using the same layout for both plots. We can do this using the names of the nodes since these are the same nodes between each graph. The approach is a little hacky but seems to work. Example code below:
library(igraph)
# Make some fake networks
set.seed(42)
df1 <- data.frame(e1 = sample(1:5, 10, replace = T),
e1 = sample(1:5, 10, replace = T))
df2 <- data.frame(e1 = sample(1:5, 10, replace = T),
e1 = sample(1:5, 10, replace = T))
# the original
g1 <- graph_from_data_frame(df1, directed = F)
# the 'simulations'
g2 <- graph_from_data_frame(df2, directed = F)
# set up the plot
par(mfrow=c(1,2))
# we set the layout
lo <- layout_with_kk(g1)
# this is a matrix of positions. Positions
# refer to the order of the nodes
head(lo)
#> [,1] [,2]
#> [1,] -0.03760207 0.08115827
#> [2,] 1.06606602 0.35564140
#> [3,] -1.09026110 0.28291157
#> [4,] -0.90060771 -0.72591181
#> [5,] 0.67151585 -1.82471026
V(g1)
#> + 5/5 vertices, named, from 418e4e6:
#> [1] 5 2 4 3 1
# If the layout has names for the rows then we can
# use those names to fiddle with the order
row.names(lo) <- names(V(g1))
# plot with layout
plot(g1, layout = lo)
# plot with layout but reorder the layout to match the order
# in which nodes appear in g2
plot(g2, layout = lo[names(V(g2)), ])
Created on 2018-11-15 by the reprex package (v0.2.1)

Related

fill NA raster cells using focal defined by boundary

I have a raster and a shapefile. The raster contains NA and I am filling the NAs using the focal function
library(terra)
v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
r[45:60, 45:60] <- NA
r_fill <- terra::focal(r, 5, mean, na.policy="only", na.rm=TRUE)
However, there are some NA still left. So I do this:
na_count <- terra::freq(r_fill, value = NA)
while(na_count$count != 0){
r_fill <- terra::focal(r_fill, 5, mean, na.policy="only", na.rm=TRUE)
na_count <- terra::freq(r_fill, value = NA)
}
Once all NA's are filled, I clip the raster again using the shapefile
r_fill <- terra::crop(r_fill, v, mask = T, touches = T)
This is what my before and after looks like:
I wondered if the while loop is an efficient way to fill the NAs or basically determine how many times I have to run focal to fill all the NAs in the raster.
Perhaps we can, or want to, dispense with the while( altogether by making a better estimate of focal('s w= arg in a world where r, as ground truth, isn't available. Were it available, we could readily derive direct value of w
r <- rast(system.file("ex/elev.tif", package="terra"))
# and it's variants
r2 <- r
r2[45:60, 45:60] <- NA
freq(r2, value=NA) - freq(r, value=NA)
layer value count
1 0 NA 256
sqrt((freq(r2, value=NA) - freq(r, value=NA))$count)
[1] 16
which might be a good value for w=, and introducing another variant
r3 <- r
r3[40:47, 40:47] <- NA
r3[60:67, 60:67] <- NA
r3[30:37, 30:37] <- NA
r3[70:77, 40:47] <- NA
rm(r)
We no longer have our ground truth. How might we estimate an edge of w=? Turning to boundaries( default values (inner)
r2_bi <- boundaries(r2)
r3_bi <- boundaries(r3)
# examining some properties of r2_bi, r3_bi
freq(r2_bi, value=1)$count
[1] 503
freq(r3_bi, value=1)$count
[1] 579
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
sum(freq(r2_bi, value=1)$count,freq(r2_bi, value = 0)$count)
[1] 4352
sum(freq(r3_bi, value=1)$count,freq(r3_bi, value = 0)$count)
[1] 4352
Taken in reverse order, sum[s] and freq[s] suggest that while the total area of (let's call them holes) are the same, they differ in number and r2 is generally larger than r3. This is also clear from the first pair of freq[s].
Now we drift into some voodoo, hocus pocus in pursuit of a better edge estimate
sum(freq(r2)$count) - sum(freq(r2, value = NA)$count)
[1] 154
sum(freq(r3)$count) - sum(freq(r3, value = NA)$count)
[1] 154
(sum(freq(r3)$count) - sum(freq(r3, value = NA)$count))
[1] 12.40967
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r2_bi, value=0)$count/freq(r2_bi, value = 1)$count
[1] 7.652087
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
taking the larger, i.e. freq(r2_bi 7.052087
7.652087/0.1306833
[1] 58.55444
154+58
[1] 212
sqrt(212)
[1] 14.56022
round(sqrt(212)+1)
[1] 16
Well, except for that +1 part, maybe still a decent estimate for w=, to be used on both r2 and r3 if called upon to find a better w, and perhaps obviate the need for while(.
Another approach to looking for squares and their edges:
wtf3 <- values(r3_bi$elevation)
wtf2 <- values(r2_bi$elevation)
wtf2_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf2)))$lengths))
wtf3_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf3)))$lengths))
names(wtf2_tbl_df2)
[1] "Var1" "Freq"
wtf2_tbl_df2[which(wtf2_tbl_df2$Var1 == wtf2_tbl_df2$Freq), ]
Var1 Freq
14 16 16
wtf3_tbl_df2[which(wtf3_tbl_df2$Freq == max(wtf3_tbl_df2$Freq)), ]
Var1 Freq
7 8 35
35/8
[1] 4.375 # 4 squares of 8 with 3 8 length vectors
bringing in v finally and filling
v <- vect(system.file("ex/lux.shp", package="terra"))
r2_fill_17 <- focal(r2, 16 + 1 , mean, na.policy='only', na.rm = TRUE)
r3_fill_9 <- focal(r3, 8 + 1 , mean, na.policy='only', na.rm = TRUE)
r2_fill_17_cropv <- crop(r2_fill_17, v, mask = TRUE, touches = TRUE)
r3_fill_9_cropv <- crop(r3_fill_9, v, mask = TRUE, touches = TRUE)
And I now appreciate your while( approach as your r2 looks better, more naturally transitioned, though the r3 looks fine. In my few, brief experiments with smaller than 'hole', i.e. focal(r2, 9, I got the sense it would take 2 passes to fill, that suggests focal(r2, 5 would take 4.
I guess further determining the proportion of fill:hole:rast for when to deploy a while would be worthwhile.

Word graph in R using Igraph

I have a simple problem. I have 2 text documents and I want to make a graph of each document through Igraph or other similar library. I actually want to make a large graph combine both subgraphs of two documents. I tried the following code. But,
> Topic1 = c("I love Pakistan")
> Topic2 = c("Pakistan played well")
> src = data.frame(Topic1,Topic2)
> mycorpus = Corpus(VectorSource(src))
> tdm = as.matrix(TermDocumentMatrix(mycorpus))
Now, don't know what should do next.
First graph of Topic1 will have 3 nodes and 3 edges, similarly, Second graph Topic2 will have 3 nodes and 3 edges. Now, I want o merge these two graph into one graph. The large graph now will have 5 nodes and 6 edges, where, node Pakistan will have 4 edges.
Anybody can help me?
Finally, I got the solution myself. First, we should make a graph of terms from Topic1. We will use every term that have frequency greater than 0.
tdm = as.matrix(TermDocumentMatrix(my))
x = names(tdm[,1][tdm[,1]>0])
k = t(combn(x,2))
g = graph_from_edgelist(k,directed = FALSE)
plot(g)
x2 = names(tdm[,2][tdm[,2]>0])
k2 = t(combn(x2,2))
g2 = graph_from_edgelist(k2,directed = FALSE)
plot(g2)
E1 = get.edgelist(g)
E2 = get.edgelist(g2)
E3 = rbind(E1,E2)
g3 = graph_from_edgelist(E3,directed = FALSE)
plot(g3)
g3 = simplify(g3,remove.multiple = TRUE, remove.loops = TRUE)

igraph: summarize each node's neighbours characteristics

With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}
Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN

Prediction with lda in R : Warning message: 'newdata' had 1600 rows but variables found have 200 rows

I am new to R. I am trying to use lda to classify all points in a generated grid. The training set is two point groups randomly generated using rmvnorm(n,mean,sigma). Here is my code :`
# number of samples
n=100;
# parameters: G2
meanG1 = matrix(
c(2, 2), # the data elements
nrow=1, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
sigmaG1 = matrix(
c(1,0,0,1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
library(mvtnorm)
# Generating a matrix G1 with norm distribution
G1 = rmvnorm(n, meanG1, sigmaG1)
G1[,3]=1
# parameters: G2
meanG2 = matrix(
c(0, 0), # the data elements
nrow=1, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
sigmaG2 = matrix(
c(1,0.75,0.75,1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
# # Generating a matrix G2 with norm distribution
G2 = rmvnorm(n, meanG2, sigmaG2)
# adding a column as a label = 1 to G1 matrix
G1 = cbind(G1, 1 )
# adding a column as a label = 2 to G2 matrix
G2 = cbind(G2, 2 )
# Concatenate both matrices
G = rbind(G1,G2)
# Transforming Matrix into dataFrame
bothGroupsWithLabel <- as.data.frame(G)
# Shuffling data row-wise
bothGroupsWithLabel <- bothGroupsWithLabel[sample(nrow(bothGroupsWithLabel)),]
# plotting the generated matrices
plot(c(G1[,1]),c(G1[,2]),col="red")
points(c(G2[,1]),c(G2[,2]),col="blue")
# Generating a grid
K = 40;
seqx1 = seq(min(G1[,1]),max(G1[,1]),length = K)
seqx2 = seq(min(G1[,2]),max(G1[,2]),length = K)
myGrid = expand.grid(z1=seqx1,z2=seqx2);
plot(myGrid[,1],myGrid[,2])
library(MASS)
# Creating a model
model.lda = lda(bothGroupsWithLabel[,3] ~bothGroupsWithLabel[,1]+bothGroupsWithLabel[,2] , data = bothGroupsWithLabel);
Ypred = predict(model.lda, newdata=myGrid);
Ypredgrid = Ypred$class
Here is a part of my data bothGroupsWithLabel
V1 V2 V3
69 2.0683949 0.5779272 1
53 2.1261046 2.0420350 1
118 -1.4502033 -1.4775360 2
148 1.1705251 1.5437296 2
195 0.3100763 -0.2594026 2
40 1.8573633 3.7717020 1
and
myGrid
z1 z2
1 0.1048024 -0.2034172
2 0.2227540 -0.2034172
3 0.3407055 -0.2034172
4 0.4586571 -0.2034172
5 0.5766086 -0.2034172
6 0.6945602 -0.2034172
my grid consists of 40*40 points, hence the size of myGird data frame is 1600 rows and 2 columns. The data frame bothGroupsWithLabel consists of 200 rows and 3 columns, the first two columns are the coordinates of the points and the third column is used for labels. My problem is when I call predict(model.lda, newdata=myGrid) I get this warning message:
Warning message:
'newdata' had 1600 rows but variables found have 200 rows
what am i missing here? can anyone please help me?
The problem is the way that you generated your model. When using a formula and data=... it is better to just use the variable names. In order for this to work, you must also make the variable names match in newdata. So, when you create myGrid add the line:
names(myGrid) = c("V1", "V2")
and then make your last few lines be:
model.lda = lda(V3 ~ V1 + V2 , data = bothGroupsWithLabel);
Ypred = predict(model.lda, newdata=myGrid);
Ypredgrid = Ypred$class
That should get what you want.

Cleaning a graph with R package "igraph"

I need to "clean" a graph in R. By cleaning, I mean that i need to delete all nodes which are not linked with a specific one. For instance, if in my graph there are 4 nodes, with these edges :
1 to 3
1 to 2
4 to 2
I want to keep only the nodes linked with the edges 1 plus the edges 1 itself, so to say I need to delete the edges 4.
Is there any way with igraph to build an algorithm which can do that for very very very big graph (like more than 1000 nodes and 1 000 000 edges) ?
Usesubcomponent and induced.subgraph:
edges_df <- data.frame(from = c(1, 1, 4), to = c(3, 2, 2))
g1 <- graph.data.frame(edges_df, directed = TRUE)
g2 <- induced.subgraph(g1, subcomponent(g1, "1", mode = "out"))
As for the "big" graphs: 1000 is not so big. On my laptop:
system.time({
g3 <- graph.full(n = 1000, directed = TRUE)
g4 <- induced.subgraph(g3, subcomponent(g3, "1", mode = "out"))
})
# user system elapsed
# 0.47 0.10 0.57

Resources