Generating random graph in r

Generating random graph in r - r

I would like to generate a grandom graph in R using any of the packages.
The desired output would be a two column matrix with the first column listing agents and the second column their connections of the following form:
1 3
1 4
1 6
1 7
2 2
2 5
3 9
3 11
3 32
3 43
3 2
4 5
I would like to be able to specify the average degree and minimum and maximum number of contacts.
What is the easiest way of doing this?

Since you don't specify the need for anything other than just a graph we ca do this very simply:
actor <- sample(1:4, 10, replace=TRUE)
receiver <- sample(3:43, 10, replace=TRUE)
graph <- cbind(actor,receiver)
if you want something more specific have a look at igraph for instance
library(igraph)
graph <- erdos.renyi.game(21, 0.3, type=c("gnp", "gnm"),
directed = FALSE, loops = FALSE)
# here the 0.3 is the probability of ties and 21 is the number of nodes
# this is a one mode network
or using package bipartite which focuses specifically on two mode networks:
library(bipartite)
web <- genweb(N1 = 5, N2 = 10, dens = 2)
web2edges(web,return=TRUE)
# here N1 is the number of nodes in set 1 and N2 the number of nodes in set 2
# and dens the average number of ties per node
There are many things to take into account, for instance if you want to constrain the degree distribution, probablity of ties between agents etc.

Related

problem with plotting several plots with mcmc_areas() function

So, I wanted to create several plots of my posterior sample using bayesplot.
My data contains several draws divided into features (the column "correlation" contains the posterior values):
head(cors_feat[cors_feat$feature==1,])
feature correlation
1 1 -0.5002517
2 1 0.1964202
3 1 -0.8603918
4 1 -0.7111672
5 1 0.1760033
6 1 0.2788863
head(cors_feat[cors_feat$feature==2,])
feature correlation
40001 2 -0.7149123
40002 2 -0.7666210
40003 2 -0.7542937
40004 2 -0.2094619
40005 2 0.4089077
40006 2 -0.8550481
I loop through the features, and in the loop I use the function mcmc_areas() to create the plots:
lst <- list()
for (i in 1:length(unique(cors_feat$feature))){
lst <- c(lst,mcmc_areas(cors_feat[cors_feat$feature==i,], pars = c("correlation"), prob = 0.95))
}
When I try to use the plot_grid function from from the "cowplot" library I get the following error:
plot_grid(plotlist = lst)
There were 50 or more warnings (use warnings() to see the first 50)
And most importantly, I cannot see any plot.
How can I solve this issue?

Hierarchical clustering and k means

I want to run a hierarchical cluster analysis. I am aware of the hclust() function but not how to use this in practice; I'm stuck with supplying the data to the function and processing the output.
The main issue that I would like to cluster a given measurement.
I would also like to compare the hierarchical clustering with that produced by kmeans(). Again I am not sure how to call this function or use/manipulate the output from it.
My data are similar to:
df<-structure(list(id=c(111,111,111,112,112,112), se=c(1,2,3,1,2,3),t1 = c(1, 2, 1, 1,1,3),
t2 = c(1, 2, 2, 1,1,4), t3 = c(1, 0, 0, 0,2,1), t4 = c(2, 5, 7, 7,1,2),
t5 = c(1, 0, 1, 1,1,1),t6 = c(1, 1, 1, 1,1,1), t7 = c(1, 1, 1 ,1,1,1), t8=c(0,0,0,0,0,0)), row.names = c(NA,
6L), class = "data.frame")
I would like to run the hierarchical cluster analysis to identify the optimum number of clusters.
How can I run clustering based on a predefined measurement - in this case for example to cluster measurement number 2?

For hierarchical clustering there is one essential element you have to define. It is the method for computing the distance between each data point. Clustering is an state of art technique so you have to define the number of clusters based on how fair data points are distributed. I will teach you how to do this in next code. We will compare three methods of distance using your data df and the function hclust():
First method is average distance, which computes the mean across all distances for all points. We will omit first variable as it is an id:
#Method 1
hc.average <- hclust(dist(df[,-1]),method='average')
Second method is complete distance, which computes the largest value across all distances for all points:
#Method 2
hc.complete<- hclust(dist(df[,-1]),method='complete')
Third method is single distance, which computes the minimal value across all distances for all points:
#Method 3
hc.single <- hclust(dist(df[,-1]),method='single')
With all models we can analyze the groups.
We can define the number of clusters based on the height of hierarchical tree, the largest the height then we will have only one cluster equals to all dataset. It is a standard to choose an intermediate value for height.
With average method a height value of three will produce four groups and a value around 4.5 will produce 2 groups:
plot(hc.average, xlab='')
Output:
With the complete method results are similar but the scale measure of height has changed.
plot(hc.complete, xlab='')
Output:
Finally, single method produces a different scheme for groups. There are three groups and even with an intermediate choice of height, you will always have that number of clusters:
plot(hc.single, xlab='')
Output:
You can use any method you wish to determine the cluster for your data using cutree() function, where you set the model object and the number of clusters. One way to determine clustering performance is checking how homogeneous the groups are. That depends of the researcher criteria. Next the method to add the cluster to your data. I will choose last model and three groups:
#Add cluster
df$Cluster <- cutree(hc.single,k = 3)
Output:
id se t1 t2 t3 t4 t5 t6 t7 t8 Cluster
1 111 1 1 1 1 2 1 1 1 0 1
2 111 2 2 2 0 5 0 1 1 0 2
3 111 3 1 2 0 7 1 1 1 0 2
4 112 1 1 1 0 7 1 1 1 0 2
5 112 2 1 1 2 1 1 1 1 0 1
6 112 3 3 4 1 2 1 1 1 0 3
The function cutree() also has an argument called h where you can set the height, we have talked previously, instead of number of clusters k.
About your doubt of using some measure to define a cluster, you could scale your data excluding the desired variable so that the variable will have a different measure and can influence in the results of your clustering.

Extract graphs based on identifier and calculate network measures in igraph

I want to separately analyze groups within a network. For example, the UK faculty data in the igraphdata package has some network data with group information on the node level.
library(igraph)
library(igraphdata)
data("UKfaculty")
V(UKfaculty)$Group
I want to extract networks based on the 4 groups and run a few calculations on the extracted graph (density, average degree, diameter, clustering coefficient, etc.) and store this information based on the groups in a dataframe. I want to calculate the measures only based on the nodes within a group, not on the whole network level (e.g. calculating only centrality based on connections in group 1, not taking connections to other groups into account).
Group density diameter
1 x x
2 x x
3 x x
Any idea how to efficiently do this?

You can use induced_subgraph to extract the subgraphs based on a list of vertices for every group.
library(igraph)
library(igraphdata)
data("UKfaculty")
ig <- UKfaculty
# `list` of vertices for every group
idx <- split(V(ig), V(ig)$Group)
# Create subgraphs based on the `list` of vertices
lst <- lapply(idx, function(v) induced_subgraph(ig, v))
It's then straight-forward to calculate any subgraph-specific metrics, e.g.
do.call(rbind, lapply(lst, function(ig)
data.frame(
Group = unique(V(ig)$Group),
density = edge_density(ig),
diameter = diameter(ig))))
# Group density diameter
#1 1 0.3001894 21
#2 2 0.3561254 12
#3 3 0.2807018 14
#4 4 1.0000000 12

Calculate furthest distance in given time or best time for given distance

I have imported data from my GPS tracker and I am trying to figure out how to best calculate furthest distance ran in given time (e.g. 12-minutes) or best time for given distence (e.g. 5 miles). Given the observations are taken in unequal intervals and my speed is also not constant, I will have data like the table below:
x <- read.table(header=T, sep="", stringsAsFactors = FALSE,text="
time dist
4 3
5 4
5 6
3 2
5 5
4 5
4 3
4 2
5 6")
My best attempt so far is to generate new dataset where times go by one time unit. It is then relatively easy to calculate furthest distance in given time. The downside of this is that a) I would need to repeat the same logic for best time (generate data with unit distance), b) it seems to be quite sub-optimal solution for data with thousands data points.
# Generate data frame where each row represents one unit of time
z_adj <- data.frame(
time = unlist(sapply(x$time, function(s) rep(s, each = s))),
dist = unlist(sapply(seq_along(x$dist), function(s) rep(x$dist[s], each = x$time[s])))
)
z_adj$seq_time <- seq_along(z_adj$time)
z_adj$time_dist <- z_adj$dist / z_adj$time
# Furthest distance given time
# Time 10
z_adj$in_t10 <- sapply(z_adj$seq_time, function(s) sum(z_adj$dist[s:(s+9)]))
z_adj$in_t10[which(z_adj$in_t10 == max(z_adj$in_t10, na.rm = T))]
# Fastest time given distance
# ... would need to do the above again with constant distance :/
Is there a more straightforward approach to accomplish this?

You could use something like this:
x <- read.table(header=T, sep="", stringsAsFactors = FALSE,text="
time dist
4 3
5 4
5 6
3 2
5 5
4 5
4 3
4 2
5 6")
# Add starting point and cumulatice time/distance
x <- rbind(c(0,0), x)
x$total_time <- cumsum(x$time)
x$total_dist <- cumsum(x$dist)
# function to interpolate and calculate lagging differences
foo <- function(x, y, n) {
interpolation <- approx(x, y, xout = seq(min(x), max(x)))
diff(interpolation$y, lag = n)
}
# Max distance in ten units of time
max(foo(x$total_time, x$total_dist, 10))
# Min time for ten units of distance
min(foo(x$total_dist, x$total_time, 10))
BTW, in your code you should sum over z_adj$time_dist instead of z_adj$distto get the correct result.

Applying ezANOVA error work-around to Long Format data

I have a similar problem as described here:
https://stats.stackexchange.com/questions/58435/repeated-measures-error-in-r-ezanova-using-more-levels-than-subjects-balanced-d
Here is an example of what my dataframe looks like:
Participant Visual Audio StimCondition Accuracy
1 Bottom Circle 1st 2 Central Beeps AO2 0.92
1 SIM Circle Left Beep AO2 0.86
2 Bottom Circle 1st 2 Central Beeps CT4 0.12
2 SIM Circle Left Beep CT4 0.56
I have 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions.
When I run the following ezANOVA:
Model <- ezANOVA(data = Shaped.means, dv = .(Accuracy), wid = .(Participant), within = .(Visual, Audio, StimCondition), type = 3, detailed = TRUE)
I get the same error as the linked question above. I have tried changing Type to equal 1 and it does return the output but minus the Sphericity Test.
I've tried to apply the solution to the linked question to my dataset but as mine is in Long Format I'm a bit lost as to what exactly I need to do to achieve the desired stats.
I'll keep playing with it my end but if anyone could help in the mean time it would be much appreciated.
Thanks.

Following the linked question, you have don't have to change much. Assuming your dataset is exactly as you describe, the following should work for you.
Let's first create a dataset to reflect your description
set.seed(123) ## make reproducible
N <- 12 ## number of Participants
S <- 5 ## number of StimCondition groups
V <- 3 ## number of Visual groups
A <- 5 ## number of Audio groups
Accuracy <- abs(round(runif(N*V*S*A), 2)) ## (N x (PxQ))-matrix with voltages
init.Df <- expand.grid(Participant=gl(N,1),
Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
df <- cbind(init.Df, Accuracy)
Now we have a dataframe with 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions. This should be at the stage you are currently at. We can do the between-subjects call easily.
# If you just read in the data set and don't know how many subjects
# N <- length(unique(df$Participant))
fit <- lm(matrix(df[,c("Accuracy")], nrow=N) ~ 1)
For the factor component, this is the only real change. If you simply generate your model design, you can pass it to anova.
library(car)
# You can create your within design table
# You can get these values from your dataset as well
# V <- nlevels(df$Visual)
# A <- nlevels(df$Audio)
# S <- nlevels(df$StimCondition)
# If you want the labels with gl, you can use the levels function (e.g. labels=levels(df$Visual))
inDf <- expand.grid(Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
# Test for Visual
anova(fit, M=~Visual, X=~1, idata=inDf, test="Spherical")
# Test for Audio
anova(fit, M=~Visual+Audio, X=~Visual, idata=inDf, test="Spherical")
# Test for Visual:Audio interaction
anova(fit, M=~Visual+Audio+Visual:Audio, X=~Visual+Audio, idata=inDf, test="Spherical")
#etc...