Output of igraph clustering functions - r

I constructed a graph from a data-frame using the igraph graph_from_data_frame function. My two first column represent the edge list, and i have another column named "weight". There is several other attributes columns.
I then tried to find a community structure within my graph using cluster_fast_greedy.
data <- data %>% rename(weight = TH_LIEN_2)
graph <- graph_from_data_frame(data,directed=FALSE)
is_weighted(graph)
cluster_1 <- cluster_fast_greedy(graph, weights = NULL)
The output is a list of three (merges, modularity, membership), each containing some of my vertices.
However, the following returns "NULL":
cluster_1[["merges"]]
cluster_1[["modularity"]]
cluster_1[["membership"]]
(I believe cluster_1[["membership"]] is supposed to be a list of integer indicating the cluster the vertices belong to?)
I have tried different method of clustering (cluster_fast_greedy, cluster_label_prop, cluster_leading_eigen, cluster_spinglass, cluster_walktrap) and with a weighted and non weighted graph and the output looks the same every time. (The number of element on the list varying from 1 to 4)
Does anyone have an idea of why it does that?
Thank you and have a nice day!
Cassandra

You should use the dollar sign $ to access the cluster object. For example
g <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g <- add_edges(g, c(1, 6, 1, 11, 6, 11))
fc <- cluster_fast_greedy(g)
and you will see
> str(fc)
Class 'communities' hidden list of 5
$ merges : num [1:14, 1:2] 3 4 5 1 12 13 15 11 7 8 ...
$ modularity: num [1:15] -6.89e-02 -4.59e-02 6.94e-18 6.89e-02 1.46e-01 ...
$ membership: num [1:15] 3 3 3 3 3 1 1 1 1 1 ...
$ algorithm : chr "fast greedy"
$ vcount : int 15
> fc$merges
[,1] [,2]
[1,] 3 2
[2,] 4 16
[3,] 5 17
[4,] 1 18
[5,] 12 14
[6,] 13 20
[7,] 15 21
[8,] 11 22
[9,] 7 9
[10,] 8 24
[11,] 10 25
[12,] 6 26
[13,] 27 19
[14,] 23 28
> fc$modularity
[1] -6.887052e-02 -4.591368e-02 6.938894e-18 6.887052e-02 1.460055e-01
[6] 1.689624e-01 2.148760e-01 2.837466e-01 3.608815e-01 3.838384e-01
[11] 4.297521e-01 4.986226e-01 5.757576e-01 3.838384e-01 -1.110223e-16
> fc$membership
[1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2
> fc$algorithm
[1] "fast greedy"
> fc$vcount
[1] 15

Related

R is not ordering data correctly - skips E values

I am trying to order data by the column weightFisher. However, it is almost as if R does not process e values as low, because all the e values are skipped when I try to order from smallest to greatest.
Code:
resultTable_bon <- GenTable(GOdata_bon,
weightFisher = resultFisher_bon,
weightKS = resultKS_bon,
topNodes = 15136,
ranksOf = 'weightFisher'
)
head(resultTable_bon)
#create Fisher ordered df
indF <- order(resultTable_bon$weightFisher)
resultTable_bonF <- resultTable_bon[indF, ]
what resultTable_bon looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
1 GO:0019373 epoxygenase P450 pathway 19 13 1.12 1
2 GO:0097267 omega-hydroxylase P450 pathway 9 7 0.53 2
3 GO:0042738 exogenous drug catabolic process 10 7 0.59 3
weightFisher weightKS
1 1.9e-12 0.79744
2 7.9e-08 0.96752
3 2.5e-07 0.96336
what "ordered" resultTable_bonF looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
17 GO:0014075 response to amine 33 7 1.95 17
18 GO:0034372 very-low-density lipoprotein particle re... 11 5 0.65 18
19 GO:0060710 chorio-allantoic fusion 6 4 0.35 19
weightFisher weightKS
17 0.00014 0.96387
18 0.00016 0.83624
19 0.00016 0.92286
As #bhas says, it appears to be working precisely as you want it to. Maybe it's the use of head() that's confusing you?
To put your mind at ease, try it with something simpler
dtf <- data.frame(a=c(1, 8, 6, 2)^-10, b=c(7, 2, 1, 6))
dtf
# a b
# 1 1.000000e+00 7
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
dtf[order(dtf$a), ]
# a b
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
# 1 1.000000e+00 7
Try the following :
resultTable_bon$weightFisher <- as.numeric (resultTable_bon$weightFisher)
Then :
resultTable_bonF <- resultTable_bon[order(resultTable_bonF$weightFisher),]

How can I tell a for loop in R to regenerate a sample if the sample contains a certain pair of species?

I am creating 1000 random communities (vectors) from a species pool of 128 with certain operations applied to the community and stored in a new vector. For simplicity, I have been practicing writing code using 10 random communities from a species pool of 20. The problem is that there are a couple of pairs of species such that if one of the pairs is generated in the random community, I need that community to be thrown out and a new one regenerated. I have been able to code that if the pair is found in a community for that community(vector) to be labeled NA. I also know how to tell the loop to skip that vector using the "next" command. But with both of these options, I do not get all of the communities that I needing.
Here is my code using the NA option, but again that ends up shorting me communities.
C<-c(1:20)
D<-numeric(10)
X<- numeric(5)
for(i in 1:10){
X<-sample(C, size=5, replace = FALSE)
if("10" %in% X & "11" %in% X) X=NA else X=X
if("1" %in% X & "2" %in% X) X=NA else X=X
print(X)
D[i]<-sum(X)
}
print(D)
This is what my result looks like.
[1] 5 1 7 3 14
[1] 20 8 3 18 17
[1] NA
[1] NA
[1] 4 7 1 5 3
[1] 16 1 11 3 12
[1] 14 3 8 10 15
[1] 7 6 18 3 17
[1] 6 5 7 3 20
[1] 16 14 17 7 9
> print(D)
[1] 30 66 NA NA 20 43 50 51 41 63
Thanks so much!

Splitting a variable into equally sized groups

I have a continuous variable called Longitude (it corresponds to geographical longitude) that has 12465 unique values. I need to create a new variable called Longitude1024 that consists of the variable Longitude split into 1024 equally sized groups. I did that using the following function:
data$Longitude1024 <- as.factor( as.numeric( cut(data$Longitude,1024)))
However, the problem is that, when I use this function to create the new variable Longitude1024, this new variable consists of only 651 unique elements rather than 1024. Does anyone know what the problem here is and how could I actually get the new variable with 1024 unique values?
Thanks a lot
Use rank, then scale it down. Here's an example with 10 groups:
x <- rnorm(124655)
g <- floor(rank(x) * 10 / (length(x) + 1))
table(g)
# g
# 0 1 2 3 4 5 6 7 8 9
# 12465 12466 12465 12466 12465 12466 12466 12465 12466 12465
Short answer: try cut2 from the Hmisc package
Long answer
Example: split dat, which is 1000 unique values, into 100 equal groups of 10.
Doesn't work:
# dummy data
set.seed(321)
dat <- rexp(1000)
# all unique values
length(unique(dat))
[1] 1000
cut generates 100 levels
init_res <- cut(dat, 100)
length(unique(levels(init_res)))
[1] 100
But does not split the data into equally sized groups
init_grps <- split(dat, cut(dat, 100))
table(unlist(lapply(init_grps, length)))
0 1 2 3 4 5 6 7 9 10 11 13 15 17 18 19 22 23 24 25 27 37 38 44 47 50 63 71 72 77
42 9 8 4 1 3 1 3 2 1 2 1 1 1 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 1
Works with Hmisc::cut2
cut2 divides the vector into groups of equal length, as desired
require(Hmisc)
final_grps <- split(dat, cut2(dat, g=100))
table(unlist(lapply(final_grps, length)))
10
100
If you want, you can store the results in a data frame, for example
foobar <- do.call(rbind, final_grps)
head(foobar)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[0.000611,0.00514) 0.004345915 0.002192086 0.004849693 0.002911516 0.003421753 0.003159641 0.004855366 0.0006111574
[0.005137,0.01392) 0.009178133 0.005137309 0.008347482 0.007072484 0.008732725 0.009379002 0.008818794 0.0110489833
[0.013924,0.02004) 0.014283326 0.014356782 0.013923721 0.014290554 0.014895342 0.017992638 0.015608931 0.0173707930
[0.020041,0.03945) 0.023047527 0.020437743 0.026353839 0.036159321 0.024371834 0.026629812 0.020793695 0.0214221779
[0.039450,0.05912) 0.043379064 0.039450453 0.050806316 0.054778805 0.040093806 0.047228050 0.055058519 0.0446634954
[0.059124,0.07362) 0.069671018 0.059124220 0.063242564 0.064505875 0.072344089 0.067196661 0.065575249 0.0634142853
[,9] [,10]
[0.000611,0.00514) 0.002524557 0.003155055
[0.005137,0.01392) 0.008287758 0.011683228
[0.013924,0.02004) 0.018537469 0.014847937
[0.020041,0.03945) 0.026233400 0.020040981
[0.039450,0.05912) 0.041310471 0.058449603
[0.059124,0.07362) 0.063608022 0.066316782
Hope this helps

Reduce the large dataset into smaller data set using R

I want to reduce a very large dataset with two variables into a smaller file. What I want to do is I need to find the data points with the same values and then I want to keep only the starting and ending values and then remove all the data points in between them. For example
the sample dataset looks like following :
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.75 126.16542
363.79167 126.16542
363.83333 126.16542
363.875 126.16542
363.91667 0
363.95833 0
364 0
364.04167 0
364.08333 0
364.125 0
364.16667 0
364.20833 0
364.25 127.79872
364.29167 127.79872
364.33333 127.79872
364.375 127.79872
364.41667 127.79872
364.45833 127.79872
364.5 0
364.54167 0
364.58333 0
364.625 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
Here, the first two points have same values i.e 26.369 so I will keep them as it is. I need to write a condition i.e if two or more data points have same values then keep only starting and ending data points. Then the next two values also have same value i.e. 0 and i will keep these two. However, after that there are 5 data points with the same values. I need to write a program such that I want to write just two data points i.e 363.708 & 363.875 and remove data points in between them. After that I will keep only two data points with zero values i.e 363.91667 and 364.20833.
The sample output I am looking for is as follows:
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.875 126.16542
363.91667 0
364.20833 0
364.25 127.79872
364.45833 127.79872
364.5 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
If your data is in a dataframe DF with column names a and b, then
runs <- rle(DF$b)
firsts <- cumsum(c(0,runs$length[-length(runs$length)]))+1
lasts <- cumsum(runs$length)
edges <- unique(sort(c(firsts, lasts)))
DF[edges,]
gives
> DF[edges,]
a b
1 363.5417 23.36990
2 363.5833 23.36990
3 363.6250 0.00000
4 363.6667 0.00000
5 363.7083 126.16542
9 363.8750 126.16542
10 363.9167 0.00000
17 364.2083 0.00000
18 364.2500 127.79872
23 364.4583 127.79872
24 364.5000 0.00000
28 364.6667 0.00000
29 364.7083 127.43202
30 364.7500 135.44052
31 364.7917 135.25522
32 364.8333 135.12892
33 364.8750 20.32986
34 364.9167 0.00000
35 364.9583 0.00000
rle gives the lengths of the groups that have the same value (floating point precision may be an issue if you have more decimal places). firsts and lasts give the row index of the first row of a group and the last row of a group, respectively. Put the indexes together, sort them, and get rid of duplicates (since a group of size one will list the same row as the first and last) and then index DF by the row numbers.
I'd use rle here (no surprise to those who know me :-) . Keeping in mind that you will want to check for approximate equality to avoid floating-point rounding problems, here's the concept. rle will return two sequences, one of which tells you how many times a value is repeated and the other tells you the value itself. Since you want to keep only single or double values, we'll essentially "shrink" all sequence values which are longer.
Edit: I recognize that this is relatively clunky code and a gentle touch with melt/cast should be far more efficient. I just liked doing this.
df<-cbind(1:20, sample(1:3,rep=T,20))
rdf<-rle(df[,2])
lenfoo<-rdf$lengths
cfoo<-cumsum(lenfoo)
repfoo<-ifelse(lenfoo==1,1,2)
outfoo<-matrix(nc=2)
for(j in 1:length(cfoo)) outfoo <- rbind( outfoo, matrix(rep(df[cfoo[j],],times=repfoo[j] ), nc=2,byrow=TRUE ) )
Rgames> df
[,1] [,2]
[1,] 1 2
[2,] 2 2
[3,] 3 3
[4,] 4 3
[5,] 5 3
[6,] 6 3
[7,] 7 3
[8,] 8 2
[9,] 9 2
[10,] 10 3
[11,] 11 1
[12,] 12 2
[13,] 13 2
[14,] 14 3
[15,] 15 1
[16,] 16 2
[17,] 17 1
[18,] 18 2
[19,] 19 3
[20,] 20 1
Rgames> outfoo
[,1] [,2]
[1,] NA NA
[2,] 2 2
[3,] 2 2
[4,] 7 3
[5,] 7 3
[6,] 9 2
[7,] 9 2
[8,] 10 3
[9,] 11 1
[10,] 13 2
[11,] 13 2
[12,] 14 3
[13,] 15 1
[14,] 16 2
[15,] 17 1
[16,] 18 2
[17,] 19 3
[18,] 20 1
x = tapply(df[[1]], df[[2]], range)
gives the values
cbind(unlist(x, use.names=FALSE), as.numeric(rep(names(x), each=2)))
gets a matrix. More explicitly, and avoiding coercion to / from character vectors
u = unique(df[[2]])
rng = sapply(split(df[[1]], match(df[[2]], u)), range)
cbind(as.vector(rng), rep(u, each=2))
If the data is very large then sort by df[[1]] and find the first (min) and last (max) values of each element of df[[2]]; combine these
df = df[order(df[[1]]),]
res = rbind(df[!duplicated(df[[2]]),], df[!duplicated(df[[2]], fromLast=TRUE),])
res[order(res[[2]]),]
perhaps setting the row names of the subset to NULL.

Reducing crosstab size by frequency of responses

Excuse my neophyte question - I'm new to R and pretty unversed in statistics.
I have a simple contingency table representing the number of queries per user for a group of web pages gathered over a period of time. There are about 15,000 total observations. This works out to a table of around 100 users viewing 50 groups of pages.
Since a 50x100 matrix is unwieldy to visualize, I would like to present a subset of this table sorted by the largest aggregates - either column (page groups), row (users), or perhaps even the largest row-by-column counts. For example I might choose the top 20 users and the top 10 groups, or the top 99% row-by-column counts.
Ideally, I end up with a table that still represents the major interactions between the most represented users and the page groups.
Is this a reasonable approach? Will I lose some large amount of statistical significance; and, is there a way to compare the before and after significance.
I must admit that I still don't know how to sort and subset a table based on two factors without resorting to row-by-column manipulation.
S <- trunc(10*runif(1000) )
R <- trunc(10*runif(1000))
RStab <- table(R, S)
str(RStab)
# 'table' int [1:10, 1:10] 6 12 10 13 10 7 8 6 9 10 ...
# - attr(*, "dimnames")=List of 2
# ..$ R: chr [1:10] "0" "1" "2" "3" ...
# ..$ S: chr [1:10] "0" "1" "2" "3" ...
rowSums( RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ])
# 8 0 1 3 2 5 9 4 6 7
# 90 94 96 99 100 101 101 103 107 109
colSums( RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ])
6 0 3 5 7 2 4 8 9 1
80 91 94 96 98 100 106 109 112 114
The 5 highest marginals for row and columns:
RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ][ 6:10, 6:10]
#-------------
S
R 2 4 8 9 1
5 14 10 12 10 12
9 6 8 9 10 13
4 10 10 8 8 18
6 9 12 12 17 8
7 14 10 14 12 9
It does sound as though you might be a little shakey on the statistical issues. Can you explain more fully what you mean by "losing a large amount of significance"? What sort of statistical test were you thinking of?

Resources