R igraph: how to find the largest community? - r

I use fastgreedy.community to generate a community object, which contains 15 communities. But how can I extract the largest community among these 15 communities?
Community sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1862 1708 763 974 2321 1164 649 1046 2 2 2 2 2 2
15
2
In this example, I want to extract the community 5 for further use.
Thanks!

Assuming that your community object is named community.object, which(membership(community.object) == x) extracts the indices of the vertices in community x. If you want the largest, community, you can set x to which.max(sizes(community.object)). Finally, you can use induced.subgraph to extract that particular community into a separate graph:
> x <- which.max(sizes(community.object))
> subg <- induced.subgraph(graph, which(membership(community.object) == x))

Related

Return values with matching conditions in r

I would like to return values with matching conditions in another column based on a cut score criterion. If the cut scores are not available in the variable, I would like to grab closest larger value. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores.a <- c(512,531,541,555,562,565,570,572,573,588)
scores.b <- c(12,13,14,15,16,17,18,19,20,21)
data <- data.frame(ids, scores.a, scores.b)
> data
ids scores.a scores.b
1 1 512 12
2 2 531 13
3 3 541 14
4 4 555 15
5 5 562 16
6 6 565 17
7 7 570 18
8 8 572 19
9 9 573 20
10 10 588 21
cuts <- c(531, 560, 571)
I would like to grab score.b value corresponding to the first cut score, which is 13. Then, grab score.b value corresponding to the second cut (560) score but it is not in the score.a, so I would like to get the score.a value 562 (closest to 560), and the corresponding value would be 16. Lastly, for the third cut score (571), I would like to get 19 which is the corresponding value of the closest value (572) to the third cut score.
Here is what I would like to get.
scores.b
cut.1 13
cut.2 16
cut.3 19
Any thoughts?
Thanks
We can use a rolling join
library(data.table)
setDT(data)[data.table(cuts = cuts), .(ids = ids, cuts, scores.b),
on = .(scores.a = cuts), roll = -Inf]
# ids cuts scores.b
#1: 2 531 13
#2: 5 560 16
#3: 8 571 19
Or another option is findInterval from base R after changing the sign and taking the reverse
with(data, scores.b[rev(nrow(data) + 1 - findInterval(rev(-cuts), rev(-scores.a)))])
#[1] 13 16 19
This doesn't remove the other columns, but this illustrates correct results better
df1 <- data[match(seq_along(cuts), findInterval(data$scores.a, cuts)), ]
rownames(df1) <- paste("cuts", seq_along(cuts), sep = ".")
> df1
ids scores.a scores.b
cuts.1 2 531 13
cuts.2 5 562 16
cuts.3 8 572 19

Hierarchical clustering with specific number of data in each cluster

I'm clustering a set of words using "Hierarchical Clustering". I want each cluster to contain a certain number of words, for example 2 words, or 3 words.
I'm trying to modify existing code for this clustering.
I just put the value of max(d) to Inf as well
Lm[min(d),] <- sl
Lm[,min(d)] <- sl
if (length(cluster)>2){#if it's already clustered with more than 2 points
#then dont't cluster them again by setting values to Inf
Lm[min(d), min(d)] <- Inf
Lm[max(d), max(d)] <- Inf
Lm[max(d),] <- Inf
Lm[,max(d)] <- Inf
Lm[min(d),] <- Inf
Lm[,min(d)] <- Inf
}
However, it doesn't give me the expected results, I was wondering if it's correct approach? How can I do this type of clustering with constraint in r ?
example of results that I got
row V1 V2
166 -194 -38
167 166 -1
……..
240 239 239
241 240 240
242 241 241
243 242 242
244 243 243
This will be tough to optimize, or it can produce arbitrarily bad results. Because your size constraint goes against the principles of clustering.
Consider the one-dimensional data set -100, -1, 1, 100. Assuming you want to limit the cluster size to 2 elements. Hierarchical clustering will first merge -1 and +1 because they are closest. Now they have reached maximum size, so the only option is now to cluster -100 and +100, the worst possible result - this cluster is as big as the entire data set.
Just to give you an example of what I meant with partitional clustering:
library(cluster)
data("ruspini")
desired_cluster_size <- 3L
corresponding_num_clusters <- round(nrow(ruspini) / desired_cluster_size)
km <- kmeans(ruspini, corresponding_num_clusters)
table(km$cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 2 4 2 2 2 1 3 3 2 3 2 3 3 2 6 3 2 1 3 6 2 8 4
This definitely can't guarantee how many observations you'll have in each group,
and it's not deterministic,
but it at least gives you an approximation.
In the tabulated results you can see that many clusters (1 through 25) ended up with 2 or 3 elements.

Avoid memory increase in foreach loop in R

I try to create summary statistics combining two different spatial data-sets: a big raster file and a polygon file. The idea is to get summary statistics of the raster values within each polygon.
Since the raster is too big to process it at once, I try to create subtasks and process them in parallel i.e. process each polygon from the SpatialPolgyonsDataframe at once.
The code works fine, however after around 100 interations I run into memory problems. Here is my code and what I intent to do:
# session setup
library("raster")
library("rgdal")
# multicore processing.
library("foreach")
library("doSNOW")
# assign three clusters to be used for current R session
cluster = makeCluster(3, type = "SOCK",outfile="")
registerDoSNOW(cluster)
getDoParWorkers()# check if it worked
# load base data
r.terra.2008<-raster("~/terra.tif")
spodf.malha.2007<-readOGR("~/,"composed")
# bring both data-sets to a common CRS
proj4string(r.terra.2008)
proj4string(spodf.malha.2007)
spodf.malha.2007<-spTransform(spodf.malha.2007,CRSobj = CRS(projargs = proj4string(r.terra.2008)))
proj4string(r.terra.2008)==proj4string(spodf.malha.2007) # should be TRUE
# create a function to extract areas
function.landcover.sum<-function(r.landuse,spodf.pol){
return(table(extract(r.landuse,spodf.pol)))}
# apply it one one subset to see if it is working
function.landcover.sum(r.terra.2008,spodf.malha.2007[1,])
## parallel loop
# define package(s) to be use in the parallel loop
l.packages<-c("raster","sp")
# try a parallel loop for the first 6 polygons
l.results<-foreach(i=1:6,
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
}
here the output is a list that looks like this.
l.results
[[1]]
9 10
193159 2567
[[2]]
7 9 10 12 14 16
17 256 1084 494 67 15
[[3]]
3 5 6 7 9 10 11 12
2199 1327 8840 8579 194437 1061 1073 1834
14 16
222 1395
[[4]]
3 6 7 9 10 12 16
287 102 728 329057 1004 1057 31
[[5]]
3 5 6 7 9 12 16
21 6 20 495 184261 4765 28
[[6]]
6 7 9 10 12 14
161 161 386 943 205 1515
So the result is rather small and should not be the source of the memory allocation problem. So than the following loop upon the whole polygon dataset which has >32.000 rows creates the memory allocation which exceeds 8GB after around 100 iteratins.
# apply the parallel loop on the whole dataset
l.results<-foreach(i=1:nrow(spodf.malha.2007),
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
# gc(reset=TRUE) # does not resolve the problem
# closeAllConnections() # does not resolve the problem
}
What am I doing wrong?
edit:
I tried (as suggested in the comments) to remove the object after each iteration in the internal loop, but it did not resolve the problem. I furthermore tried to resolve eventual problems of multiple data-imports by passing the objects to the environment in the first place:
clusterExport(cl = cluster,
varlist = c("r.terra.2008","function.landcover.sum","spodf.malha.2007"))
without major changes. My R version is 3.4 on a linux platform so supposedly also the patch of the link from the fist comment should already be included in this version. I also tried the parallel package as suggested in the first comment but no differences appeared.
You can try exact_extract in the exactextractr package. Is the fastest and memory safer function to extract values from raster. The main function is implemented in C++ and usually it doesn't need parallelization. Since you do not provide any example data I post an example with real data:
library(raster)
library(sf)
library(exactextractr)
# Pull municipal boundaries for Brazil
brazil <- st_as_sf(getData('GADM', country='BRA', level=2))
# Pull gridded precipitation data
prec <- getData('worldclim', var='prec', res=10)
#transform precipitation data in a dummy land use map
lu <- prec[[1]]
values(lu) <- sample(1:10,ncell(lu),replace = T)
plot(lu)
#extract land uses class for each pixel inside each polygon
ex <- exact_extract(lu, brazil)
#apply table to the resulting list. Here I use just the first 5 elements to avoid long output
lapply(ex[1:5],function(x){
table(x[,1])#note that I use x[,1] because by default exact_extract provide in the second column the coverage fraction of each pixel by each polygon
})
here the example output:
[[1]]
1 2 4 6 7 9 10
1 1 1 2 3 1 1
[[2]]
2 3 4 5 6 7 8 10
2 4 3 2 1 2 2 2
[[3]]
1 2 4 6 7 8 9 10
4 5 1 1 4 2 5 5
[[4]]
1 2 3 4 5 6 7 8 9 10
2 2 4 2 2 4 1 4 1 2
[[5]]
3 4 5 6 8 10
2 3 1 1 2 3

Treat variables as data_frame and other things

I guess I have a problem in R. I have this data frame (see at the bottom); I imported it as a Import Dataset "Weevils" from Text; then I converted the data via
as.data.frame(Weevils) and is.data.frame(Weevils) [1] TRUE proved me it's a data frame yet I cannot use the $ operator because the variables were all "atomic vectors"; I tried this instead:
pairs(x[Age_yrs]~x[Larvae_per_m²], col= x[Farmer] pch = 16)
but then this occured:
Error in plot.xy(xy, type, ...) :
numerische Farbe muss >= 0 sein, gefunden -2
which basically means that a negative value (for ther Farmer?) was found therefore so it cannot assign the colors to the outcome; All is supposed to look like this https://stackoverflow.com/a/40829168/5987736 (Thanks to Carles Mitjans!)
yet what came out in my case when putting in pairs(x[Age_yrs]~x[Larvae_per_m²], pch = 16) was this plot: Plot with negative values ; it has negative values, thus the colors cannto be assigned;
So my questions are: Why cannot the variables in the Weevils dataframe be treated as non-atomic vectors or why can't I use the $ and why are the values negative, what can I do so the values get positive? Thanks for helping me!
Farmer Age_yrs Larvae_per_m²
1 Band 2 1315
2 Band 4 725
3 Band 6 90
4 Fechney 1 520
5 Fechney 3 285
6 Fechney 9 30
7 Mulholland 2 725
8 Mulholland 6 20
9 Adams 2 150
10 Adams 3 225
11 Forrester 1 455
12 Forrester 3 75
13 Bilborough 2 850
14 Bilborough 3 650

How to prepare my data fo a factorial repeated measures analysis?

Currently, my dataframe is in wide-format and I want to do a factorial repeated measures analysis with two between subject factors (sex & org) and a within subject factor (tasktype). Below I've illustrated how my data looks with a sample (the actual dataset has a lot more variables). The variable starting with '1_' and '2_' belong to measurements during task 1 and task 2 respectively. this means that 1_FD_H_org and 2_FD_H_org are the same measurements but for tasks 1 and 2 respectively.
id sex org task1 task2 1_FD_H_org 1_FD_H_text 2_FD_H_org 2_FD_H_text 1_apv 2_apv
2 F T Correct 2 69.97 68.9 116.12 296.02 10 27
6 M T Correct 2 53.08 107.91 73.73 333.15 16 21
7 M T Correct 2 13.82 30.9 31.8 78.07 4 9
8 M T Correct 2 42.96 50.01 88.81 302.07 4 24
9 F H Correct 3 60.35 102.9 39.81 96.6 15 10
10 F T Incorrect 3 78.61 80.42 55.16 117.57 20 17
I want to analyze whether there is a difference between the two tasks on e.g. FD_H_org for the different groups/conditions (sex & org).
How do I reshape my data so I can analyze it with a model like this?
ezANOVA(data=df, dv=.(FD_H_org), wid=.(id), between=.(sex, org), within=.(task))
I think that the correct format of my data should like this:
id sex org task outcome FD_H_org FD_H_text apv
2 F T 1 Correct 69.97 68.9 10
2 F T 2 2 116.12 296.02 27
6 M T 1 Correct 53.08 107.91 16
6 M T 2 2 73.73 333.15 21
But I'm not sure. I tryed to achieve this wih the reshape2 package but couldn't figure out how to do it. Anybody who can help?
I think probably you need to rebuild it by binding the 2 subsets of columns together with rbind(). The only issue here was that your outcomes implied difference data types, so forced them both to text:
require(plyr)
dt<-read.table(file="dt.txt",header=TRUE,sep=" ") # this was to bring in your data
newtab=rbind(
ddply(dt,.(id,sex,org),summarize, task=1, outcome=as.character(task1), FD_H_org=X1_FD_H_org, FD_H_text=X1_FD_H_text, apv=X1_apv),
ddply(dt,.(id,sex,org),summarize, task=2, outcome=as.character(task2), FD_H_org=X2_FD_H_org, FD_H_text=X2_FD_H_text, apv=X2_apv)
)
newtab[order(newtab$id),]
id sex org task outcome FD_H_org FD_H_text apv
1 2 F T 1 Correct 69.97 68.90 10
7 2 F T 2 2 116.12 296.02 27
2 6 M T 1 Correct 53.08 107.91 16
8 6 M T 2 2 73.73 333.15 21
3 7 M T 1 Correct 13.82 30.90 4
9 7 M T 2 2 31.80 78.07 9
4 8 M T 1 Correct 42.96 50.01 4
10 8 M T 2 2 88.81 302.07 24
5 9 F H 1 Correct 60.35 102.90 15
11 9 F H 2 3 39.81 96.60 10
6 10 F T 1 Incorrect 78.61 80.42 20
12 10 F T 2 3 55.16 117.57 17
EDIT - obviously you don't need plyr for this (and it may slow it down) unless you're doing further transformations. This is the code with no non-standard dependencies:
newcolnames<-c("id","sex","org","task","outcome","FD_H_org","FD_H_text","apv")
t1<-dt[,c(1,2,3,3,4,6,8,10)]
t1$org.1<-1
colnames(t1)<-newcolnames
t2<-dt[,c(1,2,3,3,5,7,9,11)]
t2$org.1<-2
t2$task2<-as.character(t2$task2)
colnames(t2)<-newcolnames
newt<-rbind(t1,t2)
newt[order(newt$id),]

Resources