Normalisation in Igraph Weighted Betweenness Calculation - r

I have an undirected network that I am working with in Igraph with weighted edges.
For a particular graph out, I can calculate communities in Igraph using the function edge.betweenness.community and the betweenness of each edge by using edge.betweenness.
I can tell igraph to include the weights of each edge by writing the following:
largest <- which.max(sapply(modules, vcount))
out <- modules[largest][[1]]
bt <- edge.betweenness(out, weights = E(out)$value, directed = FALSE)
Returning:
bt
[1] 20.0 11.0 27.0 11.0 8.0 12.0 8.0 8.5 7.5 6.0 3.0 3.0 7.0 8.5 7.5 4.0 11.0
Where the weights are:
E(out)$value
[1] 0.2829 0.2880 0.2997 0.1842 0.2963 0.2714 0.2577 0.2850 0.2850 0.2577 0.2305 0.2305 0.2577 0.1488 0.1488 0.1215 0.2997
The weights in this case have limits 0 - 1, where 1 = highest cost to traverse an edge, 0 = lowest cost. However, these limits do not get passed to igraph in any of the betweenness calculations.
My question: How does igraph evaluate the lower and upper limits of the listed weights in terms of normalisation?
Does it automatically scale the weights based on the min and max values of the specified weights? (in this case min = 0.1215, max = 0.2997)
What I want: How do I tell it to take the true limits of the full data set (min=0 - max=1) into account?
Additional Information:
If I multiply the weights E(out)$value by some constant and recalculate the betweenness, I get a similar answer (I assume there is some floating error and they are in fact the same):
new_weights <- as.numeric(E(out)$value*2.5)
new_weights
[1] 0.70725 0.72000 0.74925 0.46050 0.74075 0.67850 0.64425 0.71250 0.71250 0.64425 0.57625 0.57625 0.64425 0.37200 0.37200 0.30375 0.74925
bt <- edge.betweenness(out, weights = new_weights, directed = FALSE)
Giving:
bt
[1] 20 11 27 11 8 12 8 8 8 6 3 3 7 8 8 4 11
This implies there is some auto scaling going on:
With this in mind, how do I manually scale the betweenness calculation to my required limits of 0 and 1?
Research:
Edit 4 Jun2 2016 -
I have tried to review the source code for edge.betweenness on the igraph R Github page https://github.com/igraph/rigraph/tree/dev/R
The closest function I could find was cluster_edge_betweenness at https://github.com/igraph/rigraph/blob/dev/R/community.R
This function make a call to the C function C_R_igraph_community_edge_betweenness. The closest reference to this I could find in the igraph C documentation is igraph_community_edge_betweenness at https://github.com/igraph/igraph/blob/master/include/igraph_community.h
Neither of these links however makes any reference to how the limits of the weights are calculated.
Original Research:
I have looked through the igraph documentation on betweenness algorithms, and explored other questions related to normalisation, but have found nothing that deals specifically with the normalisation of the weights themselves.
Modularity calculation for weighted graphs in igraph
Calculation of betweenness in iGraph
http://igraph.org/r/doc/betweenness.html
The network data and visualisation are as follows:
plot(out)
get.data.frame(out)
from to value sourceID targetID
1 74 80 0.2829 255609 262854
2 74 61 0.2880 255609 179585
3 80 1085 0.2997 262854 3055482
4 1045 1046 0.1842 2970629 2971615
5 1046 1085 0.2963 2971615 3055482
6 1046 1154 0.2714 2971615 3087803
7 1085 1154 0.2577 3055482 3087803
8 1085 1187 0.2850 3055482 3101131
9 1085 1209 0.2850 3055482 3110186
10 1154 1243 0.2577 3087803 3130848
11 1154 1187 0.2305 3087803 3101131
12 1154 1209 0.2305 3087803 3110186
13 1154 1244 0.2577 3087803 3131379
14 1243 1187 0.1488 3130848 3101131
15 1243 1209 0.1488 3130848 3110186
16 1243 1244 0.1215 3130848 3131379
17 1243 1281 0.2997 3130848 3255811
(The weights in this case are in the frame$value column with limits 0 - 1, where 1 = highest cost to traverse an edge, 0 = lowest cost)

Related

Indexing through 'names' in a list and performing actions on contained values in R

I have a data set of counts from standard solutions passed through an instrument that analyses chemical concentrations (an ICPMS for those familiar). The data is over a range of different standards and for each standard I have four repeat measurements that I want to calculate the mean and variance of.
I'm importing the data from an excel spreadsheet and then, following some housekeeping such as getting dates and times in the right format, I split the the dataset up into a list identified by the name of the standard solution using Count11.sp<-split(Count11.raw, Count11.raw$Type). Count11.raw$Type then becomes the list element name and I have the four count results for each chemical element in that list element.
So far so good.
I find I can yield an average (mean, median etc) easily enough by identifying the list element specifically i.e. mean(Count11.sp$'Ca40') , or sapply(Count11$'Ca40', median), but what I'm not able to do is automate that in a loop so that I can calculate the means for each standard and drop that into a numerical matrix for further manipulation. I can extract the list element names with names() and I can even use a loop to make a vector of all the names and reference the specific list element using these in a for loop.
For instance Count11.sp[names(Count11.sp[i])]will extract the full list element no problem:
$`Post Ca45t`
Type Run Date 7Li 9Be 24Mg 43Ca 52Cr 55Mn 59Co 60Ni
77 Post Ca45t 1 2011-02-08 00:13:08 114 26101 4191 453525 2632 520 714 2270
78 Post Ca45t 2 2011-02-08 00:13:24 114 26045 4179 454299 2822 524 704 2444
79 Post Ca45t 3 2011-02-08 00:13:41 96 26372 3961 456293 2898 520 762 2244
80 Post Ca45t 4 2011-02-08 00:13:58 112 26244 3799 454702 2630 510 792 2356
65Cu 66Zn 85Rb 86Sr 111Cd 115In 118Sn 137Ba 140Ce 141Pr 157Gd 185Re 208Pb
77 244 1036 56 3081 44 520625 78 166 724 10 0 388998 613
78 250 982 70 3103 46 526154 76 174 744 16 4 396496 644
79 246 1014 36 3183 56 524195 60 198 744 2 0 396024 612
80 270 932 60 3137 44 523366 70 180 824 2 4 390436 632
238U
77 24
78 20
79 14
80 6
but sapply(Count11.sp[names(count11.sp[i])produces an error message: Error in median.default(X[[i]], ...) : need numeric data
while sapply(Input$Post Ca45t, median) <'Post Ca45t' being name Count11.sp[i] i=4> does exactly what I want and produces the median value (I can clean that vector up later for medians that don't make sense) e.g.
Type Run Date 7Li 9Be 24Mg
NA 2.5 1297109612.5 113.0 26172.5 4070.0
43Ca 52Cr 55Mn 59Co 60Ni 65Cu
454500.5 2727.0 520.0 738.0 2313.0 248.0
66Zn 85Rb 86Sr 111Cd 115In 118Sn
998.0 58.0 3120.0 45.0 523780.5 73.0
137Ba 140Ce 141Pr 157Gd 185Re 208Pb
177.0 744.0 6.0 2.0 393230.0 622.5
238U
17.0
Can anyone give me any insight into how I can automate (i.e. loop through) these names to produce one median vector per list element? I'm sure there's just some simple disconnect in my logic here that may be easily solved.
Update: I've solved the problem. The way to do so is to use tapply on the original dataset with out the need to split it. tapply allows functions to be applied to data based on a user defined grouping criteria. In my case I could group according to the Count11.raw$Type and then take the mean of the data subset. tapply(Count11.raw$Type, Count11.raw[,3:ncol(Count11.raw)], mean), job done.

How are we supposed to get at matrix diagonals and partial regression plots using r programming?

Given the data
farm
up
right
left
24.3
34.3
50
45
30.2
35.3
54
45
49
45
540
4353
70
60
334
343
69
80
54
342
# for finding Studentized residuals vs fitted value
mod1<-lm(farm~up+right+left)
plot(mod1)
# for finding cooks distance
plot(cookd(lm(farm~up+right+left, data=data)))
could not find function "cookd"
I don't know how to find partial and diagonal matrix though I also couldn't find much information online.
Please help or correct me if I am wrong.

issue with normalizing variable

I am trying to normalize a variable using Box-Cox. However, I am receiving an error message:
boxcox_obj <- boxcox(alive_data_4$mosslpadeq)
Error in estimate_boxcox_lambda(x, ...) : x must be positive
I read online that you can get this message when the variable has negative values. However, that is not the case with this variable (see frequency below).
table(alive_data_4$mosslpadeq)
0 10 20 30 40 50 60 70 80 90 100
766 635 2141 1756 3355 1913 2095 1400 4498 1361 2228
Can someone advise?

Natural Neighbor Interpolation in R

I need to conduct Natural Neighbor Interpolation (NNI) via R in order to smooth my numeric data. For example, say I have very spurious data, my goal is to use NNI to model the data neatly.
I have several hundred rows of data (one observation for each postcode), alongside latitudes and longitudes. I've made up some data below:
Postcode lat lon Value
200 -35.277272 149.117136 7
221 -35.201372 149.095065 38
800 -12.801028 130.955789 27
801 -12.801028 130.955789 3
804 -12.432181 130.84331 29
810 -12.378451 130.877014 20
811 -12.376597 130.850489 3
812 -12.400091 130.913672 42
814 -12.382572 130.853877 32
820 -12.410444 130.856124 39
821 -12.426641 130.882367 39
822 -12.799278 131.131697 49
828 -12.474896 130.907378 38
829 -14.460879 132.280002 34
830 -12.487233 130.972637 8
831 -12.480066 130.984006 49
832 -12.492269 130.990891 29
835 -12.48138 131.029173 33
836 -12.525546 131.103025 40
837 -12.460094 130.842663 39
838 -12.709507 130.995407 28
840 -12.717562 130.351316 22
841 -12.801028 130.955789 8
845 -13.038663 131.072091 19
846 -13.226806 131.098416 50
847 -13.824123 131.835799 11
850 -14.464497 132.262021 2
851 -14.464497 132.262021 23
852 -14.92267 133.064654 36
854 -16.81839 137.14707 17
860 -19.648306 134.186642 3
861 -18.94406 134.318373 8
862 -20.231104 137.762232 28
870 -12.436101 130.84059 24
871 -12.436101 130.84059 16
Is there any kind of package that will do this? I should mention, that the only predictors I am using in this model are latitude and longitude. If there isn't a package than can do this, how can I implement it manually. I've searched extensively and I can't figure out how to implement this in R. I have seen one or two other SO posts, but they haven't assisted me in figuring this out.
Please let me know if there's anything I must add to the question. Thanks.
I suggest the following:
Reproject the data to the corresponding UTM Zone.
Use R WhiteboxTools package to process the data using natural neighbour interpolation.

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

Resources