How can I calculate jaccard vertex similarity with weights in igraph - r

I have a square matrix that represents directed interactions, with values representing the magnitude of the "flow" from row i to column j.
mat <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.59734154600838,
0.962276996464401, 0.996554553573577, 0.988150008522967, 0.581536975261071,
0.280105566896129, 0.0520717823071291, 0.0443864046117343, 0.0162858335588474,
0, 0, 0, 0, 0, 0, 0, 0.111900863185923, 0.289483837277475, 0.338036619790556,
0.973201117894343, 0.876145758734938, 0.280105566896129, 0.245172586054694,
0.101440228047504, 0.0136022221272776, 0, 0, 0, 0, 0, 0, 0.073088274682518,
0.21588462733217, 0.258134862678946, 0.93528472971792, 0.921844796228768,
0.318790697187933, 0.280105566896129, 0.117928032625428, 0.016073037487081,
0, 0, 0, 0, 0, 0, 0, 0.0119602547215087, 0.0174757225504163,
0.443466799224191, 0.941024455005652, 0.632609306727839, 0.57418820480725,
0.280105566896129, 0.043827579210664, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0.0547471528159807, 0.884304818335752, 0.937495721370637,
0.925118019265575, 0.280105566896129, 0.055967839940851, 0.0122649398400715,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0679263578760456, 0.104884821422108,
0.569814755335506, 0.853130344409379, 0.280105566896129, 0.0728699300735904,
0.0339371561178606, 0.012188886551821, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0.0219303360220489, 0.843994038605239, 0.759918325154657,
0.280105566896129, 0.143508732965731, 0.0556400089034765, 0.0296286033644999,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.421151438381493, 0.977746695038157,
0.499880491267235, 0.280105566896129, 0.116686808742586, 0.0639605586005988,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0495967410949283, 0.841406989124245,
0.85505217514437, 0.578265483357174, 0.280105566896129, 0.163154497800251,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.499941945587477, 0.993657104473566,
0.807475685951474, 0.45318772928331, 0.280105566896129), .Dim = c(15L,
15L))
I am interested in calculating the weighted linkage similarity (both in and out flows) of all vertices in the network, so taking magnitude into account.
Using igraph, I can calculate the Jaccard similarity but without considering weights
library(igraph)
bin <- mat
bin[bin > 0] <- 1
similarity(graph_from_adjacency_matrix(bin),
mode = "all",
method = "jaccard")
# this gives the same result as the one above
similarity(graph_from_adjacency_matrix(mat, weighted = T),
mode = "all",
method = "jaccard")
Using the code from this blogpost, I was able to calculate the Jaccard similarity of outflows and inflows and combine them.
# outflow similarity
sim.jac.out <- matrix(0, nrow=nrow(mat), ncol=nrow(mat))
pairs <- t(combn(1:nrow(mat), 2))
for (i in 1:nrow(pairs)) {
num <- sum(sapply(1:ncol(mat), function(x) (min(mat[pairs[i,1],x], mat[pairs[i,2],x]))))
den <- sum(sapply(1:ncol(mat), function(x) (max(mat[pairs[i,1],x], mat[pairs[i,2],x]))))
sim.jac.out[pairs[i,1],pairs[i,2]] <- num/den
sim.jac.out[pairs[i,2],pairs[i,1]] <- num/den
}
sim.jac.out[which(is.na(sim.jac.out))] <- 0
diag(sim.jac.out) <- 1
# inflow similarity
sim.jac.in <- matrix(0, nrow=nrow(mat), ncol=nrow(mat))
pairs <- t(combn(1:nrow(t(mat)), 2))
for (i in 1:nrow(pairs)) {
num <- sum(sapply(1:ncol(t(mat)), function(x) (min(t(mat)[pairs[i,1],x], t(mat)[pairs[i,2],x]))))
den <- sum(sapply(1:ncol(t(mat)), function(x) (max(t(mat)[pairs[i,1],x], t(mat)[pairs[i,2],x]))))
sim.jac.in[pairs[i,1],pairs[i,2]] <- num/den
sim.jac.in[pairs[i,2],pairs[i,1]] <- num/den
}
sim.jac.in[which(is.na(sim.jac.in))] <- 0
diag(sim.jac.in) <- 1
# total similariry
sim.jac.all <- (sim.jac.in + sim.jac.out)/2
So the general question is, does this make sense?
But more specifically, I would be interested to know if there is a way to incorporate link weights in the calculation of similarity with igraph.
In my real dataset, I need to do this several times iteratively (swapping individuals), for a large number of networks, so my method would take forever. I believe igraph uses C++ under the hood.

Related

Contrasts between successive levels of a factor in R

I'm writing this post because I'm stuck in the analysis of a data file from a laboratorial experiment.
In this experiment, I counted the number of females (of a small arthropod) present in a specific environment, throughout 26 time points (TP). However, I want to understand if the number of females was different between each successive time point (e.g. if the number of females counted in TP 1 is different than TP 2; the number of females counted in TP 2 is different than TP 3; and so on...)
The data frame has the following columns:
Replicate (who contain the number of the replicate, going from 1 to 8); TimePoint (the day in which the females where counted, going from 1 to 26); Females (the number of females counted in each time point); and Block (experiment had 2 blocks)
I've tried to do some successive contrasts, but I dont think its the best way. This is my code:
m1<-lmer(Females~TimePoint+(1|Block))
Suc_contrasts2<-glht(m1,linfct=mcp(TimePoint=
c(
"t1 - t2 == 0",
"t2 - t3 == 0",
"t3 - t4 == 0",
"t4 - t5 == 0",
"t5 - t6 == 0",
"t6 - t7 == 0",
"t7 - t8 == 0",
"t8 - t9 == 0",
"t9 - t10 == 0",
"t10 - t11== 0",
"t11 - t12 == 0",
"t12 - t13 == 0",
"t13 - t14 == 0",
"t14 - t15 == 0",
"t15 - t16 == 0",
"t16 - t17 == 0",
"t17 - t18 == 0",
"t18 - t19 == 0",
"t19 - t20 == 0",
"t20 - t21== 0",
"t21 - t22 == 0",
"t22 - t23 == 0",
"t23 - t24 == 0",
"t24 - t25 == 0",
"t25 - t26 == 0")))
summary(Suc_contrasts2)
summary(Suc_contrasts2, test=adjusted ("bonferroni"))
I've been looking on google for other ways to do planned comparisons, but all i've found was not really appropriate for my data set. I'm still new at this, so sorry for any newbie mistake.
Thus my question is, is there any better way to compare the number of females I found between each pair of successive time points?
Edit 1:
I also tried doing contrasts like this, but the results don't seem right..
levels(TimePoint)
# [1] "t1" "t10" "t11" "t12" "t13" "t14" "t15" "t16" "t17" "t18" "t19" "t2" "t20" "t21" "t22" "t23" "t24" "t25" "t26"
# [20] "t3" "t4" "t5" "t6" "t7" "t8" "t9"
# tell R which TimePoints to compare
c1<- c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #1v2
c2<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0) #2v3
c3<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0) #3v4
c4<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0) #4v5
c5<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0) #5v6
c5<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0) #6v7
c6<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0) #7v8
c7<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1) #8v9
c8<- c(0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1) #9v10
c9<- c(0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #10v11
c10<- c(0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #11v12
c11<- c(0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #11v12
c12<- c(0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #12v13
c13<- c(0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #13v14
c14<- c(0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #14v15
c15<- c(0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #15v16
c16<- c(0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #16v17
c17<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #17v18
c18<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #18v19
c19<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #19v20
c20<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #20v21
c21<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #21v22
c22<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) #22v23
c23<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0) #23v24
c24<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0) #24v25
c25<- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0) #25v26
# combined the above lines into a matrix
mat <- cbind(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25)
# tell R that the matrix gives the contrasts you want
contrasts(TimePoint) <- mat
model2 <- aov(Females ~ TimePoint)
summary(model2)
# Df Sum Sq Mean Sq F value Pr(>F)
# line2$TimePoint 25 9694 387.8 6.939 <2e-16 ***
# Residuals 390 21794 55.9
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(model2, split=list(TimePoint=list("1v2"=1, "2v3" = 2, "3v4"=3, "4v5"=4, "5v6"=5, "6v7"=6, "7v8"=7, "8v9"=8, "9v10"=9, "10v11"=10, "11v12"=11, "12v13"=12, "13v14"=13, "14v15"=14, "15v16"=15, "16v17"=16, "17v18"=17, "18v19"=18, "19v20"=19, "20v21"=20, "21v22"=21, "22v23"=22, "23v24"=23, "24v25"=24, "25v26"=25)))
Thanks for your time,
André
Another option for fitting successive-differences contrasts:
m1 <- lmer(Females~TimePoint+(1|Block), contrasts=list(TimePoint=MASS::contr.sdif))
This doesn't take the multiplicity of testing into account (which you might get away with since these are pre-planned contrasts): you could use p.adjust() on the p-values.
#AndreasM's points about the ordering of your factor, choice of random vs fixed effects, etc., should definitely be taken into consideration.
I think this website my help you: backward difference coding
Following the information there, difference contrasts between subsequent factor levels could be set like this (see below). Note, that I only use a simple example with 5 factor levels.
#create dummy data
df <- expand.grid(TimePoint = c("t01", "t02", "t03", "t04", "t05"),
Replicate = 1:8, Block = 1:2)
set.seed(2)
df$Females <- runif(nrow(df), min = 0, max = 100)
#set backward difference contrasts
contrasts(df$TimePoint) <- matrix(c(-4/5, 1/5, 1/5, 1/5, 1/5,
-3/5, -3/5, 2/5, 2/5, 2/5,
-2/5, -2/5, -2/5, 3/5, 3/5,
-1/5, -1/5, -1/5, -1/5, 4/5), ncol = 4)
When fitting a simple linear model, the parameter estimates correspond to the expected values, i.e., contrast "TimePoint1" corresponds to t2 - t1, contrast "TimePoint2" to t3 - t2 and so on.
#fit linear model
m1 <- lm(Females ~ TimePoint, data = df)
coef(m1)
(Intercept) TimePoint1 TimePoint2 TimePoint3 TimePoint4
50.295659 -10.116045 7.979465 -10.182389 2.209413
#mean by time point
with(df, tapply(Females, TimePoint, mean))
t01 t02 t03 t04 t05
57.23189 47.11584 55.09531 44.91292 47.12233
I want to add that I am not sure whether what you are trying to do is sensible. But this I don´t feel comfortable to evaluate and it would be a question for CrossValidated. I worry that treating 26 time points as categorical factor levels is not the best way to go. Also, in your initial code you seem to fit a model treating block as a random factor. This does not make sense if block has only 2 levels (as you write), see for example here: Link
Finally, I noticed that in your example, factor levels of your TimePoint variable are not ordered right (t1, t10, t11... instead of t1, t2, t3, ...). You could change this for instance with this line of code:
df$TimePoint <- factor(df$TimePoint, levels = paste0("t", 1:26),
labels = paste0("t", sprintf("%02d", 1:26)))

Working on bipartite networks with igraph : problem with basic measures (density, normalized degree)

I'm new to bipartite network analysis and i've some trouble with basic measures.
I'm trying to work on bipartite networks without projecting in 1-mode graphs.
My problems come from the fact that the igraph package allows to create bipartite graphs but that the measures do not seem to adapt to the specificity of these graphs.
So, my general question is how do you do when you work directly on bipartite networks ?
Here a concrete exemple with density
## Working with an incidence matrix (sample) with 47 columns and 10 rows (unweighted / undirected)
# Want to compute basic global index like density with igraph
library(igraph)
g <- graph.incidence(m, directed = F )
graph.density(g) # result = 0.04636591
# Now trying to compute basic density for a bipartite graph without igraph (number of edges divided by the product of the two types of vertices)
library(Matrix)
d <- nnzero(m)/ (ncol(m)*nrow(m)) # result 0.1574468
# It seems that bipartite package does the job
library(bipartite)
networklevel(m, index=c("connectance")) # result 0.1574468
But the bipartite package is very specific to ecology fields and lot of measures are designed to food web and interaction between species (and some, like clustering coefficient, don't seem to take into account the bipartite nature of the graph : e.g compute 4-cycles).
So, are there simpler ways to work on bipartite networks with igraph ? To measure some global indexes (density, clustering coefficient with 4-cycles, I know that tnet does this but my actual networks are too large), and to normalize local indexes like degree, closeness, betweenness centralities taking into account the bipartite specificity (like in Borgatti S.P., Everett M.G., 1997, « Network analysis of 2-mode data », Social Networks) ?
Any advice will be appreciated !
Below the code to reproduce the sample of my matrix "m"
m <- structure(c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0), .Dim = c(10L, 47L), .Dimnames = list(
c("02723", "13963", "F3238", "02194", "15051", "04477", "02164",
"06283", "04080", "08304"), c("1185241", "170063", "10350868",
"217831", "2210247", "2262963", "1816670", "1848354", "2232593",
"146214", "1880252", "2261639", "2262581", "2158177", "1850147",
"2262912", "146412", "2262957", "1566083", "1841811", "146384",
"216281", "2220957", "1846986", "1951567", "1581130", "105343",
"1580240", "170654", "1796236", "1835553", "1835848", "146400",
"1174872", "1283240", "2253354", "1283617", "146617", "160263",
"2263115", "184745", "1809858", "1496747", "10346824", "148730",
"2262582", "146268")))
Density: you already got it
Degree
degv1 <- degree(g, V(g)[type == FALSE])
degv2 <- degree(g, V(g)[type == TRUE])
Normalized degree: divise by the vcount of the other node category
degnormv1 <- degv1/length(V(g)[type == TRUE])
degnormv2 <- degv2/length(V(g)[type == FALSE])
No answer regarding closeness, betweenness nor clustering coefficient
For the normalized degree, here a solution without igraph
normalizedegreeV1 <- data.frame(ND = colSums(m)/nrow(m))
normalizedegreeV2 <- data.frame(ND = rowSums(m)/ncol(m))
but that leaves the other questions about centrality measures open...

Replace character in a df for numeric vector in R

I would like to replace characters for specifics numeric vector.
I have this df:
First Second Third
A C D
F R K
and I also have vectors like these
A = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
R = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
N = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
I have tried several times but I can't do it. Does anyone have some advice or idea?
An option would be to unlist (convert to character if it is factor) and then use mget to return the values for that object in a list
lst1 <- mget(as.character(unlist(df)))

R: Why isn't this matrix 3d linear interpolation working correctly?

I have a matrix of values and zeros, where zero= NA. The values are interspersed around the matrix, and what I want to do is interpolate the values of all the NA values. This is the data:
I'm trying to guess all of these values by taking all the known values in my matrix, and multiplying the value by the distance (such that the further away a point is, the less influence it has). This is what the interpolated result looks like:
As you can see, this method is not very effective, it does affect the NAs nearest to the known values, but then they quickly converge onto an average value. I think this is due to the fact that it's taking the ENTIRE RANGE, which has many ups and downs... rather than just the points nearest to it.
Obviously, matrix operations aren't my specialty... what do I need to change to correctly do the linear-interpolation?
Here's the code:
library(dplyr)
library(plotly)
Cont <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1816, 2320, 1406, 2028, 1760, 1932, 1630,
1835, 1873, 1474, 1671, 2073, 1347, 2131, 2038, 1969, 2036, 1602,
1986, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2311, 1947, 2094, 1947, 2441, 1775, 1461, 1260,
1494, 2022, 1863, 1587, 2082, 1567, 1770, 2065, 1404, 1809, 1972,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 2314, 1595, 2065, 1870, 2178, 1410, 1994, 1979, 2111,
1531, 1917, 1559, 2109, 1921, 1606, 1469, 1601, 1771, 1771), .Dim = c(19L,
30L))
## First get real control values
idx <- which(Cont > 0, arr.ind=TRUE)
V <- Cont[idx]
ControlValues <- data.frame(idx,V)
## Make data.frame of values to fill
toFill <- which(Cont == 0, arr.ind=TRUE) %>% as.data.frame
toFill$V <- 0
## And now figure out the weighted value of each point
for (i in 1:nrow(toFill)){
toFill[i,] -> CurrentPoint
Xs <- (1/abs(CurrentPoint[,1] - ControlValues[,1]))
Xs[is.infinite(Xs)] <- 0
Xs <- Xs/sum(Xs)/100
Ys <- (1/abs(CurrentPoint[,2] - ControlValues[,2]))
Ys[is.infinite(Ys)] <- 0
Ys <- Ys/sum(Ys)/100
ControlValues1 <- data.frame(Xs,Ys)
toFill[i,3] <- sum(rowMeans(ControlValues1) * ControlValues$V)*100
}
## add back in the controls and reorder
bind_rows(ControlValues,toFill) -> Both
Both %>% arrange(row,col) -> Both
## and plot the new surface
NewCont <- matrix(Both$V,max(Both$row),max(Both$col),byrow = T)
plot_ly(z=NewCont, type="surface",showscale=FALSE)
One approach to interpolate and extrapolate data in R is to use the akima package. The following performs bi-linear interpolation followed by extrapolation using as input the known data points in the data frame ControlValues to fill the zeroes in Cont.
library(akima)
library(plotly)
NewCont <- akima::interp(x=ControlValues[,1], y=ControlValues[,2], z=ControlValues[,3],
xo=1:nrow(Cont), yo=1:ncol(Cont), linear=TRUE)$z
NewCont[,1:9] <- akima::interp.old(x=ControlValues[,1], y=ControlValues[,2],
z=ControlValues[,3], xo=1:nrow(Cont),
yo=1:9, ncp=2, extrap=TRUE)$z
plot_ly(z=NewCont, type="surface",showscale=FALSE)
Notes:
The first call to akima::interp performs the bi-linear interpolation. See the help page ?akima::interp for usage and details.
A key point is that the inputs x, y, and z for the known data points need not be on a x-y grid. In this case, these are the columns of ControlValues.
The output of akima::interp is a list whose z component is a matrix of interpolated values over the grid whose x and y coordinates are defined by the inputs xo and yo, respectively. In this case, these are just the row and column indices of Cont
As stated in the help page
z-values for points outside the convex hull are returned as NA.
In this case, the first nine columns of the output corresponding to yo=1:9 will be NAs.
The second call to akima::interp (actually akima::interp.old) performs the data extrapolation to fill in the NAs left by the first call. See this SO quation/answer for the details of this usage.
The above approach gives the following result
Another approach to perform bi-linear interpolation is to use the interp.surface function in the fields package. This approach is mentioned because the implementation is an R-script, which can be listed by typing the function name interp.surface at the R command line.
library(fields)
loc <- make.surface.grid(list(x=1:nrow(Cont), y=1:ncol(Cont)))
NewCont2 <- matrix(interp.surface(list(x=sort(unique(ControlValues[,1])),
y=sort(unique(ControlValues[,2])),
z=matrix(ControlValues[,3],
nrow=length(unique(ControlValues[,1])),
ncol=length(unique(ControlValues[,2])))),
loc), nrow=nrow(Cont), ncol=ncol(Cont))
NewCont2[,1:9] <- akima::interp.old(x=ControlValues[,1], y=ControlValues[,2],
z=ControlValues[,3], xo=1:nrow(Cont),
yo=1:9, ncp=2, extrap=TRUE)$z
Here, the requirements are the opposite to those for akima::interp. Specifically, the known data points must lie on a x-y grid. However, the coordinates to interpolate need not be on a grid and is instead a matrix containing corresponding column vectors of x and y coordinates where each tuple (x[i],y[i]) is a x-y coordinate to interpolate. Since the data points in ControlValues are on a grid, these requirements are also satisfied for this case. See the help page ?interp.surface for usage and details.
Notes:
sort(unique(ControlValues[,1])) and sort(unique(ControlValues[,2])) simply gives the x and y coordinates for the grid of known data points
The z component in the list is simply the z values for the known data points reshaped as a matrix over the grid of known data points
The matrix of coordinates to interpolate is generated by make.surface.grid using as x and y coordinates the row and column indices of Conf, respectively
A coordinate to interpolate that lies outside the grid of known points will result in a interpolated value of NA
interp.surface returns a vector of z values corresponding to the coordinates to interpolate. This is then rehaped to a matrix over the grid of coordinates to interpolate, which has dimensions nrow(Cont) by ncol(Cont)
Finally, it is easy to verify that the two approaches give the same result
print(max(abs(NewCont - NewCont2)))
##[1] 4.547474e-13

Add consecutive elements of a vector until a value

I would like to calculate the minimum number of consecutive elements in a vector that when added (consecutively) would be less than a given value.
For example in the following vector
ev<-c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3.27, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 370.33, 1375.4,
1394.03, 1423.8, 1360, 1269.77, 1378.8, 1350.37, 1425.97, 1423.6,
1363.4, 1369.87, 1365.5, 1294.97, 1362.27, 1117.67, 1026.97,
1077.4, 1356.83, 565.23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 356.83,
973.5, 0, 240.43, 1232.07, 1440, 1329.67, 1096.87, 1331.37, 1305.03,
1328.03, 1246.03, 1182.3, 1054.53, 723.03, 1171.53, 1263.17,
1200.37, 1054.8, 971.4, 936.4, 968.57, 897.93, 1099.87, 876.43,
1095.47, 1132, 774.4, 1075.13, 982.57, 947.33, 1096.97, 929.83,
1246.9, 1398.2, 1063.83, 1223.73, 1174.37, 1248.5, 1171.63, 1280.57,
1183.33, 1016.23, 1082.1, 795.37, 900.83, 1159.2, 992.5, 967.3,
1440, 804.13, 418.17, 559.57, 563.87, 562.97, 1113.1, 954.87,
883.8, 1207.1, 1046.83, 995.77, 803.93, 1036.63, 946.9, 887.33,
727.97, 733.93, 979.2, 1176.8, 1241.3, 1435.6)
What is the minimum number of elements that when added consecutively (as in the order within the vector) would sum up to lets say 20000
To be more clear i need the following:
Start with ev[1] and add consecutively up to 20000. Record the number of elements you had to add in order to get to 20000 as r[1]. Then start with ev[2] and add till 20000 and so on. Recored the number of elements you had to add till 20000 as r[2]. Do this for the entire length of ev. Then return the min(r)
For example
j<-c(1, 2, 3, 5, 7, 9, 2).
I want the minimum number of elements that when added consecutively would give lets say >20. This should be 3 (5+7+9)
Thanks a lot
Well, I'll give it a shot: This one will find the length of the minimum sequence of numbers
that add up to or above max. It makes no claims to be fast, but it has O(2n) time complexity :-)
I made it return both the start index and the length.
f <- function(x, max=10) {
s <- 0
len <- Inf
start <- 1
j <- 1
for (i in seq_along(x)) {
s <- s + x[i]
while (s >= max) {
if (i-j+1 < len) {
len <- i-j+1
start <- j
}
s <- s - x[j]
j <- j + 1
}
}
list(start=start, length=len)
# uncomment the line below if you don't need the start index...
#len
}
r <- f(ev, 20000) # list(start=245, length=15)
sum(ev[seq(r$start, len=r$length)]) # 20275.42
# Test speed:
x <- sin(1:1e6)
system.time( r <- f(x, 1.9) ) # 1.54 secs
# Compile the function makes it 9x faster...
g <- compiler::cmpfun(f)
system.time( r <- g(x, 1.9) ) # 0.17 secs
library(zoo) # Needed for rollapply
N <- 20000 # The desired sum we want to achieve
j <- 0
for(i in 1:length(ev)){
k <- rollapply(ev, i, sum)
j[i] <- max(k)
if(j[i] >= N){
break
}
}
i # contains how many consecutive elements you need to sum (15)
j[i] # contains the corresponding sum(20275.42)
Currently this doesn't tell you where the specific subset occurs in the vector but another use of rollapply could get you that information.
There are other ways to do it but if you have a really long vector this will break out of the loop so you don't calculate more than you need. The basic idea is to use rollapply to create a vector of the consecutive sums of length k and then find the maximum of that. If this is less than what we desire do the same thing for sums of length k+1. Repeat until we find a sum that is larger than the desired threshold.
Edit:
This appears to be about 100x faster. I haven't compared it to Tommy's answer (which is probably faster than this but this will provide a significant speedup compared to my original method.
Edit 2: Moving the [-n] and removing the suppresswarnings speeds this up quite a bit.
myfun <- function(ev, N){
i <- 1
n <- length(ev)
j <- ev
repeat{
j <- (j[-n] + ev[-c(1:i)])
i <- i+1
n <- n-1
if(max(j) >= N | i > length(ev)){
break;
}
}
return(i)
}
myfun(ev, 20000)
# And stealing the idea from Tommy gives a nice speedup as well
myfuncomp <- compiler:cmpfun(myfun)
myfuncomp(ev, 20000)
myfunc3 <- compiler:cmpfun(myfun, options = list(optimize = 3))
myfunc3(ev, 20000)
library(rbenchmark) # For testing
# If you have Tommy's functions loaded as f and g you can compare
benchmark(f(ev, 20000), g(ev, 20000), myfun(ev, 20000), myfuncomp(ev, 20000), myfunc3(ev, 20000))
you mean something like this?
> sum(ifelse(cumsum(ev)<=200000, 1, 0))
[1] 364
I think this may be a Traveling Salesman Problem in disguise unless you put in some more constraints. You cannot necessarily start at the max ev and go out in either direction since it may be a local non-dense maximum
x=1:length(ev)
plot(x,ev)
lxy <- loess(ev~x )
lines(predict(lxy, x=1:length(y)))
title(main="loess() fit of ev")
But in the region of the most dense values the values are fairly flat.
x=1:length(y); y=c(356.83,
973.5, 0, 240.43, 1232.07, 1440, 1329.67, 1096.87, 1331.37, 1305.03,
1328.03, 1246.03, 1182.3, 1054.53, 723.03, 1171.53, 1263.17,
1200.37, 1054.8, 971.4, 936.4, 968.57, 897.93, 1099.87, 876.43,
1095.47, 1132, 774.4, 1075.13, 982.57, 947.33, 1096.97, 929.83,
1246.9, 1398.2, 1063.83, 1223.73, 1174.37, 1248.5, 1171.63, 1280.57,
1183.33, 1016.23, 1082.1, 795.37, 900.83, 1159.2, 992.5, 967.3,
1440, 804.13, 418.17, 559.57, 563.87, 562.97, 1113.1, 954.87,
883.8, 1207.1, 1046.83, 995.77, 803.93, 1036.63, 946.9, 887.33,
727.97, 733.93, 979.2, 1176.8, 1241.3, 1435.6)
lxyhi <- loess(y~x)
plot(x,y)
lines(predict(lxyhi, x=1:length(y)))

Resources