I'm generating a sequence of scale-free networks, in which I can add and remove edges according to a value sampled from a uniform distribution. The following code works, but throws a warning once in awhile (every 10 runs or so). The warning is:
Warning message:
In data.table::data.table(...) :
Item 1 is of size 64 but maximum size is 66 (recycled leaving remainder of 2 items)
I've seen this question, but I don't really understand the answer and if it applies in my situation.
The code is:
library(igraph)
create_graph_sequence = function(num_nodes, num_timesteps) {
keep_graphs <- vector(mode="list", length=num_timesteps)
proportions = runif(2)
cat('proportions are: ', proportions)
prop_add = proportions[1] #Let both follow a uniform distribution
prop_del= proportions[2]
min_num_edges = ceiling(num_nodes/2)
g <- barabasi.game(num_nodes, power=1.2, directed=TRUE, algorithm="psumtree")
keep_graphs[[1]] = g
for(i in 2:num_timesteps) {
print(i)
edgelist = get.edgelist(keep_graphs[[i-1]]) #(g)
#Add and remove edges per time step.
add_edge_to_graph = function() {
#Do not allow creation of loops! If farm a ships to farm b, then farm b cannot ship to farm a.
#Do not allow self-loops! If farm a is in the network, it cannot ship to farm a.
reverse_edgelist = cbind(edgelist[,2], edgelist[,1])
self_edgelist = cbind(seq(1:num_nodes), seq(1:num_nodes))
total_edges_not_to_repeat = rbind(edgelist, reverse_edgelist, self_edgelist)
#Find two nodes that are not in the current edgelist.
#1: get a (num_nodes)*2 matrix of possible edges
possible_edges_1 = rep(seq(1:num_nodes), each=num_nodes)
possible_edges_2 = rep(seq(1:num_nodes), num_nodes)
possible_edges = cbind(possible_edges_1, possible_edges_2)
possible_edges = data.matrix(possible_edges)
DT1 <- data.table(possible_edges)
DT2 <- data.table(cbind(total_edges_not_to_repeat, 0), key=paste0("V", seq(len=ncol(total_edges_not_to_repeat))))
setnames(DT2, c(head(names(DT2), -1L), 'found'))
da <- DT2[DT1, list(found=ifelse(is.na(found), 0, 1))]
#Append found to the possible_edges
dt1 <- cbind(DT1, da)
#randomly select *prop_add* rows that have '0' in the found column and add the edges
dt1 = data.matrix(dt1)
select_0 = dt1[dt1[, "found"]==0,]
new_edge_row = sample(nrow(select_0), ceiling(nrow(edgelist)*prop_add))
new_edges = select_0[new_edge_row, 1:2] #possible_edges[new_edge_row,]
#While not all new_edges fit the bill: are self-loops, create loops with other farms, etc.
#take a new sample.
new_edges_df = as.data.frame(new_edges, by_row=False)
tentr_df = as.data.frame(total_edges_not_to_repeat, by_row=True)
while(any(do.call(paste0,new_edges_df) %in% do.call(paste0, tentr_df))) {
new_edge_row = sample(nrow(select_0), ceiling(nrow(edgelist)*prop_add))
new_edges = select_0[new_edge_row, 1:2]
new_edges_df = as.data.frame(new_edges)
tentr_df = as.data.frame(total_edges_not_to_repeat)
}
new_edges = t(as.matrix(new_edges_df)) #for some reason matrix -> df -> matrix transposes. Transposing back.
#g2 = g
print('i-1 is')
print(i-1)
#print(keep_graphs[[i-1]])
g2 = keep_graphs[[i-1]]
for(i in 1:nrow(new_edges)) {
addthisedge = c(new_edges[i,][[1]], new_edges[i,][[2]])
g2 = add_edges(g2, edges = addthisedge)
}
return(g2)
}
delete_edge_from_graph = function() {
#Randomly select a second proportion *prop_del* rows to delete
#(include the edges from the new graph - this means the number of edges DOES NOT remain constant;
#if prop_add = prop_del then yes, stays constant;
#if prop_del > prop_add then going to have a graph that gets smaller over time.)
#BUT if the number to remove results in a graph with unacceptably few edges, then remove no edges.
g2_edgelist = get.edgelist(g2)
edges_to_remove = sample(nrow(g2_edgelist), floor(nrow(g2_edgelist)*prop_del))
edgefunctiong2 = E(g2)
if(nrow(g2_edgelist) - length(edgefunctiong2[edges_to_remove]) < min_num_edges) {
print('g2')
print(g2)
return(g2)
} else {
g2 = g2 - edgefunctiong2[edges_to_remove]
print('g2')
print(g2)
return(g2)
}
}
g2 = add_edge_to_graph()
g2 = delete_edge_from_graph()
keep_graphs[[i]] = g2
}
return(keep_graphs)
}
You can call this MWE by:
kept_graphs = create_graph_sequence(8, 3)
To create a sequence of 3 graphs with 8 nodes. It shouldn't take more than a few seconds to run on a basic laptop.
How can I get rid of this warning message?
The problem is in the line
dt1 <- cbind(DT1, da)
It sometimes happens that DT1 has 64 rows, while da has 66. That needs a warning because it doesn't look like a typical recycling (e.g., when one vector is (1,2,3,4,5,6) and we give another (2,3) expecting recycling to make it (2,3,2,3,2,3)).
Since it's not obvious what the function does, I'm sure it's better if you fix the issue yourself. To replicate it, do set.seed(123) before calling the function, and add if(i == 3) browser() right before dt1 <- cbind(DT1, da).
Related
Please excuse the title for lack of a better phrase describing my question.
I'm running cluster stability analysis function out of 'flexclust' package, which runs bootstrap sampling on your dataset, calculate this thing called "Random Index" per each value of k (the range which I get to specify).
The function lets you try multiple distance metrics and clustering methods, and I want to run the function for every one of distance&method combination, find the best k based on each k's mean + median.
I've basically written nested for loops, initializing vector for each of the column: (name, distance metric, method, and best k). And calling a data.frame() to stitch all of them together.
###############################################################################################
df = data.frame(matrix(rbinom(10*100, 1, .5), ncol=4)) #random df for testing purpose
cl_stability <- function(df, df.name, k_low, k_high)
{
cluster.distance = c("euclidean","manhattan")
cluster.method = c("kmeans","hardcl","neuralgas")
for (dist in cluster.distance)
{
for (method in cluster.method)
{
j = 1
while (j <= length(cluster.distance)*length(cluster.method))
{
df.names = rep(c(df.name),length(cluster.distance)*length(cluster.method))
distances = c()
methods = c()
best.k.s = c()
ip = as.data.frame((bootFlexclust(df, k = k_low:k_high, multicore = TRUE,
FUN = "cclust", dist = d, method = m))#rand)
best_k = names(which.max(apply(ip, 2, mean) + apply(ip, 2, median))) #this part runs fine when I run them outside of the function
distances[j] = d
methods[j] = m
best.k.s[j] = best_k
j = j + 1
final = data.frame(df.names,distances,methods,best.k.s)
}
}
}
return(final)
}
Expected result would be a dataframe with 7 columns (name, distance metric, method, and best k, 2nd best, 3rd best, and the worst based on median+mean criteria.).
https://imgur.com/a/KpFM04m
I have the following vector:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
And if we with the following plot code
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
we can get this:
By eye we can see at x-axis point 4 the value change begin to change drastically plateaued.
My question is given the vector wss how can we automatically detect the index 4 without looking at the plot.
Edit: This works better:
#change relative to the maximum change
threshold <- 0.1
d1 <- diff(wss)
# this assumes that the first value is the highest
## you could use max(d1) instead of d1[1]
which.max((d1 / d1[1]) < threshold) #results in 3
d1 <- diff(wss2)
which.max(d1 / d1[1] < threshold) #results in 5
Second Edit: This is somewhat subjective, but here's how my three methods compare for your two data sets. While it's easy to visualize what a plateau is, you need to be able to describe in math terminology what a plateau is in order to automate it.
Original: If you know that the second derivative will flip from positive to negative, you can do this:
sec_der <- diff(wss, differences = 2)
inflection_pt <- which.min(sign(sec_der))
inflection_pt
For this data set, the result is 5 which corresponds to the original datasets result of 7 (i.e., 151.991).
Instead of looking at inflection points, you could instead look at some relative percent threshold.
thrshold <- 0.06
which.min(sign(abs(diff(wss)) / wss[1:(length(wss)-1)] - thrshold))
This results in 5 as well using the first derivative approach.
Regardless, using the diff() function would be a key part of figuring this out in base R. Also see:
Finding the elbow/knee in a curve
Code to create graphs:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)
data_list <- list(wss, wss2)
# Potential_methods -------------------------------------------------------
plateau_method = list(thresh_to_max = function(x) which.max(diff(x) / diff(x)[1] < threshold)
, inflection_pt = function(x) which.min(sign(diff(x, differences = 2)))
, deriv_to_raw = function(x) which.min(sign(abs(diff(x)) / x[1:(length(x)-1)] - threshold))
)
threshold <- 0.1
results <- t(sapply(plateau_method, mapply, data_list))
# graphing ----------------------------------------------------------------
par(mfrow = c(3,2))
apply(results, 1, function (x) {
for (i in seq_along(x)) {
plot(data_list[[i]],ylab="Within groups sum of squares", type = 'b', xlab = 'Number of Clusters')
abline(v = x[i])
}
} )
lapply(seq_along(names(plateau_method))
, function (i) {
mtext(paste(names(plateau_method)[i]
, "- \n"
, substring(plateau_method[i], 15))
, side = 3, line = -18*(i)+15, outer = TRUE)
})
mtext('Threshold = 0.1', side = 3, line = -53, outer = T)
I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.
As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))
Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.
I've run into a little problem, simulating the throw of dice. Basically im doing this to get familiar with loops and their output.
Intention is to simulate the throw of two dice as follows:
R = 100
d6 = c(1:6)
d = 60
DICE = NULL
for (i in 1:R)
{
i <- as.factor((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)))
j <- summary(i)
DICE = rbind(DICE, j)
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
relHIS = (HIS / sum(HIS))*100
relHIS
Problems occur if the result in one cathegorie is 0 (result did not occur in the sample). If this happens randomly in the first subsample one or more the categories (numbers 2-12) are missing. This causes problems ("number of columns of result is not a multiple of vector length (arg 2)") in the following subsamples.
Im sure there is a really simple solution for this, by defining everything beforehand...
Thanks for your help!
Here are some fixes:
R = 100
d6 = c(1:6)
d = 60
DICE = matrix(nrow = R, ncol = 11) #pre-allocate
colnames(DICE) <- 2:12
for (i in 1:R)
{
sim <- ordered((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)),
levels = 2:12) #define the factor levels
sumsim <- table(sim)
DICE[i,] <- sumsim #sub-assign
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
prop.table(HIS) * 100
Always pre-allocate your result data structure. Growing it in a loop is terribly slow and you know how big it needs to be. Also, don't use the same symbol for the iteration variable and something else.
Omit as.factor()in your seventh row
I have been working on a project for which I need to find peaks and valleys in a dataset (not just the highest numbers per column, but all of the peaks and valleys).
I did manage to get it to work on 1 column, but I use a for-loop for that and I need to do this for about 50 columns, so I think I should use an 'apply' function. I just don't know how to do so. Can I put 'if' statements and such in an 'apply' function?
Here is what I used for checking one column:
('First' is the name of the dataset and 'Seq1' is the first column)
Lowest = 0
Highest = 0
Summits = vector('numeric')
Valleys = vector('numeric')
for (i in 1:length(First$Seq1))
{
if (!is.na(First$Seq1[i+1]))
{
if (First$Seq1[i] < Lowest) {Lowest = First$Seq1[i]}
if (First$Seq1[i] > Highest) {Highest = First$Seq1[i]}
if (First$Seq1[i] > 0 && First$Seq1[i+1] < 0)
{ Summits <- append(Summits, Highest, after=length(Summits)) }
if (First$Seq1[i] < 0 && First$Seq1[i+1] > 0)
{ Valleys <- append(Valleys, Lowest, after=length(Summits)) }
}
}
Sure you can! I would first define a helper function that defines what is to be done with one specific column and then you call that function within apply:
HelperFun <- function(x) {
# your code from above, replacing 'Seq1' by x
}
apply(First, 2, HelperFun)
An *apply function is not better for this than a for loop, provided you don't grow an object in the for loop. You must never use append in a loop. Pre-allocate your results object and fill it.
This finds all local minima on a grid:
#an example
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
#plot
library(raster)
plot(raster(plane))
#initialize a logical matrix
res <- matrix(TRUE, ncol = ncol(plane), nrow = nrow(plane))
#check for each subgrid of 2 times 2 cells which of the cells is the minimum
for (i in 1:(nrow(plane) - 1)) {
for (j in 1:(ncol(plane) - 1)) {
inds <- as.matrix(expand.grid(r = i + 0:1, c = j + 0:1))
#cell must be a minimum of all 4 subgrids it is part of
res[inds] <- res[inds] & plane[inds] == min(plane[inds])
}
}
print(res)
plane[res]
#[1] -13.282277 -8.906542 -8.585043 -12.071038 -3.919195 -14.965450 -5.215595 -5.498904 -5.971644 -2.380870 -7.296070
#highlight local minima
plot(rasterToPolygons(raster(res)), border = t(res), add = TRUE)
library(reshape2)
res1 <- melt(res)
res1 <- res1[res1$value,]
text(x = res1$Var2 /10 - 0.05,
y = 1-res1$Var1 /10 + 0.05,
labels = round(plane[res],1))
I've assumed here that diagonal neighbors are counted as neighbors and not only neighbors in the same column or row. But this would be trivial to change.
I know that this is not the solution you want --- you have one-dimensional time series, but here is a (more direct) variation on Roland's solution.
#example data
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
library(raster)
r <- raster(plane)
f <- focal(r, matrix(1,3,3), min, pad=TRUE, na.rm=TRUE)
x <- r == f
mins <- mask(r, x, maskvalue=FALSE)
pts <- rasterToPoints(mins)
cells <- cellFromXY(x, pts)
r[cells]
plot(r)
text(mins, digits=1)
plot(rasterToPolygons(mins), add=TRUE)