I have the following function that I intend to run on a large scale with 10000 random networks.
m.ci.fail<-function(graph){
fail <- function(net) {
vids<-sample(V(net),1)
net <- delete.vertices(net,vids)
return(net)
}
compsize <- function(net,graph) {
b <- clusters(graph)
c <- clusters(net)
S <- max(c$csize)/max(b$csize)
return(S)
}
nodes<-1:vcount(graph)
R<-10000
cpmat<-matrix(nrow=length(nodes), ncol=R)
for(i in 1:R){
gr<-rewire(graph,mode="simple",niter=10000)
E(gr)$weight<-E(graph)$weight
grr<-gr
E(grr)$weight<-E(graph)$weight
cp<-numeric(length(nodes))
for(t in 1:(length(nodes)-1)){
gr<-fail(gr)
grcp<-compsize(gr,grr)
cp[t]<-grcp
}
cpmat[,i]<-cp
}
return(cpmat)
}
EDIT: I'm trying to create a randomly re-sampled distribution based on the original graph so I can get a confidence interval and later compare the range of random failures to centrality-based sequential deletions. Testing it as it is took hours with a graph of small size (30 nodes). I figure that if I could find a way to vectorize the random failure function this would be faster. I'm trying to vectorize the two 'for' loops but the fail function is making it a pain. Any suggestion on how I can do that?
Thanks in advance,
Related
I am trying to create a function where Monte Carlo Simulation is applied to two of the variables in a DCF Model in R Studio. It supposed to take a first value FCF_0 and applied to it a specific growth FCF_ 0*(1 + growth), which is the first input variable until period 6, each period takes the last FCF to keep growing. After that I would like to discount it as well to get the present value which would be FCFn*(1/((1+WACC)^n)). Where WACC is the second variable to simulate.
So far I have the function to calculate the FCF but with a vector of specifics values of growth, which is the following:
What I am trying so far to create this function is this, but I think is bad.
Could you please help me to understand how to create both simulations and if it is neccesary for me to create two functions or in one function I can do everything? I would expect from the function to give the sum of all present values and each sum would be an element in a vector of 10.000 simulations. I am new at this and even though I have read almost for two weeks, I don't get how to create these simulations.
Thank you very much!
revfunc <- function(hist, growth){
rval <- c()
help <- c(hist)
for(i in growth){
help <- help*(1+i)
rval <- c(rval, help)
}
return(rval)
}
Monte Carlo Simulations
pvffcf_function <- function(fcf0, growth, wacc){
rval1 <- c()
help <- c(fcf0)
pvs <- rval1*(1/((1+wacc)^n))
random_growth <- rnorm(n=10000, mean(fcfgrowth), sd(fcfgrowth))
wacc <- rnorm(n=10000, 0.03804, 0.007711)
pvffcf <- sum(freecashflows)
for(i in growth){
help <- help*(1+i)
rval1 <- c(rval1, help)
}
return(freecashflows)
}
I am trying to understand what the stat:kmeans does differently to the simple version explained eg on Wikipedia. I am honestly so supremely clueless.
Reading the help on kmeans I learned that the default algorithm is Hartigan–Wong not the more basic method, so there should be a difference, but playing around with some normal distributed variables I couldn't find a case where they differed substantially and predictably.
For reference, this is my utterly horrible code I tested it against
##squre of eudlidean metric
my_metric <- function(x=vector(),y=vector()) {
stopifnot(length(x)==length(y))
sum((x-y)^2)
}
## data: xy data
## k: amount of groups
my_kmeans <- function(data, k, maxIt=10) {
##get length and check if data lengths are equal and if enough data is provided
l<-length(data[,1])
stopifnot(l==length(data[,2]))
stopifnot(l>k)
## generate the starting points
ms <- data[sample(1:l,k),]
##append the data with g column and initilize last
data$g<-0
last <- data$g
it<-0
repeat{
it<-it+1
##iterate through each data point and assign to cluster
for(i in 1:l){
distances <- c(Inf,Inf,Inf)
for(j in 1:k){
distances[j]<-my_metric(data[i,c(1,2)],ms[j,])
}
data$g[i] <- which.min(distances)
}
##update cluster points
for(i in 1:k){
points_in_cluster <- data[data$g==i,1:2]
ms[i,] <- c(mean(points_in_cluster[,1]),mean(points_in_cluster[,2]))
}
##break condition: nothing changed
if(my_metric(last,data$g)==0 | it > maxIt){
break
}
last<-data$g
}
data
}
First off, this was a duplication (as I just found out) of this post.
But I will still try to give an example: When the clusters are separated, Lloyd tends to leave the centers inside the clusters they start in, meaning that some may end up partitioned while some others might be lumped together
I am currently doing research to predict where kudzu (an invasive vine) will spread in Oklahoma over a five year time by using Monte Carlo simulation. I have created a raster with presence points and loaded it into R.
For each Monte Carlo simulation (each year), I am running 6000 iterations to provide accurate results. However, due to "for" loops, this is taking a long time. The first year usually finishes running in 3 days, however the second year has been running for over 3 weeks and still is not complete.
Is there any way to speed this process up?
Each year builds off the previous one. I have provided the code below for the first two years:
OK.rast16<-raster("OK_rast20161.tif")
p.a16<-as.matrix(OK.rast16)
table(p.a16)
# Set the random number seed so results can be reproduced if needed
set.seed(10)
drow.pa16 <- 133.7197873 # distance of grid cell (meters) in n-s direction
trow.pa16 <- length(p.a16[,1]) # total number of rows
dcol.pa16 <- 133.7197873 # distance of grid cell (meters) in e-w direction
tcol.pa16 <- length(p.a16[1,]) # total number of rows
#####Year 1 of infection in 2016#####
kudzu_sim1_16 <- matrix(0,trow.pa16,tcol.pa16)
for(m in 1:6000)
{
OK.kudzu_16 <- p.a16 # initialize matrix of annual dispersal
for(i in 1:trow.pa16)
{
for(j in 1:tcol.pa16)
{
if(!is.na(p.a16[i,j]) & p.a16[i,j] == 1)
{
for(k in 1:trow.pa16)
{
for(l in 1:tcol.pa16)
{
if(!is.na(OK.kudzu_16[k,l]) & OK.kudzu_16[k,l] == 0)
{
distcalc <- sqrt((abs(i-k)*drow.pa16)^2+(abs(j-l)*dcol.pa16)^2)
prob <- exp(0.0369599-0.00474401*distcalc)
if(prob>runif(1)) {OK.kudzu_16[k,l] <- 1}
}
}
}
}
}
}
kudzu_sim1_16 <- OK.kudzu_16+kudzu_sim1_16
}
#####Year 2 of infection in 2016####
kudzu_sim2_16 <- matrix(0,trow.pa16,tcol.pa16)
for(m in 1:6000)
{
OK.kudzu1_16 <- OK.kudzu_16 # initialize matrix of annual dispersal
for(i in 1:trow.pa16)
{
for(j in 1:tcol.pa16)
{
if(!is.na(OK.kudzu_16[i,j]) & OK.kudzu_16[i,j] == 1)
{
for(k in 1:trow.pa16)
{
for(l in 1:tcol.pa16)
{
if(!is.na(OK.kudzu1_16[k,l]) & OK.kudzu1_16[k,l] == 0)
{
distcalc <- sqrt((abs(i-k)*drow.pa16)^2+(abs(j-l)*dcol.pa16)^2)
prob <- exp(0.0369599-0.00474401*distcalc)
if(prob>runif(1)) {OK.kudzu1_16[k,l] <- 1}
}
}
}
}
}
}
kudzu_sim2_16 <- OK.kudzu1_16+kudzu_sim2_16
}
Here is the raster to load to start the code:
kudzu in OK
Originally a comment, but it rapidly exceeded the length limit:
1) 133mx133m is a very small grid size for spatial simulations on something as large as an entire state. It might help to find a way to make resolution a parameter of your simulation (rather than a hard-wired number), streamline the code so that it runs well at a larger resolution, then increase the resolution. The raster function has an optional parameter named res which can be used to control the resolution.
2) While vectorization will surely help, it is unlikely to transform an algorithm which runs for weeks with no output into one which runs in a reasonable amount of time. Perhaps you need to fundamentally rethink your algorithm. You seem to be comparing every grid cell with every other grid cell. That doesn't strike me as biologically plausible. Kudzu spreads locally. Why should what happens next year to a given 133m x 133m cell in the Oklahoma panhandle depend on the current state of another cell over by Lake Texoma? If your simulation has any biological realism, exp(0.0369599-0.00474401*distcalc) should be negligibly small for two such cells, but your code doesn't neglect it. An algorithm which is in some sense localized might be better.
3) There is an awful lot of entries in your matrix which correspond to points outside of Oklahoma. Unless your model is designed to see how kudzu also diffuses over a large part of Texas, those might be irrelevant for your program. If so, a fundamentally different data structure (e.g. a list of locations) which only has entries for points in Oklahoma might be preferable. Or, maybe not. Just something to think about.
4) For more detailed help, it would help if you explain what your algorithm actually is (and not just what it intends to do). It isn't completely obvious in a quick read of your code.
As someone relatively new to R I'm having an issue with creating a for loop.
I have a very large data set with 9000 observations and 25 categorical variables, which I've transformed into binary data and preformed hierarchical clustering. Now I want to try K-Modes clustering to produce an Elbow Plot using the "within-cluster simple-matching distance for each cluster", which is outputted from kmodes$withindiff. I can sum this for each of the k in 1:8 clusters to get the Elbow Plot.
library(klaR)
for(k in 1:8)
{
WCSM[k] <- sum(kmodes(data,k,iter.max=100)$withindiff)
}
plot(1:8,WCSM,type="b", xlab="Number of Clusters",ylab="Within-Cluster
Simple-Matching Distance Summed", main="K-modes Elbow Plot")
My issue is that I want further output from k-modes. For each k in 1:8 I would like to get the vector of integers indicating the cluster to which each object is allocated to given by kmodes$cluster. I need to create a for loop that loops through each k in 1:8 and saves each of the outputs into 8 separate vectors. But I don't know how to do such a for loop. I could just run the 8 lines of code separately but they each take 15mins to run with iter.max=10 so increasing this to iter.max=100 will need to be left running overnight so a loop would be useful.
cl.kmodes2=kmodes(data, 2,iter.max=100)
cl.kmodes3=kmodes(data, 3,iter.max=100)
cl.kmodes4=kmodes(data, 4,iter.max=100)
cl.kmodes5=kmodes(data, 5,iter.max=100)
cl.kmodes6=kmodes(data, 6,iter.max=100)
cl.kmodes7=kmodes(data, 7,iter.max=100)
cl.kmodes8=kmodes(data, 8,iter.max=100)
Ultimately I want to compare the results from the hierarchical binary clustering to the k-modes clustering by getting the Adjusted Rand Index. For example, cutting the tree at k=4 for the hierarchical cluster and comparing this to a 4 cluster solution from k-modes:
dist.binary = dist(data, method="binary")
cl.binary = hclust(dist.binary, method="complete")
hcl.4 = cutree(cl.binary, k = 4)
tab = table(hcl.4, cl.kmodes4$cluster)
library(e1071)
classAgreement(tab)
I agree with Imo, using a list is the best solution.
If you don't want to do that, you could also use assign() to create a new vector in every iteration:
library(klaR)
for(k in 1:8) {
assign(paste("cl.kmodes", k, sep = ""), kmodes(data, k, iter.max = 100))
}
The best method is to put the output from your clusters into a named list:
library(klaR)
myClusterList <- list()
for(k in 1:8) {
myClusterList[[paste0("k.", i)]] <- kmodes(data, i,iter.max=100)
}
You can then pull out the any of the contents easily:
sum(myClusterList[["k.1"]]$withindiff)
or
sum(myClusterList[[1]]$withindiff)
You can also save the list to use in future R sessions, see ?save.
i have do to a monte carlo approach for AR(1) time series. I have to generate 10,000 time series of length 100 and afterwards i have to get the first step autocorrelation rho_1 for every time series. My problem is that i just get NA values for the autocorrelation and the calculation takes way to much time. I have no problem with computing the AR(1) time series.
Thank you for your help :)
gen_ar <- function(a,b,length,start)
{
z<-rep(0,length)
e<-rnorm(n=length,sd=1)
z[1]<-start
for (i in 2:length)
{
z[i]<-a+b*z[i-1]+e[i]
}
z
}
mc <- matrix(c(rep(0,10000000)),nrow=10000)
for (i in 1:10000)
{
mc[i,] <- gen_ar(0.99,1,100,0)
}
ac <- matrix(c(rep(0,10000)),nrow=1)
for (i in 1:10000){
for (j in 1:99){
ac[i] <- cor(mc[i,j],mc[i,j+1])
}
}
Statistics aside, I think this achieves your goals, and I don't get NA's. I changed the way it was done b/c you said it was going slow.
mc <- matrix(rep(NA,1E5), nrow=100)
for(i in seq_len(100)){
mc[,i] <- arima.sim(model=list(ar=0.99), n=100, sd=1) + 1
}
myAR <- function(x){
cor(x[-1], x[-length(x)])
}
answer <- apply(mc, 2, myAR)
I skipped the last set of nested for loops and replaced them with apply(). It seems easier to read, and is likely faster. Also, to use apply(), I created a function called myAR, which carries out the same calculation that cor() did in your for() loops.
Now, there are a couple of statistical adjustments that I made. Primarily, these were in the simulation step.
First, your simulated AR(1) process has a coefficient that is equal to 1, which seems odd to me (this would not be stationary, and arima.sim() won't even let you simulate this type of process).
Moreover, your "a" parameter adds 1 to the time series at each time step. In other words, your time series is monotonically increasing from 1 to 100 because the coefficient is equal to 1. This too would make your time series nonstationary, and with such a strong positive slope the cor() function would likely return 1 as the estimated correlation, regardless of the value of the simulated AR coefficient. I assume that you wanted the long-term mean to hover near 1, so the 1 is simply added to the entire time series after it is simulated, not iteratively at each time step.
Assuming that you did want to generate a nonstationary time series by adding some constant (a) at each time step, you could do the following:
myInnov <- function(N=100, a=1, SD=1) {a + rnorm(n=N, sd=SD)}
mc2 <- matrix(rep(NA,1E7), nrow=100)
for(i in seq_len(1E5)){
mc2[,i] <- arima.sim(model=list(ar=0.99), n=100, innov=myInnov(a=1, N=100, SD=1)) + 1
}
I hope that this helps.