SI model in Rhadoop

SI model in Rhadoop - r

i want to measure the diffusion of information on my graph using SI model. i define a set of initial infected nodes. i was based on this code : Susceptible-Infected model for network diffusion to develop my appropriate. but when i run my code in graph of 5000 nodes, it runs during hours. Here is my code:
get_infected1 = function(g, transmission_rate, diffusers){
infected=list()
Susceptible<-setdiff(V(g)$name,diffusers)
toss = function(freq) {
tossing = NULL
coins = c(1, 0)
probabilities = c(transmission_rate, 1-transmission_rate )
for (i in 1:freq ) tossing[i] = sample(coins, 1, rep=TRUE, prob=probabilities)
tossing = sum(tossing)
return (tossing)
}
infected[[1]] = diffusers
update_diffusers = function(diffusers){
nearest_neighbors<-data.frame()
for (i in 1:length(diffusers)){
L<-as.character(diffusers[i])
Nei1 <- unique(neighbors(g,(V(g)$name == L),1))
Nei1<-intersect(Susceptible,Nei1)
nearest_neighbors1 = data.frame(table(unlist(Nei1)))
nearest_neighbors = unique(rbind(nearest_neighbors,nearest_neighbors1))
}
nearest_neighbors = subset(nearest_neighbors, !(nearest_neighbors[,1]%in%diffusers))
keep = unlist(lapply(nearest_neighbors[,2],toss))
new = as.numeric(as.character(nearest_neighbors[,1][keep >= 1]))
for (j in 1:length(new)){ #fill the vector
c<-new[j]
vec[j]<-V(g)$name[c]
}
new_infected = as.vector(vec)
diffusers = unique(c(diffusers, new_infected))
return(diffusers)
}
# get infected nodes
total_time = 1
node_number=vcount(g)
while(length(Susceptible) > 0){
infected[[total_time+1]] = sort(update_diffusers(infected[[total_time]]))
Susceptible<-setdiff(Susceptible, infected[[total_time+1]])
total_time = total_time + 1
}
# return the infected nodes list
return(infected)
}
Each node of initial infected nodes infects his neighbors with some probability, so as output we get the list of infected nodes in each step.
I want to adjust this code to run on RHadoop system. but i am newbie in RHadoop. i don't know where exactly i should modify, and how could i introduce my graph on hadoop?? please any suggestions?

Related

Improving the speed

Here is my code in Julia and I would like to improve its speed since it is slow for large dataset. I provided the code with a small example so it can be executed and produce the results. I think that bottleneck is using find function in the loop which causes the code to be very slow but I don't know how I can replace it with sth faster.
A = [[1,2,3,4,5], [2,3,4,5,6,7,8], [4,7,8,9], [9,10], [2,3,4,5]]
mx = maximum(maximum(ar))
idx_new = zeros(Int, mx)
flag = ones(Int, mx);
Hscore = rand(1, length(A))
thresh = 0.2 * sum(Hscore)
acc_q = 0
pos = sortperm(vec(Hscore))
iter = 1
while acc_q < thresh
acc_q = acc_q + Hscore[pos[iter]]
nd = A[pos[iter]]
fd_flag = flag[nd]
cc = in.(fd_flag, 2)
node = nd[findall(x->x==0, cc)]
dd = nd[findall(x->x!=0, cc)]
TF = isempty(dd)
if TF == true
q_val = Hscore[pos[iter]]
acc_q = acc_q + q_val
idx_new[vec(node)] .= (val + 1)
flag[node] .= 2
val = val + 1;
iter = iter + 1
end # end of if TF
end ## end of while loop

While "please improve my code" is not a right question style for StackOverflow, generally when searching many times for element among many many options these are the first two that you might consider:
Sort the list of elements (with sort!) and use searchsorted to find the desired element
Use Set(mylist) to create a hash set and than search within the set.

Storing information during optim()

I have a general function I have provided an example below if simple linear regression:
x = 1:30
y = 0.7 * x + 32
Data = rnorm(30, mean = y, sd = 2.5);
lin = function(pars = c(grad,cons)) {
expec = pars[1] * x + pars[2];
SSE = sum((Data - expec)^2)
return(SSE)
}
start_vals = c(0.2,10)
lin(start_vals)
estimates = optim(par = start_vals, fn = lin);
## plot the data
Fit = estimates$par[1] * x + estimates$par[2]
plot(x,Data)
lines(x, Fit, col = "red")
So that's straight forward. What I want is to store the expectation for the last set of parameters, so that once I have finished optimizing I can view them. I have tried using a global container and trying to populating it if the function is executed but it doesn't work, e.g
Expectation = c();
lin = function(pars = c(grad,cons)) {
expec = pars[1] * x + pars[2];
Expectation = expec;
SSE = sum((Data - expec)^2)
return(SSE)
}
start_vals = c(0.2,10)
estimates = optim(par = start_vals, fn = lin);
Expectation ## print the expectation that would relate to estimates$par
I know that this is trivial to do outside of the function, but my actual problem (which is analogous to this) is much more complex. Basically I need to return internal information that can't be retrospectively calculated. Any help is much appreciated.

you should use <<- instead of = in your lin function, Expectation <<- expec,The operators <<- and ->> are normally only used in functions, and cause a search to be made through parent environments for an existing definition of the variable being assigned.

Tensorflow: 6 layer CNN: OOM (use 10Gb GPU memory)

I am using the following code for running a 6 layer CNN with 2 FC layers on top (on Tesla K-80 GPU).
Somehow, it consumes entire memory 10GB and died out of memory.I know that i can reduce the batch_size and then run , but i also want to run with 15 or 20 CNN layers.Whats wrong with the following code and why it takes all the memory? How should i run the code for 15 layers CNN.
Code:
import model
with tf.Graph().as_default() as g_train:
filenames = tf.train.match_filenames_once(FLAGS.train_dir+'*.tfrecords')
filename_queue = tf.train.string_input_producer(filenames, shuffle=True, num_epochs=FLAGS.num_epochs)
feats,labels = get_batch_input(filename_queue, batch_size=FLAGS.batch_size)
### feats size=(batch_size, 100, 50)
logits = model.inference(feats, FLAGS.batch_size)
loss = model.loss(logits, labels, feats)
tvars = tf.trainable_variables()
global_step = tf.Variable(0, name='global_step', trainable=False)
# Add to the Graph operations that train the model.
train_op = model.training(loss, tvars, global_step, FLAGS.learning_rate, FLAGS.clip_gradients)
# Add the Op to compare the logits to the labels during evaluation.
eval_correct = model.evaluation(logits, labels, feats)
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver(tf.all_variables(), max_to_keep=15)
# The op for initializing the variables.
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
summary_writer = tf.train.SummaryWriter(FLAGS.model_dir,
graph=sess.graph)
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value))
# Update the events file.
summary_str = sess.run(summary_op)
summary_writer.add_summary(summary_str, step)
if (step == 0) or (step + 1) % 1000 == 0 or (step + 1) == FLAGS.max_steps:
ckpt_model = os.path.join(FLAGS.model_dir, 'model.ckpt')
saver.save(sess, ckpt_model, global_step=step)
#saver.save(sess, FLAGS.model_dir, global_step=step)
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
finally:
coord.join(threads)
sess.close()
###################### File model.py ####################
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1],
padding='SAME')
x = tf.nn.bias_add(x, b)
return tf.nn.relu(x)
def maxpool2d(x, k=2,s=2):
# MaxPool2D wrapper
return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, s,
s,1],padding='SAME')
def inference(feats,batch_size):
#feats size (batch_size,100,50,1) #batch_size=256
conv1_w=tf.get_variable("conv1_w", [filter_size,filter_size,1,256],initializer=tf.uniform_unit_scaling_initializer())
conv1_b=tf.get_variable("conv1_b",[256])
conv1 = conv2d(feats, conv1_w, conv1_b,2)
conv1 = maxpool2d(conv1, k=2,s=2)
### This was replicated for 6 layers and the 2 FC connected layers are added
return logits
def training(loss, train_vars, global_step, learning_rate, clip_gradients):
# Add a scalar summary for the snapshot loss.
tf.scalar_summary(loss.op.name, loss)
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, train_vars,aggregation_method=1), clip_gradients)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, train_vars), global_step=global_step)
return train_op

I am not too sure what the model python library is. If it is something you wrote and can change the setting in the optimizer I would suggest the following which I use in my own code
train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost, aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N)
By default the aggeragetion_method is ADD_N but if you change it to EXPERIMENTAL_ACCUMULATE_N or EXPERIMENTAL_TREE this will greatly save memory. The main memory hog in these programs is that tensorflow must save the output values at every neuron so that it can compute the gradients. Changing the aggregation_method helps a lot from my experience.
Also BTW I don't think there is anything wrong with your code. I can run out of memory on small cov-nets as well.

Catching the print of the function

I am using package fda in particular function fRegress. This function includes another function that is called eigchk and checks if coeffients matrix is singular.
Here is the function as the package owners (J. O. Ramsay, Giles Hooker, and Spencer Graves) wrote it.
eigchk <- function(Cmat) {
# check Cmat for singularity
eigval <- eigen(Cmat)$values
ncoef <- length(eigval)
if (eigval[ncoef] < 0) {
neig <- min(length(eigval),10)
cat("\nSmallest eigenvalues:\n")
print(eigval[(ncoef-neig+1):ncoef])
cat("\nLargest eigenvalues:\n")
print(eigval[1:neig])
stop("Negative eigenvalue of coefficient matrix.")
}
if (eigval[ncoef] == 0) stop("Zero eigenvalue of coefficient matrix.")
logcondition <- log10(eigval[1]) - log10(eigval[ncoef])
if (logcondition > 12) {
warning("Near singularity in coefficient matrix.")
cat(paste("\nLog10 Eigenvalues range from\n",
log10(eigval[ncoef])," to ",log10(eigval[1]),"\n"))
}
}
As you can see last if condition checks if logcondition is bigger than 12 and prints then the ranges of eigenvalues.
The following code implements the useage of regularization with roughness pennalty. The code is taken from the book "Functional data analysis with R and Matlab".
annualprec = log10(apply(daily$precav,2,sum))
tempbasis =create.fourier.basis(c(0,365),65)
tempSmooth=smooth.basis(day.5,daily$tempav,tempbasis)
tempfd =tempSmooth$fd
templist = vector("list",2)
templist[[1]] = rep(1,35)
templist[[2]] = tempfd
conbasis = create.constant.basis(c(0,365))
betalist = vector("list",2)
betalist[[1]] = conbasis
SSE = sum((annualprec - mean(annualprec))^2)
Lcoef = c(0,(2*pi/365)^2,0)
harmaccelLfd = vec2Lfd(Lcoef, c(0,365))
betabasis = create.fourier.basis(c(0, 365), 35)
lambda = 10^12.5
betafdPar = fdPar(betabasis, harmaccelLfd, lambda)
betalist[[2]] = betafdPar
annPrecTemp = fRegress(annualprec, templist, betalist)
betaestlist2 = annPrecTemp$betaestlist
annualprechat2 = annPrecTemp$yhatfdobj
SSE1.2 = sum((annualprec-annualprechat2)^2)
RSQ2 = (SSE - SSE1.2)/SSE
Fratio2 = ((SSE-SSE1.2)/3.7)/(SSE1/30.3)
resid = annualprec - annualprechat2
SigmaE. = sum(resid^2)/(35-annPrecTemp$df)
SigmaE = SigmaE.*diag(rep(1,35))
y2cMap = tempSmooth$y2cMap
stderrList = fRegress.stderr(annPrecTemp, y2cMap, SigmaE)
betafdPar = betaestlist2[[2]]
betafd = betafdPar$fd
betastderrList = stderrList$betastderrlist
betastderrfd = betastderrList[[2]]
As penalty factor the authors use certain lambda.
The following code implements the search for the appropriate `lambda.
loglam = seq(5,15,0.5)
nlam = length(loglam)
SSE.CV = matrix(0,nlam,1)
for (ilam in 1:nlam) {
lambda = 10ˆloglam[ilam]
betalisti = betalist
betafdPar2 = betalisti[[2]]
betafdPar2$lambda = lambda
betalisti[[2]] = betafdPar2
fRegi = fRegress.CV(annualprec, templist,
betalisti)
SSE.CV[ilam] = fRegi$SSE.CV
}
By changing the value of the loglam and cross validation I suppose to equaire the best lambda, yet if the length of the loglam is to big or its values lead the coefficient matrix to singulrity. I recieve the following message:
Log10 Eigenvalues range from
-5.44495317739048 to 6.78194912518214
Created by the function eigchk as I already have mentioned above.
Now my question is, are there any way to catch this so called warning? By catch I mean some function or method that warns me when this has happened and I could adjust the values of the loglam. Since there is no actual warning definition in the function beside this print of the message I ran out of ideas.
Thank you all a lot for your suggestions.

By "catch the warning", if you mean, will alert you that there is a potential problem with loglam, then you might want to look at try and tryCatch functions. Then you can define the behavior you want implemented if any warning condition is satisfied.
If you just want to store the output of the warning (which might be assumed from the question title, but may not be what you want), then try looking into capture.output.

Repeat function until index value exceeds threshold in R

I need to apply a function until an index limit (distance, in this case) is met. I'm trying to figure out a way to apply the function repeatedly while avoiding recursion issues.
Example:
I want to apply the following code until total_dist = flight_distance (2500 km). The distance traveled in a given flight depends on the energy available. The flight proceeds as a series of jumps and stops--expending and obtaining energy, respectively. If there is enough energy at the start, the flight can be finished in only two jumps (with one stop). Sometimes two or more stops are necessary. I can't know this ahead of time.
So how can I modify jump_metrics to get it to repeat until the total distance covered is 2500?
flight_distance = 2500
flight_markers = c(1:flight_distance)
TD_cost_km = rnorm(2500, 5, 1)
potential_stops = c(1:(flight_distance-1))
cumulative_flight_cost = vector("list", length=length(potential_stops))
for(i in 1:length(potential_stops)) {
cumulative_flight_cost[[i]] = cumsum(TD_cost_km[flight_markers>potential_stops[i]])
}
max_fuel_quantiles = seq(0, 1, length=flight_distance)
jump_metrics = function() {
start_fuel_prob = qbeta(runif(flight_distance, pbeta(max_fuel_quantiles,1,1),
pbeta(max_fuel_quantiles,1,1)), 1.45*2, 1)
start_energy_level_est =
rnorm(1, sample(1544 + (start_fuel_prob * 7569), 1, replace=T),
0.25)
start_max_energy = ifelse(start_energy_level_est < 1544, 1544,
start_energy_level_est)
fuel_level = start_max_energy - cumulative_flight_cost[[1]]
dist_traveled = length(fuel_level[fuel_level>(max(fuel_level)*0.2)])
dist_left = flight_distance - dist_traveled
partial_dist = 1 + dist_traveled
dist_dep_max_energy = c(rep(start_max_energy, length(1:dist_left)),
seq(start_max_energy, 1544,
length.out=length((dist_left+1):flight_distance)))
next_max_energy = dist_dep_max_energy[partial_dist]
next_fuel_level = next_max_energy - cumulative_flight_cost[[partial_dist]]
next_dist_traveled = length(next_fuel_level[next_fuel_level >
(max(next_fuel_level)*0.2)])
total_dist = next_dist_traveled + partial_dist
list(partial_dist, next_dist_traveled, total_dist)
}
jump_metrics()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

SI model in Rhadoop - r

Related

Improving the speed

Storing information during optim()

Tensorflow: 6 layer CNN: OOM (use 10Gb GPU memory)

Catching the print of the function

Repeat function until index value exceeds threshold in R

Categories

Resources