Julia parallel for loop with two reductions - julia

I would like to perform two reductions in a parallel for loop in Julia. I am trying to compute the error in a random forest inside the parallel for loop as each tree is built. Any ideas?
Current:
forest = #parallel (vcat) for i in 1:ntrees
inds = rand(1:Nlabels, Nsamples)
build_tree(labels[inds], features[inds,:], nsubfeatures)
end
What I want, intuitively is to do an addition inside this for loop as well to get the out of bag error. This is how I would wish for it to work:
forest, ooberror = #parallel (vcat, +) for i in 1:ntrees
inds = rand(1:Nlabels, Nsamples)
tree = build_tree(labels[inds], features[inds,:], nsubfeatures)
error = geterror(ids, features, tree)
(tree, error)
end

Using a type might be best in terms of simplicity and clarity, e.g.
type Forest
trees :: Vector
error
end
join(a::Forest, b::Forest) = Forest(vcat(a.trees,b.trees), a.error+b.error)
#...
forest = #parallel (join) for i in 1:ntrees
inds = rand(1:Nlabels, Nsamples)
tree = build_tree(labels[inds], features[inds,:], nsubfeatures)
error = geterror(ids, features, tree)
Forest(tree, error)
end

Related

How to weight observations in mxnet?

I am new to neural networks and the mxnet package in R. I want to do a logistic regression on my predictors since my observations are probabilities varying between 0 and 1. I'd like to weight my observations by a vector obsWeights I have, but I'm not sure where to implement the weights. There seems to be a weight= option in mx.symbol.FullyConnected but if I try weight=obsWeights I get the following error message
Error in mx.varg.symbol.FullyConnected(list(...)) :
Cannot find argument 'weight', Possible Arguments:
----------------
num_hidden : int, required
Number of hidden nodes of the output.
no_bias : boolean, optional, default=False
Whether to disable bias parameter.
How should I proceed to weight my observations? Here is my code at the moment.
# Prepare data
train.mm = model.matrix(obs ~ . , data = train_data)
train_label = train_data$obs
# Normalize
train.mm = apply(train.mm, 2, function(x) (x-min(x))/(max(x)-min(x)))
# Create MXDataIter compatible iterator
batch_size = 128
train.iter = mx.io.arrayiter(data=t(train.mm), label=train_label,
batch.size=batch_size, shuffle=T)
# Symbolic model definition
data = mx.symbol.Variable('data')
fc1 = mx.symbol.FullyConnected(data=data, num.hidden=128, name='fc1')
act1 = mx.symbol.Activation(data=fc1, act.type='relu', name='act1')
final = mx.symbol.FullyConnected(data=act1, num.hidden=1, name='final')
logistic = mx.symbol.LogisticRegressionOutput(data=final, name='logistic')
# Run model
mxnet_train = mx.model.FeedForward.create(
symbol = logistic,
X = train.iter,
initializer = mx.init.Xavier(rnd_type = 'gaussian', factor_type = 'avg', magnitude = 2),
num.round = 25)
Assigning the fully connected weight argument is not what you want to do at any rate. That weight is a reference to parameters of the layer; i.e., what you multiply in the inputs by to get output values These are the parameter values you're trying to learn.
If you want to make some samples matter more than others, then you'll need to adjust the loss function. For example, multiply the usual loss function by your weights so that they do not contribute as much to the overall average loss.
I do not believe the standard Mxnet loss functions have a spot for assigning weights (that is LogisticRegressionOutput won't cover this). However, you can make your own cost function that does. This would involve passing your final layer through a sigmoid activation function to first generate the usual logistic regression output value. Then pass that into the loss function you define. You could do squared error, but for logistic regression you'll probably want to use the cross entropy function:
l * log(y) + (1 - l) * log(1 - y),
where l is the label and y is the predicted value.
Ideally, you'd write a symbol with an efficient definition of the gradient (Mxnet has a cross entropy function, but its for softmax input, not a binary output. You could translate your output to two outputs with softmax as an alternative, but that seems less easy to work with in this case), but the easiest path would be to let Mxnet do its autodiff on it. Then you multiply that cross entropy loss by the weights.
I haven't tested this code, but you'd ultimately have something like this (this is what you'd do in python, should be similar in R):
label = mx.sym.Variable('label')
out = mx.sym.Activation(data=final, act_type='sigmoid')
ce = label * mx.sym.log(out) + (1 - label) * mx.sym.log(1 - out)
weights = mx.sym.Variable('weights')
loss = mx.sym.MakeLoss(weigths * ce, normalization='batch')
Then you want to input your weight vector into the weights Variable along with your normal input data and labels.
As an added tip, the output of an mxnet network with a custom loss via MakeLoss outputs the loss, not the prediction. You'll probably want both in practice, in which case its useful to group the loss with a gradient-blocked version of the prediction so that you can get both. You'd do that like this:
pred_loss = mx.sym.Group([mx.sym.BlockGrad(out), loss])

Torch7 ClassNLLCriterion()

I've been trying for a whole day to get my code to work but it fails despite the inputs and outputs being consistent.
Someone mentioned somewhere that classnllcliterion does not accept values less than or equal to zero.
How am I supposed to go about training this network.
here is part of my code, I suppose it fails when in backward here the models output may contain -ve values.
However when I switch to meansquarederror criterion, the code works just fine.
ninputs = 22; noutputs = 3
hidden =22
model = nn.Sequential()
model:add(nn.Linear(ninputs, hidden)) -- define the only module
model:add(nn.Tanh())
model:add(nn.Linear(hidden, noutputs))
model:add(nn.LogSoftMax())
----------------------------------------------------------------------
-- 3. Define a loss function, to be minimized.
-- In that example, we minimize the Mean Square Error (MSE) between
-- the predictions of our linear model and the groundtruth available
-- in the dataset.
-- Torch provides many common criterions to train neural networks.
criterion = nn.ClassNLLCriterion()
----------------------------------------------------------------------
-- 4. Train the model
i=1
mean = {}
std = {}
-- To minimize the loss defined above, using the linear model defined
-- in 'model', we follow a stochastic gradient descent procedure (SGD).
-- SGD is a good optimization algorithm when the amount of training data
-- is large, and estimating the gradient of the loss function over the
-- entire training set is too costly.
-- Given an arbitrarily complex model, we can retrieve its trainable
-- parameters, and the gradients of our loss function wrt these
-- parameters by doing so:
x, dl_dx = model:getParameters()
-- In the following code, we define a closure, feval, which computes
-- the value of the loss function at a given point x, and the gradient of
-- that function with respect to x. x is the vector of trainable weights,
-- which, in this example, are all the weights of the linear matrix of
-- our model, plus one bias.
feval = function(x_new)
-- set x to x_new, if differnt
-- (in this simple example, x_new will typically always point to x,
-- so the copy is really useless)
if x ~= x_new then
x:copy(x_new)
end
-- select a new training sample
_nidx_ = (_nidx_ or 0) + 1
if _nidx_ > (#csv_tensor)[1] then _nidx_ = 1 end
local sample = csv_tensor[_nidx_]
local target = sample[{ {23,25} }]
local inputs = sample[{ {1,22} }] -- slicing of arrays.
-- reset gradients (gradients are always accumulated, to accommodate
-- batch methods)
dl_dx:zero()
-- evaluate the loss function and its derivative wrt x, for that sample
local loss_x = criterion:forward(model:forward(inputs), target)
model:backward(inputs, criterion:backward(model.output, target))
-- return loss(x) and dloss/dx
return loss_x, dl_dx
end
The error received is
/home/stormy/torch/install/bin/luajit:
/home/stormy/torch/install/share/lua/5.1/nn/THNN.lua:110: Assertion
`cur_target >= 0 && cur_target < n_classes' failed. at
/home/stormy/torch/extra/nn/lib/THNN/generic/ClassNLLCriterion.c:45
stack traceback: [C]: in function 'v'
/home/stormy/torch/install/share/lua/5.1/nn/THNN.lua:110: in function
'ClassNLLCriterion_updateOutput'
...rmy/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:43: in
function 'forward' nn.lua:178: in function 'opfunc'
/home/stormy/torch/install/share/lua/5.1/optim/sgd.lua:44: in
function 'sgd' nn.lua:222: in main chunk [C]: in function 'dofile'
...ormy/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in
main chunk [C]: at 0x00405d50
The error message results from passing in targets that are out of bounds.
For example:
m = nn.ClassNLLCriterion()
nClasses = 3
nBatch = 10
net_output = torch.randn(nBatch, nClasses)
targets = torch.Tensor(10):random(1,3) -- targets are between 1 and 3
m:forward(net_output, targets)
m:backward(net_output, targets)
Now, see the bad example (that you suffer from)
targets[5] = 13 -- an out of bounds set of classes
targets[4] = 0 -- an out of bounds set of classes
-- these lines below will error
m:forward(net_output, targets)
m:backward(net_output, targets)

Multichannel sequence analysis through WeightedCluster package

I would like to apply the functions available in the WeightedCluster package to analyze multichannel sequences I obtained through TraMineR. I am trying so, but due to the fact that multichannel sequences are lists composed by each channel separatedly, I get errors in functions like seqtreedisplay() and all those which require a sequence object.
This is an example:
fullsequences <- list(
work_sequence2 = work_sequence[which(rownames(work_sequence) %in% commonid),],
educ_sequence2 = educ_sequence[which(rownames(educ_sequence) %in% commonid),],
part_sequence2 = part_sequence[which(rownames(part_sequence) %in% commonid),],
kid_sequence2 = kid_sequence[which(rownames(kid_sequence) %in% commonid),]
) # a total of 926 with complete sequences on all channels
multidist <- seqdistmc(
channels = fullsequences,
method = "OM",
norm = FALSE,
sm = list("TRATE","TRATE","TRATE","TRATE"),
with.missing=FALSE,
full.matrix=TRUE,
link="sum")
clusterward <- hclust(as.dist(multidist), method = "ward")
seqtreedisplay(as.seqtree(clusterward, ncluster = 5,
seqdata = fullsequences , diss = multidist))
Error in seqlegend(seqdata, fontsize = legend.fontsize, title = "Legend", :
data is not a sequence object, use seqdef function to create one
Is there a method to use the functionalities of WeightedCluster package upon a multichannel-type object (list of sequences). I am specially interested in using the Partition Around Medioids algorithm with initial ward clusters (function wcKMedioids()). If it is not possible, which is the best alternative to cluster multichannels in R?
Thanks a lot in advance!
The as.seqtree function (from WeightedCluster) requires an object of class stslist (as produced by the TraMineR seqdef function) as seqdata argument. In your case, fullsequences is a list of such objects (the list of parallel sequences), which is NOT itself of class stslist. This causes the error.
Even if you would be able to define a tree of parallel sequences, the problem would be that the seqtreedisplay does not know how to plot parallel sequences. This means that you would have to define a plot function for a list of state sequences and, using the more general disstreedisplay function instead of seqtreedisplay, pass the plot function as imagefunc argument.
To summarize, there are two problems. First you need some as.disstree equivalent of as.seqtree that would work for hierarchical clustering of non-stslist objects. Second, you need a plot function for parallel sequences. The first problem is purely technical and should be easily solved. The second is more conceptual.

How do I replace the bootstrap step in the package randomForest r

First some background info, which is probably more interesting on stats.stackexchange:
In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).
I use time series data where the explanatory variables are lagged values of other data and the dependent variable.
For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.
To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.
Now my problem is this:
To achieve this I would normally copy the existing function and edit the desired step/lines.
randomForest2 <- randomForest()
But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?
So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).
The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.
For the code below the performance seems to be a coin-toss.
Taking Soren's code as an example it looks a bit like this:
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){
aBlock[[ttt]] <- unlist(
sample(
split2(1:nrow(dX.train),splits=splits),
splits,
replace=T
)
)
}
#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
.packages=("rpart")) %dopar% {
dBlock = df1[aBlock,]
rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) )
}
#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock = rowMeans(do.call(cbind.data.frame, rfBlocks)),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test)
)
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster
To directly alter sampling of randomForest(type="reggression"): Learn basic C programming, download from cran source code randomForest.4.6-10.tar.gz, (if windows install Rtools), (if OSX install Xcode), install and open Rstudio, start new project, choose package, unpack ...tar.gz into folder, look into src folder, open regrf.c, checkout line 151 and 163. Write new sampling strategy, press occationally Ctrl+Shift+B package to rebuild/compile and overwrite randomForest library, correct stated compile errors, test occasionally if package still works, spend some hours figuring out the old uninformative code, perhaps change description file, namespace file, and some few other references so the package will change name to randomForestMod, rebuild, voilla.
A more easy way not changing the randomForest is described below. Any trees with the same feature inputs can be patched together with the function randomForest::combine, so you can design your sampling regime in pure R code. I thought it actually was a bad idea, but for this very naive simulation it actually works with similar/slightly better performance! Remember to not predict the absolute target value, but instead a stationary derivative such as relative change, absolute change etc. If predicting the absolute value, RF will fall back to predicting tomorrow is something pretty close of today. Which is a trivial useless information.
edited code [22:42 CEST]
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
nBlocks=10 #combine do not support block of unequal size
rfBlocks = foreach(aBlock = split2(train,splits=nBlocks),
.combine=randomForest::combine,
.packages=("randomForest")) %dopar% {
dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock]
rf = randomForest(x=dXblock,y=dyblock,sampsize=length(dyblock),
replace=T,ntree=50)
}
print(rfBlocks)
#predict test, make results table
results = data.frame(predBlock = predict(rfBlocks,newdata=dX.test),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test))
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster

(R) function: object not found: environment depth fine?

I'm puzzled by a function error & would appreciate some insight.
The function, very briefly, automates the multiple processes involved in Boosted Regression Trees using gbm.step & other gbm's.
"gbm.auto" <- function (grids, samples, 3 parameters) {
starts 2 counters, require(gbm), does various small processing jobs with grids & samples
for parameter 1{
for parameter 2{
for parameter 3{
Runs 2 BRTs per parameter-combination loop, generates & iteratively updates a 'best' BRT for each, adds to counters. Extensive use of samples.
}}}
closes the loops, function continues as the first } is still open.
The next BRT can't find samples, even though it's at the same environment depth (1?) as the pre-loop processing jobs which used it successfully. Furthermore, adding "globalsamples<<-samples" after the }}} loop successfully saves the object, suggesting that samples is still available. Adding env1,2 & 3<<-environment() before the {{{ loop, within it & after it results in Environment for all three. Also suggesting it's all the same function environment & samples should be available.
What am I missing here? Thanks in advance!
Edit: exact message:
Error in eval(expr, envir, enclos) : object 'samples' not found
Function - loads removed & compacted but still gives same error message:
"gbm.auto" <-
function (samples, expvar, resvar, tc, lr, bf)
{ # open function
require(gbm)
require(dismo)
# create binary (0/1) response variable, for bernoulli BRTs
samples$brv <- ifelse(samples[resvar] > 0, 1, 0)
brvcol <- which(colnames(samples)=="brv") # brv column number for BRT
for(j in tc){ # permutations of tree complexity
for(k in lr){ # permutations of learning rate
for(l in bf){ # permutations of bag fraction
Bin_Best_Model<- gbm.step(data=samples,gbm.x = expvar, gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l)
}}} # close loops, producing all BRT/GBM objects & continue through model selection
Bin_Best_Simp_Check <- gbm.simplify(Bin_Best_Model) # simplify model
# if best number of variables to remove isn't 0 (i.e. it's worth simplifying), re-run the best model (Bin_Best_Model, using gbm.call to get its values)
# with just-calculated best number of variables to remove, removed. gbm.x asks which number of drops has the minimum mean (lowest point on the line)
# & that calls up the list of predictor variables with those removed, from $pred.list
if(min(Bin_Best_Simp_Check$deviance.summary$mean) < 0)
assign("Bin_Best_Simp", gbm.step(data = samples,
gbm.x = Bin_Best_Simp_Check$pred.list[[which.min(Bin_Best_Simp_Check$deviance.summary$mean)]],
gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l))
}
Read in data:
mysamples<-data.frame(response=round(sqrt(rnorm(5000, mean= 2.5, sd=1.5)^2)),
depth=sqrt(rnorm(5000, mean= 35, sd=24)^2),
temp=rnorm(5000, mean= 15, sd=1.2),
sal=rnorm(5000, mean= 34, sd=0.34))
Run this: gbm.auto(expvar=c(2,3,4),resvar=1,samples=mysamples,tc=2,lr=0.00000000000000001,bf=0.5)
Problem now: this causes a different error because my fake data are somehow wrong. ARGHG!
Edit: rounded the response data to integers and kept shrinking the learning rate until it runs. If it doesn't work for you, add zeroes until it does.
Edit: so this worked on my computer but reading it back to a clean sheet from online fails on a DIFFERENT count:
Error in var(cv.cor.stats, use = "complete.obs") :
no complete element pairs
In cor(y_i, u_i) : the standard deviation is zero
Is it allowed to attach or link to a csv of a small clip of my data? I'm currently burrowing deeper & deeper into bugfixing problems created by using fake data which I'm only using for this question, & thus getting off topic from the actual problem. Exasperation mode on!
Cheers
Edit2: if this is allowed: 1000row 4column csv link here: https://drive.google.com/file/d/0B6LsdZetdypkaC1WYXpKU3ZScjQ

Resources