I've been trying to use the msm package to model an 8 state, multi-state markov chain. My data set, in total, contains about 11,000 subjects, with slightly over 100k observations total.
I try to run the msm function on several subsets of the data, taking the head of the data, like so:
mm2myTrajectoryMSM<-msm(role ~ year, subject=authId, data=head(mm2myMarkovRoles[,1:3,with=FALSE],7000), qmatrix=trajectory.qmatrix,death=1,control=list(trace=1,REPORT=1))
So far, I have not been able to get past ~7000 lines. Looking at the report output, I noticed that the function freezes when the iter value outputs a negative value. For example, here is the run with the first 10k rows of the data
initial value 19017.328402
iter 2 value 17808.111677
iter 3 value 17707.483305
iter 4 value -346782.085429 (freeze)
But it works with the first 20k rows
initial value 38101.266287
iter 2 value 35871.849676
iter 3 value 35796.410415
iter 4 value -721867.559664
iter 4 value -721867.559664
final value -721867.559664
converged
But not with 50k rows
initial value 92846.642840
iter 2 value 88466.007605
iter 3 value 88310.215979
iter 4 value 88276.433502
iter 5 value 88247.381022
iter 6 value -983685.709474
But it works for 60010,80007 (I'm capturing full records of subjects), and after that I cannot tell if the system freezes or the analysis is taking a very long time. The 1 cpu assigned to the task is maxed, but I am nowhere near my RAM resources limits (< 1% of the 96GB on the server).
I have two questions - ) Why does the function (arbitrarily?) hang on certain subsets of the data and
2) How can I estimate the run time of this function? Last time I let it run, and it went for over 2 days. Oddly, the computation time for many of the runs appeared to scale sub-linearly, but once I crossed a threshold it scales...?
Are you running msm 1.5?
In the changelog (http://cran.r-project.org/web/packages/msm/ChangeLog) it is mentioned that a bug has been fixed that led to infinite loops on windows.
If your time-series has several short jumps you might get a log-likelihood underflow. You can study this using fixedpars = TRUE in the msm call (then get the log-likelihood and look for underflow/overflow).
If something is wrong you'll get very long running times (hard to predict).
Also try to scale your likelihood values using fnscale=100000.
Related
I previously asked the following question
Permutation of n bernoulli random variables in R
The answer to this question works great, as long as n is relatively small (<30), otherwise the following error code occurs Error: cannot allocate vector of size 4.0 Gb. I can get the code to run with somewhat larger values by using my desktop at work but eventually the same error occurs. Even for values that my computer can handle, say 25, the code is extremely slow.
The purpose of this code to is to calculate the difference between the CDF of an exact distribution (hence the permutations) and a normal approximation. I randomly generate some data, calculate the test statistic and then I need to determine the CDF by summing all the permutations that result in a smaller test statistic value divided by the total number of permutations.
My thought is to just generate the list of permutations one at a time, note if it is smaller than my observed value and then go on to the next one, i.e. loop over all possible permutations, but I can't just have a data frame of all the permutations to loop over because that would cause the exact same size and speed issue.
Long story short: I need to generate all possible permutations of 1's and 0's for n bernoulli trials, but I need to do this one at a time such that all of them are generated and none are generated more than once for arbitrary n. For n = 3, 2^3 = 8, I would first generate
000
calculate if my test statistic was greater (1 or 0) then generate
001
calculate again, then generate
010
calculate, then generate
100
calculate, then generate
011
etc until 111
I'm fine with this being a loop over 2^n, that outputs the permutation at each step of the loop but doesn't save them all somewhere. Also I don't care what order they are generated in, the above is just how I would list these out if I was doing it by hand.
In addition if there is anyway to speed up the previous code that would also be helpful.
A good solution for your problem is iterators. There is a package called arrangements that is able to generate permutations in an iterative fashion. Observe:
library(arrangements)
# initialize iterator
iperm <- ipermutations(0:1, 3, replace = T)
for (i in 1:(2^3)) {
print(iperm$getnext())
}
[1] 0 0 0
[1] 0 0 1
.
.
.
[1] 1 1 1
It is written in C and is very efficient. You can also generate m permutations at a time like so:
iperm$getnext(m)
This allows for better performance because the next permutations are being generated by a for loop in C as opposed to a for loop in R.
If you really need to ramp up performance you can you the parallel package.
iperm <- ipermutations(0:1, 40, replace = T)
parallel::mclapply(1:100, function(x) {
myPerms <- iperm$getnext(10000)
# do something
}, mc.cores = parallel::detectCores() - 1)
Note: All code is untested.
I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.
I have data.table in r with 150 000 rows in it.
I use 9 features and it's training time more than 30 mins, I didn't wait more.
Also tried it on 500 rows (it takes 0.2 sec) and on 5000 it takes (71.2 sec).
So how I should train my model with all data or may be you can give me any other advice?
here compile log:
train1 <- train[1:5000,]+1
> f1 = as.formula("target~ v1+ v2+ v3+ v4+ v5+ v6+ v7+ v8+ v9")
> a=Sys.time()
> nn <-neuralnet(f1,data=train1, hidden = c(4,2), err.fct = "ce", linear.output = TRUE)
Warning message:
'err.fct' was automatically set to sum of squared error (sse), because the response is not binary
> b=Sys.time()
> difftime(b,a,units = "secs")
Time difference of 71.2000401 secs
This is to be expected in my experience, there are a lot of calculations involved in Neural Nets. I personally have one written in Python (2 hidden layers), detailed including momentum term, I have about 38,000 patterns of 56 inputs and 3 outputs. Splitting them into 8,000 chunks took about 10 minutes to run and just under a week to learn to my satisfaction.
The whole set of 38,000 had a larger hidden nodes to store all the patterns and that took over 6 hrs to go through one cycle and over 3 months to learn. Neural Networks is a very powerful tool but it comes at a price in my experience, others may have better implementation but all the comparisons of classification algorithms I have seen, have always mentioned the time to learn as being significant.
I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100
I have a large list of 15000 elements each containing 10 numbers (data)
I am doing a time series cluster analysis using
distmatrix <- dist(data, method = "DTW")
This has now been running for 24 hours. Is it likely to complete any time soon. Is there a way of checking on it's progress. I don't want to abort just in case it's about to finish.