I've created an SVM in R using the kernlab package, however it's running incredibly slow (20,000 predictions takes ~45 seconds on win64 R distribution). CPU is running at 25% and RAM utilization is a mere 17% ... it's not a hardware bottleneck. Similar calculations using data mining algorithms in SQL Server analysis services run about 40x faster.
Through trial and error, we discovered that the laplacedot kernel gives us the best results by a wide margin. Rbfdot is about 15% less accurate, but twice as fast (but still too slow). The best performance is vanilladot. It runs more or less instantly but the accuracy is way too low to use.
We'd ideally like to use the laplacedot kernel but to do so we need a massive speedup. Does anyone have any ideas on how to do this?
Here is some profiling information I generated using rprof. It looks like most of the time is spent in low level math calls (the rest of the profile consists of similar data as rows 16-40). This should run very quickly but it looks like the code is just not optimized (and I don't know where to start).
http://pastebin.com/yVPC66Be
Edit: Sample code to reproduce:
dummy.length = 20000;
source.data = as.matrix(cbind(sample(1:dummy.length)/1300, sample(1:dummy.length)/1900))
colnames(source.data) <- c("column1", "column2")
y.value = as.matrix((sample(1:dummy.length) + 9) / 923)
model <- ksvm(source.data[,], y.value, type="eps-svr", kernel="laplacedot",C=1, kpar=list(sigma=3));
The source data has 7 numeric columns (floating point) and 20,000 rows. This takes about 2-3 minutes to train. The next call generates the predictions and consistently takes 40 seconds to run:
predictions <- predict(model, source.data)
Edit 2: The Laplacedot kernel calculates the dot product of two vectors using the following formula. This corresponds rather closely with the profr output. Strangely, it appears that the negative symbol (just before the round function) consumes about 50% of the runtime.
return(exp(-sigma * sqrt(-(round(2 * crossprod(x, y) - crossprod(x,x) - crossprod(y,y), 9)))))
Edit 3: Added sample code to reproduce - this gives me about the same runtimes as my actual data.
SVM itself is a very slow algorithm. The time complexity of SVM is O(n*n).
SMO (Sequence Minimum Optimization http://en.wikipedia.org/wiki/Sequential_minimal_optimization) is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines.
libsvm ( http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and liblinear are two open source implementation.
Related
I'm estimating a Non-Linear system (via seemingly unrelated regressions - SUR), using systemfit (nlsystemfit() function) package with 4 equations, 32 parameters to estimate (!) and 412 observations. But my code is taking forever (my laptop it's not a super-powerful one tho). So far, the process was on a 13 hours run. I'm not an expert in computational stuff, but someone explained me some time ago the concept of Time Complexity of the algorithms (or big-o), then depending on this concept the time to compute a certain algorithm could rely on specific functional relation on the number of observations and/or coefficients.
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
But all this has a sense if I can change key elements in my model, which can reduce the time of processing significantly. That's why I was looking on google about the time complexity of nlm-package (nlsystemfit() function relies on nlm) but unsuccessfully. So, this is my question: Anybody knows where I can find that info, or at least give me advice on how test non-linear systems before run a whole model?
Since you didn't provide any substantial information regarding your model or some code for the same, its hard to express a betterment for your situation.
From what you said:
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
It seems you require benchmarking or to obtain the measured time taken to execute, as in your case. (although it can deal with memory usage or some other performance metric as well)
There are quite a few ways to benchmark code in R, which include the use of Sys.time() or system.time() just before and right after your algorithm/function executes, or libraries such as rbenchmark (which is a simple wrapper around the system.time function), tictoc, bench and microbenchmark.
Among these the last two are preferable options, as bench::mark includes system_time(), a higher precision alternative to system.time() and microbenchmark is known to be a reliable source to accurately measure and compare the execution time of R expressions/algorithms.
This is a newbie question. I am trying to minimize the following QP problem:
x'Qx + b'x + c, for A.x >= lb
where:
x is a vector of coordinates,
Q is a sparse, strongly diagonally dominant, symmetric matrix typically
of size 500,000 x 500,000 to 1M x 1M
b is a vector of constants
c is a constant
A is an identity matrix
lb is a vector containing lower bounds on vector x
Following are the packages I have tried:
Optim.jl: They have a primal interior-point algorithm for simple "box" constraints. I have tried playing around with the inner_optimizer, by setting it to GradientDescent()/ ConjugateGradient(). No matter what this seems to be very slow for my problem set.
IterativeSolver.jl: They have a conjugate gradient solver but they do not have a way to set constraints to the QP problem.
MathProgBase.jl: They have a dedicated solver for Quadratic Programming called the Ipopt(). It works wonderfully for small data sets typically around 3Kx3K matrix, but it takes too long for the kind of data sets I am looking at. I am aware that changing the linear system solver from MUMPS to HSL or WSMP may produce significant improvement but is there a way to add third party linear system solvers to the Ipopt() through Julia?
OSQP.jl: This again takes too long to converge for the data sets that I am interested in.
Also I was wondering if anybody has worked with large data sets can they suggest a way to solve a problem of this scale really fast in Julia using the existing packages?
You can try the OSQP solver with different parameters to speedup convergence for your specific problem. In particular:
If you have multiple cores, MKL Pardiso can significantly reduce the execution time. You can find details on how to install it here (It basically consists in running the default MKL installer). After that, you can use it in OSQP as follows
model = OSQP.Model()
OSQP.setup!(model; P=Q, q=b, A=A, l=lb, u=ub, linsys_solver="mkl pardiso")
results = OSQP.solve!(model)
The number of iterations depends on your stepsize rho. OSQP automatically updates it trying to find the best one. If you have a specific problem, you can disable the automatic detection and play with it yourself. Here is an example for try different rho
model = OSQP.Model()
OSQP.setup!(model; P=Q, q=b, A=A, l=lb, u=ub, linsys_solver="mkl pardiso",
adaptive_rho=false, rho=1e-3)
results = OSQP.solve!(model)
I suggest you to try different rho values maybe logspaced between 1e-06 and 1e06.
You can reduce the iterations by rescaling the problem data so that the condition number of your matrices is not too high. This can significantly reduce the number of iterations.
I pretty sure that if follow these 3 steps you can make OSQP work pretty well. I am happy to try OSQP for your problem if you are willing to share your data (I am one of the developers).
Slightly unrelated, you can call OSQP using MathProgBase.jl and JuMP.jl. It also supports the latest MathOptInterface.jl package that will replace MathProgBase.jl for the newest version of JuMP.
I want to use glm( ... , family = "binomial") to do a logistic regression with my big dataset which has 80,000,000 rows and 125 columns as a data.frame. But when I run in RStudio, it just crashes:
So I wonder what the time complexity of glm() is, and whether there are any solutions to handle such data? Someone suggested I try running the code from command line: does this make any difference (I tried, but it seems that doesn't work either)?
Memory requirement: R has to load the entire dataset into memory (RAM). However, your dataset is (assuming entries are 32-bits) is roughly 37 gigabytes -- much larger than the amount of RAM you have on your computer. Therefore, it crashes. You cannot use R for this dataset unless you use some special big data packages, and I'm not sure it's even feasible then.
There are other languages do not need to load it into memory to look at it, and so it might be wise to do that.
Time complexity for GLMs: if N = # of observations (usually # of rows), and p = # of variables (usually # of columns), it is O(p^3 + Np^3) for most standard GLM algorithms.
For your situation, it has a time complexity of approximately 10^12 which is still barely in the realm of possibility, but you probably need more than one modern PC running for at least a few days.
I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R snippet, to train the model using randomForest. This is taking hours for me.
rf.model <- randomForest(
Weekly_Sales~.,
data=newdata,
keep.forest=TRUE,
importance=TRUE,
ntree=200,
do.trace=TRUE,
na.action=na.roughfix
)
I think, due to na.roughfix, it is taking long time to execute. There are so many NA's in the training set.
Could someone let me know how can I improve the performance?
My system configuration is:
Intel(R) Core i7 CPU # 2.90 GHz
RAM - 8 GB
HDD - 500 GB
64 bit OS
(The tl;dr is you should a) increase nodesize to >> 1 and b) exclude very low-importance feature columns, maybe even exclude (say) 80% of your columns. Your issue is almost surely not na.roughfix, but if you suspect that, run na.roughfix separately as a standalone step, before calling randomForest. Get that red herring out of the way at first.)
Now, all of the following advice only applies until you blow out your memory limits, so measure your memory usage, and make sure you're not exceeding. (Start with ridiculously small parameters, then scale them up, measure the runtime, and keep checking it didn't increase disproportionately.)
The main parameters affecting the performance of randomForest are:
mtry (less is faster)
ntrees
number of features/cols in data - more is quadratically slower, or worse! See below
number of observations/rows in data
ncores (more is faster - as long as parallel option is being used)
some performance boost by setting importance=F and proximity=F (don't compute proximity matrix)
Never ever use the insane default nodesize=1, for classification! In Breiman's package, you can't directly set maxdepth, but use nodesize as a proxy for that, and also read all the good advice at: CrossValidated: "Practical questions on tuning Random Forests"
So here your data has 4.2e+5 rows, then if each node shouldn't be smaller than ~0.1%, try nodesize=42. (First try nodesize=420 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.)
runtime is proportional to ~ 2^D_max, i.e. polynomial to (-log1p(nodesize))
optionally you can also speedup by using sampling, see strata,sampsize arguments
Then a first-order estimate of runtime, denoting mtry=M, ntrees=T, ncores=C, nfeatures=F, nrows=R, maxdepth=D_max, is:
Runtime proportional to: T * F^2 * (R^1.something) * 2^D_max / C
(Again, all bets are off if you exceed memory. Also, try running on only one core, then 2, then 4 and verify you actually do get linear speedup. And not slowdown.)
(The effect of large R is worse than linear, maybe quadratic, since tree-partitioning has to consider all partitions of the data rows; certainly it's somewhat worse than linear. Check that by using sampling or indexing to only give it say 10% of rows).
Tip: keeping lots of crap low-importance features quadratically increases runtime, for a sublinear increase in accuracy. This is because at each node, we must consider all possible feature selection (or whatever number mtry) allows. And within each tree, we must consider all (F-choose-mtry) possible combinations of features.
So here's my methodology, doing "fast-and-dirty feature-selection for performance":
generate a tree normally (slow), although use a sane nodesize=42 or larger
look at rf$importances or randomForest::varImpPlot(). Pick only the top-K features, where you choose K; for a silly-fast example, choose K=3. Save that entire list for future reference.
now rerun the tree but only give it newdata[,importantCols]
confirm that speed is quadratically faster, and oob.error is not much worse
once you know your variable importances, you can turn off importance=F
tweak mtry and nodesize (tweak one at a time), rerun and measure speed improvement
plot your performance results on logarithmic axes
post us the results! Did you corroborate the above? Any comments on memory usage?
(Note that the above is not a statistically valid procedure for actual feature-selection, do not rely on it for that, read randomForest package for the actual proper methods for RF-based feature-selection.)
I suspect do.trace might also consume time... instead do.trace = TRUE, you can used do.trace = 5 (to show only 5 traces) just to have some feel about errors. For large dataset, do.trace would take up a lot time as well.
Another think I noticed:
the correct is ntrees, not ntree.
The default is 500 for the randomForest package.
Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
ranger (uses C++ compiler)
randomForestSRC (uses C++ compiler)
h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!
I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.