I want to use glm( ... , family = "binomial") to do a logistic regression with my big dataset which has 80,000,000 rows and 125 columns as a data.frame. But when I run in RStudio, it just crashes:
So I wonder what the time complexity of glm() is, and whether there are any solutions to handle such data? Someone suggested I try running the code from command line: does this make any difference (I tried, but it seems that doesn't work either)?
Memory requirement: R has to load the entire dataset into memory (RAM). However, your dataset is (assuming entries are 32-bits) is roughly 37 gigabytes -- much larger than the amount of RAM you have on your computer. Therefore, it crashes. You cannot use R for this dataset unless you use some special big data packages, and I'm not sure it's even feasible then.
There are other languages do not need to load it into memory to look at it, and so it might be wise to do that.
Time complexity for GLMs: if N = # of observations (usually # of rows), and p = # of variables (usually # of columns), it is O(p^3 + Np^3) for most standard GLM algorithms.
For your situation, it has a time complexity of approximately 10^12 which is still barely in the realm of possibility, but you probably need more than one modern PC running for at least a few days.
Related
I'm trying to run a RFSRC on a 6500 records dataframe, with 59 variables:
rfsrc_test <- rfsrc(Surv(TIME, DIED) ~ ., data=test, nsplit=10, na.action = "na.impute")
It seems to work when I run it on 1500 records, but crashes on the entire dataset.
It crashes R without any specific error - sometimes it gives "exceptional processing error".
Any thoughts how to debug this one? I skimmed the database for weird rows without any luck.
We do not know the size of each record, nor the complexity of the variables.
I have encountered similar situations when I have hit the RAM overhead. R is not designed for massive data sets. Parallel processing will resolve this, however R is not designed for this, the next suggestion is to buy more RAM.
My approach would be to reduce the number of variables until you can process 6500 records (to make sure its just the data set size). Then I'd pre-screen the fitness of each variable e.g. GLM and use variables which explain a large amount of the data and minimise the residual. Then I'd rerun the survival analysis on a reduced number of variables.
One thing you can check is the time variable - how many different values exist? The survival forest will save a cumulative hazard function for each node. If the number of unique time points in the dataset is large than the CHFS grow large as well.. had to round my time variable and this significantly reduced the run-time.
I have a huge csv-file, 1.37 GB, and when running my glm in R, it crashes because I do not have enough memory allocated. You know, the regular error..
Are there no alternative to packages ff and bigmemory, because they do not seem to work well for me, because my columns are a mix of integer and characters, and it seems with the two packages I have to specify what type my columns are, either char or integer.
We are soon in 2018 and about to put people on Mars; are there no simple "read.csv.xxl" function we can use?
I would first address your question by recognizing that just because your sample data takes 1.37 GB does not at all mean that 1.37 GB would be satisfactory to do all your calculations using the glm package. Most likely, one of your calculations could spike at at least a multiple of 1.37 GB.
For the second part, a practical workaround here would be to just take a reasonable sub sample of your 1.37 GB data set. Do you really need to build your model using all the data points in the original data set? Or, would say a 10% sub sample also give you a statistically significant model? If you lower the size of the data set, then you solve the memory problem with R.
Keep in mind here that R runs completely in-memory, meaning that once you have exceeded available memory, you may be out of luck.
I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.
For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.
I've created an SVM in R using the kernlab package, however it's running incredibly slow (20,000 predictions takes ~45 seconds on win64 R distribution). CPU is running at 25% and RAM utilization is a mere 17% ... it's not a hardware bottleneck. Similar calculations using data mining algorithms in SQL Server analysis services run about 40x faster.
Through trial and error, we discovered that the laplacedot kernel gives us the best results by a wide margin. Rbfdot is about 15% less accurate, but twice as fast (but still too slow). The best performance is vanilladot. It runs more or less instantly but the accuracy is way too low to use.
We'd ideally like to use the laplacedot kernel but to do so we need a massive speedup. Does anyone have any ideas on how to do this?
Here is some profiling information I generated using rprof. It looks like most of the time is spent in low level math calls (the rest of the profile consists of similar data as rows 16-40). This should run very quickly but it looks like the code is just not optimized (and I don't know where to start).
http://pastebin.com/yVPC66Be
Edit: Sample code to reproduce:
dummy.length = 20000;
source.data = as.matrix(cbind(sample(1:dummy.length)/1300, sample(1:dummy.length)/1900))
colnames(source.data) <- c("column1", "column2")
y.value = as.matrix((sample(1:dummy.length) + 9) / 923)
model <- ksvm(source.data[,], y.value, type="eps-svr", kernel="laplacedot",C=1, kpar=list(sigma=3));
The source data has 7 numeric columns (floating point) and 20,000 rows. This takes about 2-3 minutes to train. The next call generates the predictions and consistently takes 40 seconds to run:
predictions <- predict(model, source.data)
Edit 2: The Laplacedot kernel calculates the dot product of two vectors using the following formula. This corresponds rather closely with the profr output. Strangely, it appears that the negative symbol (just before the round function) consumes about 50% of the runtime.
return(exp(-sigma * sqrt(-(round(2 * crossprod(x, y) - crossprod(x,x) - crossprod(y,y), 9)))))
Edit 3: Added sample code to reproduce - this gives me about the same runtimes as my actual data.
SVM itself is a very slow algorithm. The time complexity of SVM is O(n*n).
SMO (Sequence Minimum Optimization http://en.wikipedia.org/wiki/Sequential_minimal_optimization) is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines.
libsvm ( http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and liblinear are two open source implementation.