The problem:
I am estimating fixed effects models using the felm function in the lfe package. I estimate several models, but the largest includes approximately 40 dependent variables plus county and year fixed effects (~3000 levels and 14 levels, respectively) using ~50 million observations. When I run the model, I get the following error:
Error in Crowsum(xz, ia) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
Calls: felm -> felm.mm -> newols -> Crowsum
I realize that long vectors contain 2^31 or more elements, so I assume that felm produces these long vectors in estimating the model. Details about my resources: I have access to high-performance computing with multiple nodes, each with multiple cores; the node with the largest memory has 1012GB.
(1) Is support for "long vectors" something that can only be added by the author of lfe package?
(2) If so, are there other options for FE regressions on large data (provided that one has access to large amounts of memory and/or cluster computing, as I do)? (If I need to make a separate post to address this question more specifically, I can do that as well.)
Please note that I know there is a similar post about this problem: Error in LFE - Long Vectors Not Supported - R. 3.4.3
However, I made a new question for two reasons: (1) the author's question was unfocused--unclear what feedback to provide and I didn't want to assume that what I wanted to know was the same as the author; and (2) even if I edited the question, the original author left out details that I thought could be relevant.
It appears that long vectors won't be supported by the lfe package unless the author adds support.
Other routes for performing regression analysis with large datasets include the biglm package. biglm lets you run linear or logistic regression on a chunk, and then update the regression estimates as you run the regression on further chunks. However, this is a sequential process--which means it cannot be performed in parallel--and thus may run into memory issues (or take too long).
Another option, if you have access to some advanced computing resources (more cores, more RAM) is to use the partools package which provides tools for working with R's parallel. The package provides a wrapper for R's lm called calm (chunk averaged linear model), which, similar to biglm regresses on chunks, but allows for the process to be done in parallel. The vignette for partools is an excellent resources for learning how to use the package (even for those who haven't used the parallel package before): https://cran.r-project.org/web/packages/partools/index.html
It appears that calm doesn't report standard errors, but datasets of the size that require parallel computing, standard errors probably are not relevant -- especially so if that dataset contains data from the entire population being studied.
Related
Where can I obtain the relevant arguments for R's randomforest for which I need to provide signatures in dotcall64?
I have 14 million cases and about 20 predictors. I tried running randomforest on my university's High Performance Computing cluster (unix). I got an error " long vectors (argument 24) are not supported in .C".
I have successfully run it for up to about 6 million cases.
I just about know enough R to run statistical tests on PCs and clusters. I have no knowledge of C or how to obtain, read and edit any low level languages that randomforest is based on, if any.
thanks.
I am trying to see if there is an analogous solution based on this
using long vectors in R with glmnet and dotCall64
I'm aware of the fact that Amelia R package provides some support for parallel multiple imputation (MI). However, preliminary analysis of my study's data revealed that the data is not multivariate normal, so, unfortunately, I can't use Amelia. Consequently, I've switched to using mice R package for MI, as this package can perform MI on data that is not multivariate normal.
Since the MI process via mice is very slow (currently I'm using AWS m3.large 2-core instance), I've started wondering whether it's possible to parallelize the procedure to save processing time. Based on my review of mice documentation and the corresponding JSS paper, as well as mice's source code, it appears that currently the package doesn't support parallel operations. This is sad, because IMHO the MICE algorithm is naturally parallel and, thus, its parallel implementation should be relatively easy and it would result in a significant economy in both time and resources.
Question: Has anyone tried to parallelize MI in mice package, either externally (via R parallel facilities), or internally (by modifying the source code) and what are results, if any? Thank you!
Recently, I've tried to parallelize multiple imputation (MI) via mice package externally, that is, by using R multiprocessing facilities, in particular parallel package, which comes standard with R base distribution. Basically, the solution is to use mclapply() function to distribute a pre-calculated share of the total number of needed MI iterations and then combine resulting imputed data into a single object. Performance-wise, the results of this approach are beyond my most optimistic expectations: the processing time decreased from 1.5 hours to under 7 minutes(!). That's only on two cores. I've removed one multilevel factor, but it shouldn't have much effect. Regardless, the result is unbelievable!
Stata includes a a command (wntestq) that it calls the "portmanteau Q test for white noise." There seem to a variety of related tests in different packages in R. That said, most of these seem designed specifically for data in various time series formats and none that I could find that operate on a single variable.
"Portmanteau" refers to a family of statistical tests. In time series analysis, portmanteau tests are used for testing for autocorrelation of residuals in a model. The most commonly used test is the Ljung-Box test. Although it's buried in a citation in the manual, it seems that is the test that the Stata command wntestq has implemented.
R implements the same test in a function called Box.test() which is in the stats package that comes included with R. As you can see in the documentation for that function, Box.test() actually implements two tests: the Ljung-Box text that Stata uses and the Box-Pierce test. According to some sources, Box-Pierce was found to include a seemingly trivial simplification which can lead to nasty effects.[1][2] For that reasons, and because the defaults are different in R and Stata, it is worth noting that the Box-Pierce version is default in R.
The test will consider a certain number of autocorrelation coefficients (i.e., up to lag h) and there is no obvious default to select (see this question on the statistics StackExchange for a much more detailed discussion). Another important difference that will lead to different results is that the default h or number of lags will be different in Stata and R. By default, R will set h to 1* while Stata will set h to [n/2]-2 or 40, whichever is smaller.
Although there are many reasons you might not want the default, the following R function will reproduce the default behavior of the Stata command:
q.test <- function (x) {
Box.test(x, type="Ljung-Box", lag=min(length(x)/2-2, 40))
}
I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.
I am using a package called missForest to estimate the missing values in my data set.
My question is: how can we parallelize this process to shorten the time that it takes to get the results?
Please refer to this example (from missForest package):
data(iris)
summary(iris)
The data contains four continuous and one categorical variable.
Artificially produce missing values using the prodNA function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)
Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.
The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").
Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.
Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.
It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:
Create the randomForest model objects in parallel;
Execute multiple randomForest operations (create model and predict) in parallel for each of the columns of the data frame that contain NA's.
Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.
Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.
In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.
In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.