The problem:
I am estimating fixed effects models using the felm function in the lfe package. I estimate several models, but the largest includes approximately 40 dependent variables plus county and year fixed effects (~3000 levels and 14 levels, respectively) using ~50 million observations. When I run the model, I get the following error:
Error in Crowsum(xz, ia) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
Calls: felm -> felm.mm -> newols -> Crowsum
I realize that long vectors contain 2^31 or more elements, so I assume that felm produces these long vectors in estimating the model. Details about my resources: I have access to high-performance computing with multiple nodes, each with multiple cores; the node with the largest memory has 1012GB.
(1) Is support for "long vectors" something that can only be added by the author of lfe package?
(2) If so, are there other options for FE regressions on large data (provided that one has access to large amounts of memory and/or cluster computing, as I do)? (If I need to make a separate post to address this question more specifically, I can do that as well.)
Please note that I know there is a similar post about this problem: Error in LFE - Long Vectors Not Supported - R. 3.4.3
However, I made a new question for two reasons: (1) the author's question was unfocused--unclear what feedback to provide and I didn't want to assume that what I wanted to know was the same as the author; and (2) even if I edited the question, the original author left out details that I thought could be relevant.
It appears that long vectors won't be supported by the lfe package unless the author adds support.
Other routes for performing regression analysis with large datasets include the biglm package. biglm lets you run linear or logistic regression on a chunk, and then update the regression estimates as you run the regression on further chunks. However, this is a sequential process--which means it cannot be performed in parallel--and thus may run into memory issues (or take too long).
Another option, if you have access to some advanced computing resources (more cores, more RAM) is to use the partools package which provides tools for working with R's parallel. The package provides a wrapper for R's lm called calm (chunk averaged linear model), which, similar to biglm regresses on chunks, but allows for the process to be done in parallel. The vignette for partools is an excellent resources for learning how to use the package (even for those who haven't used the parallel package before): https://cran.r-project.org/web/packages/partools/index.html
It appears that calm doesn't report standard errors, but datasets of the size that require parallel computing, standard errors probably are not relevant -- especially so if that dataset contains data from the entire population being studied.
I am working on a small project in survival analysis and making use of the paper:
Lagakos, De Gruttolla - Analysis of Doubly Censored Data, with applications to AIDS.
Before implementing their code in R, I wanted to know if there were already packages that implement the method they describe. I am aware of the package dblcens however, it's functions use an EM algorithm rather than the algorithm described by Lagakos and De Gruttolla.
Has this algorithm already been implemented in some R package?
Thank you.
R is described as:
R is a language and environment for statistical computing and graphics.
Why is R not just called language for statistical computing but environment for statistical computing?
I would like to very humbly understand the emphasis on environment.
R is called a language and environment for statistical computing because, unsurprisingly, it is both.
The environment refers to the fact that R is a programme that can perform statistical operations, just like SPSS (although that has a GUI), SAS, or Stata. You can perform ANOVA, linear regression, t-tests, or almost any other statistical method you would need.
The language refers to the fact that R is a functional and object-oriented programming language, which is in effect usable inside the environment. Anything you create is an object, while you can easily create new functions that perform new tasks.
So the overall package includes a statistical environment as well as the R programming language, which is commonly just referred to as 'R' overall.
From What is R?:
The term “environment” is intended to characterize it as a fully
planned and coherent system, rather than an incremental accretion of
very specific and inflexible tools, as is frequently the case with
other data analysis software.
Incidentally, #Christoph is right that there are also environments within R (like the global environment, or environments local to functions), but I don't think this is what this term refers to.
Why the emphasis on environment? Officially stated here:
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
As stated in the quote: many other applications, some mentioned by #Phil, have incrementally grown in capability over a long period of time. That incremental addition of capabilities to software often leads to a quirky product that is frustrating to use.
By emphasizing environment, R is sending a message like: We aren't that ancient program that has been patched for forty years and is frustrating to use. R is superior integrated software that you won't dread using.
Side note: R-Studio is officially separate from R, but I consider R and R-Studio as one, and it is definitely an environment that just works.
I'm looking to port our home-grown platform of various machine learning algorithms from C# to a more robust data mining platform such as R. While it's obvious R is great at many types of data mining tasks, it is not clear to me if it can be used for text classification.
Specifically, we extract a list of bigrams from the text and then classify it into one of 15 different categories, eg:
Bigram list: jewelry, books, watches, shoes, department store
-> Category: Shopping
We'd want to both train the models in R as well as hook up to a database to perform this on a larger scale.
Can it be done in R?
Hmm, I am rather starting to look into Machine Learning, but I might have a suggestion: have you considered Weka? There's a bunch of various algorithms around and there'S IS some documentation. Plus, there is an R package RWeka that makes use of the Weka jars.
EDIT:
There is also a nice, comprehensive read by Witten et al. : Data mining that contains an extensive description of Weka among other interesting things. Look into the API opportunities.
Instead of starting to code in Matlab, I recently started learning R, mainly because it is open-source. I am currently working in data mining and machine learning field. I found many machine learning algorithms implemented in R, and I am still exploring different packages implemented in R.
I have quick question: how do you compare R to Matlab for data mining application, its popularity, pros and cons, industry and academic acceptance etc.? Which one would you choose and why?
I went through various comparisons for Matlab vs R against various metrics but I am specifically interested to get answer for its applicability in Data Mining and ML.
Since both language are pretty new for me I was just wondering if R would be a good choice or not.
I appreciate any kind of suggestions.
For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.
I was an exclusive Matlab user while in University; at the time i thought it was
an excellent set of tools/platform. I am sure it is today as well.
The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox,
and Curve Fitting Toolbox are each highly desirable (if not essential)
for someone using MATLAB for ML/Data Mining work, yet they are all separate from
the base MATLAB environment--in other words, they have to be purchased separately.
My Top 5 list for Learning ML/Data Mining in R:
Mining Association Rules in R
This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.
The standard reference, The Elements of Statistical
Learning by Hastie et al.
The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the major
ML rubrics--e.g., neural metworks, SVM, KNN; unsupervised
techniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART,
Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging)
and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).
CRAN Task View: Machine Learning
The +3,500 Packages available
for R are divided up by domain into about 30 package families or 'Task Views'. Machine Learning
is one of these families. The Machine Learning Task View contains about 50 or so
Packages. Some of these Packages are part of the core distribution, including e1071
(a sprawling ML package that includes working code for quite a few of
the usual ML categories.)
Revolution Analytics Blog
With particular focus on the posts tagged with Predictive Analytics
ML in R tutorial comprised of slide deck and R code by Josh Reich
A thorough study of the code would, by itself, be an excellent introduction to ML in R.
And one final resource that i think is excellent, but didn't make in the top 5:
A Guide to Getting Stared in Machine Learning [in R]
posted at the blog A Beautiful WWW
Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.
Both Matlab and R are good if you are doing matrix-heavy operations. Because they can use highly optimized low-level code (BLAS libraries and such) for this.
However, there is more to data-mining than just crunching matrixes. A lot of people totally neglect the whole data organization aspect of data mining (as opposed to say, plain machine learning).
And once you get to data organization, R and Matlab are a pain. Try implementing an R*-tree in R or matlab to take an O(n^2) algorithm down to O(n log n) runtime. First of all, it totally goes against the way R and Matlab are designed (use bulk math operations wherever possible), secondly it will kill your performance. Interpreted R code for example seems to run at around 50% of the speed of the C code (try R built-in k-means vs. flexclus k-means); and the BLAS libraries are optimized to an insane level, exploiting cache sizes, data alignment, advanced CPU features. If you are adventurous, try implementing a manual matrix multiplication in R or Matlab, and benchmark it against the native one.
Don't get me wrong. There is a lot of stuff where R and matlab are just elegant and excellent for prototyping. You can solve a lot of things in just 10 lines of code, and get a decent performance out of it. Writing the same thing by hand would be hundreds of lines, and probably 10x slower. But sometimes you can optimize by a level of complexity, which for large data sets does beat the optimized matrix operations of R and matlab.
If you want to scale up to "Hadoop size" on the long run, you will have to think about data layout and organization, too, unless all you need is a linear scan over the data. But then, you could just be sampling, too!
Yesterday I found two new books about Data mining. These series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. Books are: “New Fundamental Technologies in Data Mining” here http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “Knowledge-Oriented Applications in Data Mining” here http://www.intechopen.com/books/show/title/knowledge-oriented-applications-in-data-mining These are open access books so you can download it for free or just read on online reading platform like I do. Cheers!
We should not forget the origin sources for these two software: scientific computation and also signal processing leads to Matlab but statistics leads to R.
I used matlab a lot in University since we have one installed on Unix and open to all students. However, the price for Matlab is too high especially compared to free R. If your major focus is not on matrix computation and signal processing, R should work well for your needs.
I think it also depends in which field of study you are. I know of people in coastal research that use a lot of Matlab. Using R in this group would make your life more difficult. If a colleague has solved a problem, you can't use it because he fixed it using Matlab.
I would also look at the capabilities of each when you are dealing with large amounts of data. I know that R can have problems with this, and might be restrictive if you are used to an iterative data mining process. For example looking at multiple models concurrently. I don't know if MATLAB has a data limitation.
I admit to favoring MATLAB for data mining problems, and I give some of my reasoning here:
Why MATLAB for Data Mining?
I will admit to only a passing familiarity with R/S-Plus, but I'll make the following observations:
R definitely has more of a statistical focus than MATLAB. I prefer building my own tools in MATLAB, so that I know exactly what they're doing, and I can customize them, but this is more of a necessity in MATLAB than it would be in R.
Code for new statistical techniques (spatial statistics, robust statistics, etc.) often appears early in S-Plus (I assume that this carries over to R, at least some).
Some years ago, I found the commercial version of R, S-Plus to have an extremely limited capacity for data. I cannot say what the state of R/S-Plus is today, but you may want to check if your data will fit into such tools comfortably.