Is there any Python equivalent of R's biglm? - r

I have used biglm in R and found it very useful. Now I need the same type of functionality in python. Any ideas? I have seen that patsy/statsmodels has an incremental mode, but have not been able to find any samples to copy/adapt. Any pointers would be much appreciated.

from a related answer of Nathaniel Smith on the statsmodels mailing list
My incremental LS code might be useful here, it's basically the same
problem:
https://github.com/njsmith/pyrerp/blob/master/pyrerp/incremental_ls.py#L330
The new X'X is the sum of the old X'Xs, then you have to re-do the
scaling and inversion to get the new vcov matrix for the estimates.
Should be doable so long as you know how many data points are in each
and the various sums-of-squares. (The code I linked has some extra
complexity because of handling a particular sort of heteroskedasticity
via FGLS, but it can pretty much be ignored.)
statsmodels doesn't have anything in this area yet.
There is an incremental OLS function in statsmodels, however that was written as helper function for cusum tests (in memory) and hasn't been used or checked for any other purpose:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.diagnostic.recursive_olsresiduals.html

Related

R numerical method similar to Vpasolve in Matlab

I am trying to solve a numerical equation in R but would want a method which perform similar to vpasolve in Matlab. I have a non linear equation (involving lot of log functions) which when solve in R with uniroot gives me complete different answer compared to what vpasolve gives in matlab.
First, a word of caution: it's often much more productive to learn that there's a better way to do something than the way you are used to doing.
edit
I went back to MATLAB and realized that the "vpa" collection is using extended precision. Is that absolutely necessary for your purposes? If not, then my suggestions below may suffice.
If you do require extended precision, then perhaps Rmpfr::unirootR function will suffice. I would like to point out that, since all these solvers are generating an approximate solution (as opposed to analytic), the use of extended precision operations seems a bit pointless.
Next, you need to determine whether MATLAB::vpasolve or uniroot is getting you the correct answer. Or maybe you simply are converging to a root that's not the one you want, in which case you need to read up on setting limits on the starting conditions or the search region.
Finally, in addition to uniroot, I recommend you learn to use the R packages BBsolve , nleqslv, rootsolve, and ktsolve (disclaimer: I am the owner and maintainer of ktsolve). These packages are pretty flexible and may lead you to better solutions to your original problem.

Equivalent to fitcdiscr in R (regarding Coeffs.linear and Coeffs.Const)

I am currently translating some MATLAB scripts to R for Multivariate Data Analysis. Currently I am trying to generate the same data as the Coeffs.Linear and Coeffs.Const part of the fitdiscr function in MATLAB.
The code being used is:
fitcdiscr(data, groups, 'DiscrimType', 'linear');
The data consists of 3 groups.
Unfortunately the R function seems to do the LDA only for two LDs and MATLAB seems to always compare all groups in all constellations. Does anybody have an idea how I could obtain that data?
I suspect you mean information on the implementation of various MATLAB function, which would be doc <functionname> (doc fitcdiscr would yield this documentation page on fitcdscr) to get the documentation, and edit <functionname> to get the implementation, if it is not obscured by The MathWorks. If those two do not give you enough information, I'm afraid you're out of luck, since not all TMW codes are available non-obscured.
fitcdiscr is non-obscured, although very brief; it's just a wrapper for some other functions. Keep doing edit <functionname> and doc <functionname> and see how deep the rabbit hole takes you.
NB: there's no built-in function called fitdiscr, but the syntax you describe is that of fitcdiscr (note the c), so I used that as examples. If the actual function being called is named fitdiscr, it's custom-made and you'll have to spit through its file by edit fitdiscr and hope for the best.

Using all cores for R MASS::stepAIC process

I've been struggling to perform this sort of analysis and posted on the stats site about whether I was taking things in the right direction, but as I've been investigating I've also found that my lovely beefy processor (linux OS, i7) is only actually using 1 of its cores. Turns out this is default behaviour, but I have a fairly large dataset and between 40 and 50 variables to select from.
A stepAIC function that is checking various different models seems like the ideal sort of thing for parellizing, but I'm a relative newb with R and I only have sketchy notions about parallel computing.
I've taken a look at the documentation for the packages parallel, and snowfall, but these seems to have some built-in list functions for parallelisation and I'm not sure how to morph the stepAIC into a form that can be run in parellel using these packages.
Does anyone know 1) whether this is a feasible exercise, 2) how to do what I'm looking to do and can give me a sort of basic structure/list of keywords I'll need?
Thanks in advance,
Steph
I think that a process in which a step depends on de last (as in step wise selection) is not trivial to do in parallel.
The simplest way to do something in parallel I know is:
library(doMC)
registerDoMC()
l <- foreach(i=1:X) %dopar% { fun(...) }
in my poor understanding of stepwise one extracts variables (or add forward/backward) of a model and measure the fitting in each step. If extracting a variable the model fit is best you keep this model, for example. In the foreach parallel function each step is blind to other step, maybe you could write your own function to perform this task as in
http://beckmw.wordpress.com/tag/stepwise-selection/
I looked for this code, and seems to me that you could use parallel computing with the vif_func function...
I think you also should check optimized codes to do that task as in the package leaps
http://cran.r-project.org/web/packages/leaps/index.html
hope this helps...

How does r calculate the p-values in logistic regression

What type of p-values do R calculate in a binomial logistic regression, and where is this documented?
When i read the documentation for ?glm() I find no reference to the calculation of the p-values.
The p-values are calculated by the function summary.glm. See ?summary.glm for a (very brief) bit about how those are calculated.
For more information, look at the source code by typing
summary.glm
at the R command prompt. There you will find the lines of code where an object pvalue is created. Follow the code back to see how the components of the p-value calculation are (conditionally) calculated.
The authors of R wrote the help system with several principles in mind: compactness (don't write more than is needed, it's not a textbook), accuracy, and a curious and well-educated audience. It really was written for other statisticians. The "curious" part of that opening sentence was included to raise the question why you did not also follow the various links in the ?glm page: to summary.glm where you would have found one answer to your ambiguous question or to anova.glm where you would have found another possible answer. The help-authors do expect that you will follow those links and read the whole page and execute the examples. You will notice that even after you get to summary.glm that there is no mention of "binary logistic regression" since they pretty much assume that you are well-grounded in statistics and have copy of McCullagh and Nelder handy, or if not that you will go read the references.
The other principle: sometimes it is the code itself (given the open-source nature of R) that performs the documentation. Technically glm doesn't print anything and print.glm doesn't print p-values. It would be print.summary.glm or print.anova.glm that would be doing any printing. Part of learning R is learning that the results printed to the console will have gone through a eval-print loop and that output can be tailored with object-class-specific functions.
These assumptions are just part of what many people see as a "steep learning curve for R" (although I would have called it a shallow curve if plotted with time/effort on x-axis.)

Implementation of Particle Swarm Optimization Algorithm in R

I'm checking a simple moving average crossing strategy in R. Instead of running a huge simulation over the 2 dimenional parameter space (length of short term moving average, length of long term moving average), I'd like to implement the Particle Swarm Optimization algorithm to find the optimal parameter values. I've been browsing through the web and was reading that this algorithm was very effective. Moreover, the way the algorithm works fascinates me...
Does anybody of you guys have experience with implementing this algorithm in R? Are there useful packages that can be used?
Thanks a lot for your comments.
Martin
Well, there is a package available on CRAN called pso, and indeed it is a particle swarm optimizer (PSO).
I recommend this package.
It is under actively development (last update 22 Sep 2010) and is consistent with the reference implementation for PSO. In addition, the package includes functions for diagnostics and plotting results.
It certainly appears to be a sophisticated package yet the main function interface (the function psoptim) is straightforward--just pass in a few parameters that describe your problem domain, and a cost function.
More precisely, the key arguments to pass in when you call psoptim:
dimensions of the problem, as a vector
(par);
lower and upper bounds for each
variable (lower, upper); and
a cost function (fn)
There are other parameters in the psoptim method signature; those are generally related to convergence criteria and the like).
Are there any other PSO implementations in R?
There is an R Package called ppso for (parallel PSO). It is available on R-Forge. I do not know anything about this package; i have downloaded it and skimmed the documentation, but that's it.
Beyond those two, none that i am aware of. About three months ago, I looked for R implementations of the more popular meta-heuristics. This is the only pso implementation i am aware of. The R bindings to the Gnu Scientific Library GSL) has a simulated annealing algorithm, but none of the biologically inspired meta-heuristics.
The other place to look is of course the CRAN Task View for Optimization. I did not find another PSO implementation other than what i've recited here, though there are quite a few packages listed there and most of them i did not check other than looking at the name and one-sentence summary.

Resources