R: Winsorizing (robust HD) not compatible with NAs? - r

I want to use the winsorize function provided in the "robustHD" Package but it does not seem to work with NA's as can be seen in the example
## generate data
set.seed(1234) # for reproducibility
x <- rnorm(10) # standard normal
x[1] <- x[1] * 10 # introduce outlier
x[11]<- NA ## adding NA
## winsorize data
x
winsorize(x)
I googled the problem but didn't find a solution or even anyone with a similar problem. Is winsorizing might considered as a "bad" technique or how can you explain this lack of information?

If you only have a vector to winsorize, the winsor2 function defined here can be easily modified by setting na.rm = TRUE for the median and mad functions in the code. That provides the same functionality as winsorize{robustHD} with 1 difference: winsorize calls robStandardize, which includes some adjustment for very small values. I don't understand what it's doing, so caveat emptor if you forgo it.
If you want to winsorize the individual columns of a matrix (as opposed to the multivariate winsorization using a tolerance ellipse available as another option in winsorize) you should be able to poach the necessary code from winsorize.default and standardize. They do the same thing as winsor2 but in matrix form. Again, you need to add your own na.rm = TRUE settings into the functions as needed.

Some maybe useful thoughts:
Stack Overflow is a programming board, where programming related questions are asked and answers are given. For question whether or not certain statistical procedures are appropriate or considered "bad", you are more likely to find knowledgable people on crossvalidated.
A statistical method and the implementation of a statistical method into a certain software environment are often rather independent. That is to say that if the developer of a package has not included certain features (e.g NA handling) into his package, this does not mean much for the method per se. Having said that, of course it can. The only way to be sure whether the omission of a package feature was intentional is to actually ask the developer of the package. If the question is more geared towards statistics and the validity of the method in the presence of missing values, crossvalidated is likely to be more helpful.
I don't know why you can't find any information on this topic. I can confidently say though that this is the very first time I have heard the term "winsorized". I actually had to look it up, and I can surely say that I have never encountered this approach, and I would personally never use it.
A simple solution to your problem from a computational point of view would be to omit all incomplete cases before you start working with the function. It also makes intuitive sense that cases with missing values cannot be easily winsorized. First, the computation of the mean and standard deviation would have to be done on the complete cases anyway, and then it is unclear which value to assign to those with missing values since they may not necessarily be outliers, even though they could be.
If omitting incomplete cases is not an option for you, you may want to look for imputation methods (on CV).

Related

R numerical method similar to Vpasolve in Matlab

I am trying to solve a numerical equation in R but would want a method which perform similar to vpasolve in Matlab. I have a non linear equation (involving lot of log functions) which when solve in R with uniroot gives me complete different answer compared to what vpasolve gives in matlab.
First, a word of caution: it's often much more productive to learn that there's a better way to do something than the way you are used to doing.
edit
I went back to MATLAB and realized that the "vpa" collection is using extended precision. Is that absolutely necessary for your purposes? If not, then my suggestions below may suffice.
If you do require extended precision, then perhaps Rmpfr::unirootR function will suffice. I would like to point out that, since all these solvers are generating an approximate solution (as opposed to analytic), the use of extended precision operations seems a bit pointless.
Next, you need to determine whether MATLAB::vpasolve or uniroot is getting you the correct answer. Or maybe you simply are converging to a root that's not the one you want, in which case you need to read up on setting limits on the starting conditions or the search region.
Finally, in addition to uniroot, I recommend you learn to use the R packages BBsolve , nleqslv, rootsolve, and ktsolve (disclaimer: I am the owner and maintainer of ktsolve). These packages are pretty flexible and may lead you to better solutions to your original problem.

Behaviour of dfmax in glmnet

(NB: This is a slightly modified version of a post I'd made on a different forum. I received no responses there, hence the post here. If this is not allowed, please let me know, will take down the question).
I am new to glmnet, so I do not yet understand fully what the various
parameters do. I am trying to build a multinomial classifier which restricts
the number of features used in the model. From reading the docs and some
answers on this forum, I understand dfmax is the way to do it. I
played around with it a bit; I have a couple of questions and would appreciate some help:
Setup
For a particular dataset, I want to restrict the number of features to 3;
the original data has 126 features. Here's what I run:
fit<-glmnet(data.matrix(X), data.matrix(y), family='multinomial', dfmax=3)
d<-data.frame(tidy(fit))
This is the value of d:
My questions about the output:
I see multiple values of lambda in there; it looks like
glmnet tries to fit lambdas that gets the number of terms close to
dfmax=3. So its less like the LARs algorithm (where we
move stagewise by adding variables and can stop at an exact number of variables) and more about getting the
right lambdas for regularization that lead to the intended dfmax. Is
this right?
I'm guessing alpha plays a role in how close we can get
to dfmax. At alpha=1, where we're doing lasso, and so its easier to
get close to dfmax, compared to when alpha=0 and we're doing ridge.
Is this understanding correct?
A "neighborhood" of dfmax is the
best we can do it'd seem. Or am I missing a parameter that gets me
to the model with the exact dfmax (FYI: alpha=1 doesn't seem to get
me to the precise number of non zero terms either, at least on this
dataset).
In the first solution - step=1, there are no variables used. Does this mean the relative odds equal a constant?
What does pmax do?
Thanks in advance!

R: [Indicspecies package] multipatt function: extract values from summary.multipatt

I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.

Feature selection on subsets of feature set

I am trying to do the feature selection using Boruta package in R. The problem is that my feature set is way tooo large (70518 features) and therefore the dataframe is too large (2Gb) and cannot be processed by the Boruta package at once. I am wondering if I can split the data frame into several sets, each containing a smaller amount of features? This sounds a bit weird to me, as I am not sure if the algorithm can correctly identify the weights if not all features are present.
If not, I would be very grateful if someone can suggest an alternative way of doing it.
I think your best best in this case might be to first try and filter out some of the features that are either low information (e.g. ~zero variance) or highly correlated.
The caret package has some useful functions to help with this.
For example, the findCorrelation() can be used to easily remove redundant features:
dat <- cor(dat, method='spearman')
dat[is.na(dat)] <- 0
features_to_ignore <- findCorrelation(dat, cutoff=0.75, verbose=FALSE)
dat <- dat[,-features_to_ignore]
This will remove all features with a Spearman correlation of 0.75 or higher.
I'm going to start by asking why you believe that this can even work? In this case, not only is p >> n, but p >>>>>> n. You're always going to find spurious associations. More than that, even if you could do this (say by renting a sufficiently large machine in a cloud computing service, which is the method I'd suggest), you're looking at an absurd amount of computation, since the computational complexity of building a single decision tree is O(n * v log(v)), where n is the number of records and v is the number for fields in each record. Building an RF takes that much for each tree.
Instead of solving the problem as stated, you might want to rethink it from the ground up. What are you really trying to do here? Can you go back to first principles and rethink that?

In R, how can I make a vector Y whose components are derived from normal distribution?

I am a novice in R programming.
I would like to ask experts here a question concerning a code of R.
First, let a vector x be c(2,5,3,6,5)
I hope to make another vector y whose i-th component is derived from N(sum(x[1]:x[i]),1)
(i.e. the i-th component of y follows normal distribuion with variance 1 and mean summation from x[1](=2) to x[i] (i=1,2,3,4,5))
For example, the third component of y follows normal distribuion with mean x[1]+x[2]+x[3]=2+5+3=10 and variance 1
I want to know a code of R making the vector y described above "without using repetition syntax such as for, while, etc."
Since I am a novice of R programming and have a congenitally poor sense of computational statistics, I don't seem to hit on a ingenious code of R at all.
Please let me know a code of R making a vector explained above without using repetition syntax such as for, while, etc.
Previously, I should like to thank you very much heartily for your mindful answer.
You can do
rnorm(length(x), mean = cumsum(x), sd = 1)
rnorm is part of the family of functions associated with the normal distribution *norm. To see how a function with a known name works, use
help("rnorm") # or ?rnorm
cumsum takes the cumulative sum of a vector.
Finding functionality
In R, it's generally a safe bet that most functionality you can think of has been implemented by someone already. So, for example, in the OP's case, it is not necessary to roll a custom loop.
The same naming convention as *norm is followed for other distributions, e.g., rbinom. You can follow the link at the bottom of ?rnorm to reach ?Distributions, which lists others in base R.
If you are starting from scratch and don't know the names of any related functions, consider using the built-in search tools, like:
help.search("normal distribution") # or ??"normal distribution"
If this reveals nothing and yet you still think a function must exist, consider installing and loading the sos package, which allows
findFn("{cumulative mean}") # or ???"{cumulative mean}"
findFn("{the pareto distribution}") # or ???"{the pareto distribution}"
Beyond that, there are other online resources, like Google, that are good. However, a question about functionality on Stack Overflow is a risky proposition, since it will not be received well (downvoted and closed as a "tool request") if the implementation of the desired functionality is nonexistent or unknown to folks here. Stack Overflow's new "Documentation" subsite will hopefully prove to be a resource for finding R functions as well.

Resources