nzv filter for continuous features in caret - r

I am a beginner to practical machine learning using R, specifically caret.
I am currently applying a random forest algorithm for a microbiome dataset. The values are relative abundance transformed so if my features are columns, sum of all columns for Row 1 == 1
It is common to have cells with a lot of 0 values.
Typically I used the default nzv preprocessing feature in caret.
Default:
a. One unique value across the entire dataset
b. few unique values relative to the number of samples in the dataset (<10 %)
c. large ratio of the frequency of the most common value to the frequency of the second most common value (cutoff used is > 19)
So is this function not actually calculating variance, but determining a frequency of occurence of features and filter based on the frequency? If so is it only safe to use it for discrete/categorical variables?
I have a number of features in my dataset ~12k, many of which might be singletons or have a zero value for a lot of features.
My question: Is nzv suitable for such a continuous, zero inflated dataset?
What pre-processing options would you recommend?
When I use default nzv I am dropping a tonne of features (from 12k to ~2,700 k) in the final table
I do want a less noisy dataset but at the same time do not want to loose good features
This is my first question and I am willing to re-revise, edit and resubmit if required.
Any solutions will be appreciated.
Thanks a tonne!

Related

Find samples from numeric vector that have a predefined mean value

I am using historical yearly rainfall data to devise 'whatif' scenarios of altered rainfall in ecological models. To do that, I am trying to sample actual rainfall values to create a sample of rainfall years that meet a certain criteria (such as sample of rainfall years that are 10% wetter than the historical average).
I have come up with a relatively simple brute force method described below that works ok if I have a single criteria (such as a target mean value):
rainfall_values = c(270.8, 150.5, 486.2, 442.3, 397.7,
593.4191, 165.608, 116.9841, 265.69, 217.934, 358.138, 238.25,
449.842, 507.655, 344.38, 188.216, 210.058, 153.162, 232.26,
266.02801, 136.918, 230.634, 474.984, 581.156, 674.618, 359.16
)
#brute force
sample_size=10 #number of years included in each sample
n_replicates=1000 #number of total samples calculated
target=mean(rainfall_values)*1.1 #try to find samples that are 10% wetter than historical mean
tolerance=0.01*target #how close do we want to meet the target specified above?
#create large matrix of samples
sampled_DF=t(replicate(n_replicates, sample(x=rainfall_values, size=sample_size, replace=T)))
#calculate mean for each sample
Sampled_mean_vals=apply(sampled_DF,1, mean)
#create DF only with samples that meet the criteria
Sampled_DF_on_target=sampled_DF[Sampled_mean_vals>(target-tolerance)&Sampled_mean_vals<(target+tolerance),]
The problem is that I will eventually have multiple criteria to match (not only a means target, but also standard deviation, and auto correlation coefficients, etc.). With more complex multivariate targets, this brute force method becomes really inefficient in finding matches where I essentially have to look over millions of samples, and taking days even when parallelized...
So -my question is- is there any way to implement this search using an optimization algo or other non-brute force approach?
Some approaches to this kind of question are covered in this link. One respondent calls the "rejection" method what you refer to as the "brute force" method.
This link addresses a related question.

Clara clustering on binary data with R

I have a pretty big dataframe (~120k rows, 24 columns) on which I'd like to perform clustering with the pam algorithm. All the columns are binary variables, where 1 represents the presence of the attribute and 0 the absence.
I saw that a way of doing this with such a big dataset is through the clara algorithm, which is implemented in the {cluster} package. The problem is that on the documentation i see that clara takes as input:
a data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.
So how am I supposed to apply the clara algorithm on my data?
Normally it should handle categorical variables since the pam algorithm doesn't compute means, but I couldn't find any useful information online except for this question, which remained unanswered.
I could simply convert my columns to numeric in order for the algorithm to work, but I'm afraid this would not be the correct solution to handle binary data with this algorithm.

In R's randomForest package, do factors have to be explicitly labeled as factors?

Or will the package realize that they are not continuous and treat them as factors? I know that, for classification, the feature being classified does need to be a factor. But what about predictive features? I've run it on a couple of toy datasets, and I get slightly different results depending on whether categorical features are numeric or factors, but the algorithm is random, so I do not know if the difference in my results are meaningful.
Thank you!
Yes there is a difference between the two. If you want to use a factor variable you should specify it as such and not leave it as a numeric.
For categorical data (this is actually a very good answer on CrossValidated):
A split on a factor with N levels is actually a selection of one of the (2^N)−2 possible combinations. So, the algorithm will check all the possible combinations and choose the one that produces the better split
For numerical data (as seen here):
Numerical predictors are sorted then for every value Gini impurity or entropy is calculated and a threshold is chosen which gives the best split.
So yeah it makes a difference whether you will add it as a factor or as a numeric variable. How much of a difference depends on the actual data.

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

R cluster analysis Ward auto deleting outliers

How can I code in R to duplicate cluster analyses done in SAS which involved
method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)
When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.
If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.
Thanks!
You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.
Illustrating using the example from the link above, you could do something like:
duration = faithful$eruptions
duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]

Resources