R: binning problem in multiple of consistent width - r

I have been searching for R cutting or binning packages but I could not quite find what I really want.
I have a dataset of 1000 variables and for some columns they might have values ranging from 0.01 to 0.2 but for some other they might have range from 0 to 2000. Some, on the other hand, might contain negative numbers.
I would like to plot the histogram for each of the variables but with a more consistent binning label, i.e. I would like the bin width in multiple of either 1,2.5 or 5 (for decimal numbers maybe of 0.01,0.02 or 0.05) and I am flexible of the bin numbers to vary between 20-40 bins(they can be fixed if it's easier) and not so care about the amount of data in each bin.
The reason for this is because I might get some new data for the same variables and I would like to have consistent binning to their distribution and perhaps model results in the same bin. And there are simply too much variables and I could not do them manually.
Any thoughts on how to write a function for this to return the bins consistent with old and new data before I get the new data.

Related

Plotting difference between datasets using ggplot

I'm struggling with something I'm trying to do in R.
I have two datasets with the same (categorical) columns but different values. I want to compare the count of each combination of columns (e.g. male and married, female and single etc) visually.
This is easy enough to do with ggplot's geom_bar for each dataset, and I know I can put the counts for each dataset next to each other by binding them and setting position = "dodge".
My question is whether there's an easy to way to plot the difference between the two counts for each pair of variables. And whether there's a way of changing the default 'count' method in geom_count (ironic I know) to other things (like proportion or, maybe in this case, a predefined set of values for the difference).
Thanks

Peculiarity with Scale and Z-Score

I was attempting to scale my data in R after doing some research on the function (which it seems to follow (x - mean) / std.dev. This was just what I was looking for, so I scaled my dataframe in R. I'd also want to make sure my assumptions are correct so that I don't have wrong conclusions.
Assumption
R scales each column independently. Therefore, column 1 will have its own mean and standard deviation. Column 2 will have its own.
Assuming I have a dataset of size 100,000 and I scale 3 columns. If I proceed to remove all columns with a Z-Score over 3 and less than -3, I could have up to (100,000 * .003) = 900 rows removed!
However, when I went to truncate my data, my 100,000 rows were left with 94,798. This means 5,202 rows were removed.
Does this mean my assumption about scale was wrong, and that it doesn't scale by column?
Update
So I ran a test and did Z-Score conversion on my own. Still the same amount of columns removed in the end so I believe scale does work. Now I'm just curious why more than .3% of the data is removed when 3 standard deviations out are removed.

nzv filter for continuous features in caret

I am a beginner to practical machine learning using R, specifically caret.
I am currently applying a random forest algorithm for a microbiome dataset. The values are relative abundance transformed so if my features are columns, sum of all columns for Row 1 == 1
It is common to have cells with a lot of 0 values.
Typically I used the default nzv preprocessing feature in caret.
Default:
a. One unique value across the entire dataset
b. few unique values relative to the number of samples in the dataset (<10 %)
c. large ratio of the frequency of the most common value to the frequency of the second most common value (cutoff used is > 19)
So is this function not actually calculating variance, but determining a frequency of occurence of features and filter based on the frequency? If so is it only safe to use it for discrete/categorical variables?
I have a number of features in my dataset ~12k, many of which might be singletons or have a zero value for a lot of features.
My question: Is nzv suitable for such a continuous, zero inflated dataset?
What pre-processing options would you recommend?
When I use default nzv I am dropping a tonne of features (from 12k to ~2,700 k) in the final table
I do want a less noisy dataset but at the same time do not want to loose good features
This is my first question and I am willing to re-revise, edit and resubmit if required.
Any solutions will be appreciated.
Thanks a tonne!

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

R cluster analysis Ward auto deleting outliers

How can I code in R to duplicate cluster analyses done in SAS which involved
method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)
When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.
If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.
Thanks!
You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.
Illustrating using the example from the link above, you could do something like:
duration = faithful$eruptions
duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]

Resources