Define sample size using simple random sampling - r

I am trying to run a PCA, but I have too much data (20k observations) the resolution is too low.
I am using sample_n(df, replace = TRUE, n) [from dplyr] to reduce the size and have a better fit.
My question is: what is the best technique to define (or estimate) the sample size (n)?
If I have 20k observations (different sites, different times of the year, relatively well homogeneous), which cutoff should I use: 5%, 10%, 20%?
Could you give me a reference to your suggestion?
Thank you in advance for your comments.

I would make a loop with different sample sizes, I dont believe there is a clear cut/off just you could do with train/test (although we have piplines, but you know what I mean the 70/30 cutoff). The only thing I would check is if sample_n is still not too clustered and values are relatively equally represented.
If you are firm with k-means clustering, there we have the "elbow method", which is a little bit subjective where is the best amount of clusters (although we measure RSS), you just have to try a lot of iterations and loops.
You know with neural networks when you have e.g. a million observations you can reduce test set to e.g. 5 or 10 % because in absolute values you still have a lot of cases.
In summary:
I think that it needs a practical test like the elbow method in clustering. Becaue its can be very specific to your data.
I hope my answer is to at least to some value to you, I have no journal reference atm.

Related

R h2o model sizes on disk

I am using the h2o package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)

K-Means Distance Measure - Large Data and mixed Scales

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/

How to take a Probability Proportional to Size (PPS) Unequal Probability sample using R?

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1 lists US states and the parameter I'm trying to estimate
ds2 has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
There is a package for the same in R - pps and the documentation is here.
Also, there is another package called survey with a bit of documentation here.
I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.
The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.
Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

Pre-defining clusters in r

I have a pretty big data table (about 100.000 observations) that I'd like to use for clustering. Since some of the data is categorical, I've tried using "gower distance" and then hclust() with the "ward" method.
The data itself is very heterogeneous, which is why I'd like to sort of "pre-cluster" the data and then do the actual cluster analysis. Have any of you done this before and can point me in the right direction? I'm at a loss at the moment :(
With the mentioned methods, I don't really get useful clusters.
Thanks guys, I really appreciate every tip I can get.
Edit: I think that I didn't really explain my problem right, so here's another attempt: let's say, that I have a dataset containing brands of cars and some of their features. Before clustering them by features I would like to precluster them by brand. So all BMW e.g. are in the same cluster and so on.. and only after that I would like to cluster by features, so I should get a cluster with fast cars etc.
does anybody know, how to do this in R?
this does not describe my dataset, but maybe the question I'm having is clearer now.
You should start with a sample first.
Once you get good results on the sample, try to reproduce it on a different sample. Once the results are stable, you can either try to scale the algorithm to the entire data set (maybe try doubling first), or you can train a classifier and predict the clusters of the remaining data. With most clustering algorithms, a 1 nearest neighbor classifier will be very good.

How to normalize benchmark results to obtain distribution of ratios correctly?

To give a bit of the context, I am measuring the performance of virtual machines (VMs), or systems software in general, and usually want to compare different optimizations for performance problem. Performance is measured in absolute runtime for a number of benchmarks, and usually for a number of configurations of a VM variating over used number of CPU cores, different benchmark parameters, etc. To get reliable results, each configuration is measure like 100 times. Thus, I end up with quite a number of measurements for all kind of different parameters where I am usually interested in the speedup for all of them, comparing the VM with and the VM without a certain optimization.
What I currently do is to pick one specific series of measurements. Lets say the measurements for a VM with and without optimization (VM-norm/VM-opt) running benchmark A, on 1 core.
Since I want to compare the results of the different benchmarks and number of cores, I can not use absolute runtime, but need to normalize it somehow. Thus, I pair up the 100 measurements for benchmark A on 1 core for VM-norm with the corresponding 100 measurements of VM-opt to calculate the VM-opt/VM-norm ratios.
When I do that taking the measurements just in the order I got them, I obviously have quite a high variation in my 100 resulting VM-opt/VM-norm ratios. So, I thought, ok, let's assume the variation in my measurements come from non-deterministic effects and the same effects cause variation in the same way for VM-opt and VM-norm. So, naively, it should be ok to sort the measurements before pairing them up. And, as expected, that reduces the variation of course.
However, my half-knowledge tells me that is not the best way and perhaps not even correct.
Since I am eventually interested in the distribution of those ratios, to visualize them with beanplots, a colleague suggested to use the cartesian product instead of pairing sorted measurements. That sounds like it would account better for the random nature of two arbitrary measurements paired up for comparison. But, I am still wondering what a statistician would suggest for such a problem.
In the end, I am really interested to plot the distribution of ratios with R as bean or violin plots. Simple boxplots, or just mean+stddev tell me too few about what is going on. These distributions usually point at artifacts that are produced by the complex interaction on these much to complex computers, and that's what I am interested in.
Any pointers to approaches of how to work with and how to produce such ratios in a correct way a very welcome.
PS: This is a repost, the original was posted at https://stats.stackexchange.com/questions/15947/how-to-normalize-benchmark-results-to-obtain-distribution-of-ratios-correctly
I found it puzzling that you got such a minimal response on "Cross Validated". This does not seem like a specific R question, but rather a request for how to design an analysis. Perhaps the audience there thought you were asking too broad a question, but if that is the case then the [R] forum is even worse, since we generally tackle problems where data is actually provided. We deal with the requests for implementation construction in our language. I agree that violin plots are preferred to boxplots for the examination of distributions (when there is sufficient data and I am not sure that 100 samples per group makes the grade in that instance), but in any case that means the "R answer" is that you just need to refer to the proper R help page:
library(lattice)
?xyplot
?panel.violin
Further comments would require more details and preferably some data examples constructed in R. You may want to refer to the page where "great question design is outlined".
One further graphical method: If you are interested in the ratios of two paired variates but do not want to "commit" to just x/y, then you can examine them by plotting and then plotting iso-ratio lines by repeatedly using abline(a=0, b= ). I think 100 samples is pretty "thin" for doing density estimates, but there are 2d density methods if you can gather more data.

Resources