I have been looking for a way to calculate the minimum number of samples required Ne(min) to train a classification model when the dataset is not normally distributed. A research paper suggests the following :
if the data are not normally distributed, an exponential relationship between d and N will be
assumed and the number of samples that are required may be as plentiful as:
Ne(min) = Dsteps^d
where Dsteps is the discrete number of steps per feature.
d: dimension of the dataset.
....
It
is useful to think of a histogram approach to understand this relationship. If we want to construct a histogram from data with at least one sample in each bin and with Dsteps discrete steps per feature, we will require at least Dsteps^d samples.
The number of samples required to model the data accurately is in this case an exponential function of d.
I will be very grateful if someone can help me to get/calculate this measure: the discrete number of steps per feature.
An explanation with R or Matlab code would be very helpful. Thank you :D
Edit:
Paper reference: Christiaan Maarten Van Der Walt: Data Measure that Characterises Classification Problems, 2008.
Related
I have a three armed clinical trial with two treatments and one placebo. I am to compare the Restricted Mean Survival Time among the three.
I need to find the sample sizes for each arms (equal allocation)
I know that R software has survRM2 package for calculating sample sizes for a RMST of two armed trial. The code as used is:
library(SSRMST) ssrmst(ac_rate=ac_rate, ac_period=ac_period, tot_time=tot_time, tau=tau, shape0=shape0, scale0=scale0, shape1=shape1, scale1=scale1, margin=margin, seed=seed)
So, my question is, how do I use this package to calculate sample size for a three armed trial (with equal allocation).
How will the above code modify?
Any guidance will be very helpful.
I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000
I am running an NMDS and have a few questions regarding the envfit() function in the vegan package. I have read the documentation for this function and numerous posts on SO and others about vegan, envfit(), and species scores in general.
I have seen both envfit() and wascore() used to calculate species scores for ordination techniques. By default, metaMDS() uses wascore(). This uses weighted averaging, which I understand. I am having a harder time understanding envfit(). Do envfit() and wascore( yield the same results? Is wascore() preferable given that it is the default? I realize that in some situations, wascore() might not be an option (ie. negative values), as mentioned in this post. How to get 'species score' for ordination with metaMDS()?
Given that envfit() and wascore() both seem to be used for species scores, they should yield similar results, right? I am hoping that we could do a proof of this here...
The following shows species scores determined using metaMDS() using the default wascore():
data(varespec)
ord <- metaMDS(varespec)
species.scores <- as.data.frame(scores(ord, "species"))
species.scores
wascore() makes sense to me, it uses weighted averaging. There is a good explanation of weighted averaging for species scores in Analysis of Ecological Data by McCune and Grace (2002) p. 150.
Could somebody help me breakdown envfit?
species.envfit <- envfit(ord, varespec, choices = c(1,2), permutations = 999)
species.scores.envfit <- as.data.frame(scores(species.envfit, display = "vectors"))
species.scores.envfit
"The values that you see in the table are the standardised coefficients from the linear regression used to project the vectors into the ordination. These are directions for arrows of unit length." - comment from Plotted envfit vectors not matching NMDS scores
^Could somebody please show me what linear model is being run here and what standardized value is being extracted?
species.scores
species.scores.envfit
These values are very different from each other. What am I missing here?
This is my first SO post, please have mercy. I would have asked a question on some of the other relevant threads, but I am the dregs of SO and don't even have the reputation to comment.
Thanks!
Q: Do wascores() and envfit() give the same result?
No they do not give the same result as these are doing two quite different things. In this answer I have explained how envfit() works. wascores() takes the coordinates of the points in the nmds space and computes the mean on each dimension, weighting observations by the abundance of the species at each point. Hence the species score returned by wascores() is a weighted centroid in the NMDS space for each species, where the weights are the abundances of the species. envfit() fits vectors that point in the direction of increasing abundance. This implies a plane over the NMDS ordination where abundance increase linearly from any point on the plane as you move parallel to the arrow, whereas wascores() are best thought of as optima, where the abundance declines as you move away from the weighted centroid, although I think this analogy is looser than say with a CA ordination.
The issue about being optimal or not, is an issue if you passed in standardised data; as the answer you linked to shows, this would imply negative weights which doesn't work. Typically one doesn't standardise species abundances — there are transformations that we apply like converting to proportions, square root or log transformations, normalizing the data to the interval 0-1 — but these wouldn't give you negative abundances so you;re less likely to run into that issue.
envfit() in an NMDS is not necessarily a good thing as we wouldn't expect abundances to vary linearly over the ordination space. The wascores() are better as they imply non-linear abundances, but they are a little hackish in NMDS. ordisurf() is a better option in general as it adds a GAM (smooth) surface instead of the plane implied by the vectors, but you can't show more than one or a few surfaces on the ordination, whereas you can add as many species WA scores or arrows as you want.
The basic issue here is the assumption that envfit() and wascores() should give the same results. There is no reason to assume that as these are fundamentally different approaches to computing "species scores" for NMDS and each comes with it's own assumptions and advantages and disadvantages.
I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)
At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.
I'm currently trying to use the fft function in R to transform measured soil temperature at a certain depths so as to model soil temperatures and heat fluxes at different depths.
I wanted to clarify some points regarding the fft function in R as i'm currently experiencing problems implementing this procedure.
So I have a df containing the date and time and soil temperatures at 5cm (T5) depth for a period of several months. According to the literature, it is possible to simulate temperatures and heat fluxes at different depths based on a fast Fourier transform of the measured data.
So my first step was naturally DF$FFT = fft (DF$T5)
From which I receive a series of complex numbers (Cn) i.e. the respective real (an) and imaginary (bn) numbers.
According to the literature, I can then recreate the T5 data with a formula based on outputs from the aforementioned fft.
*T_(0,t )= meanT + ∑ (An sin〖nωt+φ〗) ̅
NB the summed term is summed between n=1 and M, the highest harmonic
where T o,t is the temperature at given time point, mean Temperature over the period, t is the time and...
An = (2/sqrt(N))*|Cn|
|Cn| = modulus of the complex number of the nth harmonic Mod (DF$FFT)
phi = arctan (an/bn) i.e. arctan (Re(DF$FFT)/Im(DF$FFT)
omega = (2*pi/N)
Unfortunately based on the output of the fft in R i cannot recreate the temperature values using the above formula. I realise i can recreate the data using
fft (fft(DF$T5), inverse = T)/length (DF$T5)
However i need to be able to do it with the above equation so as to use the terms from this equation to model temperatures at other depths. Could anyone lend a hand in where i may be going wrong with the procedure i have described above. For example the above procedure was implemented in paper where the fft function from Mathcad was used! I am not looking here for a quick fix solution to my problem, so i understand that more data and info would be handy if that were the case. What i am looking for though is a bit of guidance with e.g. any peculiarities of the R fft that i should be aware of.
If anyone could help in any way possible it would be most appreciated. Also if anyone needs more info regarding my problem please do ask
thanks a lot
Brad