k-means for many same points in R - r

Suppose I have a one dimension data set, which contains many same numbers, for example data set S = c(rep(4, times(1000)), rep(5, times(808)), rep(9, times(990))). Is there any efficient ways to do k-means in R? Actually in my data I have just a around 20 different points, but each of them appears around 100000 times, it runs very slow. So I wonder if there is a more efficient way.

K-means can be implemented with weights. It's straightforward to do so.
But IIRC the version included with R is not implemented this way. The version on flexcluster maybe is, but it's pure R and much much much slower.
Either way, you will want to implement this in Fortran or C, like the regular kmeans version. Maybe you can find some package that has a good implementation already.

Related

Feature selection on subsets of feature set

I am trying to do the feature selection using Boruta package in R. The problem is that my feature set is way tooo large (70518 features) and therefore the dataframe is too large (2Gb) and cannot be processed by the Boruta package at once. I am wondering if I can split the data frame into several sets, each containing a smaller amount of features? This sounds a bit weird to me, as I am not sure if the algorithm can correctly identify the weights if not all features are present.
If not, I would be very grateful if someone can suggest an alternative way of doing it.
I think your best best in this case might be to first try and filter out some of the features that are either low information (e.g. ~zero variance) or highly correlated.
The caret package has some useful functions to help with this.
For example, the findCorrelation() can be used to easily remove redundant features:
dat <- cor(dat, method='spearman')
dat[is.na(dat)] <- 0
features_to_ignore <- findCorrelation(dat, cutoff=0.75, verbose=FALSE)
dat <- dat[,-features_to_ignore]
This will remove all features with a Spearman correlation of 0.75 or higher.
I'm going to start by asking why you believe that this can even work? In this case, not only is p >> n, but p >>>>>> n. You're always going to find spurious associations. More than that, even if you could do this (say by renting a sufficiently large machine in a cloud computing service, which is the method I'd suggest), you're looking at an absurd amount of computation, since the computational complexity of building a single decision tree is O(n * v log(v)), where n is the number of records and v is the number for fields in each record. Building an RF takes that much for each tree.
Instead of solving the problem as stated, you might want to rethink it from the ground up. What are you really trying to do here? Can you go back to first principles and rethink that?

How can I efficiently best fit large data with large numbers of variables

I have a data set with, 10 million rows and 1,000 variables, and I want to best fit those variables, so I can estimate a new rows value. I am using Jama's QR decomposition to do it (better suggestions welcome, but I think this question applies to any implementation). Unfortunately that takes too long.
It appears I have two choices. Either I can split the data into, say, 1000 size 10,000 chunks and then average the results. Or I can add up every , say, 100 rows, and stick those combined rows into the QR decomposition.
One or both ways may be mathematical disasters, and I'm hoping someone can point me in the right direction.
For such big datasets I'd have to say you need to use HDF5. HDF5 is Hierarchical Data Format v5. They have C/C++ implementation APIs, and other bindings for different languages. HDF uses B-trees to keep index of datasets.
HDF5 is supported by Java, MATLAB, Scilab, Octave, Mathematica, IDL, Python, R, and Julia.
Unfortunately I don't know more than this about it. However I'd suggest you'd begin your research with a simple exploratory internet search!

R equivalent to matlab griddata, scatteredInterpolant, and/or TriScatteredInterp

We do a lot of full field 3D numerical simulations (CFD, FEA, etc.). The solutions take a long time to run. We often interpolate from solutions rather than rerun every case. We also interpolate between multiple solutions, which leads to even higher dimensional interpolation (like adding time, so x,y,z,t,v).
Matlab does a great job of reading data V at irregular grid of X,Y,Z coordinates, and interpolating from V using griddata, scatterdInterpolan, and/or TriScatteredInterp. For a variety of reasons, I've switched to R. This remains one key area I've not been able to find as good R equivalent. 'akima' only does x,y,V (not, x,y,z,V, much less even higher dimensions like x,y,z,t,v).
The next best thing I've found has been 'krigging'. But krigging behaves more like model fitting and projection, and often does not behave well between irregular grid points. So it's not nearly as robust as simple direct linear interpolation.
Matlab has had griddata for several decades. It's hard to believe R doesn't have an equivalent out there. Any suggestions? Or is there at least a way to use krigging to yield effectively the same result as a direct linear interpolation?
Jonathan
You might start by looking at the package "tripack" to do Delaunay triangulation, which gives you the first step in duplicating scatteredInterpolant().
R interpp() is equivalent to MATLAB scatteredInterpolant().

Parallelized multidimensional numerical integration in R?

I am working on a problem that needs a numerical integration of a bivariate function, where each evaluation of the function takes about 1 minute. Since numerical integration on a single core would evaluate the function thousands to tens of thousand times, I would like to parallelize the calculation. Right now I am using the bruteforce approach that calculates a naive grid of points and add them up with appropriate area multipliers. This is definitely not efficient and I suspect any modern multidimensional numerical integration algorithm would be able to achieve the same precision with a lot fewer function evaluations. There are many packages in R that would calculate 2-d integration much more efficiently and accurately (e.g. R2Cuba), but I haven't found anything that can be easily parallelized on a cluster with SGE managed job queues. Since this is only a small part of a bigger research problem, I would like to see if this can be done with reasonable effort , before I try to parallelize one of the cubature-rule based methods in R by myself.
I have found that using sparse grid achieves the best compromise between speed and accuracy in multi-dimensional integration, and it's easily parallized on the cluster because it doesn't involve any sequential steps. It won't be as accurate as other sequentially adpative integration algorithms but it is much better than the naive method because it provides a much sparser grid of points to calculate on each core.
The following R code deals with 2-dimensional integration, but can be easily modified for higher dimensions. The apply function towards the end can be easily parallelized on a cluster.
sg.int<-function(g,...,lower,upper)
{ require("SparseGrid")
lower<-floor(lower)
upper<-ceiling(upper)
if (any(lower>upper)) stop("lower must be smaller than upper")
gridss<-as.matrix(expand.grid(seq(lower[1],upper[1]-1,by=1),seq(lower[2],upper[2]-1,by=1)))
sp.grid <- createIntegrationGrid( 'KPU', dimension=2, k=5 )
nodes<-gridss[1,]+sp.grid$nodes
weights<-sp.grid$weights
for (i in 2:nrow(gridss))
{
nodes<-rbind(nodes,gridss[i,]+sp.grid$nodes)
weights<-c(weights,sp.grid$weights)
}
gx.sp <- apply(nodes, 1, g,...)
val.sp <- gx.sp %*%weights
val.sp
}

Parallel computing for TraMineR

I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. In particular, I would like to use the commands seqtreeand seqdist, which works fine when I for example use a subsample of 10,000 observations. The limit my computer can manage is around 20,000 observations.
I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. However, this doesn't help much as the process runs on a single core only. Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? Or are there other ways to speed up the process? Any help would be appreciated!
The internal seqdist function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.
Apart from selecting a sample, you should consider the following optimizations:
aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR )
If relevant, you can try to reduce the time granularity. Distance computation time is highly dependent on sequence length (O^2). See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one.
There is a hidden option in seqdist to use an optimized version of the optimal matching algorithm. It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. To use it, set method="OMopt", instead of method="OM". Depending on your sequences, it may reduce computation time.

Resources