How to generate normal distributed multidimensional points - r

I need to generate a random multidimensional clustered data. For this I want to generate few uniform distributed multidimensional points (centers) and then many normal distributed points around each of them. How can I set the vector (multidimensional point) as mean for the normal distribution? I see the function rnorm can get vectors as mean and sd parameters, but I really don't understand how it works.

Package mnormt, function rmnorm()
set.seed(2)
require(mnormt)
varcov <- matrix(rchisq(4, 2), 2)
varcov <- varcov + t(varcov)
rmnorm(1000, mean=c(0,1), varcov=varcov)

Function rmvnorm.mixt() in package ks is another good alternative. If you install this package and open the vignette file (ks: Kernel density estimation for bivariate data) you can access an example for building a 'dumbbell' density with this function (see page 1). But you can also use the rmnorm() function (proposed here already) to build this same density. This can be done as following:
xy <- rbind(4/11*rmnorm(200,c(-2,2), diag(2)),
4/11*rmnorm(200,c(2,-2), diag(2)),
3/11*rmnorm(200, c(0,0),matrix(c(0.8,-0.72,-0.72,0.8),2,2))
)
plot(xy)

Related

Silhouette plot in R

I have a set of data containing:
item, associated cluster, silhouette coefficient. I can further augment this data set with more information if necessary.
I would like to generate a silhouette plot in R. I am having trouble with this because examples I came across use the built-in kmeans (or related) clustering function and plot the result. I want to bypass this step and produce the plot for my own clustering algorithm but I'm ending up short on providing the correct arguments to the plot function.
Thank you.
EDIT
Data set example https://pastebin.mozilla.org/8853427
What I've tried is loading the dataset and passing it to the plot function using various arguments based on https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html
Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:
library (cluster)
library (vegan)
data(varespec)
dis = vegdist(varespec)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)
EDIT: For k-means (which uses squared Euclidean distance)
library (vegan)
library (cluster)
data(varespec)
dis = dist(varespec)^2
res = kmeans(varespec,3)
sil = silhouette (res$cluster, dis)
windows()
plot(sil)

Time taken to krige in gstat package in R

The following R program creates an interpolated surface using 470 data points using walker Lake data in gstat package.
source("D:/kriging/allfunctions.r") # Reads in all functions.
source("D:/kriging/panel.gamma0.r") # Reads in panel function for xyplot.
library(lattice) # Needed for "xyplot" function.
library(geoR) # Needed for "polygrid" function.
library(akima)
library(gstat);
library(sp);
walk470 <- read.table("D:/kriging/walk470.txt",header=T)
attach(walk470)
coordinates(walk470) = ~x+y
walk.var1 <- variogram(v ~ x+y,data=walk470,width=10) #the width has to be tuned resulting different point pairs
plot(walk.var1,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 5")
model1.out <- fit.variogram(walk.var1,vgm(70000,"Sph",40,20000))
plot(walk.var1, model=model1.out,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 10")
poly <- chull(coordinates(walk470))
plot(coordinates(walk470),type="n",xlab="X",ylab="Y",cex.lab=1.6,main="Plot of Sample and Prediction Sites",cex.axis=1.5,cex.main=1.6)
lines(coordinates(walk470)[poly,])
poly.in <- polygrid(seq(2.5,247.5,5),seq(2.5,297.5,5),coordinates(walk470)[poly,])
points(poly.in)
points(coordinates(walk470),pch=16)
coordinates(poly.in) <- ~ x+y
krige.out <- krige(v ~ 1, walk470,poly.in, model=model1.out)
print(krige.out)
This program calculates the following for each point of 2688 points
(470x470) matrix inversion
(470x470) and (470x1) matrix multiplication
Is gstat package is using some smart way for calculation. I knew from previous stackoverflow query that it uses cholesky decomposition for matrix inversion. Is it normal speed for one machine to calculate it so quickly.
It uses LDL' decomposition, which is similar to Choleski. As you are using global kriging, the covariance matrix needs to be decomposed only once; then, for each prediction point, a system is solved, which is O(n). No 470x470 matrix gets ever inverted, neither are solutions obtained by multiplying it. Inverses are notational devices, but avoided as computational strategy when possible. In R, for instance, compare runtime of solve(A,b) with solve(A) %*% b.
Use the source, Luke!

Density of a Two-Piece Normal (or Split Normal) Distribution

Is there a density function for the two-piece Normal distribution:
on CRAN? Thought I would check before I code one. I have checked the distribution task view. It is not listed there. I have looked in a couple of likely packages, but to no avail.
Update: I have added dsplitnorm, psplitnorm, qsplitnorm and rsplitnorm functions to the fanplot package.
If you choose to construct your own version of the distribution, you might be interested in distr. It (and the related packages distrEx, distrSim, distrTEst, distrTeach and distrDoc) have been written to provide a unified interface for constructing new distributions from existing ones. (I constructed this example with the help of the wonderful vignette that accompanies the distrDoc package and which can be gotten by typing vignette("distr").)
This implements the split normal distribution, which may not be exactly what you are after. Using the distr toolset, though, it shouldn't be too hard to adjust this to fit your exact needs.
library(distr)
## Construct the distribution object.
## Here, it's a split normal distribution with mode=0, and lower- and
## upper-half standard deviations of 1 and 2, respectively.
splitNorm <- UnivarMixingDistribution(Truncate(Norm(0,2), upper=0),
Truncate(Norm(0,1), lower=0),
mixCoeff=c(0.5, 0.5))
## Construct its density function ...
dsplitNorm <- d(splitNorm)
## ... and a function for sampling random variates from it
rsplitNorm <- r(splitNorm)
## Compare the density it returns to that from rnorm()
dsplitNorm(-1)
# [1] 0.1760327
dnorm(-1, sd=2)
# [1] 0.1760327
## Sample and plot a million random variates from the distribution
x <- rsplitNorm(1e6)
hist(x, breaks=100, col="grey")
## Plot the distribution's continuous density
plot(splitNorm, to.draw.arg="d")

analytical derivative of splinefun()

I'm trying to fit a natural cubit spline to probabilistic data (probabilities that a random variable is smaller than certain values) to obtain a cumulative distribution function, which works well enough using splinefun():
cutoffs <- c(-90,-60,-30,0,30,60,90,120)
probs <- c(0,0,0.05,0.25,0.5,0.75,0.9,1)
CDF.spline <- splinefun(cutoffs,probs, method="natural")
plot(cutoffs,probs)
curve(CDF.spline(x), add=TRUE, col=2, n=1001)
I would then, however, like to use the density function, i.e. the derivative of the spline, to perform various calculations (e.g. to obtain the expected value of the random variable).
Is there any way of obtaining this derivative as a function rather than just evaluated at a discrete number of points via splinefun(x, deriv=1)?
This is pretty close to what I'm looking for, but alas the example doesn't seem to work in R version 2.15.0.
Barring an analytical solution, what's the cleanest numerical way of going about this?
If you change the environment assignment line for g in the code the Berwin Turlach provided on R-help to this:
environment(g) <- environment(f)
... you succeed in R 2.15.1.

Is there an R library that estimates a multivariate natural cubic spline (or similar) function?

note: originally posted on Cross Validated (stats SE) on 07-26-2011, with no correct answers to date.
Background
I have a model, f, where Y=f(X)
X is an n x m matrix of samples from m parameters and Y is the n x 1 vector of model outputs.
f is computationally intensive, so I would like to approximate f using a multivariate cubic spline through (X,Y) points, so that I can evaluate Y at a larger number of points.
Question
Is there an R function that will calculate an arbitrary relationship between X and Y?
Specifically, I am looking for a multivariate version of the splinefun function, which generates a spline function for the univariate case.
e.g. this is how splinefun works for the univariate case
x <- 1:100
y <- runif(100)
foo <- splinefun(x,y, method = "monoH.FC")
foo(x) #returns y, as example
The test that the function interpolates exactly through the points is successful:
all(y == foo(1:100))
## TRUE
What I have tried
I have reviewed the mda package, and it seems that the following should work:
library(mda)
x <- data.frame(a = 1:100, b = 1:100/2, c = 1:100*2)
y <- runif(100)
foo <- mars(x,y)
predict(foo, x) #all the same value
however the function does not interpolate exactly through the design points:
all(y == predict(foo,x))
## FALSE
I also could not find a way to implement a cubic-spline in either the gam, marss, or earth packages.
Actually several packages can do it. The one I use is the "rms" package which has rcs, but the survival package also has pspline and the splines package has the ns function {}. "Natural splines" (constructed with ns) are also cubic splines. You will need to form multivariate fitting function with the '*' operator in the multivariate formula creating "crossed" spline terms.
that the example you offered was not sufficiently rich.
I guess I am confused that you want exact fits. R is a statistical package. Approximate estimation is the goal. Generally exact fits are more of a problem because they lead to multicollinearity.
Have a look at the DiceKriging package which was developed to undertake tasks like this.
http://cran.r-project.org/web/packages/DiceKriging/index.html
I've provided an example application at
https://stats.stackexchange.com/questions/13510/fitting-multivariate-natural-cubic-spline/65012#65012
I'm not sure if this is precisely what you are looking for, but you could try Tps() in the R package fields. It's meant for doing thin-plate splines interpolations (2D equivalent of cubic splines) for spatial data, but will take up to four covariates, although it will expect them to be euclidean x,y,z + time, so you need to be clear that you are selecting the correct options for your particular case. If you want to interpolate, set the smoothing parameter lambda to zero. You might also try the function polymars() in the R package polspline.

Resources