Change specific values with mean based on neighbors - r

I have dataset with values from 100 to 200, but there are a few spikes in data.
I don't want to smooth the whole dataset with rollmean or rollaplly.
I want to work it in that way:
find these spikes with condition (value > 300)
replace these too big values with mean/median that had been
calculated from 10 near neighbors values.
Example in pseudo-code:
data[n] = spike
data[n] = mean(from data[n-5] to data[n+5])
It's like using window function not on the whole data set, only on certain points in data.
Thank you in advance

I like this question. A typical moving average/k-nearest neighbourhood estimation. A nonnparametric approach. The following should work.
foo <- function(x, thresh = 300, h = 5, window.fun = mean) {
spikes.loc <- which(x > thresh)
low.bound <- spikes - h
up.bound <- spikes + h
N <- length(spikes.loc)
x.hat <- x
for (i in 1:N) x.hat[spikes.loc[i]] <- window.fun(x[low.bound[i]:up.bound[i]])
return(x.hat)
}
This function takes your original observations vector x, threshold, a window size (a smoothing parameter) as well as a user-specified window function. The returned value is the vector smoothed data. It only differs from original data at spikes points. Common choice of a window function is density function, so you end up with weighted average of all neighbouring data.
Please note, I am assuming your data are evenly spaced, so a simple index x[i-h] : x[i+h] gives a reasonable neighbourhood. In more general setting, a window is based on euclidean distance, but will naively costs O(N*N), where N is the number of observations, which is expensive.
In R, there are built-in nonparametric estimation / smoothing tools. The most basic one is kernel smoothing, a generalization of moving average. It uses FFT algorithm for fast computation at O(N log(N)) costs. Please see ?ksmooth. More advanced are KernSmooth and sm packages.

Related

Using FFT in R to Determine Density Function for IID Sum

The goal is to compute the density function of a sum of n IID random variables via the density function of one of these random variables by:
Transforming the density function into the characteristic function via fft
Raise the characteristic function to the n
Transform the resulting characteristic function into the density function of interest via fft(inverse=TRUE)
The below is my naive attempt at this:
sum_of_n <- function(density, n, xstart, xend, power_of_2)
{
x <- seq(from=xstart, to=xend, by=(xend-xstart)/(2^power_of_2-1))
y <- density(x)
fft_y <- fft(y)
fft_sum_of_y <- (fft_y ^ n)
sum_of_y <- Re(fft(fft_sum_of_y, inverse=TRUE))
return(sum_of_y)
}
In the above, density is an arbitrary density function: for example
density <- function(x){return(dgamma(x = x, shape = 2, rate = 1))}
n indicates the number of IID random variables being summed. xstart and xend are the start and end of the approximate support of the random variable. power_of_2 is the power of 2 length for the numeric vectors used. As I understand things, lengths of powers of two increase the efficiency of the fft algorithm.
I understand at least partially why the above does not work as intended in general. Firstly, the values themselves will not be scaled correctly, as fft(inverse=TRUE) does not normalize by default. However, I find that the values are still not correct when I divide by the length of the vector i.e.
sum_of_y <- sum_of_y / length(sum_of_y)
which based on my admittedly limited understanding of fft is the normalizing calculation. Secondly, the resulting vector will be out of phase due to (someone correct me on this if I am wrong) the shifting of the zero frequency that occurs when fft is performed. I have tried to use, for example, pracma's fftshift and ifftshift, but they do not appear to address this problem correctly. For symmetric distributions e.g. normal, this is not difficult to address since the phase shift is typically exactly half, so that an operation like
sum_of_y <- c(sum_of_y[(length(y)/2+1):length(y)], sum_of_y[1:(length(y)/2)])
works as a correction. However, for asymmetric distributions like the gamma distribution above this fails.
In conclusion, are there adjustments to the code above that will result in an appropriately scaled and appropriately shifted final density function for the IID sum?

nstart for k-means in R

Search results in numerous places report that the argument nstart in R's function kmeans sets a number of iterations of the algorithm and chooses 'the best one', see e.g. https://datascience.stackexchange.com/questions/11485/k-means-in-r-usage-of-nstart-parameter. Can anyone provide any clarity on how it does this, i.e. by what measure does it define best?
Secondly: R's kmeans function takes an argument centers. Here, as typical in k-means, it is possible to initialise the centroids before the algorithm begins expectation-maximisation, by choosing as initial centroids rows (data-points) from within your data-set. (You could supply, in vector form, points not present in your data-set as well, with considerably greater effort. In this case you could in theory choose the global optimum as your centroids. This is not what I'm asking for.) When nstart or the seed randomises initializations, I am quite sure that it does so by picking a random choice of centroids from your data-set and starting from those (not just a random set of points within the space).
In general, therefore, I'm looking for a way to get a good (e.g. best out of $n$ trials, or best from nstart) set of starting data-instances from the data-set as initial centroids. Is there any way of extracting the 'winning' (=best) set of initial centroids from nstart (which I could then use, say, in the centers parameter in future)? Any other streamlined & quick way to get a very good set of starting centroids (presumably, reasonably close to where the cluster centres will end up being)?
Is there perhaps, at least, a way to extract from a given kmeans run, what initial centroids it chose to start with?
The criterion that kmeans tries to minimize is the trace of the within scatter matrix, i.e. (unfortunately, this forum does not support LaTeX, but you hopefully can read it nevertheless):
$$ trace(S_w) = \sum_{k=1}^K \sum{x \in C_k} ||x - \mu_k||^2 $$
Concerning the best starting point: obviously, the "best" starting point would be the cluster centers eventually chosen by kmeans. These are returned in the attribute centers:
km <- kmeans(iris[,-5], 3)
print(km$centers)
If you are looking for the best random start point, you can create random start points yourself (with runif), do this nstart times and evaluate which initial configuration leads to the smallest km$tot.withinss:
nstart <- 10
K <- 3 # number of clusters
D <- 4 # data point dimension
# select possible range
r.min <- apply(iris[,-5], MARGIN=2, FUN=min)
r.max <- apply(iris[,-5], MARGIN=2, FUN=max)
for (i in 1:nstart) {
centers <- data.frame(runif(K, r.min[d], r.max[d]))
for (d in 2:D) {
centers <- cbind(centers, data.frame(runif(K, r.min[d], r.max[d])))
}
names(centers) <- names(iris[,-5])
# call kmeans with centers and compare tot.withinss
# ...
}

Calculating lm() within a loop

Objective: The overall objective of the problem is to calculate the confidence interval (CI) of various sample sizes (n=2,4..1024) of rnorm, 10,000 times and then count the number of times each one fails (this likely requires a counter and an if/else statement). Finally the results are to be plotted
I am trying to calculate CI of the means for several simulations of a sample sizes, however, I am first trying to break down the code for one specific sample size a = 8.
The problem I have is that I do not know how to generate a linear model for each row. Would anyone know how I can do this? Here is what I have so far:
a <- 8
n.sim.3 <- 10000
for ( i in a) {
r.mat <- matrix(rnorm(i*n.sim.3), nrow=n.sim.3, ncol = a)
lm.tmp <- apply(three.mat,1,lm(n.sim.3~1) # The lm command is where I'm stuck I don't think this is correct)
confint.tmp <- confint(lm.tmp)

How to calculate the volume under a surface defined by discrete data?

I need to determine the volume beneath a series of surfaces represented by discrete data points. In my data, each sample is stored as a separate data frame within a list of data frames. Here is some (small) example data:
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
df2 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(1,1,2,3,5,6,2,1,3,3,8,9,8,3,1))
DF <- list(df1,df2)
Answers to similar questions are either in other languages (matlab, python), or the answers do not contain useable script to address the problem (as here). I can think of two acceptable ways to estimate the volume beneath each surface: 1) write out a discretized version of simpson's rule as a function in R that is applied across the list of data frames (DF); 2) calculate an arbitrary relationship between x, y, and z and use multivariate numerical integration to find the volume under the surface (with functions like simpson2d / quad2d in the package pracma or adaptIntegrate in cubature).
Regarding the first approach, the formula for the composite simpson's rule (that I would like to use) is here, but due to its complexity, I have been unsuccessful in writing a working double summation function. In this expression, I(lambda(em) lambda(ex)) is equal to z in the above datasets at each x,y grid point, and Delta(em) and Delta(ex) represent the interval between x and y points.
The second approach would essentially extend the approach found here to multivariate spline fits and then pass the predicted z values as a function for integration. Here's what I have tried so far for this approach:
require(pracma)
df1.loess <- loess(z ~ x + y, data=DF[[1]])
mod.fun <- function(x,y) predict(df1.loess, newdata=x,y)
simpson2d(mod.fun, x=c(2,6), y=c(1,3))
But this does not yield useful results.
In reality, I have a list of almost 100 data frames for individual samples, so I really need to be able to express the solution as a series of lapply functions that automate these calculations across all data frames in the list. An example looks something like this:
require(akima)
DF.splines <- lapply(DF, function(x,y,z) interp(x = "x", y = "y", z = "z",
linear=F, nx=4, ny=2))
Unfortunately, this produces an exception for missing values and Infs. I'm extremely open to any suggestions for how to successfully implement one of these strategies, or to utilize a different (simpler?) approach. Could a kriging function (like km in the DiceKriging package) produce a better fit that could be passed on for numerical integration?
I am assuming that the volume surface mesh is defined by connecting points via straight lines. Then you can find the volume beneath that surface via
triangular tessellation of the (x,y) grid into triangles T_i with area A_i
finding the corresponding z values Z_i for each of the triangles T_i
calculating the volume V_i of the truncated prisms (defined by T_i and Z_i) via V_i=A_i*sum(Z_i)/3 (see https://en.wikipedia.org/wiki/Prism_(geometry) and https://math.stackexchange.com/questions/2371139/volume-of-truncated-prism)
summing up all truncated prism volumes V_i
Keep in mind, however, that the volume does depend on your tessellation and that the tessellation is not unique. But your problem is not fully defined in the sense that it does not describe how one should interpolate between points. So any approach to calculate a volume will have to make additional assumptions.
Going back to my solution approach, points 1 and 2 can be achieved via the geometry package.
Here some code
library(geometry)
getVolume=function(df) {
#find triangular tesselation of (x,y) grid
res=delaunayn(as.matrix(df[,-3]),full=TRUE,options="Qz")
#calulates sum of truncated prism volumes
sum(mapply(function(triPoints,A) A/3*sum(df[triPoints,"z"]),
split.data.frame(res$tri,seq_along(res$areas)),
res$areas))
}
sapply(DF,getVolume)
#[1] 32.50000 30.33333
Since it's hard to check whether the results are consistent, here a simple example where we know the right answer. It's a cube with side length 2 where we have cut out a wedge along the x axis. The cut-out region is 1/4 of the total volume.
cutOutCube=expand.grid(c(0,1,2),c(0,1,2))
colnames(cutOutCube)=c("x","y")
cutOutCube$z=ifelse(cutOutCube$x==1,1,2)
sapply(list(cutOutCube),getVolume)
#[1] 6
That's correct since 2^3*(1-1/4)=6.
Another sanity check can be performed by calculating the "complement" of the volume w.r.t. to a simple cuboid where all z values are set to the max z value (in your case max(z)=9 in both cases). The simple cuboid volumes are 72 for both of your cases. Not let's define the complement surfaces and sum up volume and complement volume
df1c=df1
df1c$z=max(df1c$z)-df1c$z
df2c=df2
df2c$z=max(df2c$z)-df2c$z
DFc=list(df1c,df2c)
sapply(DFc,getVolume)+sapply(DF,getVolume)
#[1] 72 72
So volume and complement volume give the right simple cuboid volume in both cases.
You could approximate the surface through a "barycentric Lagrangian" approach as implemented in function barylag2d in the pracma package. Then, to avoid any vectorization problems, apply the Gaussian quadrature rules explicitly.
library(pracma)
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
# Define the nodes in x- and y-direction
xn <- df1$x[c(1,4,7,10,13)]
yn <- df1$y[1:3]
# Define the matrix representing the function
m1 <- matrix(df1$z, nrow=5, byrow=TRUE)
f <- function(x, y)
c(pracma::barylag2d(m1, xn, yn, x, y))
# 32 nodes in integration intervals
n <- 32
xa <- 2; xb <- 6; ya <- 1; yb <- 3
# Apply quadrature rules explicitely
cx <- gaussLegendre(n, xa, xb)
x <- cx$x; wx <- cx$w
cy <- gaussLegendre(n, ya, yb)
y <- cy$x; wy <- cy$w
# Sum weights * values over all nodes
I <- 0
for (i in 1:n) {
for (j in 1:n) {
I <- I + wx[i] * wy[j] * f(x[i], y[j])
}
}
I # 40.37037
The integral value of 40 seems reasonable given the data. simpson2d or quad2d will not work in this setting.
You may try whether adaptIntegrate will work with the so defined function f.

Random Walks and Gaussian (Normal) Distribution in R

I'm very new to coding in R(coding in general). I've created a distribution using a random walk within the following code:
set.seed(124)
norm <- rnorm(1000)
mean(norm)
mean(norm)^2
sd(norm)
d <- density(norm)
plot(d)
Now I want to create a function of n-steps using the above distribution. The function calculates the expected values based on the probability of moving n-steps to the left or right from the center. I have no idea where to begin.
Any direction would be greatly appreciated.
Thanks
If each normally distributed variate is your step size (positive moves right and negative moves left), then the cumulative sum of your random draws represents your current position. You can compute that with the cumsum function in R:
set.seed(144)
pos <- cumsum(rnorm(1000))
plot(seq_along(pos), pos, xlab="Step Number", ylab="Current Position")
Using replicate and logical operations, you can simulate any number of different questions about random walks. For instance "with what probability does the value of the random walk exceed 100 within the first 1000 steps" could be simulated with:
set.seed(144)
exceed.100 <- replicate(100000, any(cumsum(rnorm(1000)) >= 100))
mean(exceed.100)
# [1] 0.00173
From these 100k replicates, it looks like the probability is around 0.17% that the random walk will exceed 100 during the first 1000 steps.

Resources