R : how to use variables for vector indices? - r

I'm new user of R, and trying to generate a k-moving average graph with sine function which involves random number(in range [-0.5,+0.5]) noise.
So what I have to do is calculate a mean of consecutive (2*k+1) elements in noised-sine vector but however, the code with "HELP" below, it's not working as I expected... :(
The code seems to calculate the mean of 1 through (i-k)th element.
What's wrong with it? Help please!
set.seed(1)
x = seq(0,2*pi,pi/50)
sin_graph <- sin(x)
noise <- runif(101, -0.5, 0.5)
sin_noise <- sin_graph + noise
plot(x,sin_noise, ylim=c(-2,2))
lines(x,sin_graph, col="red")
k<-1
MA<-0
while (k<=1){
i <- k+1
MA_vector <- rep(NA, times=101)
while (i<=101-k){
MA_vector[i] <- mean(sin_noise[i-k:i+k]) #HELP!
i <- i+1
}
print(MA_vector)
plot(x, MA_vector, ylim=c(-2,2))
lines(x,sin_graph, col="red")
k<-k+1
}

As it stands, it's substracting a vector of k:i from i and then adding k. : takes precedent over mathematical operators. By using brackets (see code below), it evaluates i-k and i+k and creates a vector with min and max as results of the evaluations. I get another smooth function.
MA_vector[i] <- mean(sin_noise[(i-k):(i+k)])

Related

Plotting an 'n' sized vector between a given function with given interval in R

Let me make my question clear because I don't know how to ask it properly (therefore I don't know if it was answered already or not), I will go through my whole problem:
There is a given function (which is the right side of an explicit first order differential equation if it matters):
f = function(t,y){
-2*y+3*t
}
Then there's a given interval from 'a' to 'b', this is the range the function is calculated in with 'n' steps, so the step size in the interval (dt) is:
dt=abs(a-b)/n
In this case 'a' is always 0 and 'b' is always positive, so 'b' is always greater than 'a' but I tried to be generic.
The initial condition:
yt0=y0
The calculation that determines the vector:
yt=vector("numeric",n)
for (i in 1:(n-1))
{
yt[1]=f(0,yt0)*dt+yt0
yt[i+1]=(f(dt*i,yt[i]))*dt+yt[i]
}
The created vector is 'n' long, but this is an approximate solution to the differential equation between the interval ranging from 'a' to 'b'. And here comes my problem:
When I try plotting it alongside the exact solution (using deSolve), it is not accurate. The values of the vector are accurate, but it does not know that these values belong to an approximate function that's between the interval range 'a' to 'b' .
That's why the graphs of the exact and approximate solution are not matching at all. I feel pretty burnt out, so I might not describe my issue properly, but is there a solution to this? To make it realise that its values are between 'a' and 'b' on the 'x' axis and not between '1' and 'n'?
I thank you all for the answers in advance!
The deSolve lines I used (regarding 'b' is greater than 'a'):
df = function(t, y, params) list(-2*y+3*t)
t = seq(a, b, length.out = n)
ddf = as.data.frame(ode(yt0, t, df, parms=NULL))
I tried to reconstruct the comparison between an "approximate" solution using a loop (that is in fact the Euler method), and a solution with package deSolve. It uses the lsoda solver by default that is more precise than Euler'S method, but it is of course also an approximation (default relative and absolute tolerance set to 1e-6).
As the question missed some concrete values and the plot functions, it was not clear where the original problem was, but the following example may help to re-formulate the question. I assume that the problem may be confusion between t (absolute time) and dt between the two approaches. Compare the lines marked as "original code" with the "suggestion":
library(deSolve)
f = function(t, y){
-2 * y + 3 * t
}
## some values
y0 <- 0.1
a <- 3
b <- 5
n <- 100
## Euler method using a loop
dt <- abs(a-b)/n
yt <- vector("numeric", n)
yt[1] <- f(0, y0) * dt + y0 # written before the loop
for (i in 1:(n-1)) {
#yt[i+1] = (f( dt * i, yt[i])) * dt + yt[i] # original code
yt[i+1] <- (f(a + dt * i, yt[i])) * dt + yt[i] # suggestion
}
## Lsoda integration wit package deSolve
df <- function(t, y, params) list(-2*y + 3*t)
t <- seq(a, b, length.out = n)
ddf = as.data.frame(ode(y0, t, df, parms=NULL))
## Plot of both solutions
plot(ddf, type="l", lwd=5, col="orange", ylab="y", las=1)
lines(t, yt, lwd=2, lty="dashed", col="blue")
legend("topleft", c("deSolve", "for loop"),
lty=c("solid", "dashed"), lwd=c(5, 2), col=c("orange", "blue"))

R Function to Find Derivative of Every Point in Time Series

I have a smoothed time series and want to find the instantaneous velocity of the function at any point along the line.
What I want to do is take a series of values: ex(1,6,5,4,3,5,6,7,1)
and return the derivative of each relative to the function of the entire series, such that at every point in time, I know what direction the line is trending.
I am new to R, but know there must be a way.
Any tips?
Ex:
library(smoother)
data(BJsales)
m <- data.frame(BJsales)
x.smth <- as.data.frame(smth.gaussian(m$BJsales,tails=TRUE,alpha = 5))
x.smth.ts <- cbind(seq(1:nrow(m)),x.smth)
colnames(x.smth.ts) <- c("x","y")
x.smth.ts
plot(x.smth.ts$y~x.smth.ts$x)
Desired output:
df with 2 columns: x, deriv.of.y
Edit: Final Result thanks to G5W
TS with Color by Derivative
Your proposed example using the BJSales data is decidedly not differentiable,
so instead I will show the derivative of a much smoother function. If your real data is smooth, this should work for you.
The simplest way to approximate the derivative is simply to use finite differences.
f'(x) ≈ (f(x+h) - f(x))/h
## Smooth sample function
x = seq(0,10,0.1)
y = x/2 + sin(x)
plot(x,y, pch=20)
## Simplest - first difference
d1 = diff(y)/diff(x)
d1 = c(d1[1],d1)
Let's use it to plot a tangent line as an error check. I picked a place to draw the tangent line arbitrarily: the 18th point, x=1.7
plot(x,y, type="l")
abline(y[18]-x[18]*d1[18], d1[18])
To get the data.frame that you requested, you just need
Derivative = data.frame(x, d1)

colorRampPalette in R more than 2 clusters

I'd want to color my points with different colors in a fuzzy way, according to a probabilistic function associated to the points.
I've managed for 2 cases. First I'm building my dataset and a probabilities associated given 2 clusters.
set.seed(16)
rbPal <- colorRampPalette(c('yellow','red'))
(mu1<-c(0,0)) # vector mean multinom 1
(S1<-matrix(c(0.1,0,0,0.6),2)) # var/cov matrix multinom 1
(mu2<-c(3,0)) # vector mean multino 2
(S2<- matrix(c(1,0,0,0.1),2)) # var/cov matrix multinom 2
x1<-mvrnorm(n=100, mu=mu1,Sigma=S1 )
x2<-mvrnorm(n=100, mu=mu2,Sigma=S2 )
x<-rbind(x1,x2) # Dataset
euc.dist<-function (a,b){
sqrt(sum((a-b)^2))
}
randC<-x[sample(nrow(x),2),]
Distmatrix<- t(apply(x,1,function(r) apply(randC,1, function(s) euc.dist(r,
s))))
mat<-matrix(,200,2)
mat<-apply(mat,2,function(x) x=apply(Distmatrix,1, prod))/Distmatrix
P<-t(apply(mat, 1, function(x) x/sum(x)))
D4<-data.frame(x,P)
D4$Col <- rbPal(10)[as.numeric(cut(D4$X1.1,breaks = 10))]
plot(D4$X1,D4$X2,pch = 20,col = D4$Col, cex=1.2)
points(randC, col="red")
That's what I get imagining 2 points as centroid of a cluster.
What if I wanted to do the same color job imagining more than 2 clusters?
So I should have:
[...]
set.seed(50)
rbPal <- colorRampPalette(c('yellow','red',"green"))
mat<-matrix(,200,3)
randC<-x[sample(nrow(x),3),]
Distmatrix<- t(apply(x,1,function(r) apply(randC,1, function(s) euc.dist(r,
s))))
mat<-apply(mat,2,function(x) x=apply(Distmatrix,1, prod))/Distmatrix
P<-t(apply(mat, 1, function(x) x/sum(x)))
D4<-data.frame(x,P)
D4$Col <- rbPal(10)[as.numeric(cut(D4$X1.1,breaks = 10))]
plot(D4$X1,D4$X2,pch = 20,col = D4$Col, cex=1.2)
points(randC, col="red")
That's wrong, cause I want that each centroid has the maximum value for a color and then shade according to the distance depending on which cluster.
You may need to do the mixing function yourself.
If you have more than two clusters, a linear color space is not enough anymore.
The easiest choice is a linear mixing in each component. Straight forward to implement. For more advanced cases, you may want "balanced" points (where all distances are equal) to be gray, and not the average color.
As an ad-hoc solution, you could also set up palettes for each cluster, from gray to the clusters color. Then use (x_j-x_i)/x_j of the ith palette as value, where x_i is the smallest, and x_j the second smallest value. If x_i=x_j, the value will be 0 (gray). If x_i=0, the value will be 1.
This is probably quite pretty, but can be misleading because it doesn't use the same scaling everywhere.
I think I found a good solution, here's the cose:
set.seed(50)
mat<-matrix(,200,3)
randC<-x[sample(nrow(x),3),]
Distmatrix<- t(apply(x,1,function(r) apply(randC,1, function(s) euc.dist(r,
s))))
mat<-apply(mat,2,function(x) x=apply(Distmatrix,1, prod))/Distmatrix
P<-t(apply(mat, 1, function(x) x/sum(x)))
D4<-data.frame(x,P)
rbPal<-list()
for(i in 1:k){
rbPal[[i]] <- colorRampPalette(c('white',col=I(i+1)))
}
for(i in 1:k){
D4[[dim(D4)[2]+1]] <- rbPal[[i]](10)[as.numeric(cut(D4[[2+i]],breaks = 10))]
}
for(i in 1:k){
D4[[dim(D4)[2]+1]]<-t(col2rgb(D4[[dim(D4)[2]-k+1]]))
}
prova<-matrix(0,dim(D4)[1],3)
for(i in 1:k){
prova<-prova+D4[,(dim(D4)[2]-k+i)]*P[,i]
}
prova[is.nan(prova)] <- 0
provcol=apply(prova,1, function(x) rgb(x[1], x[2], x[3], maxColorValue=255))
plot(D4$X1,D4$X2,pch = 20,col = provcol, cex=1.5)
points(randC, col="red")
I basically created k different color palette each of them starting from white, which is the color in common of everybody. Then, according to probabilities, I mixed the rgb values of the k cluster probabilities with a weighted mixing.

Identify all local extrema of a fitted smoothing spline via R function 'smooth.spline'

I have a 2-dimensional data set.
I use the R's smooth.spline function to smooth my points graph following an example in this article:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.smooth.spline.html
So that I get the spline graph similar to the green line on this picture
I'd like to know the X values, where the first derivative of the smoothing spline equals zero (to determine exact minimum or maximum).
My problem is that my initial dataset (or a dataset that I could auto-generate) to feed into the predict() function does not contain such exact X values that correspond to the smoothing spline extrema.
How can I find such X values?
Here is the picture of the first derivative of the green spline line above
But exact X coordinate of extremums are still not exact.
My approximate R script to generate the pictures looks like the following
sp1 <- smooth.spline(df)
pred.prime <- predict(sp1, deriv=1)
pred.second <- predict(sp1, deriv=2)
d1 <- data.frame(pred.prime)
d2 <- data.frame(pred.second)
dfMinimums <- d1[abs(d1$y) < 1e-4, c('x','y')]
I think that there are two problems here.
You are using the original x-values and they are spaced too far apart AND
Because of the wide spacing of the x's, your threshold for where you consider the derivative "close enough" to zero is too high.
Here is basically your code but with many more x values and requiring smaller derivatives. Since you do not provide any data, I made a coarse approximation to it that should suffice for illustration.
## Coarse approximation of your data
x = runif(300, 0,45000)
y = sin(x/5000) + sin(x/950)/4 + rnorm(300, 0,0.05)
df = data.frame(x,y)
sp1 <- smooth.spline(df)
Spline code
Sx = seq(0,45000,10)
pred.spline <- predict(sp1, Sx)
d0 <- data.frame(pred.spline)
pred.prime <- predict(sp1, Sx, deriv=1)
d1 <- data.frame(pred.prime)
Mins = which(abs(d1$y) < mean(abs(d1$y))/150)
plot(df, pch=20, col="navy")
lines(sp1, col="darkgreen")
points(d0[Mins,], pch=20, col="red")
The extrema look pretty good.
plot(d1, type="l")
points(d1[Mins,], pch=20, col="red")
The points identified look like zeros of the derivative.
You can use my R package SplinesUtils: https://github.com/ZheyuanLi/SplinesUtils, which can be installed by
devtools::install_github("ZheyuanLi/SplinesUtils")
The function to be used are SmoothSplinesAsPiecePoly and solve. I will just use the example under the documentation.
library(SplinesUtils)
## a toy dataset
set.seed(0)
x <- 1:100 + runif(100, -0.1, 0.1)
y <- poly(x, 9) %*% rnorm(9)
y <- y + rnorm(length(y), 0, 0.2 * sd(y))
## fit a smoothing spline
sm <- smooth.spline(x, y)
## coerce "smooth.spline" object to "PiecePoly" object
oo <- SmoothSplineAsPiecePoly(sm)
## plot the spline
plot(oo)
## find all stationary / saddle points
xs <- solve(oo, deriv = 1)
#[1] 3.791103 15.957159 21.918534 23.034192 25.958486 39.799999 58.627431
#[8] 74.583000 87.049227 96.544430
## predict the "PiecePoly" at stationary / saddle points
ys <- predict(oo, xs)
#[1] -0.92224176 0.38751847 0.09951236 0.10764884 0.05960727 0.52068566
#[7] -0.51029209 0.15989592 -0.36464409 0.63471723
points(xs, ys, pch = 19)
One caveat in the #G5W implementation that I found is that it sometimes returns multiple records close around extrema instead of a single one. On the diagram they cannot be seen, since they all fall into one point effectively.
The following snippet from here filters out single extrema points with the minimum value of the first derivative:
library(tidyverse)
df2 <- df %>%
group_by(round(y, 4)) %>%
filter(abs(d1) == min(abs(d1))) %>%
ungroup() %>%
select(-5)

Detecting dips in a 2D plot

I need to automatically detect dips in a 2D plot, like the regions marked with red circles in the figure below. I'm only interested in the "main" dips, meaning the dips have to span a minimum length in the x axis. The number of dips is unknown, i.e., different plots will contain different numbers of dips. Any ideas?
Update:
As requested, here's the sample data, together with an attempt to smooth it using median filtering, as suggested by vines.
Looks like I need now a robust way to approximate the derivative at each point that would ignore the little blips that remain in the data. Is there any standard approach?
y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512)
yy <- runmed(y, 41)
plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5)
points(yy, col="blue", type="l", lwd=2)
EDITED : function strips the regions to contain nothing but the lowest part, if wanted.
Actually, Using the mean is easier than using the median. This allows you to find regions where the real values are continuously below the mean. The median is not smooth enough for an easy application.
One example function to do this would be :
FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){
nx <- length(x)
n <- 2*(n %/% 2) + 1
# smooth out based on means
sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T)
# find which series are far from the mean
rlesx <- rle((sx-x)>0)
# construct start and end of regions
int <- embed(cumsum(c(1,rlesx$lengths)),2)
# which regions fulfill requirements
id <- rlesx$value & rlesx$length > tol
# Cut regions to be in general smaller than median
regions <-
apply(int[id,],1,function(i){
i <- min(i):max(i)
tmp <- x[i]
id <- which(tmp < quantile(tmp,p))
id <- min(id):max(id)
i[id]
})
# return
unlist(regions)
}
where
n determines how much values are used to calculate the running mean,
tol determines how many consecutive values should be lower than the running mean to talk about a low region, and
p determines the cutoff used (as a quantile) for stripping the regions to their lowest part. When p=1, the complete lower region is shown.
Function is tweaked to work on data as you presented, but the numbers might need to be adjusted a bit to work with other data.
This function returns a set of indices, which allows you to find the low regions. Illustrated with your y vector :
Lows <- FindLowRegion(y)
newx <- seq_along(y)
newy <- ifelse(newx %in% Lows,y,NA)
plot(y, col="blue", type="l", lwd=2)
lines(newx,newy,col="red",lwd="3")
Gives :
You have to smooth the graph in some way. Median filtration is quite useful for that purpose (see http://en.wikipedia.org/wiki/Median_filter). After smoothing, you will simply have to search for the minima, just as usual (i.e. search for the points where the 1st derivative switches from negative to positive).
A simpler answer (which also does not require smoothing) could be provided by adapting the maxdrawdown() function from the tseries. A drawdown is commonly defined as the retreat from the most-recent maximum; here we want the opposite. Such a function could then be used in a sliding window over the data, or over segmented data.
maxdrawdown <- function(x) {
if(NCOL(x) > 1)
stop("x is not a vector or univariate time series")
if(any(is.na(x)))
stop("NAs in x")
cmaxx <- cummax(x)-x
mdd <- max(cmaxx)
to <- which(mdd == cmaxx)
from <- double(NROW(to))
for (i in 1:NROW(to))
from[i] <- max(which(cmaxx[1:to[i]] == 0))
return(list(maxdrawdown = mdd, from = from, to = to))
}
So instead of using cummax(), one would have to switch to cummin() etc.
My first thought was something much cruder than filtering. Why not look for the big drops followed by long enough stable periods?
span.b <- 20
threshold.b <- 0.2
dy.b <- c(rep(NA, span.b), diff(y, lag = span.b))
span.f <- 10
threshold.f <- 0.05
dy.f <- c(diff(y, lag = span.f), rep(NA, span.f))
down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f)
abline(v = down)
The plot shows that it's not perfect, but it doesn't discard the outliers (I guess it depends on your take on the data).

Resources