Lagged exponential moving average of a vector - r

Given a simple vector of 82 observation
x = c(102, 104, 89, 89, 76, 95, 88, 112, 81, 101, 101, 104, 94, 111, 108, 104, 93, 92, 86, 113, 93, 100, 92, 80, 92, 126, 102, 109, 104, 95, 84, 81, 103, 83, 103, 83, 58, 109, 89, 93, 104, 104, 123, 104, 93, 76, 103, 103, 100, 105, 108, 90, 122, 103, 114, 102, 87, 98, 88, 107, 102, 80, 81, 96, 107, 105, 113, 98, 93, 104, 94, 107, 107, 97, 102, 82, 90, 97, 124, 109, 96, 92)
I would like to perform EMA (Exponential Moving Average) of this vector in such a way:
The 1st element of the new vector should be NA
The 2nd element should be the first element of the original vector
The 3rd element should be the EMA of the first and second elements of the original vector
The 4th element should be the EMA of the first three elements of the original vector
...
The 82th element should be the EMA of all the values of the original vector except for the last one
The ideas are to to place greater weight on the most recent vector elements and that also the last elements of the new vector is affected (although infinitesimally) by the first elements of the orginal vector.
I tried achieving this using the function EMA from the package TTR and lag from dplyr
> library(dplyr)
> library(TTR)
> lag(EMA(x, 1, ratio = 2/(81+1)))
[1] NA 102.00000 102.04878 101.73052 101.42002 100.80002 100.65855 100.34981 100.63396 100.15508 100.17569
[12] 100.19579 100.28858 100.13520 100.40020 100.58556 100.66884 100.48179 100.27492 99.92675 100.24561 100.06889
[23] 100.06721 99.87045 99.38580 99.20566 99.85918 99.91139 100.13307 100.22738 100.09989 99.70721 99.25093
[34] 99.34237 98.94378 99.04271 98.65143 97.65993 97.93651 97.71855 97.60346 97.75948 97.91168 98.52360
[45] 98.65717 98.51919 97.96994 98.09262 98.21231 98.25592 98.42041 98.65405 98.44298 99.01754 99.11468
[56] 99.47773 99.53925 99.23342 99.20333 98.93008 99.12691 99.19698 98.72876 98.29635 98.24035 98.45400
[67] 98.61365 98.96454 98.94102 98.79611 98.92304 98.80296 99.00289 99.19794 99.14433 99.21398 98.79413
[78] 98.57964 98.54111 99.16206 99.40201 99.31903
But that's definetely not the result I was looking for.... what I'm doing wrong?
I wasn't able to find any comprehensive documentation about ratio argument on internet, and I'm not sure I've got this clear.
Can anyone please help me?
To get things more clear:
the result i reached until now is the following:
> library(runner)
> mean_run(x, k = 7, lag = 1)
[1] NA 102.00000 103.00000 98.33333 96.00000 92.00000 92.50000 91.85714 93.28571 90.00000 91.71429
[12] 93.42857 97.42857 97.28571 100.57143 100.00000 103.28571 102.14286 100.85714 98.28571 101.00000 98.42857
[23] 97.28571 95.57143 93.71429 93.71429 99.42857 97.85714 100.14286 100.71429 101.14286 101.71429 100.14286
[34] 96.85714 94.14286 93.28571 90.28571 85.00000 88.57143 89.71429 88.28571 91.28571 91.42857 97.14286
[45] 103.71429 101.42857 99.57143 101.00000 100.85714 100.28571 97.71429 98.28571 97.85714 104.42857 104.42857
[56] 106.00000 106.28571 103.71429 102.28571 102.00000 99.85714 99.71429 94.85714 91.85714 93.14286 94.42857
[67] 96.85714 97.71429 97.14286 99.00000 102.28571 102.00000 102.00000 102.28571 100.00000 100.57143 99.00000
[78] 97.00000 97.42857 99.85714 100.14286 100.00000
So this is the Simple Moving Average (SMA) over k=7 observations i obteneid with mean_run function from runner package.
Now i would like to "improve" this moving average by placing exponential increasing weights on each observations and making sure that also the last element is affected by the first one (the weight for that observation should be as close as possible to 0). That means that the window sizes for the rolling average will be:
n=0 for the 1st element (i.e. NA)
n=1 for the 2nd element (i.e the 1st element of the original vector)
n=2 for the 3d element (i.e. the EMA of the 1st and 2nd elements)
n=3 for the 4th element (i.e. the EMA of the 1st,2nd and 3rd elements)
...
n=81 for the 82th element (i.e. the EMA of the first 81 elements)
I still wasn't able to find any good documentation about the ratio arguments (i.e alpha), but i think it should be settled aribitrarily, but I'm not sure about that

Assuming that you intended what you wrote, i.e. a lagged exponential moving average and not a lagged weighted moving average defined in a comment, we define the iteration in iter and then use Reduce like this.
alfa <- 2/(81+1)
iter <- function(y, x) alfa * x + (1-alfa) * y
ema <- c(NA, head(Reduce(iter, tail(x, -1), init = x[1], acc = TRUE), -1))
# check
identical(ema[1], NA_real_)
## [1] TRUE
identical(ema[2], x[1])
## [1] TRUE
identical(ema[3], alfa * x[2] + (1-alfa) * x[1])
## [1] TRUE
identical(ema[4], alfa * x[3] + (1-alfa) * ema[3])
## [1] TRUE
The lagged weighted moving average in the comment is not an exponential moving average and is unlikely what you want but just to show how to implement it if the second argument to rollapply is a list containing a vector then that vector will be regarded as the offsets to use.
library(zoo)
c(NA, x[1], rollapplyr(x, list(-seq(2)), weighted.mean, c(alfa, 1-alfa)))

Related

Extract clusters of similar continuous numbers from vectors

I am attempting an analysis that requires the extraction of a some (2 or 3) consecutive values in which perform further analysis later.
I have two vectors: a is the output from a machine of consecutive cellular signals. b is the same output, but shifted by 1. This notation is used to understand the variability between one signal and the next one
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
What I am trying to do is to identify the most homogeneous (stable) region (i.e. one value is similar to the following) in this set of data.
The idea I had was to perform a subtraction between a and b and consider the absolute value:
c <- abs(a-b)
which gives
c
[1] 20 5 45 2 8 2 7 25 30 20 10 50 1 51
Now, if I want the 3 closest consecutive points, I can clearly see that the sequence 2 8 2 is by far the one that I would consider, but I have no idea on how I can automatically extract these 3 values, especially from arrays of hundreds of data points.
Initial data:
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
Find absolute difference between two vectors:
res <- abs(a - b)
For each element in res get neighbors and calculate sum off absolute difference :
# with res[(x-1):(x+1)] we extract x and it's neighbors
resSimilarity <- sapply(seq_along(res), function(x) sum(res[(x-1):(x+1)]))
resPosition <- which.min(resSimilarity)
# [1] 5
To extract values from original vectors use:
a[(resPosition - 1):(resPosition + 1)]
# [1] 180 182 190
b[(resPosition - 1):(resPosition + 1)]
# [1] 182 190 188
Here is one more alternative:
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
res <- abs(a-b)
> which.min(diff(c(0, cumsum(res)), lag=3))
[1] 4
> res[(4):(4+2)]
[1] 2 8 2
The above code uses cumsum to get the cumulative sums of your absolute differences. Then it calls diff with lag=3 to get the differences between each element and the element 3 positions away from it. Finally it takes the position where the increase in cumulative sum over successive 3 elements was the smallest.

Plot the ranges of values in R

I am interested in plotting the range of values of variables so that the names appear on the Y-axis and the range on the X-axis, for a better visualization.
I have used the following code:
primer_matrix1a <- matrix(
c(
"EF1", 65, 217,
"EF6", 165, 197,
"EF14", 96, 138,
"EF15", 103, 159,
"EF20", 86, 118,
"G9", 115, 173,
"G25", 112, 140,
"BE22", 131, 135,
"TT20", 180, 190
)
,nrow=9,ncol=3,byrow = T)
# Format data
Primer_name <- primer_matrix1a[,1]
Primer_name <- matrix(c(Primer_name),nrow = 9,byrow = T)
Primer_values<- matrix(c(as.numeric(primer_matrix1a[ ,2-3])),nrow = 9,ncol = 2,byrow = T)
Primer_Frame <- data.frame(Primer_name,Primer_values)
colnames(Primer_Frame) <- c("Primer","min","max")
Primer_Frame$mean<- mean(c(Primer_Frame$min,Primer_Frame$max))
ggplot(Primer_Frame, aes(x=Primer))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()
but the plot is weird, EF15 goes from 103, 159, while G9 goes from 115 to 173, and they do not overlap, so I am doing something wrong.
It looks like something is getting muddled when you are joining the matrix, but the approach is already more complex than it should be, so you might want to start afresh. It is probably easiest converting it to a dataframe and then formatting it there, rather than fiddling around with all the matrix functions:
df <- as.data.frame(primer_matrix1a)
names(df)<- c("Primer","min","max")
df$min <- as.numeric(as.character(df$min)) # Converts factor to numeric
df$max <- as.numeric(as.character(df$max))
df$mean<- mean(c(df$min,df$max))
ggplot(df, aes(x=Primer))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()

I want to create a nested function in R

For example, I have variables
Store.Number <- c("105, 105, 105, 106, 106, 106, 110, 110, 110, 110")
Date <- c("2017-01-04", "2016-07-06", "2016-04-04", "2017-01-31", "2016-10-31",
"2016-05-11", "2017-01-26", "2016-10-28", "2016-07-20", "2016-04-27")
Jan012016 <- c("369",NA,NA.......)
a <- as.Date("01/01/2016", "%m/%d/%Y)
I want to write a function such that first loop should check for
for i=1
Store.Number[i+1,1] = Store.Number [i,1]
If "True"
abs(Date[i,2] - Date[i+1,2])
Else
abs(Date[i,2] - a)
example:
Store.Number[2,1] = Store.Number [1,1] > TRUE
Jan012016 = abs(Date[2,2] - Date[1,2])
Else
Jan012016 = abs(Date[i,2] - a)
The functions most commonly used in R to split the data into subsets and perform calculations are probably tapply and aggregate. In this case, tapply gives output closest to what you want, so an example is:
Store.Number <- c(105, 105, 105, 106, 106, 106, 110, 110, 110, 110)
Date <- as.Date(c("2017-01-04", "2016-07-06", "2016-04-04", "2017-01-31", "2016-10-31",
"2016-05-11", "2017-01-26", "2016-10-28", "2016-07-20", "2016-04-27"), "%Y-%m-%d")
a <- as.Date("01/01/2016", "%m/%d/%Y")
store.diffs <- tapply(Date, Store.Number, function(x)abs(diff(c(a,x))))
Jan012016 <- do.call(c,store.diffs)
names(Jan012016) <- NULL
Jan012016
# Time differences in days
# [1] 369 182 93 396 92 173 391 90 100 84
Note that I've got rid of the quotes in your definition of Store.Number, and converted Date from character to Date.

What is the meaning of "out" object for box plot in R?

Suppose this is my data set:
ID<- seq(1:50)
mou<-sample(c(2000, 2500, 440, 4990, 23000, 450, 3412, 4958,745,1000), 50, replace= TRUE)
calls<-sample(c(50, 51, 12, 60, 90, 888, 444, 668, 16, 89, 222,33, 243, 239, 333, 645,23, 50,555), 50, replace= TRUE)
rev<- sample(c(100, 345, 758, 44, 58, 334, 50000, 888, 205, 940,298, 754), 50, replace= TRUE)
dt<- data.frame(mou, calls, rev)
I did the box plot for calls and while analyzing it, I saw the following objects for the boxplot.
x<-boxplot(dt$calls)
names(x)
> names(x)
[1] "stats" "n" "conf" "out" "group" "names"
Looking at the output for x$stats, I figured that stats object gives me the lower whisker the lower hinge, the median, the the upper hinge and the upper whisker for each group. But i am little bit confused what the object "out" really mean? Does this signify the outlier values or something else?
The out object for my boxplot gives the following results:
> x$out
[1] 555 10000 555 555 555 555 555 10000
It gives you: "The values of any data points which lie beyond the extremes of the whiskers"
Take a look at here for more insight.

Find entry that causes low p-value

In R is have 2 vectors
u <- c(109, 77, 57, 158, 60, 63, 42, 20, 139, 15, 64, 18)
v <- c(734, 645, 1001, 1117, 1071, 687, 162, 84, 626, 64, 218, 79)
I want to test H: u and v are independent so I run a chi-square test:
chisq.test( as.data.frame( rbind(u,v) ) )
and get a very low p-value meaning that I can reject H, meaning that u and v are not independent.
But when I type
chisq.test(u,v)
I get a p-value on 0.23 which mean that I can accept H.
Which one of these two test should I chose ?
Furthermore I want to find the entries in these vectors that causes this low p-value. Any ideas how to do this?
The test statistic uses the sum of squared standardised residuals. You can look at these values to get an idea of the importance of particular values
m = chisq.test(u, v)
residuals(m)
m$stdres

Resources