Generating sliding window to subset data for prediction task

Generating sliding window to subset data for prediction task - r

I want to write a sliding window function in order to use the model trained from t, t+1, and t+2 year to make prediction on the outcome of the t+3 year. This means that for a 10-year's data, the desired sliding window function should create 7 train-test splits and make 7 predictions (for the t+3, t+4, t+5, t+6, t+7, t+8, t+9 year).
I came up with the following code but the result doesn't ring the bell. Not only does the resulting object length differs, but even if I try to manually work through the prediction task, the predict function actually generates 3 predicted values for a single year's outcome, which doesn't make sense. It would be grateful if someone could point out the sources of the error.
# generate the data
set.seed(123)
df <- data.frame(year = 2000:2009, # T = 10
y = c(1, 1, 1, 1, 0, 0, 1, 0, 0, 0),
var1 = runif(10, min=0, max=1),
var2 = runif(10, min=1, max=2))
# store predicted values in a list
pred <- list()
# loop from the 1st year to the T-3 year
for(i in 2000:2007){
df_sub1 <- subset(df, year == c(i, i+1, i+2))
mod <- glm(y~var1+var2, data=df_sub1, family=binomial())
df_sub2 <- subset(df, year == i+3)
pred[[i]] <- predict(mod, data=df_sub2, type = "response")
}
# error message
Error in family$linkfun(mustart) :
Argument mu must be a nonempty numeric vector
In addition: Warning messages:
1: In year == c(i, i + 1, i + 2) :
longer object length is not a multiple of shorter object length
2: In year == c(i, i + 1, i + 2) :
longer object length is not a multiple of shorter object length

The error/warning is from using == when the rhs is of length > 1. Use %in%
pred <- vector('list', 8)
names(pred) <- 2000:2007
for(i in 2000:2007){
df_sub1 <- subset(df, year %in% c(i, i+1, i+2))
mod <- glm(y~var1+var2, data=df_sub1, family=binomial())
df_sub2 <- subset(df, year == (i+3))
pred[[as.character(i)]] <- tryCatch(predict(mod,
newdata=df_sub2, type = "response"), error = function(e) NA_real_)
}
-output
> pred
$`2000`
4
1
$`2001`
5
1
$`2002`
6
1
$`2003`
7
2.220446e-16
$`2004`
8
0.1467543
$`2005`
9
0.001408577
$`2006`
10
2.220446e-16
$`2007`
[1] NA

Here is another way with one of package zoo's functions to apply a function to a rolling window. The function to be applied, roll_pred is almost a copy&paste of akrun's, only the creation of the subsets is different.
# generate the data
set.seed(123)
df <- data.frame(year = 2000:2009, # T = 10
y = c(1, 1, 1, 1, 0, 0, 1, 0, 0, 0),
var1 = runif(10, min=0, max=1),
var2 = runif(10, min=1, max=2))
library(zoo, quietly = TRUE)
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
roll_pred <- function(year, X) {
i <- match(year, X$year)
df_sub1 <- X[i, ]
mod <- glm(y ~ var1 + var2, data = df_sub1, family = binomial())
df_sub2 <- X[ i[length(year)] + 1, ]
tryCatch(predict(mod, newdata = df_sub2, type = "response"),
error = function(e) NA_real_)
}
rollapplyr(df$year, 3, roll_pred, X = df)
#> 4 5 6 7 8 9
#> 1.000000e+00 1.000000e+00 1.000000e+00 2.220446e-16 1.467543e-01 1.408577e-03
#> 10 NA
#> 2.220446e-16 NA
Created on 2022-06-05 by the reprex package (v2.0.1)

Related

Mode imputation for categorical variables in a dataframe

I have a data frame(cat_df) which has categorical variables only. I want to impute mode values to missing values in each variable.
I tried the following code. But It's not working.
Way -1
cat_df[is.na(cat_df)] <- modefunc(cat_df, na.rm = TRUE)
cat_df
modefunc <- function(x){
tabresult <- tabulate(x)
themode <- which(tabresult == max(tabresult))
if(sum(tabresult == max(tabresult))>1) themode <- NA
return(themode)
}
Error in modefunc(cat_df, na.rm = TRUE) :
unused argument (na.rm = TRUE)
Way -2
cat_df[is.na(cat_df)] <- my_mode(cat_df[!is.na(cat_df)])
cat_df
my_mode <- function(x){
unique_x <- unique(x)
mode <- unique_x[which.max(tabulate(match(x,unique_x)))]
mode
}
The above code is not not imputing the mode values
Is there any other way to impute mode values to categoriacal variables in a dataframe?

Update:
This Mode function is for dataframes:
my_mode <- function (x, na.rm) {
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1) xmode <- ">1 mode"
return(xmode)
}
for (var in 1:ncol(cat_df)) {
if (class(cat_df[,var])=="numeric") {
cat_df[is.na(cat_df[,var]),var] <- mean(cat_df[,var], na.rm = TRUE)
} else if (class(cat_df[,var]) %in% c("character", "factor")) {
cat_df[is.na(cat_df[,var]),var] <- my_mode(cat_df[,var], na.rm = TRUE)
}
}
This mode function is for vectors
Try this and please let me know.
#define missing values in vector
values <- unique(cat_column)[!is.na(cat_column)]
# mode of cat_column
themode <- values[which.max(tabulate(match(cat_column, values)))]
#assign missing vector
imputevector <- cat_column
imputevector[is.na(imputevector)] <- themode

User Defined Function
Here is the mode function I use with an additional line to choose a single mode in the event there are actually multiple modes:
my_mode <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
mode <- ux[tab == max(tab)]
ifelse(length(mode) > 1, sample(mode, 1), mode)
}
# single mode
cat_col_1 <- c(1, 1, 2, NA)
cat_col_1
#> [1] 1 1 2 NA
cat_col_1[is.na(cat_col_1)] <- my_mode(cat_col_1)
cat_col_1
#> [1] 1 1 2 1
# random sample among multimodal
cat_col_2 <- c(1, 1, 2, 2, NA)
cat_col_2
#> [1] 1 1 2 2 NA
cat_col_2[is.na(cat_col_2)] <- my_mode(cat_col_2)
cat_col_2
#> [1] 1 1 2 2 2
DescTools::Mode()
But other folks have written mode functions. One possibility is in the DescTools package and is named Mode().
Because it returns multiple modes in the event there are more than one, you would need to decide what to do in that event.
Here is an example to randomly sample with replacement, the necessary number of modes to replace the missing values.
# single mode
cat_col_3 <- c(1, 1, 2, NA)
cat_col_3
#> [1] 1 1 2 NA
cat_col_3_modes <- DescTools::Mode(cat_col_3, na.rm = TRUE)
cat_col_3_nmiss <- sum(is.na(cat_col_3))
cat_col_3[is.na(cat_col_3)] <- sample(cat_col_3_modes, cat_col_3_nmiss, TRUE)
cat_col_3
#> [1] 1 1 2 1
# random sample among multimodal
cat_col_4 <- c(1, 1, 2, 2, NA, NA)
cat_col_4
#> [1] 1 1 2 2 NA NA
cat_col_4_modes <- DescTools::Mode(cat_col_4, na.rm = TRUE)
cat_col_4_nmiss <- sum(is.na(cat_col_4))
cat_col_4[is.na(cat_col_4)] <- sample(cat_col_4_modes, cat_col_4_nmiss, TRUE)
cat_col_4
#> [1] 1 1 2 2 2 1
Created on 2021-04-16 by the reprex package (v1.0.0)

Efficient Montecarlo simulation over a grid in R

I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}

You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.

How to find the local minmum value in r? [duplicate]

I'm looking for a computationally efficient way to find local maxima/minima for a large list of numbers in R.
Hopefully without for loops...
For example, if I have a datafile like 1 2 3 2 1 1 2 1, I want the function to return 3 and 7, which are the positions of the local maxima.

diff(diff(x)) (or diff(x,differences=2): thanks to #ZheyuanLi) essentially computes the discrete analogue of the second derivative, so should be negative at local maxima. The +1 below takes care of the fact that the result of diff is shorter than the input vector.
edit: added #Tommy's correction for cases where delta-x is not 1...
tt <- c(1,2,3,2,1, 1, 2, 1)
which(diff(sign(diff(tt)))==-2)+1
My suggestion above ( http://statweb.stanford.edu/~tibs/PPC/Rdist/ ) is intended for the case where the data are noisier.

#Ben's solution is pretty sweet. It doesn't handle the follwing cases though:
# all these return numeric(0):
x <- c(1,2,9,9,2,1,1,5,5,1) # duplicated points at maxima
which(diff(sign(diff(x)))==-2)+1
x <- c(2,2,9,9,2,1,1,5,5,1) # duplicated points at start
which(diff(sign(diff(x)))==-2)+1
x <- c(3,2,9,9,2,1,1,5,5,1) # start is maxima
which(diff(sign(diff(x)))==-2)+1
Here's a more robust (and slower, uglier) version:
localMaxima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(-.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
x <- c(1,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 3, 8
x <- c(2,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 3, 8
x <- c(3,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 1, 3, 8

Use the zoo library function rollapply:
x <- c(1, 2, 3, 2, 1, 1, 2, 1)
library(zoo)
xz <- as.zoo(x)
rollapply(xz, 3, function(x) which.min(x)==2)
# 2 3 4 5 6 7
#FALSE FALSE FALSE TRUE FALSE FALSE
rollapply(xz, 3, function(x) which.max(x)==2)
# 2 3 4 5 6 7
#FALSE TRUE FALSE FALSE FALSE TRUE
Then pull the index using the 'coredata' for those values where 'which.max' is a "center value" signaling a local maximum. You could obviously do the same for local minima using which.min instead of which.max.
rxz <- rollapply(xz, 3, function(x) which.max(x)==2)
index(rxz)[coredata(rxz)]
#[1] 3 7
I am assuming you do not want the starting or ending values, but if you do , you could pad the ends of your vectors before processing, rather like telomeres do on chromosomes.
(I'm noting the ppc package ("Peak Probability Contrasts" for doing mass spectrometry analyses, simply because I was unaware of its availability until reading #BenBolker's comment above, and I think adding these few words will increase the chances that someone with a mass-spec interest will see this on a search.)

I took a stab at this today. I know you said hopefully without for loops but I stuck with using the apply function. Somewhat compact and fast and allows threshold specification so you can go greater than 1.
The function:
inflect <- function(x, threshold = 1){
up <- sapply(1:threshold, function(n) c(x[-(seq(n))], rep(NA, n)))
down <- sapply(-1:-threshold, function(n) c(rep(NA,abs(n)), x[-seq(length(x), length(x) - abs(n) + 1)]))
a <- cbind(x,up,down)
list(minima = which(apply(a, 1, min) == a[,1]), maxima = which(apply(a, 1, max) == a[,1]))
}
To a visualize it/play with thresholds you can run the following code:
# Pick a desired threshold # to plot up to
n <- 2
# Generate Data
randomwalk <- 100 + cumsum(rnorm(50, 0.2, 1)) # climbs upwards most of the time
bottoms <- lapply(1:n, function(x) inflect(randomwalk, threshold = x)$minima)
tops <- lapply(1:n, function(x) inflect(randomwalk, threshold = x)$maxima)
# Color functions
cf.1 <- grDevices::colorRampPalette(c("pink","red"))
cf.2 <- grDevices::colorRampPalette(c("cyan","blue"))
plot(randomwalk, type = 'l', main = "Minima & Maxima\nVariable Thresholds")
for(i in 1:n){
points(bottoms[[i]], randomwalk[bottoms[[i]]], pch = 16, col = cf.1(n)[i], cex = i/1.5)
}
for(i in 1:n){
points(tops[[i]], randomwalk[tops[[i]]], pch = 16, col = cf.2(n)[i], cex = i/1.5)
}
legend("topleft", legend = c("Minima",1:n,"Maxima",1:n),
pch = rep(c(NA, rep(16,n)), 2), col = c(1, cf.1(n),1, cf.2(n)),
pt.cex = c(rep(c(1, c(1:n) / 1.5), 2)), cex = .75, ncol = 2)

There are some good solutions provided, but it depends on what you need.
Just diff(tt) returns the differences.
You want to detect when you go from increasing values to decreasing values. One way to do this is provided by #Ben:
diff(sign(diff(tt)))==-2
The problem here is that this will only detect changes that go immediately from strictly increasing to strictly decreasing.
A slight change will allow for repeated values at the peak (returning TRUE for last occurence of the peak value):
diff(diff(x)>=0)<0
Then, you simply need to properly pad the front and back if you want to detect maxima at the beginning or end of
Here's everything wrapped in a function (including finding of valleys):
which.peaks <- function(x,partial=TRUE,decreasing=FALSE){
if (decreasing){
if (partial){
which(diff(c(FALSE,diff(x)>0,TRUE))>0)
}else {
which(diff(diff(x)>0)>0)+1
}
}else {
if (partial){
which(diff(c(TRUE,diff(x)>=0,FALSE))<0)
}else {
which(diff(diff(x)>=0)<0)+1
}
}
}

Late to the party, but this might be of interest for others. You can nowadays use the (internal) function find_peaks from ggpmisc package. You can parametrize it using threshold, span and strict arguments. Since ggpmisc package is aimed for using with ggplot2 you can directly plot minima and maxima using thestat_peaks and stat_valleys functions:
set.seed(1)
x <- 1:10
y <- runif(10)
# Maxima
x[ggpmisc:::find_peaks(y)]
[1] 4 7
y[ggpmisc:::find_peaks(y)]
[1] 0.9082078 0.9446753
# Minima
x[ggpmisc:::find_peaks(-y)]
[1] 5
y[ggpmisc:::find_peaks(-y)]
[1] 0.2016819
# Plot
ggplot(data = data.frame(x, y), aes(x = x, y = y)) + geom_line() + stat_peaks(col = "red") + stat_valleys(col = "green")

Answer by #42- is great, but I had a use case where I didn't want to use zoo. It's easy to implement this with dplyr using lag and lead:
library(dplyr)
test = data_frame(x = sample(1:10, 20, replace = TRUE))
mutate(test, local.minima = if_else(lag(x) > x & lead(x) > x, TRUE, FALSE)
Like the rollapply solution, you can control the window size and edge cases through the lag/lead arguments n and default, respectively.

In the case I'm working on, duplicates are frequent. So I have implemented a function that allows finding first or last extrema (min or max):
locate_xtrem <- function (x, last = FALSE)
{
# use rle to deal with duplicates
x_rle <- rle(x)
# force the first value to be identified as an extrema
first_value <- x_rle$values[1] - x_rle$values[2]
# differentiate the series, keep only the sign, and use 'rle' function to
# locate increase or decrease concerning multiple successive values.
# The result values is a series of (only) -1 and 1.
#
# ! NOTE: with this method, last value will be considered as an extrema
diff_sign_rle <- c(first_value, diff(x_rle$values)) %>% sign() %>% rle()
# this vector will be used to get the initial positions
diff_idx <- cumsum(diff_sign_rle$lengths)
# find min and max
diff_min <- diff_idx[diff_sign_rle$values < 0]
diff_max <- diff_idx[diff_sign_rle$values > 0]
# get the min and max indexes in the original series
x_idx <- cumsum(x_rle$lengths)
if (last) {
min <- x_idx[diff_min]
max <- x_idx[diff_max]
} else {
min <- x_idx[diff_min] - x_rle$lengths[diff_min] + 1
max <- x_idx[diff_max] - x_rle$lengths[diff_max] + 1
}
# just get number of occurences
min_nb <- x_rle$lengths[diff_min]
max_nb <- x_rle$lengths[diff_max]
# format the result as a tibble
bind_rows(
tibble(Idx = min, Values = x[min], NB = min_nb, Status = "min"),
tibble(Idx = max, Values = x[max], NB = max_nb, Status = "max")) %>%
arrange(.data$Idx) %>%
mutate(Last = last) %>%
mutate_at(vars(.data$Idx, .data$NB), as.integer)
}
The answer to the original question is:
> x <- c(1, 2, 3, 2, 1, 1, 2, 1)
> locate_xtrem(x)
# A tibble: 5 x 5
Idx Values NB Status Last
<int> <dbl> <int> <chr> <lgl>
1 1 1 1 min FALSE
2 3 3 1 max FALSE
3 5 1 2 min FALSE
4 7 2 1 max FALSE
5 8 1 1 min FALSE
The result indicates that the second minimum is equal to 1 and that this value is repeated twice starting at index 5. Therefore, a different result could be obtained by indicating this time to the function to find the last occurrences of local extremas:
> locate_xtrem(x, last = TRUE)
# A tibble: 5 x 5
Idx Values NB Status Last
<int> <dbl> <int> <chr> <lgl>
1 1 1 1 min TRUE
2 3 3 1 max TRUE
3 6 1 2 min TRUE
4 7 2 1 max TRUE
5 8 1 1 min TRUE
Depending on the objective, it is then possible to switch between the first and the last value of a local extremas. The second result with last = TRUE could also be obtained from an operation between columns "Idx" and "NB"...
Finally to deal with noise in the data, a function could be implemented to remove fluctuations below a given threshold. Code is not exposed since it goes beyond the initial question. I have wrapped it in a package (mainly to automate the testing process) and I give below a result example:
x_series %>% xtrem::locate_xtrem()
x_series %>% xtrem::locate_xtrem() %>% remove_noise()

Here's the solution for minima:
#Ben's solution
x <- c(1,2,3,2,1,2,1)
which(diff(sign(diff(x)))==+2)+1 # 5
Please regard the cases at Tommy's post!
#Tommy's solution:
localMinima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
x <- c(1,2,9,9,2,1,1,5,5,1)
localMinima(x) # 1, 7, 10
x <- c(2,2,9,9,2,1,1,5,5,1)
localMinima(x) # 7, 10
x <- c(3,2,9,9,2,1,1,5,5,1)
localMinima(x) # 2, 7, 10
Please regard: Neither localMaxima nor localMinima can handle duplicated maxima/minima at start!

I had some trouble getting the locations to work in previous solutions and came up with a way to grab the minima and maxima directly. The code below will do this and will plot it, marking the minima in green and the maxima in red. Unlike the which.max() function this will pull all indices of the minima/maxima out of a data frame. The zero value is added in the first diff() function to account for the missing decreased length of the result that occurs whenever you use the function. Inserting this into the innermost diff() function call saves from having to add an offset outside of the logical expression. It doesn't matter much, but i feel it's a cleaner way to do it.
# create example data called stockData
stockData = data.frame(x = 1:30, y=rnorm(30,7))
# get the location of the minima/maxima. note the added zero offsets
# the location to get the correct indices
min_indexes = which(diff( sign(diff( c(0,stockData$y)))) == 2)
max_indexes = which(diff( sign(diff( c(0,stockData$y)))) == -2)
# get the actual values where the minima/maxima are located
min_locs = stockData[min_indexes,]
max_locs = stockData[max_indexes,]
# plot the data and mark minima with red and maxima with green
plot(stockData$y, type="l")
points( min_locs, col="red", pch=19, cex=1 )
points( max_locs, col="green", pch=19, cex=1 )

This function by Timothée Poisot is handy for noisy series:
May 3, 2009
An Algorithm To Find Local Extrema In A Vector
Filed under: Algorithm — Tags: Extrema, Time series — Timothée Poisot # 6:46pm
I spend some time looking for an algorithm to find local extrema in
a vector (time series). The solution I used is to “walk” through the
vector by step larger than 1, in order to retain only one value even
when the values are very noisy (see the picture at the end of the
post).
It goes like this :
findpeaks <- function(vec,bw=1,x.coo=c(1:length(vec)))
{
pos.x.max <- NULL
pos.y.max <- NULL
pos.x.min <- NULL
pos.y.min <- NULL for(i in 1:(length(vec)-1)) { if((i+1+bw)>length(vec)){
sup.stop <- length(vec)}else{sup.stop <- i+1+bw
}
if((i-bw)<1){inf.stop <- 1}else{inf.stop <- i-bw}
subset.sup <- vec[(i+1):sup.stop]
subset.inf <- vec[inf.stop:(i-1)]
is.max <- sum(subset.inf > vec[i]) == 0
is.nomin <- sum(subset.sup > vec[i]) == 0
no.max <- sum(subset.inf > vec[i]) == length(subset.inf)
no.nomin <- sum(subset.sup > vec[i]) == length(subset.sup)
if(is.max & is.nomin){
pos.x.max <- c(pos.x.max,x.coo[i])
pos.y.max <- c(pos.y.max,vec[i])
}
if(no.max & no.nomin){
pos.x.min <- c(pos.x.min,x.coo[i])
pos.y.min <- c(pos.y.min,vec[i])
}
}
return(list(pos.x.max,pos.y.max,pos.x.min,pos.y.min))
}
Link to original blog post

In the pracma package, use the
tt <- c(1,2,3,2,1, 1, 2, 1)
tt_peaks <- findpeaks(tt, zero = "0", peakpat = NULL,
minpeakheight = -Inf, minpeakdistance = 1, threshold = 0, npeaks = 0, sortstr = FALSE)
[,1] [,2] [,3] [,4]
[1,] 3 3 1 5
[2,] 2 7 6 8
That returns a matrix with 4 columns.
The first column is showing the local peaks' absolute values.
The 2nd column are the indices
The 3rd and 4th column are the start and end of the peaks (with potential overlap).
See https://www.rdocumentation.org/packages/pracma/versions/1.9.9/topics/findpeaks for details.
One caveat: I used it in a series of non-integers, and the peak was one index too late (for all peaks) and I do not know why. So I had to manually remove "1" from my index vector (no big deal).

Finding local maxima and minima for a not so easy sequence e.g. 1 0 1 1 2 0 1 1 0 1 1 1 0 1 I would give their positions at (1), 5, 7.5, 11 and (14) for maxima and 2, 6, 9, 13 for minima.
#Position 1 1 1 1 1
# 1 2 3 4 5 6 7 8 9 0 1 2 3 4
x <- c(1,0,1,1,2,0,1,1,0,1,1,1,0,1) #Frequency
# p v p v p v p v p p..Peak, v..Valey
peakPosition <- function(x, inclBorders=TRUE) {
if(inclBorders) {y <- c(min(x), x, min(x))
} else {y <- c(x[1], x)}
y <- data.frame(x=sign(diff(y)), i=1:(length(y)-1))
y <- y[y$x!=0,]
idx <- diff(y$x)<0
(y$i[c(idx,F)] + y$i[c(F,idx)] - 1)/2
}
#Find Peaks
peakPosition(x)
#1.0 5.0 7.5 11.0 14.0
#Find Valeys
peakPosition(-x)
#2 6 9 13
peakPosition(c(1,2,3,2,1,1,2,1)) #3 7

We see many nice functions and ideas with different features here. One issue of almost all examples is the efficiency. Many times we see the use of complex functions like diff() or for()-loops, which become slow when large data sets are involved. Let me introduce an efficient function I use every day, with minimal features, but very fast:
Local Maxima Function amax()
The purpose is to detect all local maxima in a real valued vector.
If the first element x[1] is the global maximum, it is ignored,
because there is no information about the previous emlement. If there
is a plateau, the first edge is detected.
#param x numeric vector
#return returns the indicies of local maxima. If x[1] = max, then
it is ignored.
amax <- function(x)
{
a1 <- c(0,x,0)
a2 <- c(x,0,0)
a3 <- c(0,0,x)
e <- which((a1 >= a2 & a1 > a3)[2:(length(x))])
if(!is.na(e[1] == 1))
if(e[1]==1)
e <- e[-1]
if(length(e) == 0) e <- NaN
return (e)
}
a <- c(1,2,3,2,1,5,5,4)
amax(a) # 3, 6

I posted this elsewhere, but I think this is an interesting way to go about it. I'm not sure what its computational efficiency is, but it's a very concise way of solving the problem.
vals=rbinom(1000,20,0.5)
text=paste0(substr(format(diff(vals),scientific=TRUE),1,1),collapse="")
sort(na.omit(c(gregexpr('[ ]-',text)[[1]]+1,ifelse(grepl('^-',text),1,NA),
ifelse(grepl('[^-]$',text),length(vals),NA))))

An enhancement (fast and simple method) to the formula proposed by #BEN and regarding to the cases proposed by #TOMMY:
the following recursive formula handle any cases:
dx=c(0,sign(diff(x)))
numberofzeros= length(dx) - sum(abs(dx)) -1 # to find the number of zeros
# in the dx minus the first one
# which is added intentionally.
#running recursive formula to clear middle zeros
# iterate for the number of zeros
for (i in 1:numberofzeros){
dx = sign(2*dx + c(0,rev(sign(diff(rev(dx))))))
}
Now, the formula provided by #Ben Bolker can be used with a little change:
plot(x)
points(which(diff(dx)==2),x[which(diff(dx)==2)],col = 'blue')#Local MIN.
points(which(diff(dx)==-2),x[which(diff(dx)==-2)],col = 'red')#Local MAX.

I liked #mikeck's solution so that I wouldn't have to convert my dataframes back and forth from a zoo object. But I also wanted to use a window wider than 1. Their solution only looks at the xth value away from the value of interest, not the values within x distance. Here is what I came up with. You would need to add an extra lag/lead line for every value away from the value of interest that you want to look.
x <- data.frame(AIC = c(98, 97, 96, 97, 98, 99, 98, 98, 97, 96, 95, 94, 93, 92, 93, 94, 95, 96, 95, 94, 93, 92, 91, 90, 89, 88))
x <- x %>%
mutate(local.minima = if_else(lag(AIC) > AIC & lead(AIC) > AIC &
lag(AIC, 2) > AIC & lead(AIC, 2) > AIC &
lag(AIC, 3) > AIC & lead(AIC, 3) > AIC, TRUE, FALSE),
local.minima = if_else(is.na(local.minima), TRUE, local.minima))

Find peaks in R plot [duplicate]

I'm looking for a computationally efficient way to find local maxima/minima for a large list of numbers in R.
Hopefully without for loops...
For example, if I have a datafile like 1 2 3 2 1 1 2 1, I want the function to return 3 and 7, which are the positions of the local maxima.

diff(diff(x)) (or diff(x,differences=2): thanks to #ZheyuanLi) essentially computes the discrete analogue of the second derivative, so should be negative at local maxima. The +1 below takes care of the fact that the result of diff is shorter than the input vector.
edit: added #Tommy's correction for cases where delta-x is not 1...
tt <- c(1,2,3,2,1, 1, 2, 1)
which(diff(sign(diff(tt)))==-2)+1
My suggestion above ( http://statweb.stanford.edu/~tibs/PPC/Rdist/ ) is intended for the case where the data are noisier.

#Ben's solution is pretty sweet. It doesn't handle the follwing cases though:
# all these return numeric(0):
x <- c(1,2,9,9,2,1,1,5,5,1) # duplicated points at maxima
which(diff(sign(diff(x)))==-2)+1
x <- c(2,2,9,9,2,1,1,5,5,1) # duplicated points at start
which(diff(sign(diff(x)))==-2)+1
x <- c(3,2,9,9,2,1,1,5,5,1) # start is maxima
which(diff(sign(diff(x)))==-2)+1
Here's a more robust (and slower, uglier) version:
localMaxima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(-.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
x <- c(1,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 3, 8
x <- c(2,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 3, 8
x <- c(3,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 1, 3, 8

Use the zoo library function rollapply:
x <- c(1, 2, 3, 2, 1, 1, 2, 1)
library(zoo)
xz <- as.zoo(x)
rollapply(xz, 3, function(x) which.min(x)==2)
# 2 3 4 5 6 7
#FALSE FALSE FALSE TRUE FALSE FALSE
rollapply(xz, 3, function(x) which.max(x)==2)
# 2 3 4 5 6 7
#FALSE TRUE FALSE FALSE FALSE TRUE
Then pull the index using the 'coredata' for those values where 'which.max' is a "center value" signaling a local maximum. You could obviously do the same for local minima using which.min instead of which.max.
rxz <- rollapply(xz, 3, function(x) which.max(x)==2)
index(rxz)[coredata(rxz)]
#[1] 3 7
I am assuming you do not want the starting or ending values, but if you do , you could pad the ends of your vectors before processing, rather like telomeres do on chromosomes.
(I'm noting the ppc package ("Peak Probability Contrasts" for doing mass spectrometry analyses, simply because I was unaware of its availability until reading #BenBolker's comment above, and I think adding these few words will increase the chances that someone with a mass-spec interest will see this on a search.)

I took a stab at this today. I know you said hopefully without for loops but I stuck with using the apply function. Somewhat compact and fast and allows threshold specification so you can go greater than 1.
The function:
inflect <- function(x, threshold = 1){
up <- sapply(1:threshold, function(n) c(x[-(seq(n))], rep(NA, n)))
down <- sapply(-1:-threshold, function(n) c(rep(NA,abs(n)), x[-seq(length(x), length(x) - abs(n) + 1)]))
a <- cbind(x,up,down)
list(minima = which(apply(a, 1, min) == a[,1]), maxima = which(apply(a, 1, max) == a[,1]))
}
To a visualize it/play with thresholds you can run the following code:
# Pick a desired threshold # to plot up to
n <- 2
# Generate Data
randomwalk <- 100 + cumsum(rnorm(50, 0.2, 1)) # climbs upwards most of the time
bottoms <- lapply(1:n, function(x) inflect(randomwalk, threshold = x)$minima)
tops <- lapply(1:n, function(x) inflect(randomwalk, threshold = x)$maxima)
# Color functions
cf.1 <- grDevices::colorRampPalette(c("pink","red"))
cf.2 <- grDevices::colorRampPalette(c("cyan","blue"))
plot(randomwalk, type = 'l', main = "Minima & Maxima\nVariable Thresholds")
for(i in 1:n){
points(bottoms[[i]], randomwalk[bottoms[[i]]], pch = 16, col = cf.1(n)[i], cex = i/1.5)
}
for(i in 1:n){
points(tops[[i]], randomwalk[tops[[i]]], pch = 16, col = cf.2(n)[i], cex = i/1.5)
}
legend("topleft", legend = c("Minima",1:n,"Maxima",1:n),
pch = rep(c(NA, rep(16,n)), 2), col = c(1, cf.1(n),1, cf.2(n)),
pt.cex = c(rep(c(1, c(1:n) / 1.5), 2)), cex = .75, ncol = 2)

There are some good solutions provided, but it depends on what you need.
Just diff(tt) returns the differences.
You want to detect when you go from increasing values to decreasing values. One way to do this is provided by #Ben:
diff(sign(diff(tt)))==-2
The problem here is that this will only detect changes that go immediately from strictly increasing to strictly decreasing.
A slight change will allow for repeated values at the peak (returning TRUE for last occurence of the peak value):
diff(diff(x)>=0)<0
Then, you simply need to properly pad the front and back if you want to detect maxima at the beginning or end of
Here's everything wrapped in a function (including finding of valleys):
which.peaks <- function(x,partial=TRUE,decreasing=FALSE){
if (decreasing){
if (partial){
which(diff(c(FALSE,diff(x)>0,TRUE))>0)
}else {
which(diff(diff(x)>0)>0)+1
}
}else {
if (partial){
which(diff(c(TRUE,diff(x)>=0,FALSE))<0)
}else {
which(diff(diff(x)>=0)<0)+1
}
}
}

Late to the party, but this might be of interest for others. You can nowadays use the (internal) function find_peaks from ggpmisc package. You can parametrize it using threshold, span and strict arguments. Since ggpmisc package is aimed for using with ggplot2 you can directly plot minima and maxima using thestat_peaks and stat_valleys functions:
set.seed(1)
x <- 1:10
y <- runif(10)
# Maxima
x[ggpmisc:::find_peaks(y)]
[1] 4 7
y[ggpmisc:::find_peaks(y)]
[1] 0.9082078 0.9446753
# Minima
x[ggpmisc:::find_peaks(-y)]
[1] 5
y[ggpmisc:::find_peaks(-y)]
[1] 0.2016819
# Plot
ggplot(data = data.frame(x, y), aes(x = x, y = y)) + geom_line() + stat_peaks(col = "red") + stat_valleys(col = "green")

Answer by #42- is great, but I had a use case where I didn't want to use zoo. It's easy to implement this with dplyr using lag and lead:
library(dplyr)
test = data_frame(x = sample(1:10, 20, replace = TRUE))
mutate(test, local.minima = if_else(lag(x) > x & lead(x) > x, TRUE, FALSE)
Like the rollapply solution, you can control the window size and edge cases through the lag/lead arguments n and default, respectively.

In the case I'm working on, duplicates are frequent. So I have implemented a function that allows finding first or last extrema (min or max):
locate_xtrem <- function (x, last = FALSE)
{
# use rle to deal with duplicates
x_rle <- rle(x)
# force the first value to be identified as an extrema
first_value <- x_rle$values[1] - x_rle$values[2]
# differentiate the series, keep only the sign, and use 'rle' function to
# locate increase or decrease concerning multiple successive values.
# The result values is a series of (only) -1 and 1.
#
# ! NOTE: with this method, last value will be considered as an extrema
diff_sign_rle <- c(first_value, diff(x_rle$values)) %>% sign() %>% rle()
# this vector will be used to get the initial positions
diff_idx <- cumsum(diff_sign_rle$lengths)
# find min and max
diff_min <- diff_idx[diff_sign_rle$values < 0]
diff_max <- diff_idx[diff_sign_rle$values > 0]
# get the min and max indexes in the original series
x_idx <- cumsum(x_rle$lengths)
if (last) {
min <- x_idx[diff_min]
max <- x_idx[diff_max]
} else {
min <- x_idx[diff_min] - x_rle$lengths[diff_min] + 1
max <- x_idx[diff_max] - x_rle$lengths[diff_max] + 1
}
# just get number of occurences
min_nb <- x_rle$lengths[diff_min]
max_nb <- x_rle$lengths[diff_max]
# format the result as a tibble
bind_rows(
tibble(Idx = min, Values = x[min], NB = min_nb, Status = "min"),
tibble(Idx = max, Values = x[max], NB = max_nb, Status = "max")) %>%
arrange(.data$Idx) %>%
mutate(Last = last) %>%
mutate_at(vars(.data$Idx, .data$NB), as.integer)
}
The answer to the original question is:
> x <- c(1, 2, 3, 2, 1, 1, 2, 1)
> locate_xtrem(x)
# A tibble: 5 x 5
Idx Values NB Status Last
<int> <dbl> <int> <chr> <lgl>
1 1 1 1 min FALSE
2 3 3 1 max FALSE
3 5 1 2 min FALSE
4 7 2 1 max FALSE
5 8 1 1 min FALSE
The result indicates that the second minimum is equal to 1 and that this value is repeated twice starting at index 5. Therefore, a different result could be obtained by indicating this time to the function to find the last occurrences of local extremas:
> locate_xtrem(x, last = TRUE)
# A tibble: 5 x 5
Idx Values NB Status Last
<int> <dbl> <int> <chr> <lgl>
1 1 1 1 min TRUE
2 3 3 1 max TRUE
3 6 1 2 min TRUE
4 7 2 1 max TRUE
5 8 1 1 min TRUE
Depending on the objective, it is then possible to switch between the first and the last value of a local extremas. The second result with last = TRUE could also be obtained from an operation between columns "Idx" and "NB"...
Finally to deal with noise in the data, a function could be implemented to remove fluctuations below a given threshold. Code is not exposed since it goes beyond the initial question. I have wrapped it in a package (mainly to automate the testing process) and I give below a result example:
x_series %>% xtrem::locate_xtrem()
x_series %>% xtrem::locate_xtrem() %>% remove_noise()

Here's the solution for minima:
#Ben's solution
x <- c(1,2,3,2,1,2,1)
which(diff(sign(diff(x)))==+2)+1 # 5
Please regard the cases at Tommy's post!
#Tommy's solution:
localMinima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
x <- c(1,2,9,9,2,1,1,5,5,1)
localMinima(x) # 1, 7, 10
x <- c(2,2,9,9,2,1,1,5,5,1)
localMinima(x) # 7, 10
x <- c(3,2,9,9,2,1,1,5,5,1)
localMinima(x) # 2, 7, 10
Please regard: Neither localMaxima nor localMinima can handle duplicated maxima/minima at start!

I had some trouble getting the locations to work in previous solutions and came up with a way to grab the minima and maxima directly. The code below will do this and will plot it, marking the minima in green and the maxima in red. Unlike the which.max() function this will pull all indices of the minima/maxima out of a data frame. The zero value is added in the first diff() function to account for the missing decreased length of the result that occurs whenever you use the function. Inserting this into the innermost diff() function call saves from having to add an offset outside of the logical expression. It doesn't matter much, but i feel it's a cleaner way to do it.
# create example data called stockData
stockData = data.frame(x = 1:30, y=rnorm(30,7))
# get the location of the minima/maxima. note the added zero offsets
# the location to get the correct indices
min_indexes = which(diff( sign(diff( c(0,stockData$y)))) == 2)
max_indexes = which(diff( sign(diff( c(0,stockData$y)))) == -2)
# get the actual values where the minima/maxima are located
min_locs = stockData[min_indexes,]
max_locs = stockData[max_indexes,]
# plot the data and mark minima with red and maxima with green
plot(stockData$y, type="l")
points( min_locs, col="red", pch=19, cex=1 )
points( max_locs, col="green", pch=19, cex=1 )

This function by Timothée Poisot is handy for noisy series:
May 3, 2009
An Algorithm To Find Local Extrema In A Vector
Filed under: Algorithm — Tags: Extrema, Time series — Timothée Poisot # 6:46pm
I spend some time looking for an algorithm to find local extrema in
a vector (time series). The solution I used is to “walk” through the
vector by step larger than 1, in order to retain only one value even
when the values are very noisy (see the picture at the end of the
post).
It goes like this :
findpeaks <- function(vec,bw=1,x.coo=c(1:length(vec)))
{
pos.x.max <- NULL
pos.y.max <- NULL
pos.x.min <- NULL
pos.y.min <- NULL for(i in 1:(length(vec)-1)) { if((i+1+bw)>length(vec)){
sup.stop <- length(vec)}else{sup.stop <- i+1+bw
}
if((i-bw)<1){inf.stop <- 1}else{inf.stop <- i-bw}
subset.sup <- vec[(i+1):sup.stop]
subset.inf <- vec[inf.stop:(i-1)]
is.max <- sum(subset.inf > vec[i]) == 0
is.nomin <- sum(subset.sup > vec[i]) == 0
no.max <- sum(subset.inf > vec[i]) == length(subset.inf)
no.nomin <- sum(subset.sup > vec[i]) == length(subset.sup)
if(is.max & is.nomin){
pos.x.max <- c(pos.x.max,x.coo[i])
pos.y.max <- c(pos.y.max,vec[i])
}
if(no.max & no.nomin){
pos.x.min <- c(pos.x.min,x.coo[i])
pos.y.min <- c(pos.y.min,vec[i])
}
}
return(list(pos.x.max,pos.y.max,pos.x.min,pos.y.min))
}
Link to original blog post

In the pracma package, use the
tt <- c(1,2,3,2,1, 1, 2, 1)
tt_peaks <- findpeaks(tt, zero = "0", peakpat = NULL,
minpeakheight = -Inf, minpeakdistance = 1, threshold = 0, npeaks = 0, sortstr = FALSE)
[,1] [,2] [,3] [,4]
[1,] 3 3 1 5
[2,] 2 7 6 8
That returns a matrix with 4 columns.
The first column is showing the local peaks' absolute values.
The 2nd column are the indices
The 3rd and 4th column are the start and end of the peaks (with potential overlap).
See https://www.rdocumentation.org/packages/pracma/versions/1.9.9/topics/findpeaks for details.
One caveat: I used it in a series of non-integers, and the peak was one index too late (for all peaks) and I do not know why. So I had to manually remove "1" from my index vector (no big deal).

Finding local maxima and minima for a not so easy sequence e.g. 1 0 1 1 2 0 1 1 0 1 1 1 0 1 I would give their positions at (1), 5, 7.5, 11 and (14) for maxima and 2, 6, 9, 13 for minima.
#Position 1 1 1 1 1
# 1 2 3 4 5 6 7 8 9 0 1 2 3 4
x <- c(1,0,1,1,2,0,1,1,0,1,1,1,0,1) #Frequency
# p v p v p v p v p p..Peak, v..Valey
peakPosition <- function(x, inclBorders=TRUE) {
if(inclBorders) {y <- c(min(x), x, min(x))
} else {y <- c(x[1], x)}
y <- data.frame(x=sign(diff(y)), i=1:(length(y)-1))
y <- y[y$x!=0,]
idx <- diff(y$x)<0
(y$i[c(idx,F)] + y$i[c(F,idx)] - 1)/2
}
#Find Peaks
peakPosition(x)
#1.0 5.0 7.5 11.0 14.0
#Find Valeys
peakPosition(-x)
#2 6 9 13
peakPosition(c(1,2,3,2,1,1,2,1)) #3 7

We see many nice functions and ideas with different features here. One issue of almost all examples is the efficiency. Many times we see the use of complex functions like diff() or for()-loops, which become slow when large data sets are involved. Let me introduce an efficient function I use every day, with minimal features, but very fast:
Local Maxima Function amax()
The purpose is to detect all local maxima in a real valued vector.
If the first element x[1] is the global maximum, it is ignored,
because there is no information about the previous emlement. If there
is a plateau, the first edge is detected.
#param x numeric vector
#return returns the indicies of local maxima. If x[1] = max, then
it is ignored.
amax <- function(x)
{
a1 <- c(0,x,0)
a2 <- c(x,0,0)
a3 <- c(0,0,x)
e <- which((a1 >= a2 & a1 > a3)[2:(length(x))])
if(!is.na(e[1] == 1))
if(e[1]==1)
e <- e[-1]
if(length(e) == 0) e <- NaN
return (e)
}
a <- c(1,2,3,2,1,5,5,4)
amax(a) # 3, 6

I posted this elsewhere, but I think this is an interesting way to go about it. I'm not sure what its computational efficiency is, but it's a very concise way of solving the problem.
vals=rbinom(1000,20,0.5)
text=paste0(substr(format(diff(vals),scientific=TRUE),1,1),collapse="")
sort(na.omit(c(gregexpr('[ ]-',text)[[1]]+1,ifelse(grepl('^-',text),1,NA),
ifelse(grepl('[^-]$',text),length(vals),NA))))

An enhancement (fast and simple method) to the formula proposed by #BEN and regarding to the cases proposed by #TOMMY:
the following recursive formula handle any cases:
dx=c(0,sign(diff(x)))
numberofzeros= length(dx) - sum(abs(dx)) -1 # to find the number of zeros
# in the dx minus the first one
# which is added intentionally.
#running recursive formula to clear middle zeros
# iterate for the number of zeros
for (i in 1:numberofzeros){
dx = sign(2*dx + c(0,rev(sign(diff(rev(dx))))))
}
Now, the formula provided by #Ben Bolker can be used with a little change:
plot(x)
points(which(diff(dx)==2),x[which(diff(dx)==2)],col = 'blue')#Local MIN.
points(which(diff(dx)==-2),x[which(diff(dx)==-2)],col = 'red')#Local MAX.

I liked #mikeck's solution so that I wouldn't have to convert my dataframes back and forth from a zoo object. But I also wanted to use a window wider than 1. Their solution only looks at the xth value away from the value of interest, not the values within x distance. Here is what I came up with. You would need to add an extra lag/lead line for every value away from the value of interest that you want to look.
x <- data.frame(AIC = c(98, 97, 96, 97, 98, 99, 98, 98, 97, 96, 95, 94, 93, 92, 93, 94, 95, 96, 95, 94, 93, 92, 91, 90, 89, 88))
x <- x %>%
mutate(local.minima = if_else(lag(AIC) > AIC & lead(AIC) > AIC &
lag(AIC, 2) > AIC & lead(AIC, 2) > AIC &
lag(AIC, 3) > AIC & lead(AIC, 3) > AIC, TRUE, FALSE),
local.minima = if_else(is.na(local.minima), TRUE, local.minima))

Find the lower points of the two data columns and compare [duplicate]

I'm looking for a computationally efficient way to find local maxima/minima for a large list of numbers in R.
Hopefully without for loops...
For example, if I have a datafile like 1 2 3 2 1 1 2 1, I want the function to return 3 and 7, which are the positions of the local maxima.

diff(diff(x)) (or diff(x,differences=2): thanks to #ZheyuanLi) essentially computes the discrete analogue of the second derivative, so should be negative at local maxima. The +1 below takes care of the fact that the result of diff is shorter than the input vector.
edit: added #Tommy's correction for cases where delta-x is not 1...
tt <- c(1,2,3,2,1, 1, 2, 1)
which(diff(sign(diff(tt)))==-2)+1
My suggestion above ( http://statweb.stanford.edu/~tibs/PPC/Rdist/ ) is intended for the case where the data are noisier.

#Ben's solution is pretty sweet. It doesn't handle the follwing cases though:
# all these return numeric(0):
x <- c(1,2,9,9,2,1,1,5,5,1) # duplicated points at maxima
which(diff(sign(diff(x)))==-2)+1
x <- c(2,2,9,9,2,1,1,5,5,1) # duplicated points at start
which(diff(sign(diff(x)))==-2)+1
x <- c(3,2,9,9,2,1,1,5,5,1) # start is maxima
which(diff(sign(diff(x)))==-2)+1
Here's a more robust (and slower, uglier) version:
localMaxima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(-.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
x <- c(1,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 3, 8
x <- c(2,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 3, 8
x <- c(3,2,9,9,2,1,1,5,5,1)
localMaxima(x) # 1, 3, 8

Use the zoo library function rollapply:
x <- c(1, 2, 3, 2, 1, 1, 2, 1)
library(zoo)
xz <- as.zoo(x)
rollapply(xz, 3, function(x) which.min(x)==2)
# 2 3 4 5 6 7
#FALSE FALSE FALSE TRUE FALSE FALSE
rollapply(xz, 3, function(x) which.max(x)==2)
# 2 3 4 5 6 7
#FALSE TRUE FALSE FALSE FALSE TRUE
Then pull the index using the 'coredata' for those values where 'which.max' is a "center value" signaling a local maximum. You could obviously do the same for local minima using which.min instead of which.max.
rxz <- rollapply(xz, 3, function(x) which.max(x)==2)
index(rxz)[coredata(rxz)]
#[1] 3 7
I am assuming you do not want the starting or ending values, but if you do , you could pad the ends of your vectors before processing, rather like telomeres do on chromosomes.
(I'm noting the ppc package ("Peak Probability Contrasts" for doing mass spectrometry analyses, simply because I was unaware of its availability until reading #BenBolker's comment above, and I think adding these few words will increase the chances that someone with a mass-spec interest will see this on a search.)

I took a stab at this today. I know you said hopefully without for loops but I stuck with using the apply function. Somewhat compact and fast and allows threshold specification so you can go greater than 1.
The function:
inflect <- function(x, threshold = 1){
up <- sapply(1:threshold, function(n) c(x[-(seq(n))], rep(NA, n)))
down <- sapply(-1:-threshold, function(n) c(rep(NA,abs(n)), x[-seq(length(x), length(x) - abs(n) + 1)]))
a <- cbind(x,up,down)
list(minima = which(apply(a, 1, min) == a[,1]), maxima = which(apply(a, 1, max) == a[,1]))
}
To a visualize it/play with thresholds you can run the following code:
# Pick a desired threshold # to plot up to
n <- 2
# Generate Data
randomwalk <- 100 + cumsum(rnorm(50, 0.2, 1)) # climbs upwards most of the time
bottoms <- lapply(1:n, function(x) inflect(randomwalk, threshold = x)$minima)
tops <- lapply(1:n, function(x) inflect(randomwalk, threshold = x)$maxima)
# Color functions
cf.1 <- grDevices::colorRampPalette(c("pink","red"))
cf.2 <- grDevices::colorRampPalette(c("cyan","blue"))
plot(randomwalk, type = 'l', main = "Minima & Maxima\nVariable Thresholds")
for(i in 1:n){
points(bottoms[[i]], randomwalk[bottoms[[i]]], pch = 16, col = cf.1(n)[i], cex = i/1.5)
}
for(i in 1:n){
points(tops[[i]], randomwalk[tops[[i]]], pch = 16, col = cf.2(n)[i], cex = i/1.5)
}
legend("topleft", legend = c("Minima",1:n,"Maxima",1:n),
pch = rep(c(NA, rep(16,n)), 2), col = c(1, cf.1(n),1, cf.2(n)),
pt.cex = c(rep(c(1, c(1:n) / 1.5), 2)), cex = .75, ncol = 2)

There are some good solutions provided, but it depends on what you need.
Just diff(tt) returns the differences.
You want to detect when you go from increasing values to decreasing values. One way to do this is provided by #Ben:
diff(sign(diff(tt)))==-2
The problem here is that this will only detect changes that go immediately from strictly increasing to strictly decreasing.
A slight change will allow for repeated values at the peak (returning TRUE for last occurence of the peak value):
diff(diff(x)>=0)<0
Then, you simply need to properly pad the front and back if you want to detect maxima at the beginning or end of
Here's everything wrapped in a function (including finding of valleys):
which.peaks <- function(x,partial=TRUE,decreasing=FALSE){
if (decreasing){
if (partial){
which(diff(c(FALSE,diff(x)>0,TRUE))>0)
}else {
which(diff(diff(x)>0)>0)+1
}
}else {
if (partial){
which(diff(c(TRUE,diff(x)>=0,FALSE))<0)
}else {
which(diff(diff(x)>=0)<0)+1
}
}
}

Late to the party, but this might be of interest for others. You can nowadays use the (internal) function find_peaks from ggpmisc package. You can parametrize it using threshold, span and strict arguments. Since ggpmisc package is aimed for using with ggplot2 you can directly plot minima and maxima using thestat_peaks and stat_valleys functions:
set.seed(1)
x <- 1:10
y <- runif(10)
# Maxima
x[ggpmisc:::find_peaks(y)]
[1] 4 7
y[ggpmisc:::find_peaks(y)]
[1] 0.9082078 0.9446753
# Minima
x[ggpmisc:::find_peaks(-y)]
[1] 5
y[ggpmisc:::find_peaks(-y)]
[1] 0.2016819
# Plot
ggplot(data = data.frame(x, y), aes(x = x, y = y)) + geom_line() + stat_peaks(col = "red") + stat_valleys(col = "green")

Answer by #42- is great, but I had a use case where I didn't want to use zoo. It's easy to implement this with dplyr using lag and lead:
library(dplyr)
test = data_frame(x = sample(1:10, 20, replace = TRUE))
mutate(test, local.minima = if_else(lag(x) > x & lead(x) > x, TRUE, FALSE)
Like the rollapply solution, you can control the window size and edge cases through the lag/lead arguments n and default, respectively.

In the case I'm working on, duplicates are frequent. So I have implemented a function that allows finding first or last extrema (min or max):
locate_xtrem <- function (x, last = FALSE)
{
# use rle to deal with duplicates
x_rle <- rle(x)
# force the first value to be identified as an extrema
first_value <- x_rle$values[1] - x_rle$values[2]
# differentiate the series, keep only the sign, and use 'rle' function to
# locate increase or decrease concerning multiple successive values.
# The result values is a series of (only) -1 and 1.
#
# ! NOTE: with this method, last value will be considered as an extrema
diff_sign_rle <- c(first_value, diff(x_rle$values)) %>% sign() %>% rle()
# this vector will be used to get the initial positions
diff_idx <- cumsum(diff_sign_rle$lengths)
# find min and max
diff_min <- diff_idx[diff_sign_rle$values < 0]
diff_max <- diff_idx[diff_sign_rle$values > 0]
# get the min and max indexes in the original series
x_idx <- cumsum(x_rle$lengths)
if (last) {
min <- x_idx[diff_min]
max <- x_idx[diff_max]
} else {
min <- x_idx[diff_min] - x_rle$lengths[diff_min] + 1
max <- x_idx[diff_max] - x_rle$lengths[diff_max] + 1
}
# just get number of occurences
min_nb <- x_rle$lengths[diff_min]
max_nb <- x_rle$lengths[diff_max]
# format the result as a tibble
bind_rows(
tibble(Idx = min, Values = x[min], NB = min_nb, Status = "min"),
tibble(Idx = max, Values = x[max], NB = max_nb, Status = "max")) %>%
arrange(.data$Idx) %>%
mutate(Last = last) %>%
mutate_at(vars(.data$Idx, .data$NB), as.integer)
}
The answer to the original question is:
> x <- c(1, 2, 3, 2, 1, 1, 2, 1)
> locate_xtrem(x)
# A tibble: 5 x 5
Idx Values NB Status Last
<int> <dbl> <int> <chr> <lgl>
1 1 1 1 min FALSE
2 3 3 1 max FALSE
3 5 1 2 min FALSE
4 7 2 1 max FALSE
5 8 1 1 min FALSE
The result indicates that the second minimum is equal to 1 and that this value is repeated twice starting at index 5. Therefore, a different result could be obtained by indicating this time to the function to find the last occurrences of local extremas:
> locate_xtrem(x, last = TRUE)
# A tibble: 5 x 5
Idx Values NB Status Last
<int> <dbl> <int> <chr> <lgl>
1 1 1 1 min TRUE
2 3 3 1 max TRUE
3 6 1 2 min TRUE
4 7 2 1 max TRUE
5 8 1 1 min TRUE
Depending on the objective, it is then possible to switch between the first and the last value of a local extremas. The second result with last = TRUE could also be obtained from an operation between columns "Idx" and "NB"...
Finally to deal with noise in the data, a function could be implemented to remove fluctuations below a given threshold. Code is not exposed since it goes beyond the initial question. I have wrapped it in a package (mainly to automate the testing process) and I give below a result example:
x_series %>% xtrem::locate_xtrem()
x_series %>% xtrem::locate_xtrem() %>% remove_noise()

Here's the solution for minima:
#Ben's solution
x <- c(1,2,3,2,1,2,1)
which(diff(sign(diff(x)))==+2)+1 # 5
Please regard the cases at Tommy's post!
#Tommy's solution:
localMinima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
x <- c(1,2,9,9,2,1,1,5,5,1)
localMinima(x) # 1, 7, 10
x <- c(2,2,9,9,2,1,1,5,5,1)
localMinima(x) # 7, 10
x <- c(3,2,9,9,2,1,1,5,5,1)
localMinima(x) # 2, 7, 10
Please regard: Neither localMaxima nor localMinima can handle duplicated maxima/minima at start!

I had some trouble getting the locations to work in previous solutions and came up with a way to grab the minima and maxima directly. The code below will do this and will plot it, marking the minima in green and the maxima in red. Unlike the which.max() function this will pull all indices of the minima/maxima out of a data frame. The zero value is added in the first diff() function to account for the missing decreased length of the result that occurs whenever you use the function. Inserting this into the innermost diff() function call saves from having to add an offset outside of the logical expression. It doesn't matter much, but i feel it's a cleaner way to do it.
# create example data called stockData
stockData = data.frame(x = 1:30, y=rnorm(30,7))
# get the location of the minima/maxima. note the added zero offsets
# the location to get the correct indices
min_indexes = which(diff( sign(diff( c(0,stockData$y)))) == 2)
max_indexes = which(diff( sign(diff( c(0,stockData$y)))) == -2)
# get the actual values where the minima/maxima are located
min_locs = stockData[min_indexes,]
max_locs = stockData[max_indexes,]
# plot the data and mark minima with red and maxima with green
plot(stockData$y, type="l")
points( min_locs, col="red", pch=19, cex=1 )
points( max_locs, col="green", pch=19, cex=1 )

This function by Timothée Poisot is handy for noisy series:
May 3, 2009
An Algorithm To Find Local Extrema In A Vector
Filed under: Algorithm — Tags: Extrema, Time series — Timothée Poisot # 6:46pm
I spend some time looking for an algorithm to find local extrema in
a vector (time series). The solution I used is to “walk” through the
vector by step larger than 1, in order to retain only one value even
when the values are very noisy (see the picture at the end of the
post).
It goes like this :
findpeaks <- function(vec,bw=1,x.coo=c(1:length(vec)))
{
pos.x.max <- NULL
pos.y.max <- NULL
pos.x.min <- NULL
pos.y.min <- NULL for(i in 1:(length(vec)-1)) { if((i+1+bw)>length(vec)){
sup.stop <- length(vec)}else{sup.stop <- i+1+bw
}
if((i-bw)<1){inf.stop <- 1}else{inf.stop <- i-bw}
subset.sup <- vec[(i+1):sup.stop]
subset.inf <- vec[inf.stop:(i-1)]
is.max <- sum(subset.inf > vec[i]) == 0
is.nomin <- sum(subset.sup > vec[i]) == 0
no.max <- sum(subset.inf > vec[i]) == length(subset.inf)
no.nomin <- sum(subset.sup > vec[i]) == length(subset.sup)
if(is.max & is.nomin){
pos.x.max <- c(pos.x.max,x.coo[i])
pos.y.max <- c(pos.y.max,vec[i])
}
if(no.max & no.nomin){
pos.x.min <- c(pos.x.min,x.coo[i])
pos.y.min <- c(pos.y.min,vec[i])
}
}
return(list(pos.x.max,pos.y.max,pos.x.min,pos.y.min))
}
Link to original blog post

In the pracma package, use the
tt <- c(1,2,3,2,1, 1, 2, 1)
tt_peaks <- findpeaks(tt, zero = "0", peakpat = NULL,
minpeakheight = -Inf, minpeakdistance = 1, threshold = 0, npeaks = 0, sortstr = FALSE)
[,1] [,2] [,3] [,4]
[1,] 3 3 1 5
[2,] 2 7 6 8
That returns a matrix with 4 columns.
The first column is showing the local peaks' absolute values.
The 2nd column are the indices
The 3rd and 4th column are the start and end of the peaks (with potential overlap).
See https://www.rdocumentation.org/packages/pracma/versions/1.9.9/topics/findpeaks for details.
One caveat: I used it in a series of non-integers, and the peak was one index too late (for all peaks) and I do not know why. So I had to manually remove "1" from my index vector (no big deal).

Finding local maxima and minima for a not so easy sequence e.g. 1 0 1 1 2 0 1 1 0 1 1 1 0 1 I would give their positions at (1), 5, 7.5, 11 and (14) for maxima and 2, 6, 9, 13 for minima.
#Position 1 1 1 1 1
# 1 2 3 4 5 6 7 8 9 0 1 2 3 4
x <- c(1,0,1,1,2,0,1,1,0,1,1,1,0,1) #Frequency
# p v p v p v p v p p..Peak, v..Valey
peakPosition <- function(x, inclBorders=TRUE) {
if(inclBorders) {y <- c(min(x), x, min(x))
} else {y <- c(x[1], x)}
y <- data.frame(x=sign(diff(y)), i=1:(length(y)-1))
y <- y[y$x!=0,]
idx <- diff(y$x)<0
(y$i[c(idx,F)] + y$i[c(F,idx)] - 1)/2
}
#Find Peaks
peakPosition(x)
#1.0 5.0 7.5 11.0 14.0
#Find Valeys
peakPosition(-x)
#2 6 9 13
peakPosition(c(1,2,3,2,1,1,2,1)) #3 7

We see many nice functions and ideas with different features here. One issue of almost all examples is the efficiency. Many times we see the use of complex functions like diff() or for()-loops, which become slow when large data sets are involved. Let me introduce an efficient function I use every day, with minimal features, but very fast:
Local Maxima Function amax()
The purpose is to detect all local maxima in a real valued vector.
If the first element x[1] is the global maximum, it is ignored,
because there is no information about the previous emlement. If there
is a plateau, the first edge is detected.
#param x numeric vector
#return returns the indicies of local maxima. If x[1] = max, then
it is ignored.
amax <- function(x)
{
a1 <- c(0,x,0)
a2 <- c(x,0,0)
a3 <- c(0,0,x)
e <- which((a1 >= a2 & a1 > a3)[2:(length(x))])
if(!is.na(e[1] == 1))
if(e[1]==1)
e <- e[-1]
if(length(e) == 0) e <- NaN
return (e)
}
a <- c(1,2,3,2,1,5,5,4)
amax(a) # 3, 6

I posted this elsewhere, but I think this is an interesting way to go about it. I'm not sure what its computational efficiency is, but it's a very concise way of solving the problem.
vals=rbinom(1000,20,0.5)
text=paste0(substr(format(diff(vals),scientific=TRUE),1,1),collapse="")
sort(na.omit(c(gregexpr('[ ]-',text)[[1]]+1,ifelse(grepl('^-',text),1,NA),
ifelse(grepl('[^-]$',text),length(vals),NA))))

An enhancement (fast and simple method) to the formula proposed by #BEN and regarding to the cases proposed by #TOMMY:
the following recursive formula handle any cases:
dx=c(0,sign(diff(x)))
numberofzeros= length(dx) - sum(abs(dx)) -1 # to find the number of zeros
# in the dx minus the first one
# which is added intentionally.
#running recursive formula to clear middle zeros
# iterate for the number of zeros
for (i in 1:numberofzeros){
dx = sign(2*dx + c(0,rev(sign(diff(rev(dx))))))
}
Now, the formula provided by #Ben Bolker can be used with a little change:
plot(x)
points(which(diff(dx)==2),x[which(diff(dx)==2)],col = 'blue')#Local MIN.
points(which(diff(dx)==-2),x[which(diff(dx)==-2)],col = 'red')#Local MAX.

I liked #mikeck's solution so that I wouldn't have to convert my dataframes back and forth from a zoo object. But I also wanted to use a window wider than 1. Their solution only looks at the xth value away from the value of interest, not the values within x distance. Here is what I came up with. You would need to add an extra lag/lead line for every value away from the value of interest that you want to look.
x <- data.frame(AIC = c(98, 97, 96, 97, 98, 99, 98, 98, 97, 96, 95, 94, 93, 92, 93, 94, 95, 96, 95, 94, 93, 92, 91, 90, 89, 88))
x <- x %>%
mutate(local.minima = if_else(lag(AIC) > AIC & lead(AIC) > AIC &
lag(AIC, 2) > AIC & lead(AIC, 2) > AIC &
lag(AIC, 3) > AIC & lead(AIC, 3) > AIC, TRUE, FALSE),
local.minima = if_else(is.na(local.minima), TRUE, local.minima))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Generating sliding window to subset data for prediction task - r

Related

Mode imputation for categorical variables in a dataframe

Efficient Montecarlo simulation over a grid in R

How to find the local minmum value in r? [duplicate]

Find peaks in R plot [duplicate]

Find the lower points of the two data columns and compare [duplicate]

Categories

Resources