How to find clusters of values over threshold for timeseries

How to find clusters of values over threshold for timeseries - r

I have timeseries and need to find clusters of values over threshold and plot that cluster on separate plot.
My code example. Unfortunately I don't know how to generate well clustered values.
#generate sample data
Sys.setlocale("LC_ALL","English")
set.seed(8)
Values <- sample(0:100,24241, replace = T)
Values <- rpois(24241, lambda=60)
start <- as.POSIXct("2012-01-15 06:10:00")
interval <- 15
end <- start + as.difftime(4, units="days") + as.difftime(5, units = "hours")
DateTimes <- seq(from=start, by=interval, to=end)
my_data_sample <- tibble(datetime = DateTimes, Value = Values)
threshold <- 82
ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
geom_line(size = 1, color = "darkgreen") +
geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
theme_bw() +
labs(
x= "" ,
y = "",
title = paste("Threshold:", threshold )
) +
scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))
Here is what I need:
I need to find clusters of values over threshold - consecutive or near each other, sort that clusters using cluster length in seconds (longest clusters) or sum of values (most powerful clusters), and plot let's say top 3 of that time periods on separate plots.
Any suggestions how to do that?

You can find runs that follow some expectation using run-length encoding (RLE). At the RLE level, you can filter out runs that are too short on either side. You can play with the run_threshold value until it matches your data.
# Put some actual deviating runs in the data
my_data_sample$Value[5001:5100] <- rpois(100, lambda = 80)
my_data_sample$Value[10001:11000] <- rpois(1000, lambda = 80)
threshold <- 82
rle <- rle(my_data_sample$Value > threshold)
# Find sub-threshold values in between super-threshold values,
# convert these to other class
run_threshold <- 20
rle$values[!rle$values & rle$lengths < run_threshold] <- TRUE
# Restructure rle
rle <- rle(inverse.rle(rle))
# Find short super-threshold values to filter
run_threshold <- 5
rle$values[rle$values & rle$lengths < run_threshold] <- FALSE
rle <- rle(inverse.rle(rle))
# Find run starts and ends
rle_start <- {rle_end <- cumsum(rle$lengths)} - rle$lengths + 1
# Format as data.frame for ggplot
rle_df <- data.frame(
min = my_data_sample$datetime[rle_start],
max = my_data_sample$datetime[rle_end],
value = rle$values
)
ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
geom_line(size = 1, color = "darkgreen") +
geom_rect(aes(xmin = min, xmax = max, ymin = 0, ymax = 10, fill = value),
data = rle_df, inherit.aes = FALSE) +
geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
theme_bw() +
labs(
x= "" ,
y = "",
title = paste("Threshold:", threshold )
) +
scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))

Related

Draw arrow on ggplot with dates as variable, by specifying co-ordinates rather than using the units of the x- and y-variables

I am attempting to plot the blood test results for a patient in a time series. I have managed to do this and included a reference range between two shaded y-intercepts. My problem is that the annotate() or geom_segment() calls want me to specify, in the units of my independent variable, which is, unhelpfully, a date (YYYY-MM-DD).
Is it possible to get R to ignore the units of the x- and y-axis and specify the arrow co-ordinates as if they were on a grid?
result <- runif(25, min = 2.0, max = 3.5)
start_date <- ymd("2021-08-16")
end_date <- ymd("2022-10-29")
date <- sample(seq(start_date, end_date, by = "days"), 25, replace = TRUE)
q <- data.table(numbers, date)
ggplot(q, aes(x = date, y = result)) +
geom_line() +
geom_point(aes(x = date, y = result), shape = 21, size = 3) +
scale_x_date(limits = c(min(q$date), max(q$date)),
breaks = date_breaks("1 month"),
labels = date_format("%b %Y")) +
ylab("Corrected calcium (mmol/L")+
xlab("Date of blood test") +
ylim(1,4)+
geom_ribbon(aes(ymin=2.1, ymax=2.6), fill="grey", alpha=0.2, colour="grey")+
geom_vline(xintercept=as.numeric(q$date[c(3, 2)]),
linetype=4, colour="black") +
theme(axis.text.x = element_text(angle = 45)) + theme_prism(base_size = 10) +
annotate("segment", x = 1, y = 2, xend = 3, yend = 4, arrow = arrow(length = unit(0.15, "cm")))
The error produced is Error: Invalid input: date_trans works with objects of class Date only.
I can confirm that:
> class(q$date)
[1] "Date"
I've just gone with test co-ordinates (1,2,3,4) for the annotate("segment"...), ideally I want to be able to get the arrow to point to a specific data point on the plot to indicate when the patient went on treatment.
Many thanks,
Sandro

You don't need to convert to points or coordinates. Just use the actual values from your data frame. I am just subsetting within annotate using a hard coded index (you can also automate this of course), but you will need to "remind" R that you are dealing with dates - thus the added lubridate::as_date call.
library(ggplot2)
library(lubridate)
result <- runif(25, min = 2.0, max = 3.5)
start_date <- ymd("2021-08-16")
end_date <- ymd("2022-10-29")
date <- sample(seq(start_date, end_date, by = "days"), 25, replace = TRUE)
q <- data.frame(result, date)
## I am arranging the data frame by date
q <- dplyr::arrange(q, date)
ggplot(q, aes(x = date, y = result)) +
geom_line() +
## for start use a random x and y so it starts whereever you want it to start
## for end, use the same row from your data frame, in this case row 20
annotate(geom = "segment",
x = as_date(q$date[2]), xend = as_date(q$date[20]),
y = min(q$result), yend = q$result[20],
arrow = arrow(),
size = 2, color = "red")

Shading every other vertical month, alternating white and gray

I have boxplots of data with one set of boxplots for each month. But together, they get busy and it isn't clear which box goes with which date. Is there a way to shade every other vertical month light gray so I can easily see which ones go with which month?
Edit: I'm already using geom_polygon for another part of the plot that I commented out for now.
date <- seq(as.Date('2015-09-15'), as.Date('2016-09-30'), by = "2 days")
x <- rnorm(length(date))
date <- date[order(x)]
Date <- format(as.Date(date), "%Y-%m")
values <- rnorm(length(x),.16,.01)
type <- c(rep("a",63),rep("b",64),rep("c",64))
new.table <- as.data.frame.matrix(table(Date,type))
dataset <- data.frame(values,date,type,Date)
if(length(which(levels(factor(type))=="c"))==0){
count.data <- rep(0,length(levels(factor(Date))))
}else{count.data <- new.table[,names(new.table)=="c"]}
ly <- length(count.data)
max.count <- max(count.data)
max.right <- max.count*4
max.box <- max(dataset$values,na.rm=T)
min.box <- 5/4*min(dataset$values,na.rm=T)-max.box/4
box.25 <- (max.box-min.box)/4
x <- c(0:(ly+1),c((ly+1):0))
y <- c(0,count.data,rep(0,(ly+3)))*(box.25/max.count)+min.box
poly.data <- data.frame(x,y)
dates1 <- levels(factor(dataset$Date))
noB <- length(dates1)
new.table$Date <- rownames(new.table)
library(tidyverse)
library(gridExtra)
library(ggthemes)
library(ggplot2)
p <- ggplot(dataset,aes(x=Date,y=values,fill=type))+
geom_boxplot(position=position_dodge(width = 0.7))+
stat_boxplot(geom="errorbar",width=0.7)+
coord_cartesian(ylim = c(min.box,max.box))+
#geom_polygon(data=poly.data,mapping = aes(x=x,y=y),fill="grey30")+
#scale_y_continuous(sec.axis = sec_axis(~(.-min.box)*max.count/box.25, name = "Sec axis"),breaks = scales::pretty_breaks(n = 10))+
labs(title="Boxplot of values Over Time",y="values",x="Date (year-month)")+
theme_classic(base_size=15)+
theme(axis.text.x = element_text(angle = ifelse(noB>15,45,0), hjust=ifelse(noB>15,1,0.5)),panel.grid.major=element_line("light grey"))
p

I would use geom_rect with a separate data.frame (here: shades)
shades <- data.frame(xmin=seq(1.5,length(unique(dataset$Date))-1.5, 2),
xmax=seq(2.5,length(unique(dataset$Date))+.5, 2),
ymin=0, ymax=Inf)
ggplot(dataset,aes(x=Date,y=values,fill=type))+
geom_boxplot(position=position_dodge(width = 0.7))+
stat_boxplot(geom="errorbar",width=0.7)+
coord_cartesian(ylim = c(min.box,max.box))+
#geom_polygon(data=poly.data,mapping = aes(x=x,y=y),fill="grey30")+
#scale_y_continuous(sec.axis = sec_axis(~(.-min.box)*max.count/box.25, name = "Sec axis"),breaks = scales::pretty_breaks(n = 10))+
labs(title="Boxplot of values Over Time",y="values",x="Date (year-month)")+
theme_classic(base_size=15)+
theme(axis.text.x = element_text(angle = ifelse(noB>15,45,0), hjust=ifelse(noB>15,1,0.5)),panel.grid.major=element_line("light grey")) +
geom_rect(inherit.aes = F, data = shades, mapping = aes(xmin=xmin, xmax=xmax, ymin = ymin, ymax = ymax), alpha = 0.2)

stat_function not transitioning over transition_states

I'm trying to write my own Central Limit Theorem demonstration using ggplot2 and am unable to get my stat_function to display a changing normal distribution.
below is my code, I want the normal distribution in stat_function to transition through different states; specifically, I'm hoping for it to change the standard deviation to correspond with each value in dataset. Any help would be greatly appreciated.
#library defs
library(gganimate)
library(ggplot2)
library(transformr)
#initialization for distribution, rolls, and vectors
k = 2
meanr = 1/k
sdr = 1/k
br = sdr/10
rolls <- 200
avg <- 1
dataset <- 1
s <- 1
#loop through to create vectors of sample statistics from 200 samples of size i
#avg is sample average, s is standard deviations of sample means, and dataset is the indexes to run the transition states
for (i in c(1:40)){
for (j in 1:rolls){
avg <- c(avg,mean(rexp(i,k)))
}
dataset <- c(dataset, rep(i,rolls))
s <- c(s,rep(sdr/sqrt(i),rolls))
}
#remove initialized vector information as it was only created to start loops
avg <- avg[-1]
rn <- rn[-1]
dataset <- dataset[-1]
s <- s[-1]
#dataframe
a <- data.frame(avgf=avg, rnf = rn,datasetf = dataset,sf = s)
#plot histogram, density function, and normal distribution
ggplot(a,aes(x=avg,y=s))+
geom_histogram(aes(y = ..density..), binwidth = br,fill='beige',col='black')+
geom_line(aes(y = ..density..,colour = 'Empirical'),lwd=2, stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal', y = s),lwd=2,args=list(mean=meanr,sd = mean(s)))+
scale_y_continuous(labels = scales::percent_format()) +
scale_color_discrete(name = "Densities", labels = c("Empirical", "Normal"))+
labs(x = 'Sample Average',title = 'Sample Size: {closest_state}')+
transition_states(dataset,4,4)+ view_follow(fixed_x = TRUE)

I think it's difficult to use stat_function here because the dnorm function that you are passing includes a grouped variable (mean(s)). There is no way to indicate that you wish to group s by the dataset column, and the transition_states function doesn't filter the whole data frame. You could use transition_filter to filter the whole data frame, but this would be laborious.
It's not much work to just add a dnorm to your input data frame and plot it as a line, particularly since the rest of your code can be simplified substantially. Here's a fully reproducible example:
library(gganimate)
library(ggplot2)
library(transformr)
k <- 2
meanr <- sdr <- 1/k
br <- sdr/10
rolls <- 200
a <- do.call(rbind, lapply(1:40, function(i){
data.frame(avg = replicate(rolls, mean(rexp(i, k))),
dataset = rep(i, rolls),
x = seq(0, 2, length.out = rolls),
s = dnorm(seq(0, 2, length.out = rolls),
meanr, sdr/sqrt(i))) }))
ggplot(a, aes(x = avg, group = dataset)) +
geom_histogram(aes(y = ..density..), fill = 'beige',
colour = "black", binwidth = br) +
geom_line(aes(y = ..density.., colour = 'Empirical'),
lwd = 2, stat = 'density', alpha = 0.5) +
geom_line(aes(x = x, y = s, colour = "Normal"), size = 2, alpha = 0.5) +
scale_y_continuous(labels = scales::percent_format()) +
coord_cartesian(xlim = c(0, 2)) +
scale_color_discrete(name = "Densities", labels = c("Empirical", "Normal")) +
labs(x = 'Sample Average', title = 'Sample Size: {closest_state}') +
transition_states(dataset, 4, 4) +
view_follow(fixed_x = TRUE, fixed_y = TRUE)

ggplot change order of continuous y axis values

I am trying to plot shift data by hour (integer) ordered by 3 different shifts worked (8-16, 16-24, 24-8) by day as the x-axis. The hours I have are 24hr format and I want to plot them not in numerical order (0-24) but by the shift order (8-16, 16-24, 24-8).
Here is the code to create the data and make the plot. I want to put the 0-8 chunk above the 16-24 chunk.
set.seed(123)
Hour = sample(0:24, 500, replace=T)
Day = sample(0:1, 500, replace=T)
dat <- as.tibble(cbind(Hour, Day)) %>%
mutate(Day = factor(ifelse(Day == 0, "Mon", "Tues")),
Shift = cut(Hour, 3, labels = c("0-8", "8-16", "16-24")),
Exposure = factor(sample(0:1, 500, replace=T)))
ggplot(dat, aes(x = Day, y = Hour)) +
geom_jitter(aes(color = Exposure, shape = Exposure)) +
geom_hline(yintercept = 8) +
geom_hline(yintercept = 16) +
theme_classic()
Current plot
It is an interesting problem, and I have tried recoding a new hour variable that is in the order that I want but then I'm not sure how to plot it displaying the standard 24hr variable.
How would i accomplish this ordering?

Not sure if I completely understand, but if you facet your table on the Shift column, it should do what you want. First you must factor the Shift column to the order you specify:
dat$Shift <- factor(dat$Shift, levels = c("0-8", "16-24", "8-16"))
ggplot(dat, aes(x = Day, y = Hour)) +
geom_jitter(aes(color = Exposure, shape = Exposure)) +
facet_grid(Shift ~ ., scales = "free") +
theme_classic()

set.seed(123)
Hour = sample(0:24, 500, replace=T)
Day = sample(0:1, 500, replace=T)
dat <- as.tibble(cbind(Hour, Day)) %>%
mutate(Day = factor(ifelse(Day == 0, "Mon", "Tues")),
Shift = cut(Hour, 3, labels = c("0-8", "8-16", "16-24")),
Exposure = factor(sample(0:1, 500, replace=T)))
dat$Shift <- factor(dat$Shift, levels=rev(levels(dat$Shift)))
ggplot(dat, aes(x = Day, y = Shift)) +
geom_jitter(aes(color = Exposure, shape = Exposure)) +
geom_hline(yintercept = 8) +
geom_hline(yintercept = 16) +
theme_classic()
You just need to reverse the level.

Conditional stat_summary for ggplot in R

I'd like to write some conditional stats in my graph if the data is bigger than a certain value.
With the kind help of Jack Ryan (Cut data and access groups to draw percentile lines), I could create the following script that groups data into hours and plots the result:
# Read example data
A <- read.csv(url('http://people.ee.ethz.ch/~hoferr/download/data-20130812.csv'))
# Libraries
library(doBy)
library(ggplot2)
library(plyr)
library(reshape2)
library(MASS)
library(scales)
# Sample size function
give.n <- function(x){
return(c(y = min(x) - 0.2, label = length(x)))
}
# Calculate gaps
gaps <- rep(NA, length(A$Timestamp))
times <- A$Timestamp
loss <- A$pingLoss
gap.start <- 1
gap.end <- 1
for(i in 2:length(A$Timestamp))
{ #For all rows
if(is.na(A$pingRTT.ms.[i]))
{ #Currently no connection
if(!is.na(A$pingRTT.ms.[i-1]))
{ #Connection lost now
gap.start <- i
}
if(!is.na(A$pingRTT.ms.[i+1]))
{ # Connection restores next time
gap.end <- i+1
gaps[gap.start] <- as.numeric(A$Timestamp[gap.end]-A$Timestamp[gap.start], units="secs")
loss[gap.start] <- gap.end - gap.start
}
}
}
H <- data.frame(times, gaps, loss)
H <- H[complete.cases(H),]
C <- H
C$dates <- strptime(C$times, "%Y-%m-%d %H:%M:%S")
C$h1 <- C$dates$hour
# Calculate percentiles
cuts <- c(1, .75, .5, .25, 0)
c <- ddply(C, .(h1), function (x) { summarise(x, y = quantile(x$gaps, cuts)) } )
c$cuts <- cuts
c <- dcast(c, h1 ~ cuts, value.var = "y")
c.melt <- melt(c, id.vars = "h1")
p <- ggplot(c.h1.melt, aes(x = h1, y = value, color = variable)) +
geom_point(size = 4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,23)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Hour of day") + ylab("Ping gaps [s]")
p
p <- ggplot(c.m1.melt, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 1) +
stat_summary(fun.data = give.n, geom = "text", fun.y = median, angle = 90, size=4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,24)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Time of day") + ylab("Ping gaps [s]")
p
This creates an hourly grouped plot of gaps with the length of the longest gaps written right next to the data points:
Below is the minutely grouped plot. The number are unreadable why I'd like to add conditional stats if the gap is longer than 5 minutes or only for the ten longest gaps or something like this.
I tried to just change the stat function to
max.n.filt <- function(x){
filter = 300
if ( x > filter ) {
return(c(y = max(x) + 0.4, label = round(max(10^x),2)))
} else {
return(c(y=x, label = ""))
}
}
and use this for the minutely grouped plot. But I got this error:
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in if (nrow(layer_data) == 0) return() : argument is of length zero
Calls: print ... print.ggplot -> ggplot_gtable -> Map -> mapply -> <Anonymous>
In addition: Warning message:
Removed 6 rows containing missing values (geom_point).
In addition, in the hourly plot, I'd like to write the number of samples per hour right next to the length of the gaps. I think I can add a new column to the c data frame, but unfortunately I can't find a way to do this.
Any help is very much appreciated.

See ?stat_summary.
fun.data : Complete summary function. Should take data frame as input
and return data frame as output
Your function max.n.filt uses an if() statement that tries to evaluate the condition x > filter. But when length(x) > 1, the if() statement only evaluates the condition for the first value of x. When used on a data frame, this will return a list cobbled together from the original input x and whatever label the if() statement returns.
> max.n.filt(data.frame(x=c(10,15,400)))
$y.x
[1] 10 15 400
$label
[1] ""
Try a function that uses ifelse() instead:
max.n.filt2 <- function(x){
filter = 300 # whatever threshold
y = ifelse( x > filter, max(x) + 1, x[,1] )
label = ifelse( x > filter, round(max(x),2), NA )
return(data.frame(y=y[,1], label=label[,1]))
}
> max.n.filt2(data.frame(x=c(10,15,400)))
y label
1 10 NA
2 15 NA
3 401 400
Alternatively, you might just find it easier to use geom_text(). I can't reproduce your example, but here's a simulated dataset:
set.seed(101)
sim_data <- expand.grid(m1=1:1440, variable=factor(c(0,0.25,0.5,0.75,1)))
sim_data$sample_size <- sapply(1:1440, function(.) sample(1:25, 1, replace=T))
sim_data$value = t(sapply(1:1440, function(.) quantile(rgamma(sim_data$sample_size, 0.9, 0.5),c(0,0.25,0.5,0.75,1))))[1:(1440*5)]
Just use the subset argument in geom_text() to select those points you wish to label:
ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=round(value)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.5)
If you have a column of sample sizes, those can be incorporated into label with paste():
ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=paste(round(value),", N=",sample_size)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.25)
(or create a separate column in your data with whatever labels you want.) If you're asking about how to retrieve the sample sizes, you could modify your call to ddply() like this:
...
c2 <- ddply(C, .(h1), function (x) { cbind(summarise(x, y = quantile(x$gaps, cuts)), n=nrow(x)) } )
c2$cuts <- cuts
c2 <- dcast(c2, h1 + n ~ cuts, value.var = "y")
c2.h1.melt <- melt(c2, id.vars = c("h1","n"))
...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to find clusters of values over threshold for timeseries - r

Related

Draw arrow on ggplot with dates as variable, by specifying co-ordinates rather than using the units of the x- and y-variables

Shading every other vertical month, alternating white and gray

stat_function not transitioning over transition_states

ggplot change order of continuous y axis values

Conditional stat_summary for ggplot in R

Categories

Resources