I need to scan nearly a million datapoints and determine if they lay under or above a threshold. I have the threshold defined globally and I have a simple predefined function
function.lower.penalty <- function(i,j){ if( i < j ){
#if gate condition is met, flip the gate flag:
n <- 1 }else{n<-0} return(n) }
that I call with mapply, which will write a 0/1 flag column in my dataframe:
df[, paste0("outside.highpass")] <- mapply(function.lower.penalty,i="somesignal.found.in.df", j="*some.threshold.found.in.df*" )
This is pretty straightforward, I can flag dozens of signals with their respective thresholds like this in a second big dataframe. Also, given how the threshold is written, the code will either flag the signals as below/above the threshold (meaning I got also a function.higher.penalty).
Now I was asked to make a more complex threshold that has the shape of a multisegmented line.
What is the fastest way to flag datapoints given that you have only the corner points of the multisegmented line (I can guess them according to how they painted the line) visible here.
Until now I had a predefined threshold (gray 0.2) and used mapply to scan the signal drawn on the x-axis. I just used a function to return 0 or 1 if datapoint was smaller or bigger than the threshold. Now I need a multisegmented line like the one drawn in red to do the same job.
Edit: Using the suggestion from det I was able to flag datapoints in the dataframe. However, it seems that some datapoints close to the defined line are wrongly assinged, see here. I am wondering as how to work around it or if this is a drawing error?
You can create function which returns picewise linear function based on points:
picewiseLinear <- function(x.var, y.var){
stopifnot(length(x.var) == length(y.var), sum(duplicated(x.var)) == 0)
p <- order(x.var)
x.var <- x.var[p]
y.var <- y.var[p]
k <- diff(y.var) / diff(x.var)
l <- -1 * k * head(x.var, -1) + head(y.var, -1)
function(x){
ind <- findInterval(x, x.var)
if(!all(between(ind, 1, length(x.var) - 1))) stop("wrong input")
x * k[ind] + l[ind]
}
}
For example:
point_df <- tribble(
~x, ~y,
3, 0,
5, 2,
3, 3,
5, 4
)
f <- picewiseLinear(point_df$y, point_df$x)
(on your picture you have picewise linear function but looked on x as dependent variable)
and on example dataset you get something like this:
set.seed(123)
tibble(
x = runif(1000, 0, 6),
y = runif(1000, 0, 4)
) %>%
mutate(color = ifelse(x > f(y), "red", "blue")) %>%
ggplot(aes(x, y)) +
geom_point(aes(color = color)) +
scale_color_identity() +
geom_path(data = point_df)
Related
When plotting the ratio between two variables, their relative order is often of no concern, yet depending on which variable is in the numerator, its relative size is constrained either to (0,1) or (1, Inf), which is somewhat unintuitive and breaks symmetry. I want to plot ratios "symmetrically", without resorting to symmetric log-scale, by having a y-axis that goes like 1/4, 1/3, 1/2, 1, 2, 3, 4 or, equivalently, 4^-1, 3^-1, 2^-1, 1, 2, 3, 4 in regular intervals. I've come up with the following:
symmult <- function(x){
isf <- is.finite(x) & (x>0)
xf <- x[isf]
xf <- ifelse(xf>=1,
xf-1,
1-(1/xf))
x[isf] <- xf
x[!isf] <- NA
x[!is.finite(x)] <- NA
return(x)
}
symmultinv <- function(x){
isf <- is.finite(x)
xf <- x[isf]
xf <- ifelse(x[isf]>=0,
x[isf]+1,
-1/(x[isf]-1))
x[isf] <- xf
x[!isf] <- NA
x[!is.finite(x)] <- NA
return(x)
}
sym_mult_trans = function(){trans_new("sym_mult", symmult, symmultinv )}
x <- c(-4:-2, 1:4)
x[x<1] <- 1/abs(x[x<1])
ggplot() +
geom_point(aes(x=x, y=x)) +
scale_y_continuous(trans="sym_mult")
The transformation works, but I cannot get the axis labels etc. to work for any 0<x<1, without setting them manually. Any help would be greatly appreciated.
You can create bespoke 'breaks' and 'format' functions that you can use inside trans_new (or pass to scale_y_continuous directly via its breaks and labels parameters).
For the breaks function, remember it will take as input a length-two numeric vector representing the range of the y axis. You must then convert this to a number of appropriate breaks. Here, if the minimum of the range is less than one, we take its reciprocal, find the pretty breaks between one and that number, then take the reciprocal of the output. We concatenate that onto pretty breaks between 1 and our range maximum:
# Define breaks function
symmult_breaks <- function(x) {
c(1 / extended_breaks(5)(c(1/x[x < 1], 1)),
extended_breaks(5)(c(1, x[x >= 1])))
}
For the labelling function, remember, it needs to take as input the vector of numbers produced by our breaks function. We can paste a 1/ in front of the reciprocal of numbers less than one, but leave numbers of 1 or more unaltered:
# Define labelling function
symmult_labs <- function(x) {
labs <- character(length(x))
labs[x >= 1] <- as.character(x[x >= 1])
labs[x < 1] <- paste("1", as.character(1/x[x < 1]), sep = "/")
labs
}
So your full new transformation becomes:
# Use our four functions to define the whole transformation:
sym_mult_trans <- function() {
trans_new(name = "sym_mult",
transform = symmult,
inverse = symmultinv,
breaks = symmult_breaks,
format = symmult_labs)
}
And your plot becomes:
ggplot() +
geom_point(aes(x = x, y = x)) +
scale_y_continuous(trans = "sym_mult")
I have the following vector:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
And if we with the following plot code
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
we can get this:
By eye we can see at x-axis point 4 the value change begin to change drastically plateaued.
My question is given the vector wss how can we automatically detect the index 4 without looking at the plot.
Edit: This works better:
#change relative to the maximum change
threshold <- 0.1
d1 <- diff(wss)
# this assumes that the first value is the highest
## you could use max(d1) instead of d1[1]
which.max((d1 / d1[1]) < threshold) #results in 3
d1 <- diff(wss2)
which.max(d1 / d1[1] < threshold) #results in 5
Second Edit: This is somewhat subjective, but here's how my three methods compare for your two data sets. While it's easy to visualize what a plateau is, you need to be able to describe in math terminology what a plateau is in order to automate it.
Original: If you know that the second derivative will flip from positive to negative, you can do this:
sec_der <- diff(wss, differences = 2)
inflection_pt <- which.min(sign(sec_der))
inflection_pt
For this data set, the result is 5 which corresponds to the original datasets result of 7 (i.e., 151.991).
Instead of looking at inflection points, you could instead look at some relative percent threshold.
thrshold <- 0.06
which.min(sign(abs(diff(wss)) / wss[1:(length(wss)-1)] - thrshold))
This results in 5 as well using the first derivative approach.
Regardless, using the diff() function would be a key part of figuring this out in base R. Also see:
Finding the elbow/knee in a curve
Code to create graphs:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)
data_list <- list(wss, wss2)
# Potential_methods -------------------------------------------------------
plateau_method = list(thresh_to_max = function(x) which.max(diff(x) / diff(x)[1] < threshold)
, inflection_pt = function(x) which.min(sign(diff(x, differences = 2)))
, deriv_to_raw = function(x) which.min(sign(abs(diff(x)) / x[1:(length(x)-1)] - threshold))
)
threshold <- 0.1
results <- t(sapply(plateau_method, mapply, data_list))
# graphing ----------------------------------------------------------------
par(mfrow = c(3,2))
apply(results, 1, function (x) {
for (i in seq_along(x)) {
plot(data_list[[i]],ylab="Within groups sum of squares", type = 'b', xlab = 'Number of Clusters')
abline(v = x[i])
}
} )
lapply(seq_along(names(plateau_method))
, function (i) {
mtext(paste(names(plateau_method)[i]
, "- \n"
, substring(plateau_method[i], 15))
, side = 3, line = -18*(i)+15, outer = TRUE)
})
mtext('Threshold = 0.1', side = 3, line = -53, outer = T)
I have a function that uses matplot to plot some data. Data structure is like this:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
matplot(test[,-1])
matlines(test[,1], test[,-1])
So far so good. However, if there are missing values in the data set, then there are gaps in the resulting plot, and I would like to avoid those by connecting the edges of the gaps.
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
In the real situation this is inside a function, the dimension of the matrix is bigger and the number of rows, columns and the position of the non-overlapping missing values may change between different calls, so I'd like to find a solution that could handle this in a flexible way. I also need to use matlines
I was thinking maybe filling in the gaps with intrapolated data, but maybe there is a better solution.
I came across this exact situation today, but I didn't want to interpolate values - I just wanted the lines to "span the gaps", so to speak. I came up with a solution that, in my opinion, is more elegant than interpolating, so I thought I'd post it even though the question is rather old.
The problem causing the gaps is that there are NAs between consecutive values. So my solution is to 'shift' the column values so that there are no NA gaps. For example, a column consisting of c(1,2,NA,NA,5) would become c(1,2,5,NA,NA). I do this with a function called shift_vec_na() in an apply() loop. The x values also need to be adjusted, so we can make the x values into a matrix using the same principle, but using the columns of the y matrix to determine which values to shift.
Here's the code for the functions:
# x -> vector
# bool -> boolean vector; must be same length as x. The values of x where bool
# is TRUE will be 'shifted' to the front of the vector, and the back of the
# vector will be all NA (i.e. the number of NAs in the resulting vector is
# sum(!bool))
# returns the 'shifted' vector (will be the same length as x)
shift_vec_na <- function(x, bool){
n <- sum(bool)
if(n < length(x)){
x[1:n] <- x[bool]
x[(n + 1):length(x)] <- NA
}
return(x)
}
# x -> vector
# y -> matrix, where nrow(y) == length(x)
# returns a list of two elements ('x' and 'y') that contain the 'adjusted'
# values that can be used with 'matplot()'
adj_data_matplot <- function(x, y){
y2 <- apply(y, 2, function(col_i){
return(shift_vec_na(col_i, !is.na(col_i)))
})
x2 <- apply(y, 2, function(col_i){
return(shift_vec_na(x, !is.na(col_i)))
})
return(list(x = x2, y = y2))
}
Then, using the sample data:
test <- data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] <- NA
test$b[7] <- NA
lst <- adj_data_matplot(test[,1], test[,-1])
matplot(lst$x, lst$y, type = "b")
You could use the na.interpolation function from the imputeTS package:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
library('imputeTS')
test <- na.interpolation(test, option = "linear")
matplot(test[,-1])
matlines(test[,1], test[,-1])
Had also the same issue today. In my context I was not permitted to interpolate. I am providing here a minimal, but sufficiently general working example of what I did. I hope it helps someone:
mymatplot <- function(data, main=NULL, xlab=NULL, ylab=NULL,...){
#graphical set up of the window
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)
#plot the data
for(i in 1:nrow(data)){
nin.na <- !is.na(data[i,])
lines(x=which(nin.na), y=data[i,nin.na], col = i,...)
}
}
The core 'trick' is in x=which(nin.na). It aligns the data points of the line consistently with the indices of the x axis.
The lines
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)`
draw the graphical part of the window.
range(data, na.rm=TRUE) adapts the plot to a proper size being able to include all data points.
mtext(...) is used to label the axes and provides the main title. The axes themselves are drawn by the axis(...) command.
The following for-loop plots the data.
The function head of mymatplot provides the ... argument for an optional passage of typical plot parameters as lty, lwt, cex etc. via . Those will be passed on to the lines.
At last word on the choice of colors - they are up to your flavor.
I have been working on a project for which I need to find peaks and valleys in a dataset (not just the highest numbers per column, but all of the peaks and valleys).
I did manage to get it to work on 1 column, but I use a for-loop for that and I need to do this for about 50 columns, so I think I should use an 'apply' function. I just don't know how to do so. Can I put 'if' statements and such in an 'apply' function?
Here is what I used for checking one column:
('First' is the name of the dataset and 'Seq1' is the first column)
Lowest = 0
Highest = 0
Summits = vector('numeric')
Valleys = vector('numeric')
for (i in 1:length(First$Seq1))
{
if (!is.na(First$Seq1[i+1]))
{
if (First$Seq1[i] < Lowest) {Lowest = First$Seq1[i]}
if (First$Seq1[i] > Highest) {Highest = First$Seq1[i]}
if (First$Seq1[i] > 0 && First$Seq1[i+1] < 0)
{ Summits <- append(Summits, Highest, after=length(Summits)) }
if (First$Seq1[i] < 0 && First$Seq1[i+1] > 0)
{ Valleys <- append(Valleys, Lowest, after=length(Summits)) }
}
}
Sure you can! I would first define a helper function that defines what is to be done with one specific column and then you call that function within apply:
HelperFun <- function(x) {
# your code from above, replacing 'Seq1' by x
}
apply(First, 2, HelperFun)
An *apply function is not better for this than a for loop, provided you don't grow an object in the for loop. You must never use append in a loop. Pre-allocate your results object and fill it.
This finds all local minima on a grid:
#an example
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
#plot
library(raster)
plot(raster(plane))
#initialize a logical matrix
res <- matrix(TRUE, ncol = ncol(plane), nrow = nrow(plane))
#check for each subgrid of 2 times 2 cells which of the cells is the minimum
for (i in 1:(nrow(plane) - 1)) {
for (j in 1:(ncol(plane) - 1)) {
inds <- as.matrix(expand.grid(r = i + 0:1, c = j + 0:1))
#cell must be a minimum of all 4 subgrids it is part of
res[inds] <- res[inds] & plane[inds] == min(plane[inds])
}
}
print(res)
plane[res]
#[1] -13.282277 -8.906542 -8.585043 -12.071038 -3.919195 -14.965450 -5.215595 -5.498904 -5.971644 -2.380870 -7.296070
#highlight local minima
plot(rasterToPolygons(raster(res)), border = t(res), add = TRUE)
library(reshape2)
res1 <- melt(res)
res1 <- res1[res1$value,]
text(x = res1$Var2 /10 - 0.05,
y = 1-res1$Var1 /10 + 0.05,
labels = round(plane[res],1))
I've assumed here that diagonal neighbors are counted as neighbors and not only neighbors in the same column or row. But this would be trivial to change.
I know that this is not the solution you want --- you have one-dimensional time series, but here is a (more direct) variation on Roland's solution.
#example data
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
library(raster)
r <- raster(plane)
f <- focal(r, matrix(1,3,3), min, pad=TRUE, na.rm=TRUE)
x <- r == f
mins <- mask(r, x, maskvalue=FALSE)
pts <- rasterToPoints(mins)
cells <- cellFromXY(x, pts)
r[cells]
plot(r)
text(mins, digits=1)
plot(rasterToPolygons(mins), add=TRUE)
I'm trying to plot a dataset over time (timeframe of ms/s). I need to show the order of events, the type of event and the duration of each event + the time between events. The dataset consists of a start time, end time and category.
I got close with this code someone used to answer a similar question back in '11 but found that I couldn't get it to colour the events according to the category, and I don't understand what the code is doing well enough to fix the issue.
zucchini <- function(st, en, mingap=1)
{
i <- order(st, en-st);
st <- st[i];
en <- en[i];
last <- r <- 1
while( sum( ok <- (st > (en[last] + mingap)) ) > 0 )
{
last <- which(ok)[1];
r <- append(r, last);
}
if( length(r) == length(st) )
return( list(c = list(st[r], en[r]), n = 1 ));
ne <- zucchini( st[-r], en[-r]);
return(list( c = c(list(st[r], en[r]), ne$c), n = ne$n+1));
}
{
zu <- zucchini(st, en, mingap = 1);
plot.new();
plot.window( xlim=c(min(st), max(en)), ylim = c(0, zu$n+1));
box(); axis(1);
for(i in seq(1, 2*zu$n, 2))
{
x1 <- zu$c[[i]];
x2 <- zu$c[[i+1]];
for(j in 1:length(x1))
rect( x1[j], (i+1)/2, x2[j], (i+1)/2+0.5,col=data$Type, border="black",
);
legend('bottomright', legend = levels(data$Type), col = 1:10, cex = 0.8, pch = 1)}
}
st <- data$Time
en <- data$End
coliflore(st,en)
current code outputs this As best as I can tell it is assigning all boxes the same colour, that of the category of the first data point.
Does anyone know either: how to get this code to assign colours to the boxes based on a category, or how to accomplish this kind of plotting another way?
Its a little hard to for me to see whats going on without a toy dataset for your example. For maximum control over coloring in plots I like to add a color column to the dataframe or create a vector to store color values for use in plotting instead of using the factor levels to generate colors (eg data$Type). For instance if I want factors 1:3 to be red, green, and blue:
# create data frame with X,Y coordinates and 3 factor levels
toy_data<- data.frame (X= 1:9, Y=9:1, Factor = rep(1:3, times=3))
# create a vector of colors to use for plotting
# color function
colFxn<-function(val){
cw_df<-data.frame(value=1:3, color = c("red", "green", "blue"))
return(cw_df[cw_df$value %in% val,]$color)
}
col_vec<-sapply (toy_data$Factor, colFxn)
#plot
plot(toy_data$X, toy_data$Y, col=col_vec)
I prefer this option because of the control I have over my colors. This can also be expanded to transparent colors by changing the alpha value using the RGB function, or through using a color pallet available through many packages.