Finding the elbow/knee in a curve - r

I have these data:
x <- c(6.626,6.6234,6.6206,6.6008,6.5568,6.4953,6.4441,6.2186,6.0942,5.8833,5.702,5.4361,5.0501,4.744,4.1598,3.9318,3.4479,3.3462,3.108,2.8468,2.3365,2.1574,1.899,1.5644,1.3072,1.1579,0.95783,0.82376,0.67734,0.34578,0.27116,0.058285)
y <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32)
which look like:
plot(x,y)
And I want to find a way to get the elbow/knee point at around x=6.5
I thought that fitting a loess curve and then taking the second derivative may work but:
plot(x,predict(loess(y ~ x)),type="l")
doesn't look like it'll do the job.
Any idea?

I think you want to find the points where the derivative of the function y=f(x) has a huge jump in value. you can try the following, as you can see there can be one or many such points depending on the threshold (for huge jump) we choose:
get.elbow.points.indices <- function(x, y, threshold) {
d1 <- diff(y) / diff(x) # first derivative
d2 <- diff(d1) / diff(x[-1]) # second derivative
indices <- which(abs(d2) > threshold)
return(indices)
}
# first approximate the function, since we have only a few points
ap <- approx(x, y, n=1000, yleft=min(y), yright=max(y))
x <- ap$x
y <- ap$y
indices <- get.elbow.points.indices(x, y, 1e4) # threshold for huge jump = 1e4
x[indices]
#[1] 6.612851 # there is one such point
plot(x, y, pch=19)
points(x[indices], y[indices], pch=19, col='red')
indices <- get.elbow.points.indices(x, y, 1e3) # threshold for huge jump = 1e3
x[indices]
#[1] 0.3409794 6.4353456 6.5931286 6.6128514 # there are 4 such points
plot(x, y, pch=19)
points(x[indices], y[indices], pch=19, col='red')

You can now find the knees/elbows with different methods by using the maxcurv function in the soilphysics package.

There is this library
https://www.rdocumentation.org/packages/SamSPECTRAL/versions/1.26.0/topics/kneepointDetection
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SamSPECTRAL")

Related

Remove outliers by condition from list of data frames

I try to create a function to remove multiple outliers via cooks distance from a list of data frames.
There are some problems at the moment:
Can I formulate part 1 as function? I tried several things that did not work out. I want to use several different variables for the lm - so it would be great if I could use colnumbers and the regular expression syntax of data frames as input argument.
Part 2 - the filename of the plots are not correct. It takes the first observation in each data frame from the list as filename. How can I correct this?
Part 3: data frames without the outliers are not created. Function comes to an end after the message is printed. I can't find my mistake.
data(iris)
iris.lst <- split(iris[, 1:2], iris$Species)
new_names <- c(paste0(unlist(levels(iris$Species)),"_data"))
for (i in 1:length(iris.lst)) {
assign(new_names[i], iris.lst[[i]])
}
# Part 1: Then cooks distances
fit <- lapply(mget(ls(pattern = "_data")),
function(x) lm(x[,1] ~ x[,3], data = x))
cooksd <-lapply(fit,cooks.distance)
# Part 2: Plot each data frame with suspected outlier
plots <- function(x){
jpeg(file=paste0(names(x),".jpeg")) # file names are numbers
#par(mfrow=c(2,1))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance") # plot cook's distance
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1, y=x, labels=ifelse(x > 3*mean(x, na.rm=T),
names(x),""), col="red")
dev.off()
}
myplots <- lapply(cooksd, plots)
# Part 3: give me new data frames without influential cases
show_influential_cases <- function(x){
# invisible(cooksd[["n_OG"]] <- lapply(cooksd, length)
influential <- lapply(x,function(x) names(x)[x > 3*mean(x, na.rm=T)])
test <- as.data.frame(unlist(influential))[,1]
test <- as.numeric(test)
}
tested <- show_influential_cases(result)
cleaned_data <- add_new[-tested,] # removing outliers by indexing
Could someone please help me to improve my code?
Many thanks,
Nadine
In general, it is not a good practice to create multiple dataframes in global environment. Lists always are a better option, they are easy to manage.
Part 1 -
You can combine multiple steps in one lapply function. Here in part 1 we apply lm and cooks.distance function together in the same lapply call.
master_data <- split(iris[, 1:2], iris$Species)
data <- lapply(master_data, function(x) {
cooks.distance(lm(Sepal.Length ~ Sepal.Width, data = x))
})
new_names <- paste0(levels(iris$Species),"_data")
names(data) <- new_names
Part 2 -
lapply does not have access to names of the list, pass them separately and use Map to call plots function.
plots <- function(x, y){
jpeg(file=paste0(y,".jpeg"))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance")
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1,y=x,labels=ifelse(x > 3*mean(x, na.rm=T),y,""), col="red")
dev.off()
}
Map(plots, data, names(data))
Part 3 -
I am not exactly clear about how you want to perform Part3 but for now I am showing outlier and data separately.
remove_influential_cases <- function(x, y){
inds <- x > 3*mean(x, na.rm=TRUE)
y[!inds, ]
}
result <- Map(remove_influential_cases, data, master_data)

Undertanding Sample Sizes in qqplots with R

I'm trying to plot QQ graphs with the MASS Boston data set and comparing how the plots will change with increased random data points. I'm looking at the R documentation on qqnorm() but it doesn't seem to let me select an n value as a parameter? I'd like to plot the QQplots of “random” samples of size 10, 100, and 1000 samples all from a normal distribution for the same variable all in a 3x1 matrix.
Example would be if I wanted to look at the QQplot for Boston Crime, how would I get
qqnorm(Boston$crim) #find how to set n = 10
qqnorm(Boston$crim) #find how to set n = 100
qqnorm(Boston$crim) #find how to set n = 1000
Also if someone could elaborate when to use qqplot() vs qqnorm(), I'd appreciate it.
I'm inclined to believe that I should use qqplot() as such, as it does seem to give me the output I want, but I want to make sure that using rnorm(n) and then using that variable as a second argument is okay to do:
x <- rnorm(10)
y <- rnorm(100)
z <- rnorm(1000)
par(mfrow = c(1,3))
qqplot(Boston$crim, x)
qqplot(Boston$crim, y)
qqplot(Boston$crim, z)
The question is not clear but to plot samples of a vector, define a vector N of sample sizes and loop through it. The lapply loop will sample from the vector, plot with the Q-Q line and return the qqnorm plot.
data(Boston, package = "MASS")
set.seed(2021)
N <- c(10, 100, nrow(Boston))
qq_list <- lapply(N, function(n){
subtitle <- paste("Sample size:", n)
i <- sample(nrow(Boston), n, replace = FALSE)
qq <- qqnorm(Boston$crim[i], sub = subtitle)
qqline(Boston$crim[i])
qq
})

Can someone explain what these lines of code mean?

I have been trying to find a way to make a scatter plot with colour intensity that is indicative of the density of points plotted in the area (it's a big data set with lots of overlap). I found these lines of code which allow me to do this but I want to make sure I actually understand what each line is actually doing.
Thanks in advance :)
get_density <- function(x, y, ...){
dens <- MASS::kde2d(x, y, ...)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
Below is the function with some explanatory comments, let me know if anything is still confusing:
# The function "get_density" takes two arguments, called x and y
# The "..." allows you to pass other arguments
get_density <- function(x, y, ...){
# The "MASS::" means it comes from the MASS package, but makes it so you don't have to load the whole MASS package and can just pull out this one function to use.
# This is where the arguments passed as "..." (above) would get passed along to the kde2d function
dens <- MASS::kde2d(x, y, ...)
# These lines use the base R function "findInterval" to get the density values of x and y
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
# This command "cbind" pastes the two sets of values together, each as one column
ii <- cbind(ix, iy)
# This line takes a subset of the "density" output, subsetted by the intervals above
return(dens$z[ii])
}
# The "set.seed()" function makes sure that any randomness used by a function is the same if it is re-run (as long as the same number is used), so it makes code more reproducible
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
If your question is about the MASS::kde2d function itself, it might be better to rewrite this StackOverflow question to reflect that!
It looks like the same function is wrapped into a ggplot2 method described here, so if you switch to making your plot with ggplot2 you could give it a try.

How to orient color scheme along z axis in R persp function?

I have a matrix called ht2. I use persp function to generate a 3D view.
ht2 <- matrix(1, 29, 36)
ht2[4:26,4:33] <- 0
ht2[6:10,6:31] <- 3
ht2[13:17,6:31] <- 3
ht2[20:24,6:31] <- 3
persp(ht2, expand=0.03, theta=25, phi=25, shade=0.75, col=terrain.colors(999,alpha=1))
This gives me:
As you can see, the color from green to yellow to brown changes along y-axis. However, I'd rather want to change it along z-axis.
I'm looking for any simple way to do that.
I found a possible solution in this site:
https://stat.ethz.ch/pipermail/r-help/2003-July/036151.html
levelpersp <- function(x, y, z, colors=topo.colors, ...) {
## getting the value of the midpoint
zz <- (z[-1,-1] + z[-1,-ncol(z)] + z[-nrow(z),-1] + z[-nrow(z),-ncol(z)])/4
## calculating the breaks
breaks <- hist(zz, plot=FALSE)$breaks
## cutting up zz
cols <- colors(length(breaks)-1)
zzz <- cut(zz, breaks=breaks, labels=cols)
## plotting
persp(x, y, z, col=as.character(zzz), ...)
## return breaks and colors for the legend
list(breaks=breaks, colors=cols)
}
## Example
x <- seq(-10, 10, length=60)
y <- x
f <- function(x,y) { r <- sqrt(x^2+y^2); 10 * sin(r)/r }
z <- outer(x, y, f)
levelpersp(x, y, z, theta = 30, phi = 30, expand = 0.5)
Someone may suggest a way to implement this in original question.
In principle you just have to give col= a matrix with your colours you want to fill the squares with, as simple example:
col=terrain.colors(max(ht2)+1)[ht2[-1,-1]+1]
(this simple version works since ht2 contains integers, otherwise it wouldn't)
This creates all the colours needed: terrain.colours(max(ht2)+1)
and then selects them for each position based on one corner: [ht2[-1,-1]+1]
What Anuj Sharma's answer does is basically a nicer version of this, it assumes you have decimal numbers, so it bins them (breaks & cutting up) and instead of taking one corner it uses the height of the middle point (averaging of the four shifted matrices in getting the midpoint )

Construct a specific plot of time series using R

My problem is that I generate a time series from normal distribution and I plot my time series but I want to color in red the positive area between the time series and the axe X, the same for the negative area below the axe X and my time series.
This is the code I use but it does not work :
x1<-rnorm(250,0.4,0.9)
x <- as.matrix(x1)
t <- ts(x[,1], start=c(1,1), frequency=30)
plot(t,main="Daily closing price of Walterenergie",ylab="Adjusted close Returns",xlab="Times",col="blue")
plot(t,xlim=c(2,4),main="Daily closing price of Walterenergie",ylab="Adjusted close Returns",xlab="Times",col="blue")
abline(0,0)
z1<-seq(2,4,0.001)
cord.x <- c(2,z1,4)
cord.y <- c(0,t(z1),0)
polygon(cord.x,cord.y,col='red')
Edit: In response to OP's additional query.
library(ggplot2)
df <- data.frame(t=1:nrow(x),y=x)
df$fill <- ifelse(x>0,"Above","Below")
ggplot(df)+geom_line(aes(t,y),color="grey")+
geom_ribbon(aes(x=t,ymin=0,ymax=ifelse(y>0,y,0)),fill="red")+
geom_ribbon(aes(x=t,ymin=0,ymax=ifelse(y<0,y,0)),fill="blue")+
labs(title="Daily closing price of Walterenergie",
y="Adjusted close Returns",
x="Times")
Original response:
Is this what you had in mind?
library(ggplot2)
df <- data.frame(t=1:nrow(x),y=x)
ggplot(df)+geom_line(aes(t,y),color="grey")+
geom_ribbon(aes(x=t,ymin=0,ymax=y),fill="red")+
labs(title="Daily closing price of Walterenergie",
y="Adjusted close Returns",
x="Times")
This is some code I had written a while ago for someone. In this case two different colors are used for positive and negative. Although this is not exactly what you're after, I thought I'll share this.
# Set a seed to get a reproducible example
set.seed(12345)
num.points <- 100
# Create some data
x.vals <- 1:num.points
values <- rnorm(n=num.points, mean=0, sd=10)
# Plot the graph
plot(x.vals, values, t="o", pch=20, xlab="", ylab="", las=1)
abline(h=0, col="darkgray", lwd=2)
# We need to find the intersections of the curve with the x axis
# Those lie between positive and negative points
# When the sign changes the product between subsequent elements
# will be negative
crossings <- values[-length(values)] * values[-1]
crossings <- which(crossings < 0)
# You can draw the points to check (uncomment following line)
# points(x.vals[crossings], values[crossings], col="red", pch="X")
# We now find the exact intersections using a proportion
# See? Those high school geometry problems finally come in handy
intersections <- NULL
for (cr in crossings)
{
new.int <- cr + abs(values[cr])/(abs(values[cr])+abs(values[cr+1]))
intersections <- c(intersections, new.int)
}
# Again, let's check the intersections
# points(intersections, rep(0, length(intersections)), pch=20, col="red", cex=0.7)
last.intersection <- 0
for (i in intersections)
{
ids <- which(x.vals<=i & x.vals>last.intersection)
poly.x <- c(last.intersection, x.vals[ids], i)
poly.y <- c(0, values[ids], 0)
if (max(poly.y) > 0)
{
col="green"
}
else
{
col="red"
}
polygon(x=poly.x, y=poly.y, col=col)
last.intersection <- i
}
And here's the result!
Base plotting solution:
x1<-rnorm(250,0.4,0.9)
x <- as.matrix(x1)
# t <- ts(x[,1], start=c(1,1), frequency=30)
plot(x1,main="Daily closing price of Walterenergie",ylab="Adjusted close Returns",xlab="Times",col="blue", type="l")
polygon( c(0,1:250,251), c(0, x1, 0) , col="red")
Note this doesn't deal with the time-series plotting method which is rather difficult to understand because of differences in scaling by the frequency value and a starting x value of 1. The solution to that is below:
plot(t,main="Daily closing price of Walterenergie",
ylab="Adjusted close Returns",xlab="Times",col="blue", type="l")
polygon( c(1,1+(0:250)/30), c(0, t, 0) , col="red")

Resources