Can someone explain what these lines of code mean? - r

I have been trying to find a way to make a scatter plot with colour intensity that is indicative of the density of points plotted in the area (it's a big data set with lots of overlap). I found these lines of code which allow me to do this but I want to make sure I actually understand what each line is actually doing.
Thanks in advance :)
get_density <- function(x, y, ...){
dens <- MASS::kde2d(x, y, ...)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)

Below is the function with some explanatory comments, let me know if anything is still confusing:
# The function "get_density" takes two arguments, called x and y
# The "..." allows you to pass other arguments
get_density <- function(x, y, ...){
# The "MASS::" means it comes from the MASS package, but makes it so you don't have to load the whole MASS package and can just pull out this one function to use.
# This is where the arguments passed as "..." (above) would get passed along to the kde2d function
dens <- MASS::kde2d(x, y, ...)
# These lines use the base R function "findInterval" to get the density values of x and y
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
# This command "cbind" pastes the two sets of values together, each as one column
ii <- cbind(ix, iy)
# This line takes a subset of the "density" output, subsetted by the intervals above
return(dens$z[ii])
}
# The "set.seed()" function makes sure that any randomness used by a function is the same if it is re-run (as long as the same number is used), so it makes code more reproducible
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
If your question is about the MASS::kde2d function itself, it might be better to rewrite this StackOverflow question to reflect that!
It looks like the same function is wrapped into a ggplot2 method described here, so if you switch to making your plot with ggplot2 you could give it a try.

Related

Undertanding Sample Sizes in qqplots with R

I'm trying to plot QQ graphs with the MASS Boston data set and comparing how the plots will change with increased random data points. I'm looking at the R documentation on qqnorm() but it doesn't seem to let me select an n value as a parameter? I'd like to plot the QQplots of “random” samples of size 10, 100, and 1000 samples all from a normal distribution for the same variable all in a 3x1 matrix.
Example would be if I wanted to look at the QQplot for Boston Crime, how would I get
qqnorm(Boston$crim) #find how to set n = 10
qqnorm(Boston$crim) #find how to set n = 100
qqnorm(Boston$crim) #find how to set n = 1000
Also if someone could elaborate when to use qqplot() vs qqnorm(), I'd appreciate it.
I'm inclined to believe that I should use qqplot() as such, as it does seem to give me the output I want, but I want to make sure that using rnorm(n) and then using that variable as a second argument is okay to do:
x <- rnorm(10)
y <- rnorm(100)
z <- rnorm(1000)
par(mfrow = c(1,3))
qqplot(Boston$crim, x)
qqplot(Boston$crim, y)
qqplot(Boston$crim, z)
The question is not clear but to plot samples of a vector, define a vector N of sample sizes and loop through it. The lapply loop will sample from the vector, plot with the Q-Q line and return the qqnorm plot.
data(Boston, package = "MASS")
set.seed(2021)
N <- c(10, 100, nrow(Boston))
qq_list <- lapply(N, function(n){
subtitle <- paste("Sample size:", n)
i <- sample(nrow(Boston), n, replace = FALSE)
qq <- qqnorm(Boston$crim[i], sub = subtitle)
qqline(Boston$crim[i])
qq
})

How to use plot function to plot results of your own function?

I'm writing a short R package which contains a function. The function returns a list of vectors. I would like to use the plot function in order to plot by default a plot done with some of those vectors, add lines and add a new parameter.
As an example, if I use the survival package I can get the following:
library(survival)
data <- survfit(Surv(time, status == 2) ~ 1, data = pbc)
plot(data) # Plots the result of survfit
plot(data, conf.int = "none") # New parameter
In order to try to make a reproducible example:
f <- function(x, y){
b <- x^2
c <- y^2
d <- x+y
return(list(one = b, two = c, three = d))
}
dat <- f(3, 2)
So using plot(dat) I would like to get the same as plot(dat$one, dat$two). I would also like to add one more (new) parameter that could be set to TRUE/FALSE.
Is this possible?
I think you might be looking for classes. You can use the S3 system for this.
For your survival example, data has the class survfit (see class(data)). Then using plot(data) will look for a function called plot.survfit. That is actually a non-exported function in the survival package, at survival:::plot.survfit.
You can easily do the same for your package. For example, have a function that creates an object of class my_class, and then define a plotting method for that class:
f <- function(x, y){
b <- x^2
c <- y^2
d <- x+y
r <- list(one = b, two = c, three = d)
class(r) <- c('list', 'my_class') # this is the important bit.
r
}
plot.my_class <- function(x) {
plot(x$one, x$two)
}
Now your code should work:
dat <- f(3, 2)
plot(dat)
You can put anything in plot.my_class you want, including additional arguments, as long as your first argument is x and is the my_class object.
plot now calls plot.my_class, since dat is of class my_class.
You can also add other methods, e.g. for print.
There are many different plotting functions that can be called with plot for different classes, see methods(plot)
Also see Hadley's Advanced R book chapter on S3.

Finding the elbow/knee in a curve

I have these data:
x <- c(6.626,6.6234,6.6206,6.6008,6.5568,6.4953,6.4441,6.2186,6.0942,5.8833,5.702,5.4361,5.0501,4.744,4.1598,3.9318,3.4479,3.3462,3.108,2.8468,2.3365,2.1574,1.899,1.5644,1.3072,1.1579,0.95783,0.82376,0.67734,0.34578,0.27116,0.058285)
y <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32)
which look like:
plot(x,y)
And I want to find a way to get the elbow/knee point at around x=6.5
I thought that fitting a loess curve and then taking the second derivative may work but:
plot(x,predict(loess(y ~ x)),type="l")
doesn't look like it'll do the job.
Any idea?
I think you want to find the points where the derivative of the function y=f(x) has a huge jump in value. you can try the following, as you can see there can be one or many such points depending on the threshold (for huge jump) we choose:
get.elbow.points.indices <- function(x, y, threshold) {
d1 <- diff(y) / diff(x) # first derivative
d2 <- diff(d1) / diff(x[-1]) # second derivative
indices <- which(abs(d2) > threshold)
return(indices)
}
# first approximate the function, since we have only a few points
ap <- approx(x, y, n=1000, yleft=min(y), yright=max(y))
x <- ap$x
y <- ap$y
indices <- get.elbow.points.indices(x, y, 1e4) # threshold for huge jump = 1e4
x[indices]
#[1] 6.612851 # there is one such point
plot(x, y, pch=19)
points(x[indices], y[indices], pch=19, col='red')
indices <- get.elbow.points.indices(x, y, 1e3) # threshold for huge jump = 1e3
x[indices]
#[1] 0.3409794 6.4353456 6.5931286 6.6128514 # there are 4 such points
plot(x, y, pch=19)
points(x[indices], y[indices], pch=19, col='red')
You can now find the knees/elbows with different methods by using the maxcurv function in the soilphysics package.
There is this library
https://www.rdocumentation.org/packages/SamSPECTRAL/versions/1.26.0/topics/kneepointDetection
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SamSPECTRAL")

R: display values in levelplot stratified by a grouping variable

In this following example, I need to display the values for each of the cells in each of the panels stratified by the grouping variable class:
library("lattice")
x <- seq(pi/4, 5*pi, length.out=5)
y <- seq(pi/4, 5*pi, length.out=5)
r1 <- as.vector(sqrt(outer(x^2, y^2, "+")))
r2 <- as.vector(sqrt(outer(x^2, y^2, "/")))
grid1 <- grid2 <- expand.grid(x=x, y=y)
grid1$z <- cos(r1^2)*exp(-r1/(pi^3))
grid2$z <- cos(r2^2)*exp(-r2/(pi^3))
grid <- rbind(grid1, grid2)
grid$class <- c(rep("addition",length(x)^2), rep("division", length(x)^2))
p <- levelplot(z~x*y | factor(class), grid,
panel=function(...) {
arg <- list(...)
panel.levelplot(...)
panel.text(arg$x, arg$y, round(arg$z,1))})
print(p)
However, the cell values are superimposed on each other because the panel option dose not distinguish between the two groups. How can I get the values to display correctly in each group?
Slightly behind the scenes, lattice uses an argument called subscripts to subset data for display in different panels. Often, it does so without you needing to be aware of it, but this is not one of those cases.
A look at the source code for panel.levelplotreveals that it handles subscripts on its own. args(panel.levelplot) shows that it's among the function's formal arguments, and the function's body shows how it uses them.
panel.text(), (really just a wrapper for lattice:::ltext.default()), on the other hand, doesn't know about or do anything with subscripts. From within a call to panel.text(x,y,z), the x, y, and z that are seen are the full columns of the data.frame grid, which is why you saw the overplotting that you did.
To plot text for the values that are a part of the current panel, you need to make explicit use of the subscripts argument, like this:
myPanel <- function(x, y, z, ..., subscripts=subscripts) {
panel.levelplot(x=x, y=y, z=z, ..., subscripts=subscripts)
panel.text(x = x[subscripts],
y = y[subscripts],
labels = round(z[subscripts], 1))
}
p <- levelplot(z~x*y | factor(class), grid, panel = myPanel)
print(p)

How can I auto-title a plot with the R call that produced it?

R's plotting is great for data exploration, as it often has very intelligent defaults. For example, when plotting with a formula the labels for the plot axes are derived from the formula. In other words, the following two calls produce the same output:
plot(x~y)
plot(x~y, xlab="x", ylab="y")
Is there any way to get a similar "intelligent auto-title"?
For example, I would like to call
plot(x~y, main=<something>)
And produce the same output as calling
plot(x~y, main="plot(x~y)")
Where the <something> inserts the call used using some kind of introspection.
Is there a facility for doing this in R, either through some standard mechanism or an external package?
edit: One suggestion was to specify the formula as a string, and supply that as the argument to a formula() call as well as main. This is useful, but it misses out on parameters than can affect a plot, such as using subsets of data. To elaborate, I'd like
x<-c(1,2,3)
y<-c(1,2,3)
z<-c(0,0,1)
d<-data.frame(x,y,z)
plot(x~y, subset(d, z==0), main=<something>)
To have the same effect as
plot(x~y, subset(d, z==0), main="plot(x~y, subset(d, z==0))")
I don't think this can be done without writing a thin wrapper around plot(). The reason is that R evaluates "supplied arguments" in the evaluation frame of the calling function, in which there's no way to access the current function call (see here for details).
By contrast, "default arguments" are evaluated in the evaluation frame of the function, from where introspection is possible. Here are a couple of possibilities (differing just in whether you want "myPlot" or "plot" to appear in the title:
## Function that reports actual call to itself (i.e. 'myPlot()') in plot title.
myPlot <- function(x,...) {
cl <- deparse(sys.call())
plot(x, main=cl, ...)
}
## Function that 'lies' and says that plot() (rather than myPlot2()) called it.
myPlot2 <- function(x,...) {
cl <- sys.call()
cl[[1]] <- as.symbol("plot")
cl <- deparse(cl)
plot(x, main=cl, ...)
}
## Try them out
x <- 1:10
y <- 1:10
par(mfcol=c(1,2))
myPlot(x,y)
myPlot2(y~x)
Here's a more general solution:
plotCaller <- function(plotCall, ...) {
main <- deparse(substitute(plotCall))
main <- paste(main, collapse="\n")
eval(as.call(c(as.list(substitute(plotCall)), main=main, ...)))
}
## Try _it_ out
plotCaller(hist(rnorm(9999), breaks=100, col="red"))
library(lattice)
plotCaller(xyplot(rnorm(10)~1:10, pch=16))
## plotCaller will also pass through additional arguments, so they take effect
## without being displayed
plotCaller(xyplot(rnorm(10)~1:10), pch=16)
deparse will attempt to break deparsed lines if they get too long (the default is 60 characters). When it does this, it returns a vector of strings. plot methods assume that 'main' is a single string, so the line main <- paste(main, collapse='\n') deals with this by concatenating all the strings returned by deparse, joining them using \n.
Here is an example of where this is necessary:
plotCaller(hist(rnorm(9999), breaks=100, col="red", xlab="a rather long label",
ylab="yet another long label"))
Of course there is! Here ya go:
x = rnorm(100)
y = sin(x)
something = "y~x"
plot(formula(something),main=something)
You might be thinking of the functionality of match.call. However that only really works when called inside of a function, not passed in as an argument. You could create your wrapper function that would call match.call then pass everything else on to plot or use substitute to capture the call then modify it with the call before evaluating:
x <- runif(25)
y <- rnorm(25, x, .1)
myplot <- function(...) {
tmp <- match.call()
plot(..., main=deparse(tmp))
}
myplot( y~x )
myplot( y~x, xlim=c(-.25,1.25) )
## or
myplot2 <- function(FUN) {
tmp1 <- substitute(FUN)
tmp2 <- deparse(tmp1)
tmp3 <- as.list(tmp1)
tmp4 <- as.call(c(tmp3, main=tmp2))
eval(tmp4)
}
myplot2( plot(y~x) )
myplot2( plot(y~x, xlim=c(-.25,1.25) ) )

Resources