subscript after superscript using bquote in R - r

I am trying to plot a specific R-square metric, the R2 relative to the 1:1 line. Here are some code to generate data and a plot. (I realize I am only calculating standard r-squared here, but thats fine for the purposes of working out the code).
#generate data.
set.seed(1234)
x <- rnorm(100)
y <- x*0.7 + rnorm(100)
mod <- lm(y~x)
#develop rsq label.
rsq <- round(summary(mod)$r.squared, 2)
rsq.1.lab <- bquote(R^2 [1:1] == .(rsq))
#drop plot and rsq label.
plot(y ~ x)
mtext(rsq.1.lab, side = 3, line = -2, adj = 0.05)
The plot looks like this:
This is pretty close, but the subscript is actually relative to the superscript, rather than being relative to the letter R. How can I change this? Looking for solutions that use base R, ideally keeping bquote().

Use {/} grouping:
rsq.1.lab <- bquote({R^2} [1:1] == .(rsq))
or
rsq.1.lab <- bquote({R [1:1]}^2 == .(rsq))
or even a somewhat ridiculous
rsq.1.lab <- bquote(R * atop(2, "1:1") == .(rsq))
though we can reduce the font size a little using
rsq.1.lab <- bquote(R * scriptstyle(atop(2, "1:1")) == .(rsq))
Much of this is suggested/documented in ?plotmath.

Related

Undertanding Sample Sizes in qqplots with R

I'm trying to plot QQ graphs with the MASS Boston data set and comparing how the plots will change with increased random data points. I'm looking at the R documentation on qqnorm() but it doesn't seem to let me select an n value as a parameter? I'd like to plot the QQplots of “random” samples of size 10, 100, and 1000 samples all from a normal distribution for the same variable all in a 3x1 matrix.
Example would be if I wanted to look at the QQplot for Boston Crime, how would I get
qqnorm(Boston$crim) #find how to set n = 10
qqnorm(Boston$crim) #find how to set n = 100
qqnorm(Boston$crim) #find how to set n = 1000
Also if someone could elaborate when to use qqplot() vs qqnorm(), I'd appreciate it.
I'm inclined to believe that I should use qqplot() as such, as it does seem to give me the output I want, but I want to make sure that using rnorm(n) and then using that variable as a second argument is okay to do:
x <- rnorm(10)
y <- rnorm(100)
z <- rnorm(1000)
par(mfrow = c(1,3))
qqplot(Boston$crim, x)
qqplot(Boston$crim, y)
qqplot(Boston$crim, z)
The question is not clear but to plot samples of a vector, define a vector N of sample sizes and loop through it. The lapply loop will sample from the vector, plot with the Q-Q line and return the qqnorm plot.
data(Boston, package = "MASS")
set.seed(2021)
N <- c(10, 100, nrow(Boston))
qq_list <- lapply(N, function(n){
subtitle <- paste("Sample size:", n)
i <- sample(nrow(Boston), n, replace = FALSE)
qq <- qqnorm(Boston$crim[i], sub = subtitle)
qqline(Boston$crim[i])
qq
})

Can someone explain what these lines of code mean?

I have been trying to find a way to make a scatter plot with colour intensity that is indicative of the density of points plotted in the area (it's a big data set with lots of overlap). I found these lines of code which allow me to do this but I want to make sure I actually understand what each line is actually doing.
Thanks in advance :)
get_density <- function(x, y, ...){
dens <- MASS::kde2d(x, y, ...)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
Below is the function with some explanatory comments, let me know if anything is still confusing:
# The function "get_density" takes two arguments, called x and y
# The "..." allows you to pass other arguments
get_density <- function(x, y, ...){
# The "MASS::" means it comes from the MASS package, but makes it so you don't have to load the whole MASS package and can just pull out this one function to use.
# This is where the arguments passed as "..." (above) would get passed along to the kde2d function
dens <- MASS::kde2d(x, y, ...)
# These lines use the base R function "findInterval" to get the density values of x and y
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
# This command "cbind" pastes the two sets of values together, each as one column
ii <- cbind(ix, iy)
# This line takes a subset of the "density" output, subsetted by the intervals above
return(dens$z[ii])
}
# The "set.seed()" function makes sure that any randomness used by a function is the same if it is re-run (as long as the same number is used), so it makes code more reproducible
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
If your question is about the MASS::kde2d function itself, it might be better to rewrite this StackOverflow question to reflect that!
It looks like the same function is wrapped into a ggplot2 method described here, so if you switch to making your plot with ggplot2 you could give it a try.

R: display values in levelplot stratified by a grouping variable

In this following example, I need to display the values for each of the cells in each of the panels stratified by the grouping variable class:
library("lattice")
x <- seq(pi/4, 5*pi, length.out=5)
y <- seq(pi/4, 5*pi, length.out=5)
r1 <- as.vector(sqrt(outer(x^2, y^2, "+")))
r2 <- as.vector(sqrt(outer(x^2, y^2, "/")))
grid1 <- grid2 <- expand.grid(x=x, y=y)
grid1$z <- cos(r1^2)*exp(-r1/(pi^3))
grid2$z <- cos(r2^2)*exp(-r2/(pi^3))
grid <- rbind(grid1, grid2)
grid$class <- c(rep("addition",length(x)^2), rep("division", length(x)^2))
p <- levelplot(z~x*y | factor(class), grid,
panel=function(...) {
arg <- list(...)
panel.levelplot(...)
panel.text(arg$x, arg$y, round(arg$z,1))})
print(p)
However, the cell values are superimposed on each other because the panel option dose not distinguish between the two groups. How can I get the values to display correctly in each group?
Slightly behind the scenes, lattice uses an argument called subscripts to subset data for display in different panels. Often, it does so without you needing to be aware of it, but this is not one of those cases.
A look at the source code for panel.levelplotreveals that it handles subscripts on its own. args(panel.levelplot) shows that it's among the function's formal arguments, and the function's body shows how it uses them.
panel.text(), (really just a wrapper for lattice:::ltext.default()), on the other hand, doesn't know about or do anything with subscripts. From within a call to panel.text(x,y,z), the x, y, and z that are seen are the full columns of the data.frame grid, which is why you saw the overplotting that you did.
To plot text for the values that are a part of the current panel, you need to make explicit use of the subscripts argument, like this:
myPanel <- function(x, y, z, ..., subscripts=subscripts) {
panel.levelplot(x=x, y=y, z=z, ..., subscripts=subscripts)
panel.text(x = x[subscripts],
y = y[subscripts],
labels = round(z[subscripts], 1))
}
p <- levelplot(z~x*y | factor(class), grid, panel = myPanel)
print(p)

Rendering xtable with significant digits

I'd like to be able to render an xtable in an automatically run piece of code, i.e. NOT via copy-and-paste, while controlling the number of significant digits. The only way that I know to render an xtable on a standard plot device is by using grid.table, but that method ignores the digits directive and plots all available digits. Here's a code example. Any advice?
library(xtable)
library(gridExtra)
x = rnorm(100)
y = x + rnorm(100)
m = lm(y ~ x)
print(xtable(m)) #too many decimal places
print(xtable(m, digits = 2)) #this works
grid.table(xtable(m, digits=2)) #this doesn't!!!
None of the bits of advice here seem useful for automated rendering:
R: rendering xtable
If you convert everything to strings, you should be able to make this work:
x <- xtable(m)
x[] <- lapply(x, sprintf, fmt = "%0.2f")
grid.table(x)
I'm not sure of your final plot device, but for some purposes you can just skip xtable all together:
library("broom")
library("gridExtra")
x = rnorm(100)
y = x + rnorm(100)
m = lm(y ~ x)
DF <- broom::tidy(m)
DF[,2:4] <- round(DF[,2:4], 2)
DF[,5] <- format(DF[,5], scientific = TRUE, digits = 4)
grid.table(DF)
Make sure you have the latest gridExtra. You can also control the appearance of the table in great detail, via themes (there is a vignette on the topic).

Utilise Surv object in ggplot or lattice

Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}

Resources