Long vector-plot/Coverage plot in R - r

I really need your R skills here. Been working with this plot for several days now. I'm a R newbie, so that might explain it.
I have sequence coverage data for chromosomes (basically a value for each position along the length of every chromosome, making the length of the vectors many millions). I want to make a nice coverage plot of my reads. This is what I got so far:
Looks alright, but I'm missing y-labels so I can tell which chromosome it is, and also I've been having trouble modifying the x-axis, so it ends where the coverage ends. Additionally, my own data is much much bigger, making this plot in particular take extremely long time. Which is why I tried this HilbertVis plotLongVector. It works but I can't figure out how to modify it, the x-axis, the labels, how to make the y-axis logged, and the vectors all get the same length on the plot even though they are not equally long.
source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07))
chr2 <- abs(makeRandomTestData(len=1e+07))
par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)
# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.
# Chr2
plotCoverage <- function(chr2, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr2[start:end]), type="l")
}
plotCoverage(chr2, start=1, end=length(chr2)) # Plots coverage result.
# 2nd way of trying with plotLongVector
plotLongVector(chr1, bty="n", ylab="Chr1") # ylab doesn't work
plotLongVector(chr2, bty="n")
Then I have another vector called genes that are of special interest. They are about the same length as the chromosome-vectors but in my data they contain more zeroes than values.
genes_chr1 <- abs(makeRandomTestData(len=1.3e+07))
genes_chr2 <- abs(makeRandomTestData(len=1e+07))
These gene vectors I would like plotted as a red dot under the chromosomes! Basically, if the vector has a value there (>0), it is presented as a dot (or line) under the long vector plot. This I have not idea how to add! But it seems fairly straightforward.
Please help me! Thank you so much.

DISCLAIMER: Please do not simply copy and paste this code to run off the entire positions of your chromosome. Please sample positions (for example, as #Gx1sptDTDa shows) and plot those. Otherwise you'd probably get a huge black filled rectangle after many many hours, if your computer survives the drain.
Using ggplot2, this is really easily achieved using geom_area. Here, I've generated some random data for three chromosomes with 300 positions, just to show an example. You can build up on this, I hope.
# construct a test data with 3 chromosomes and 100 positions
# and random coverage between 0 and 500
set.seed(45)
chr <- rep(paste0("chr", 1:3), each=100)
pos <- rep(1:100, 3)
cov <- sample(0:500, 300)
df <- data.frame(chr, pos, cov)
require(ggplot2)
p <- ggplot(data = df, aes(x=pos, y=cov)) + geom_area(aes(fill=chr))
p + facet_wrap(~ chr, ncol=1)

You could use the ggplot2 package.
I'm not sure what exactly you want, but here's what I did:
This has 7000 random data points (about double the amount of genes on Chromosome 1 in reality). I used alpha to show dense areas (not many here, as it's random data).
library(ggplot2)
Chr1_cov <- sample(1.3e+07,7000)
Chr1 <- data.frame(Cov=Chr1_cov,fil=1)
pl <- qplot(Cov,fil,data=Chr1,geom="pointrange",ymin=0,ymax=1.1,xlab="Chromosome 1",ylab="-",alpha=I(1/50))
print(pl)
And that's it. This ran in less than a second. ggplot2 has a humongous amount of settings, so just try some out. Use facets to create multiple graphs.
The code beneath is for a sort of moving average, and then plotting the output of that. It is not a real moving average, as a real moving average would have (almost) the same amount of data points as the original - it will only make the data smoother. This code, however, takes an average for every n points. It will of course run quite a bit faster, but you will loose a lot of detailed information.
VeryLongVector <- sample(500,1e+07,replace=TRUE)
movAv <- function(vector,n){
chops <- as.integer(length(vector)/n)
count <- 0
pos <- 0
Cov <-0
pos[1:chops] <- 0
Cov[1:chops] <- 0
for(c in 1:chops){
tmpcount <- count + n
tmppos <- median(count:tmpcount)
tmpCov <- mean(vector[count:tmpcount])
pos[c] <- tmppos
Cov[c] <- tmpCov
count <- count + n
}
result <- data.frame(pos=pos,cov=Cov)
return(result)
}
Chr1 <- movAv(VeryLongVector,10000)
qplot(pos,cov,data=Chr1,geom="line")

Related

R - Histogram Doesn't show density due to magnitude of the Data

I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!
Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)

Generating histogram that can calculate percent recovered

I have the following dataset called df:
Amp Injected Recovered Percent less_0.1_True
0.13175 25.22161274 0.96055540 3.81 0
0.26838 21.05919344 21.06294791 100.02 1
0.07602 16.88526724 16.91541763 100.18 1
0.04608 27.50209048 27.55404507 100.19 0
0.01729 8.31489333 8.31326976 99.98 1
0.31867 4.14961918 4.14876247 99.98 0
0.28756 14.65843377 14.65248551 99.96 1
0.26177 10.64754579 10.76435667 101.10 1
0.23214 6.28826689 6.28564299 99.96 1
0.20300 17.01774090 1.05925850 6.22 0
...
Here, the less_0.1_True column flags whether the Recovered periods were close enough to Injected period to be considered a successful recovery or not. If the flag is 1, then it is a succesful recovery. Based on this, I need to generate a plot (Henderson & Stassun, the Astrophysical Journal, 747:51, 2012) like the following:
I am not sure how to create a histogram like this. The closest I have been do reproduce is a bar plot with the following code:
breaks <- seq(0,30,by=1)
df <- split(dat, cut(dat$Injected,breaks)) # I make bins with width = 1 day
x <- seq(1,30,by=1)
len <- numeric() #Here I store the total number of objects in each bin
sum <- numeric() #Here I store the total number of 1s in each bin
for (i in 1:30){
n <- nrow(df[[i]])
len <- c(len,n)
s <- sum(df[[i]]$less_0.1_True == 1, na.rm = TRUE)
sum <- c(sum,s)
}
percent = sum/len*100 #Here I calculate what the percentage is for each bin
barplot(percent, names = x, xlab = "Period [d]" , ylab = "Percent Recovered", ylim=c(0,100))
And it generates the following bar plot:
Obviously, this plot does not look like the first one and there are issues such as it does not show from 0 to 1 like the first graph (which I understand is the case because the latter is a bar graph and not a histogram).
Could anyone please guide me as to how I may reproduce the first figure based on my dataset?
If I run your code I get errors. You need to use border = NA to get rid of the bar borders:
set.seed(42)
hist(rnorm(1000,4), xlim=c(0,10), col = 'skyblue', border = NA, main = "Histogram", xlab = NULL)
Another example using ggplot2:
ggplot(iris, aes(x=Sepal.Length))+
geom_histogram()
I finally found a solution to the problem in StackOverflow. I guess the solved question was worded differently than mine and so I could not find it when I was looking for it initially. The solution is here: How to plot a histogram with a custom distribution?

Stretch x-axis between two values

I have to plot several IR-spectrums. The x-axis with this plots has to be stretched between 2000 and 500. I've tried axis(side=1,at=c(4000,3500,2000,1500,1000,500)), but this does not produce the same distance between the labels. I've searched nearly 2 hours but can't figure out how to achieve this.
Help would be appreciated.
Thanks in advance
I don't think that there's a particularly clean way to do this in base graphics - no doubt there's something in one of the many graphics packages that would do it, but heres' my workaround for what I think you're trying to do.
#Some data to plot
x <- 0:4000
y <- sin(x/100)
#A function to do the stretching that you describe
stretcher <- function(x)
{
lower <- 500 ##lower end of expansion
upper <- 2000 ##upper end of expansion
stretchfactor <- 3 ##must be greater than 1, factor of expansion
x[x>upper] <- x[x>upper] + (stretchfactor-1) * (upper-lower)
x[x<=upper & x>lower] <- (x[x<=upper & x>lower] - lower) * stretchfactor + lower
x
}
#Create the plot
plot(stretcher(x),y,axes=FALSE)
labels <- c(4000,3500,3000,2500,2000,1500,1000,500)
box()
axis(2)
axis(1,labels=labels,at=stretcher(labels))
I'd also emphasis the breaks with something like:
abline(v=stretcher(2000),col='red',lty=2)
abline(v=stretcher(500),col='red',lty=2)

Concatenate a list for plot labels

I want to create intervals (discretize/bin) of continuous variables to plot a choropleth map using ggplot. After reading various threads, I decided to use cut and quantile to eliminate the problems of: a) manually creating bins, and b) taking care of dominant states (otherwise, I had to manually to create bins and see the map and readjust the bins).
However, I am facing another problem now. Intervals coming out of cut are hardly pretty. So, I am trying to follow this example and this example to come up with my pretty labels.
Here is my list:
x <- seq(1,50)
Rounded quantiles:
qs_x <- round(quantile(x, probs=c(seq(0,0.8,by=0.2),0.9)))
which results:
0% 20% 40% 60% 80% 90%
1 11 21 30 40 45
Using these cuts, I want to come up with these labels:
1-11, 12-21, 22-30, 31-40, 41-45, 45+
I am sure there is an easy solution to convert a list using some apply function, but I am not well-versed with those functions.
Help appreciated.
A 3-liner produces the output you want, without using apply.
labels <- paste(qs_x+1, qs_x[-1], sep="-")
labels[1] <- paste(qs_x[1], qs_x[2], sep="-")
labels[length(labels)] <- paste(tail(qs_x, 1), "+", sep = "")
The first line constructs labels of the form (x1 + 1) - x2, the second line fixes the first label, and the third line fixes the last label. Here is the output
> labels
[1] "1-11" "12-21" "22-30" "31-40" "41-45" "45+"

Utilise Surv object in ggplot or lattice

Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}

Resources