Aligning multiple xts time series plots - r

I have an xts object with 4 columns. The first 3 columns are the mean and lower and upper confidence bounds for a proportion. The 4th column is the sample size. Since the scales are different, I thought it would make sense to plot the first 3 columns on one graph, and plot the 4th on a separate graph, right below it. Any suggestions on how to do this?
Here's code to build an xts object that's like the one I have:
startTime = Sys.time()
n = 10
d = seq(startTime,startTime+n*24*60*60,by="1 day")
a = sample(10000,length(d),replace=TRUE)
p = runif(length(d))
l = p/2
u = p+(p+1)/2
x= xts(p,d)
x = cbind(x,l,u,a)
colnames(x) = c("prop","low","high","size")

It's easy to do if you use plot.zoo. Something like this will get you started:
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
plot.zoo(x, screens=c(1,1,1,2))
There are tons of examples in ?plot.zoo; make sure to check them out.

Related

How to add labels to original data given clustering result using hclust

Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:
library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)
If I create an hclust object with a sample of 60 from the original dataset like so:
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')
I know that I can then add the labels to the dendrogram for viewing like this:
observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")
However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.
It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.
Can anyone help with extracting this information?
You need to define a level where you cut your dendrogram, this will form the groups.
Use:
labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels
Let's look at the dendrogram in order to find the best number for k:
plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups
It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).
Use the horizontal lines at that height and count the cluster "formed".

Gradient colours in ggplot (relatively simple)

I have a dataframe which I have constructed by interpolating a series of origin destination points (they relate to a cycle share scheme that used to run in Seattle).
I've called the dataframe interpolated_flows:
line_id long lat seg_num count
1 1 -122.3170 47.61855 1 155
2 1 -122.3170 47.61911 2 155
3 1 -122.3170 47.61967 3 155
4 1 -122.3170 47.62023 4 155
5 1 -122.3169 47.62079 5 155
6 1 -122.3169 47.62135 6 155
What I would like to do (and I think is relatively simple if you know ggplot) is to plot these flows (lines) with the width of a line determined by the count and the gradient determined by the seg_num.
This is my attempt so far:
#Create variables to store relevant data for simplicity of code
X <- interpolated_flows$long
Y <- interpolated_flows$lat
sgn <- interpolated_flows$seg_num
ct <- interpolated_flows$count
#Create a map from flow data and include the bounded box as a base
g <- ggplot(interpolated_flows,aes(x=X, y=Y),group=interpolated_flows$line_id,color=sgn)
map <- ggmap(seattle_map,base_layer = g)
map <- map + geom_path(size=as.numeric(ct)/100,alpha=0.4)+
scale_alpha_continuous(range = c(0.03, 0.3))+coord_fixed(ratio=1.3)+
scale_colour_gradient(high="red",low="blue")
png(filename='Seattle_flows_gradient.png')
print(map)
dev.off()
And I end up with the image attached. I have spent a long time playing around with various parameters in the plotting part of the code but without success so would really appreciate if someone could point me in the right direction.
Edit:
base <- ggplot(interpolated_flows,aes(x=X, y=Y))
map <- ggmap(seattle_map,base_layer = g)
map <- map+geom_path(aes(color=seg_num,size=as.numeric(count)))+
scale_size_continuous(name="Journey Count",range=c(0.05,0.4))+
scale_color_gradient(name="Journey Path",high="white",low="blue",breaks=c(1,10), labels=c('Origin','Destination'))+
coord_fixed(ratio=1.3)+scale_x_continuous("", breaks=NULL)+
scale_y_continuous("", breaks=NULL)
png(filename='Seattle_flows_gradient.png')
print(map)
dev.off()
This is the plot I have now got to which looks like this. I have only two questions - 1) does anyone know a way to improve the resolution of the background map? I tried changing the zoom parameter in the get_map function but it didn't seem to help. 2) The lines I have plotted seem very 'white' heavy. It doesn't look to me like the gradient is evenly distributed. Anyone have any ideas why this would be and how to fix?
See if this suits you. I have create a new dataset so as to see diffencies. Once the data.frame is created you can use it as your first ggplot argument and reference columns by their names as Mako212 say.
long<-seq(-122,-123,length.out = 6)
lat<-seq(47,48,length.out = 6)
seg_num<-seq(1,6,1)
count<-seq(155,165,length.out = 6)
interpolated_flows<-data.frame(long,lat,seg_num,count,stringsAsFactors = false)
base_plot<-ggplot(interpolated_flows,aes(x=long, y=lat))
base_plot+
geom_path(aes(color=seg_num,size=as.numeric(count/100),alpha=lat))+
#notice that size, color and alpha are into aethetic
scale_size_continuous(name="Count")+
scale_alpha_continuous(name="Latitude",range = c(0.03, 0.3))+ #you won't need it if you don't want variable transparency
#just put the desired value into the aethteic
scale_color_gradient(name="Seg_num",high="red",low="blue")+
coord_fixed(ratio=1.3)
Hope it helps

R - Histogram Doesn't show density due to magnitude of the Data

I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!
Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)

Long vector-plot/Coverage plot in R

I really need your R skills here. Been working with this plot for several days now. I'm a R newbie, so that might explain it.
I have sequence coverage data for chromosomes (basically a value for each position along the length of every chromosome, making the length of the vectors many millions). I want to make a nice coverage plot of my reads. This is what I got so far:
Looks alright, but I'm missing y-labels so I can tell which chromosome it is, and also I've been having trouble modifying the x-axis, so it ends where the coverage ends. Additionally, my own data is much much bigger, making this plot in particular take extremely long time. Which is why I tried this HilbertVis plotLongVector. It works but I can't figure out how to modify it, the x-axis, the labels, how to make the y-axis logged, and the vectors all get the same length on the plot even though they are not equally long.
source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07))
chr2 <- abs(makeRandomTestData(len=1e+07))
par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)
# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.
# Chr2
plotCoverage <- function(chr2, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr2[start:end]), type="l")
}
plotCoverage(chr2, start=1, end=length(chr2)) # Plots coverage result.
# 2nd way of trying with plotLongVector
plotLongVector(chr1, bty="n", ylab="Chr1") # ylab doesn't work
plotLongVector(chr2, bty="n")
Then I have another vector called genes that are of special interest. They are about the same length as the chromosome-vectors but in my data they contain more zeroes than values.
genes_chr1 <- abs(makeRandomTestData(len=1.3e+07))
genes_chr2 <- abs(makeRandomTestData(len=1e+07))
These gene vectors I would like plotted as a red dot under the chromosomes! Basically, if the vector has a value there (>0), it is presented as a dot (or line) under the long vector plot. This I have not idea how to add! But it seems fairly straightforward.
Please help me! Thank you so much.
DISCLAIMER: Please do not simply copy and paste this code to run off the entire positions of your chromosome. Please sample positions (for example, as #Gx1sptDTDa shows) and plot those. Otherwise you'd probably get a huge black filled rectangle after many many hours, if your computer survives the drain.
Using ggplot2, this is really easily achieved using geom_area. Here, I've generated some random data for three chromosomes with 300 positions, just to show an example. You can build up on this, I hope.
# construct a test data with 3 chromosomes and 100 positions
# and random coverage between 0 and 500
set.seed(45)
chr <- rep(paste0("chr", 1:3), each=100)
pos <- rep(1:100, 3)
cov <- sample(0:500, 300)
df <- data.frame(chr, pos, cov)
require(ggplot2)
p <- ggplot(data = df, aes(x=pos, y=cov)) + geom_area(aes(fill=chr))
p + facet_wrap(~ chr, ncol=1)
You could use the ggplot2 package.
I'm not sure what exactly you want, but here's what I did:
This has 7000 random data points (about double the amount of genes on Chromosome 1 in reality). I used alpha to show dense areas (not many here, as it's random data).
library(ggplot2)
Chr1_cov <- sample(1.3e+07,7000)
Chr1 <- data.frame(Cov=Chr1_cov,fil=1)
pl <- qplot(Cov,fil,data=Chr1,geom="pointrange",ymin=0,ymax=1.1,xlab="Chromosome 1",ylab="-",alpha=I(1/50))
print(pl)
And that's it. This ran in less than a second. ggplot2 has a humongous amount of settings, so just try some out. Use facets to create multiple graphs.
The code beneath is for a sort of moving average, and then plotting the output of that. It is not a real moving average, as a real moving average would have (almost) the same amount of data points as the original - it will only make the data smoother. This code, however, takes an average for every n points. It will of course run quite a bit faster, but you will loose a lot of detailed information.
VeryLongVector <- sample(500,1e+07,replace=TRUE)
movAv <- function(vector,n){
chops <- as.integer(length(vector)/n)
count <- 0
pos <- 0
Cov <-0
pos[1:chops] <- 0
Cov[1:chops] <- 0
for(c in 1:chops){
tmpcount <- count + n
tmppos <- median(count:tmpcount)
tmpCov <- mean(vector[count:tmpcount])
pos[c] <- tmppos
Cov[c] <- tmpCov
count <- count + n
}
result <- data.frame(pos=pos,cov=Cov)
return(result)
}
Chr1 <- movAv(VeryLongVector,10000)
qplot(pos,cov,data=Chr1,geom="line")

How to make 3-D time series plots

I have time series data on 10 individuals, similar to what is created below. I am looking for a way to make a 3-D plot similar to this example:
Except I want each unique ID to be where year is in the picture
set.seed(123)
ID <- rep(1:10, each = 500)
Time = rep(c(1:500),10)
Var <- rnorm(5000)
data <- data.frame(
ID = factor(ID),
Time = Time,
Variable = Var
)
As a very quick, and ugly start, try
library(plot3D)
## rearrange data into matrix form
m <- matrix(
data$Variable,
nrow=length(unique(data$ID)))
hist3D(z = m)
This doesn't look at all like your example plot; on the other hand, your data don't look much like the data in this plot. Things I haven't played around with yet:
axis labels
making sure the matrix is actually oriented in the right way (this will be a lot easier with real data!)
changing aspect ratio

Resources