I have the following dataset called df:
Amp Injected Recovered Percent less_0.1_True
0.13175 25.22161274 0.96055540 3.81 0
0.26838 21.05919344 21.06294791 100.02 1
0.07602 16.88526724 16.91541763 100.18 1
0.04608 27.50209048 27.55404507 100.19 0
0.01729 8.31489333 8.31326976 99.98 1
0.31867 4.14961918 4.14876247 99.98 0
0.28756 14.65843377 14.65248551 99.96 1
0.26177 10.64754579 10.76435667 101.10 1
0.23214 6.28826689 6.28564299 99.96 1
0.20300 17.01774090 1.05925850 6.22 0
...
Here, the less_0.1_True column flags whether the Recovered periods were close enough to Injected period to be considered a successful recovery or not. If the flag is 1, then it is a succesful recovery. Based on this, I need to generate a plot (Henderson & Stassun, the Astrophysical Journal, 747:51, 2012) like the following:
I am not sure how to create a histogram like this. The closest I have been do reproduce is a bar plot with the following code:
breaks <- seq(0,30,by=1)
df <- split(dat, cut(dat$Injected,breaks)) # I make bins with width = 1 day
x <- seq(1,30,by=1)
len <- numeric() #Here I store the total number of objects in each bin
sum <- numeric() #Here I store the total number of 1s in each bin
for (i in 1:30){
n <- nrow(df[[i]])
len <- c(len,n)
s <- sum(df[[i]]$less_0.1_True == 1, na.rm = TRUE)
sum <- c(sum,s)
}
percent = sum/len*100 #Here I calculate what the percentage is for each bin
barplot(percent, names = x, xlab = "Period [d]" , ylab = "Percent Recovered", ylim=c(0,100))
And it generates the following bar plot:
Obviously, this plot does not look like the first one and there are issues such as it does not show from 0 to 1 like the first graph (which I understand is the case because the latter is a bar graph and not a histogram).
Could anyone please guide me as to how I may reproduce the first figure based on my dataset?
If I run your code I get errors. You need to use border = NA to get rid of the bar borders:
set.seed(42)
hist(rnorm(1000,4), xlim=c(0,10), col = 'skyblue', border = NA, main = "Histogram", xlab = NULL)
Another example using ggplot2:
ggplot(iris, aes(x=Sepal.Length))+
geom_histogram()
I finally found a solution to the problem in StackOverflow. I guess the solved question was worded differently than mine and so I could not find it when I was looking for it initially. The solution is here: How to plot a histogram with a custom distribution?
Related
Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:
library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)
If I create an hclust object with a sample of 60 from the original dataset like so:
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')
I know that I can then add the labels to the dendrogram for viewing like this:
observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")
However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.
It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.
Can anyone help with extracting this information?
You need to define a level where you cut your dendrogram, this will form the groups.
Use:
labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels
Let's look at the dendrogram in order to find the best number for k:
plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups
It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).
Use the horizontal lines at that height and count the cluster "formed".
I have a dataframe which I have constructed by interpolating a series of origin destination points (they relate to a cycle share scheme that used to run in Seattle).
I've called the dataframe interpolated_flows:
line_id long lat seg_num count
1 1 -122.3170 47.61855 1 155
2 1 -122.3170 47.61911 2 155
3 1 -122.3170 47.61967 3 155
4 1 -122.3170 47.62023 4 155
5 1 -122.3169 47.62079 5 155
6 1 -122.3169 47.62135 6 155
What I would like to do (and I think is relatively simple if you know ggplot) is to plot these flows (lines) with the width of a line determined by the count and the gradient determined by the seg_num.
This is my attempt so far:
#Create variables to store relevant data for simplicity of code
X <- interpolated_flows$long
Y <- interpolated_flows$lat
sgn <- interpolated_flows$seg_num
ct <- interpolated_flows$count
#Create a map from flow data and include the bounded box as a base
g <- ggplot(interpolated_flows,aes(x=X, y=Y),group=interpolated_flows$line_id,color=sgn)
map <- ggmap(seattle_map,base_layer = g)
map <- map + geom_path(size=as.numeric(ct)/100,alpha=0.4)+
scale_alpha_continuous(range = c(0.03, 0.3))+coord_fixed(ratio=1.3)+
scale_colour_gradient(high="red",low="blue")
png(filename='Seattle_flows_gradient.png')
print(map)
dev.off()
And I end up with the image attached. I have spent a long time playing around with various parameters in the plotting part of the code but without success so would really appreciate if someone could point me in the right direction.
Edit:
base <- ggplot(interpolated_flows,aes(x=X, y=Y))
map <- ggmap(seattle_map,base_layer = g)
map <- map+geom_path(aes(color=seg_num,size=as.numeric(count)))+
scale_size_continuous(name="Journey Count",range=c(0.05,0.4))+
scale_color_gradient(name="Journey Path",high="white",low="blue",breaks=c(1,10), labels=c('Origin','Destination'))+
coord_fixed(ratio=1.3)+scale_x_continuous("", breaks=NULL)+
scale_y_continuous("", breaks=NULL)
png(filename='Seattle_flows_gradient.png')
print(map)
dev.off()
This is the plot I have now got to which looks like this. I have only two questions - 1) does anyone know a way to improve the resolution of the background map? I tried changing the zoom parameter in the get_map function but it didn't seem to help. 2) The lines I have plotted seem very 'white' heavy. It doesn't look to me like the gradient is evenly distributed. Anyone have any ideas why this would be and how to fix?
See if this suits you. I have create a new dataset so as to see diffencies. Once the data.frame is created you can use it as your first ggplot argument and reference columns by their names as Mako212 say.
long<-seq(-122,-123,length.out = 6)
lat<-seq(47,48,length.out = 6)
seg_num<-seq(1,6,1)
count<-seq(155,165,length.out = 6)
interpolated_flows<-data.frame(long,lat,seg_num,count,stringsAsFactors = false)
base_plot<-ggplot(interpolated_flows,aes(x=long, y=lat))
base_plot+
geom_path(aes(color=seg_num,size=as.numeric(count/100),alpha=lat))+
#notice that size, color and alpha are into aethetic
scale_size_continuous(name="Count")+
scale_alpha_continuous(name="Latitude",range = c(0.03, 0.3))+ #you won't need it if you don't want variable transparency
#just put the desired value into the aethteic
scale_color_gradient(name="Seg_num",high="red",low="blue")+
coord_fixed(ratio=1.3)
Hope it helps
I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!
Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)
When I plot the following example:
Participant <- c(1:12)
AnswersDay1 <- c(9,3,9,13,7,12,10,7,9,0,12,11)
Day1Group <- c(0,1,0,1, 0, 1, 0,1,0,1, 0, 1)
PushFrame <- data.frame(Participant, AnswersDay1, Day1Group)
plot(AnswersDay1, Day1Group)
The plot shows only ten dots instead of the 12 values in the data.frame. I figured out that this is due to the fact, that there are three pairs with the exact same value.
Is it possible to somehow illustrate this inside the plot? Maybe that bigger dots are used when they have the same value or something like this?
1) sunflowerplot You may prefer to use a sunflowerplot which shows duplicate points as a single point with a spoke for each occurrence. No packages needed.
sunflowerplot(AnswersDay1, Day1Group)
(continued after graph)
2) jitter The other common technique is to use jitter which slightly moves duplicate points. In this example we jitter the Y variable but one could alternately jitter the X variable or both. No packages needed.
set.seed(123) # set seed of random number generator for reproducibility
plot(AnswersDay1, jitter(Day1Group))
(continued after graph)
3) cex If you really do want to use size as an indicator of how many duplicates then create a new data frame which contains the number of duplicates of each point (in the Participant column of ag) and then plot as shown. Again, no packages needed.
ag <- aggregate(Participant ~., PushFrame, length)
plot(Day1Group ~ AnswersDay1, ag, cex = Participant, pch = 20)
Yes, there’s absolutely a way of doing this: set the cex appropriately:
plot(AnswersDay1, Day1Group, cex = point_size)
How do you get the point size corresponding to each entry? Well, you count them using table:
tab = table(AnswersDay1, Day1Group)
This is what tab looks like:
Day1Group
AnswersDay1 0 1
0 0 1
3 0 1
7 1 1
9 3 0
10 1 0
11 0 1
12 1 1
13 0 1
That is, for each data point in AnswersDay1 it tells you how often that point appears. Now you just need to index it using AnswersDay1 and Day1Group:
point_size = diag(tab[as.character(AnswersDay1), as.character(Day1Group)])
Note the as.character — this is necessary since the names in the table are character strings, and using numeric indices here would index the wrong element. diag gives us back just the diagonal of the resulting matrix, which is what we’re after here.
you can use the scales package to alter transparency of your points, then overlapped points will be brighter (less opulent):
library(scales)
plot(AnswersDay1, Day1Group, pch = 20, cex= 2, col = alpha('black', 0.35))
the alpha parameter varies from 1 (no transparency) to 0 (complete transparency).
A couple more possibilities:
#KonradRudolph's solution is already implemented in plotrix::sizeplot().
PushFrame <- data.frame(Participant=1:12,
AnswersDay1=c(9,3,9,13,7,12,10,7,9,0,12,11),
Day1Group=c(0,1,0,1, 0, 1, 0,1,0,1, 0, 1))
library(plotrix)
with(PushFrame,sizeplot(AnswersDay1,Day1Group))
In ggplot2, stat_sum() automatically counts coincident values and scales the size accordingly ...
library(ggplot2); theme_set(theme_bw())
ggplot(PushFrame,aes(AnswersDay1,Day1Group))+stat_sum()
I really need your R skills here. Been working with this plot for several days now. I'm a R newbie, so that might explain it.
I have sequence coverage data for chromosomes (basically a value for each position along the length of every chromosome, making the length of the vectors many millions). I want to make a nice coverage plot of my reads. This is what I got so far:
Looks alright, but I'm missing y-labels so I can tell which chromosome it is, and also I've been having trouble modifying the x-axis, so it ends where the coverage ends. Additionally, my own data is much much bigger, making this plot in particular take extremely long time. Which is why I tried this HilbertVis plotLongVector. It works but I can't figure out how to modify it, the x-axis, the labels, how to make the y-axis logged, and the vectors all get the same length on the plot even though they are not equally long.
source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07))
chr2 <- abs(makeRandomTestData(len=1e+07))
par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)
# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.
# Chr2
plotCoverage <- function(chr2, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr2[start:end]), type="l")
}
plotCoverage(chr2, start=1, end=length(chr2)) # Plots coverage result.
# 2nd way of trying with plotLongVector
plotLongVector(chr1, bty="n", ylab="Chr1") # ylab doesn't work
plotLongVector(chr2, bty="n")
Then I have another vector called genes that are of special interest. They are about the same length as the chromosome-vectors but in my data they contain more zeroes than values.
genes_chr1 <- abs(makeRandomTestData(len=1.3e+07))
genes_chr2 <- abs(makeRandomTestData(len=1e+07))
These gene vectors I would like plotted as a red dot under the chromosomes! Basically, if the vector has a value there (>0), it is presented as a dot (or line) under the long vector plot. This I have not idea how to add! But it seems fairly straightforward.
Please help me! Thank you so much.
DISCLAIMER: Please do not simply copy and paste this code to run off the entire positions of your chromosome. Please sample positions (for example, as #Gx1sptDTDa shows) and plot those. Otherwise you'd probably get a huge black filled rectangle after many many hours, if your computer survives the drain.
Using ggplot2, this is really easily achieved using geom_area. Here, I've generated some random data for three chromosomes with 300 positions, just to show an example. You can build up on this, I hope.
# construct a test data with 3 chromosomes and 100 positions
# and random coverage between 0 and 500
set.seed(45)
chr <- rep(paste0("chr", 1:3), each=100)
pos <- rep(1:100, 3)
cov <- sample(0:500, 300)
df <- data.frame(chr, pos, cov)
require(ggplot2)
p <- ggplot(data = df, aes(x=pos, y=cov)) + geom_area(aes(fill=chr))
p + facet_wrap(~ chr, ncol=1)
You could use the ggplot2 package.
I'm not sure what exactly you want, but here's what I did:
This has 7000 random data points (about double the amount of genes on Chromosome 1 in reality). I used alpha to show dense areas (not many here, as it's random data).
library(ggplot2)
Chr1_cov <- sample(1.3e+07,7000)
Chr1 <- data.frame(Cov=Chr1_cov,fil=1)
pl <- qplot(Cov,fil,data=Chr1,geom="pointrange",ymin=0,ymax=1.1,xlab="Chromosome 1",ylab="-",alpha=I(1/50))
print(pl)
And that's it. This ran in less than a second. ggplot2 has a humongous amount of settings, so just try some out. Use facets to create multiple graphs.
The code beneath is for a sort of moving average, and then plotting the output of that. It is not a real moving average, as a real moving average would have (almost) the same amount of data points as the original - it will only make the data smoother. This code, however, takes an average for every n points. It will of course run quite a bit faster, but you will loose a lot of detailed information.
VeryLongVector <- sample(500,1e+07,replace=TRUE)
movAv <- function(vector,n){
chops <- as.integer(length(vector)/n)
count <- 0
pos <- 0
Cov <-0
pos[1:chops] <- 0
Cov[1:chops] <- 0
for(c in 1:chops){
tmpcount <- count + n
tmppos <- median(count:tmpcount)
tmpCov <- mean(vector[count:tmpcount])
pos[c] <- tmppos
Cov[c] <- tmpCov
count <- count + n
}
result <- data.frame(pos=pos,cov=Cov)
return(result)
}
Chr1 <- movAv(VeryLongVector,10000)
qplot(pos,cov,data=Chr1,geom="line")