When I plot the following example:
Participant <- c(1:12)
AnswersDay1 <- c(9,3,9,13,7,12,10,7,9,0,12,11)
Day1Group <- c(0,1,0,1, 0, 1, 0,1,0,1, 0, 1)
PushFrame <- data.frame(Participant, AnswersDay1, Day1Group)
plot(AnswersDay1, Day1Group)
The plot shows only ten dots instead of the 12 values in the data.frame. I figured out that this is due to the fact, that there are three pairs with the exact same value.
Is it possible to somehow illustrate this inside the plot? Maybe that bigger dots are used when they have the same value or something like this?
1) sunflowerplot You may prefer to use a sunflowerplot which shows duplicate points as a single point with a spoke for each occurrence. No packages needed.
sunflowerplot(AnswersDay1, Day1Group)
(continued after graph)
2) jitter The other common technique is to use jitter which slightly moves duplicate points. In this example we jitter the Y variable but one could alternately jitter the X variable or both. No packages needed.
set.seed(123) # set seed of random number generator for reproducibility
plot(AnswersDay1, jitter(Day1Group))
(continued after graph)
3) cex If you really do want to use size as an indicator of how many duplicates then create a new data frame which contains the number of duplicates of each point (in the Participant column of ag) and then plot as shown. Again, no packages needed.
ag <- aggregate(Participant ~., PushFrame, length)
plot(Day1Group ~ AnswersDay1, ag, cex = Participant, pch = 20)
Yes, there’s absolutely a way of doing this: set the cex appropriately:
plot(AnswersDay1, Day1Group, cex = point_size)
How do you get the point size corresponding to each entry? Well, you count them using table:
tab = table(AnswersDay1, Day1Group)
This is what tab looks like:
Day1Group
AnswersDay1 0 1
0 0 1
3 0 1
7 1 1
9 3 0
10 1 0
11 0 1
12 1 1
13 0 1
That is, for each data point in AnswersDay1 it tells you how often that point appears. Now you just need to index it using AnswersDay1 and Day1Group:
point_size = diag(tab[as.character(AnswersDay1), as.character(Day1Group)])
Note the as.character — this is necessary since the names in the table are character strings, and using numeric indices here would index the wrong element. diag gives us back just the diagonal of the resulting matrix, which is what we’re after here.
you can use the scales package to alter transparency of your points, then overlapped points will be brighter (less opulent):
library(scales)
plot(AnswersDay1, Day1Group, pch = 20, cex= 2, col = alpha('black', 0.35))
the alpha parameter varies from 1 (no transparency) to 0 (complete transparency).
A couple more possibilities:
#KonradRudolph's solution is already implemented in plotrix::sizeplot().
PushFrame <- data.frame(Participant=1:12,
AnswersDay1=c(9,3,9,13,7,12,10,7,9,0,12,11),
Day1Group=c(0,1,0,1, 0, 1, 0,1,0,1, 0, 1))
library(plotrix)
with(PushFrame,sizeplot(AnswersDay1,Day1Group))
In ggplot2, stat_sum() automatically counts coincident values and scales the size accordingly ...
library(ggplot2); theme_set(theme_bw())
ggplot(PushFrame,aes(AnswersDay1,Day1Group))+stat_sum()
Related
I am trying to plot a single image that contains 35 ggplots. The order of the plots in the single image is fixed and is shown below.
I also want blank grids as shown in the grid image. Each grid should have the plot with a particular drug number. I have a data frame "drug_dctv2" which I am splitting, and making into a list to read data into the for loop.
The problem is: In plot_list[[i]], only the last plot is saved 35 times with i (1 to 35). I am also not sure how to save the plots in the particular order as shown in the grid.
Through my internet search, I found library like "cowplot" and "gridextra" but I couldn't find a proper way to implement these.
I made a plot layout file which contains the drug names in the following order as shown in the grid image and in place of blank spaces, I inserted "tab". But I do not find a way to proceed from there.
I am new to R. Any help and suggestion will be appreciated.
Data set looks like as shown below. Each Drug has 10 data points.
**Drug_name conc viab**
Drug_1 1 1.0265
Drug_1 0.1 1.2365
Drug_1 0.01 0.5896
-- -- --
Drug_2 1 2.0584
Drug_2 0.1 1.0277
Drug_2 0.01 1.5696
-- -- --
#
split <- split(file,rep(1:35,each=10)) #### this will be used in the for loop
plot_list = list()
for(i in 1:length(split))
{
data <- split[[i]]
c <- data$conc
v <- data$viab
p = ggplot(data = data,aes(x=c,y=v))+geom_point()+ylim(0,1.5)+
scale_x_continuous(trans='log10')+
theme(axis.text = element_blank(),axis.title = element_blank()) +
geom_line(data=line_data, aes(x=x,y=y2),color ="red",size=1)
plot_list[[i]] = p
}
Thank you in advance !!
ggplot, as many tidyverse packages, use delayed non standard evaluation. The expression you provide inside aes is not evaluated until the plot is built (e.g. printed or saved).
The expression in your question refers to the vectors c and v defined in the for loop. These vectors change on each iteration, but the aes call only contains an expression to the reference to c and v in the environment where the for loop is running, so the c and v values used in the plot are the ones available when the plot is printed or saved.
You can, as mentioned in the comments, use a column from the data frame directly, since ggplot evaluates the data frame when ggplot() is called.
An alternative if you wanted to keep using c and v, is to make sure each iteration runs in an independent environment, so ggplot references for c and v point to the different c and v on each loop iteration. This can be done for instance replacing the for loop with an lapply call.
plot_list <- lapply(split, function(data_drug) {
c <- data_drug$conc
v <- data_drug$viab
ggplot(data = data_drug,aes(x=c,y=v))+geom_point()+ylim(0,1.5)+
scale_x_continuous(trans='log10')+
theme(axis.text = element_blank(),axis.title = element_blank()) +
geom_line(data=line_data, aes(x=x,y=y2),color ="red",size=1)
})
This is one beautiful example where a for loop and an lapply call produce different results and it's a great learning experience about non standard evaluation and variable environments.
To combine the plots look at cowplot::plot_grid https://wilkelab.org/cowplot/articles/plot_grid.html
Something like this should work
library(cowplot)
plot_grid(
plot_list[[35]], plot_list[[5]], plot_list[[3]], plot_list[[2]],
plot_list[[34]], plot_list[[1]], plot_list[[4]], plot_list[[6]],
plot_list[[32]], plot_list[[8]], NULL, NULL,
plot_list[[30]], plot_list[[7]], plot_list[[33]] , NULL,
labels = "AUTO", ncol = 4
)
You can put all the function arguments in a list and use do.call to call the function with the arguments:
plot_order <- c(
35, 5, 3, 2,
34, 1, 4, 6,
32, 8, NA, NA
)
plot_grid_args <- c(plot_list[plot_order], list(ncol = 4))
do.call(plot_grid, plot_grid_args)
So, Finally I was able to solve this problem.
I made a variable layout with the position of the drugs as they are in the split[i] list. For eg: drug_35 has to come first on the grid and it is on 35th position in split[i] list, so in "layout" variable 35 comes first and so on.
I made a text file with the grid layout as shown above in the image and then read that file in the R script and by some lines of codes I was able to make the layout variable. For the sake of simplicity I am not showing those code lines here. But, I hope the concept is clear.
lay <- read.delim("layout.txt",stringsAsFactors = FALSE,sep = "\t", header = F)
lay1 = c(t(lay))
col_n = ncol(lay)
row_n = nrow(lay)
split <- split(file,rep(1:35,each=10))
## layout = 35 5 3 2 34 1 4 6 32 8 0 0 30 7 33 .....
## 0 means blank spaces
png("PLOT.png", width = 6, height = 10, units = "in", res = 400)
par(mfrow=c(row_n,col_n),mar=c(2,0.7,1.5,0.5)) ## margins: bottom, left, top and right
for(i in layout)
{
if(i== 0) { frame(); next; }
## Here if 0 comes then the for loop will be skipped and frame() will generate a blank in the grid image
data <- split[[i]]
c <- data$conc
v <- data$viab
plot(c,v,xlab = NULL,ylab = NULL, axes = F,log = "x")
}
dev.off()
I have a dataframe which I have constructed by interpolating a series of origin destination points (they relate to a cycle share scheme that used to run in Seattle).
I've called the dataframe interpolated_flows:
line_id long lat seg_num count
1 1 -122.3170 47.61855 1 155
2 1 -122.3170 47.61911 2 155
3 1 -122.3170 47.61967 3 155
4 1 -122.3170 47.62023 4 155
5 1 -122.3169 47.62079 5 155
6 1 -122.3169 47.62135 6 155
What I would like to do (and I think is relatively simple if you know ggplot) is to plot these flows (lines) with the width of a line determined by the count and the gradient determined by the seg_num.
This is my attempt so far:
#Create variables to store relevant data for simplicity of code
X <- interpolated_flows$long
Y <- interpolated_flows$lat
sgn <- interpolated_flows$seg_num
ct <- interpolated_flows$count
#Create a map from flow data and include the bounded box as a base
g <- ggplot(interpolated_flows,aes(x=X, y=Y),group=interpolated_flows$line_id,color=sgn)
map <- ggmap(seattle_map,base_layer = g)
map <- map + geom_path(size=as.numeric(ct)/100,alpha=0.4)+
scale_alpha_continuous(range = c(0.03, 0.3))+coord_fixed(ratio=1.3)+
scale_colour_gradient(high="red",low="blue")
png(filename='Seattle_flows_gradient.png')
print(map)
dev.off()
And I end up with the image attached. I have spent a long time playing around with various parameters in the plotting part of the code but without success so would really appreciate if someone could point me in the right direction.
Edit:
base <- ggplot(interpolated_flows,aes(x=X, y=Y))
map <- ggmap(seattle_map,base_layer = g)
map <- map+geom_path(aes(color=seg_num,size=as.numeric(count)))+
scale_size_continuous(name="Journey Count",range=c(0.05,0.4))+
scale_color_gradient(name="Journey Path",high="white",low="blue",breaks=c(1,10), labels=c('Origin','Destination'))+
coord_fixed(ratio=1.3)+scale_x_continuous("", breaks=NULL)+
scale_y_continuous("", breaks=NULL)
png(filename='Seattle_flows_gradient.png')
print(map)
dev.off()
This is the plot I have now got to which looks like this. I have only two questions - 1) does anyone know a way to improve the resolution of the background map? I tried changing the zoom parameter in the get_map function but it didn't seem to help. 2) The lines I have plotted seem very 'white' heavy. It doesn't look to me like the gradient is evenly distributed. Anyone have any ideas why this would be and how to fix?
See if this suits you. I have create a new dataset so as to see diffencies. Once the data.frame is created you can use it as your first ggplot argument and reference columns by their names as Mako212 say.
long<-seq(-122,-123,length.out = 6)
lat<-seq(47,48,length.out = 6)
seg_num<-seq(1,6,1)
count<-seq(155,165,length.out = 6)
interpolated_flows<-data.frame(long,lat,seg_num,count,stringsAsFactors = false)
base_plot<-ggplot(interpolated_flows,aes(x=long, y=lat))
base_plot+
geom_path(aes(color=seg_num,size=as.numeric(count/100),alpha=lat))+
#notice that size, color and alpha are into aethetic
scale_size_continuous(name="Count")+
scale_alpha_continuous(name="Latitude",range = c(0.03, 0.3))+ #you won't need it if you don't want variable transparency
#just put the desired value into the aethteic
scale_color_gradient(name="Seg_num",high="red",low="blue")+
coord_fixed(ratio=1.3)
Hope it helps
I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!
Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)
I have the following dataset called df:
Amp Injected Recovered Percent less_0.1_True
0.13175 25.22161274 0.96055540 3.81 0
0.26838 21.05919344 21.06294791 100.02 1
0.07602 16.88526724 16.91541763 100.18 1
0.04608 27.50209048 27.55404507 100.19 0
0.01729 8.31489333 8.31326976 99.98 1
0.31867 4.14961918 4.14876247 99.98 0
0.28756 14.65843377 14.65248551 99.96 1
0.26177 10.64754579 10.76435667 101.10 1
0.23214 6.28826689 6.28564299 99.96 1
0.20300 17.01774090 1.05925850 6.22 0
...
Here, the less_0.1_True column flags whether the Recovered periods were close enough to Injected period to be considered a successful recovery or not. If the flag is 1, then it is a succesful recovery. Based on this, I need to generate a plot (Henderson & Stassun, the Astrophysical Journal, 747:51, 2012) like the following:
I am not sure how to create a histogram like this. The closest I have been do reproduce is a bar plot with the following code:
breaks <- seq(0,30,by=1)
df <- split(dat, cut(dat$Injected,breaks)) # I make bins with width = 1 day
x <- seq(1,30,by=1)
len <- numeric() #Here I store the total number of objects in each bin
sum <- numeric() #Here I store the total number of 1s in each bin
for (i in 1:30){
n <- nrow(df[[i]])
len <- c(len,n)
s <- sum(df[[i]]$less_0.1_True == 1, na.rm = TRUE)
sum <- c(sum,s)
}
percent = sum/len*100 #Here I calculate what the percentage is for each bin
barplot(percent, names = x, xlab = "Period [d]" , ylab = "Percent Recovered", ylim=c(0,100))
And it generates the following bar plot:
Obviously, this plot does not look like the first one and there are issues such as it does not show from 0 to 1 like the first graph (which I understand is the case because the latter is a bar graph and not a histogram).
Could anyone please guide me as to how I may reproduce the first figure based on my dataset?
If I run your code I get errors. You need to use border = NA to get rid of the bar borders:
set.seed(42)
hist(rnorm(1000,4), xlim=c(0,10), col = 'skyblue', border = NA, main = "Histogram", xlab = NULL)
Another example using ggplot2:
ggplot(iris, aes(x=Sepal.Length))+
geom_histogram()
I finally found a solution to the problem in StackOverflow. I guess the solved question was worded differently than mine and so I could not find it when I was looking for it initially. The solution is here: How to plot a histogram with a custom distribution?
I really need your R skills here. Been working with this plot for several days now. I'm a R newbie, so that might explain it.
I have sequence coverage data for chromosomes (basically a value for each position along the length of every chromosome, making the length of the vectors many millions). I want to make a nice coverage plot of my reads. This is what I got so far:
Looks alright, but I'm missing y-labels so I can tell which chromosome it is, and also I've been having trouble modifying the x-axis, so it ends where the coverage ends. Additionally, my own data is much much bigger, making this plot in particular take extremely long time. Which is why I tried this HilbertVis plotLongVector. It works but I can't figure out how to modify it, the x-axis, the labels, how to make the y-axis logged, and the vectors all get the same length on the plot even though they are not equally long.
source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07))
chr2 <- abs(makeRandomTestData(len=1e+07))
par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)
# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.
# Chr2
plotCoverage <- function(chr2, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr2[start:end]), type="l")
}
plotCoverage(chr2, start=1, end=length(chr2)) # Plots coverage result.
# 2nd way of trying with plotLongVector
plotLongVector(chr1, bty="n", ylab="Chr1") # ylab doesn't work
plotLongVector(chr2, bty="n")
Then I have another vector called genes that are of special interest. They are about the same length as the chromosome-vectors but in my data they contain more zeroes than values.
genes_chr1 <- abs(makeRandomTestData(len=1.3e+07))
genes_chr2 <- abs(makeRandomTestData(len=1e+07))
These gene vectors I would like plotted as a red dot under the chromosomes! Basically, if the vector has a value there (>0), it is presented as a dot (or line) under the long vector plot. This I have not idea how to add! But it seems fairly straightforward.
Please help me! Thank you so much.
DISCLAIMER: Please do not simply copy and paste this code to run off the entire positions of your chromosome. Please sample positions (for example, as #Gx1sptDTDa shows) and plot those. Otherwise you'd probably get a huge black filled rectangle after many many hours, if your computer survives the drain.
Using ggplot2, this is really easily achieved using geom_area. Here, I've generated some random data for three chromosomes with 300 positions, just to show an example. You can build up on this, I hope.
# construct a test data with 3 chromosomes and 100 positions
# and random coverage between 0 and 500
set.seed(45)
chr <- rep(paste0("chr", 1:3), each=100)
pos <- rep(1:100, 3)
cov <- sample(0:500, 300)
df <- data.frame(chr, pos, cov)
require(ggplot2)
p <- ggplot(data = df, aes(x=pos, y=cov)) + geom_area(aes(fill=chr))
p + facet_wrap(~ chr, ncol=1)
You could use the ggplot2 package.
I'm not sure what exactly you want, but here's what I did:
This has 7000 random data points (about double the amount of genes on Chromosome 1 in reality). I used alpha to show dense areas (not many here, as it's random data).
library(ggplot2)
Chr1_cov <- sample(1.3e+07,7000)
Chr1 <- data.frame(Cov=Chr1_cov,fil=1)
pl <- qplot(Cov,fil,data=Chr1,geom="pointrange",ymin=0,ymax=1.1,xlab="Chromosome 1",ylab="-",alpha=I(1/50))
print(pl)
And that's it. This ran in less than a second. ggplot2 has a humongous amount of settings, so just try some out. Use facets to create multiple graphs.
The code beneath is for a sort of moving average, and then plotting the output of that. It is not a real moving average, as a real moving average would have (almost) the same amount of data points as the original - it will only make the data smoother. This code, however, takes an average for every n points. It will of course run quite a bit faster, but you will loose a lot of detailed information.
VeryLongVector <- sample(500,1e+07,replace=TRUE)
movAv <- function(vector,n){
chops <- as.integer(length(vector)/n)
count <- 0
pos <- 0
Cov <-0
pos[1:chops] <- 0
Cov[1:chops] <- 0
for(c in 1:chops){
tmpcount <- count + n
tmppos <- median(count:tmpcount)
tmpCov <- mean(vector[count:tmpcount])
pos[c] <- tmppos
Cov[c] <- tmpCov
count <- count + n
}
result <- data.frame(pos=pos,cov=Cov)
return(result)
}
Chr1 <- movAv(VeryLongVector,10000)
qplot(pos,cov,data=Chr1,geom="line")