Related
I try to make function plots or my top 20 genes. That's why I created a list of data frames, these data frames exist or different columns contain values and names.
One of these columns in the data frame is the gene column. My code turns the first 20 genes into function plots. But now I have the problem in some of the data frames that exist
fewer than 20 genes. This causes my code to abort.
Because I want a maximum of 5 function plots per page, I cannot just define a counter.
Thank you for the input.
Example of my list or data frames
listGroups
group1_2: 'data.frame': 68 obs. of 7 variables:
..$ p_val: num [1:68] 1.15 1.43 ...
..$ score: num [1:68] 15.5 27.14 ...
..$ gene: Factor w/ 68 levels "BRA1", "NED",...: 41 52 ...
group2_3: 'data.frame': 3 obs. of 7 variables:
..$ p_val: num [1:3] 1.15 1.43 ...
..$ score: num [1:3] 15.5 27.14 ...
..$ gene: Factor w/ 3 levels "BCL12", "DEF1",...: 41 52 ...
Code
groupNames <- c("cluster1_2","cluster2_3","cluster3_4","cluster4_5","cluster5_6")
for (i in 1:length(listGroups)) {
Grouplist <- listGroups[[i]]
genesList <- Grouplist['gene']
lengths(geneList)
print(groupNames[i])
# Make Featureplots for top20 DE genes per cluster_group
pdf(file=paste0(sampleFolder,"/Featureplots_cluster_",groupNames[i],"_",sampleName,".pdf"))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[1:5,]))))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[6:10,]))))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[11:15,]))))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[16:20,]))))
dev.off()
}
For each genelist you could make one plot with your choice of genes like this (plot would look fine with larger size, as in your PDF):
use combine=FALSE and limit the number of features to plot to something like rownames(pbmc_small)[1 : min(20, nrow(pbmc_small))] to avoid errors
then export the list of single plots (allows for themeing) and plot into pdf using cowplot::plot_grid
instead of plotting within the function (plot(out)), you could export to pdf (maybe pass filename as second argument to function).
library(Seurat)
genelist <- list(
l1 = sample(rownames(pbmc_small), 23),
l2 = sample(rownames(pbmc_small), 14),
l3 = sample(rownames(pbmc_small), 4))
plotFeatures <- function(x){
p <- FeaturePlot(object = pbmc_small,
features = x[1 : min(20, length(x))],
combine = FALSE, label.size = 2)
out <- cowplot::plot_grid(plotlist = p, ncol = 5, nrow = 4)
plot(out)
}
lapply(genelist, plotFeatures)
Not tested, something like this should work. Instead of calling print 5 times for each 5 genes, we call it in a loop n times based on number of genes. If we have 10 genes forloop will print twice, if 20 then we call print 4 times, etc:
groupNames <- c("cluster1_2","cluster2_3","cluster3_4","cluster4_5","cluster5_6")
for (i in 1:length(listGroups)) {
Grouplist <- listGroups[[i]]
genesList <- Grouplist['gene']
#lengths(geneList)
print(groupNames[i])
# Make Featureplots for top20 DE genes per cluster_group
# make chunks of 5 each.
myChunks <- split(genesList, ceiling(seq_along(genesList)/5))
pdf(file=paste0(sampleFolder,"/Featureplots_cluster_",groupNames[i],"_",sampleName,".pdf"))
# loop through genes plotting 5 genes each time.
for(x %in% seq(myChunks) ){
print(FeaturePlot(object = seuratObj, features = myChunks[[ x ]]))
}
dev.off()
}
Thanks to the input of zx8754 and user12728748. I found two solutions for my problem.
for (i in 1:length(listGroups)) {
Grouplist <- listGroups[[1]]
genesList <- Grouplist['gene']
print(groupNames[1])
## Solution 1
# Here all genes are printed. I didn't find a way yet to limited to 20
# make chunks of 5 each.
myChunks <- split(genesList,ceiling(seq(lengths(genesList))/5))
# Make Featureplots for top20 DE genes per cluster_group
pdf(file=paste0(sampleFolderAggr,"results/Featureplots_",groupNames[i],"_",sampleNameAggr,".pdf"))
# loop through genes plotting 5 genes each time.
for(x in 1:min(5, length(myChunks) ){
# Create a list of 5 genes
my5Genes <- as.list(myChunks[[x]])
print(FeaturePlot(object = seuratObj, features = c(as.character(my5Genes$gene))))
}
dev.off()
## Solution 2
pdf(file=paste0(sampleFolderAggr,"results/Featureplots_",groupNames[i],"_",sampleNameAggr,".pdf"))
plotFeatures <- function(x){
p <- FeaturePlot(object = seuratObj, features = c(as.character(x[1: min(20, lengths(x)),])), combine = FALSE, label.size = 2)
out <- cowplot::plot_grid(plotlist = p, ncol = 5, nrow = 4)
# Make Featureplots for top20 DE genes per cluster_group
plot(out)
}
lapply(genelist, plotFeatures)
dev.off()
}
I am trying to plot a boxplot in R, where the input file has multiple columns and each column has different number of rows. With the help given on help on the following link:
boxplot of vectors with different length
I am trying:
x <- read.csv( 'filename.csv', header = T )
plot(
1, 1,
xlim=c(1,ncol(x)), ylim=range(x[-1,], na.rm=TRUE),
xaxt='n', xlab='', ylab=''
)
axis(1, labels=colnames(x), at=1:ncol(x))
for(i in 1:ncol(x)) {
p <- x[,i]
boxplot(p, add=T, at=i)
}
I am trying to plot the values in log scale. But defining log ="y", I am getting the following error:
Error in xypolygon(xx, yy, lty = "blank", col = boxfill[i]) :
plot.new has not been called yet
Following is the sample of my input csv data:
A B C D
2345.42 932.19 40.8 26.19
138.48 1074.1 4405.62 4077.16
849.35 0.0 1451.66 1637.39
451.38 146.22 4579.6 5133.14
5749.01 7250.08 12.23 0.09
4125.48 129.46 49.51
440.38 6405.02
Your data as a reproducible example
Note I had to remove an extra element
library(data.table)
df <- fread("A,B,C,D
2345.42,932.19,40.8,26.19
138.48,1074.1,4405.62,4077.16
849.35,0.0,1451.66,1637.39
451.38,146.22,4579.6,5133.14
5749.01,7250.08,12.23,0.09
4125.48,129.46,49.51,440.38", sep=",", header=T)
dplyr and tidyr solution
library(dplyr)
library(tidyr)
df1 <- df %>%
replace(.==0,NA) %>% # make 0 into NA
gather(var,values,A:D) %>% # convert from wide (4-col) to long (2-col) format
mutate(values = log10(values)) # log10 transform
If you want log2, simply replace log10 with log2
Output
boxplot(values ~ var, df1)
A little extra
For log10 scale, I like to add 1 to my values to eliminate negative values since log10(0 < x < 1) = -value. This sets the minimum value on your plot as 0 since 0 + 1 = 1 and log10(1) = 0
I have an R data.frame:
> str(trainTotal)
'data.frame': 1000 obs. of 41 variables:
$ V1 : num 0.299 -1.174 1.192 1.573 -0.613 ...
$ V2 : num -1.227 0.332 -0.414 -0.58 -0.644 ...
etc.
$ V40 : num 0.101 -1.818 2.987 1.883 0.408 ...
$ Class: int 1 0 0 1 0 1 0 1 1 0 ...
and I would like to draw a 3D scatter plot of Class "0" in blue and Class "1" in red according to V13, V5, and V24.
V13, V5, V24 are the top variables when sorted by scaled variance, so my intuition tells me the 3D visualization could be interesting. Not sure if that makes sense.
How can I plot this with R ?
Edit:
I have tried the following:
install.packages("Rcmdr")
library(Rcmdr)
scatter3d(x=trainTotal[[13]], y= trainTotal[[5]], z= trainTotal[[24]], point.col = as.numeric(as.factor(trainTotal[,41])), size = 10)
which gives me this plot:
I am not sure how to read this plot.
I would prefer to see only dots of two colors, for a start.
Maybe something like this? Using scatterplot3d.
library(scatterplot3d)
#random data
DF <- data.frame(V13 = sample(1:100, 10, T), V5 = sample(1:100, 10, T), V24 = sample(1:100, 10, T), class = sample(0:1, 10, T))
#plot
scatterplot3d(x = DF$V13, y = DF$V5, z = DF$V24, color = c("blue", "red")[as.factor(DF$class)], pch = 19)
This gives:
In scatterplot3d there is also an angle argument for different views.
Perspective issues mean that static 3d plots are mostly horrible and misleading. If you really want a 3d scatterplot, it's best to draw one where you can view it from different angles. The rgl package allows this.
EDIT: I've updated the plot to use colours, in this case picked using the colorspace package, though you can define them however you like. Specifying attributes for points is described on the ?rgl.material help page.
library(rgl)
library(colorspace)
n_points <- 50
n_groups <- 5
some_data <- data.frame(
x = seq(0, 1, length.out = n_points),
y = runif(n_points),
z = rnorm(n_points),
group = gl(n_groups, n_points / n_groups)
)
colors <- rainbow_hcl(n_groups)
with(some_data, points3d(x, y, z, color = colors[group], size = 7))
axes3d()
I need to calculate a moving average and standard deviation for a moving window. This is simple enough with the catools package!
... However, what i would like to do, is having defined my moving window, i want to take an average from ONLY those values within the window, whose corresponding values of other variables meet certain criteria. For example, I would like to calculate a moving Temperature average, using only the values within the window (e.g. +/- 2 days), when say Relative Humidity is above 80%.
Could anybody help point me in the right direction? Here is some example data:
da <- data.frame(matrix(c(12,15,12,13,8,20,18,19,20,80,79,91,92,70,94,80,80,90),
ncol = 2, byrow = TRUE))
names(da) = c("Temp", "RH")
Thanks,
Brad
I haven't used catools, but in the help text for the (presumably) most relevant function in that package, ?runmean, you see that x, the input data, can be either "a numeric vector [...] or matrix with n rows". In your case the matrix alternative is most relevant - you wish to calculate mean of a focal variable, Temp, conditional on a second variable, RH, and the function needs access to both variables. However, "[i]f x is a matrix than each column will be processed separately". Thus, I don't think catools can solve your problem. Instead, I would suggest rollapply in the zoo package. In rollapply, you have the argument by.column. Default is TRUE: "If TRUE, FUN is applied to each column separately". However, as explained above we need access to both columns in the function, and set by.column to FALSE.
# First, specify a function to apply to each window: mean of Temp where RH > 80
meanfun <- function(x) mean(x[(x[ , "RH"] > 80), "Temp"])
# Apply the function to windows of size 3 in your data 'da'.
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE)
meanTemp
# If you want to add the means to 'da',
# you need to make it the same length as number of rows in 'da'.
# This can be acheived by the `fill` argument,
# where we can pad the resulting vector of running means with NA
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
# Add the vector of means to the data frame
da2 <- cbind(da, meanTemp)
da2
# even smaller example to make it easier to see how the function works
da <- data.frame(Temp = 1:9, RH = rep(c(80, 81, 80), each = 3))
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
da2 <- cbind(da, meanTemp)
da2
# Temp RH meanTemp
# 1 1 80 NA
# 2 2 80 NaN
# 3 3 80 4.0
# 4 4 81 4.5
# 5 5 81 5.0
# 6 6 81 5.5
# 7 7 80 6.0
# 8 8 80 NaN
# 9 9 80 NA
I'm trying to make a hexbin representation of data in several categories. The problem is, facetting these bins seems to make all of them different sizes.
set.seed(1) #Create data
bindata <- data.frame(x=rnorm(100), y=rnorm(100))
fac_probs <- dnorm(seq(-3, 3, length.out=26))
fac_probs <- fac_probs/sum(fac_probs)
bindata$factor <- sample(letters, 100, replace=TRUE, prob=fac_probs)
library(ggplot2) #Actual plotting
library(hexbin)
ggplot(bindata, aes(x=x, y=y)) +
geom_hex() +
facet_wrap(~factor)
Is it possible to set something to make all these bins physically the same size?
As Julius says, the problem is that hexGrob doesn't get the information about the bin sizes, and guesses it from the differences it finds within the facet.
Obviously, it would make sense to hand dx and dy to a hexGrob -- not having the width and height of a hexagon is like specifying a circle by center without giving the radius.
Workaround:
The resolution strategy works, if the facet contains two adjacent haxagons that differ in both x and y. So, as a workaround, I'll construct manually a data.frame containing the x and y center coordinates of the cells, and the factor for facetting and the counts:
In addition to the libraries specified in the question, I'll need
library (reshape2)
and also bindata$factor actually needs to be a factor:
bindata$factor <- as.factor (bindata$factor)
Now, calculate the basic hexagon grid
h <- hexbin (bindata, xbins = 5, IDs = TRUE,
xbnds = range (bindata$x),
ybnds = range (bindata$y))
Next, we need to calculate the counts depending on bindata$factor
counts <- hexTapply (h, bindata$factor, table)
counts <- t (simplify2array (counts))
counts <- melt (counts)
colnames (counts) <- c ("ID", "factor", "counts")
As we have the cell IDs, we can merge this data.frame with the proper coordinates:
hexdf <- data.frame (hcell2xy (h), ID = h#cell)
hexdf <- merge (counts, hexdf)
Here's what the data.frame looks like:
> head (hexdf)
ID factor counts x y
1 3 e 0 -0.3681728 -1.914359
2 3 s 0 -0.3681728 -1.914359
3 3 y 0 -0.3681728 -1.914359
4 3 r 0 -0.3681728 -1.914359
5 3 p 0 -0.3681728 -1.914359
6 3 o 0 -0.3681728 -1.914359
ggplotting (use the command below) this yields the correct bin sizes, but the figure has a bit weird appearance: 0 count hexagons are drawn, but only where some other facet has this bin populated. To suppres the drawing, we can set the counts there to NA and make the na.value completely transparent (it defaults to grey50):
hexdf$counts [hexdf$counts == 0] <- NA
ggplot(hexdf, aes(x=x, y=y, fill = counts)) +
geom_hex(stat="identity") +
facet_wrap(~factor) +
coord_equal () +
scale_fill_continuous (low = "grey80", high = "#000040", na.value = "#00000000")
yields the figure at the top of the post.
This strategy works as long as the binwidths are correct without facetting. If the binwidths are set very small, the resolution may still yield too large dx and dy. In that case, we can supply hexGrob with two adjacent bins (but differing in both x and y) with NA counts for each facet.
dummy <- hgridcent (xbins = 5,
xbnds = range (bindata$x),
ybnds = range (bindata$y),
shape = 1)
dummy <- data.frame (ID = 0,
factor = rep (levels (bindata$factor), each = 2),
counts = NA,
x = rep (dummy$x [1] + c (0, dummy$dx/2),
nlevels (bindata$factor)),
y = rep (dummy$y [1] + c (0, dummy$dy ),
nlevels (bindata$factor)))
An additional advantage of this approach is that we can delete all the rows with 0 counts already in counts, in this case reducing the size of hexdf by roughly 3/4 (122 rows instead of 520):
counts <- counts [counts$counts > 0 ,]
hexdf <- data.frame (hcell2xy (h), ID = h#cell)
hexdf <- merge (counts, hexdf)
hexdf <- rbind (hexdf, dummy)
The plot looks exactly the same as above, but you can visualize the difference with na.value not being fully transparent.
more about the problem
The problem is not unique to facetting but occurs always if too few bins are occupied, so that no "diagonally" adjacent bins are populated.
Here's a series of more minimal data that shows the problem:
First, I trace hexBin so I get all center coordinates of the same hexagonal grid that ggplot2:::hexBin and the object returned by hexbin:
trace (ggplot2:::hexBin, exit = quote ({trace.grid <<- as.data.frame (hgridcent (xbins = xbins, xbnds = xbnds, ybnds = ybnds, shape = ybins/xbins) [1:2]); trace.h <<- hb}))
Set up a very small data set:
df <- data.frame (x = 3 : 1, y = 1 : 3)
And plot:
p <- ggplot(df, aes(x=x, y=y)) + geom_hex(binwidth=c(1, 1)) +
coord_fixed (xlim = c (0, 4), ylim = c (0,4))
p # needed for the tracing to occur
p + geom_point (data = trace.grid, size = 4) +
geom_point (data = df, col = "red") # data pts
str (trace.h)
Formal class 'hexbin' [package "hexbin"] with 16 slots
..# cell : int [1:3] 3 5 7
..# count : int [1:3] 1 1 1
..# xcm : num [1:3] 3 2 1
..# ycm : num [1:3] 1 2 3
..# xbins : num 2
..# shape : num 1
..# xbnds : num [1:2] 1 3
..# ybnds : num [1:2] 1 3
..# dimen : num [1:2] 4 3
..# n : int 3
..# ncells: int 3
..# call : language hexbin(x = x, y = y, xbins = xbins, shape = ybins/xbins, xbnds = xbnds, ybnds = ybnds)
..# xlab : chr "x"
..# ylab : chr "y"
..# cID : NULL
..# cAtt : int(0)
I repeat the plot, leaving out data point 2:
p <- ggplot(df [-2,], aes(x=x, y=y)) + geom_hex(binwidth=c(1, 1)) + coord_fixed (xlim = c (0, 4), ylim = c (0,4))
p
p + geom_point (data = trace.grid, size = 4) + geom_point (data = df, col = "red")
str (trace.h)
Formal class 'hexbin' [package "hexbin"] with 16 slots
..# cell : int [1:2] 3 7
..# count : int [1:2] 1 1
..# xcm : num [1:2] 3 1
..# ycm : num [1:2] 1 3
..# xbins : num 2
..# shape : num 1
..# xbnds : num [1:2] 1 3
..# ybnds : num [1:2] 1 3
..# dimen : num [1:2] 4 3
..# n : int 2
..# ncells: int 2
..# call : language hexbin(x = x, y = y, xbins = xbins, shape = ybins/xbins, xbnds = xbnds, ybnds = ybnds)
..# xlab : chr "x"
..# ylab : chr "y"
..# cID : NULL
..# cAtt : int(0)
note that the results from hexbin are on the same grid (cell numbers did not change, just cell 5 is not populated any more and thus not listed), grid dimensions and ranges did not change. But the plotted hexagons did change dramatically.
Also notice that hgridcent forgets to return the center coordinates of the first cell (lower left).
Though it gets populated:
df <- data.frame (x = 1 : 3, y = 1 : 3)
p <- ggplot(df, aes(x=x, y=y)) + geom_hex(binwidth=c(0.5, 0.8)) +
coord_fixed (xlim = c (0, 4), ylim = c (0,4))
p # needed for the tracing to occur
p + geom_point (data = trace.grid, size = 4) +
geom_point (data = df, col = "red") + # data pts
geom_point (data = as.data.frame (hcell2xy (trace.h)), shape = 1, size = 6)
Here, the rendering of the hexagons cannot possibly be correct - they do not belong to one hexagonal grid.
I tried to replicate your solution with the same data set using lattice hexbinplot. Initially, it gave me an error xbnds[1] < xbnds[2] is not fulfilled. This error was due to wrong numeric vectors specifying range of values that should be covered by the binning. I changed those arguments in hexbinplot, and it somehow worked. Not sure if it helps you to solve it with ggplot, but it's probably some starting point.
library(lattice)
library(hexbin)
hexbinplot(y ~ x | factor, bindata, xbnds = "panel", ybnds = "panel", xbins=5,
layout=c(7,3))
EDIT
Although rectangular bins with stat_bin2d() work just fine:
ggplot(bindata, aes(x=x, y=y, group=factor)) +
facet_wrap(~factor) +
stat_bin2d(binwidth=c(0.6, 0.6))
There are two source files that we are interested in: stat-binhex.r and geom-hex.r, mainly hexBin and hexGrob functions.
As #Dinre mentioned, this issue is not really related to faceting. What we can see is that binwidth is not ignored and is used in a special way in hexBin, this function is applied for every facet separately. After that, hexGrob is applied for every facet. To be sure you can inspect them with e.g.
trace(ggplot2:::hexGrob, quote(browser()))
trace(ggplot2:::hexBin, quote(browser()))
Hence this explains why sizes differ - they depend on both binwidth and the data of each facet itself.
It is difficult to keep track of the process because of various coordinates transforms, but notice that the output of hexBin
data.frame(
hcell2xy(hb),
count = hb#count,
density = hb#count / sum(hb#count, na.rm=TRUE)
)
always seems to look quite ordinary and that hexGrob is responsible for drawing hex bins, distortion, i.e. it has polygonGrob. In case when there is only one hex bin in a facet there is a more serious anomaly.
dx <- resolution(x, FALSE)
dy <- resolution(y, FALSE) / sqrt(3) / 2 * 1.15
in ?resolution we can see
Description
The resolution is is the smallest non-zero distance between adjacent
values. If there is only one unique value, then the resolution is
defined to be one.
for this reason (resolution(x, FALSE) == 1 and resolution(y, FALSE) == 1) the x coordinates of polygonGrob of the first facet in your example are
[1] 1.5native 1.5native 0.5native -0.5native -0.5native 0.5native
and if I am not wrong, in this case native units are like npc, so they should be between 0 and 1. That is, in case of single hex bin it goes out of range because of resolution(). This function also is the reason of distortion that #Dinre mentioned even when having up to several hex bins.
So for now there does not seem to be an option to have hex bins of equal size. A temporal (and very inconvenient for a large number of factors) solution could begin with something like this:
library(gridExtra)
set.seed(2)
bindata <- data.frame(x = rnorm(100), y = rnorm(100))
fac_probs <- c(10, 40, 40, 10)
bindata$factor <- sample(letters[1:4], 100,
replace = TRUE, prob = fac_probs)
binwidths <- list(c(0.4, 0.4), c(0.5, 0.5),
c(0.5, 0.5), c(0.4, 0.4))
plots <- mapply(function(w,z){
ggplot(bindata[bindata$factor == w, ], aes(x = x, y = y)) +
geom_hex(binwidth = z) + theme(legend.position = 'none')
}, letters[1:4], binwidths, SIMPLIFY = FALSE)
do.call(grid.arrange, plots)
I also did some fiddling around with the hex plots in 'ggplot2', and I was able to consistently produce significant bin distortion when a factor's population was reduced to 8 or below. I can't explain why this is happening without digging down into the package source (which I am reluctant to do), but I can tell you that sparse factors seem to consistently wreck the hex bin plotting in 'ggplot2'.
This suggests to me that the size and shape of a particular hex bin in 'ggplot2' is related to a calculation that is unique to each facet, instead of doing a single calculation for the group and plotting the data afterwards. This is somewhat reinforced by the fact that I can reproduce the distortion in any given facet by plotting only that single factor, like so:
ggplot(bindata[bindata$factor=="e",], aes(x=x, y=y)) +
geom_hex()
This feels like something that should be elevated to the package maintainer, Hadley Wickham (h.wickham at gmail.com). This info is publicly available from CRAN.
Update: I sent an email to the Hadley Wickham asking if he would take a look at this question, and he confirmed that this behavior is indeed a bug.