I try to make function plots or my top 20 genes. That's why I created a list of data frames, these data frames exist or different columns contain values and names.
One of these columns in the data frame is the gene column. My code turns the first 20 genes into function plots. But now I have the problem in some of the data frames that exist
fewer than 20 genes. This causes my code to abort.
Because I want a maximum of 5 function plots per page, I cannot just define a counter.
Thank you for the input.
Example of my list or data frames
listGroups
group1_2: 'data.frame': 68 obs. of 7 variables:
..$ p_val: num [1:68] 1.15 1.43 ...
..$ score: num [1:68] 15.5 27.14 ...
..$ gene: Factor w/ 68 levels "BRA1", "NED",...: 41 52 ...
group2_3: 'data.frame': 3 obs. of 7 variables:
..$ p_val: num [1:3] 1.15 1.43 ...
..$ score: num [1:3] 15.5 27.14 ...
..$ gene: Factor w/ 3 levels "BCL12", "DEF1",...: 41 52 ...
Code
groupNames <- c("cluster1_2","cluster2_3","cluster3_4","cluster4_5","cluster5_6")
for (i in 1:length(listGroups)) {
Grouplist <- listGroups[[i]]
genesList <- Grouplist['gene']
lengths(geneList)
print(groupNames[i])
# Make Featureplots for top20 DE genes per cluster_group
pdf(file=paste0(sampleFolder,"/Featureplots_cluster_",groupNames[i],"_",sampleName,".pdf"))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[1:5,]))))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[6:10,]))))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[11:15,]))))
print(FeaturePlot(object = seuratObj, features = c(as.character(genesList[16:20,]))))
dev.off()
}
For each genelist you could make one plot with your choice of genes like this (plot would look fine with larger size, as in your PDF):
use combine=FALSE and limit the number of features to plot to something like rownames(pbmc_small)[1 : min(20, nrow(pbmc_small))] to avoid errors
then export the list of single plots (allows for themeing) and plot into pdf using cowplot::plot_grid
instead of plotting within the function (plot(out)), you could export to pdf (maybe pass filename as second argument to function).
library(Seurat)
genelist <- list(
l1 = sample(rownames(pbmc_small), 23),
l2 = sample(rownames(pbmc_small), 14),
l3 = sample(rownames(pbmc_small), 4))
plotFeatures <- function(x){
p <- FeaturePlot(object = pbmc_small,
features = x[1 : min(20, length(x))],
combine = FALSE, label.size = 2)
out <- cowplot::plot_grid(plotlist = p, ncol = 5, nrow = 4)
plot(out)
}
lapply(genelist, plotFeatures)
Not tested, something like this should work. Instead of calling print 5 times for each 5 genes, we call it in a loop n times based on number of genes. If we have 10 genes forloop will print twice, if 20 then we call print 4 times, etc:
groupNames <- c("cluster1_2","cluster2_3","cluster3_4","cluster4_5","cluster5_6")
for (i in 1:length(listGroups)) {
Grouplist <- listGroups[[i]]
genesList <- Grouplist['gene']
#lengths(geneList)
print(groupNames[i])
# Make Featureplots for top20 DE genes per cluster_group
# make chunks of 5 each.
myChunks <- split(genesList, ceiling(seq_along(genesList)/5))
pdf(file=paste0(sampleFolder,"/Featureplots_cluster_",groupNames[i],"_",sampleName,".pdf"))
# loop through genes plotting 5 genes each time.
for(x %in% seq(myChunks) ){
print(FeaturePlot(object = seuratObj, features = myChunks[[ x ]]))
}
dev.off()
}
Thanks to the input of zx8754 and user12728748. I found two solutions for my problem.
for (i in 1:length(listGroups)) {
Grouplist <- listGroups[[1]]
genesList <- Grouplist['gene']
print(groupNames[1])
## Solution 1
# Here all genes are printed. I didn't find a way yet to limited to 20
# make chunks of 5 each.
myChunks <- split(genesList,ceiling(seq(lengths(genesList))/5))
# Make Featureplots for top20 DE genes per cluster_group
pdf(file=paste0(sampleFolderAggr,"results/Featureplots_",groupNames[i],"_",sampleNameAggr,".pdf"))
# loop through genes plotting 5 genes each time.
for(x in 1:min(5, length(myChunks) ){
# Create a list of 5 genes
my5Genes <- as.list(myChunks[[x]])
print(FeaturePlot(object = seuratObj, features = c(as.character(my5Genes$gene))))
}
dev.off()
## Solution 2
pdf(file=paste0(sampleFolderAggr,"results/Featureplots_",groupNames[i],"_",sampleNameAggr,".pdf"))
plotFeatures <- function(x){
p <- FeaturePlot(object = seuratObj, features = c(as.character(x[1: min(20, lengths(x)),])), combine = FALSE, label.size = 2)
out <- cowplot::plot_grid(plotlist = p, ncol = 5, nrow = 4)
# Make Featureplots for top20 DE genes per cluster_group
plot(out)
}
lapply(genelist, plotFeatures)
dev.off()
}
Related
I have the following dataset. For each month and site, I'm trying to use a particular package (EcoSim) to estimate overlap (RA4Model).
I want to achieve two things:
install.packages("EcoSimR")
library(EcoSimR)
set.seed(111)
month <- rep(c("J","J","J","F"), each = 4)
site <- rep(c("1","2","3","1"), each = 4)
species <- rep(c("A","B","C","D"), rep = 4)
q1 <- rnorm(16,5,1)
q2 <- rnorm(16,5,1)
q3 <- rnorm(16,5,1)
q4 <- rnorm(16,5,1)
q5 <- rnorm(16,5,1)
df <- data.frame(month, site, species,q1,q2,q3,q4,q5)
df.site <- df[df$month == "J" & df$site == "1",]
df.site <- df.site[,-c(1,2,3)]
RA4model <- niche_null_model(speciesData=df.site,
algo="ra4", metric="pianka",
suppressProg=TRUE,nReps=5000)
summary(RA4model)
First, I want to store certain values from the summary into the corresponding rows in R
Here is the output of summary(RA4Model) for Month "J" and site "1"
# Time Stamp: Sun Jan 8 11:43:05 2023
# Reproducible: FALSE
# Number of Replications: 5000
# Elapsed Time: 1 secs
# Metric: pianka
# Algorithm: ra4
# Observed Index: 0.96516
# Mean Of Simulated Index: 0.95101
# Variance Of Simulated Index: 9.4906e-05
# Lower 95% (1-tail): 0.93823
# Upper 95% (1-tail): 0.96979
# Lower 95% (2-tail): 0.93729
# Upper 95% (2-tail): 0.97272
# Lower-tail P = 0.8982
# Upper-tail P = 0.1018
# Observed metric > 4491 simulated metrics
# Observed metric < 509 simulated metrics
# Observed metric = 0 simulated metrics
# Standardized Effect Size (SES): 1.453
Storing this output into vector. I don't know how to store the lower-1tailP and SES from the summary output into the dataset
df.out <- df[,c(1,2,3)]
df.out$Obs <- RA4model$Obs
df.out$Sim <- mean(RA4model$Sim)
df.out$lower-1tailP <- #This should be 0.93
df.out$SES <- #This should be 1.453
Next, I want to loop this so that it does it for each unique(month,site). So the final dataframe looks something like this:
month site Obs Sim lower-1tailP SES
J 1 .. .. .. ..
J 2 .. .. .. ..
J 3 .. .. .. ..
F 1 .. .. .. ..
You can create a helper function get_eco_sim_result(), which returns a list of the parameters of interest:
get_eco_sim_result <- function(spd, algo= "ra4", metric = "pianka", nReps=500) {
model = niche_null_model(speciesData = spd,
algo = algo,metric =metric, nReps = nReps, suppressProg = TRUE
)
return(list(
Obs = model$Obs,
Sim = mean(model$Sim),
lower_1tailp = quantile(model$Sim,0.05),
SES = (model$Obs - mean(model$Sim))/sd(model$Sim)
))
}
Then use lapply() to apply that helper function to each subset of the data; here I obtain the subsets of the data using split(). By wrapping the retuned list in data.frame(), you can subsequently use do.call() with rbind()
do.call(
rbind, lapply(split(df, list(month,site), drop=T), \(d) {
data.frame(get_eco_sim_result(d[,-c(1,2,3)], nReps=5000))
})
)
Output:
Obs Sim lower_1tailp SES
F.1 0.9641760 0.9546306 0.9429722 1.0966176
J.1 0.9651613 0.9508969 0.9381335 1.4635265
J.2 0.9931026 0.9842322 0.9800247 2.9524328
J.3 0.9726413 0.9674799 0.9586858 0.7669562
Your original question, however, is perhaps less about the helper function and the loop, but rather about how the parameters in summary(RA4model) are estimated? Use getAnywhere(summary.nichenullmod) to see how these are estimated.
I have the following PCA data on which i am doing Kmeans clustering:
head(pcdffinal)
PC1 PC2 PC3 PC4 PC5 PC6
1 -9.204228 -2.73517110 2.7975063 0.6794614 -0.84627095 0.4455297
2 2.927245 0.05666389 0.5085896 0.1472800 0.18193152 0.1041490
3 -4.667932 -1.98176361 2.2751862 0.5347725 -0.43314927 0.3222719
4 -1.366505 -0.40858595 0.5005192 0.4507366 -0.54996933 0.5533013
5 -4.689454 -2.77185636 2.4323856 0.7387788 0.49237229 -0.4817083
6 -3.477046 -1.84904214 1.5539558 0.5463861 -0.03231143 0.2814843
opt.cluster<-3
set.seed(115)
pccomp.km <- kmeans(pcdffinal,opt.cluster,nstart=25)
head(pccomp.km$cluster)
[1] 2 1 2 2 2 2
barplot(table(pccomp.km$cluster), col="steelblue")
pccomp.km$tot.withinss #For total within cluster sum of squares.
[1] 13172.59
We can also use a plot to illustrate the groups that the data have been arranged into.
par(mfrow=c(1,1))
plot(pcdffinal[,1:2],col=(pccomp.km$cluster+1),main=paste('K-Means Clustering result with k = ', opt.cluster,sep=" "),pch=20,cex=2)
points(pccomp.km$centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
library("factoextra")
fviz_cluster(pccomp.km, data = pcdffinal, frame.type = "convex")+ theme_minimal()
df.num_kmeans<-df.num
df.num_kmeans$cluster.kmeans <- pccomp.km$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
save this dataset & kmeans model for further use
saveRDS(pccomp.km, "kmeans_model.RDS")
write.csv(df.num_kmeans,"dfnum_kmeans.cluster.csv")
library(cluster)
clusplot(df.num_kmeans,pccomp.km$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
library(ggfortify)
autoplot(pccomp.km, data=pcdffinal, frame=TRUE,frame.type='norm')
I would like to do Kmeans iteratively for a range of Ks say k=2:6 each time making plots for the respective k as well as saving the models as well as the data as a csv but each done separately for different k's.
Need help to convert the above codes into an iterative with the counter i going from 2 till 6.
original data:
head(df.num_kmeans)
datausage mou revenue calldrop handset2g handset3g smartphone
1 896804.7 2854801 40830.404 27515 7930 19040 20810
2 155932.1 419109 5512.498 5247 2325 2856 3257
3 674983.3 2021183 25252.265 21068 6497 13056 14273
4 522787.2 1303221 14547.380 8865 4693 9439 10746
5 523465.7 1714641 24177.095 25441 8668 12605 14766
6 527062.3 1651303 20153.482 18219 6822 11067 12994
rechargecount rechargesum arpu subscribers
1 4461 235430 197704.10 105822
2 843 39820 34799.21 18210
3 2944 157099 133842.38 71351
4 2278 121697 104681.58 44975
5 2802 144262 133190.55 75860
6 2875 143333 119389.91 63740
Using random forest for accuracy comparison
dfnum.kmeans <- read.csv("dfnum_kmeans.cluster.csv")
table(dfnum.kmeans$cluster.kmeans) # size of each cluster
convert cluster var into a factor
dfnum.kmeans$cluster.kmeans <- as.factor(dfnum.kmeans$cluster.kmeans)
is.factor(dfnum.kmeans$cluster.kmeans)
create training and test sets (75:25 split) using 'caret' package
set.seed(128) # for reproducibility
inTrain_kmeans <- caret::createDataPartition(y = dfnum.kmeans$cluster.kmeans, p = 0.75, list = FALSE)
training_kmeans <- dfnum.kmeans[inTrain_kmeans, ]
testing_kmeans <- dfnum.kmeans[-inTrain_kmeans, ]
set.seed(122)
control <- trainControl(method = "repeatedcv", number = 10,allowParallel = TRUE)
modFit.rfcaret_kmeans <- caret::train(cluster.kmeans~ ., method = "rf",data = training_kmeans, trControl = control, number = 25)
modFit.rfcaret_kmeans$finalModel
pred.test_kmeans = predict(modFit.rfcaret_kmeans, testing_kmeans); confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )
confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )$overall[1]
Assuming that your original dataframe is df.num, the following could save all the files (for different k values) in your working directory:
for (k in 2:6) {
set.seed(115)
pccomp.km <- kmeans(pcdffinal,k,nstart=25)
head(pccomp.km$cluster)
print(paste(k, pccomp.km$tot.withinss)) #For total within cluster sum of squares.
png(paste0('kmeans_proj_',k, '.png'))
par(mfrow=c(1,1))
plot(pcdffinal[,1:2],col=(pccomp.km$cluster+1),main=paste('K-Means Clustering result with k = ', k,sep=" "),pch=20,cex=2)
points(pccomp.km$centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
dev.off()
png(paste0('kmeans_fviz_',k, '.png'))
print(fviz_cluster(pccomp.km, data = pcdffinal, frame.type = "convex")+ theme_minimal())
dev.off()
df.num_kmeans<-df.num
df.num_kmeans$cluster.kmeans <- pccomp.km$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
saveRDS(pccomp.km, paste0("kmeans_model_", k, ".RDS"))
write.csv(df.num_kmeans,paste0("dfnum_kmeans_", k, ".cluster.csv"))
png(paste0('clusplot_',k, '.png'))
clusplot(df.num_kmeans,pccomp.km$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
dev.off()
png(paste0('autoplot_',k, '.png'))
print(autoplot(pccomp.km, data=pcdffinal, frame=TRUE,frame.type='norm'))
dev.off()
}
After fitting a Tree with party::ctree() I want to create a table to characterise the branches.
I have fitted these variables
> summary(juridicos_segmentar)
actividad_economica
Financieras : 89
Gubernamental : 48
Sector Primario : 34
Sector Secundario:596
Sector Terciario :669
ingresos_cut
(-Inf,1.03e+08] :931
(1.03e+08,4.19e+08]:252
(4.19e+08,1.61e+09]:144
(1.61e+09, Inf] :109
egresos_cut
(-Inf,6e+07] :922
(6e+07,2.67e+08] :256
(2.67e+08,1.03e+09]:132
(1.03e+09, Inf] :126
patrimonio_cut
(-Inf,2.72e+08] :718
(2.72e+08,1.46e+09]:359
(1.46e+09,5.83e+09]:191
(5.83e+09, Inf] :168
op_ingreso_cut
(-Inf,3] :1308
(3,7] : 53
(7,22] : 44
(22, Inf]: 31
The first one is categorical and the others are ordinal and I fitted them to
another factor variable
> summary(as.factor(segmento))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27 66 30 39 36 33 39 15 84 70 271 247 101 34 100 74 47 25 48 50
I used the following code
library(party)
fit_jur <- ctree(cluster ~ .,
data=data.frame(juridicos_segmentar, cluster=as.factor(segmento)))
to get this tree
> fit_jur
Conditional inference tree with 31 terminal nodes
Response: cluster
Inputs: actividad_economica, ingresos_cut, egresos_cut, patrimonio_cut, op_ingreso_cut
Number of observations: 1436
1) actividad_economica == {Financieras}; criterion = 1, statistic = 4588.487
2) ingresos_cut <= (4.19e+08,1.61e+09]; criterion = 1, statistic = 62.896
3) egresos_cut <= (6e+07,2.67e+08]; criterion = 1, statistic = 22.314
4)* weights = 70
3) egresos_cut > (6e+07,2.67e+08]
5)* weights = 10
2) ingresos_cut > (4.19e+08,1.61e+09]
6)* weights = 9
plot of part of the tree
What I want is a table where every row is a path from the node to a leaf saying the prediction of the variable segmento and every column is the condition on the variable to split. Something alike this:
actividad economica ingresos (rango) egresos (rango) patrimonio (rango) operaciones de ingreso segmento
Sector Primario <=261.000.000 18
Sector Primario >261.000.000 20
The problem is there are several leaves to characterise and some time a variable appears several times in one path so I'd like to intersect the conditions, i.e. intersecting the ranges.
I've thought of data.tree::ToDataFrameTable but I've got no idea of how it works with party.
Thank you very much guys!
library(partykit)
fit_jur <- ctree(cluster ~ .,
data=data.frame(juridicos_segmentar, cluster=as.factor(segmento)))
pathpred <- function(object, ...)
{
## coerce to "party" object if necessary
if(!inherits(object, "party")) object <- as.party(object)
## get standard predictions (response/prob) and collect in data frame
rval <- data.frame(response = predict(object, type = "response", ...))
rval$prob <- predict(object, type = "prob", ...)
## get rules for each node
rls <- partykit:::.list.rules.party(object)
## get predicted node and select corresponding rule
rval$rule <- rls[as.character(predict(object, type = "node", ...))]
return(rval)
}
ct_pred_jur <- unique(pathpred(fit_jur)[c(1,3)])
write.csv2(ct_pred_jur,'parametrizacion_juridicos.csv')
thank you Achim Zeileis for pointing me in this direction, I couldn't intersect the rules in a same variable, i.e. evaluate the '&s'. That problem is still open.
You can convert both party class (from partykit) and BinaryTree (from party) to a data.tree, and use it for conversion to data frame and/or printing. For example like this:
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
tree <- as.Node(airct)
df <- ToDataFrameTable(tree,
"pathString",
"label",
criterion = function(x) round(x$criterion$maxcriterion, 3),
statistic = function(x) round(max(x$criterion$statistic), 3)
)
df
This will print like so:
pathString label criterion statistic
1 1/2/3 weights = 10 0.000 0.000
2 1/2/4/5 weights = 48 0.936 6.141
3 1/2/4/6 weights = 21 0.891 5.182
4 1/7/8 weights = 30 0.675 3.159
5 1/7/9 weights = 7 0.000 0.000
Plotting:
#print subtree
subtree <- Clone(tree$`2`)
SetNodeStyle(subtree,
style = "filled,rounded",
shape = "box",
fillcolor = "GreenYellow",
fontname = "helvetica",
label = function(x) x$label,
tooltip = function(x) round(x$criterion$maxcriterion, 3))
plot(subtree)
And the result will look like this:
Using the following code:
library("ggplot2")
require(zoo)
args <- commandArgs(TRUE)
input <- read.csv(args[1], header=F, col.names=c("POS","ATT"))
id <- args[2]
prot_len <- nrow(input)
manual <- prot_len/100 # 4.3
att_name <- "Entropy"
att_zoo <- zoo(input$ATT)
att_avg <- rollapply(att_zoo, width = manual, by = manual, FUN = mean, align = "left")
autoplot(att_avg, col="att1") + labs(x = "Positions", y = att_name, title="")
With data:
> str(input)
'data.frame': 431 obs. of 2 variables:
$ POS: int 1 2 3 4 5 6 7 8 9 10 ...
$ ATT: num 0.652 0.733 0.815 1.079 0.885 ...
I do:
I would like to upload input2 which has different lenght (therefore, different x-axis) and overlap the 2 curves in the same plot (I mean overlap because I want the two curves in the same plot size, so I will "ignore" the overlapped axis labels and tittles), I would like to compare the shape, regardles the lenght of input.
First I've tried by generating toy input2 changing manual value, so that I have att_avg2 in which manual equals e.g. 7. In between original autoplot and new autoplot-2 I add par(new=TRUE), but this is not my expected output. Any hint on how doing this? Maybe it's better to save att_avg from zoo series to data.frame and not use autoplot? Thanks
UPDATE, response to G. Grothendieck:
If I do:
[...]
att_zoo <- zoo(input$ATT)
att_avg <- rollapply(att_zoo, width = manual, by = manual, FUN = mean, align = "left") #manual=4.3
att_avg2 <- rollapply(att_zoo, width = 7, by = 7, FUN = mean, align = "left")
autoplot(cbind(att_avg, att_avg2), facet=NULL) +
labs(x = "Positions", y = att_name, title="")
I get
and a warning message:
Removed 1 rows containing missing values (geom_path).
par is used with classic graphics, not for ggplot2. If you have two zoo series just cbind or merge the series together and autoplot them using facet=NULL:
library(zoo)
library(ggplot2)
z1 <- zoo(1:3) # length 3
z2 <- zoo(5:1) # length 5
autoplot(cbind(z1, z2), facet = NULL)
Note: The question omitted input2 so there could be some additional considerations from aspects not shown.
I'm trying to make a hexbin representation of data in several categories. The problem is, facetting these bins seems to make all of them different sizes.
set.seed(1) #Create data
bindata <- data.frame(x=rnorm(100), y=rnorm(100))
fac_probs <- dnorm(seq(-3, 3, length.out=26))
fac_probs <- fac_probs/sum(fac_probs)
bindata$factor <- sample(letters, 100, replace=TRUE, prob=fac_probs)
library(ggplot2) #Actual plotting
library(hexbin)
ggplot(bindata, aes(x=x, y=y)) +
geom_hex() +
facet_wrap(~factor)
Is it possible to set something to make all these bins physically the same size?
As Julius says, the problem is that hexGrob doesn't get the information about the bin sizes, and guesses it from the differences it finds within the facet.
Obviously, it would make sense to hand dx and dy to a hexGrob -- not having the width and height of a hexagon is like specifying a circle by center without giving the radius.
Workaround:
The resolution strategy works, if the facet contains two adjacent haxagons that differ in both x and y. So, as a workaround, I'll construct manually a data.frame containing the x and y center coordinates of the cells, and the factor for facetting and the counts:
In addition to the libraries specified in the question, I'll need
library (reshape2)
and also bindata$factor actually needs to be a factor:
bindata$factor <- as.factor (bindata$factor)
Now, calculate the basic hexagon grid
h <- hexbin (bindata, xbins = 5, IDs = TRUE,
xbnds = range (bindata$x),
ybnds = range (bindata$y))
Next, we need to calculate the counts depending on bindata$factor
counts <- hexTapply (h, bindata$factor, table)
counts <- t (simplify2array (counts))
counts <- melt (counts)
colnames (counts) <- c ("ID", "factor", "counts")
As we have the cell IDs, we can merge this data.frame with the proper coordinates:
hexdf <- data.frame (hcell2xy (h), ID = h#cell)
hexdf <- merge (counts, hexdf)
Here's what the data.frame looks like:
> head (hexdf)
ID factor counts x y
1 3 e 0 -0.3681728 -1.914359
2 3 s 0 -0.3681728 -1.914359
3 3 y 0 -0.3681728 -1.914359
4 3 r 0 -0.3681728 -1.914359
5 3 p 0 -0.3681728 -1.914359
6 3 o 0 -0.3681728 -1.914359
ggplotting (use the command below) this yields the correct bin sizes, but the figure has a bit weird appearance: 0 count hexagons are drawn, but only where some other facet has this bin populated. To suppres the drawing, we can set the counts there to NA and make the na.value completely transparent (it defaults to grey50):
hexdf$counts [hexdf$counts == 0] <- NA
ggplot(hexdf, aes(x=x, y=y, fill = counts)) +
geom_hex(stat="identity") +
facet_wrap(~factor) +
coord_equal () +
scale_fill_continuous (low = "grey80", high = "#000040", na.value = "#00000000")
yields the figure at the top of the post.
This strategy works as long as the binwidths are correct without facetting. If the binwidths are set very small, the resolution may still yield too large dx and dy. In that case, we can supply hexGrob with two adjacent bins (but differing in both x and y) with NA counts for each facet.
dummy <- hgridcent (xbins = 5,
xbnds = range (bindata$x),
ybnds = range (bindata$y),
shape = 1)
dummy <- data.frame (ID = 0,
factor = rep (levels (bindata$factor), each = 2),
counts = NA,
x = rep (dummy$x [1] + c (0, dummy$dx/2),
nlevels (bindata$factor)),
y = rep (dummy$y [1] + c (0, dummy$dy ),
nlevels (bindata$factor)))
An additional advantage of this approach is that we can delete all the rows with 0 counts already in counts, in this case reducing the size of hexdf by roughly 3/4 (122 rows instead of 520):
counts <- counts [counts$counts > 0 ,]
hexdf <- data.frame (hcell2xy (h), ID = h#cell)
hexdf <- merge (counts, hexdf)
hexdf <- rbind (hexdf, dummy)
The plot looks exactly the same as above, but you can visualize the difference with na.value not being fully transparent.
more about the problem
The problem is not unique to facetting but occurs always if too few bins are occupied, so that no "diagonally" adjacent bins are populated.
Here's a series of more minimal data that shows the problem:
First, I trace hexBin so I get all center coordinates of the same hexagonal grid that ggplot2:::hexBin and the object returned by hexbin:
trace (ggplot2:::hexBin, exit = quote ({trace.grid <<- as.data.frame (hgridcent (xbins = xbins, xbnds = xbnds, ybnds = ybnds, shape = ybins/xbins) [1:2]); trace.h <<- hb}))
Set up a very small data set:
df <- data.frame (x = 3 : 1, y = 1 : 3)
And plot:
p <- ggplot(df, aes(x=x, y=y)) + geom_hex(binwidth=c(1, 1)) +
coord_fixed (xlim = c (0, 4), ylim = c (0,4))
p # needed for the tracing to occur
p + geom_point (data = trace.grid, size = 4) +
geom_point (data = df, col = "red") # data pts
str (trace.h)
Formal class 'hexbin' [package "hexbin"] with 16 slots
..# cell : int [1:3] 3 5 7
..# count : int [1:3] 1 1 1
..# xcm : num [1:3] 3 2 1
..# ycm : num [1:3] 1 2 3
..# xbins : num 2
..# shape : num 1
..# xbnds : num [1:2] 1 3
..# ybnds : num [1:2] 1 3
..# dimen : num [1:2] 4 3
..# n : int 3
..# ncells: int 3
..# call : language hexbin(x = x, y = y, xbins = xbins, shape = ybins/xbins, xbnds = xbnds, ybnds = ybnds)
..# xlab : chr "x"
..# ylab : chr "y"
..# cID : NULL
..# cAtt : int(0)
I repeat the plot, leaving out data point 2:
p <- ggplot(df [-2,], aes(x=x, y=y)) + geom_hex(binwidth=c(1, 1)) + coord_fixed (xlim = c (0, 4), ylim = c (0,4))
p
p + geom_point (data = trace.grid, size = 4) + geom_point (data = df, col = "red")
str (trace.h)
Formal class 'hexbin' [package "hexbin"] with 16 slots
..# cell : int [1:2] 3 7
..# count : int [1:2] 1 1
..# xcm : num [1:2] 3 1
..# ycm : num [1:2] 1 3
..# xbins : num 2
..# shape : num 1
..# xbnds : num [1:2] 1 3
..# ybnds : num [1:2] 1 3
..# dimen : num [1:2] 4 3
..# n : int 2
..# ncells: int 2
..# call : language hexbin(x = x, y = y, xbins = xbins, shape = ybins/xbins, xbnds = xbnds, ybnds = ybnds)
..# xlab : chr "x"
..# ylab : chr "y"
..# cID : NULL
..# cAtt : int(0)
note that the results from hexbin are on the same grid (cell numbers did not change, just cell 5 is not populated any more and thus not listed), grid dimensions and ranges did not change. But the plotted hexagons did change dramatically.
Also notice that hgridcent forgets to return the center coordinates of the first cell (lower left).
Though it gets populated:
df <- data.frame (x = 1 : 3, y = 1 : 3)
p <- ggplot(df, aes(x=x, y=y)) + geom_hex(binwidth=c(0.5, 0.8)) +
coord_fixed (xlim = c (0, 4), ylim = c (0,4))
p # needed for the tracing to occur
p + geom_point (data = trace.grid, size = 4) +
geom_point (data = df, col = "red") + # data pts
geom_point (data = as.data.frame (hcell2xy (trace.h)), shape = 1, size = 6)
Here, the rendering of the hexagons cannot possibly be correct - they do not belong to one hexagonal grid.
I tried to replicate your solution with the same data set using lattice hexbinplot. Initially, it gave me an error xbnds[1] < xbnds[2] is not fulfilled. This error was due to wrong numeric vectors specifying range of values that should be covered by the binning. I changed those arguments in hexbinplot, and it somehow worked. Not sure if it helps you to solve it with ggplot, but it's probably some starting point.
library(lattice)
library(hexbin)
hexbinplot(y ~ x | factor, bindata, xbnds = "panel", ybnds = "panel", xbins=5,
layout=c(7,3))
EDIT
Although rectangular bins with stat_bin2d() work just fine:
ggplot(bindata, aes(x=x, y=y, group=factor)) +
facet_wrap(~factor) +
stat_bin2d(binwidth=c(0.6, 0.6))
There are two source files that we are interested in: stat-binhex.r and geom-hex.r, mainly hexBin and hexGrob functions.
As #Dinre mentioned, this issue is not really related to faceting. What we can see is that binwidth is not ignored and is used in a special way in hexBin, this function is applied for every facet separately. After that, hexGrob is applied for every facet. To be sure you can inspect them with e.g.
trace(ggplot2:::hexGrob, quote(browser()))
trace(ggplot2:::hexBin, quote(browser()))
Hence this explains why sizes differ - they depend on both binwidth and the data of each facet itself.
It is difficult to keep track of the process because of various coordinates transforms, but notice that the output of hexBin
data.frame(
hcell2xy(hb),
count = hb#count,
density = hb#count / sum(hb#count, na.rm=TRUE)
)
always seems to look quite ordinary and that hexGrob is responsible for drawing hex bins, distortion, i.e. it has polygonGrob. In case when there is only one hex bin in a facet there is a more serious anomaly.
dx <- resolution(x, FALSE)
dy <- resolution(y, FALSE) / sqrt(3) / 2 * 1.15
in ?resolution we can see
Description
The resolution is is the smallest non-zero distance between adjacent
values. If there is only one unique value, then the resolution is
defined to be one.
for this reason (resolution(x, FALSE) == 1 and resolution(y, FALSE) == 1) the x coordinates of polygonGrob of the first facet in your example are
[1] 1.5native 1.5native 0.5native -0.5native -0.5native 0.5native
and if I am not wrong, in this case native units are like npc, so they should be between 0 and 1. That is, in case of single hex bin it goes out of range because of resolution(). This function also is the reason of distortion that #Dinre mentioned even when having up to several hex bins.
So for now there does not seem to be an option to have hex bins of equal size. A temporal (and very inconvenient for a large number of factors) solution could begin with something like this:
library(gridExtra)
set.seed(2)
bindata <- data.frame(x = rnorm(100), y = rnorm(100))
fac_probs <- c(10, 40, 40, 10)
bindata$factor <- sample(letters[1:4], 100,
replace = TRUE, prob = fac_probs)
binwidths <- list(c(0.4, 0.4), c(0.5, 0.5),
c(0.5, 0.5), c(0.4, 0.4))
plots <- mapply(function(w,z){
ggplot(bindata[bindata$factor == w, ], aes(x = x, y = y)) +
geom_hex(binwidth = z) + theme(legend.position = 'none')
}, letters[1:4], binwidths, SIMPLIFY = FALSE)
do.call(grid.arrange, plots)
I also did some fiddling around with the hex plots in 'ggplot2', and I was able to consistently produce significant bin distortion when a factor's population was reduced to 8 or below. I can't explain why this is happening without digging down into the package source (which I am reluctant to do), but I can tell you that sparse factors seem to consistently wreck the hex bin plotting in 'ggplot2'.
This suggests to me that the size and shape of a particular hex bin in 'ggplot2' is related to a calculation that is unique to each facet, instead of doing a single calculation for the group and plotting the data afterwards. This is somewhat reinforced by the fact that I can reproduce the distortion in any given facet by plotting only that single factor, like so:
ggplot(bindata[bindata$factor=="e",], aes(x=x, y=y)) +
geom_hex()
This feels like something that should be elevated to the package maintainer, Hadley Wickham (h.wickham at gmail.com). This info is publicly available from CRAN.
Update: I sent an email to the Hadley Wickham asking if he would take a look at this question, and he confirmed that this behavior is indeed a bug.