how to change geom_tile scale for very small values? - r

I have a dataframe containing some comparisons and the value represent the similarity between objects. I have a real object compared to some random ones which led to very small similarity. Also, I compared random objects versus random which led to higher similarity rate. At this point I want to put all together and plot it as a heatmap. Problem is that very small values of similarity which I want to highlight have the same colour as the not-so-small from the random-random comparison. Of course this is a problem of scale but I don't know how to manage colour scale. The following code generate a heatmap that actually show the issue. Here, the first column has a yellowish colour, which is fine, but this is the same colour as other tiles which, on the other hand, have higher, non comparable values. How to colour tiles accordingly to the actual scale?
The code:
set.seed(131)
#number of comparisons in the original data: 1 value versus n=10
n <- 10
#generate real data (very small values)
fakeRealData <- runif(n, min=0.00000000000001, max=0.00000000000002)
#and create the data structure
realD <- cbind.data.frame(rowS=rep("fakeRealData", n), colS=paste("rnd", seq(1, n, by=1), sep=" "), Similarity=fakeRealData, stringsAsFactors=F)
#the same for random data, n=10 random comparisons make for a n by n matrix
rndN <- n*n
randomData <- data.frame(matrix(runif(rndN), nrow=n, ncol=n))
rowS <- vector()
#for each column of randomData
for (r in seq(1, n, by=1)) {
#create a vector of the first rowname, then the second, the third, etc etc which is long as the number of columns
rowS <- append(rowS, rep(paste("rnd", r, sep=" "), n))
}
#and create the random data structure
randomPVs <- cbind.data.frame(rowS=rowS, colS=rep(paste("rnd", seq(1, n, by=1), sep=" "), n), Similarity=unlist(randomData), stringsAsFactors=F)
#eventually put everything together
everything <- rbind.data.frame(randomPVs, realD)
#and finally plot the heatmap
heaT <- ggplot(everything, aes(rowS, colS, fill=Similarity)) +
geom_tile() +
scale_fill_distiller(palette = "YlGn", direction=2) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
xlab("")+
ylab("")
plot(heaT)

Here are three approaches:
Add geom_text to your plot to show the values when color differences are small.
heaT <- ggplot(everything, aes(rowS, colS)) +
geom_tile(aes(fill=Similarity)) +
scale_fill_distiller(palette = "YlGn", direction=2) +
geom_text(aes(label = round(Similarity, 2))) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ylab("")
Use the values argument to set a nonlinear scale to scale_fill_distiller. I added an extra break point at 0.01 to the otherwise linear scale to accentuate the difference between 0 and small nonzero numbers. I let the rest of the scale linear.
heaT <- ggplot(everything, aes(rowS, colS)) +
geom_tile(aes(fill=Similarity)) +
scale_fill_distiller(palette = "YlGn", direction=2,
values = c(0, 0.01, seq(0.05, 1, 0.05))) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ylab("")
Transform your scale as Richard mentioned in the comments. Note that this will mess with the values in the legend, so either rename it or hide it.
heaT <- ggplot(everything, aes(rowS, colS)) +
geom_tile(aes(fill=Similarity)) +
scale_fill_distiller(palette = "YlGn", direction=2, trans = "log10",
name = "log10(Similarity)") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
xlab("")+
ylab("")
Try combinations of these approaches and see what you like.

Related

Looking for a way to plot the following correlation data using ggplot

I'm trying to visualize the following data and would appreciate some advice. Basically I ran a bunch of correlations and want to visualize if variable A or variable B is more strongly correlated with Height, Weight, Volume, etc.
variable <- c('A','B','A','B','A','B')
outcome <- c('Height','Height','Weight', 'Weight', 'Volume', 'Volume')
correlation_coeff <- c(0.76, 0.65, 0.77,0.56,0.91,-0.34)
p_value<- c(0.04,0.03,0.01,0.02,0.001,0.09)
data <- data.frame(variable, outcome, correlation_coeff, p_value)
Since this is not a matrix of correlation coefficients (I never looked at the correlation between Height and Weight, for example) I'm not really sure what to do. Normally I just use the ggcorrplot() function but in this case it's obviously not going to work. Any ideas?
You could plot the correlation plot directly with geom_tile, which closely resembles the look of ggcorrplot.
You can optionally have the p values overlaid:
ggplot(data, aes(variable, outcome, fill = correlation_coeff)) +
geom_tile(color = "black") +
geom_text(aes(label = paste("p =", p_value)), size = 10) +
scale_fill_gradientn(colors = c("blue", "white", "red"), limits = c(-1, 1)) +
theme_minimal() +
theme(axis.title = element_blank(),
axis.text = element_text(size = 16)) +
labs(fill = "Correlation coefficient") +
coord_equal()
ggcorrplot() takes a matrix as an input, so you just need to turn your data into a matrix:
variable <- c('A','B')
outcome <- c('Height','Weight', 'Volume')
correlation_coeff <- c(0.76, 0.65, 0.77,0.56,0.91,-0.34)
p_value<- c(0.04,0.03,0.01,0.02,0.001,0.09)
data <- matrix(correlation_coeff, nrow = 2, ncol = 3, dimnames = list(variable,outcome))
ggcorplot(data)
That should make a 2x3 correlation chart in the style you're looking for.

How to apply/loop the same function to a group of similar objects in R

I have a big datatable, containing metadata such as different drug treatments to samples under different conditions and numeric features of measurements.
Mini dummy dataset:
#I only used one sample here for simplicity. You can image there are multiple sample IDs
#and sometimes the same ID but different timestamp.
#for each sample, the sample will have 3 levels conditions,
#for each treatment(here's only 1, R), multiple dosages per condition
#on top of the condition&dose there is next layer to have combo or not.
#from id ~combo they are factors or at least I believe so
#after all these, the real measured variables V1,2.....
set.seed(123)
id <- rep("S112",30)
timestamp <- rep("T4",30)
condit <- rep(c("uns","2S","3S"), 10)
treatment <- rep("R", 30)
dose <- rep(c("0.1","1"),each=15)
combo <-rep(c("none","I10","I100"),each=10)
v1 <-rnorm(30,0.5)
v2 <-rnorm(30,0.05)
v3 <-rnorm(30,0.1)
df <- data.frame(id,timestamp,condit,treatment,dose,combo,v1,v2,v3)
now if I can visualise the one treatment of R, at different condition and different dosages and even combination.
#import libs
library(ggplot2)
library(dplyr)
library(tidyr)
library(wesanderson) #I have great movie taste I know
# now I look at treatment of interest
R <- df[df$treatment=="R" & df$combo == "I10",]
#table to long
R_long <- gather(R,7:9, key = bin, value = value, -id, -timestamp, -condit )
#plot it
b<- ggplot(R_long, aes(x=bin, y=id,fill=value))
pal <- wes_palette("Zissou1", 100, type = "continuous")
R_map <- b +
geom_tile()+
scale_fill_gradientn(colors=pal)+
facet_grid(dose~condit)+
theme(text = element_text(size = 40,face="bold")) +
theme(legend.text = element_text(size=35, face="bold"))+
theme(axis.text.x = element_text(angle=45, hjust=1)) +
theme(legend.key.size = unit(2, "cm"))+
xlab("Bins")+
ylab("Sample ID")+
ggtitle("Plot of treatment R")
ggsave(R_map,file="R.pdf",width=30,height=30)
This works, but I want to perform the same thing to a group of drugs in the real dataset, instead of just one treatment R. I'm guessing the vectorised R language should allow something like group the treatments I want in a vector c(R1, R2, R3, R4) and apply the above code on this vector. How could I achieve that?
Note: I'm sorry for this poorly worded question. I honestly am too basic in R to even ask the key questions. So feel free to help me edit this (the COVID part is for the good spirit)
Thank you
You could also use map or map2 from purrr
Here an example with map2 that would allow you to add another variable from a vector, such as combo:
library(purrr)
#get data
set.seed(123)
id <- rep("S112",30)
timestamp <- rep("T4",30)
condit <- rep(c("uns","2S","3S"), 10)
treatment <- rep("R", 30)
dose <- rep(c("0.1","1"),each=15)
combo <-rep(c("none","I10","I100"),each=10)
v1 <-rnorm(30,0.5)
v2 <-rnorm(30,0.05)
v3 <-rnorm(30,0.1)
df <- data.frame(id,timestamp,condit,treatment,dose,combo,v1,v2,v3)
# make function if you use map, just remove y from the function and replace y with a variable within the function definition
makeplot <- function(x,y) {
# get data
R <- df[df$treatment==x & df$combo %in% unlist(strsplit(y, split="|", fixed=TRUE)),] #allows or statements in the vector
R_long <- gather(R, 7:9, key = bin, value = value, -id, -timestamp, -condit)
# make plot
pal <- wes_palette("Zissou1", 100, type = "continuous")
R_map <- ggplot(R_long, aes(x=bin, y=id, fill=value)) +
geom_tile() +
scale_fill_gradientn(colors = pal) +
facet_grid(dose ~ condit) +
theme(text = element_text(size = 40, face="bold")) +
theme(legend.text = element_text(size = 35, face = "bold")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.key.size = unit(2, "cm")) +
xlab("Bins") +
ylab("Sample ID") +
ggtitle(paste("Plot of treatment", x))
ggsave(R_map, file = paste0(x,"_",paste0(gsub("|","_",y,fixed=TRUE)), ".pdf"), width = 30, height = 30)
}
#make vectors, the function above allows you to combine combos with |
treatment <- c("R","R","R")
combo <- c("none","I10","none|I10")
#apply
map2(treatment,combo,makeplot)
PS: I hope that your research will aid finding an effective off-label treatment for COVID-19 patients.
Since you are already faceted, you can't go that direction, but if you want one drug per loop, then wrap it in a for loop:
for (tr in unique(df$treatment)) {
R <- df[df$treatment==tr & df$combo == "I10",]
R_long <- gather(R, 7:9, key = bin, value = value, -id, -timestamp, -condit)
#plot it
pal <- wes_palette("Zissou1", 100, type = "continuous")
R_map <- ggplot(R_long, aes(x=bin, y=id, fill=value)) +
geom_tile() +
scale_fill_gradientn(colors = pal) +
facet_grid(dose ~ condit) +
theme(text = element_text(size = 40, face="bold")) +
theme(legend.text = element_text(size = 35, face = "bold")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.key.size = unit(2, "cm")) +
xlab("Bins") +
ylab("Sample ID") +
ggtitle(paste("Plot of treatment", tr))
ggsave(R_map, file = paste0(tr, ".pdf"), width = 30, height = 30)
}

Set the width and gap in geom_bar in a large dataset with a lot of unique values

I have the dataframe below:
res<-sample.int(2187, 2187)
freq<-floor(runif(2187, 95,105))
t<-data.frame(res,freq)
and Im trying to create a bar chart based on this but despite the fact that I use width and color arguments I still cannot create space between the bars which are black instead of the selected fill.
library(ggplot2)
require(scales)
ggplot(t,width=0.1)+
geom_bar(aes(x=res,y=freq ,fill = (t$res==101)),
color = "black",stat = "identity") +
scale_fill_manual(values=c("darkblue", "lightblue"), guide = F) +
theme_classic(base_size = 16)+ theme(legend.position = "none")+
scale_x_discrete(breaks = seq(80, 115, 5))+ scale_y_continuous(labels = comma)
Note that this code works nice for a dataset with much fewer unique values like:
fac<-factor(rep(c(80,85,100,100.5,100.7,101,101.5,110,105),2000000))
res<-data.frame(fac)
new<-data.frame(table(res))
require(scales)
ggplot(new,width=0.1)+
geom_bar(aes(x=res,y=Freq ,fill = (new$res==101)),
color = "black",stat = "identity") +
scale_fill_manual(values=c("darkblue", "lightblue"), guide = F) +
theme_classic(base_size = 16)+ theme(legend.position = "none")+
scale_x_discrete(breaks = seq(80, 115, 5))+ scale_y_continuous(labels = comma)
May be I am completely wrong but if I understand correctly, the OP wants to reproduce the second chart from scratch using a sample of random numbers instead of already tabulated counts.
To create a histogram / bar chart, we only need a vector of random numbers (wraped in a data.frame for ggplot) and let geom_bar() do the counting. In addition, a particular bar will be highlighted.
By using floor(), the random numbers are already binned but are still considered as continuous by ggplot(). Therefore, they need to be turned into factor.
# create data
set.seed(123L) # ensure random data are reproducible
t <- data.frame(res = floor(runif(2187, 95, 105)))
library(ggplot2)
ggplot(t) +
aes(x = as.factor(res), fill = res == 101) +
geom_bar() +
theme_classic(base_size = 16) +
scale_fill_manual(values = c("darkblue", "lightblue"), guide = FALSE) +
xlab("res") +
ylab("freq")
Edit: geom_histogram()
Ther is an alternative approach using geom_histogram().
geom_histogram() does all steps in one go: The binning (no need to use floor()) as well as counting and plotting:
set.seed(123L) # ensure random data are reproducible
t2 <- data.frame(res = runif(2187, 95,105)) # floor() omitted here
ggplot(t2) +
aes(x = res, fill = floor(res) == 101) +
geom_histogram(breaks = seq(95, 105, 1), closed = "left") +
theme_classic(base_size = 16) +
scale_fill_manual(values = c("darkblue", "lightblue"), guide = FALSE) +
xlab("res") +
ylab("freq")
Here, the breaks parameter was used to specify the bin boundaries explicitely. Alternatively, the number of bins or the width of the bins can be specifies. This gives flexibilty to play around with the parameters.
Edit 2
The OP has asked about the case where the random numbers are uniformly distributed between 100 and 1015. With an adjustment to the sequence of breaks,
set.seed(123L) # ensure random data are reproducible
t3 <- data.frame(res = runif(2187, 100, 1015))
ggplot(t3) +
aes(x = res, fill = floor(res) == 101) +
geom_histogram(breaks = seq(100, 1015, 1), closed = "left") +
theme_classic(base_size = 16) +
scale_fill_manual(values = c("darkblue", "lightblue"), guide = FALSE) +
xlab("res") +
ylab("freq")
returns
This chart contains over 900 bars for each bin of width 1 which aren't all visible depending on the screen resolution as already explained by Jon Spring.
Therefore, it might be more suitable to reduce the number of bins, e.g., to 100 bins:
ggplot(t3) +
aes(x = res, fill = floor(res) == 101) +
geom_histogram(bins = 100L) +
theme_classic(base_size = 16) +
scale_fill_manual(values = c("darkblue", "lightblue"), guide = FALSE) +
xlab("res") +
ylab("freq")
Please note that 101 is still highlighted in the lower left corner.
Edit -- added alternate solutions at bottom.
If you have over 2,000 bars, and each one has a black outline 1 pixel wide on each side, that'll take something on the order of 6,000 horizontal pixels (ignoring anti-aliasing) to see one with a different fill. Most screens have much lower resolution than that.
If you must use bars, and must show every value, one option would be to drop the outline with color = NA and set width = 1 (as a term in the geom_col/geom_bar call) so there's no distracting blank space between bars. Even then, the different color at res == 101 is only visible at certain resolutions. (That might vary on device settings and anti-aliasing.)
ggplot(t)+
geom_col(aes(x=res,y=freq , fill = (res==101)),
color = NA, width = 1) +
scale_fill_manual(values=c("darkblue", "lightblue"), guide = F) +
theme_classic(base_size = 16) +
scale_x_continuous(breaks = c(500*0:4, 101))
If you must show all 2000 points, but want to highlight one, it might make sense to use a different geom that spreads the data out to use more of the available space.
For instance, we might use geom_point or geom_jitter to plot all the coordinates in 2d space. Here, I highlight the element with res == 101. I use arrange to make sure the special dot gets plotted last so that it doesn't get occluded.
library(dplyr)
ggplot(t %>% arrange(res == 101),
aes(x = res, y = freq,
fill = res == 101,
size = res == 101)) +
geom_jitter(shape = 21, stroke = 0.1)
Or we might plot the data as a line, highlighting the special dot on its own:
ggplot(t, aes(res, freq)) +
geom_line(color = "gray70") +
geom_point(data = subset(t, res == 101)) +
expand_limits(y=0)

ggplot2 and regression lines and R^2 values

I know there have been a number of entries with regards to adding R^2 values to plots, but I am having trouble following the codes. I am graphing a scatter plot with three categories. I have added a linear regression line for each one. I would now like to add r^2 values for each but I can't figure out how to do this.
My code:
veg <- read.csv("latandwtall2.csv", header=TRUE)
library("ggplot2")
a <- ggplot(veg, aes(x=avglat, y=wtfi, color=genus)) + geom_point(shape=19, size=4)
b <- a + scale_colour_hue(l=50) + stat_smooth(method = "lm", formula = y ~ x, size = 1, se = FALSE)
c <- b + labs(x="Latitude", y="Weight (g)")
d <- c + theme_bw()
e <- d + theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank())
#changes size of text
f <- e + theme(
axis.title.x = element_text(color="black", vjust=-0.35, size=15, face="bold"),
axis.title.y = element_text(color="black" , vjust=0.35, size=15, face="bold")
)
g <- e+theme(legend.key=element_rect(fill='white'))
g
Any help with how to add R^2 values would be greatly appreciated. Thanks!
If you build a data frame with the r-squared values, you might be able to (mostly) automate the positioning of the annotation text by including it as a call to geom_text.
Here's a toy example. The rsq data frame is used in geom_text to place the r-squared labels. In this case, I've set it up to put the labels just after the highest x-value and the predict function gets the y-value. It's probably too much work for a single plot, but if you're doing this a lot, you can turn it into a function so that you don't have to repeat the set-up code every time, and maybe add some fancier logic to make label placement more flexible:
library(reshape2) # For melt function
# Fake data
set.seed(12)
x = runif(100, 0, 10)
dat = data.frame(x, y1 = 2*x + 3 + rnorm(100, 0, 5),
y2 = 4*x + 20 + rnorm(100, 0, 10))
dat.m = melt(dat, id.var="x")
# linear models
my1 = lm(y1 ~ x, data=dat)
my2 = lm(y2 ~ x, data=dat)
# Data frame for adding r-squared values to plot
rsq = data.frame(model=c("y1","y2"),
r2=c(summary(my1)$adj.r.squared,
summary(my2)$adj.r.squared),
x=max(dat$x),
y=c(predict(my1, newdata=data.frame(x=max(dat$x))),
predict(my2, newdata=data.frame(x=max(dat$x)))))
ggplot() +
geom_point(data=dat.m, aes(x, value, colour=variable)) +
geom_smooth(data=dat.m, aes(x, value, colour=variable),
method="lm", se=FALSE) +
geom_text(data=rsq, aes(label=paste("r^2 == ", round(r2,2)),
x=1.05*x, y=y, colour=model, hjust=0.5),
size=4.5, parse=TRUE)
I can't really reproduce what you're doing but you need to use annotate()
Something that could work (puting the R2 on the 10th point) would be :
R2 = 0.4
i = 10
text = paste("R-squared = ", R2, sep="")
g = g + annotate("text", x=avglat[i], y=wtfi[i], label=text, font="Calibri", colour="red", vjust = -2, hjust = 1)
Use vjust and hjust to adjust the position of the text to the point (change the i), and just fill the variable R2 with your computed rsquared. You can choose the point you like or manually enter the x,y coordinate it's up to you. Does that help ?
PS. I put extra parameters (font, colours) so that you have the flexibility to change them.
Build the model separately, get the R^2 from there, and add it to the plot. I'll give you some dummy code, but it would be of better quality if you had given us a sample data frame.
r2 = summary(lm(wtfi ~ avglat, data=veg))$r.squared
#to piggyback on Romain's code...
i=10
g = g + annotate("text", x=avglat[i], y=wtfi[i], label=round(r2,2), font="Calibri", colour="red", vjust = -2, hjust = 1)
The way I wrote it here you don't need to hard-code the R^2 value in.

Gap Y axis in ggplot

I have the below plot of ggplot with most Y values between 0-200, and one value ~3000:
I want to "zoom" on most of the values, but still show the high value
I wrote the following code:
Figure_2 <- ggplot(data = count_df, aes(x=count_df$`ng`,
y=count_df$`Number`)) +
geom_point(col = "darkmagenta") + ggtitle("start VS Number") +
xlab(expression(paste("start " , mu, "l"))) + ylab("Number") +
theme(plot.title = element_text(hjust = 0.5, color="orange", size=14,
face="bold.italic"),
axis.title.x = element_text(color="#993333", size=10, face = "bold"),
axis.title.y = element_text(color="#993333", size=10,face = "bold"))
Anybody knows how to achieve that?
A possible solution could be found by help of facet_grid. I do not have the exact data from OP but the approach should be to think of grouping y-axis in ranges. The OP has mentioned about two ranges as 0 - 200 and ~3000 for value of Number.
Hence, we have an option to divide Number by 2000 to transform it into factors representing 2 groups. That means factor(ceiling(Number/2000)) will create two factors.
Let's take similar data as OP and try our approach:
# Data
count_df <- data.frame(ng = 1:30, Number = sample(200:220, 30, TRUE))
# Change one value high as 3000
count_df$Number[20] <- 3000
library(ggplot2)
ggplot(data = count_df, aes(x=ng, y=Number)) +
geom_point() +
facet_grid(factor(ceiling(Number/2000))~., scales = "free_y") +
ggtitle("start VS Number") +
xlab(expression(paste("start " , mu, "l")))

Resources