R heat map: Ordering by value; label issues - r

I am looking to improve upon output I implemented in R based on Jeromy's answer here (thanks!). Mine is a 31x31 matrix with positive and negative values, and uses basically the same ggplot2 code:
library(ggplot2)
library(reshape)
z<-cor(insheet3,use="complete.obs",method="kendall")
zm<-melt(z)
ggplot(zm, aes(X1,X2, fill=value)) + geom_tile() +
scale_fill_gradient2(low = "blue", high = "dark violet")
I need to change three things:
Right now, the rows appear in reverse alphabetical order, which means no visible data trends. How can I influence the order of the rows and columns, such that either:
A. (Preferred:) The columns are ordered by correlation value (negative to positive or vice versa), as they are in the ellipse package output on that same page; or
B. The columns are manually ordered, so that I can group similar variables?
Along the bottom X-axis, my variable names are overlapping dramatically and are unreadable. They need to remain long (i.e., OrthoPhos, Ammonia, Residential...), so how can I rotate their labels 90 degrees?
Is there a way to remove the "X1" and "X2" labels along each axis?
Thank you!

Following what I'll call an extensive/religious R journey into correlation matrix possibilities, I wanted to share what I'm finally going to use. Also, thanks to the previous answerers; I've found that there are many "right" answers to this.
Since my reviewers insisted I include numbers and not just colors, and that I stay away from more "confusing" and "busy" output like correlogram, I finally found "image" and based my final output on this example. Thanks #Marcinthebox.
Also to appease StackOverflow, here is a link to the image, rather than the image itself.
Because some of these specifications took a while to figure out and were critical to the final output, here's my code, shortened as much as I could.
#Subsetting to only the vectors I want to see in the correlation, as ordered
insheet<-subset(insheet1,
select=c("Cond", "CL", "SO4", "TN", "TP", "OrthoPhos", "DO", ...., "Rural"))
#Defining "high" and "low" colors
library(colorspace)
mycolors<-diverge_hcl(8, h = c(8, 240), c = 80, l = c(50,100), power = 1)
#Correlating them into a matrix
sheet<-cor(insheet,use="complete.obs")
#Making it!
image(x=seq(dim(sheet)[2]), y=seq(dim(sheet)[2]), z=sheet, ann=FALSE,
col=mycolors, xlab="x column", ylab="y column", xaxt='n', yaxt='n')
text(expand.grid(x=seq(dim(sheet)[2]), y=seq(dim(sheet)[2])),
labels=round(c(sheet),2), cex=0.5)
axis(1, 1:dim(insheet2)[2], colnames(insheet2), las=2)
axis(2, 1:dim(insheet2)[2], colnames(insheet2), las=2)
par(mar=c(5.5, 5.5, 2, 1)) #Moves margins over to allow for axis labels
I was also able to for-loop this to output multiple .wmf files, once errors were suppressed. Too bad I couldn't visualize significant p-values as well... another time. Thanks!

I assume that you mean "clustering" for point 1.?
For such tasks I prefer the heatmap.2() function from the gplots package, which offers various clustering options.
For point 2 and 3: The heatmap.2() function will also take care of the 90º rotation and the labels since it is using a data matrix as input instead of a data table.

Related

Manually set breaks and keep equal distance between them

I'd like to plot a dataset that consists of two vectors of length 100. The mean difference of the vectors being high and the variance of each of them being considerably smaller, it is quite difficult to plot both vectors and still be able to see the variation within each vector.
What I'd like to be able to manually set the breaks so that we could both see the difference between the vectors and within them.
Consider this data set
a=rnorm(100,sd=0.005)+1
b=rnorm(100,sd=0.005)+10
vec = c(a,b)
Neither plot(vec) nor plot(vec,log="y") gives satisfying results, as it is not possible to distinguish the variation within the vector (see picture).
I'd like the breaks on the y-axis to be (min(a), max(a), 5, min(b), max(b)) (and get equal distance between them). How could one achieve that?
Depending on exactly what you are trying to do, a simple transformation of the data in each part of the vector might be enough:
vec2 <- c( (a - min(a))/ (max(a)-min(a)) , 3 + (b - min(b))/ (max(b)-min(b)) )
plot(vec2, axes=F)
box()
axis(1)
axis(2, at=c(0,1,2,3,4), labels = round(c(min(a), max(a), 5, min(b), max(b)),2))
Alternative approaches might be a custom transformation in ggplot, a secondary axis in ggplot, breaking the graph into facets, or using ggbreak.

Manually creating an object that looks like a heatmap color key

I'm working on trying to create a key for a heatmap, but as far as I know, I cannot use the existing tools for adding a legend since I've generated the colors myself (I manually turn a scaled variable into rgb values for a short rainbow ( [255,0,0] to [0,0,255] ).
Basically, all I want to do is use the rightmost 10th of the screen to create a rectangle with these 10 colors: "#0000FF", "#0072FF", "#00E3FF", "#00FFAA", "#00FF38", "#39FF00", "#AAFF00", "#FFE200", "#FF7100", "#FF0000"
with three numerical labels - at 0, max/2, and max
In essence, I want to manually produce an object that looks like a rudimentary heatmap color key.
As far as I know, split.screen can only split the screen in half, which isn't what I'm looking for. I want the graphic I already know how to produce to take up the leftmost 90% of the screen, and I want this colored rectangle to take up the other 10%.
Thanks.
EDIT: I greatly appreciate the advice about the best way to the the plot - that said, I still would like to know the best way to do the task originally asked - creating the legend by hand; I already am able to produce the exact heatmap graphic that I'm looking for - the false coloring wasn't the only problem with ggplot that I was having - it was just the final factor convincing me to switch. I need a non ggplot solution.
EDIT #2: This is close to the solution I am looking for, except this only goes up to 10 instead of accepting a maximum value as a parameter (I will be running this code on multiple data-sets, all with different maximum values - I want the legend to reflect this). Additionally, if I change the size of the graph, the key falls apart into disconnected squares.
Take a look at the layouts function (link). I think you want something like this:
layout(matrix(c(1,2), 1, 2, byrow = TRUE), widths=c(9,1))
## plot heatmap
## plot legend
I would also recommend the ggplot2 package and the geom_tile function which will take care of all of this for you.
Assuming your data is in a data frame with the x and y coordinates and heatmap value (e.g. gdat <- data.frame(x_coord=c(1,2,...), y_coord=c(1,1,...), val=c(6,2,...))) Then you should be able to produce your desired heat map plot with the following ggplot command:
ggplot(gdat) + geom_tile(aes(x=x_coord, y=y_coord, fill=val)) +
scale_fill_gradient(low="#0000FF", high="#FF0000")
To get your data into the following format you may want to look into the very useful reshape2 package.
Given a script no ggplot restriction on this answer here is how one could produce the plot with just base R.
colors <- c("#0000FF", "#0072FF", "#00E3FF", "#00FFAA", "#00FF38",
"#39FF00", "#AAFF00", "#FFE200", "#FF7100", "#FF0000")
layout(matrix(c(1,2), 1, 2, byrow = TRUE), widths=c(9,1))
plot(rnorm(20), rnorm(20), col=sample(colors, 20, replace=TRUE))
par(mar=c(0,0,0,0))
plot(x=rep(1,10), y=1:10, col=colors, pch=15, cex=7.1)
You may have to adjust the cex for your device.

How do I get x-axis labels to show in R Barplot? [duplicate]

This question already has answers here:
How to display all x labels in R barplot?
(4 answers)
Closed 7 years ago.
Probably pretty simple but I am a newb to R (and to stack)...
Can't seem to get the car names to show on the x-axis of my barplot.
I tried pasting in the example given in the "How to display all x labels in R barplot?" Question but that didn't work
My code is below. Does that code work for anybody else?
#plot of efficiency of 4 cylinder cars
#get 4cylinder cars seperate
fourcyl <- subset(mtcars, cyl == "4")
#barplot in descending order... need to add in car names.
barplot(fourcyl$mpg[order(fourcyl$mpg, decreasing = TRUE)],
horiz=FALSE,
ylab = "Miles per Gallon",
main = "Efficiency for 4 cylinder vehicles",
ylim = c(0,35))
#Pascal's comment links to two possible solutions, but the bottom line is that you need to add the car names manually.
To know which car names to use takes a first step: if you look at mtcars, you'll see that they don't appear under a column header, meaning they are the row names. To get at them, simply:
carnames <- rownames(fourcyl)[ order(fourcyl$mpg, decreasing=TRUE) ]
From here, you need to know how and where to add them. Perhaps the first place many people look is to axis, where you'd do something like:
axis(side=1, at=1:length(carnames), labels=carnames)
But you'd be disappointed on at least two accounts: first, you don't see all of the names, since axis courteously ensures they don't overlap by omitting some; second, the ones that do show are not aligned properly under the corresponding vertical bar.
To fix the first, you can try rotating the text. You could use las (see help(par)) and do something like:
axis(side=1, at=1:length(carnames), labels=carnames, las=2)
But again you'll be a little disappointed in that many of the names will run over the default bottom margin (and disappear). You can fix with this a preceding par(mar=...) (again, see the help there and play with it some to find the right parameters), but there are solutions that provide slightly better methods (aesthetically), two of which are mentioned in #Pascal's link (really, go there).
The other problem -- where to put the labels -- is resolved by reading help(barplot) and noticing that the return value from barplot(...) is a matrix providing the mid-point of each of the bars. Odd, perhaps, but it is what it is (and it has a good reason, somewhere). So, capture this and you'll be nearly home-free:
bp <- barplot(fourcyl$mpg[order(fourcyl$mpg, decreasing = TRUE)],
horiz=FALSE, ylab = "Miles per Gallon",
main = "Efficiency for 4 cylinder vehicles",
ylim = c(0,35))
Now, to copy one of the link's suggestions, try:
text(x=bp[,1], y=-1, adj=c(1, 1), carnames, cex=0.8, srt=45, xpd=TRUE)
(No need for the axis command, just bp <- barplot(...), carnames <- ..., and text(...).)

Coloring scatterplot in R based on fold enrichment

I'm very new to R and have tried to search around for an answer to my question, but couldn't find quite what I was looking for (or I just couldn't figure out the right keywords to include!). I think this is a fairly common task in R though, I am just very new.
I have a x vs y scatterplot and I want to color those points for which there is at least a 2-fold enrichment, ie where x/y>=2 . Since my values are expressed as log2 values, the the transformed value needs to be x/y>=4.
I currently have the scatterplot plotted with
plot(log2(counts[,40], log2(counts[,41))
where counts is a .csv imported files and 40 & 41 are my columns of interested.
I've also created a column for fold change using
counts$fold<-counts[,41]/counts[,40]
I don't know how to incorporate these two pieces of information... Ultimately I want a graph that looks something like the example here: http://s17.postimg.org/s3k1w8r7j/error_messsage_1.png
where those points that are at least two-fold enriched will colored in blue.
Any help would be greatly appreciated. Thanks!
Is this what you're looking for:
# Fake data
dat = data.frame(x=runif(100,0,50), y = rnorm(100, 10, 2))
plot(dat$x, dat$y, col=ifelse(dat$x/dat$y > 4, "blue", "red"), pch=16)
The ifelse statement creates a vector of "blue" and "red" (or whatever colors you want) based on the values of dat$x/dat$y and plot uses that to color the points.
This might be helpful if you've never worked with colors in R.
Another option is to use ggplot2 instead of base graphics. Here's an example:
library(ggplot2)
ggplot(dat, aes(x,y, colour=cut(x/y, breaks=c(-1000,4,1000),
labels=c("<=4",">4")))) +
geom_point(size=5) +
labs(colour="x/y")

How to extract coordinates to plot line segments connecting legend keys in ggplot2?

I've long puzzled over a concise way to communicate significance of an interaction between numeric and categorical variables in a line plot (response on the Y-axis, numeric predictor variable on the X-axis, and each level of the categoric variable a line of a different color or pattern plotted on those axes). I finally came up with the idea of drawing the traditional "brackets and p-values" connecting legend keys instead of lines of data.
Here is a mockup of what I mean:
library(ggplot2);
mydat <- do.call(rbind,lapply(1:3,function(ii) data.frame(
y=seq(0,10)*c(.695,.78,1.39)[ii]+c(.322,.663,.847)[ii],
a=factor(ii-1),b=0:10)));
myplot <- ggplot(data=mydat,aes(x=b,y=y,colour=a,group=a)) +
geom_line()+theme(legend.position=c(.1,.9));
# Plotting with p-value bracket:
myplot +
# The three line segments making up the bracket
geom_segment(x=1.2,xend=1.2,y=13.8,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13.8,yend=13.8) +
# The text accompanying the bracket.
geom_text(label='p < 0.001',x=2,y=13.4);
This is less cluttered than trying to plot brackets someplace on the line-plot itself.
The problem is that the x and y values for the geom_segments and geom_text were obtained by trial and error and for another dataset these coordinates would be completely wrong. That's a problem if I'm trying to write a function whose purpose is to automate the process of pulling these contrasts out of models and plotting them (kind of like the effects package, but with more flexibility about how to represent the data).
My question is: is there a way to somehow pull the actual coordinates of each box comprising the legend and convert them to the scale used by geom_segment and geom_text, or manually specify the coordinates of each box when creating the myplot object, or reliably predict where the individual boxes will be and convert them to the plot's scale given that myplot$theme$legend.position returns 0.1 0.9?
I'd like to do this within ggplot2, because it's robust, elegant, and perfect for all the other things I want to do with my script. I'm open to using additional packages that extend ggplot2 and I'm also open to other approaches to visually indicating significance level on line-plots. However, suggestions that amount to "you shouldn't even do that" are not constructive-- because whether or not I personally agree with you, my collaborators and their editors don't read Stackoverflow (unfortunately).
Update:
This question kind of simplifies to: if the myplot$theme$legend.key.height is in lines and myplot$theme$legend.position seems to be roughly in fractions of the overall plot area (but not exactly) how can I convert these to the units in which the x and y axes are delineated, or alternatively, convert the x and y axis scales to the units of legend.key.height and legend.position?
I don't know the answer to your question as posed. But, another, definitely quickly do-able if less fancy approach to convey the information is to change the names of the levels so that the level names include significance codes. In your first example, you could use
levels(mydat$a) <- list("0" = "0", "1 *" = "1", "2 *" = "2")
And then the legend will reflect this:
With more levels and combos of significance, you could probably work out a set of symbols. Then mention in your figure legend the p level reflected in each set of symbols.
This might be a related way to convey the information: The figure below is produced by rxnNorm in HandyStuff here. Unfortunately, this is another non-answer as I have not been able to make this work with the new version of ggplot2. Hopefully I can figure it out soon.
My answer is not using ggplot2, but the lattice package. I think dotplot is what I would use if I want to compare a continuous variable versus categorical variables.
Here I use dotplot in 2 manners, one where I reproduce your plot, and another where
library(lattice)
library(latticeExtra) ## to get ggplot2 theme
#y versus levels of B, in different panel of A
p1 <- dotplot(b~y|a ,
data = mydat,
groups = a,
type = c("p", "h"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
#y versus levels of B , grouped by a(color and line are defined by a)
p2 <- dotplot(b~y, groups= a ,
data = mydat,
type = c("l"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
library(gridExtra) ## to arrange many grid plots
grid.arrange(p1,p2)

Resources