Using R package pheatmap to draw heatmaps. Is there a way to assign a color to NAs in the input matrix? It seems NA gets colored in white by default.
E.g.:
library(pheatmap)
m<- matrix(c(1:100), nrow= 10)
m[1,1]<- NA
m[10,10]<- NA
pheatmap(m, cluster_rows=FALSE, cluster_cols=FALSE)
Thanks
It is possible, but requires some hacking.
First of all let's see how pheatmap draws a heatmap. You can check that just by typing pheatmap in the console and scrolling through the output, or alternatively using edit(pheatmap).
You will find that colours are mapped using
mat = scale_colours(mat, col = color, breaks = breaks)
The scale_colours function seems to be an internal function of the pheatmap package, but we can check the source code using
getAnywhere(scale_colours)
Which gives
function (mat, col = rainbow(10), breaks = NA)
{
mat = as.matrix(mat)
return(matrix(scale_vec_colours(as.vector(mat), col = col,
breaks = breaks), nrow(mat), ncol(mat), dimnames = list(rownames(mat),
colnames(mat))))
}
Now we need to check scale_vec_colours, that turns out to be:
function (x, col = rainbow(10), breaks = NA)
{
return(col[as.numeric(cut(x, breaks = breaks, include.lowest = T))])
}
So, essentially, pheatmap is using cut to decide which colours to use.
Let's try and see what cut does if there are NAs around:
as.numeric(cut(c(1:100, NA, NA), seq(0, 100, 10)))
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[29] 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
[57] 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9
[85] 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 NA NA
It returns NA! So, here's your issue!
Now, how do we get around it?
The easiest thing is to let pheatmap draw the heatmap, then overplot the NA values as we like.
Looking again at the pheatmap function you'll see it uses the grid package for plotting (see also this question: R - How do I add lines and text to pheatmap?)
So you can use grid.rect to add rectangles to the NA positions.
What I would do is find the coordinates of the heatmap border by trial and error, then work from there to plot the rectangles.
For instance:
library(pheatmap)
m<- matrix(c(1:100), nrow= 10)
m[1,1]<- NA
m[10,10]<- NA
hmap <- pheatmap(m, cluster_rows=FALSE, cluster_cols=FALSE)
# These values were found by trial and error
# They WILL be different on your system and will vary when you change
# the size of the output, you may want to take that into account.
min.x <- 0.005
min.y <- 0.01
max.x <- 0.968
max.y <- 0.990
width <- 0.095
height <- 0.095
coord.x <- seq(min.x, max.x-width, length.out=ncol(m))
coord.y <- seq(max.y-height, min.y, length.out=nrow(m))
for (x in seq_along(coord.x))
{
for (y in seq_along(coord.y))
{
if (is.na(m[x,y]))
grid.rect(coord.x[x], coord.y[y], just=c("left", "bottom"),
width, height, gp = gpar(fill = "green"))
}
}
A better solution would be to hack the code of pheatmap using the edit function and have it deal with NAs as you wish...
Actually, the question is easy now. The current pheatmap function has incorporated a parameter for assigning a color to "NA", na_col. Example:
na_col = "grey90"
You can enable assigning a colour by using the developer version of pheatmap from github. You can do this using devtools:
#this part loads the dev pheatmap package from github
if (!require("devtools")) {
install.packages("devtools", dependencies = TRUE)
library(devtools)
}
install_github("raivokolde/pheatmap")
Now you can use the parameter "na_col" in the pheatmap function:
pheatmap(..., na_col = "grey", ...)
(edit)
Don't forget to load it afterwards. Once it is installed, you can treat it as any other installed package.
If you don't mind using heatmap.2 from gplots instead, there's a convenient na.color argument. Taking the example data m from above:
library(gplots)
heatmap.2(m, Rowv = F, Colv = F, trace = "none", na.color = "Green")
If you want the NAs to be grey, you can simply force the "NA" as double.
m[is.na(m)] <- as.double("NA")
pheatmap(m, cluster_rows=F, cluster_cols=F)
Related
I have a large data set that I want to represent with a network graph using igraph. I just don't understand how to get the colors right. Is it possible to get an igraph plot with edge having the same color as vertex color? I my example below, I would like to color vertex and edges according to the status 'sampled' or 'unsampled'. An other problem is that all the edge do not appear on the igraph, and I don't understand why
My code so far is:
d <- data.frame(individual=c(1:10), mother_id = c(0,0,0,0,0,1,3,7,6,7), father_id = c(0,0,0,0,0,4,1,6,7,6) , generation = c(0,0,0,0,0,1,1,2,2,2), status=c("sampled","unsampled","unsampled","sampled",'sampled',"sampled","unsampled","unsampled","sampled",'sampled'))
#Just some settings for layout plot
g <- d$generation
n <- nrow(d)
pos <- matrix(data = NA, nrow = n, ncol = 2)
pos[, 2] <- max(g) - g
pos[, 1] <- order(g, partial = order(d$individual, decreasing = TRUE)) - cumsum(c(0, table(g)))[g + 1]
#Plotting the igraph
G <- graph_from_data_frame(d)
plot(G, rescale = T, vertex.label = d$individual, layout = pos,
edge.arrow.mode = "-",
vertex.color = d$status,
edge.color = d$status,
asp = 0.35)
My question is somewhat similar to this question, but I would like to do it with igraph package.
Ggraph node color to match edge color
Thanks for your help
if you plot(G) you will see that the graph from data frame object is not what you expect, most likely. That is why you dont see all edges (i.e the column father_id is not used at all).
By default igraph takes the first column as "from" and the second one as "to". That is why you see 1to0, 2to0 and so on.
You can fix this by passing in two objects, one with the edges and their attributes, and one with the nodes and their attributes.
It is not so clear to me where the edges should be. However, your code should look something like this:
dd <- read.table(text = "
from to type
1 6 A
3 7 B
7 8 A
6 9 B
7 10 A
4 6 B
1 7 A
6 8 B
7 9 B
6 10 A ", header=T )
nodes <- data.frame(id=unique(c(dd$from, dd$to)) )
nodes$type <- sample(LETTERS[1:2], 8, replace = T )
nodes$x <- c(8,3,5,7,1,2,4,10) # this if for the layout
nodes$y <- c(1, 2, 4, 5, 6, 8, 5, 7)
nodes
id type x y
1 1 B 8 1
2 3 A 3 2
3 7 B 5 4
4 6 A 7 5
5 4 A 1 6
6 8 B 2 8
7 9 A 4 5
8 10 A 10 7
G <- graph_from_data_frame(dd, vertices = nodes ) # directed T or F?
V(G)$color <- ifelse( V(G)$type == "A", "pink", "skyblue")
E(G)$color <- ifelse( E(G)$type == "A", "pink", "skyblue")
edge_attr(G)
vertex_attr(G)
plot(G)
I am trying to color in from the top of the graph (y=5) to the line that was in the original plot by creating a polygon and filling it in. I'm screwing up the point generation somehow. Can someone explain whats wrong here? (Didn't mean to fill in the triangle)
half_instances<-c(0,5,2)
Ts<-c(1,2,3)
xpairs<-c(Ts, rep(5,length(half_instances)))
ypairs<-c(Ts,half_instances)
xpairs #1 2 3 5 5 5
ypairs #0 5 2 1 2 3
plot(Ts,half_instances,type="l")
polygon(xpairs,ypairs)
accidental output:
You've mixed up your x and y values, the 5's need to go into the vector for the y coordinates:
half_instances<-c(0,5,2)
Ts<-c(1,2,3)
xpairs <- c(Ts, rev(Ts))
xpairs # 1 2 3 3 2 1 = original x-values from left to right for the bottom half, then go back from right to left by using the reverse of the original x-values
ypairs <- c(half_instances, rep(5, length(half_instances)))
ypairs # 0 5 2 5 5 5 = original y-values for bottom half, then fill up with 5's tor the top half
plot(Ts, half_instances,type="l")
polygon(xpairs, ypairs, col="red")
Because you have points at coordinate X = 5, You need to modify xlim if you want to see the whole polygon:
half_instances<-c(0,5,2)
Ts<-c(1,2,3)
xpairs<-c(Ts, rep(5,length(half_instances)))
ypairs<-c(half_instances,Ts)
xpairs #1 2 3 5 5 5
ypairs #0 5 2 1 2 3
plot(Ts,half_instances,type="l",xlim=c(1,5))
polygon(xpairs,ypairs)
I'm not entirely sure what you're trying to do, but hopefully the following code helps:
half_instances<-c(0,5,2)
Ts<-c(1,2,3)
xpairs<-c(Ts, rep(5,length(half_instances)))
ypairs<-c(Ts,half_instances)
xpairs #1 2 3 5 5 5
ypairs #0 5 2 1 2 3
points <- cbind(Ts, half_instances)
# Set up basic plot
plot(points, type="l")
# Create the outside polygon...
maxX <- max(points[, 1])
minX <- min(points[, 1])
maxY <- max(points[, 2])
minY <- min(points[, 2])
borderPoints <- matrix(c(minX,minY, minX,maxY, maxX,maxY), ncol=2, byrow=TRUE)
linePoints <- points[nrow(points):1, ]
outside <- rbind(borderPoints, linePoints)
# ...and plot it in blue
polygon(outside, border=NA, col='blue')
# Create the inside polygon and plot it in red
inside <- rbind(points, pts[1,])
polygon(inside, col='red', border=NA)
# Redraw the initial line if you want
lines(points, col='black', lwd=2)
I have a csv that looks similar to the following.
Library Parameter1 Parameter2 Parameter3
A 3 6 4
A 4 6 3
A 7 8 9
B 2 10 7
B 4 4 5
B 3 5 4
C 4 6 4
C 6 3 12
C 5 6 8
I would like to be able to create a function to create a histogram for a specific library and parameter e.g., histogram of the frequency of Parameter 2 in Library B.
I kind of know how to use the histogram function here's what I have right now.
### x = "Parameter"
histogram <- function(x) {hist(filename[[x]], main = "Normalized",
xlab = "x", ylab = "Frequency", breaks = ceiling(sqrt(nrow(filename))))}
Edit: This is the actual data frame I am working with. It is quite large so I couldn't put the dput in here???
https://www.dropbox.com/s/2ivbhc7wyqms0fy/All-Norm.csv?dl=0
(Sorry if I've done anything incorrectly, still very new.)
One solution would be to subset your data first:
sub <- subset(yourdata, Library == "B")$Parameter2
histogram(sub)
This is a really simple ggplot I just put together
Code:
dat <- data.frame(Library = c("A","A","A","B","B","B","C","C","C"),
Parameter1=c(3,4,7,2,4,3,4,6,5),
Parameter2 = c(6,6,8,10,4,5,6,3,6),
Parameter3=c(4,3,9,7,5,4,4,12,8))
dat <- data.table::melt(dat,id.vars="Library")
library(ggplot2)
ggplot(dat,aes(x = value)) + geom_histogram() + facet_grid(Library~variable)
Output:
Obviously this could be cleaned up a lot, but this is a place to start.
I have two data.frames called outlier and data.
outlier just keeps row numbers which needs to be coloured.
data has 1000 data.
It has two columns called x and y.
If row number exists in outliers I want dots in plot to be red, otherwise black
plot(data$x, data$y, col=ifelse(??,"red","black"))
Something should be in ?? .
Hi this way works for me using ifelse, let me know what you think:
outlier <- sample(1:100, 50)
data <- data.frame(x = 1:100, y = rnorm(n = 100))
plot(
data[ ,1], data[ ,2]
,col = ifelse(row.names(data) %in% outlier, "red", "blue")
,type = "h"
)
I think this can be accomplished by creating a new color column in your data frame:
data$color <- "black"
Then set the outliers to a different value:
data[outlier,"color"] <- "red"
I dont have your exact data but I think I got something similar to what you wanted using the following:
outlier <- c(1, 2, 7, 9)
data <- data.frame(x=c(1,2,3,4,5,6,7,8,9,10),
y=c(1,2,3,4,5,6,7,8,9,10))
data$color <- "black"
data[outlier,"color"] <- "red"
data
x y color
1 1 1 red
2 2 2 red
3 3 3 black
4 4 4 black
5 5 5 black
6 6 6 black
7 7 7 red
8 8 8 black
9 9 9 red
10 10 10 black
Finally plot using the new value in data:
plot(data$x, data$y, col=data$color)
Results in:
I'd like to create a variable that bins values from another variable based on a binwidth
The data would look something like this if I wanted to create a bin variable based on counts where:
1 to 5 = 1
6 to 10 = 2
11 to 15 = 3
Without hand recoding each bin is there a function that can do something like this in R?
Since it looks like you want to get a numeric rather than a factor result, try something like trunc((mydata$count-1)/5)+1
e.g.
mydata$bucket = trunc((mydata$count-1)/5)+1
There's also the ceiling function, which is a little simpler:
mydata$bucket = ceiling(mydata$count/5)
see ?round
So on your data:
mydata = data.frame(spend=c(21,32,34,43,36,39,33,47,47,47,25,50,44,44) ,
count=c(3L,1L,2L,15L,1L,8L,1L,11L,15L,11L,3L,12L,11L,4L) )
mydata$bucket = ceiling(mydata$count/5)
Which gives:
> mydata
spend count bucket
1 21 3 1
2 32 1 1
3 34 2 1
4 43 15 3
5 36 1 1
6 39 8 2
7 33 1 1
8 47 11 3
9 47 15 3
10 47 11 3
11 25 3 1
12 50 12 3
13 44 11 3
14 44 4 1
Yeah its called the cut function
? cut
You can use the generic cut() function. For a numeric vector x, the method has these arguments:
> args(cut.default)
function (x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE,
dig.lab = 3L, ordered_result = FALSE, ...)
The argument breaks is central here. It is either a number of intervals or a vector of “breakpoints” defining your intervals. Note that all intervals are by default right-open (right = TRUE), so by creating an object x, containing the numbers from 1 to 100 and defining a vector of breakpoints (brk) {1, 20, 50, 100}, you will get these results (after using table() on the result):
> x <- 1:100
> brk <- c(1,20,50,100)
> table(cut(x = x, breaks = brk))
(1,20] (20,50] (50,100]
19 30 50
You can see that the first interval is $(1,\,20]$, so 1 is not part of it and the first observation will become a missing value NA (as all other observations outside the defined intervals).
By setting include.lowest = TRUE, R includes the lowest value (i.e., the first interval will be closed), so I think this will produce what you want:
> x <- 1:100
> brk <- c(1,20,50,100)
> table(cut(x = x, breaks = brk, include.lowest = TRUE))
[1,20] (20,50] (50,100]
20 30 50
The argument right reverses the whole process, so intervals are left-open by default and include.lowest will close the last interval (i.e., include the highest value in the last category).
As the resulting object will be of class "factor", you might consider setting ordered_result to TRUE, producing an ordered factor object (classes "ordered" and "factor").
Labelling, etc. is optional (see ?cut).
The cut function can actually accomplish binning a variable while keeping it as a continuous variable you just need to use the labels parameter:
myData$bucket <- cut(myData$counts, breaks = 30, labels = rep(1:30))