Heatmap using ggplot2 in r - r

My Data looks like this:
data1 <- read.table(text = "District Block IE FE AOE CLE CS
A A1 4.87 17.54 13.85 9.01 45.27
B B1 8.19 20.83 14.59 7.04 50.65
C C1 8.71 19.16 16.54 8.24 52.65
D D1 2.43 11.77 11.51 6.96 32.67
E E1 6.85 13.54 14.54 5.7 40.63
F F1 7.02 19.96 13.96 3.82 44.76
G G1 2.55 11.64 8.74 5.06 27.99
H H1 9.81 20.2 12.62 5.95 48.58
I I1 6.56 15.49 12.32 8.08 42.45
J J1 9.47 22.86 25 22.73 80.06
K K1 10.2 20.18 20.14 20.06 70.58
L L1 9.52 14.86 16.95 18.23 59.56", header = TRUE)
I have created data matrix from the dataframe.My initial code looks like this
row.names(data1)<-data1$Column1
data1<-select(data1,-c(1))
data1<-data.matrix()
data1_heatmap<-heatmap(data1,Rowv = NA,Colv = NA,col=heat.colors(256),scale = "none",margins = c(12,3))
Whenever I am using the above code,it plots for the whole sheet.
I have 2 issues:
I need to show the cell values which are present in the data matrix.
Also i need to specify a color range in each column from IE to CS.For example,in IE column less the 4.87 is red,6.56 to 6.85 is orange and green for greater than 8.17.So basically user defined range for each column.

Try this with ggplot2 (starting with the original data1):
library(ggplot2)
library(reshape2)
row.names(data1)<-data1$Column1
data1<-select(data1,-c(1))
data1 <- melt(data1, id='Block')
data1$value <- cut(data1$value,breaks = c(-Inf,4.87, 6.56, 6.58, 8.17, 14, 19, 21, Inf),right = FALSE) # bin IE variable
ggplot(data = data1, aes(x = Block, y = variable)) +
geom_tile(aes(fill = value), colour = "white") +
scale_fill_brewer(palette = "PRGn")

Related

How to break datasets in r into new data sets using empty rows

I am trying to "automize" the process of separating this data into datasets based on its trials. The original data is a list of values taken at several sites, and I want to break them into individual sets without having to use notation like c(1:312) because the number of rows in each trial vary. Each trail starts with a header, like d9, and ends with a blank row before the next header. How can I separate the data into new dataframes using the headers/empty rows?
This is for analyzing water data; Depth, Temperature, DO, and Salinity. The end goal is to create a graph of each trial showing the differences is Temperature across the trials.
Data Set (starting at row 1299)
1299 NA
1300 d4
1301 0.00
1302 0.18
1303 0.20
1304 0.31
1305 0.49
1306 0.76
1307 1.12
1308 1.51
1309 1.82
1310 1.92
1311 2.08
1312 2.35
1313 2.41
1314 2.48
1315 2.68
1316 2.97
1317 3.22
1318 3.33
1319 3.40
1320 3.55
1321 3.81
1322 4.05
1323 4.30
1324 4.41
1325 4.46
1326 4.56
1327 4.61
1328 4.62
1329 4.55
1330 4.54
1331 4.56
1332 4.49
1333 4.38
1334 4.38
1335 4.55
1336 4.71
1337 4.91
1338 5.14
1339 5.22
1340 5.26
1341 NA
1342 d11
1343 0.00
1344 0.22
1345 0.22
1346 0.27
D9 <- Data[3:314,]
D12 <- Data[317:517,]
D3 <- Data[520:703,]
D15 <- Data[706:795,]
D14 <- Data[798:853,]
D2 <- Data[856:939,]
D13 <- Data[942:975,]
D1 <- Data[978:1029,]
D6 <- Data[1032:1113,]
D5 <- Data[1116:1171,]
D7 <- Data[1174:1230,]
D8 <- Data[1233:1298,]
D4 <- Data[1301:1340,]
D11 <- Data[1343:1392,]
D10 <- Data[1395:1493,]
We can create a list using split along with grepl and cumsum
lst <- lapply(split.data.frame(x = df, cumsum(grepl('d\\d+',df$V2))),
function(x) {
names(x)[2] <- as.character(x[1,'V2'])
x <- x[-1,]
})
data
df <- structure(list(V1 = 1299:1346, V2 = c(NA, "d4", "0.00", "0.18",
"0.20", "0.31", "0.49", "0.76", "1.12", "1.51", "1.82", "1.92",
"2.08", "2.35", "2.41", "2.48", "2.68", "2.97", "3.22", "3.33",
"3.40", "3.55", "3.81", "4.05", "4.30", "4.41", "4.46", "4.56",
"4.61", "4.62", "4.55", "4.54", "4.56", "4.49", "4.38", "4.38",
"4.55", "4.71", "4.91", "5.14", "5.22", "5.26", NA, "d11", "0.00",
"0.22", "0.22", "0.27")), class = "data.frame", row.names = c(NA, -48L))
Note: It's advised to keep your data frames in a list instead of assigning them into Global env., see here

plotting with specific values for heatmap in pheatmap

I have a data frame like this:
gene s1 s2 s3
1 -3.83 -8.17 -8.59
2 0.33 -4.51 -7.27
3 0.15 -5.26 -6.2
4 -0.08 -6.13 -5.95
5 -1.15 -4.82 -5.75
6 -0.99 -4.11 -4.85
7 0.42 -4.18 -4.54
8 -0.32 -3.43 -4.4
9 -0.72 -3.37 -4.39
I want to make a heatmap using pheatmap where if anything is below -4 it should be green and anything over +4 should be red and everything in between should red/green shades. I also don't want to scale my data and no clustering. I have this code so far in R:
d <- read.table("test.txt", header = TRUE, sep = "\t", row.names = 1, quote = "")
pheatmap(as.matrix(d), # matrix
scale = "none", # z score scaling applied to rows
cluster_cols=FALSE, # do not cluster columns
cluster_rows = FALSE,
treeheight_row=0, # do not show row dendrogram
show_rownames=FALSE, # do not show row names i.e gene names
main = "test.txt",
color = colorRampPalette(c("#0016DB","#FFFFFF","#FFFF00"))(50),
)
How can I plot this with the color scheme I mentioned above.
Thanks
d <-read.table(text="gene s1 s2 s3
1 -3.83 -8.17 -8.59
2 0.33 -4.51 -7.27
3 0.15 -5.26 -6.20
4 -0.08 -6.13 -5.95
5 -1.15 -4.82 -5.75
6 -0.99 -4.11 -4.85
7 0.42 -4.18 -4.54
8 -0.32 -3.43 -4.40
9 -0.72 -3.37 -4.39", header=T)
library(pheatmap)
my_colors <- c(min(d),seq(-4,4,by=0.01),max(d))
my_palette <- c("green",colorRampPalette(colors = c("green", "red"))
(n = length(my_colors)-2), "red")
pheatmap(as.matrix(d),
scale = "none",
cluster_cols=FALSE,
cluster_rows = FALSE,
treeheight_row=0,
show_rownames=FALSE,
main = "test.txt",
color = my_palette,
breaks = my_colors)
Created on 2019-05-29 by the reprex package (v0.3.0)

R generate bins from a data frame respecting blanks

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

ggplot2 boxplot for each variable with unequal distance

I have the following data frame:
date DGS1MO DGS3MO DGS6MO DGS1 DGS2 DGS3 DGS5 DGS7 DGS10 DGS20 DGS30
1 2006-02-28 4.47 4.62 4.74 4.73 4.69 4.67 4.61 4.57 4.55 4.70 4.51
2 2006-03-31 4.65 4.63 4.81 4.82 4.82 4.83 4.82 4.83 4.86 5.07 4.90
3 2006-04-28 4.60 4.77 4.91 4.90 4.87 4.87 4.92 4.98 5.07 5.31 5.17
4 2006-05-31 4.75 4.86 5.08 5.07 5.04 5.03 5.04 5.06 5.12 5.35 5.21
5 2006-06-30 4.54 5.01 5.24 5.21 5.16 5.13 5.10 5.11 5.15 5.31 5.19
6 2006-07-31 5.02 5.10 5.18 5.11 4.97 4.93 4.91 4.93 4.99 5.17 5.07
Using melt (from reshape2) I got this data frame:
date variable value
1 2006-02-28 DGS1MO 4.47
2 2006-03-31 DGS1MO 4.65
3 2006-04-28 DGS1MO 4.60
4 2006-05-31 DGS1MO 4.75
5 2006-06-30 DGS1MO 4.54
6 2006-07-31 DGS1MO 5.02
As you can see I have 1, 3, 6 month, along with 10, 20, 30 year time horizons. I would like to plot box-and-whisker plot for each of these columns and have the following code:
bwplot <- ggplot(df, aes(x = variable, y = value, color = variable)) +
stat_boxplot(geom = "errorbar") +
geom_boxplot() +
bwplot
However, the issue is the distance (space) between the boxplots for each variable is the same. Ideally, there should be very small distance between the boxplots for 1 month and 3 month. And the gap between the boxplots for 10 year and 20 year should be wide. To remedy, I have tried to convert the variables into numbers (1/12, 3/12, 6/12, 1, 2, etc.) and then tried this code:
levels(df$variable) <- c(0.83, 0.25, 0.5, 1, 2, 3, 5, 7, 10, 20, 30)
bwplot <- ggplot(df, aes(x = as.numeric(as.character(df$variable)), y = value, color = variable)) +
stat_boxplot(geom = "errorbar") +
geom_boxplot() +
bwplot
But what I am getting is only one huge boxplot for the entire time horizon followed by this warning msg:
Warning messages:
1: Continuous x aesthetic -- did you forget aes(group=...)?
If I try
group = variable
I get
Error: Continuous value supplied to discrete scale
What is the right way of doing this?
Thanks.
s<-data.frame(date=seq(as.Date("2006-02-01"), by="month", length.out=6), M1=rnorm(6,5,0.5), M3=rnorm(6,5,0.5), M6=rnorm(6,5,0.5), Y1=rnorm(6,5,0.5), Y2=rnorm(6,5,0.5), Y3=rnorm(6,5,0.5), Y10=rnorm(6,5,0.5), Y20=rnorm(6,5,0.5), Y30=rnorm(6,5,0.5))
require(ggplot2)
require(reshape2)
s.melted<-melt(s, id.var="date")
#Create an axis where the numbers represent the number of months elapsed
s.melted$xaxis <-c("M"=1, "Y"=12)[sub("(M|Y)([0-9]+)","\\1",s.melted$variable)] * as.numeric(sub("(M|Y)([0-9]+)","\\2",s.melted$variable))
s.melted[sample(1:nrow(s.melted),6),]
date variable value xaxis
23 2006-06-01 Y1 4.645595 12
38 2006-03-01 Y10 5.190710 120
25 2006-02-01 Y2 4.831788 24
50 2006-03-01 Y30 3.892580 360
39 2006-04-01 Y10 4.513831 120
31 2006-02-01 Y3 4.357127 36
# Only show the ticks for variable
bwplot <- ggplot(s.melted, aes(x = xaxis, y = value, color = variable)) +
stat_boxplot(geom = "errorbar") +
geom_boxplot() + scale_x_continuous(breaks=s.melted$xaxis,
labels=s.melted$variable)
bwplot

Multidimensional scaling plot in R

I have a dataset ("data") that looks like this:
PatientID Visit Var1 Var2 Var3 Var4 Var5
1 ID1 0 44.28 4.57 23.56 4.36 8.87
2 ID1 1 58.60 5.34 4.74 3.76 6.96
3 ID1 2 72.44 11.18 21.22 2.15 8.34
4 ID2 0 65.98 6.91 8.57 1.19 7.39
5 ID2 1 10.33 38.27 0.48 14.41 NA
6 ID2 2 69.45 11.18 20.69 2.15 8.34
7 ID3 0 69.16 6.17 10.98 1.91 6.12
8 ID3 1 86.02 3.28 16.29 4.28 5.74
9 ID3 2 69.45 NA 20.69 2.15 8.34
10 ID4 0 98.55 26.75 2.89 3.92 2.19
11 ID4 1 32.66 14.38 4.96 1.13 4.78
12 ID4 2 70.45 11.42 21.78 2.15 8.34
I need to to generate an MDS plot with all datapoints. I also need the visit-points to be linked by a line and coloured as green for visit 1, red for visit 2 and black for visit3 (consistent colours for all individuals).
My code looks like this (quite lenghty, but it doesn't work):
data.cor <- cor(t(data[,3:7]), use = "pairwise.complete.obs", method = "spearman")
dim(data.cor)
dim(data)
rownames(data.cor) <- paste0(data$PatientID, "V", data$Visit)
colnames(data.cor) <- paste0(data$PatientID, "V", data$Visit)
c <- dist(data.cor)
fit <- cmdscale(c,eig=TRUE, k=2)
ff <- fit$points
ff <- as.data.frame(ff)
ff$pair <- paste0(substr(rownames(ff),1,6))
ff$pair <- factor(ff$pair)
pc.pair.distances <- matrix(nrow = nlevels(ff$pair), ncol = 1)
for(i in 1:nlevels(ff$pair)){
pair2 <- ff[ff$pair %in% levels(ff$pair)[i] , ]
pc.pair.distances[i,1] <- sqrt(
((pair2[1,1] - pair2[2,1]) * (pair2[1,1] - pair2[2,1]))
+ ((pair2[1,2] - pair2[2,2]) * (pair2[1,2] - pair2[2,2]))
)
rm(pair2)
}
plot(ff[,1], ff[,2], xlab="Principal 1", ylab="Principal 2", type = "n", las = 1)
for(i in 1:nlevels(ff$pair)){
lines(ff[ff$pair == levels(ff$pair)[i],1], ff[ff$pair == levels(ff$pair)[i],2], col = "grey")
}
points(ff[,1], ff[,2], xlab="Coordinate 1", ylab="Coordinate 2", type = "p",
pch = ifelse(grepl(x = substr(rownames(ff), 7,8), "V1"), 20, 18),
cex = 1.3)
)
I would really appreciate your help.
I suggest you to modify your data.frame in order to add a column for visit number and for indiv id with the function sapply.
ff$visit <- sapply(ff$pair,function(x){substr(x,5,5)})
ff$indiv <- sapply(ff$pair,function(x){substr(x,3,3)})
And then the library ggplot2 is very usefull to plot data. First, you draw the points :
g <- ggplot(ff,aes(V1,V2))+geom_point(aes(color=visit))
And then add lines for each individual :
for (i in unique(ff$indiv)){
g <- g+geom_line(data=ff[ff$indiv==i,],aes(V1,V2))
}

Resources