I want to create a heatmap using R.
Here is how my dataset looks like:
sortCC
Genus Location Number propn
86 Flavobacterium CC 580 0.3081827843
130 Algoriphagus CC 569 0.3023379384
88 Joostella CC 175 0.0929861849
215 Paracoccus CC 122 0.0648246546
31 Leifsonia CC 48 0.0255047821
sortN
Genus Location Number propn
119 Niastella N 316 0.08205661
206 Aminobacter N 252 0.06543755
51 Nocardioides N 222 0.05764736
121 Niabella N 205 0.05323293
257 Pseudorhodoferax??? N 193 0.05011685
149 Pedobacter N 175 0.04544274
Here is the code I have so far:
row.names(sortCC) <- sortCC$Genus
sortCC_matrix <- data.matrix(sortCC)
sortCC_heatmap <- heatmap(sortCC_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
I was going to generate 2 separate heatmap, but when I used the code above it looked wrong while using R.
Questions: 1)Is it possible to combine the two data set since they have the same genus, but the difference is the location and number & proportion. 2) If it is not possible to combine the two then how do I exclude the location column from the heatmap.
Any suggestions will be much appreciated! Thanks!
Since you have the same columns, you cand bind your data.frames and use some facets to differentiate it. Here a solution based on ggplot2:
dat <- rbind(sortCC,sortN)
library(ggplot2)
ggplot(dat, aes(y = factor(Number),x = factor(Genus))) +
geom_tile(aes(fill = propn)) +
theme_bw() +
theme(axis.text.x=element_text(angle=90)) +
facet_grid(Location~.)
To remove extra column , You can use subset:
subset(dat,select=-c(Location))
If you still want to merge data's by Genius, you can use do this for example:
sortCC <- subset(sortCC,select=-c(Location))
sortN <- subset(sortN,select=-c(Location))
merge(sortCC,sortN,by='Genus')
Related
I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!
For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()
IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output
I'm like a newbie in R, I worked with it during my university studies, but it's far away...
I have a table with 4 columns: vine ID, and 3 columns for NDVI (a vegetation index) values at 3 dates.
ID 09052017 25052017 16062017
1 233 244 238
2 225 234 247
3 224 231 245
4 124 115 124
I know how to read my table, create variables with it, select columns or rows, make a plot(x,y).
My goal is to represent for each ID a line with the 3 NDVI values and all that in a same graph windows
But i'm a little bit confused to do what I want.
Somebody can give some ideas to create this ?
Like this?
library(ggplot2)
library(dplyr)
library(tidyr)
df %>%
gather(date, NDVI, -ID) %>%
ggplot(aes(x = as.Date(date, '%d%m%Y'), y = NDVI, group = ID, col = factor(ID))) +
geom_line() +
xlab("Date")
I have an xyplot grouped by a factor. I plot salinity (AvgSal = Y) against time (DayN = X) for 16 different sites, site being the factor (SiteCode). I want all of the site plots stacked so I set the layout to one column with 16 rows.
First issue: I would like to remove the strip above each plot that contains only the SiteCode label, as it takes up a lot of space. Instead, I could introduce a second column with the SiteCode names or introduce a legend in the same strip as the plot. Can anyone tell me how to remove the label strip and introduce labelling in a different fashion?
Here's the code:
Sample Data
zz <- "SiteCode DayN AvgSal
1 CC 157 29.25933
2 CC 184 29.68447
3 DW 160 26.47328
4 DW 190 29.07192
5 FP 157 30.40344
6 FP 184 30.58842
7 IN 157 30.25319
8 IN 184 29.20716
9 IP 156 29.09548
10 IP 187 27.86887
11 LB 162 27.58603
12 LB 191 28.86910
13 LR 160 28.06035
14 LR 190 29.52723
15 PB 159 30.10903
16 PB 188 29.46113
17 PG 161 29.67765
18 PG 189 28.90864
19 SA 162 23.23362
20 SA 190 26.96549
21 SH 156 24.86752
22 SH 187 23.12184
23 SP 161 18.95347
24 SP 189 19.16433
25 VC 162 29.49714
26 VC 186 29.66493
27 WP 157 27.33631
28 WP 183 27.18465
29 YB 157 30.50193
30 YB 183 30.49824
31 ZZ 159 30.14175
32 ZZ 186 29.44860"
Data <- read.table(text=zz, header = TRUE)
xyplot(AvgSal~DayN | factor(SiteCode),
layout = c(1, 16),
xlab = "Time (Day of the year)",
ylab = "Average Salinity (PSU)",
strip = function(bg = 'white', ...) strip.default(bg = 'white', ...),
data = Data, type = c("a","p"))
Second issue: The strips are ordered by SiteCode alphabetically, or in the original order had them entered into the csv datafile. I would like to order them from highest to lowest average salinity, but I do not know how to achieve this. Can anyone help?
I have tried using order () to change the data layout so it is sorted by ascending salinity before running the plot, but this doesn't seem to work, even when I remove the rownames.
I also tried the solution in How to change the order of the panels in simple Lattice graphs, by assigning set levels, i.e.
levels(Data$SiteCode) <- c("SP", "SA", "SH", "LB", "DW",
"LR", "PG", "VC", "ZZ", "PB",
"WP", "IP", "IN", "CC", "FP", "YB")
This seemed to change the label above each panel, but it did not change the corresponding plots, leaving plots with the wrong label. It also seems like an inefficient way to go about it if I want to do this process for a large number of variables.
Any help will be really appreciated! Cheers :)
The solutions always seem so simple, in hindsight.
Issue 1: I had to use levels() and reorder() in the factor command, where X = the numeric factor I want to order SiteCode by.
xyplot(AvgSal ~ DayN | factor(SiteCode, levels(reorder(SiteCode, X),
Issue 2: Turned out to be very simple, once I knew what I was doing. Just had to 'turn off' strip, with the following getting rid of the title strip altogether.
strip = FALSE
In addition, I decided having the strip vertically aligned on the left would be nice:
strip = FALSE,
strip.left = strip.custom(horizontal = FALSE)
I have data frame which I want to pass first two columns rows and
variable column names to create legend.
Inside of df I have group of dataset in which they grouped with letters from a to h. In particular, I want to pass AC&AR columns rows as names in combination with DQ0:DQ2 variables and they should be shown in the legend with that format.
something like 78_256_DQ0, and 78_256_DQ1 and 78_256_DQ2 for data group a
and same for the rest of letters in the df.
my reproducible df like this;
df <- do.call(rbind,lapply(1,function(x){
AC <- as.character(rep(rep(c(78,110),each=10),times=3))
AR <- as.character(rep(rep(c(256,320,384),each=20),times=1))
V <- rep(c(seq(2,40,length.out=5),seq(-2,-40,length.out=5)),times=2)
DQ0 = sort(replicate(6, runif(10,0.001:1)))
DQ1 = sort(replicate(6, runif(10,0.001:1)))
DQ2 = sort(replicate(6, runif(10,0.001:1)))
No = c(replicate(1,rep(letters[1:6],each=10)))
data.frame(AC,AR,V,DQ0,DQ1,DQ2,No)
}))
head(df)
AC AR V DQ0 DQ1 DQ2 No
1 78 256 2.0 0.003944916 0.00902776 0.00228837 a
2 78 256 11.5 0.006629239 0.01739512 0.01649540 a
3 78 256 21.0 0.048515226 0.02034436 0.04525160 a
4 78 256 30.5 0.079483625 0.04346118 0.04778420 a
5 78 256 40.0 0.099462310 0.04430493 0.05086738 a
6 78 256 -2.0 0.103686255 0.04440260 0.09931459 a
*****************************************************
library(reshape2)
df_new <- melt(df,id=c("V","No"),measure=c("DQ0","DQ1","DQ2"))
library(ggplot2)
ggplot(df_new,aes(y=value,x=V,group=No,colour=No))+
geom_point()+
geom_line()
UPDATE
after #... answer I made a little bit progress. His solution is partially ok. Because when we melt names
df$names <- interaction(df$AC,df$AR,names(df)[4:6])
df_new <- melt(df,id=c("V","No","names1"),measure=c("DQ0","DQ1","DQ2"))
this command plots 4 rows for each group a to h.
the output becomes like this;
head(df)
AC AR V DQ0 DQ1 DQ2 No names
1 78 256 2.0 0.002576547 0.04294134 0.008302918 a 78.256.DQ0
2 78 256 11.5 0.010150299 0.04570650 0.011749370 a 78.256.DQ1
3 78 256 21.0 0.012540026 0.06977744 0.013887357 a 78.256.DQ2
4 78 256 30.5 0.036532977 0.11460343 0.071172301 a 78.256.DQ0
5 78 256 40.0 0.042801967 0.11518191 0.073756228 a 78.256.DQ1
6 78 256 -2.0 0.043275144 0.13033194 0.076569977 a 78.256.DQ2
**************************************************************
and with modification of the plot command
ggplot(df_new,aes(y=value,x=V,lty=variable,colour=names))+
geom_point()+
geom_line()
the output format which I prefer is something I can refer all rows of DQ0,DQ1 and DQ2 inside of each group. Any suggestions?
last condition
u can use df$names <- interaction(v$AC,v$AR,DQ0) and then also set names in you melt command as id. Later you use color=names in your aes function.
So, this will add a column name with a combination of the defined columns. You can also set a sep='_' if you prefer over ..
If you now use this column for colouring, you will get those labels as legend names.
finally I found a way using gather from dplyr.
df_gather <- df %>% gather(DQ, value,-No, -AC, -AR, -V)
and using interaction function from #drmariod answer
df_gather$names <- interaction(df_gather$AC,df_gather$AR,df_gather$DQ)
and here is the result of this question:)
I am a beginner of R. I want to plot the numeral relations between different columns of a data frame.
Currently I have the following data frame:
topN Precision Recall F1Score udim idim tdim
10 50 0.02712121 0.2843955 0.04951998 67 78 50
40 50 0.02515152 0.2584113 0.04584124 67 156 50
70 50 0.02539924 0.2585877 0.04625516 67 234 50
100 50 0.02608365 0.2735997 0.04762680 133 78 50
130 50 0.02431818 0.2504262 0.04433146 133 156 50
160 50 0.02425856 0.2448997 0.04414439 133 234 50
190 50 0.02418251 0.2498824 0.04409746 200 78 50
220 50 0.02342205 0.2436125 0.04273533 200 156 50
250 50 0.02136882 0.2179636 0.03892181 200 234 50
I want to plot the 3D relation between udim, idim and F1Score. I am using persp() function in R. I want to make sure if I am doing the right thing to use t() on z.
So
x is udim: 67 133 200
y is idim: 78 156 234
z is their corresponding F1Score value in the data frame.
I use the following codes:
plot.data <- read.table(plot.file, sep=",", header=T)
# plot.file is the data frame file location
udim <- as.factor(plot.data$udim)
u <-as.integer(levels(udim))
idim <- as.factor(plot.data$idim)
i <- as.integer(levels(idim))
t <- as.integer(levels(as.factor(plot.data$tdim)))
z <- outer(u, i, FUN = function(u, i){
ss <- subset(plot.data, tdim == 50 & topN == 50) #udim == u & idim == i &
ss$F1Score
})
persp(u, i, t(z), theta=45, phi=45, shade = 0.45, xlab="user dim",
ylab="item dim", zlab="F1 Score", scale=TRUE)
I got the following plot:
Am I plotting it right?
Is this the easiest/normal way to tackle with such task?
Actually in my data frame I have more rows with different values of topNs and tdim, so is it possible add one or two more dimensions, say tdim, topN, to reflect numeral relations between so many columns in a plot?
Your graph already looks nice and I cannot answer your second question.
However, I want to present you another option for 3-way graphs.
Although they are usually quite confusing, I found an appealing way to make use of 3D Scatterplots.
Using scatterplot3dand animation as well as some third party software like ImageMagick (http://imagemagick.org) you can create animated pictures of 3D Scatterplots, which are certainly an option for data presentation using a computer.
Sample for your data (I don't have the animation package installed right now so I can only give you the syntax for the plot):
library(scatterplot3d)
F1Score <- c(0.04951998,0.04584124,0.04625516,0.04762680,0.04433146,0.04414439,0.04409746,0.04273533,0.03892181)
udim <- c(67,67,67,133,133,133,200,200,200)
idim <- c(78,156,234,78,156,234,78,156,234)
for (j in seq(5, 175, by = 5)) {
scatterplot3d(udim, idim, F1Score, angle = j)
Sys.sleep(0.042) # for 24 fps when looking at it in R
}