I have data matrix with thousands row like this:
file_A file_B file_C file_D
Carbohydrate metabolism 69370 67839 68914 67272
Energy metabolism 40223 40750 39450 39735
Lipid metabolism 22333 21668 22421 21773
Nucleotide metabolism 18449 18389 17560 18263
Amino acid metabolism 63739 63441 62797 63106
Metabolism of other amino acids 19075 19068 18896 18836
I want to create heatmap only for 50 highest value of the row for file_A,B,C,D.
How I can get it?
Assuming you want the top 50 rows for the sum of file_A through file_D, you can do so with dplyr pretty easily:
your_dataframe %>%
mutate(fileSum = select(., file_A:file_D) %>% rowSums()) %>%
arrange(desc(fileSum)) %>%
head(50)
From there, you can pipe into ggplot for your desired visual, save it as a separate dataframe, or whatever you need to do.
First, determine maximum values by row, then sort in descending order and pick top 50. Then plot, eg. using pheatmap.
library(pheatmap)
# toy example
df <- data.frame(iris[, 1:4], row.names=make.unique(as.character(iris$Species)))
# pick top 50 rows with highest values
top <- df[order(apply(df, 1, max), decreasing = TRUE)[1:50],]
# plot heatmap
pheatmap::pheatmap(top)
Created on 2020-03-13 by the reprex package (v0.3.0)
Edit:
If I misunderstood and you want the sums of the rows, then use
top <- df[order(rowSums(df), decreasing = TRUE)[1:50], ]
instead.
Edit #2:
If you want the top 50 for each row, as suggested by dc37, then you can use
top <- df[unique(unlist(lapply(df, function(x) order(x, decreasing = TRUE)[1:50]))),]
instead.
Maybe I misunderstood your question, but from my understanding, you are looking make the heatmap of the top 50 values of file A, top 50 values of file B, top 50 of file C and top 50 of File D. Am I right ?
If it is what you are looking for, it could means that you don't need only 50 but potentially up to 200 values (depending if the same row is in top 50 for all files or in only one).
Here a dummy example of large dataframe corresponding to your example:
row <- expand.grid(LETTERS, letters, LETTERS)
row$Row = paste(row$Var1, row$Var2, row$Var3, sep = "")
df <- data.frame(row = row$Row,
file_A = sample(10000:99000,nrow(row), replace = TRUE),
file_B = sample(10000:99000,nrow(row), replace = TRUE),
file_C = sample(10000:99000,nrow(row), replace = TRUE),
file_D = sample(10000:99000,nrow(row), replace = TRUE))
> head(df)
row file_A file_B file_C file_D
1 AaA 54418 65384 43526 86870
2 BaA 57098 75440 92820 27695
3 CaA 71172 59942 12626 53196
4 DaA 54976 25370 43797 30770
5 EaA 56631 73034 50746 77878
6 FaA 45245 57979 72878 94381
In order to get a heatmap using ggplot2, you need to obtain the following organization: One column for x value, one column for y value and one column that serve as a categorical variable for filling for example.
To get that, you need to reshape your dataframe into a longer format. To do that, you can use pivot_longer function from tidyr package but as you have thousands of rows,I will rather recommend data.table which is faster for this kind of process.
library(data.table)
DF <- melt(setDT(df), measure = list(c("file_A","file_B","file_C","file_D")), value.name = "Value", variable.name = "File")
row File Value
1: AaA file_A 54418
2: BaA file_A 57098
3: CaA file_A 71172
4: DaA file_A 54976
5: EaA file_A 56631
6: FaA file_A 45245
Now, we can use dplyr to get only the first top 50 values for each file by doing:
library(dplyr)
Extract_DF <- DF %>%
group_by(File) %>%
arrange(desc(Value)) %>%
slice(1:50)
# A tibble: 200 x 3
# Groups: File [4]
row File Value
<fct> <fct> <int>
1 PaH file_A 98999
2 RwX file_A 98996
3 JjQ file_A 98992
4 SfA file_A 98990
5 TrI file_A 98989
6 WgU file_A 98975
7 DnZ file_A 98969
8 TdK file_A 98965
9 YlS file_A 98954
10 FeZ file_A 98954
# … with 190 more rows
Now to plot this as a heatmap we can do:
library(ggplot2)
ggplot(Extract_DF, aes(y = row, x = File, fill = Value))+
geom_tile(color = "black")+
scale_fill_gradient(low = "red", high = "green")
And you get:
I intentionally let y labeling even if it is not elegant just in order you see how the graph is organized. All the white spot are those rows that are top 50 in one column but not in other columns
If you are looking for only top 50 values across all columns, you can use #Jon's answer and use the last part of my answer for getting a heatmap using ggplot2
Here is another approach using rank. I am using a matrix, but it should easily work on a data.frame as well. Using the volcano dataset, each column is reverse ranked (i.e. lowest rank for highest value), then returns a value of 1 for those values that have a rank of less than or equal to 50, and a 0 otherwise. I include a plot of the scaled version of the matrix to show that the results correctly identify the highest values for each column of the matrix.
# example data
M <- volcano
# for reference - each column is centered and scaled
Msc <- scale(M)
# return TRUE if rank is in top 50 highest values
Ma <- apply(M, 2, function(x){
ran <- length(x) - rank(x, ties.method = "average")
ran <= 50
})
colSums(Ma)
png("tmp.png", width = 7.5, height = 2.5, units = "in", res = 400)
op <- par(mfcol = c(1,3), mar = c(1,1,1.5,1), oma = c(2,2,0,0))
image(M, xlab = "", ylab = "", xaxt = "n", yaxt = "n"); mtext("original")
image(Msc, xlab = "", ylab = "", xaxt = "n", yaxt = "n"); mtext("scaled")
image(Ma, xlab = "", ylab = "", xaxt = "n", yaxt = "n"); mtext("top 50 for each column")
mtext(text = "rows", side = 1, line = 0, outer = TRUE)
mtext(text = "columns", side = 2, line = 0, outer = TRUE)
par(op)
dev.off()
Related
I have now for days without luck scanned the internet for help on this issue. Any suggestions would be highly appreciated! (especially in a tidyverse-friendly syntax)
I have a tibble with approx. 4300 rows/obs and 320 columns. One column is my dependent variable, a continuous numeric column called "RR" (Response Ratios). My goal is to bin the RR values into 10 factor levels. Later for Machine Learning classification.
I have experimented with the cut() function with this code:
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
But I have no luck, because my bins come out with equal n in each, however, that does not reflect the distribution of the data.. I have studied a bit here https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b but I simply cannot achieve the same in R.
Here is the distribution of my RR data values grouped into classes *not what I want
You can try hist() to get the breaks. It's for plotting histograms but it also provides other associated data as side effect. In the example below, the plot is suppressed by plot = FALSE to expose the breaks data. Then, use that in cut(). This should give you the cutoffs, maintaining the distribution of the variable.
hist(iris$Sepal.Length, breaks = 5, plot = FALSE)
# $breaks
# [1] 4 5 6 7 8
#
# $counts
# [1] 32 57 49 12
#
# ...<omitted>
breaks <- hist(iris$Sepal.Length, breaks = 5, plot = FALSE)$breaks
dat <- iris %>%
mutate(sepal_length_group = cut(Sepal.Length, breaks = breaks))
dat %>%
count(sepal_length_group)
# sepal_length_group n
# 1 (4,5] 32
# 2 (5,6] 57
# 3 (6,7] 49
# 4 (7,8] 12
Thank you!
I also experimented using cut() and then count(). Then I use the labels=FALSE to give labels that can be used in a new mutate for a new column with character names of the intervals groups..
numbers_of_bins = 10
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
head(df$RR_MyQuantile,10)
df %>%
group_by(RR_MyQuantile) %>%
count()
I'm pretty new to R and I have a problem with plotting a barplot out of my data which looks like this:
condition answer
2 H
1 H
8 H
5 W
4 M
7 H
9 H
10 H
6 H
3 W
The data consists of 100 rows with the conditions 1 to 10, each randomly generated 10 times (10 times condition 1, 10 times condition 8,...). Each of the conditions also has a answer which could be H for Hit, M for Miss or W for wrong.
I want to plot the number of Hits for each condition in a barplot (for example 8 Hits out of 10 for condition 1,...) for that I tried to do the following in ggplot2
ggplot(data=test, aes(x=test$condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1))
And it looked like this:
This actually exactly what I need except for the red color which covers everything. You can see that conditions 3 to 5 have no blue bar, because there are no hits for these conditions.
Is there any way to get rid of this red color and to maybe count the amount of hits for the different conditions? -> I tried the count function of dplyr but it only showed me the amount of H when there where some for this particular condition. 3-5 where just "ignored" by count, there wasn't even a 0 in the output.-> but I'd still need those numbers for the plot
I'm sorry for this particular long post but I'm really at the end of knowledge considering this. I'd be open for suggestions or alternatives! Thanks in advance!
This is a situation where a little preprocessing goes a long way. I made sample data that would recreate the issue, i.e. has cases where there won't be any "H"s.
Instead of relying on ggplot to aggregate data in the way you want it, use proper tools. Since you mention dplyr::count, I use dplyr functions.
The preprocessing task is to count observations with answer "H", including cases where the count is 0. To make sure all combinations are retained, convert condition to a factor and set .drop = F in count, which is in turn passed to group_by.
library(dplyr)
library(ggplot2)
set.seed(529)
test <- data.frame(condition = rep(1:10, times = 10),
answer = c(sample(c("H", "M", "W"), 50, replace = T),
sample(c("M", "W"), 50, replace = T)))
hit_counts <- test %>%
mutate(condition = as.factor(condition)) %>%
filter(answer == "H") %>%
count(condition, .drop = F)
hit_counts
#> # A tibble: 10 x 2
#> condition n
#> <fct> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 2
#> 5 5 3
#> 6 6 0
#> 7 7 3
#> 8 8 2
#> 9 9 1
#> 10 10 1
Then just plot that. geom_col is the version of geom_bar for where you have your y-values already, instead of having ggplot tally them up for you.
ggplot(hit_counts, aes(x = condition, y = n)) +
geom_col()
One option is to just filter out anything but where answer == "H" from your dataset, and then plot.
An alternative is to use a grouped bar plot, made by setting position = "dodge":
test <- data.frame(condition = rep(1:10, each = 10),
answer = sample(c('H', 'M', 'W'), 100, replace = T))
ggplot(data=test) +
geom_bar(aes(x = condition, fill = answer), position = "dodge") +
labs(x="Conditions", y="Hitrate") +
coord_cartesian(xlim = c(1:10), ylim = c(0:10)) +
scale_x_continuous(breaks=seq(1,10,1))
Also note that if the condition is actually a categorical variable, it may be better to make it a factor:
test$condition <- as.factor(test$condition)
This means that you don't need the scale_x_continuous call, and that the grid lines will be cleaner.
Another option is to pick your fill colors explicitly and make FALSE transparent by using scale_fill_manual. Since FALSE comes alphabetically first, the first value to specify is FALSE, the second TRUE.
ggplot(data=test, aes(x=condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1)) +
scale_fill_manual(values = c(alpha("red", 0), "cadetblue")) +
guides(fill = F)
I want to select the top 10 voted restaurants, and plot them together.
So i want to create a plot that shows the restaurant names and their votes.
I used:
topTenVotes <- top_n(dataSet, 10, Votes)
and it showed me data of the columns in dataset based on the top 10 highest votes, however i want just the number of votes and restaurant names.
My Question is how to select only the top 10 highest votes and their restaurant names, and plotting them together?
expected output:
Restaurant Names Votes
A 300
B 250
C 230
D 220
E 210
F 205
G 200
H 194
I 160
J 120
K 34
And then a bar plot that shows these restaurant names and their votes
Another simple approach with base functions creating another variable:
df <- data.frame(Names = LETTERS, Votes = sample(40:400, length(LETTERS)))
x <- df$Votes
names(x) <- df$Names # x <- setNames(df$Votes, df$Names) is another approach
barplot(sort(x, decreasing = TRUE)[1:10], xlab = "Restaurant Name", ylab = "Votes")
Or a one-line solution with base functions:
barplot(sort(xtabs(Votes ~ Names, df), decreasing = TRUE)[1:10], xlab = "Restaurant Names")
I'm not seeing a data set to use, so here's a minimal example to show how it might work:
library(tidyverse)
df <-
tibble(
restaurant = c("res1", "res2", "res3", "res4"),
votes = c(2, 5, 8, 6)
)
df %>%
arrange(-votes) %>%
head(3) %>%
ggplot(aes(x = reorder(restaurant, votes), y = votes)) +
geom_col() +
coord_flip()
The top_n command also works in this case but is designed for grouped data.
Its more efficient, though less readable, to use base functions:
#toy data
d <- data.frame(list(Names = sample(LETTERS, size = 15), value = rnorm(25, 10, n = 15)))
head(d)
Names value
1 D 25.592749
2 B 28.362303
3 H 1.576343
4 L 28.718517
5 S 27.648078
6 Y 29.364797
#reorder by, and retain, the top 10
newdata <- data.frame()
for (i in 1:10) {
newdata <- rbind(newdata,d[which(d$value == sort(d$value, decreasing = T)[1:10][i]),])
}
newdata
Names value
8 W 45.11330
13 K 36.50623
14 P 31.33122
15 T 30.28397
6 Y 29.36480
7 Q 29.29337
4 L 28.71852
10 Z 28.62501
2 B 28.36230
5 S 27.64808
First time question asker here. I wasn't able to find an answer to this question in other posts (love stackexchange, btw).
Anyway...
I'm creating a rarefaction curve via the vegan package and I'm getting a very messy plot that has a very thick black bar at the bottom of the plot which is obscuring some low diversity sample lines.
Ideally, I would like to generate a plot with all of my lines (169; I could reduce this to 144) but make a composite graph, coloring by Sample Year and making different types of lines for each Pond (i.e: 2 sample years: 2016, 2017 and 3 ponds: 1,2,5). I've used phyloseq to create an object with all my data, then separated my OTU abundance table from my metadata into distinct objects (jt = OTU table and sampledata = metadata). My current code:
jt <- as.data.frame(t(j)) # transform it to make it compatible with the proceeding commands
rarecurve(jt
, step = 100
, sample = 6000
, main = "Alpha Rarefaction Curve"
, cex = 0.2
, color = sampledata$PondYear)
# A very small subset of the sample metadata
Pond Year
F16.5.d.1.1.R2 5 2016
F17.1.D.6.1.R1 1 2017
F16.1.D15.1.R3 1 2016
F17.2.D00.1.R2 2 2017
enter image description here
Here is an example of how to plot a rarefaction curve with ggplot. I used data available in the phyloseq package available from bioconductor.
to install phyloseq:
source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
library(phyloseq)
other libraries needed
library(tidyverse)
library(vegan)
data:
mothlist <- system.file("extdata", "esophagus.fn.list.gz", package = "phyloseq")
mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package = "phyloseq")
mothtree <- system.file("extdata", "esophagus.tree.gz", package = "phyloseq")
cutoff <- "0.10"
esophman <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
extract OTU table, transpose and convert to data frame
otu <- otu_table(esophman)
otu <- as.data.frame(t(otu))
sample_names <- rownames(otu)
out <- rarecurve(otu, step = 5, sample = 6000, label = T)
Now you have a list each element corresponds to one sample:
Clean the list up a bit:
rare <- lapply(out, function(x){
b <- as.data.frame(x)
b <- data.frame(OTU = b[,1], raw.read = rownames(b))
b$raw.read <- as.numeric(gsub("N", "", b$raw.read))
return(b)
})
label list
names(rare) <- sample_names
convert to data frame:
rare <- map_dfr(rare, function(x){
z <- data.frame(x)
return(z)
}, .id = "sample")
Lets see how it looks:
head(rare)
sample OTU raw.read
1 B 1.000000 1
2 B 5.977595 6
3 B 10.919090 11
4 B 15.826125 16
5 B 20.700279 21
6 B 25.543070 26
plot with ggplot2
ggplot(data = rare)+
geom_line(aes(x = raw.read, y = OTU, color = sample))+
scale_x_continuous(labels = scales::scientific_format())
vegan plot:
rarecurve(otu, step = 5, sample = 6000, label = T) #low step size because of low abundance
One can make an additional column of groupings and color according to that.
Here is an example how to add another grouping. Lets assume you have a table of the form:
groupings <- data.frame(sample = c("B", "C", "D"),
location = c("one", "one", "two"), stringsAsFactors = F)
groupings
sample location
1 B one
2 C one
3 D two
where samples are grouped according to another feature. You could use lapply or map_dfr to go over groupings$sample and label rare$location.
rare <- map_dfr(groupings$sample, function(x){ #loop over samples
z <- rare[rare$sample == x,] #subset rare according to sample
loc <- groupings$location[groupings$sample == x] #subset groupings according to sample, if more than one grouping repeat for all
z <- data.frame(z, loc) #make a new data frame with the subsets
return(z)
})
head(rare)
sample OTU raw.read loc
1 B 1.000000 1 one
2 B 5.977595 6 one
3 B 10.919090 11 one
4 B 15.826125 16 one
5 B 20.700279 21 one
6 B 25.543070 26 one
Lets make a decent plot out of this
ggplot(data = rare)+
geom_line(aes(x = raw.read, y = OTU, group = sample, color = loc))+
geom_text(data = rare %>% #here we need coordinates of the labels
group_by(sample) %>% #first group by samples
summarise(max_OTU = max(OTU), #find max OTU
max_raw = max(raw.read)), #find max raw read
aes(x = max_raw, y = max_OTU, label = sample), check_overlap = T, hjust = 0)+
scale_x_continuous(labels = scales::scientific_format())+
theme_bw()
I know this is an older question but I originally came here for the same reason and along the way found out that in a recent (2021) update vegan has made this a LOT easier.
This is an absolutely bare-bones example.
Ultimately we're going to be plotting the final result in ggplot so you'll have full customization options, and this is a tidyverse solution with dplyr.
library(vegan)
library(dplyr)
library(ggplot2)
I'm going to use the dune data within vegan and generate a column of random metadata for the site.
data(dune)
metadata <- data.frame("Site" = as.factor(1:20),
"Vegetation" = rep(c("Cactus", "None")))
Now we will run rarecurve, but provide the argument tidy = TRUE which will export a dataframe rather than a plot.
One thing to note here is that I have also used the step argument. The default step is 1, and this means by default you will get one row per individual per sample in your dataset, which can make the resulting dataframe huge. Step = 1 for dune gave me over 600 rows. Reducing the step too much will make your curves blocky, so it will be a balance between step and resolution for a nice plot.
Then I piped a left join right into the rarecurve call
dune_rare <- rarecurve(dune,
step = 2,
tidy = TRUE) %>%
left_join(metadata)
Now it will be plottable in ggplot, with a color/colour call to whatever metadata you attached.
From here you can customize other aspects of the plot as well.
ggplot(dune_rare) +
geom_line(aes(x = Sample, y = Species, group = Site, colour = Vegetation)) +
theme_bw()
dune-output
(Sorry it says I'm not allowed to embed the image yet :( )
Good day!
I’ve got a table of two columns. In the first column (x) there are values which I want to divide in into categories according to the specified range of values (in my instance – 300). And then using these categories I want to sum values in anther column (v). For instance, using my test data: The first category is from 65100 to 65400 (65100
The result: there is a table of two columns. The first one is the categories of x; the second column is the sum of according values of v.
Thank you!!!
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(x=c(x), v=c(v))
attach(tabl)
#categories
seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
I understood you want to:
Cut vector x,
Using pre-calculated cut-off thresholds
Compute sums over vector v using those groupings
This is one line of code with data.table and chaining. Your data are in data.table named DT.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Explanation:
First, assign cut-offs to variable breaks like so.
breaks <- seq(((min(x) - min(x) %% 300) + 300), ((max(x) - max(x) %% 300) + 300), by =300)
Second, compute a new column CUT to group rows by the data in breaks.
DT[, CUT := cut(x, breaks)]
Third, sum on column v in groups, using by=. I have chained this operation with the previous.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Convert your data.frame to data.table like so.
library(data.table)
DT <- as.data.table(tabl)
This is the final result:
CUT V1
1: (6.57e+04,6.6e+04] 45493
2: (6.6e+04,6.63e+04] 77865
3: (6.66e+04,6.69e+04] 22893
4: (6.75e+04,6.78e+04] 61738
5: (6.54e+04,6.57e+04] 44805
6: (6.69e+04,6.72e+04] 64079
7: NA 33234
8: (6.72e+04,6.75e+04] 66517
9: (6.63e+04,6.66e+04] 43887
10: (6.78e+04,6.81e+04] 172
You can dress this up to improve aesthetics. For example, you can reset the factor levels for ease of reading.
When I use dplyr I am used to do it like this. Although I like the cut solution too.
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(group=c(x), value=c(v))
attach(tabl)
#categories
s <- seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
tabl %>% rowwise() %>% mutate(g = s[min(which(group < s), na.rm=T)]) %>% ungroup() %>%
group_by(g) %>% summarise(sumvalue = sum(value))
result:
g sumvalue
<dbl> <int>
65400 28552
65700 49487
66000 45493
66300 77865
66600 43887
66900 21187
67200 65785
67500 66517
67800 61738
68100 1722
Try this (no package needed):
s <- seq(65100, max(tabl$x)+300, 300)
tabl$col = as.vector(cut(tabl$x, breaks = s, labels = 1:10))
df <- aggregate(v~col, tabl, sum)
# col v
# 1 1 33234
# 2 2 44805
# 3 3 45493
# 4 4 77865
# 5 5 43887
# 6 6 22893
# 7 7 64079
# 8 8 66517
# 9 9 61738
# 10 10 1722