Related
I applied the UMAP dimentionaity reduction over my data, and clustred it. I got three different clusters:
I have the data that specefices to which cluster does eahc sample belong, with the name of the sample and everything. Here is a subsample of it, let's call it df_cluster:
structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
The samples of df_cluster are the same in the original data, data, which I used for the clustering. Which is basically just the samples you saw as rows, and features as columns, looks something like this:
structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
I just want to view each of those features (the columns of data), in each cluster, using a box plot or a violin plot. Kind of a comparison between the clusters.
So in the X-axis I'll have clusters 1, 2, and 3, the Y-axis would be the values. Each feature will get a plot. I've drawn an example by hand to make it more clear:
You could use facets.
But first you need to pivot the dataframe.
df_cluster <- structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
data <- structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
library(tidyverse)
data %>%
as.data.frame() %>%
rownames_to_column("Patient") %>%
left_join(df_cluster %>% rownames_to_column("Patient") %>% select(Patient, cluster)) %>%
pivot_longer(- c(cluster, Patient)) %>% #Pivot the dataframe
ggplot(aes(as.factor(cluster), value)) +
geom_boxplot() +
facet_grid(~ name)
Here is what it looks like after those edits - lines but no boxes.
Reproducible code:
df <- data.frame(SampleID = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L),
.Label = c("C004", "C005", "C007", "C009", "C010",
"C011", "C013", "C027", "C028", "C029",
"C030", "C031", "C032", "C033", "C034",
"C035", "C036", "C042", "C043", "C044",
"C045", "C046", "C047", "C048", "C049",
"C058", "C086"), class = "factor"),
Sequencing.Depth = c(1L, 2612L, 5223L, 7834L, 10445L, 13056L, 15667L, 18278L,
20889L, 23500L),
Observed.OTUs = c(1, 213, 289.5, 338, 377.8, 408.9, 434.4, 453.8, 472.1, NA),
Mange = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
.Label = c("N", "Y"), class = "factor"),
SpeciesCode = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
.Label = c("Cla", "Ucin", "Vvu"), class = "factor"))
In your aes, you can use interaction of your x values and your categorical values for plotting boxplot on a continuous x axis and pass position = "identity" in order to place them on the precise x values and not to be dodged.
Here to add the line connecting each boxplot, I calculate mean per Species per x values using dplyr directly inggplot but you can calculate outside and generate a second dataframe.
So, as your x values are pretty spread from 1 to 23500, you will have to modify the width of the geom_boxplot in order to see a box and not a single line:
library(ggplot2)
library(dplyr)
ggplot(df,aes(x = Xvalues, y = Yvalues, color = Species,
group = interaction(Species, Xvalues)))+
geom_boxplot(position = "identity", width = 1000)+
geom_line(data = df %>%
group_by(Xvalues, Species) %>%
summarise(Mean = mean(Yvalues)),
aes(x = Xvalues, y = Mean,
color = Species, group = Species))
So, apply to your dataset (based on informations you provided in your code), you should try something like:
library(ggplot2)
library(dplyr)
ggplot(observedotusrare,
aes(x=Sequencing.Depth, y=Observed.OTUs,
color=SpeciesCode,
group = interaction(Sequencing.Depth, SpeciesCode))) +
geom_boxplot(position = "identity", width = 1000) +
geom_line(data = observedotusrare %>%
group_by(Sequencing.Depth, SpeciesCode) %>%
summarise(Mean = mean(Observed.OTUs, na.rm = TRUE)),
aes(x = Sequencing.Depth, y = Mean,
color = SpeciesCode, group = SpeciesCode))
Does it answer your question ?
Reproducible example
df <- data.frame(Xvalues = rep(c(10,2000,23500), each = 30),
Species = rep(rep(LETTERS[1:3], each = 10),3),
Yvalues = c(rnorm(10,1,1),
rnorm(10,5,1),
rnorm(10,8,1),
rnorm(10,5,1),
rnorm(10,8,1),
rnorm(10,12,1),
rnorm(10,20,1),
rnorm(10,30,1),
rnorm(10,50,1)))
mydat=structure(list(date = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), .Label = c("01.01.2018", "02.01.2018"), class = "factor"),
x = structure(c(2L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("e",
"q", "w"), class = "factor"), y = structure(c(2L, 2L, 2L,
3L, 1L, 1L, 1L, 1L, 1L), .Label = c("e", "q", "w"), class = "factor")), .Names = c("date",
"x", "y"), class = "data.frame", row.names = c(NA, -9L))
As we can see x and y are groups varibles (we have only the group categories q-q,w-w,e-e)
for 1 january
q q = count 3
w w =count 1
then for 2 january
e e =count 5
How count of categories display in graph like this: dataset is large so graph needed for january month, so the plot display number of sold categories by day
I've found your question not too much clear, but maybe this could help:
library(lubridate) # manipulate date
library(tidyverse) # manipulate data and plot
# your data
mydat %>%
# add columns (here my doubts)
mutate(group = paste (x,y, sep ='-'), # here the category pasted
cnt = ifelse(paste (x,y, sep ='-') == 'q-q',3,
ifelse(paste (x,y, sep ='-') == 'w-w',1,5)), # ifelse with value
day = day(dmy(date))) %>% # day
group_by(group,day) %>% # grouping
summarise(cnt = sum(cnt)) %>% # add the count as sum
# now the plot, here other doubts on your request
ggplot(aes(x = as.factor(day), y = cnt, group = group, fill = group, label = group)) +
geom_bar(stat = 'identity', position = 'dodge') +
geom_label(position = position_dodge(width = 1)) +
theme(legend.position="none")
Your question isn't too much clean as I wish, but I think you wanna to find how much of each group we have in each day, right?
You can use group_by from dplyr package.
I created a new variable called group which contatenate x and y.
mydata <- mydat %>%
mutate('group' = paste(x, y, sep = '-')) %>%
group_by(date, group) %>%
summarise('qtd' = length(group))
Result:
date group qtd
01.01.2018 q-q 3
01.01.2018 w-w 1
02.01.2018 e-e 5
You can use ggplot2 package and create as below where you can use facet_wrap to separate the plots by date:
ggplot(data = mydata, aes(x = group, y = qtd)) +
geom_bar(stat = 'identity') +
facet_wrap(~date)
Otherwise you can use another syntax of ggplot2 and use fill. It's better sometimes specially if you have a lot of dates.
Code:
ggplot(data = mydata, aes(x = group, y = qtd, fill = date)) +
geom_bar(stat = 'identity')
Good luck!
I'm having some trouble setting readable tick marks on my axes. The problem is that my data are at different magnitudes, so I'm not really sure how to go about it.
My data include ~400 different products, with 3/4 variables each, from two machines. I've pre-processed it into a data.table and used gather to convert it to long form- that part is fine.
Overview: Data is discrete, each X_________ on the x-axis represents a separate reading, and its relative values from machine 1/2 - the idea is to compare the two. The graphical format is perfect for my needs, I would just like to set the ticks at say, every 10 products on the x-axes, and at reasonable values on the y-axis.
Y_1: from 150 to 250
Y_2: from say, 1.5* to 2.5
Y_3: from say, 0.8* to 2.3
Y_4: from say, 0.4* to 1.5
*Bottom value, rounded down
Here's the code I'm using so far
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
MProduct$Parameter <- factor(MProduct$Parameter,
labels = var.Parameter)
labels_x <- MProduct$Lot[seq(0, 1626, by= 20)]
labels_y <- MProduct$Value[seq(0, 1626, by= 15)]
plot.MProduct <- ggplot(MProduct, aes(x = Lot,
y = Value,
colour = V4)) +
facet_grid(Parameter ~.,
scales = "free_y") +
scale_x_discrete(breaks=labels_x) +
scale_y_discrete(breaks=labels_y) +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (angle = 90,
hjust = 1,
vjust = 0.5))
# ggsave("MProduct.png")
plot.MProduct
Anyone knows how to possibly render this graph more readable? Setting labels/breaks manually greatly limits flexibility and readability - there should be an option to set it to every X ticks, right? Same with y.
I need to apply this as a function to multiple datasets, so I'm not very happy about having to specify the column length of the "gathered" dataset every time either, which, in this case is 1626.
Since I'm here, I would also like to take the opportunity to ask about this code:
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
More often than not, I need to label my data in a specific order, which is not necessarily alphabetical. R, however, defaults to some kind of odd behaviour whereupon I have to plot and verify that the labels are indeed where they should be. Any clue how I could force them to be presented in order? As it is, my solution is to keep shifting their position in that line of code until it produces the graph correctly.
Many thanks.
Okay. I'm going to ignore the y axis labels because the defaults seem to work just fine as long as you don't try to overwrite them with your custom labels_y thing. Just let the defaults do their work. For the X axis, we'll give a couple options:
(A) label every N products on X-axis. Looking at ?scale_x_discrete, we can set the labels to a function that takes all the level of the factor and returns the labels we want. So we'll write a functional that returns a function that returns every Nth label:
every_n_labeler = function(n = 3) {
function (x) {
ind = ((1:length(x)) - 1) %% n == 0
x[!ind] = ""
return(x)
}
}
Now let's use that as the labeler:
ggplot(df, aes(x = Lot,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
scale_x_discrete(labels = every_n_labeler(3)) +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
You can change the every_n_labeler(3) to (10) to make it every 10th label.
(B) Maybe more appropriate, it seems like your x-axis is actually numeric, it just happens to have "X" in front of it, let's convert it to numeric and let the defaults do the labeling work:
df$time = as.numeric(gsub(pattern = "X", replacement = "", x = df$Lot))
ggplot(df, aes(x = time,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
With your full x range, I imagine that would look nice.
(C) But who wants to read those 9-digit numbers? You're labeling the x-axis a "Time (s)", which makes me think it's actual a time, measured in seconds from some start time. I'll make up that your start time is 2010-01-01 and covert these seconds to actual times, and then we get a nice date-time scale:
ggplot(df_s, aes(x = as.POSIXct(time, origin = "2010-01-01"),
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
If this is the real meaning behind your data, then using a date-time axis is a big step up for readability. (Again, notice that we are not specifying the breaks, the defaults work quite well.)
Using this data (I subset your sample data down to 2 facets and used dput to make it copy/pasteable):
df = structure(list(Lot = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L), .Label = c("X180106482", "X180126485", "X180306523",
"X180526326"), class = "factor"), Value = c(201, 156, 253, 211,
178, 202.5, 203.4, 204.3, 205.2, 2.02, 2.17, 1.23, 1.28, 1.54,
1.28, 1.45, 1.61, 2.35, 1.34, 1.36, 1.67, 2.01, 2.06, 2.07, 2.19,
1.44, 2.19), Parameter = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Var 1", "Var 2", "Var 3", "Var 4"
), class = "factor"), Machine = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Machine 1", "Machine 2"), class = "factor"),
time = c(180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482, 180106482, 180126485,
180306523, 180526326, 180106482, 180126485, 180306523, 180526326,
180106482, 180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482)), row.names = c(NA,
-27L), class = "data.frame")
Trying to produce a point plot that reorders my values and also has a mean line above the values.
I can produce the plot with the mean line, or the reordered values but not both at the same time because I get the error
"geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?".
I believe I am getting the error as some of my data only has one observation but I don't understand why this only becomes an issue with the reorder data.
In the end all I want is to be able to show the means of the two different values groups for each x value.
Here is my sample code
library(ggplot2)
typ <- c("T", "N", "T", "T", "N")
samplenum <- c(7,7,6,8,8)
values <- c(1,2,1,3,2)
df = data.frame(typ, samplenum, values)
d <- ggplot(df, aes(x= reorder(samplenum, values), y= values))
d <- d + geom_point(position=position_jitter(width=0.15, height=0.05))
d <- d + aes(colour = factor(df$typ))
d <- d + stat_summary(fun.y = mean, geom="line")
d
Thank you for the help in advance.
This is what I am going for
Here is some steps before the completion sample pictures of what I have produced from my larger data set.
With Line but Not Reordered
Reordered but No Mean Line
As the error message suggests, you need to adjust the group aesthetic. When you use reorder you will end up with a discrete scale but you want to draw lines that connect across groups, that's why the error.
You can try this
ggplot(df, aes(x = reorder(samplenum, values), y = values, colour = factor(typ))) +
geom_jitter(width = 0.15, height = 0.05) +
stat_summary(fun.y = mean, geom = "line", aes(group = factor(typ)))
(I altered your data slighly so it contains more observations.)
data
df <- structure(list(typ = structure(c(2L, 1L, 2L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("N", "T"), class = "factor"),
samplenum = c(7, 7, 6, 8, 8, 7, 7, 6, 8, 8, 7, 7, 6, 8, 8
), values = c(1L, 3L, 2L, 1L, 3L, 3L, 1L, 3L, 2L, 2L, 2L,
1L, 3L, 1L, 2L)), .Names = c("typ", "samplenum", "values"
), row.names = c(NA, -15L), class = "data.frame")
The resulting plot with your input data