R stairstep without upwards line after last data point - r

I'm plotting a cumulative step function, and I want to suppress the behavior of the line jumping up after the last row in dataset. This happens both in base R and ggplot2.
Is there a way to do it without specifying xlim to exclude the jump upwards?
data = data.frame(V1 = c(-0.1, 0, 0, 1, 1.1), V2 = c(0, 0, 0.7, 0.3, 0.3))
base R
plot(data$V1, cumsum(data$V2), type="s")
ggplot2
ggplot(data, aes(x=V1, y=cumsum(V2))) +
geom_step()

The way the step function works seems correct to me, if you take sum(data$V2) that is 1.3 and that is where your line ends. It is also identical to tail(cumsum(data$V2), 1). However, if you insist on not drawing the last line segment, you can set the last value of data$V2 to 0. Example below:
library(ggplot2)
data = data.frame(V1 = c(-0.1, 0, 0, 1, 1.1), V2 = c(0, 0, 0.7, 0.3, 0.3))
ggplot(data, aes(x = V1, y = cumsum(c(head(V2, -1), 0)))) +
geom_step()
Note that the example doesn't generalise to multiple groups; pre-processing the data should help then.

Related

R code for plotting multiple line segments with unique R ranges

I know there are many many questions on here around plotting multiple lines in a graph in R, but I've been struggling with a more specific task. I would like to add multiple line segments to a graph using only the intercept and slope specified for each line. abline() would work great for this, except each line has a specific range on the X axis, and I do not want the line plotted beyond the range.
I managed to get the graph I want using plotrix, but I am hoping to publish the work, and the graph does not look up-to-par (very basic). I am somewhat familiar with ggpplot, and think that graphs generated in ggplot look much better than what I have made, especially with the various themes availible, but I cannot figure out how to do something similar using ggplot.
Code:
library(plotrix)
plot(1, type="n", xlab="PM2.5(ug/m3)", ylab="LogRR Preeclampsia ", xlim=c(0, 20), ylim=c(-1, 2.5))
ablineclip(a = 0, b = 0.3, x1=1.2, x2=3)
ablineclip(a = 0, b = 0.08, x1=8.0, x2=13.1)
ablineclip(a = 0, b = 0.5, x1=10.1, x2=18.9)
ablineclip(a = 0, b = 0.12, x1=2.6, x2=14.1)
Any help would be appreciated!
Thank you.
You can write a basic function doing a bit of algebra to calculate the start/stop points for the line segments and then feed that into ggplot. For example
to_points <- function(intercept, slope, start, stop) {
data.frame(
segment = seq_along(start),
xstart = start,
xend = stop,
ystart = intercept + slope*start,
yend = intercept + slope*stop)
}
And then use that with
library(ggplot2)
segments <- to_points(0, c(0.3, 0.08, 0.5, .12),
c(1.2, 8.0, 10.1, 2.6),
c(3, 13.1, 18.9, 14.2))
ggplot(segments) +
aes(xstart, ystart, xend=xend, yend=yend) +
geom_segment() +
coord_cartesian(xlim=c(0,20), ylim=c(-1, 2.5)) +
labs(x="PM2.5(ug/m3)", y="LogRR Preeclampsia ")
That will produce the following plot
(Note the third segment is outside the region you specified. You can drop the coord_cartesian to see all the segments.)

How to put labels between columns in a bar plot in R?

I'm a beginner with R and looking for help with plotting.
I would like to make a distribution plot in R that looks like a histogram of continuous data bucketed into columns with x-axis labels between each column to denote the range captured in each column.
Instead of continuous data though, I only have the bucketed counts. I can create a plot with barplot, however I can't find a way to label BETWEEN the columns to denote the range captured in each bar.
I've tried barplot but cannot get the labels to fall between columns instead of being treated as column labels and falling directly beneath each column.
dat$freq = c(5,15,20,10)
dat$mid = c(-1.5,-.5,.5,1.5) #midpoint in each bucketed range
dat$perc = dat$freq/sum(dat$freq)
barplot(dat$perc, names.arg = dat$mid)
Each column is labeled with the midpoint. I would instead like the labels to be -2,-1,0,1,2 BETWEEN the columns.
Thank you!
edit: dput(dat) outputs:
list(freq = c(5, 15, 20, 10), mid = c(-1.5, -0.5, 0.5, 1.5), perc =
c(0.1, 0.3, 0.4, 0.2))
Is this what you're after?
df <- data.frame(freq = c(5, 15, 20, 10), mid = c(-1.5, -0.5, 0.5, 1.5), perc = c(0.1, 0.3, 0.4, 0.2))
I'm using the awesome and highly customisable library ggplot2 to plot this, which renders the plot as I think you want it. You can install this with install.packages('ggplot2'):
# install.packages('ggplot2')
library(ggplot2)
p <- ggplot(df)
p <- p + geom_bar(aes(mid, perc), stat='identity')
p

Plotting violin plots: when I add a sample to be displayed, the violin plots no longer show up

I am trying to display my results using violin plot and box plot at the same time.
I am using cell count to display the number of immune cells in different cancer samples/groups. When I plot the expression for 4 samples, everything works. When I add another sample (GTEx_M2), the violin plots for all other 4 samples disappear and I end up with only the box plots.
Any suggestion? Thanks in advance!
library(ggplot2)
library(ggpubr)
Cibersort7 = structure(list(
Hot_M1 = c(0.0214400757119873, 0.170557805230298, 0.0804456569076382,
0.0893978598771954, 0.134477669028274, 0, 0.0525708788146097,
0.0511711964723951, 0.126904881120795, 0.0485101553521798,
0.170894800822398, 0.106555021195299, 0.0970104286070479,
0.115825265978309, 0.0427923320117795, 0.0733825856784013,
0.0111265771852828, 0.0657019859547462, 0.11656416302191,
0.172002238486688, 0.0154591596631105, 0.0350445248592811,
0.0795539781894198, 0.0781276090630857, 0.0087982313041526,
0.0289274652853823, 0.0712661645666698, 0.0435482190581647,
0.0455556872660798, 0.0871522448556361),
Cold_M1 = c(0.0346024087291239, 0.0201947741817111, 0.0306194109725081,
0.0277445612030966, 0.00905915199266666, 0.00939058305405205,
0.0146535473252646, 0.0159980760737253, 0.147670469457772,
0.0426119074182886, 0.0219251208462312, 0.0128996237306264,
0.0094816829459359, 0.0219336027293415, 0.0438220246067735,
0.00950926112282649, 0.0838386603270565, 0.0486661009213444,
0.00651564872414969, 0.00110323590537234, 0.0807125087307139, 0,
0.037709808301658, 0, 0.0898041410439557, 0.0417739517920607, 0,
0.0202168551193018, 0.00176008746063679, 0.0161337603014608),
Hotnorm_M1 = c(0.00622155478760928, 0.00864956989565159, 0.0245812979257332,
0.0339687958970202, 8e-04, 0, 0.0582086801600888, 0,
0.03481918582501, 0.021338008027511, 0.0157360408231509,
0.00489068636912568, 0.0281166183638247, 0.0162726467268935,
0.0415769266772567, 0, 0.00344830695596762, 0.00196737745405557,
0.0075141479562764, 0.0232464687737552, 0, 0, 0.0289423690350636,
0.0218584208695064, 0.0255945495324721, 4e-04, 0.0221942067802419,
0.00476738514342175, 0.00722699142988291, 0.00974645683928458),
Coldnorm_M1 = c(0.0280536098964266, 0.0261826834038114, 0.0150413750071331, 0,
0.0199730743908202, 0.0115748800373456, 0.0275674859254823,
0.0168847795974374, 0.0140281070945953, 0.00907861159279308,
0, 0, 0, 0.0453414461512909, 0, 0.00730963773612433,
0.0236424416792874, 0.0866914356225127, 0.0246339344582405,
0.00881531992455549, 0.0140744199322424, 0, 0, 0,
0.0319211626770028, 0.00155291355277603, 0.00295913497381517,
0.00738775271575955, 0.0179786878323852, 0.00442919920031897),
GTEx_M1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0.00551740159760184, 0, 0, 0, 0, 0)),
row.names = c(NA, -30L),
class = c("tbl_df", "tbl", "data.frame"))
This is a small part of my data that still shows the same issue I see.
y_axis = list(na.omit(Cibersort7$Hot_M1),
na.omit(Cibersort7$Cold_M1),
na.omit(Cibersort7$Hotnorm_M1),
na.omit(Cibersort7$Coldnorm_M1),
na.omit(Cibersort7$GTEx_M1))
groupname = groupexpression = data = violinPlot = pairwise_results = list(5)
for (i in 1:5){
groupname[[i]] = as.factor(colnames(Cibersort7[, i]))
groupexpression[[i]] = y_axis[[i]]
data[[i]] = data.frame("Sample" = groupname[[i]],
"Expression" = groupexpression[[i]])
}
dataframe = do.call(rbind, data)
dataframe$Sample = as.factor(dataframe$Sample)
my_comparisons = list(c("Hot_M1", "Cold_M1"),
c("Hot_M1", "Hotnorm_M1"),
c("Hot_M1", "GTEx_M1"),
c("Cold_M1", "Coldnorm_M1"),
c("Cold_M1", "GTEx_M1"))
violinPlot = ggplot(dataframe,
aes(x =Sample, y = Expression, fill = Sample)) +
geom_violin(trim = FALSE) +
geom_boxplot(width=0.1, fill="white") +
labs(title ="Distribution of M2 Macrophages",
x = "Tissue Samples", y = "Cibersort Count") +
theme_classic()
violinPlot
Here is how my violin plots look like:
Here is how they look like before adding the GTEx data:
And here's GTEx violin plots when displayed alone:
I understand that my GTEx data is zero but why do the violin plots disappear?
geom_violin has an argument named scale, which takes on the default value "area". From ?geom_violin:
if "area" (default), all violins have the same area (before trimming
the tails). If "count", areas are scaled proportionally to the number
of observations. If "width", all violins have the same maximum width.
Since GTEx's Expression values are concentrated at 0, its density peaks sharply at that value. We can see it more obviously in a normal density plot, with each sample's line overlaid atop one another:
ggplot(dataframe,
aes(x = Expression, color = Sample)) +
geom_density() +
theme_classic()
With the default scale = "area" argument, including GTEx in the data means the violin plot for all other samples becomes a lot skinnier, & hence become almost completely covered by the boxplots. You'd still be able to see them if you comment out the boxplot layer.
You can set scale = "width" instead if you want comparable visibility between each violin. You may also want to highlight this to your target audience if you choose this option, as scale = "area" tends to be more common, & people may feel confused when some violins appear clearly larger than others.
ggplot(dataframe,
aes(x = Sample, y = Expression, fill = Sample)) +
geom_violin(trim = FALSE, scale = "width") +
geom_boxplot(width=0.1, fill="white") +
labs(title ="Distribution of M2 Macrophages",
x = "Tissue Samples", y = "Cibersort Count") +
theme_classic()
p.s. You can simplify your data processing steps, which are (from what I can tell) essentially a conversion from wide to long format. The usual way to do this is via melt (from reshape2 package) or gather (from tidyr package). Here's a possible implementation:
library(dplyr)
library(tidyr)
df2 <- Cibersort7 %>%
gather(Sample, Expression) %>%
mutate(Sample = factor(Sample, levels = colnames(Cibersort7)))
> all.equal(dataframe, as.data.frame(df2))
[1] TRUE
p.p.s. If there are multiple people commenting in your thread & you don't # anyone in your reply, no one is going to get any notification about it, which is rather a waste if you've gone through all the trouble of improving your question. See here for an explanation of how the system works.

plotting two paths on ggtern plot in R

In the ggtern package in R, I am trying to plot two paths of different colors on the same ternery plot, and label their starting points ONLY, could someone show me how to do this, I can get the path on single plots, but not together on the same one, here is my example:
require(ggtern)
require(ggtern)
x <- data.frame(
A = c( 0, 0, 1, 0.1),
B = c( 0, 1, 0, 0.3) ,
C = c( 1, 0, 0, 0.6)
)
yy<-data.frame(
D= c(0.6, 0.2,0.8,0.33 ),
E= c(0.2, 0.8, 0.1,0.33),
F= c(0.2, 0.0, 0.1,0.33)
)
ggtern(data=x,aes(A,B,C)) +
geom_path(color="red")+
geom_point(type="l",shape=21,size=2) +
geom_text(label="", color="blue")+
theme_classic()
ggtern(data=yy,aes(D,E,F)) +
geom_path(color="blue")+
geom_point(type="l",shape=21,size=1) +
theme_classic()
Here I provide an answer to your question, also taking the opportunity to demonstrate some of the additional functionality of ggtern 2.0.1, which was published on CRAN a couple of days ago after completely re-writing the package to be compatible with ggplot2 2.0.0. A summary of the new functionality in ggtern 2.0.X can be found here:
Eric Fail is correct in saying that the best solution requires that the data to be combined into a single dataframe, and the paths either grouped or mapped to a different variable for colour, in order to distinguish between them. An alternate way is to create two(2) path layers, with a local dataframe passed to each geometry, rather than using the global dataframe passed to the ggtern constructor.
In the following solution, I have combined the data, created a 'Series' variable (subsequently mapped to colour), and then made use of the new geom_label(...) geometry that comes with the new version of ggplot2. Since some of the points lie on the perimeter (and the labels extend beyond the perimeter), I have also applied a manual clipping mask under the layers, which suppresses ggterns automatic clipping mask -- normally rendered in the foreground. I have also applied the theme_rotate(...) convenience function for the purposes of demonstration, and made use of the limit_tern(...) convenience function to extend the range of the axes beyond the standard range of [0,1]. Finally, new labels have been created for the procession arrows, which are different from the apex labels.
The above solution can be produced with the following code:
require(ggtern)
df.A <- data.frame(
A = c( 0, 0, 1, 0.1),
B = c( 0, 1, 0, 0.3) ,
C = c( 1, 0, 0, 0.6)
)
df.B <-data.frame(
A= c(0.6, 0.2,0.8,0.33 ),
B= c(0.2, 0.8, 0.1,0.33),
C= c(0.2, 0.0, 0.1,0.33)
)
df = rbind(data.frame(df.A,Series='A'),
data.frame(df.B,Series='B'))
df$Label = 1:nrow(df)
ggtern(data=df,aes(A,B,C,colour=Series)) +
theme_dark() +
theme_legend_position('topleft') +
theme_showarrows() + custom_percent('%') +
theme_rotate(60) +
geom_mask() +
geom_path(size=1) +
geom_label(aes(label=Label),show.legend = F) +
limit_tern(1.1,1.1,1.1) +
labs(title ="Example Combined Paths",
Tarrow = "Value B",
Larrow = "Value A",
Rarrow = "Value C")

R: Bar plot on a continuous x-axis (time-scaled)

I'm fairly new to R so please comment on anything you see.
I have data taken at different timepoints, under two conditions (for one timpoint) and I want to plot this as a bar plot with errorbars and with the bars at the appropriate timepoint.
I currently have this (stolen from another question on this site):
library(ggplot2)
example <- data.frame(tp = factor(c(0, "14a", "14b", 24, 48, 72)), means = c(1, 2.1, 1.9, 1.8, 1.7, 1.2), std = c(0.3, 0.4, 0.2, 0.6, 0.2, 0.3))
ggplot(example, aes(x = tp, y = means)) +
geom_bar(position = position_dodge()) +
geom_errorbar(aes(ymin=means-std, ymax=means+std))
Now my timepoints are a factor, but the fact that there is an unequal distribution of measurements across time makes the plot less nice.!
This is how I imagine the graph :
I find the ggplot2 package can give you very nice graphs, but I have a lot more difficulty understanding it than I have with other R stuff.
Before we get into R, you have to realize that even in a bar plot the x axis needs a numeric value. If you treat them as factors then the software assumes equal spacing between the bars by default. What would be the x-values for each of the bars in this case? It can be (0, 14, 14, 24, 48, 72) but then it will plot two bars at point 14 which you don't seem to want. So you have to come up with the x-values.
Joran provides an elegant solution by modifying the width of the bars at position 14. Modifying the code given by joran to make the bars fall at the right position in the x-axis, the final solution is:
library(ggplot2)
example <- data.frame(tp = factor(c(0, "14a", "14b", 24, 48, 72)), means = c(1, 2.1, 1.9, 1.8, 1.7, 1.2), std = c(0.3, 0.4, 0.2, 0.6, 0.2, 0.3))
example$tp1 <- gsub("a|b","",example$tp)
example$grp <- c('a','a','b','a','a','a')
example$tp2 <- as.numeric(example$tp1)
ggplot(example, aes(x = tp2, y = means,fill = grp)) +
geom_bar(position = "dodge",stat = "identity") +
geom_errorbar(aes(ymin=means-std, ymax=means+std),position = "dodge")

Resources