Automatic label modification in ggplot graph in R - r

I am have a data.frame of mutations and their frequences in subset of the genes. The data.frame looks like:
a <- read.csv(text="gene,frequency,mutation,position
geneA,0.5,C > T,183
geneB,0.2,+T,22
geneC,0.3,A > G,539", stringsAsFactors=FALSE)
I have plotted the graph with ggplot2:
pie <- ggplot(a, aes(x="", y=frequency, fill=gene))+
geom_col(width = 1)+
scale_fill_manual(values=c("red","blue","yellow"))+
coord_polar("y", start=0, direction = -1)
Now I would like to mark the mutations in labels on the graph as they are in literature: geneAc.183 C > T. And here I started to have a problem. I need to transform values in character vectors, then merge them and add to the graph
function like
expression(paste(italic(a$gene), ^"c.", ^a$position, ^a$mutation))
does not work. I tried to use apply function with no success as well. Could you help me to find solution for this?

Related

Scatter plot with ggplot2

I am trying to make scatter plot with ggplot2. Below you can see data and my code.
data=data.frame(
gross_i.2019=seq(1,101),
Prediction=seq(21,121))
ggplot(data=data, aes(x=gross_i.2019, y=Prediction, group=1)) +
geom_point()
This code produce chart below
So now I want to have values on scatter plot with different two different colors, first for gross_i.2019 and second for Prediction. I try with this code below with different color but this code this lines of code only change previous color into new color.
sccater <- ggplot(data=data, aes(x=gross_i.2019, y=Prediction))
sccater + geom_point(color = "#00AFBB")
So can anybody help me how to make this plot with two different color (e.g black and red) one for gross_i.2019 and second for Prediction?
I may be confused by what you are trying to accomplish, but it doesn't seem like you have two groups of data to plot two different colors for. You have one dependent(Prediction) and one independent (gross_i.2019) variable that you are plotting a relationship for. If Prediction and gross_i.2019 are both supposed to be dependent variables of different groups, you need a common independent variable to plot them separately, against (like time for example). Then you can do something like geompoint(color=groups)
Edit1: If you wanted the index (count of the dataset to be your independent x axis then you could do the following:
library(tidyverse)
data=data.frame(gross_i.2019=seq(1,101),Prediction=seq(21,121))
#create a column for the index numbers
data$index <- c(1:101)
#using tidyr pivot your dataset to a tidy dataset (long not wide)
data <- data %>% pivot_longer(!index, names_to="group",values_to="count")
#asign the groups to colors
p<- ggplot(data=data, aes(x=index, y=count, color=group))
p1<- p + geom_point()
p1
This type of problems generally has to do with reshaping the data. The format should be the long format and the data is in wide format. See this post on how to reshape the data from wide to long format.
long <- reshape(data,
ids = row.names(data),
varying = c("gross_i.2019", "Prediction"),
v.names = "line",
direction = "long")
long$time <- names(data)[long$time]
long$id <- as.numeric(long$id)
library(ggplot2)
ggplot(long, aes(id, line, color = time)) +
geom_point() +
scale_color_manual(values = c("#000000", "#00AFBB"))

Plotting each column of a dataframe as one line using ggplot

The whole dataset describes a module (or cluster if you prefer).
In order to reproduce the example, the dataset is available at:
https://www.dropbox.com/s/y1905suwnlib510/example_dataset.txt?dl=0
(54kb file)
You can read as:
test_example <- read.table(file='example_dataset.txt')
What I would like to have in my plot is this
On the plot, the x-axis is my Timepoints column, and the y-axis are the columns on the dataset, except for the last 3 columns. Then I used facet_wrap() to group by the ConditionID column.
This is exactly what I want, but the way I achieved this was with the following code:
plot <- ggplot(dataset, aes(x=Timepoints))
plot <- plot + geom_line(aes(y=dataset[,1],colour = dataset$InModule))
plot <- plot + geom_line(aes(y=dataset[,2],colour = dataset$InModule))
plot <- plot + geom_line(aes(y=dataset[,3],colour = dataset$InModule))
plot <- plot + geom_line(aes(y=dataset[,4],colour = dataset$InModule))
plot <- plot + geom_line(aes(y=dataset[,5],colour = dataset$InModule))
plot <- plot + geom_line(aes(y=dataset[,6],colour = dataset$InModule))
plot <- plot + geom_line(aes(y=dataset[,7],colour = dataset$InModule))
plot <- plot + geom_line(aes(y=dataset[,8],colour = dataset$InModule))
...
As you can see it is not very automated. I thought about putting in a loop, like
columns <- dim(dataset)[2] - 3
for (i in seq(1:columns))
{
plot <- plot + geom_line(aes(y=dataset[,i],colour = dataset$InModule))
}
(plot <- plot + facet_wrap( ~ ConditionID, ncol=6) )
That doesn't work.
I found this topic
Use for loop to plot multiple lines in single plot with ggplot2 which corresponds to my problem.
I tried the solution given with the melt() function.
The problem is that when I use melt on my dataset, I lose information of the Timepoints column to plot as my x-axis. This is how I did:
data_melted <- dataset
as.character(data_melted$Timepoints)
dataset_melted <- melt(data_melted)
I tried using aggregate
aggdata <-aggregate(dataset, by=list(dataset$ConditionID), FUN=length)
Now with aggdata at least I have the information on how many Timepoints for each ConditionID I have, but I don't know how to proceed from here and combine this on ggplot.
Can anyone suggest me an approach.
I know I could use the ugly solution of creating new datasets on a loop with rbind(also given in that link), but I don't wanna do that, as it sounds really inefficient. I want to learn the right way.
Thanks
You have to specify id.vars in your call to melt.data.frame to keep all information you need. In the call to ggplot you then need to specify the correct grouping variable to get the same result as before. Here's a possible solution:
melted <- melt(dataset, id.vars=c("Timepoints", "InModule", "ConditionID"))
p <- ggplot(melted, aes(Timepoints, value, color = InModule)) +
geom_line(aes(group=paste0(variable, InModule)))
p

ggplot2 use running variable to set geom_text label

I have two datsets in my diagram, "y1" and "baseline". As labels in a ggplot diagram, I want to use the difference of the y-value of both. I want to facilitate this on the fly.
Dataframe:
df <- data.frame(c(10,20,40),c(0.1,0.2,0.3),c(0.05,0.1,0.2))
names(df)[1] <- "classes"
names(df)[2] <- "y1"
names(df)[3] <- "baseline"
df$classes <- factor(df$classes,levels=c(10,20,40), labels=c("10m","20m","40m"))
dfm=melt(df)
To start with, I defined a function which returns the y-value of the baseline corresponding to a particular x-value:
Tested it, works fine:
getBaselineY <- function(xValue){
return(dfm[dfm$classes==xValue & dfm$variable=="baseline",]$value[1])
}
Unfortunately, parsing this function into the ggplot code only gives me the baseline y-value for the first x-value:
diagram <- ggplot(dfm, aes(x=classes, y=value, group=variable, colour=variable))
diagram <- diagram + geom_point() + geom_line()
diagram <- diagram + geom_text(aes(label=getBaselineY(classes)))
diagram <- diagram + theme_bw(base_size=16)
diagram
Nevertheless, subsetting the function call by just the x-value gives me the respective x-value for each ggplot-iteration:
diagram <- diagram + geom_text(aes(label=classes))
I don't understand how this come and how to solve it the best way. Any help is highly appreciated!
Alternatively, this could be solved by calculating the difference beforehand and adding an additional column to the data frame:
df$Difference<-df$y1-df$baseline
dfm=melt(df,id.var=c(1,4))
And use it directly as geom_text label:
diagram <- diagram + geom_text(aes(label=Difference))
The problem is your function getBaselineY. I guess you wrote and tested it with a single xValue in mind. But you are passing a vector to the function and return only the first value.
To get the labels the way you described use an ifelse:
diagram + geom_text(aes(label = ifelse(variable == "baseline", value,
value - value[variable == "baseline"])))

log-scaled density plot: ggplot2 and freqpoly, but with points instead of lines

What I really want to do is plot a histogram, with the y-axis on a log-scale. Obviously this i a problem with the ggplot2 geom_histogram, since the bottom os the bar is at zero, and the log of that gives you trouble.
My workaround is to use the freqpoly geom, and that more-or less does the job. The following code works just fine:
ggplot(zcoorddist) +
geom_freqpoly(aes(x=zcoord,y=..density..),binwidth = 0.001) +
scale_y_continuous(trans = 'log10')
The issue is that at the edges of my data, I get a couple of garish vertical lines that really thro you off visually when combining a bunch of these freqpoly curves in one plot. What I'd like to be able to do is use points at every vertex of the freqpoly curve, and no lines connecting them. Is there a way to to this easily?
The easiest way to get the desired plot is to just recast your data. Then you can use geom_point. Since you don't provide an example, I used the standard example for geom_histogram to show this:
# load packages
require(ggplot2)
require(reshape)
# get data
data(movies)
movies <- movies[, c("title", "rating")]
# here's the equivalent of your plot
ggplot(movies) + geom_freqpoly(aes(x=rating, y=..density..), binwidth=.001) +
scale_y_continuous(trans = 'log10')
# recast the data
df1 <- recast(movies, value~., measure.var="rating")
names(df1) <- c("rating", "number")
# alternative way to recast data
df2 <- as.data.frame(table(movies$rating))
names(df2) <- c("rating", "number")
df2$rating <- as.numeric(as.character(df$rating))
# plot
p <- ggplot(df1, aes(x=rating)) + scale_y_continuous(trans="log10", name="density")
# with lines
p + geom_linerange(aes(ymax=number, ymin=.9))
# only points
p + geom_point(aes(y=number))

Combining two ecdf plots with different

At the moment I`m writing my bachelor thesis and all of my plots are created with ggplot2. Now I need a plot of two ecdfs but my problem is that the two dataframes have different lengths. But by adding values to equalize the length I would change the distribution, therefore my first thought isn't possible. But a ecdf plot with two different dataframes with a different length is forbidden.
daten <- peptidPSMotherExplained[peptidPSMotherExplained$V3!=-1,]
daten <- cbind ( daten , "scoreDistance"= daten$V2-daten$V3 )
daten2 <- peptidPSMotherExplained2[peptidPSMotherExplained2$V3!=-1,]
daten2 <- cbind ( daten2 , "scoreDistance"= daten2$V2-daten2$V3 )
p <- ggplot(daten, aes(x = scoreDistance)) + stat_ecdf()
p <- p + geom_point(aes(x = daten2$lengthDistance))
p
with the normal plot function of R it is possible
plot(ecdf(daten$scoreDistance))
plot(ecdf(daten2$scoreDistance),add=TRUE)
but it looks different to all of my other plots and I dislike this.
Has anybody a solution for me?
Thank you,
Tobias
Example:
df <-data.frame(scoreDifference = rnorm(10,0,12))
df2 <- data.frame(scoreDifference = rnorm(5,-3,9))
plot(ecdf(df$scoreDifference))
plot(ecdf(df2$scoreDifference),add=TRUE)
So how can I achieve this kind of plot in ggplot?
I don't know what geom one should use for such plots, but for combining two datasets you can simply specify the data in a new layer,
ggplot(df, aes(x = scoreDifference)) +
stat_ecdf(geom = "point") +
stat_ecdf(data=df2, geom = "point")
I think, reshaping your data in the right way will probably make ggplot2 work for you:
df <-data.frame(scoreDiff1 = rnorm(10,0,12))
df2 <- data.frame(scoreDiff2 = rnorm(5,-3,9))
library('reshape2')
data <- merge(melt(df),melt(df2),all=TRUE)
Then, with data in the right shape, you can simply go on to plot the stuff with colour (or shape, or whatever you wish) to distinguish the two datasets:
p <- ggplot(daten, aes(x = value, colour = variable)) + stat_ecdf()
Hope this is what you were looking for!?

Resources