How to indicate factors in ggplot with horizontal line and Text - r

My data looks like this example:
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance1",
"Substance1","Substance2","Substance2","Substance2",
"Substance2","Substance1","Substance1"))
dataExample
Time Data1 Data2 Application
1 1 6.511573 5.385265 Substance1
2 2 5.870173 4.512775 Substance1
3 3 6.822132 5.109790 Substance1
4 4 5.940528 6.281412 Substance1
5 5 7.269394 4.680380 Substance2
6 6 6.122454 6.015899 Substance2
7 7 5.660429 6.113362 Substance2
8 8 6.649749 4.344978 Substance2
9 9 7.252656 4.764667 Substance1
10 10 7.204440 5.835590 Substance1
I would like to indicate at which time any Substance was applied that is different from dataExample$Application[1].
Here I show you the way I get this ploted, but I assume that there is a much easier way to do it with ggplot.
library(reshape2)
library(ggplot)
plotDataExample<-function(DataFrame){
longDF<-melt(DataFrame,id.vars=c("Time","Application"))
p=ggplot(longDF,aes(Time,value,color=variable))+geom_line()
maxValue=max(longDF$value)
minValue=min(longDF$value)
yAppLine=maxValue+((maxValue-minValue)/20)
xAppLine1=min(longDF$Time[which(longDF$Application!=longDF$Application[1])])
xAppLine2=max(longDF$Time[which(longDF$Application!=longDF$Application[1])])
lineData=data.frame(x=c(xAppLine1,xAppLine2),y=c(yAppLine,yAppLine))
xAppText=xAppLine1+(xAppLine2-xAppLine1)/2
yAppText=yAppLine+((maxValue-minValue)/20)
appText=longDF$Application[which(longDF$Application!=longDF$Application[1])[1]]
textData=data.frame(x=xAppText,y=yAppText,appText=appText)
p=p+geom_line(data=lineData,aes(x=x, y=y),color="black")
p=p+geom_text(data=textData,aes(x=x,y=y,label = appText),color="black")
return(p)
}
plotDataExample(dataExample)
Question:
Do you know a better way to get a similar result so that I could possibly indicate more than one factor (e.g. Substance3, Substance4 ...).

First, made new sample data to have more than 2 levels and twice repeated Substance2.
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance2",
"Substance2","Substance1","Substance1","Substance2",
"Substance2","Substance3","Substance3"))
Didn't make this as function to show each step.
Add new column groups to original data frame - this contains identifier for grouping of Applications - if substance changes then new group is formed.
dataExample$groups<-c(cumsum(c(1,tail(dataExample$Application,n=-1)!=head(dataExample$Application,n=-1))))
Convert to long format data for lines of data.
longDF<-melt(dataExample,id.vars=c("Time","Application","groups"))
Calculate positions for Substance identifiers. Used function ddply() from library plyr. For calculation only data that differs from first Application value are used (that's subset()). Then Application and groups are used for grouping of data. Calculated starting, middle and ending positions on x axis and y value taken as maximal value +0.3.
library(plyr)
lineData<-ddply(subset(dataExample,Application != dataExample$Application[1]),
.(Application,groups),
summarise,minT=min(Time),maxT=max(Time),
meanT=mean(Time),ypos=max(longDF$value)+0.3)
Now plot longDF data with ggplot() and geom_line() and add segments above plot with geom_segment() and text with annotate() using new data frame lineData.
ggplot(longDF,aes(Time,value,color=variable))+geom_line()+
geom_segment(data=lineData,aes(x=minT,xend=maxT,y=ypos,yend=ypos),inherit.aes=FALSE)+
annotate("text",x=lineData$meanT,y=lineData$ypos+0.1,label=lineData$Application)

Related

Trying to make a graph with multiple lines using ggplot

I am new to R and I have been trying to make a line graph with mupltiple lines. I have tried the 'plot' function but didn't get the desired result so I am now trying the ggplot.
I keep running into error:
Aesthetics must be either length 1 or the same as the data (100): x
and there's obviously no graph output.
Any help is much appreciated
I have rearranged my data, before it had 4 separate columns for different consumer types but now I have merged them and made a column that identifies each consumer.
This is the part of the code that generates the error
ggplot(data=consumers,aes(x=scenarios,y=unitary.bill)) +
geom_line(aes(color=consumer.type,group=consumer.type))
my data looks like this:
scenario unitary.bill consumer.type
1 1 0.076536835 net.cons
2 2 0.075835361 net.cons
3 3 0.076696548 net.cons
4 4 0.076431602 net.cons
5 5 0.076816135 net.cons
.........
27 2 0.076794287 smart.cons
28 3 0.075555555 smart.cons
29 4 0.077126955 smart.cons
30 5 0.077925161 smart.cons
.......
100 25 0.049247761 smart.pros
I expect the a line graph to have four different colors (each representing my consumer type) and the scenarios at the x-axis.
Thanks for all the help from Camille and Infominer. My code now looks like this (I added some more details)
ggplot(data=consumers,aes(x = scenarios,y = unitary.bill, colour= SMCs)) +
geom_line(size=1) + scale_colour_manual(values=c("indianred1", "yellowgreen","lightpink","springgreen4"))+
ggtitle(" Unitary bill for each SMC type at the end of the scenario runs")+
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25))
and the graph looks as I wanted it to. However, if I could put some more distance between the title and the graph that will make it prettier.
you can view the graph here

R Question: How can I create a histogram with 2 variables against eachother?

Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.
I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )

Statistical testing for each row to create a volcano plot in R?

I currently have a dataset in R where for each gene, I have the log2 expression for four daphnia exposed to copper (columns 2:5), and four control daphnia (columns 6:9). The head of the data frame is shown below:
GENE_ID s_MC13_A9_Cu s_MC13_A10_Cu s_MC13_A11_Cu s_MC13_A12_Cu
1 GENE:JGI_V11_100009 3.202503 3.152049 3.171360 3.164072
2 GENE:JGI_V11_100036 3.241468 3.226221 3.217426 3.208969
3 GENE:JGI_V11_100044 3.220307 3.223803 3.171068 3.228047
4 GENE:JGI_V11_100045 3.030001 3.017256 3.028523 3.013106
5 GENE:JGI_V11_100062 3.487595 3.471572 3.517525 3.503789
6 GENE:JGI_V11_100066 3.576387 3.671755 3.589585 3.573903
s_MC13_C1_C s_MC13_C2_C s_MC13_C3_C s_MC13_C4_C
1 3.154395 3.174957 3.153786 3.188665
2 3.201010 3.211897 3.201822 3.220532
3 3.201586 3.183351 3.178189 3.178734
4 3.021061 3.023119 3.068359 3.053681
5 3.525903 3.532097 3.532113 3.530310
6 3.624656 3.534701 3.697618 3.513394
I want to be able to create a volcano plot illustrating differential expression and so will first need to produce P values for each of my genes. I thought about doing a t-test or Ebayes to do so, is this the best option for micro-array data? If so, how to I go about doing this, thereby generating a P value for every gene in the first column?
Thanks

Dot Plots with multiple categories - R

I'm definitely a neophyte to R for visualizing data, so bear with me.
I'm looking to create side-by-side dot plots of seven categorical samples with many gene expression values corresponding with individual gene names. mydata.csv file looks like the following
B27 B28 B30 B31 LTNP5.IFN.1 LTNP5.IFN.2 LTNP5.IL2.1
1 13800.91 13800.91 13800.91 13800.91 13800.91 13800.91 13800.91
2 6552.52 5488.25 3611.63 6552.52 6552.52 6552.52 6552.52
3 3381.70 1533.46 1917.30 2005.85 3611.63 4267.62 5488.25
4 2985.37 1188.62 1051.96 1362.32 2717.68 2985.37 5016.01
5 1917.30 2862.19 2625.29 2493.26 2428.45 2717.68 4583.02
6 990.69 777.97 1269.05 1017.26 5488.25 5488.25 4267.62
I would like each sample data to be organized in its own dot plot in one graph. Additionally, if I could point out individual data points of interest, that would be great.
Thanks!
You can use base R, but you need to convert to matrix first.
dotchart(as.matrix(df))
or, we can transpose the matrix to arrange it by sample:
dotchart(t(as.matrix(df)))
Considering your [toy] data is stored in a data frame called a:
library(reshape2)
library(ggplot2)
a$trial<-1:dim(a)[1] # also, nrow(a)
b<-melt(data = a,varnames = colnames(a)[1:7],id.vars = "trial")
b$variable<-as.factor(b$variable)
ggplot(b,aes(trial,value))+geom_point()+facet_wrap(~variable)
produces
What we did:
Loaded required libraries (reshape2 to convert from wide to long and ggplot2 to, well, plot); melted the data into long formmat (more difficult to read, easier to process) and then plotted with ggplot.
I introduced trial to point to each "run" each variable was measured, and so I plotted trial vs value at each level of variable. The facet_wrap part puts each plot into a subplot region determined by variable.

R histogram from already summarized count

I have a really huge file, thus I had to count frequencies for histogram generation outside the R.
Couldn't find the correct answer in already existing threads. Everything I tried led me to bar plot or failure (even R's exceptions didn't let it plot as histogram the way I tried)
file looks like (it's tab delimited):
freq cov
394104974 1
387288861 3
141169009 4
105488813 2
60039934 6
45109486 5
26318120 7
9691068 8
7532886 9
3973434 10
it has sth like 3k lines.
How can I plot this with ggplot2 as a nice histogram? (cov column holds x axis values)
Cheers,
Irek

Resources