What is the best way to plot a graph with a continuous variable on the x axis and the ratio of success on the y axis for example with data:
x <- c(.1,.3,.4,.5,.6,.3,.4,.6,.7,.8)
y <- c(0,0,0,0,0,1,1,1,1,1)
df <- cbind(x,y)
plot(x,y)
I want to see values on the y as a ratio instead of 0 and 1. But, need to aggregate x values since the data is continuous and not .1,.2, etc.
For .3 on the x for examle, the point should have a y value of .5 (instead of one on 1, and one on 0).
I want to model success but I don't know what type of model to use, linear or something else. I would like to see the shape of the curve and then find a proper fit.
Thanks!
OK I'm guessing you mean this?
> rowMeans(table(x,y))
0.1 0.3 0.4 0.5 0.6 0.7 0.8
0.5 1.0 1.0 0.5 1.0 0.5 0.5
> R=rowMeans(table(x,y))
> plot(names(R),R, type='h', ylim=c(0,1))
Related
I am making a volcano plot in R. I have a huge range of pvalues and log2fold changes. I set an xlim and ylim because I want to focus in on the central region of the plot. However, naturally setting my limits excludes some of my data. I would like to have the data outside of my axes limits displayed at my limits. So for example, a fold change of 4 would be displayed as a point just outside of my xlim of 2.
with(mydata, plot(ExpLogRatio, -log10(Expr_p_value), pch=20, main = "Volcano Plot",xlim=c(-2,2),ylim=c(0,40)))
this works but cuts out some of my datapoints (those with fold change above 2 and less than -2 and with pvalue of less than -log10(40)
if I understand correctly, I'd just use pmin and pmax to limit your values, e.g.:
values = seq(-3, 3, len=21)
pmin(pmax(values, -2), 2)
gives back:
[1] -2.0 -2.0 -2.0 -2.0 -1.8 -1.5 -1.2 -0.9 -0.6 -0.3 0.0 0.3 0.6 0.9 1.2
[16] 1.5 1.8 2.0 2.0 2.0 2.0
i.e. it's limited values to the range (-2, +2).
applying this to your data, you'd do something like:
with(mydata, {
lratio <- pmin(pmax(ExpLogRatio, -2.1), 2.1)
pch <- ifelse(ExpLogRatio == lratio, 20, 4)
plot(lratio, -log10(Expr_p_value), pch=pch, ylim=c(0, 40))
})
you'll probably want to set xlab and main to set titles, but I've not included that to keep the answer tidier. also extending this to the y-axis would obviously be easy
note I've also changed the plotting point style to indicate which points were truncated
I have raw data where I want to see what kind of cutoff level results in what percentage of observations above the cutoff level. Here is the simulation:
data<-rnorm(100,50,30)
prop.table(table(data>10))
prop.table(table(data>20))
prop.table(table(data>30))
prop.table(table(data>40))
prop.table(table(data>50))
prop.table(table(data>60))
prop.table(table(data>70))
prop.table(table(data>80))
prop.table(table(data>90))
Here is the output:
FALSE TRUE
0.1 0.9
FALSE TRUE
0.16 0.84
FALSE TRUE
0.29 0.71
FALSE TRUE
0.36 0.64
FALSE TRUE
0.51 0.49
FALSE TRUE
0.61 0.39
FALSE TRUE
0.75 0.25
FALSE TRUE
0.86 0.14
FALSE TRUE
0.91 0.09
But it is a crude and inefficient way as you would agree. Instread of calculating respective percentage for each cutoff value endlessly, I wanted to build a plot that represents that relationship where X axis would represent the range of the all possible cutoff levels, and Y axis representing percentages from 0 to 100. Something similar to this:
Please ignore the axis labels etc of the plot, this is only to provide a general example. Any suggestions?
I believe you are looking for the ecdf() function to create an empirical cumulative distribution function.
data<-rnorm(1000,50,30)
a = ecdf(data)
plot(a)
example
You write:
I have raw data where I want to see what kind of cutoff level results
in what percentage of observations above the cutoff level.
Taking what you write literally, then you want the proportion of observations above the cutoff. Say the cutoff is X. The empirical CDF gives you the value P(x <= X), i.e. the proportion below the cutoff. If you want the value corresponding to P(x > X), you can use the equality P(x > X) = 1-P(x <= X).
For instance:
data<-rnorm(100,50,30) # your data
dat <- data.frame(x = sort(data)) # into sorted dataframe
dat$ecdf <- ecdf(data)(dat$x) # get cdf values for each x value
dat$above <- with(dat, 1-ecdf) # get values above
plot(dat$x, dat$above)
Having said all this, you are presenting the ECDF of a Gaussian distribution after all, which may indicate that you are looking for the ECDF instead. In this case, as already outlined in the Vincent's answers, you can just plot the corresponding values of ecdf instead of above. Here an example where I plot both.
To address your comment, I print one with a smooth line, using geom_smooth instead of geom_line.
library(ggplot2); library(scales)
ggplot(dat, aes(x=x)) +
geom_line(aes(y=ecdf), col="red" ) + # P(x<=X) in red
geom_smooth(aes(y=above), col="blue") + # Smooth version of P(x > X)
labs(y="Proportion", x="Variate") +
scale_y_continuous(labels=percent)
If you prefer the smoothed line to be printed without surrounding error intervals, you can add the option se=F. See ?geom_smooth-
To achieve something similar with base plot, you can use
plot(dat$x, dat$above, type="n")
lines(loess.smooth(dat$x, dat$above, span=1/6))
though you may have to play around with the span parameter. This will give the following image:
I'm using Gnuplot 4.6. I have data files each containing 3 columns of data: X coordinate, Y coordinate and temperature. I wish to make an animation of plots of temperature as a function of X and Y coordinates. For this I'm using the following script:
set pm3d map; set palette;
do for [n=0:200] {splot sprintf("Temperature.%04d.dbl", n) binary array=100:100:1 form="%double" title 'file number'.n}
My problem is with the fact that after a few plots, the distribution of colors changes, both in the plot and in the legend. This makes the reading from the graph really hard.
I consulted the following post:
gnuplot heat map color range
and since the range of the temperature variable is from 0.0 to 1.2 I thought to use:
set zrange [0.0:1.2]; set cbrange [0.0:1.2];
but it doesn't help and the temperature color continues to be autoscaled from plot to plot. Any suggestions?
In addition to setting cbrange, you could try defining your own palette by
set palette defined (0 "black",\
0.2 "red",\
0.4 "orange-red",\
0.6 "orange",\
0.8 "yellow",\
1.0 "light-green",\
1.2 "green")
Or if you want discrete values:
set palette defined (0 "black",\
0.2 "black",\
0.2 "red",\
0.4 "red",\
0.4 "orange-red",\
0.6 "orange-red",\
0.6 "orange",\
0.8 "orange",\
0.8 "yellow",\
1.0 "yellow",\
1.0 "light-green",\
1.2 "light-green")
I have generated a pearson similarity matrix and plotted the results using pheatmap (clustered using hclust, method = "complete"). I'd like to output the ordered matrix, but in R the default seems to be just to alphabetize everything.
Here is my code:
df <- cor(t(genes), method = "pearson")
pheatmap(df, clustering_method = "complete")
head(genes)
pre early mid late end
AAC1 2.0059007 3.64679740 3.0092533 2.4936171 2.2693034
AAC3 -1.6843969 -1.62572636 -0.7654462 -1.5827481 -1.6059080
AAD10 2.6012529 2.05759631 1.3665322 1.4590833 0.3778324
AAD14 0.5047704 0.76021375 0.1825944 0.6111774 0.1174208
AAD15 7.6017557 8.52315453 7.2605744 6.9029452 5.9028824
AAD16 1.2018193 -0.03285354 0.2229450 -0.1337033 0.2198542
This what the current output (df) looks like:
A B C D
A 1 0.5 0.25 0.1
B 0.1 1 0.1 0.5
C 0.5 0.2 1 0.2
D 0 0.1 0.7 1
How can I output the similarity matrix as ordered by hclust?
I've looked, but I haven't been able to find anything that quite accomplishes what I need. Thanks in advance for your help!
(also sorry I don't know how to properly format everything yet)
EDIT: maybe some visuals would help. My clustered pheatmap output looks like this: ordered heatmap
I can see groups of genes that behave similarly, but because there are so many it's impossible/useless to read the labels. I want to find out which genes cluster together, but I can't output the ordered matrix.
When I plot the data without clustering it looks like this: unclustered heatmap
So the output/data I can get is pretty much useless for further analysis.
I have a question about the package gplots. I want to use the function heatmap.2 and therefore I want to change my symmetric point in color key from 0 to 1. Normally when symkey=TRUE and you use the col=redgreen(), a colorbar is created where the colors are managed like this:
red = -2 to -0.5
black=-0.5 to 0.5
green= 0.5 to 2
Now i want to create a colorbar like this:
red= -1 to 0.8
black= 0.8 to 1.2
green= 1.2 to 3
Is something like this possible?
Thank you!
If you look at the heatmap.2 help file, it looks like you want the breaks argument. From the help file:
breaks (optional) Either a numeric vector indicating the splitting points for binning x into colors, or a integer number of break points to be used, in which case the break points will be spaced equally between min(x) and max(x)
So, you use breaks to specify the cutoff points for each colour. e.g.:
library(gplots)
# make up a bunch of random data from -1, -.9, -.8, ..., 2.9, 3
# 10x10
x = matrix(sample(seq(-1,3,by=.1),100,replace=TRUE),ncol=10)
# plot. We want -1 to 0.8 being red, 0.8 to 1.2 being black, 1.2 to 3 being green.
heatmap.2(x, col=redgreen, breaks=c(-1,0.8,1.2,3))
The crucial bit is the breaks=c(-1,0.8,1.2,3) being your cutoffs.