Labelling issue after using gap.plot to create an axis break - r

I've been struggling to get a plot that shows my data accurately, and spent a while getting gap.plot up and running. After doing so, I have an issue with labelling the points.
Just plotting my data ends up with this:
Plot of abundance data, basically two different tiers of data at ~38,000, and between 1 - 50
As you can see, that doesn't clearly show either the top or the bottom sections of my plots well enough to distinguish anything.
Using gap plot, I managed to get:
gap.plot of abundance data, 100 - 37000 missed, labels only appearing on the lower tier
The code for my two plots is pretty simple:
plot(counts.abund1,pch=".",main= "Repeat 1")
text(counts.abund1, labels=row.names(counts.abund1), cex= 1.5)
gap.plot(counts.abund1[,1],counts.abund1[,2],gap=c(100,38000),gap.axis="y",xlim=c(0,60),ylim=c(0,39000))
text(counts.abund1, labels=row.names(counts.abund1), cex= 1.5)
But I don't know why/can't figure out why the labels (which are just the letters that the points denote) are not being applied the same in the two plots.
I'm kind of out of my depth trying this bit, very little idea how to plot things like this nicely, never had data like it when learning.
The data this comes from is originally a large (10,000 x 10,000 matrix) that contains a random assortment of letters a to z, then has replacements and "speciation" or "immigration" which results in the first lot of letters at ~38,000, and the second lot normally below 50.
The code I run after getting that matrix to get the rank abundance is:
##Abundance 1
counts1 <- as.data.frame(as.list(table(neutral.v1)))
counts.abund1<-rankabundance(counts1)
With neutral.v1 being the matrix.
The data frame for counts.abund1 looks like (extremely poorly formatted, sorry):
rank abundance proportion plower pupper accumfreq logabun rankfreq
a 1 38795 3.9 NaN NaN 3.9 4.6 1.9
x 2 38759 3.9 NaN NaN 7.8 4.6 3.8
j 3 38649 3.9 NaN NaN 11.6 4.6 5.7
m 4 38639 3.9 NaN NaN 15.5 4.6 7.5
and continues for all the variables. I only use Rank and Abundance right now, with the a,x,j,m just the variable that applies to, and what I want to use as the labels on the plot.
Any advice would be really appreciated. I can't really shorten the code too much or provide the matrix because the type of data is quite specific, as are the quantities in a sense.
As I mentioned, I've been using gap.plot to just create a break in the axis, but if there are better solutions to plotting this type of data I'd be absolutely all ears.
Really sorry that this is a mess of a question, bit frazzled on the whole thing right now.

gap.plot() doesn't draw two plots but one plot by decreasing upper section's value, drawing additional box and rewriting axis tick labels. So, the upper region's y-coordinate is neither equivalent to original value nor axis tick labels. The real y-coordinate in upper region is "original value" - diff(gap).
gap.plot(counts.abund1[,1], counts.abund1[,2], gap=c(100,38000), gap.axis="y",
xlim=c(0,60), ylim=c(0,39000))
text(counts.abund1, labels=row.names(counts.abund1), cex= 1.5)
text(counts.abund1[,1], counts.abund1[,2] - diff(c(100, 38000)), labels=row.names(counts.abund1), cex=1.5)
# the example data I used
set.seed(1)
counts.abund1 <- data.frame(rank = 1:50,
abundance = c(rnorm(25, 38500, 100), rnorm(25, 30, 20)))

Related

finding different y values along curve in r

So I have plotted a curve, and have had a look in both my book and on stack but can not seem to find any code to instruct R to tell me the value of y when along curve at 70 x.
curve(
20*1.05^x,
from=0, to=140,
xlab='Time passed since 1890',
ylab='Population of Salmon',
main='Growth of Salmon since 1890'
)
So in short, I would like to know how to command R to give me the number of salmon at 70 years, and at other times.
Edit:
To clarify, I was curious how to command R to show multiple Y values for X at an increase of 5.
salmon <- data.frame(curve(
20*1.05^x,
from=0, to=140,
xlab='Time passed since 1890',
ylab='Population of Salmon',
main='Growth of Salmon since 1890'
))
salmon$y[salmon$x==70]
1 608.5285
This salmon data.frame gives you all of the data.
head(salmon)
x y
1 0.0 20.00000
2 1.4 21.41386
3 2.8 22.92768
4 4.2 24.54851
5 5.6 26.28392
6 7.0 28.14201
If you can also use inequalities to check the number of salmon in given ranges using the syntax above.
It's also simple to answer the 2nd part of your question using this object:
salmon$z <- salmon$y*5 # I am using * instead of + to make the plot more clear
plot(x=salmon$x,y=salmon$z, xlab='Time passed since 1890', ylab='Population of Salmon',type="l")
lines(salmon$x,salmon$y, col="blue")
curve is plotting the function 20*1.05^x
so just plug any value you want in that function instead of x, e.g.
> 20*1.05^70
[1] 608.5285
>
20*1.05^(seq(from=0, to=70, by=10))
Was all I had to do, I had forgotten until Ed posted his reply that I could type a function directly into R.

Identifying data points amongst background noise for binned data R

Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.

Retain relationship between x and y data values in R plots

I'm fairly new to R but I am trying to create line graphs that monitor growth of bacteria over the course of time. I can successfully do this but the resulting graph isn't to my satisfaction. This is because I'm not using evenly spaced time increments although R plots these increments equally. Here is some sample data to give you and idea of what I'm talking about.
x=c(.1,.5,.6,.7,.7)
plot(x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")
axis(1,at=1:5,lab=c(0,24,72,96,120))
As you can see there are 48 hours between 24 and 72 but this is evenly distributed on the graph, is there anyway I can adjust the scale to more accurately display my data?
It's always best in R to use data structures that exhibit the relationships between your data. Instead of defining growth and time as two separate vectors, use a data frame:
growth <- c(.1,.5,.6,.7,.7)
time <- c(0,24,72,96,120)
df <- data.frame(time,growth)
print(df)
time growth
1 0 0.1
2 24 0.5
3 72 0.6
4 96 0.7
5 120 0.7
plot(df, type="o")
Not sure if this produces the exact x-axis labels that you want, but you should be free to edit the graph now without changing the relationship between the growth and time variables.
x=data.frame(x=c(.1,.5,.6,.7,.7), y=c(0,24,72,96,120))
plot(x$y, x$x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")

How to separate the two leftmost bins of a histogram in R

Suppose I need to plot a dataset like below:
set.seed(1)
dataset <- sample(1:7, 1000, replace=T)
hist(dataset)
As you can see in the plot below, the two leftmost bins do not have any space between them unlike the rest of the bins.
I tried changing xlim, but it didn't work. Basically I would like to have each number (1 to 7) represented as a bin, and additionally, I would like any two adjacent bins to have space beween them...Thanks!
The best way is to set the breaks argument manually. Using the data from your code,
hist(dataset,breaks=rep(1:7,each=2)+c(-.4,.4))
gives the following plot:
The first part, rep(1:7,each=2), is what numbers you want the bars centered around. The second part controls how wide the bars are; if you change it to c(-.49,.49) they'll almost touch, if you change it to c(-.3,.3) you get narrower bars. If you set it to c(-.5,.5) then R yells at you because you aren't allowed to have the same number in your breaks vector twice.
Why does this work?
If you split up the breaks vector, you get one part that looks like this:
> rep(1:7,each=2)
[1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7
and a second part that looks like this:
> c(-.4,.4)
[1] -0.4 0.4
When you add them together, R loops through the second vector as many times as needed to make it as long as the first vector. So you end up with
1-0.4 1+0.4 2-0.4 2+0.4 3-0.4 3+0.4 [etc.]
= 0.6 1.4 1.6 2.4 2.6 3.4 [etc.]
Thus, you have one bar from 0.6 to 1.4--centered around 1, with width 2*.4--another bar from 1.6 to 2.4 centered around 2 with with 2*.4, and so on. If you had data in between (e.g. 2.5) then the histogram would look kind of silly, because it would create a bar from 2.4 to 2.6, and the bar widths would not be even (since that bar would only be .2 wide, while all the others are .8). But with only integer values that's not a problem.
You need six bars NOT seven bars; that is what your histogram has space for. But then you end up generating seven bars. That is the bug.
do sample(1:6, 1000, replace=T) instead of sample(1:7, 1000, replace=T)
If you do need seven bars, then seed with 0

Boxplot with axis limits in R

I have data in tab delimited format with nearly 400 columns filled with values ie
X Y Z A B C
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
Now I want visualize the data using box plot method.Though it is difficult to view 400 in single odf,I want split into 50 each.ie(50 x 8).Here is the code I used:
boxplot(data[1:50],xlab="Samples",xlim=c(0.001,70),log="xy",
pch='.',col=rainbow(ncol(data[1:50)))
but I got the following error:
In plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs)
: nonfinite axis limits [GScale(-inf,4.4591,2, .); log=1]
I want to view the box plots for 400 samples with 50 each in a 8 different pdf....Please do help me in getting better visualization.
Others have already pointed out that actual boxplots are not going to work well. However, there is a very efficient way to visually scan all of your variables: Simply plot their distributions as an image (i.e. heatmap). Here is an example showing how it is really quite easy to get the gist of 400 variables and 80,000 individual data points!
# Simulate some data
set.seed(12345)
n.var = 400
n.obs = 200
data = matrix(rnorm(n.var*n.obs), nrow=n.obs)
# Summarize data
breaks = seq(min(data), max(data), length.out=51)
histdata = apply(data, 2, function(x) hist(x, plot=F, breaks=breaks)$counts)
# Plot
dev.new(width=4, height=4)
image(1:n.var, breaks, t(histdata), xlab='Variable Index', ylab='Histogram Bin')
This will be most useful if all your variables are comparable, or are at least sorted into rational groups. hclust and the heatmap functions can also be helpful here for more complicated displays. Good luck!
I agree that you will have to do something a bit drastic to distinguish 400 boxes in the same graph. The code below uses two tricks: (1) reverse the usual x-y order so that it's easier to read the labels (plotted on the y axis); (2) send the output to a tall, skinny PDF file so that you can scroll through it at your leisure. I also opted to sort the variables by mean, to make the plot easier to interpret -- that would be optional, but I suspect you'd have a hard time looking up a particular category in a 400-box plot in any case ...
nc <- 400
z <- as.data.frame(matrix(rnorm(nc*100),ncol=nc))
library(ggplot2)
m <- melt(z)
m <- transform(m,variable=reorder(variable,value))
pdf(width=10,height=50,file="boxplot.pdf")
print(ggplot(m,aes(x=variable,y=value))+geom_boxplot()+coord_flip())
dev.off()
Considering that you are plotting 400 boxes in your box plot, I am not surprised that you are having trouble seeing them. Suppose that you have a monitor that is 1024 pixels wide. Your application will only be able to display the boxes as two pixels wide. Even with larger screens you will not increase the number of pixels by much (a screen with 2000 pixels will at most show you boxes that are 5 pixels wide).
I would suggest plotting your boxes on two or more separate plots.

Resources