R Plot - Multiple Variables, Numeric and Categorical, Layered - r

Still learning R, and have been struggling with plotting. Below is part of my data, and I will try to explain the type of plot:
> head(bees.net.counts)
Month Block Treatment Flower Bee_Richness Bee_Abundance
1 May 1 UB POSI 1 1
2 May 2 DS ERST 4 38
3 May 2 UB RUBU 2 2
4 May 3 DS ERST 3 4
5 May 3 DS TROH 1 10
6 May 3 GS ERST 1 1
I want to make a plot where Flower is on the x-axis (there are 54 different ones), Bee_Richness or Bee_Abundance is on the y-axis, different colored symbols for Block (n=4) and amount of shading in each of those symbols for Treatment (n=3) (ie Block 1 Treatment UB is a red circle unfilled, Block 1 Treatment DS is a circle with half shaded red, and Block 1 Treatment GS is fully shaded red).
The problem I have is that each line is plotted instead of putting every point above a specific flower spp (there are multiple rows that have, say, CHFA, but those represent different Blocks and Treatments).
I have also tried this by month, where I separated the four months to make different graphs (to limit the length of the x-axis). There are 10 records in May, with 4 different flower species. I still can't figure out a way to do this.
Thank you for your help!!
Edit: Here is what I hope to get = plot idea

This uses the idea of #d.b 's solution, but improves the axis labels.
plot(x = as.numeric(as.factor(df$Flower)), df$Bee_Richness,
pch = as.numeric(as.factor(df$Block)),
col = as.numeric(as.factor(df$Treatment)),
xaxt="n", xlab="Flower", ylab="Richness")
axis(1, at=1:length(levels(df$Flower)),
labels=levels(df$Flower))
Some added explanation
As you requested, the character is based on the Block.
The color is based on the Treatment. Let's look at the
color/Treatment. The trick is that when you make Treatment a factor,
each value is internally represented as an integer, so you can
use as.numeric on the factor and it translates
DS to 1, GS to 2 and UB to 3. That makes the argument
col = as.numeric(as.factor(df$Treatment))
give DS color 1 and so on. R uses the numbers 1-8 as some
easy-to-access colors. Since you only need 3, this works fine.
Similarly,
pch = as.numeric(as.factor(df$Block))
picks characters 1 through 3 for the three Block values in the small test data.

Related

Identifying data points amongst background noise for binned data R

Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.

Connecting grouped dots/points on a scatter plot based on distance

I have 2 sets of depth point measurements, for example:
> a
depth value
1 2 2
2 4 3
3 6 4
4 8 5
5 16 40
6 18 45
7 20 58
> b
depth value
1 10 10
2 12 20
3 14 35
I want to show both groups in one figure plotted with depth and with different symbols as you can see here
plot(a$value, a$depth, type='b', col='green', pch=15)
points(b$value, b$depth, type='b', col='red', pch=14)
The plot seems okay, but the annoying part is that the green symbols are all connected (though I want connected lines also). I want connection only when one group has a continued data points at 2 m interval i.e. the symbols should be connected with a line from 2 to 8 m (green) and then group B symbols should be connected from 10-14 m (red) and again group A symbols should be connected (green), which means I do NOT want to see the connection between 8 m sample with the 16 m for group A.
An easy solution may be dividing the group A into two parts (say, A-shallow and A-deep) and then plotting A-shallow, B, and A-deep separately. But this is completely impractical because I have thousands of data points with hundreds of groups i.e. I have to produce many depth profiles. Therefore, there has to be a way to program so that dots are NOT connected beyond a prescribed frequency/depth interval (e.g. 2 m in this case) for a particular group of samples. Any idea?
If plot or lines encounters and NA value, it will automatically break the line. Using that, we can insert NA values for missing measurements in your data and that would fix the problem. One way is this
rng<-range(range(a$depth), range(b$depth))
rng<-seq(rng[1], rng[2], by=2)
aa<-rep(NA, length(rng))
aa[match(a$depth, rng)]<-a$value
bb<-rep(NA, length(rng))
bb[match(b$depth, rng)]<-b$value
plot(aa, rng, type='b', col='green', pch=15)
points(bb, rng, type='b', col='red', pch=14)
Which produces
Note that this code assumes that all depth measurements are evenly divisible by 2.
I'm not sure if you really have separate data.frames for all of your groups, but there may be better ways to fill in missing values depending on your real data structure.
We can use the fact that lines will but breaks in when there is a NA, like MrFlick suggests. There might be a simpler way, though:
#Merge the two sets together
all = merge(a,b,by='depth', all=T)
#Plot the lines
plot(all$value.x, all$depth, type='b', col='green', pch=15)
points(all$value.y, all$depth, type='b', col='red', pch=14)

Retain relationship between x and y data values in R plots

I'm fairly new to R but I am trying to create line graphs that monitor growth of bacteria over the course of time. I can successfully do this but the resulting graph isn't to my satisfaction. This is because I'm not using evenly spaced time increments although R plots these increments equally. Here is some sample data to give you and idea of what I'm talking about.
x=c(.1,.5,.6,.7,.7)
plot(x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")
axis(1,at=1:5,lab=c(0,24,72,96,120))
As you can see there are 48 hours between 24 and 72 but this is evenly distributed on the graph, is there anyway I can adjust the scale to more accurately display my data?
It's always best in R to use data structures that exhibit the relationships between your data. Instead of defining growth and time as two separate vectors, use a data frame:
growth <- c(.1,.5,.6,.7,.7)
time <- c(0,24,72,96,120)
df <- data.frame(time,growth)
print(df)
time growth
1 0 0.1
2 24 0.5
3 72 0.6
4 96 0.7
5 120 0.7
plot(df, type="o")
Not sure if this produces the exact x-axis labels that you want, but you should be free to edit the graph now without changing the relationship between the growth and time variables.
x=data.frame(x=c(.1,.5,.6,.7,.7), y=c(0,24,72,96,120))
plot(x$y, x$x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")

Box plot in r: 3 time points, 2 treatments with 2 factors on the same graph

I would like to show a simple box plot of data from 3 time points (0, 7 and 28) against abundance. I want to split the plots into treatment (i.e. CO2 level/Temperature) which will be nested within. Essentially I have 2 box plots per time point indicating the 2 different treatments. I Was going to use an overlay but because I have 2 box plots for each time point I am finding it tricky to work out the correct code.
Thanks

Plot points for every 15 minutes

I have a text file having the numbers(of float type) which represents time in seconds. I wish to represent the number of occurances every 15 minutes. The sample of my file is:
0.128766
2.888977
25.087900
102.787657
400.654768
879.090874
903.786754
1367.098789
1456.678567
1786.564569
1909.567567
for first 900 seconds(15 minutes), there are 6 occurances. I want to plot that point on y axis first. Then from 900-1800(next 15 minutes), there are 4 occurances. So, i want to plot 4 on my y-axis next. This should go on...
I know the basic plot() function, but i don't know how to plot every 15 minutes. If there is a link present, please guide me to that link.
Use findInterval():
counts <- table(findInterval(x, seq(0, max(x), 900)))
counts
1 2 3
6 4 1
It's easy to plot:
plot(counts)
To build on Andrie's answer. You can add plot(counts, type = 'p') to plot points or plot(counts, type = 'l') to plot a connected line. If you want to plot a curve for the counts you would need to model it using ?lm or ?nls.

Resources