Grouping extreme value bins into one "> x" bin - r

Does a function / method already exist to determine the frequency of data greater than some value? Similar to the Excel frequency distribution, I would like to group extreme values into the last bin (e.g., >120 as in image). I have been doing this manually by first using the hist function and then summing the counts for breaks greater than a given value.

Here's one option:
d <- rlnorm(1000, 3)
d.cut <- cut(d, c(seq(0, 120, 10), Inf))
hist(as.numeric(d.cut), breaks=0:13, xaxt='n', xlab='',
col=1, border=0, main='', cex.axis=0.8, las=1)
axis(1, at=0:13, labels=c(seq(0, 120, 10), '>120'), cex.axis=0.8)
box()

Related

misplaced label on scatter plot data

I am quite new to R and was wondering if anyone could help with this problem:
I am trying to graph a set of data. I use plot to plot the scatter data and use text to add labels to the values. However the last label is misplaced on the graph and I can't figure out why. Below is the code:
#specify the dataset
x<-c(1:10)
#find p: the percentile of each data in the dataset
y=quantile(x, probs=seq(0,1,0.1), na.rm=FALSE, type=5)
#print the values of p
y
#plot p against x
plot(y, tck=0.02, main="Percentile Graph of Dataset D", xlab="Data of the dataset", ylab="Percentile", xlim=c(0, 11), ylim=c(0, 11), pch=10, seq(1, 11, 1), col="blue", las=1, cex.lab=0.9, cex.axis=0.9, cex.main=0.9)
#change the x-axis scale
axis(1, seq(1, 11, 1), tck=0.02)
#draw disconnected line segments
abline(h = 1:11, v = 1:11, col = "#EDEDED")
#Add data labels to the graph
text(y, x, labels= (y), cex=0.6, pos=1, col="red")
Your probs request returns 11 values, but you only have 10 x values. Therefore R recycles your y values, and the 11th label is plotted at y = 1 when you add the text. How to fix this depends upon what you are trying to do. Perhaps in your probs sequence you want seq(0, 1, length.out = 10)?

Create sample vector data in R with a skewed distribution with limited range

I want to create in R a sample vector of data in R, in which I can control the range of values selected, so I think I want to use sample to limit the range of values generated rather than an rnorm-type command that generates a range of values based upon the type of distribution, variance, SD, etc.
So I'm looking to do a sample with a specified range (e.g. 1-5) for a skewed distribution something like this:
x=rexp(100,1/10)
Here's what I have but does not provide a skewed distribution:
y=sample(1:5,234, replace=T)
How can I have my cake (limited range) and eat it too (skewed distribution), so to speak.
Thanks
set.seed(3)
hist(sample(1:10, size = 100, replace = TRUE, prob = 10:1))
The beta distribution takes values from 0 to 1. If you want your values to be from 0 to 5 for instance, then you can multiply them by 5. Finally, you can get a "skewness" with the beta distribution.
For example, for the skewness you can get these three types:
And using R and beta distribution you can get similar distributions as follows. Notice that the Green Vertical line refers to mean and the Red to median:
x= rbeta(10000,5,2)
hist(x, main="Negative or Left Skewness", freq=FALSE)
lines(density(x), col='red', lwd=3)
abline(v = c(mean(x),median(x)), col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))
x= rbeta(10000,2,5)
hist(x, main="Positive or Right Skewness", freq=FALSE)
lines(density(x), col='red', lwd=3)
abline(v = c(mean(x),median(x)), col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))
x= rbeta(10000,5,5)
hist(x, main="Symmetrical", freq=FALSE)
lines(density(x), col='red', lwd=3)
abline(v = c(mean(x),median(x)), col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))
To better see what the sample function is doing with integers, use the barplot function, not the histogram function:
set.seed(3)
barplot(table(sample(1:10, size = 100, replace = TRUE, prob = 10:1)))

Plotting a histogram with custom breaks

I have a vector like:
K <- rnorm(10000, mean=100)
I want to create a histogram of K with custom breaks (and labels) like <20, 20-50, 50-75, 75-99, =100, >400, etc.
Any ideas?
With base plotting, you might be better off cutting the vector first, and then using barplot or the plot method for tables.
For example:
K <- rnorm(10000, mean=100, sd = 100)
K.cut <- cut(K, c(-Inf, 20, 50, 75, 100, 400, Inf))
plot(table(K.cut), xaxt='n', ylab='K')
axis(1, at=1:6, labels=c('< 20', '20-50', '50-75', '75-100', '100-400', '> 400'))
box(bty='L')
xax <- barplot(table(K.cut), xaxt='n')
axis(1, at=xax, labels=c('< 20', '20-50', '50-75', '75-100', '100-400', '> 400'))
box(bty='L')
Note that by default, cut includes the upper (but not the lower) bound in each bin, so for example the 20-50 bin includes any 50s, but the 20s will be included in the lower adjacent bin.
Try ggplot version:
library(ggplot)
ggplot()+ geom_histogram(aes(K))
Many options are available for tweaking.

Adding nonlinear line with stacked bar plot in r

I would like to add a curved line to fit the dark bars of this supply cost curve (like the red line that appears in image). The height of the dark bars represent the range in uncertainty in their costs (costrange). I am using fully transparent values (costtrans) to stack the bars above a certain level
This is my code:
costtrans<-c(10,10,20,28,30,37,50,50,55,66,67,70)
costrange<-c(15,30,50,21,50,20,30,40,45,29,30,20)
cost3<-table(costtrans,costrange)
cost3<-c(10,15,10,30,20,50,28,21,30,50,37,20,50,30,50,40,55,45,66,29,67,30,70,20)
costmat <- matrix(data=cost3,ncol=12,byrow=FALSE)
Dark <- rgb(99/255,99/255,99/250,1)
Transparent<-rgb(99/255,99/255,99/250,0)
production<-c(31.6,40.9,3.7,3.7,1,0.3,1.105,0.5,2.3,0.7,0.926,0.9)
par(xaxs='i',yaxs='i')
par(mar=c(4, 6, 4, 4))
barplot(costmat,production, space=0, main="Supply Curve", col=c(Transparent, Dark), border=NA, xlab="Quantity", xlim=c(0,100),ylim=c(0, 110), ylab="Supply Cost", las=1, bty="l", cex.lab=1.25,axes=FALSE)
axis(1, at=seq(0,100, by=5), las=1, cex.axis=1.25)
axis(2, at=seq(0,110, by=10), las=1, cex.axis=1.25)
Image to describe what I am looking for:
I guess it really depends how you want to calculate the line...
One first option would be:
# Save the barplot coordinates into a variable
bp <- barplot(costmat,production, space=0, main="Supply Curve",
col=c(Transparent, Dark), border=NA, xlab="Quantity",
xlim=c(0,100), ylim=c(0, 110), ylab="Supply Cost", las=1,
bty="l", cex.lab=1.25,axes=FALSE)
axis(1, at=seq(0,100, by=5), las=1, cex.axis=1.25)
axis(2, at=seq(0,110, by=10), las=1, cex.axis=1.25)
# Find the mean y value for each box
mean.cost <- (costmat[1,]+colSums(costmat))/2
# Add a line through the points
lines(bp, mean.cost, col="red", lwd=2)
Which gives
Now, you could do some smoother line, using some sort of regression
For instance, using a LOESS regression.
# Perform a LOESS regression
# To allow for extrapolation, you may want to add
# control = loess.control(surface = "direct")
model <- loess(mean.cost~bp, span=1)
# Predict values in the 0:100 range.
# Note that, unless you allow extrapolation (see above)
# by default only values in the range of the original data
# will be predicted.
pr <- predict(model, newdata=data.frame(bp=0:100))
lines(0:100, pr, col="red", lwd=2)

Change the number of tick marks on a figure in R

I created a figure of two plots (two years) of climate data (temp and precip) that looks exactly like I want it, except that one of my axes has too many tick marks. With everything I have going on with this figure, I can't find a way to specify fewer tick marks without messing up other parts. I would also like to specify where the tick marks are. Here is the figure:
You can see that the tick marks for the top axis just blur together and the numbers chosen are not very meaningful to me. How can I tell R what I really want?
Here are the datasets I am using: cobs10 and
cobs11.
And here is my code:
par(mfrow=c(2,1))
par(mar = c(5,4,4,4) + 0.3)
plot(cobs10$day, cobs10$temp, type="l", col="red", yaxt="n", xlab="", ylab="",
ylim=c(-25, 30))
axis(side=3, col="black", at=cobs10$day, labels=cobs10$gdd)
at = axTicks(3)
mtext("Thermal Units", side=3, las=0, line = 3)
axis(side=2, col='red', labels=FALSE)
at= axTicks(2)
mtext(side=2, text= at, at = at, col = "red", line = 1, las=0)
mtext("Temperature (C)", side=2, las=0, line=3)
par(new=TRUE)
plot(cobs10$gdd, cobs10$precip, type="h", col="blue", yaxt="n", xaxt="n", ylab="",
xlab="")
axis(side=4, col='blue', labels=FALSE)
at = axTicks(4)
mtext(side = 4, text = at, at = at, col = "blue", line = 1,las=0)
mtext("Precipitation (cm)", side=4, las=0, line = 3)
par(mar = c(5,4,4,4) + 0.3)
plot(cobs11$day, cobs11$temp, type="l", col="red", yaxt="n", xlab="Day of Year",
ylab="", ylim=c(-25, 30))
axis(side=3, col="black", at=cobs11$day, labels=cobs11$gdd)
at = axTicks(3)
mtext("", side=3, las=0, line = 3)
axis(side=2, col='red', labels=FALSE)
at= axTicks(2)
mtext(side=2, text= at, at = at, col = "red", line = 1, las=0)
mtext("Temperature (C)", side=2, las=0, line=3)
par(new=TRUE)
plot(cobs11$gdd, cobs11$precip, type="h", col="blue", yaxt="n", xaxt="n", ylab="",
xlab="", ylim=c(0,12))
axis(side=4, col='blue', labels=FALSE)
at = axTicks(4)
mtext(side = 4, text = at, at = at, col = "blue", line = 1,las=0)
mtext("Precipitation (cm)", side=4, las=0, line = 3)
Thanks for thinking about it.
You've pretty much got the solution already:
axis(side=3, col="black", at=cobs10$day, labels=cobs10$gdd)
Except, you are asking to have ticks and labels at every single entry.
Take a look at the function pretty:
at <- pretty(cobs10$day)
at
# [1] 0 100 200 300 400
These are where the ticks should be placed on the x-axis. Now you need to find the corresponding labels. This is not straigtforward, but we will get:
lbl <- which(cobs10$day %in% at)
lbl
# [1] 100 200 300
lbl <- c(0, cobs10$gdd[lbl]
axis(side=3, at=at[-5], labels=lbl)
Update
I've been a bit annoyed by your use of three different series in a single plot. There are many reasons this is troublesome.
Having two y-values are always troublesome see this article from Stephen Few (go to page 5 for my favorite example); in your case it is not that serious due to the nature of the plots and your use of colours to indicate which y-axis the values belong to. But still, on principle.
Axis ticks should have a fixed function, e.g. linear or logarithm. With your Thermal Units, they appear "randomly" (I know that is not the case, but for an outsider they do).
We gotta do something about your x-axis ticks that just refer to "day of year".
First up, we take a look at your data and see what can be done naively. We recognize that your ''date'' variable is actual dates. Let's exploit it and make R aware of it!
cobs10 <- read.table('cobs10.txt',as.is=TRUE)
cobs10$date <- as.Date(cobs10$date)
plot(temp ~ date, data=cobs10, type='l')
Here, I really like the x-axis ticks and had some trouble replicating it. ''pretty'' on dates insisted on either 4 ticks or 12 ticks. But we will come back to that later.
Next, we can do something about the overlay plotting. Here I use ''par(mfrow=c(3,1))'' to instruct R to have three multiple plots stacked in a single window; with these multiple plots we can differentiate between inner and outer margins. The ''mar'' and ''oma'' arguments refers to the inner and outer margin.
Lets put all three variable together!
par(mfrow=c(3,1), mar=c(0.6, 5.1, 0, 0.6), oma=c(5.1, 0, 1, 0))
plot(temp ~ date, data=cobs10, type='l', ylab='Temperatur (C)')
plot(precip ~ date, data=cobs10, type='l', ylab='Precipitation (cm)')
plot(gdd ~ date, data=cobs10, type='l', ylab='Thermal units')
This looks okay, but not with ticks on top of the plots. Not good. Naturally, we can enable ticks in the first two plots (with ''plot(..., xaxt='n')''), but this will distort the bottom plot. So you will need to do so for all three plots and then add the axis to the outer plotting region.
par(mfrow=c(3,1), mar=c(0.6, 5.1, 0, 0.6), oma=c(5.1, 0, 1, 0))
plot(temp ~ date, data=cobs10, type='l', xaxt='n', ylab='Temperatur (C)')
plot(precip ~ date, data=cobs10, type='l', xaxt='n', ylab='Precipitation (cm)')
plot(gdd ~ date, data=cobs10, type='l', xaxt='n', ylab='Thermal units')
ticks <- seq(from=min(cobs10$date), by='2 months', length=7)
lbl <- strftime(ticks, '%b')
axis(side=1, outer=TRUE, at=ticks, labels=lbl)
mtext('2010', side=1, outer=TRUE, line=3, cex=0.67)
Since ''pretty'' doesn't behave as we want it to, we use ''seq'' to make the sequence of x-axis ticks. Then we format the dates to just display an abbreviation of the month name, but this is done with regard to local settings (I live in Denmark), see ''locale''.
To add the axis-ticks and a label to the outer region, we must remember to specify ''outer=TRUE''; otherwise it is added to the last subplot.
Also note that I specified ''cex=0.67'' to match the font size of the x-axis to the y-axis.
Now I agree that displaying the thermal units in a individual subplot is not optimal, although it is the correct way of displaying it. But there was the issue with the ticks. What we really want is to display some nice values that clearly display that they are not linear. But your data does not necessarily contain these nice values, so we will have to interpolate them ourselves.
For this, I use the ''splinefun''
lbl <- c(0, 2, 200, 1000, 2000, 3000, 4000)
thermals <- splinefun(cobs10$gdd, cobs10$date) # thermals is a function that returns the date (as an integer) for a requested value
thermals(lbl)
## [1] 14649.00 14686.79 14709.55 14761.28 14806.04 14847.68 14908.45
ticks <- as.Date(thermals(lbl), origin='1970-01-01') # remember to specify an origin when converting an integer to a Date.
Now the thermal ticks are in place, lets try it.
par(mfrow=c(2,1), mar=c(0.6, 5.1, 0, 0.6), oma=c(5.1, 0, 4, 0))
plot(temp ~ date, data=cobs10, type='l', xaxt='n', ylab='Temperatur (C)')
plot(precip ~ date, data=cobs10, type='l', xaxt='n', ylab='Precipitation (cm)')
usr <- par('usr')
x.pos <- (usr[2]+usr[1])/2
ticks <- seq(from=min(cobs10$date), by='2 months', length=7)
lbl <- strftime(ticks, '%b')
axis(side=1, outer=TRUE, at=ticks, labels=lbl)
mtext('2010', side=1, at=x.pos, line=3)
lbl <- c(0, 2, 200, 1000, 2000, 3000, 4000)
thermals <- splinefun(cobs10$gdd, cobs10$date) # thermals is a function that returns the date (as an integer) for a requested value
ticks <- as.Date(thermals(lbl), origin='1970-01-01') # remember to specify an origin when converting an integer to a Date.
axis(side=3, outer=TRUE, at=ticks, labels=lbl)
mtext('Thermal units', side=3, line=15, at=x.pos)
Update I changed the mtext function calls in the last code block to ensure that the x-axis texts are centred on the plotting region, not the entire region. You might want to tweak the vertical position by changing the line-argument.

Resources