Box plot with only one Whisker line? - r

I wanted to compare a variable among two different places. For this I used the following code in R:
boxplot(variable~place)
As my place had two levels. However, I get a weird looking box plot. Do you know what seems to be the problem?

As has been noted in comments, there's nothing wrong with the boxplots. The fact that there is just one whisker each is a reflection of your data.
Boxplots are an extremely instructive type of graphic: they not only show the location and spread of data but also indicate skewness. The interpretation of the boxplot depends on its ‘syntax’, i.e., its main graphical elements. There are five such elements:
1. bold horizontal line: depicts the median (the middle value of a sorted distribution)
2. box: represents the interquartile range (IQR)
3. whiskers: separate values lying outside the IQR (but still somewhat typical of the data) from outliers
3. empty circles: depict outliers (values surprisingly large or small given all values considered)
4. notches (optional): give a rough impression of the significance of the difference between the medians
The fact that there's just one whisker in each boxplot is, then, due to the extreme skewness of your data: in the case of box 1 the upper limit of the values is the upper limit of the IQR, and in the case of box 2 there exists no value smaller than the median!
Hope this helps.

Related

How to make a histogram smaller inside another line plot

I wish to make a plot just like the following:
plot(cp1cold,"overall",col=4,ylab="RR",xlab="Temperature",xlim=c(-5,35),
ylim=c(-0.5,3.5),axes=F,lwd=1.5)
lines(cp1hot,"overall",ci="area",col=2,lwd=1.5)
axis(1,at=-1:7*5)
axis(2,at=c(1:7*0.5))
title("Overall cumulative association and temperature distribution")
mtext("London 1993-2006",cex=0.75)
par(new=T)
hist(lndn$tmean,xlim=c(-5,35),ylim=c(0,1200),axes=F,ann=F,col=grey(0.95),breaks=30)
abline(v=quantile(lndn$tmean,c(0.01,0.99)),lty=2)
abline(v=cen,lty=3)
axis(4,at=0:4*100)
mtext("Freq",4,line=2.5,at=200,cex=0.8)
Which code can be found here: https://github.com/gasparrini/2014_gasparrini_BMCmrm_Rcodedata/blob/master/03.graphs.R
However, no matter what I try, when I try using my own data my plot comes with a huge histogram.
Do someone know how to make this histogram smaller?
OBS: I have tried using a different layout, putting each plot in one line layer, but it was not aesthetically pleasing
Your data have very high values four your histogram. Your upper limit value (1200) is lower than your higher values. Because of that, your histogram exceeds its boundries. You must change the upper limit value at "ylim=c(0,1200)".

Is it possible to edit the numbers displayed on an axis without moving the points that were plotted?

I've come up with a graph (a scatterplot) of the log(1+inf) (inf = number of people infected with a given disease on the y-axis against one of the explanatory variables, in this case, the populational density (pop./km²; x-axis) on my model. The log transformation was used merely for visualization, because it spreads the distribution of the data and allows for more aesthetically appealing plots. Basically, what I want is both axis to show the value of that same variable before the log transformation. The dots need to be plotted like plot(log(1+inf),log(populational_density), but the number on the axis should refer to plot(inf,populational_density). I've provided a picture of my graph with some manual editing on the y-axis to show you the idea of what I want.
The numbers in red would be the 'inf' values equivalent to log(inf);
Please, bear in mind that those values in red do not correspond to reality.
I understand the whole concept of y = f(x), but i've been asked to provide it. Is this possible? I'm using the ggplot2package for plotting.

Confidence interval square in a plot with one variable in each axis in ggplot

Although it might sound easy at first, I do not have a scatterplot. And I think that is what make this question challenging. I am having this plot, which comes from this question.
Summing up, each axis represents a variable that is not connected to the other. It is not an XY scatterplot, as you see.
I wonder to know if there is any possibility to trace the 95% confidence interval for the mean in both variables, and draw a square in the middle of the plot representing the overlapping area among both datasets.
The result might be something similar to this, bearing in mind that 95CL represented do not correspond to reality (just for the sake of illustrating how it might appear):
Here is a another question which deals with this situation, but not using ggplot.

ggplot draw multiple plots by levels of a variable

I have a sample dataset
d=data.frame(n=rep(c(1,1,1,1,1,1,2,2,2,3),2),group=rep(c("A","B"),each=20),stringsAsFactors = F)
And I want to draw two separate histograms based on group variable.
I tried this method suggested by #jenesaisquoi in a separate post here
Generating Multiple Plots in ggplot by Factor
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)+facet_wrap(~group)
It did the trick but if you look closely, the proportions are wrong. It didn't calculate the proportion for each group but rather a grand proportion. I want the proportion to be 0.6 for number 1 for each group, not 0.3.
Then I tried dplyr package, and it didn't even create two graphs. It ignored the group_by command. Except the proportion is right this time.
d%>%group_by(group)%>%ggplot(data=.)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)
Finally I tried factoring with color
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..),color=group),binwidth = 1)
But the result is far from ideal. I was going to accept one output but with the bins side by side, not on top of each other.
In conclusion, I want to draw two separate histograms with correct proportions calculated within each group. If there is no easy way to do this, I can live with one graph but having the bins side by side, and with correct proportions for each group. In this example, number 1 should have 0.6 as its proportion.
By changing ..count../sum(..count..) to ..density.., it gives you the desired proportion
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..),binwidth = 1)+facet_wrap(~group)
You actually have the separation of charts by variable correct! Especially with ggplot, you sometimes need to consider the scales of the graph separately from the shape. Facet_wrap applies a new layer to your data, regardless of scale. It will behave the same, no matter what your axes are. You could also try adding scale_y_log10() as a layer, and you'll notice that the overall shape and style of your graph is the same, you've just changed the axes.
What you actually need is a fix to your scales. Understandable - frequency plots can be confusing. ..count../sum(..count..)) treats each bin as an independent unit, regardless of its value. See a good explanation of this here: Show % instead of counts in charts of categorical variables
What you want is ..density.., which is basically the count divided by the total count. The difference is subtle in principle, but the important bit is that the value on the x-axis matters. For an extreme case of this, see here: Normalizing y-axis in histograms in R ggplot to proportion, where tiny x-axis values produced huge densities.
Your original code will still work, just substituting the aesthetics I described above.
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..,)binwidth = 1)+facet_wrap(~group)
If you're still confused about density, so are lots of people. Hadley Wickham wrote a long piece about it, you can find that here: http://vita.had.co.nz/papers/density-estimation.pdf

Splitting lme residual plot into separate boxplots

Using the basic plot function (plot.intervals.lmList) from an lme model (called meef1), I produced a massive graph of boxplots. My vector v2andv3commoditycombined has 98 levels.
plot(meef1, v2andv3commoditycombined~resid(.))
I would like to separate by the grouping values of my variable v2andv3commoditycombined to either graph them separately, order them, or exclude some. I'm not sure if there is code to do this or if I have to extract information from the lme output. If that is the case, I'm not sure what to extract to create the boxplots as extracting the residuals returns only one value for each level. If this is impossible, any advice on how to space out the commodity names would be equally helpful.
Thank you.
For each level of v2andv3commoditycombined, what exactly would you like your Y axis and your X axis to be? Since you're splitting the plots by v2andv3commoditycombined, you obviously can't also use that as one of your axes.
Let's pretend you just want do the traditional residuals on the Y axis and fitted values on the X axis, in a separate plot for each of the 98 levels. You can change the code to do plot whatever it is you actually want to plot.
As per ?plot.lme, you would do something like this:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined);
Make sure you stretch out your plot window beforehand so that it's nice and big, otherwise you might get an error saying something about margins. The following might produce a better-looking plot:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined,pch='.',cex=1.5,abline=0);
Since it wasn't clear from your question I went ahead and assumed you're interested in the individual level residuals (i.e. how much each datapoint differs from the predicted value given its random variables), and that you have one level of nesting in your random formula. If you want population residuals (i.e. how much each datapoint differs from the average predicted value), change both instances of level to say level=0. If you have K levels of nesting, change them to level=K and good luck.
I also assumed you wanted standardized residuals (because you can use the convenient rule of thumb that absolute values greater than 3 are possible outliers, regardless of what scale the original data are on). If not, see ?residuals.lme for other valid options for the type argument.
Oh, and the name of your variables suggests that you're looking at some sort of financial time series. If so, have a look at ACF(meef1) to see if there is a lot of autocorrelation. If there is, you could remedy it by instead fitting a model where the response (Y) variable is diff(...) the original variable. If you're seeing really skewed residuals, you might consider log-transforming your response variable before taking the diff.

Resources