I'm having trouble interpreting what binwidth means in ggplot2 and I am looking for a more precise definition of what it means.
For example:
#this is example is taken from Elegant Graphics for Data Analysis
library(ggplot2)
qplot(percbelowpoverty, data = midwest,binwidth=1)
How do I interpret binwidth=1? What are its units? How does that relate to the number of bins that are calculated? I have no clue and I'm not finding ?stan_bin to be helpful in answering my question:
binwidth
The width of the bins. Can be specified as a numeric value, or a function that calculates width from x. The default is to use bins bins that cover the range of the data. You should always override this value, exploring multiple widths to find the best to illustrate the stories in your data.
The bin width of a date variable is the number of days in each time; the bin width of a time variable is the number of seconds.
Maybe just don't know where to look for documentation of things like this because I am having difficulty understanding a number of related issues (such as what the "weight" aesthetic is all about).
I think I've answered my own question. I was having trouble because I misread the x-axis units. The precentages in the midwest$percwhite column are not actually percentages (i.e, 96.7 is meant by us to be interpreted as 96.7%, but as data it is the actual number 96.7). It was for this reason I was confused about how to interpret the binwidth argument. Now I see that it has the standard interpretation that MrFlick provided in the comment:
Setting binwidth=1 mean each bin should be one x unit wide, eg (1-2], (2-3], (3,4], etc. The units are whatever units the midwest$percbelowpoverty values are in.
Related
I am writing a module that creates a scatter plot from a 2 dimensional array of numbers provided by the user (x and y values). It is intended that the graph axis will be scaled to the value required to encompass all the input numbers, while also rounding to an aesthetically pleasing value. For example, if the maximum value entered is 4.56, I would like it to round the maximum axis value to 5. If the maximum value is 850, I'd like it to round the axis to 1000.
This initially seems like a simple task. Simply take the max value and round up. However, what makes it difficult is my module could be dealing with input values as small as 0.00000001 or as large as many billions.
Can anyone suggest a workflow for making this happen? I don't need the code itself, just the process required. The only way I have come up with is an extremely cumbersome iterative approach that still handles unusual values rather poorly.
Any advice would be most appreciated!
Thanks,
Greg
I have a sample dataset
d=data.frame(n=rep(c(1,1,1,1,1,1,2,2,2,3),2),group=rep(c("A","B"),each=20),stringsAsFactors = F)
And I want to draw two separate histograms based on group variable.
I tried this method suggested by #jenesaisquoi in a separate post here
Generating Multiple Plots in ggplot by Factor
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)+facet_wrap(~group)
It did the trick but if you look closely, the proportions are wrong. It didn't calculate the proportion for each group but rather a grand proportion. I want the proportion to be 0.6 for number 1 for each group, not 0.3.
Then I tried dplyr package, and it didn't even create two graphs. It ignored the group_by command. Except the proportion is right this time.
d%>%group_by(group)%>%ggplot(data=.)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)
Finally I tried factoring with color
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..),color=group),binwidth = 1)
But the result is far from ideal. I was going to accept one output but with the bins side by side, not on top of each other.
In conclusion, I want to draw two separate histograms with correct proportions calculated within each group. If there is no easy way to do this, I can live with one graph but having the bins side by side, and with correct proportions for each group. In this example, number 1 should have 0.6 as its proportion.
By changing ..count../sum(..count..) to ..density.., it gives you the desired proportion
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..),binwidth = 1)+facet_wrap(~group)
You actually have the separation of charts by variable correct! Especially with ggplot, you sometimes need to consider the scales of the graph separately from the shape. Facet_wrap applies a new layer to your data, regardless of scale. It will behave the same, no matter what your axes are. You could also try adding scale_y_log10() as a layer, and you'll notice that the overall shape and style of your graph is the same, you've just changed the axes.
What you actually need is a fix to your scales. Understandable - frequency plots can be confusing. ..count../sum(..count..)) treats each bin as an independent unit, regardless of its value. See a good explanation of this here: Show % instead of counts in charts of categorical variables
What you want is ..density.., which is basically the count divided by the total count. The difference is subtle in principle, but the important bit is that the value on the x-axis matters. For an extreme case of this, see here: Normalizing y-axis in histograms in R ggplot to proportion, where tiny x-axis values produced huge densities.
Your original code will still work, just substituting the aesthetics I described above.
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..,)binwidth = 1)+facet_wrap(~group)
If you're still confused about density, so are lots of people. Hadley Wickham wrote a long piece about it, you can find that here: http://vita.had.co.nz/papers/density-estimation.pdf
This seems like a simple question, but I can't seem to find an answer anywhere. In the R {wordcloud} package, the wordcloud function, there is a scale value that you can enter. The full documentation (here: https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf) says: "A vector of length 2 indicating the range of the size of the words."
I can't seem to make any sense of the values though, and I can't find any other documentation. For instance, examples have scale=c(4,.5) or scale=c(8,.3). What do these numbers mean?
I've messed around with different values a bit, but I can't seem to figure out the pattern.
Thanks in advance for any help,
Seth
wordcloud internally calculates
size <- (scale[1] - scale[2]) * normedFreq + scale[2]
where the 2 elements of size are used to set strheight and strwidth. These are graphics values described as follows:
These functions compute the width or height, respectively, of the
given strings or mathematical expressions s[i] on the current plotting
device in user coordinates, inches or as fraction of the figure width
par("fin").
So, long story short, it's height and width.
I want to plot some data with barplot. Rather, I want to make a bar graph and barplot seemed the logical choice. I am plotting just fine but I was wondering if there is a way to intelligently scale the y axis to round up from the highest count.
For example I set the yaxis in this case to be 30, because I knew that Strand.22 had 27 counts in it: barplot(unlist(d), ylim=c(0,30), xlab="Forward Reverse", ylab="Counts")
In the future, I want this script to run on its own, so it would be optimal for the the Y-axis to choose it's own ylim. Short of pulling the information out of my 'd' variable I can't think of a good way to do this. Is there an easy way to do this with barplot? Would some other plotter work better? I have seen things about ggplots but it seemed super complex and I wasn't sure that it would do anything better.
EDIT: If I do not choose a ylim it picks automatically and this is what it decided was best.
I disagree with it's choice.
If you don't specify ylim, R will come up with something based on the data. (Sounds like you don't like it's choice, which is fair.)
If you specify something based on the data like:
barplot(unlist(d), ylim=c(0,1.1*max(unlist(d)))
R will draw you a plot that reflects the maximum value of data. That example just takes the maximum of your values and multiplies that by 1.1 (this could be any number) to give it a little extra height. R does something similar to this when you make a scatterplot but it handles barplots slightly differently.
Using the basic plot function (plot.intervals.lmList) from an lme model (called meef1), I produced a massive graph of boxplots. My vector v2andv3commoditycombined has 98 levels.
plot(meef1, v2andv3commoditycombined~resid(.))
I would like to separate by the grouping values of my variable v2andv3commoditycombined to either graph them separately, order them, or exclude some. I'm not sure if there is code to do this or if I have to extract information from the lme output. If that is the case, I'm not sure what to extract to create the boxplots as extracting the residuals returns only one value for each level. If this is impossible, any advice on how to space out the commodity names would be equally helpful.
Thank you.
For each level of v2andv3commoditycombined, what exactly would you like your Y axis and your X axis to be? Since you're splitting the plots by v2andv3commoditycombined, you obviously can't also use that as one of your axes.
Let's pretend you just want do the traditional residuals on the Y axis and fitted values on the X axis, in a separate plot for each of the 98 levels. You can change the code to do plot whatever it is you actually want to plot.
As per ?plot.lme, you would do something like this:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined);
Make sure you stretch out your plot window beforehand so that it's nice and big, otherwise you might get an error saying something about margins. The following might produce a better-looking plot:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined,pch='.',cex=1.5,abline=0);
Since it wasn't clear from your question I went ahead and assumed you're interested in the individual level residuals (i.e. how much each datapoint differs from the predicted value given its random variables), and that you have one level of nesting in your random formula. If you want population residuals (i.e. how much each datapoint differs from the average predicted value), change both instances of level to say level=0. If you have K levels of nesting, change them to level=K and good luck.
I also assumed you wanted standardized residuals (because you can use the convenient rule of thumb that absolute values greater than 3 are possible outliers, regardless of what scale the original data are on). If not, see ?residuals.lme for other valid options for the type argument.
Oh, and the name of your variables suggests that you're looking at some sort of financial time series. If so, have a look at ACF(meef1) to see if there is a lot of autocorrelation. If there is, you could remedy it by instead fitting a model where the response (Y) variable is diff(...) the original variable. If you're seeing really skewed residuals, you might consider log-transforming your response variable before taking the diff.