I am graphing measured results versus expected results from a model, grouped by categories (the category in the boxplot below is one of a few different ones I'm using). For each data point, I subtracted the expected from the observed to determine the difference. My task is to modify the model to minimize the difference.
I would like to add the significance level to this chart but all resources I am finding are to compare means of each category to one another. In this case, I would like to know if each of the category's means is significantly different from 0. I can run this test one by one, selecting for data points falling within each category and testing for a difference from 0, but this seems inefficient.
Is there a way to automatically generate this and plot it? stat_compare_means seemed promising but I couldn't figure out how to make it work, while stat_pvalue_manual may hold more promise if I figure out how to code this.
Thanks in advance!
Sample boxplot (too new to add preview)
I have a PLM model with two dummy independent variables, each with an interaction effect with another dummy variable. I have fixed effects (with a within estimator) as well. The resulting coefficients tell me the difference of the interaction effect with each treatment. I now want to visually plot the difference-in-difference based on these coefficients to show how the interaction effect is different for each independent variable (treatment). For the plot, it is not panel data, so the x axis would not be time, but would be the interaction (ie t=1 (no interaction), t=2 (interaction)). However, I cannot find a way to plot the difference-in-difference plot like this based off of the function. I would like to continue using the PLM function because of the fixed effects I have built in, and I feel there should be a way to make this DID plot, even if the x axis is the interaction and not a time variable.
I have tried various suggestions for predicted values and other ways to form the plot, but none have worked thus far. The closest I have come to visualizing the effect would be simple box plots of the group means, but I definitely want to make the DID plot based off of the regression output, so these would not suffice.
I am trying to figure out how to normalised some positively skewed data.
data
I really need it to have some parvence of positive distribution, but I have already tried log-transforming and it simply does not work. I get this kind of distribution.
log.data
I also tried sqrt(), but still no joy.
Should I just get rid of some of the extreme values on the tail? Why is log() not really doing much in terms of normalising my data?
Log transforming your data won't necessarily make it unskewed, but it does reduce the data range in the axis it was applied. Read this paper about using log transformations.
Nevertheless, a simple log transformation formatted your x-axis from a 1.2 e+07 range to a 0.2 range according to your image.
I've come up with a graph (a scatterplot) of the log(1+inf) (inf = number of people infected with a given disease on the y-axis against one of the explanatory variables, in this case, the populational density (pop./kmĀ²; x-axis) on my model. The log transformation was used merely for visualization, because it spreads the distribution of the data and allows for more aesthetically appealing plots. Basically, what I want is both axis to show the value of that same variable before the log transformation. The dots need to be plotted like plot(log(1+inf),log(populational_density), but the number on the axis should refer to plot(inf,populational_density). I've provided a picture of my graph with some manual editing on the y-axis to show you the idea of what I want.
The numbers in red would be the 'inf' values equivalent to log(inf);
Please, bear in mind that those values in red do not correspond to reality.
I understand the whole concept of y = f(x), but i've been asked to provide it. Is this possible? I'm using the ggplot2package for plotting.
I have a sample dataset
d=data.frame(n=rep(c(1,1,1,1,1,1,2,2,2,3),2),group=rep(c("A","B"),each=20),stringsAsFactors = F)
And I want to draw two separate histograms based on group variable.
I tried this method suggested by #jenesaisquoi in a separate post here
Generating Multiple Plots in ggplot by Factor
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)+facet_wrap(~group)
It did the trick but if you look closely, the proportions are wrong. It didn't calculate the proportion for each group but rather a grand proportion. I want the proportion to be 0.6 for number 1 for each group, not 0.3.
Then I tried dplyr package, and it didn't even create two graphs. It ignored the group_by command. Except the proportion is right this time.
d%>%group_by(group)%>%ggplot(data=.)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)
Finally I tried factoring with color
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..),color=group),binwidth = 1)
But the result is far from ideal. I was going to accept one output but with the bins side by side, not on top of each other.
In conclusion, I want to draw two separate histograms with correct proportions calculated within each group. If there is no easy way to do this, I can live with one graph but having the bins side by side, and with correct proportions for each group. In this example, number 1 should have 0.6 as its proportion.
By changing ..count../sum(..count..) to ..density.., it gives you the desired proportion
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..),binwidth = 1)+facet_wrap(~group)
You actually have the separation of charts by variable correct! Especially with ggplot, you sometimes need to consider the scales of the graph separately from the shape. Facet_wrap applies a new layer to your data, regardless of scale. It will behave the same, no matter what your axes are. You could also try adding scale_y_log10() as a layer, and you'll notice that the overall shape and style of your graph is the same, you've just changed the axes.
What you actually need is a fix to your scales. Understandable - frequency plots can be confusing. ..count../sum(..count..)) treats each bin as an independent unit, regardless of its value. See a good explanation of this here: Show % instead of counts in charts of categorical variables
What you want is ..density.., which is basically the count divided by the total count. The difference is subtle in principle, but the important bit is that the value on the x-axis matters. For an extreme case of this, see here: Normalizing y-axis in histograms in R ggplot to proportion, where tiny x-axis values produced huge densities.
Your original code will still work, just substituting the aesthetics I described above.
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..,)binwidth = 1)+facet_wrap(~group)
If you're still confused about density, so are lots of people. Hadley Wickham wrote a long piece about it, you can find that here: http://vita.had.co.nz/papers/density-estimation.pdf