Confidence interval square in a plot with one variable in each axis in ggplot - r

Although it might sound easy at first, I do not have a scatterplot. And I think that is what make this question challenging. I am having this plot, which comes from this question.
Summing up, each axis represents a variable that is not connected to the other. It is not an XY scatterplot, as you see.
I wonder to know if there is any possibility to trace the 95% confidence interval for the mean in both variables, and draw a square in the middle of the plot representing the overlapping area among both datasets.
The result might be something similar to this, bearing in mind that 95CL represented do not correspond to reality (just for the sake of illustrating how it might appear):
Here is a another question which deals with this situation, but not using ggplot.

Related

Jump y axis values when highest value is to far away from the other points

Basically I'm building an area graph with Chart.js, the data that I'm using in order to build the graph usually contains a peak that is much higher than the rest of the points and the y-axis range of values will be to high, to notice the diference between the lower points and it wil seem almost as a parallel line to the x-axis as we can see in this image:
Graph with problems
The solution I want to try is to skip the values from the y-axis between the lower points and the peak of the graph, and accomplish a graph presentation similar to this one:
Solution graph sketch
As we can see at this sketch the y-axis has a normal scale until 300 but then as the next point is to far away from the other ones the y-axis values are skiped.
So what I want to know is if this jump on the values of the y-axis is possible to achieve with this library (Chart.js) and if so where can I find documentation about it, because I already looked everywhere and couldn't find a thing. If not I would ask you for recommendations of any other librarys where I could achieve this.

Is it possible to edit the numbers displayed on an axis without moving the points that were plotted?

I've come up with a graph (a scatterplot) of the log(1+inf) (inf = number of people infected with a given disease on the y-axis against one of the explanatory variables, in this case, the populational density (pop./kmĀ²; x-axis) on my model. The log transformation was used merely for visualization, because it spreads the distribution of the data and allows for more aesthetically appealing plots. Basically, what I want is both axis to show the value of that same variable before the log transformation. The dots need to be plotted like plot(log(1+inf),log(populational_density), but the number on the axis should refer to plot(inf,populational_density). I've provided a picture of my graph with some manual editing on the y-axis to show you the idea of what I want.
The numbers in red would be the 'inf' values equivalent to log(inf);
Please, bear in mind that those values in red do not correspond to reality.
I understand the whole concept of y = f(x), but i've been asked to provide it. Is this possible? I'm using the ggplot2package for plotting.

ggplot draw multiple plots by levels of a variable

I have a sample dataset
d=data.frame(n=rep(c(1,1,1,1,1,1,2,2,2,3),2),group=rep(c("A","B"),each=20),stringsAsFactors = F)
And I want to draw two separate histograms based on group variable.
I tried this method suggested by #jenesaisquoi in a separate post here
Generating Multiple Plots in ggplot by Factor
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)+facet_wrap(~group)
It did the trick but if you look closely, the proportions are wrong. It didn't calculate the proportion for each group but rather a grand proportion. I want the proportion to be 0.6 for number 1 for each group, not 0.3.
Then I tried dplyr package, and it didn't even create two graphs. It ignored the group_by command. Except the proportion is right this time.
d%>%group_by(group)%>%ggplot(data=.)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)
Finally I tried factoring with color
ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..),color=group),binwidth = 1)
But the result is far from ideal. I was going to accept one output but with the bins side by side, not on top of each other.
In conclusion, I want to draw two separate histograms with correct proportions calculated within each group. If there is no easy way to do this, I can live with one graph but having the bins side by side, and with correct proportions for each group. In this example, number 1 should have 0.6 as its proportion.
By changing ..count../sum(..count..) to ..density.., it gives you the desired proportion
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..),binwidth = 1)+facet_wrap(~group)
You actually have the separation of charts by variable correct! Especially with ggplot, you sometimes need to consider the scales of the graph separately from the shape. Facet_wrap applies a new layer to your data, regardless of scale. It will behave the same, no matter what your axes are. You could also try adding scale_y_log10() as a layer, and you'll notice that the overall shape and style of your graph is the same, you've just changed the axes.
What you actually need is a fix to your scales. Understandable - frequency plots can be confusing. ..count../sum(..count..)) treats each bin as an independent unit, regardless of its value. See a good explanation of this here: Show % instead of counts in charts of categorical variables
What you want is ..density.., which is basically the count divided by the total count. The difference is subtle in principle, but the important bit is that the value on the x-axis matters. For an extreme case of this, see here: Normalizing y-axis in histograms in R ggplot to proportion, where tiny x-axis values produced huge densities.
Your original code will still work, just substituting the aesthetics I described above.
ggplot(data=d)+geom_histogram(aes(x=n,y=..density..,)binwidth = 1)+facet_wrap(~group)
If you're still confused about density, so are lots of people. Hadley Wickham wrote a long piece about it, you can find that here: http://vita.had.co.nz/papers/density-estimation.pdf

heatmap with R,ggmap and ggplot

I want to plot incidents on a map(San Francisco). As my incidents are way too many (800k points) I end up with overplotting problem. So to avoid this I want to make a 2 dimensional density in order to grab the desired insight. The problem is that while the incidents are spread all over the map, geom_density2d only illustrates a small area of the city. Of course the expected outcome is a density that covers nearly all the city.Any ideas why this happens?
CODE
a<-get_map("San Francisco",zoom=12,source='osm')
ggmap(a,extent='device')+ geom_density2d(data=train,aes(x=X,y=Y))+
stat_density2d(data=train,aes(x=X,y=Y,fill=..level..,alpha=..level..),
geom='polygon')
--------------------------------------------------------------
At first, #ajrwhite thanks for your answer and attitude dude. You are also right that when dealing with datasets this big you have to subset in order to experiment. As far as the number of bins are concerned, I was thinking that like geom_density the optimal kernel binwidth/ number of bins is internally calculated. As it seems, in the 2-dimensional case you have to adjust it by yourself.
Now, my problem as you mentioned was that I never thought that crimes in the city would be so concentrated. The discovery was so clear that my output seemed false. As it turns out, this is the case in the city. There is also a more detailed approach on the various visualizations of this dataset by this guy.
https://www.kaggle.com/mircat/sf-crime/violent-crime-mapping
Finally, thank you for the redirection. There is indeed extensive covering of the subject.
So I grabbed the San Francisco Crime data from Kaggle, which I suspect is the dataset you are using.
First, a suggestion - given that there are 878,049 rows in this dataset, take a sample of 5,000 and use that to experiment with plots. It will save you a lot of time:
train_reduced = train[sample(1:nrow(train), 5000),]
You can then easily plot individual cases to get a better feeling for what's happening:
ggmap(a,extent='device') + geom_point(aes(x=X, y=Y), data=train_reduced)
And now we can see that the coordinates and the data are correctly aligned:
So your problem is simply that crime is concentrated in the north-east of the city.
Returning to your density contours, we can use the bins argument to increase the precision of our contour intervals:
ggmap(a,extent='device') +
geom_density2d(data=train_reduced,aes(x=X,y=Y), bins=30) +
stat_density2d(data=train_reduced,aes(x=X,y=Y,fill=..level.., alpha=..level..), geom='polygon')
Which gives us a more informative plot spreading out more into the low-crime areas of the city:
There are countless ways of improving the aesthetics and consistency of these plots, but these have already been covered elsewhere on StackOverflow, for example:
How to make a ggplot2 contour plot analogue to lattice:filled.contour()?
Filled contour plot with R/ggplot/ggmap
If you use a smaller sample of your dataset, you should be able to experiment with these ideas very quickly and find the parameters that best suit your requirements. The ggplot2 documentation is excellent, by the way.

Applying functions from histograms - in R

I have a very basic grasp of stats, and a very basic grasp of R so please bear with me.
I have survey data which shows the weekly expenditure of a number of respondents. I have put this into a histogram, and have plotted a density function as well. So far so good.
How do I then apply this curve to a larger population? Say that I know that the population of my town is 25000. How can I apply that to the density curve to arrive at a new histogram and the data table behind it?
I hope this is an appropriate question, thank you.
It is not exactly clear what you want to do.
If you only have data on the sample then the best estimate that you have of the histogram/density for the population is the histogram/density of the sample, the only difference would be the scale on the y-axis. Personally I think the tick marks on the y axis should be ignored (and my preference would be that the tick labels were never plotted) since it is really the shape of the histogram/density that is important and the tick labels can change based on things that don't change the meaning. If you really feel the need to have the tick labels represent population values then see the axis function.
If you want something more than this then give us a better description of what you are trying to accomplish.

Resources