Applying functions from histograms - in R - r

I have a very basic grasp of stats, and a very basic grasp of R so please bear with me.
I have survey data which shows the weekly expenditure of a number of respondents. I have put this into a histogram, and have plotted a density function as well. So far so good.
How do I then apply this curve to a larger population? Say that I know that the population of my town is 25000. How can I apply that to the density curve to arrive at a new histogram and the data table behind it?
I hope this is an appropriate question, thank you.

It is not exactly clear what you want to do.
If you only have data on the sample then the best estimate that you have of the histogram/density for the population is the histogram/density of the sample, the only difference would be the scale on the y-axis. Personally I think the tick marks on the y axis should be ignored (and my preference would be that the tick labels were never plotted) since it is really the shape of the histogram/density that is important and the tick labels can change based on things that don't change the meaning. If you really feel the need to have the tick labels represent population values then see the axis function.
If you want something more than this then give us a better description of what you are trying to accomplish.

Related

Need Help Creating a specific kind of Isotope Plot in ggplot2

I hope that you are doing well. I am currently trying to replicate a type of isotope plot that's common in my field. Essentially, it's the result of a compound-specific stable isotope analysis.
The x and y axes represent delta values that are plotted against isotopic references from animals (ellipses) to identify animals by their signature. The ellipses represent a 95% CI.
I'm a beginner in R. I've managed to get the scatter plot to work, but I don't understand how to create a CI ellipses with reference data. Would anyone here know how to do this?
enter image description here

Is it possible to edit the numbers displayed on an axis without moving the points that were plotted?

I've come up with a graph (a scatterplot) of the log(1+inf) (inf = number of people infected with a given disease on the y-axis against one of the explanatory variables, in this case, the populational density (pop./kmĀ²; x-axis) on my model. The log transformation was used merely for visualization, because it spreads the distribution of the data and allows for more aesthetically appealing plots. Basically, what I want is both axis to show the value of that same variable before the log transformation. The dots need to be plotted like plot(log(1+inf),log(populational_density), but the number on the axis should refer to plot(inf,populational_density). I've provided a picture of my graph with some manual editing on the y-axis to show you the idea of what I want.
The numbers in red would be the 'inf' values equivalent to log(inf);
Please, bear in mind that those values in red do not correspond to reality.
I understand the whole concept of y = f(x), but i've been asked to provide it. Is this possible? I'm using the ggplot2package for plotting.

Confidence interval square in a plot with one variable in each axis in ggplot

Although it might sound easy at first, I do not have a scatterplot. And I think that is what make this question challenging. I am having this plot, which comes from this question.
Summing up, each axis represents a variable that is not connected to the other. It is not an XY scatterplot, as you see.
I wonder to know if there is any possibility to trace the 95% confidence interval for the mean in both variables, and draw a square in the middle of the plot representing the overlapping area among both datasets.
The result might be something similar to this, bearing in mind that 95CL represented do not correspond to reality (just for the sake of illustrating how it might appear):
Here is a another question which deals with this situation, but not using ggplot.

heatmap with R,ggmap and ggplot

I want to plot incidents on a map(San Francisco). As my incidents are way too many (800k points) I end up with overplotting problem. So to avoid this I want to make a 2 dimensional density in order to grab the desired insight. The problem is that while the incidents are spread all over the map, geom_density2d only illustrates a small area of the city. Of course the expected outcome is a density that covers nearly all the city.Any ideas why this happens?
CODE
a<-get_map("San Francisco",zoom=12,source='osm')
ggmap(a,extent='device')+ geom_density2d(data=train,aes(x=X,y=Y))+
stat_density2d(data=train,aes(x=X,y=Y,fill=..level..,alpha=..level..),
geom='polygon')
--------------------------------------------------------------
At first, #ajrwhite thanks for your answer and attitude dude. You are also right that when dealing with datasets this big you have to subset in order to experiment. As far as the number of bins are concerned, I was thinking that like geom_density the optimal kernel binwidth/ number of bins is internally calculated. As it seems, in the 2-dimensional case you have to adjust it by yourself.
Now, my problem as you mentioned was that I never thought that crimes in the city would be so concentrated. The discovery was so clear that my output seemed false. As it turns out, this is the case in the city. There is also a more detailed approach on the various visualizations of this dataset by this guy.
https://www.kaggle.com/mircat/sf-crime/violent-crime-mapping
Finally, thank you for the redirection. There is indeed extensive covering of the subject.
So I grabbed the San Francisco Crime data from Kaggle, which I suspect is the dataset you are using.
First, a suggestion - given that there are 878,049 rows in this dataset, take a sample of 5,000 and use that to experiment with plots. It will save you a lot of time:
train_reduced = train[sample(1:nrow(train), 5000),]
You can then easily plot individual cases to get a better feeling for what's happening:
ggmap(a,extent='device') + geom_point(aes(x=X, y=Y), data=train_reduced)
And now we can see that the coordinates and the data are correctly aligned:
So your problem is simply that crime is concentrated in the north-east of the city.
Returning to your density contours, we can use the bins argument to increase the precision of our contour intervals:
ggmap(a,extent='device') +
geom_density2d(data=train_reduced,aes(x=X,y=Y), bins=30) +
stat_density2d(data=train_reduced,aes(x=X,y=Y,fill=..level.., alpha=..level..), geom='polygon')
Which gives us a more informative plot spreading out more into the low-crime areas of the city:
There are countless ways of improving the aesthetics and consistency of these plots, but these have already been covered elsewhere on StackOverflow, for example:
How to make a ggplot2 contour plot analogue to lattice:filled.contour()?
Filled contour plot with R/ggplot/ggmap
If you use a smaller sample of your dataset, you should be able to experiment with these ideas very quickly and find the parameters that best suit your requirements. The ggplot2 documentation is excellent, by the way.

Intelligent Y Axis Scaling BarPlot R

I want to plot some data with barplot. Rather, I want to make a bar graph and barplot seemed the logical choice. I am plotting just fine but I was wondering if there is a way to intelligently scale the y axis to round up from the highest count.
For example I set the yaxis in this case to be 30, because I knew that Strand.22 had 27 counts in it: barplot(unlist(d), ylim=c(0,30), xlab="Forward Reverse", ylab="Counts")
In the future, I want this script to run on its own, so it would be optimal for the the Y-axis to choose it's own ylim. Short of pulling the information out of my 'd' variable I can't think of a good way to do this. Is there an easy way to do this with barplot? Would some other plotter work better? I have seen things about ggplots but it seemed super complex and I wasn't sure that it would do anything better.
EDIT: If I do not choose a ylim it picks automatically and this is what it decided was best.
I disagree with it's choice.
If you don't specify ylim, R will come up with something based on the data. (Sounds like you don't like it's choice, which is fair.)
If you specify something based on the data like:
barplot(unlist(d), ylim=c(0,1.1*max(unlist(d)))
R will draw you a plot that reflects the maximum value of data. That example just takes the maximum of your values and multiplies that by 1.1 (this could be any number) to give it a little extra height. R does something similar to this when you make a scatterplot but it handles barplots slightly differently.

Resources