I want to plot incidents on a map(San Francisco). As my incidents are way too many (800k points) I end up with overplotting problem. So to avoid this I want to make a 2 dimensional density in order to grab the desired insight. The problem is that while the incidents are spread all over the map, geom_density2d only illustrates a small area of the city. Of course the expected outcome is a density that covers nearly all the city.Any ideas why this happens?
CODE
a<-get_map("San Francisco",zoom=12,source='osm')
ggmap(a,extent='device')+ geom_density2d(data=train,aes(x=X,y=Y))+
stat_density2d(data=train,aes(x=X,y=Y,fill=..level..,alpha=..level..),
geom='polygon')
--------------------------------------------------------------
At first, #ajrwhite thanks for your answer and attitude dude. You are also right that when dealing with datasets this big you have to subset in order to experiment. As far as the number of bins are concerned, I was thinking that like geom_density the optimal kernel binwidth/ number of bins is internally calculated. As it seems, in the 2-dimensional case you have to adjust it by yourself.
Now, my problem as you mentioned was that I never thought that crimes in the city would be so concentrated. The discovery was so clear that my output seemed false. As it turns out, this is the case in the city. There is also a more detailed approach on the various visualizations of this dataset by this guy.
https://www.kaggle.com/mircat/sf-crime/violent-crime-mapping
Finally, thank you for the redirection. There is indeed extensive covering of the subject.
So I grabbed the San Francisco Crime data from Kaggle, which I suspect is the dataset you are using.
First, a suggestion - given that there are 878,049 rows in this dataset, take a sample of 5,000 and use that to experiment with plots. It will save you a lot of time:
train_reduced = train[sample(1:nrow(train), 5000),]
You can then easily plot individual cases to get a better feeling for what's happening:
ggmap(a,extent='device') + geom_point(aes(x=X, y=Y), data=train_reduced)
And now we can see that the coordinates and the data are correctly aligned:
So your problem is simply that crime is concentrated in the north-east of the city.
Returning to your density contours, we can use the bins argument to increase the precision of our contour intervals:
ggmap(a,extent='device') +
geom_density2d(data=train_reduced,aes(x=X,y=Y), bins=30) +
stat_density2d(data=train_reduced,aes(x=X,y=Y,fill=..level.., alpha=..level..), geom='polygon')
Which gives us a more informative plot spreading out more into the low-crime areas of the city:
There are countless ways of improving the aesthetics and consistency of these plots, but these have already been covered elsewhere on StackOverflow, for example:
How to make a ggplot2 contour plot analogue to lattice:filled.contour()?
Filled contour plot with R/ggplot/ggmap
If you use a smaller sample of your dataset, you should be able to experiment with these ideas very quickly and find the parameters that best suit your requirements. The ggplot2 documentation is excellent, by the way.
Related
I have a fair amount of experience analyzing RNA-Seq data, but I am looking for new ways to visualize the data. I typically use heat maps and volcano plots, but I'd like to make this plot which is from this paper. I can make this type of plot with rlog transformed data before doing DEG analysis, but I want to color dots based on statistically significant expression differences.
I've search online and have not been able to find a good way to create this plot. Thanks in advance for any advice.
This question is more about bioinformatics so maybe it is better you can post it on biostar.
In any case, maybe you can draw a scatter plot with the package "ggscatter" or "ggplot2" and colour the statistically significant gene with an if else statement.
Please, provide sample of your data.
I have a data frame with several million points in it - each having two values.
When I plot this like this:
plot(myData)
All the points are plotted, but the plot is quite busy, so I thought I'd plot it as a line:
plot(myData, type="l")
But while the x axis doesn't change (i.e. goes from 0 to 7e+07), the actual plotting stops at about 3e+07 and I don't actually get a proper line plot either.
Is there a limitation on line plotting?
Update
If I use
plot(myData, type="h")
I get correct and useable output, but I still wonder why the type="l" option fails so badly.
Further update
I am plotting a time series - here is one output using type="h":
That's perfectly usable, but having a line would allow me to compare several outputs.
High dimensional data graphic representation is growing issue in data analysis. The problem, actually, is not create the graph. The problem is make the graph capable of communicate information that we could transform in useful knowledge. Allow me to present an example to produce this point, by considering a data with a million observations, that is, not that big.
x <- rnorm(10^6, 0, 1)
y <- rnorm(10^6, 0, 1)
Let's plot it. R can yes easily manage such a problem. But can we? Probably not.
Afterall, what kind of information can we deduce from an ink hard stain? Probably, no more than a tasseographyst trying to divinate the future in patterns of tea leaves, coffee grounds, or wine sediments.
plot(x, y)
A different approach is represented by the smoothScatter function. It creates a density plot of bivariate data. There, we create two examples.
First, with defaults.
smoothScatter(x, y)
Second, the bandwidth was specified to be a little larger than the default, and five points are specified to be shown using a different symbol pch = 3.
smoothScatter(x, y, bandwidth=c(5,1)/(1/3), nrpoints=5, pch=3)
As you can see, the problem is not solved. Nevertheless, we can have a better grasp on the distribution of our data. This kind of approach is still in development, and there are several matters that are discussed and evolved. If this approach represents a more suitable approach to represent your big dataset, I suggest you to visit this blog that discuss throughfully the issue.
For what it's worth, all the evidence I have is that is computer - even though it was a lump of big iron - ran out of memory.
I have asked this question in the GIS part of stack exchange https://gis.stackexchange.com/questions/95265/r-how-to-create-a-pre-determined-number-of-identical-square-polygons-to-use-fo - I am asking it here as well as it has also topics of wider interest (e.g. calculation of density) - I hope not to be penalised for this! :)
I am trying to plot crime data density (again!) over a city map, say of NY. As a well known problem there are plenty of examples on this (http://www.obscureanalytics.com/2012/12/07/visualizing-baltimore-with-r-and-ggplot2-crime-data/). These methods plot the crime density through isoclines, while I need to represent it through identical density squares of a pre-determined area (and the area / side length may change from one iteration to the other). This is actually done in commercially available COTS packages like PredPol (see http://www.predpol.com). The reason for representing crime density through squares is that the square are the hotspot areas to be patrolled. The size will influence the overall amount of police people required.
This is what I am trying to achieve:
I would like to be able to create identical square polygons with pre-determined size to overimpose to the map (is it a raster? apologies but I've just started to learn to spell GIS!)
I would like to use the above squares as items to colour as in a choropleth map (i.e. different colouring in relation to frequency of crime in the area), probably using ggplot2 or similar.
This should allow me to see how the density of crimes per square kilometre varies changing the size (i.e. the area) of the square, proposing different patrolling areas.
I do not have a clue if it is possible to use R to create regularly shaped squares polygons of a pre-defined size to use for this (as the code snipped below attests). Any help or links to examples are welcome.
I would be glad to get some indication on alternative ways to calculate the density. I have used the stat_density2 (part of ggplot2) but maybe there are better / faster ways?
(
In hindsight, do I need a density function at all? I just need to count the crimes in a cell and colour-plot it accordingly...)
This is where I got to:
library(rgdal)
library(raster)
library(sp)
#NY boroughs shapefile downloaded from NY website
shp <- readOGR(dsn = "nybb_14a_av", layer = "nybb")
r <- raster(extent(shp))
res(r)=0.05
# using BoroCode as an experiment...
r <- rasterize(shp, field="BoroCode", r)
plot(r)
plot(shp,lwd=10,add=TRUE)
#don't know the result of the above: the laptop basically hangs processing
#plot(r) :)
I have a very basic grasp of stats, and a very basic grasp of R so please bear with me.
I have survey data which shows the weekly expenditure of a number of respondents. I have put this into a histogram, and have plotted a density function as well. So far so good.
How do I then apply this curve to a larger population? Say that I know that the population of my town is 25000. How can I apply that to the density curve to arrive at a new histogram and the data table behind it?
I hope this is an appropriate question, thank you.
It is not exactly clear what you want to do.
If you only have data on the sample then the best estimate that you have of the histogram/density for the population is the histogram/density of the sample, the only difference would be the scale on the y-axis. Personally I think the tick marks on the y axis should be ignored (and my preference would be that the tick labels were never plotted) since it is really the shape of the histogram/density that is important and the tick labels can change based on things that don't change the meaning. If you really feel the need to have the tick labels represent population values then see the axis function.
If you want something more than this then give us a better description of what you are trying to accomplish.
Background:
I'm running a Monte Carlo simulation to show that a particular process (a cumulative mean) does not converge over time, and often diverges wildly in simulation (the expectation of the random variable = infinity). I want to plot about 10 of these simulations on a line chart, where the x axis has the iteration number, and the y axis has the cumulative mean up to that point.
Here's my problem:
I'll run the first simulation (each sim. having 10,000 iterations), and build the main plot based on its current range. But often one of the simulations will have a range a few orders of magnitude large than the first one, so the plot flies outside of the original range. So, is there any way to dynamically update the ylim or xlim of a plot upon adding a new set of points or lines?
I can think of two workarounds for this: 1. store each simulation, then pick the one with the largest range, and build the base graph off of that (not elegant, and I'd have to store a lot of data in memory, but would probably be laptop-friendly [[EDIT: as Marek points out, this is not a memory-intense example, but if you know of a nice solution that'd support far more iterations such that it becomes an issue (think high dimensional walks that require much, much larger MC samples for convergence) then jump right in]]) 2. find a seed that appears to build a nice looking version of it, and set the ylim manually, which would make the demonstration reproducible.
Naturally I'm holding out for something more elegant than my workarounds. Hoping this isn't too pedestrian a problem, since I imagine it's not uncommon with simulations in R. Any ideas?
I'm not sure if this is possible using base graphics, if someone has a solution I'd love to see it. However graphics systems based on grid (lattice and ggplot2) allow the graphics object to be saved and updated. It's insanely easy in ggplot2.
require(ggplot2)
make some data and get the range:
foo <- as.data.frame(cbind(data=rnorm(100), numb=seq_len(100)))
make an initial ggplot object and plot it:
p <- ggplot(as.data.frame(foo), aes(numb, data)) + layer(geom='line')
p
make some more data and add it to the plot
foo <- as.data.frame(cbind(data=rnorm(200), numb=seq_len(200)))
p <- p + geom_line(aes(numb, data, colour="red"), data=as.data.frame(foo))
plot the new object
p
I think (1) is the best option. I actually don't think this isn't elegant. I think it would be more computationally intensive to redraw every time you hit a point greater than xlim or ylim.
Also, I saw in Peter Hoff's book about Bayesian statistics a cool use of ts() instead of lines() for cumulative sums/means. It looks pretty spiffy: