R crashed when using Geom_Point for large data frame - r

Background: I have a large data frame data_2014, containing ~ 1,000,000 rows like this
library(tidyverse)
tibble(
date_time = "4/1/2014 0:11:00",
Lat = 40.7690,
Lon = -73.9549,
Base = "B02512"
)
Problem: I want to create a plot like this
This is what I've attempted to do:
library(tidyverse)
library(ggthemes)
library(scales)
min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004
ggplot(data_2014, aes(Lon, Lat)) +
geom_point(size = 1, color = "chocolate") +
scale_x_continuous(limits = c(min_long, max_long)) +
scale_y_continuous(limits = c(min_lat, max_lat)) +
theme_map() +
ggtitle("NYC Map Based on Uber Rides Data (April-September 2014)")
However, when I ran this code, Rstudio crashed. I'm not particularly sure how to fix or improve this. Is there any suggestion?

A million points is a lot for ggplot2, but do-able if your computer is good enough. Yours may or may not be. Short of getting a bigger computer here's what you should do.
This is spatial data, so use the sf package.
library(sf)
data_2014 <- st_as_sf(data_2014, coords = c('Lon', 'Lat')) %>%
st_set_crs(4326)
If you're only plotting the points, get rid of the columns of data you don't need. I'm guessing they might include trip distance, time, borough, etc. Use dplyr's select, or whatever other method you're familiar with.
Try plotting some of the data, and then a little more. See where your computer slows down & stop there. You can plot the data from row 1:n, or sample x number of rows.
# try starting with 100,000 and go up from there.
n <- 100000
ggplot(data_2014[1:n,]) +
geom_sf()
# Alternatively sample a fraction of the data.
# Start with ~10% and go up until R crashes again.
data_2015 %>%
sample_frac(.1) %>%
ggplot() +
geom_sf()

Related

ggplot par new=TRUE option

I am trying to plot 400 ecdf graphs in one image using ggplot.
As far as I know ggplot does not support the par(new=T) option.
So the first solution I thought was use the grid.arrange function in gridExtra package.
However, the ecdfs I am generating are in a for loop format.
Below is my code, but you could ignore the steps for data processing.
i=1
for(i in 1:400)
{
test<-subset(df,code==temp[i,])
test<-test[c(order(test$Distance)),]
test$AI_ij<-normalize(test$AI_ij)
AI = test$AI_ij
ggplot(test, aes(AI)) +
stat_ecdf(geom = "step") +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
new_theme +
xlab("Calculated Accessibility Value") +
ylab("Percent")
}
So I have values stored in "AI" in the for loop.
In this case how should I plot 400 graphs in the same chart?
This is not the way to put multiple lines on a ggplot. To do this, it is far easier to pass all of your data together and map code to the "group" aesthetic to give you one ecdf line for each code.
By far the hardest part of answering this question was attempting to reverse-engineer your data set. The following data set should be close enough in structure and naming to allow the code to be run on your own data.
library(dplyr)
library(BBmisc)
library(ggplot2)
set.seed(1)
all_codes <- apply(expand.grid(1:16, LETTERS), 1, paste0, collapse = "")
temp <- data.frame(sample(all_codes, 400), stringsAsFactors = FALSE)
df <- data.frame(code = rep(all_codes, 100),
Distance = sqrt(rnorm(41600)^2 + rnorm(41600)^2),
AI_ij = rnorm(41600),
stringsAsFactors = FALSE)
Since you only want the first 400 codes from temp that appear in df to be shown on the plot, you can use dplyr::filter to filter out code %in% test[[1]] rather than iterating through the whole thing one element at a time.
You can then group_by code, and arrange by Distance within each group before normalizing AI_ij, so there is no need to split your data frame into a new subset for every line: the data is processed all at once and the data frame is kept together.
Finally, you plot this using the group aesthetic. Note that because you have 400 lines on one plot, you need to make each line faint in order to see the overall pattern more clearly. We do this by setting the alpha value to 0.05 inside stat_ecdf
Note also that there are multiple packages with a function called normalize and I don't know which one you are using. I have guessed you are using BBmisc
So you can get rid of the loop and do:
df %>%
filter(code %in% temp[[1]]) %>%
group_by(code) %>%
arrange(Distance, by_group = TRUE) %>%
mutate(AI = normalize(AI_ij)) %>%
ggplot(aes(AI, group = code)) +
stat_ecdf(geom = "step", alpha = 0.05) +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
xlab("Calculated Accessibility Value") +
ylab("Percent")

Speed up rendering of large heatmap from ggplot in R

I am trying to plot a large heatmap, generated with ggplot, in R. Ultimately, I would like to 'polish' this heat map using Illustrator.
Sample code:
# Load packages (tidyverse)
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,100000), y = seq(1,100000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
Although I save the plot as a vectorized image (.pdf; that is not that large), the pdf is loading very slowly when opening. I expect that every individual point in the data frame is rendered when opening the file.
I have read other posts (e.g. Data exploration in R: display heatmap of large matrix, quickly?) that use image() to visualize matrices, however I would like to use ggplot to modify the image.
Question: How do I speed up the rendering of this plot? Is there a way (besides lowering the resolution of the plot), while keeping the image vectorized, to speed this process up? Is it possible to downsample a vectorized ggplot?
The first thing I tried was stat_summary_2d to get average binning, but it seemed slow and also created some artifacts on the right and top edges:
library(tidyverse)
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
set.seed(123)
df$z <- rnorm(nrow(df))
print(object.size(df), units = "Mb")
#15.4 Mb
ggplot(data = df, aes(x = x, y = y, z = z)) +
stat_summary_2d(bins = c(100,100)) + #10x downsample, in this case
scale_x_continuous(breaks = 100*0:10) +
labs(title = "stat_summary_2d, 1000x1000 downsampled to 100x100")
Even though this is much smaller than your suggested data, this still took about 3 seconds to plot on my machine, and had artifacts on the top and right edges, I presume due to those bins being smaller ones from the edges, leaving more variation.
It got slower from there when I tried a larger grid like you are requesting.
(As an aside, it may be worth clarifying that a vector graphic file like a PDF, unlike a raster graphic, can be resized without loss of resolution. However, in this use case, the output is 10,000 megapixel raster file, far beyond the limits of human perception, that is getting exported into a vector format, where each "pixel" becomes a very tiny rectangle in the PDF. That use of a vector format could be useful for certain unusual cases, like if you ever need to blow up your heatmap without loss of resolution onto a gigantic surface, like a football field. But it sounds like in this case it might be the wrong tool for the job, since you're putting heaps of data into the vector file that won't be perceptible.)
What worked more efficiently was to do the averaging with dplyr before ggplot. With that, I could take a 10k x 10k array and downsample it 100x before sending to ggplot. This necessarily reduces the resolution, but I don't understand the value in this use case of preserving resolution beyond human abilities to perceive it.
Here's some code to do the bucketing ourselves and then plot the downsampled version:
# Using 10k x 10k array, 1527.1 Mb when initialized
downsample <- 100
df2 <- df %>%
group_by(x = downsample * round(x / downsample),
y = downsample * round(y / downsample)) %>%
summarise(z = mean(z))
ggplot(df2, aes(x = x, y = y)) +
geom_raster(aes(fill = z)) +
scale_x_continuous(breaks = 1000*0:10) +
labs(title = "10,000x10,000 downsampled to 100x100")
Your reproducible example just shows noise so it's hard to know what kind of output you would like.
One way would be to follow #dww's suggestion and use geom_hex to show aggregated data.
Another way, as you ask "Is it possible to downsample a vectorized ggplot?", is to use dplyr::sample_frac or dplyr::sample_n in the data argument of your geom_raster. I have to take a smaller sample than in your example though or I can't build the df.
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z), . %>% sample_frac(0.1))
If you want to start from your high resolution ggplot object you can do for the same effect:
gg <- ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
gg$data <- sample_frac(gg$data,0.1)
gg

How to use cshapes and ggplot2 to make a choropleth map in R?

I'm having trouble doing something very basic. I've done this hundreds of times with no problem with other maps but I can't get a cshapes shapefile to map properly using ggplot2 (as an example I'm trying to map "AREA" as the fill, which is a variable that comes with the cshapes shapefile). Here is the code I'm using:
library(cshapes)
library(ggplot2)
world <- cshp(date=as.Date("2009-1-1"))
world#data$id <- rownames(world#data)
world.df = fortify(world, region="COWCODE")
world.df <- join(world.df, world#data, by="id")
ggplot() + geom_polygon(data=world.df,
aes(x = long, y = lat, group = group,fill = AREA))
+coord_equal()
What I end up with is the following:, which as you can see is missing data for the eastern hemisphere. Not sure what's going on, any assistance is much appreciated.
The id you created did not match the id in world.df, thus NAs were introduced with joining by id.
If you set region to and join by SP_ID it works:
world <- cshp(date=as.Date("2009-1-1"))
world.df = fortify(world, region="SP_ID")
names(world.df)[6] <- "SP_ID"
world.df <- join(world.df, world#data)
Ok so I figured out the problem. When I was inspecting the data frame created by fortify() and then remerged with the original data, I noticed that NA's were produced in the merge. Not sure why. So I decided to use the ?help function for fortify() to see if I was missing an argument and lo and behold it says "Rather than using this function, I now recomend using the broom package, which implements a much wider range of methods. fortify may be deprecated in the future." -I had never seen this before and likely explains why I never had trouble in the past. So I checked out library(broom) and the equivalent function is tidy(), which works just fine, like so:
library(broom)
library(cshapes)
library(ggplot2)
library(dplyr)
world <- cshp(date=as.Date("2009-1-1"))
world#data$id <- rownames(world#data)
world.df = tidy(world)
world.df$arrange<-1:192609 ###Needs be reordered (something fortify did automatically)###
world.df <- join(world.df, world#data, by="id")
world.df<-arrange(world.df, arrange)
ggplot() + geom_polygon(data=world.df,
aes(x = long, y = lat, group = group,fill = AREA))
+coord_equal()
Which produces the following:

Average values of a point dataset to a grid dataset

I am relatively new to ggplot, so please forgive me if some of my problems are really simple or not solvable at all.
What I am trying to do is generate a "Heat Map" of a country where the filling of the shape is continous. Furthermore I have the shape of the country as .RData. I used hadley wickham's script to transform my SpatialPolygon data into a data frame. The long and lat data of my data frame now looks like this
head(my_df)
long lat group
6.527187 51.87055 0.1
6.531768 51.87206 0.1
6.541202 51.87656 0.1
6.553331 51.88271 0.1
This long/lat data draws the outline of Germany. The rest of the data frame is omitted here since I think it is not needed. I also have a second data frame of values for certain long/lat points. This looks like this
my_fixed_points
long lat value
12.817 48.917 0.04
8.533 52.017 0.034
8.683 50.117 0.02
7.217 49.483 0.0542
What I would like to do now, is colour each point of the map according to an average value over all the fixed points that lie within a certain distance of that point. That way I would get a (almost)continous colouring of the whole map of the country.
What I have so far is the map of the country plotted with ggplot2
ggplot(my_df,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") +
geom_path(color="white",aes(group=group)) + coord_equal()
My first Idea was to generate points that lie within the map that has been drawn and then calculate the value for every generated point my_generated_point like so
value_vector <- subset(my_fixed_points,
spDistsN1(cbind(my_fixed_points$long, my_fixed_points$lat),
c(my_generated_point$long, my_generated_point$lat), longlat=TRUE) < 50,
select = value)
point_value <- mean(value_vector)
I havent found a way to generate these points though. And as with the whole problem, I dont even know if it is possible to solve this way. My question now is if there exists a way to generate these points and/or if there is another way to come to a solution.
Solution
Thanks to Paul I almost got what I wanted. Here is an example with sample data for the Netherlands.
library(ggplot2)
library(sp)
library(automap)
library(rgdal)
library(scales)
#get the spatial data for the Netherlands
con <- url("http://gadm.org/data/rda/NLD_adm0.RData")
print(load(con))
close(con)
#transform them into the right format for autoKrige
gadm_t <- spTransform(gadm, CRS=CRS("+proj=merc +ellps=WGS84"))
#generate some random values that serve as fixed points
value_points <- spsample(gadm_t, type="stratified", n = 200)
values <- data.frame(value = rnorm(dim(coordinates(value_points))[1], 0 ,1))
value_df <- SpatialPointsDataFrame(value_points, values)
#generate a grid that can be estimated from the fixed points
grd = spsample(gadm_t, type = "regular", n = 4000)
kr <- autoKrige(value~1, value_df, grd)
dat = as.data.frame(kr$krige_output)
#draw the generated grid with the underlying map
ggplot(gadm_t,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") + geom_path(color="white",aes(group=group)) + coord_equal() +
geom_tile(aes(x = x1, y = x2, fill = var1.pred), data = dat) + scale_fill_continuous(low = "white", high = muted("orange"), name = "value")
I think what you want is something along these lines. I predict that this homebrew is going to be terribly inefficient for large datasets, but it works on a small example dataset. I would look into kernel densities and maybe the raster package. But maybe this suits you well...
The following snippet of code calculates the mean value of cadmium concentration of a grid of points overlaying the original point dataset. Only points closer than 1000 m are considered.
library(sp)
library(ggplot2)
loadMeuse()
# Generate a grid to sample on
bb = bbox(meuse)
grd = spsample(meuse, type = "regular", n = 4000)
# Come up with mean cadmium value
# of all points < 1000m.
mn_value = sapply(1:length(grd), function(pt) {
d = spDistsN1(meuse, grd[pt,])
return(mean(meuse[d < 1000,]$cadmium))
})
# Make a new object
dat = data.frame(coordinates(grd), mn_value)
ggplot(aes(x = x1, y = x2, fill = mn_value), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
An alternative approach is to use an interpolation algorithm. One example is kriging. This is quite easy using the automap package (spot the self promotion :), I wrote the package):
library(automap)
kr = autoKrige(cadmium~1, meuse, meuse.grid)
dat = as.data.frame(kr$krige_output)
ggplot(aes(x = x, y = y, fill = var1.pred), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
However, without knowledge as to what your goal is with this map, it is hard for me to see what you want exactly.
This slideshow offers another approach--see page 18 for a description of the approach and page 21 for a view of what the results looked like for the slide-maker.
Note however that the slide-maker used the sp package and the spplot function rather than ggplot2 and its plotting functions.

How to create histogram in R with CSV time data?

I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!
I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:
An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.

Resources