I am relatively new to ggplot, so please forgive me if some of my problems are really simple or not solvable at all.
What I am trying to do is generate a "Heat Map" of a country where the filling of the shape is continous. Furthermore I have the shape of the country as .RData. I used hadley wickham's script to transform my SpatialPolygon data into a data frame. The long and lat data of my data frame now looks like this
long lat group
6.527187 51.87055 0.1
6.531768 51.87206 0.1
6.541202 51.87656 0.1
6.553331 51.88271 0.1
This long/lat data draws the outline of Germany. The rest of the data frame is omitted here since I think it is not needed. I also have a second data frame of values for certain long/lat points. This looks like this
long lat value
12.817 48.917 0.04
8.533 52.017 0.034
8.683 50.117 0.02
7.217 49.483 0.0542
What I would like to do now, is colour each point of the map according to an average value over all the fixed points that lie within a certain distance of that point. That way I would get a (almost)continous colouring of the whole map of the country.
What I have so far is the map of the country plotted with ggplot2
ggplot(my_df,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") +
geom_path(color="white",aes(group=group)) + coord_equal()
My first Idea was to generate points that lie within the map that has been drawn and then calculate the value for every generated point my_generated_point like so
value_vector <- subset(my_fixed_points,
spDistsN1(cbind(my_fixed_points$long, my_fixed_points$lat),
c(my_generated_point$long, my_generated_point$lat), longlat=TRUE) < 50,
select = value)
point_value <- mean(value_vector)
I havent found a way to generate these points though. And as with the whole problem, I dont even know if it is possible to solve this way. My question now is if there exists a way to generate these points and/or if there is another way to come to a solution.
Thanks to Paul I almost got what I wanted. Here is an example with sample data for the Netherlands.
#get the spatial data for the Netherlands
con <- url("http://gadm.org/data/rda/NLD_adm0.RData")
#transform them into the right format for autoKrige
gadm_t <- spTransform(gadm, CRS=CRS("+proj=merc +ellps=WGS84"))
#generate some random values that serve as fixed points
value_points <- spsample(gadm_t, type="stratified", n = 200)
values <- data.frame(value = rnorm(dim(coordinates(value_points))[1], 0 ,1))
value_df <- SpatialPointsDataFrame(value_points, values)
#generate a grid that can be estimated from the fixed points
grd = spsample(gadm_t, type = "regular", n = 4000)
kr <- autoKrige(value~1, value_df, grd)
dat = as.data.frame(kr$krige_output)
#draw the generated grid with the underlying map
ggplot(gadm_t,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") + geom_path(color="white",aes(group=group)) + coord_equal() +
geom_tile(aes(x = x1, y = x2, fill = var1.pred), data = dat) + scale_fill_continuous(low = "white", high = muted("orange"), name = "value")
I think what you want is something along these lines. I predict that this homebrew is going to be terribly inefficient for large datasets, but it works on a small example dataset. I would look into kernel densities and maybe the raster package. But maybe this suits you well...
The following snippet of code calculates the mean value of cadmium concentration of a grid of points overlaying the original point dataset. Only points closer than 1000 m are considered.
# Generate a grid to sample on
bb = bbox(meuse)
grd = spsample(meuse, type = "regular", n = 4000)
# Come up with mean cadmium value
# of all points < 1000m.
mn_value = sapply(1:length(grd), function(pt) {
d = spDistsN1(meuse, grd[pt,])
return(mean(meuse[d < 1000,]$cadmium))
# Make a new object
dat = data.frame(coordinates(grd), mn_value)
ggplot(aes(x = x1, y = x2, fill = mn_value), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
which leads to the following image:
An alternative approach is to use an interpolation algorithm. One example is kriging. This is quite easy using the automap package (spot the self promotion :), I wrote the package):
kr = autoKrige(cadmium~1, meuse, meuse.grid)
dat = as.data.frame(kr$krige_output)
ggplot(aes(x = x, y = y, fill = var1.pred), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
which leads to the following image:
However, without knowledge as to what your goal is with this map, it is hard for me to see what you want exactly.
This slideshow offers another approach--see page 18 for a description of the approach and page 21 for a view of what the results looked like for the slide-maker.
Note however that the slide-maker used the sp package and the spplot function rather than ggplot2 and its plotting functions.
I am using an excel sheet for data. One column has FIPS numbers for GA counties and the other is labeled Count with numbers 1 - 5. I have made a map with these values using the following code:
carrierdata <- import("GA Info.xlsx")
plot_usmap( data = carrierdata, values = "Count", "counties", include = c("GA"), color="black") +
scale_fill_continuous(low = "#56B1F7", high = "#132B43", name="Count", label=scales::comma)+
theme(plot.background=element_rect(), legend.position="right")
I've included the picture of the map I get and a sample of the data I am using. Can anyone help me put the actual Count numbers on each county?
The usmap package is a good source for county maps, but the data it contains is in the format of data frames of x, y co-ordinates of county outlines, whereas you need the numbers plotted in the center of the counties. The package doesn't seem to contain the center co-ordinates for each county.
Although it's a bit of a pain, it is worth converting the map into a formal sf data frame format to give better plotting options, including the calculation of the centroid for each county. First, we'll load the necessary packages, get the Georgia data and convert it to sf format:
d <- us_map("counties")
d <- d[d$abbr == "GA",]
GAc <- lapply(split(d, d$county), function(x) st_polygon(list(cbind(x$x, x$y))))
GA <- st_sfc(GAc, crs = usmap_crs()#projargs)
GA <- st_sf(data.frame(fips = unique(d$fips), county = names(GAc), geometry = GA))
Now, obviously I don't have your numeric data, so I'll have to make some up, equivalent to the data you are importing from Excel. I'll assume your own carrierdata has a column named "fips" and another called "values":
carrierdata <- data.frame(fips = GA$fips, values = sample(5, nrow(GA), TRUE))
So now we left_join our imported data to the GA county data:
GA <- dplyr::left_join(GA, carrierdata, by = "fips")
And we can calculate the center point for each county:
GA$centroids <- st_centroid(GA$geometry)
All that's left now is to plot the result:
ggplot(GA) +
geom_sf(aes(fill = values)) +
geom_sf_text(aes(label = values, geometry = centroids), colour = "white")
I am trying to plot a large heatmap, generated with ggplot, in R. Ultimately, I would like to 'polish' this heat map using Illustrator.
Sample code:
# Load packages (tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,100000), y = seq(1,100000))
# add variable: performance
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
Although I save the plot as a vectorized image (.pdf; that is not that large), the pdf is loading very slowly when opening. I expect that every individual point in the data frame is rendered when opening the file.
I have read other posts (e.g. Data exploration in R: display heatmap of large matrix, quickly?) that use image() to visualize matrices, however I would like to use ggplot to modify the image.
Question: How do I speed up the rendering of this plot? Is there a way (besides lowering the resolution of the plot), while keeping the image vectorized, to speed this process up? Is it possible to downsample a vectorized ggplot?
The first thing I tried was stat_summary_2d to get average binning, but it seemed slow and also created some artifacts on the right and top edges:
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
df$z <- rnorm(nrow(df))
print(object.size(df), units = "Mb")
#15.4 Mb
ggplot(data = df, aes(x = x, y = y, z = z)) +
stat_summary_2d(bins = c(100,100)) + #10x downsample, in this case
scale_x_continuous(breaks = 100*0:10) +
labs(title = "stat_summary_2d, 1000x1000 downsampled to 100x100")
Even though this is much smaller than your suggested data, this still took about 3 seconds to plot on my machine, and had artifacts on the top and right edges, I presume due to those bins being smaller ones from the edges, leaving more variation.
It got slower from there when I tried a larger grid like you are requesting.
(As an aside, it may be worth clarifying that a vector graphic file like a PDF, unlike a raster graphic, can be resized without loss of resolution. However, in this use case, the output is 10,000 megapixel raster file, far beyond the limits of human perception, that is getting exported into a vector format, where each "pixel" becomes a very tiny rectangle in the PDF. That use of a vector format could be useful for certain unusual cases, like if you ever need to blow up your heatmap without loss of resolution onto a gigantic surface, like a football field. But it sounds like in this case it might be the wrong tool for the job, since you're putting heaps of data into the vector file that won't be perceptible.)
What worked more efficiently was to do the averaging with dplyr before ggplot. With that, I could take a 10k x 10k array and downsample it 100x before sending to ggplot. This necessarily reduces the resolution, but I don't understand the value in this use case of preserving resolution beyond human abilities to perceive it.
Here's some code to do the bucketing ourselves and then plot the downsampled version:
# Using 10k x 10k array, 1527.1 Mb when initialized
downsample <- 100
df2 <- df %>%
group_by(x = downsample * round(x / downsample),
y = downsample * round(y / downsample)) %>%
summarise(z = mean(z))
ggplot(df2, aes(x = x, y = y)) +
geom_raster(aes(fill = z)) +
scale_x_continuous(breaks = 1000*0:10) +
labs(title = "10,000x10,000 downsampled to 100x100")
Your reproducible example just shows noise so it's hard to know what kind of output you would like.
One way would be to follow #dww's suggestion and use geom_hex to show aggregated data.
Another way, as you ask "Is it possible to downsample a vectorized ggplot?", is to use dplyr::sample_frac or dplyr::sample_n in the data argument of your geom_raster. I have to take a smaller sample than in your example though or I can't build the df.
# Create dataframe
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
# add variable: performance
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z), . %>% sample_frac(0.1))
If you want to start from your high resolution ggplot object you can do for the same effect:
gg <- ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
gg$data <- sample_frac(gg$data,0.1)
After searching around a lot, asking, and doing some code, I kinda got the bare minimum for doing kriging in R's gstat.
Using 4 points (I know, totally bad), I kriged the unsampled points located between them. But in actuality, I don't need all of those points. Inside that area, there is a smaller subarea... this area is the one I actually need.
Long story short.. I have measurements taken from 4 weather stations that report rainfall data. The lat and long coordinates for these points are:
lat long
7.16 124.21
8.6 123.35
8.43 124.28
8.15 125.08
My road to kriging can be seen through my previous questions on StackOverflow.
This: Create variogram in R's gstat package
And this: Create Grid in R for kriging in gstat
I know that the image in has the coordinates (at least according to my estimates):
Leftmost: 124 13ish 0 E(DMS)
Rightmost : 124 20ish 0 E
Topmost corrdinates: 124 17ish 0 E
Bottommost coordinates: 124 16ish 0 E
Conversion will take place for that but that doesn't matter I think, or easier to deal with later.
The image is also irregular (but aren't they all though).
Think of it like a doughnut, you krige the the whole circular shape of the doughnut but you only need the area covered by the hole so you remove or at least disregard the values you got from the doughnut itself.
I have an image (.jpg) of the area in question, I will have to convert the image into a shapefile or some other vector format using QGIS or similar software. After that, I will have to insert that vector image inside the 4 point kriged area, so I know which coordinates to actually consider and which ones to remove.
Finally, I take the values of the area covered by the image and store them into a csv or database.
Anybody know how I can start with this? Total noob at R and statistics. Thanks to anyone who responds.
I just want to know if its possible and if it is provide some tips. Thanks again.
Might as well also post my script:
library(dplyr) # for "glimpse"
library(scales) # for "comma"
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="Rainfall Data", host="localhost", port=5432,
user="postgres", password="postgres")
day_1 <- dbGetQuery(con, "SELECT lat, long, rainfall FROM cotabato.sample")
coordinates(day_1) <- ~ lat + long
x.range <- as.integer(c(7.0,9.0))
y.range <- as.integer(c(123.0,126.0))
grid <- expand.grid(x=seq(from=x.range[1], to=x.range[2], by=0.05),
y=seq(from=y.range[1], to=y.range[2], by=0.05))
coordinates(grid) <- ~x+y
plot(grid, cex=1.5)
points(day_1, col='red')
title("Interpolation Grid and Sample Points")
day_1.vgm <- variogram(rainfall~1, day_1, width = 0.02, cutoff = 1.8)
day_1.fit <- fit.variogram(day_1.vgm, model=vgm("Sph", psill = 8000, range = 1))
plot(day_1.vgm, day_1.fit)
plot1 <- day_1 %>% as.data.frame %>%
ggplot(aes(lat, long)) + geom_point(size=1) + coord_equal() +
ggtitle("Points with measurements")
plot2 <- grid %>% as.data.frame %>%
ggplot(aes(x, y)) + geom_point(size=1) + coord_equal() +
ggtitle("Points at which to estimate")
grid.arrange(plot1, plot2, ncol = 2)
coordinates(grid) <- ~ x + y
day_1.kriged <- krige(rainfall~1, day_1, grid, model=day_1.fit)
day_1.kriged %>% as.data.frame %>%
ggplot(aes(x=x, y=y)) + geom_tile(aes(fill=var1.pred)) + coord_equal() +
scale_fill_gradient(low = "yellow", high="red") +
scale_x_continuous(labels=comma) + scale_y_continuous(labels=comma) +
write.csv(day_1.kriged, file = "Day_1.csv")
EDIT: The code has changed since the last time. But that doesn't matter I guess, I just want to know if its possible and can anybody provide the simplest example of it being possible. I can derive the solution to the example to my own problem from there.
Let me know if you find this useful:
"Think of it like a doughnut, you krige the the whole circular shape of the doughnut but you only need the area covered by the hole so you remove or at least disregard the values you got from the doughnut itself."
For this you load your vectorial data:
donut <- rgdal::readOGR('/variogram', 'donut')
day_1#proj4string#projargs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0" # Becouse donut shape have this CRS
plot(donut, axes = TRUE, col = 3)
plot(day_1, col = 2, pch = 20, add = TRUE)
Then you delete the 'external ring' and keep the insider. Also indicates that the second isn't a hole anymore:
hole <- donut # for keep original shape
hole#polygons[1][[1]]#Polygons[1] <- NULL
hole#polygons[1][[1]]#Polygons[1][[1]]#hole <- FALSE
plot(hole, axes = TRUE, col = 4, add = TRUE)
After that you chek whicch points are inside 'hole' new blue vector layer:
over.pts <- over(day_1, hole)
day_1_subset <- day_1[!is.na(over.pts$Id), ]
plot(donut, axes = TRUE, col = 3)
plot(hole, col = 4, add = TRUE)
plot(day_1, col = 2, pch = 20, add = TRUE)
plot(day_1_subset, col = 'white', pch = 1, cex = 2, add = TRUE)
write.csv(day_1_subset#data, 'myfile.csv') # write intersected points table
write.csv(as.data.frame(coordinates(day_1_subset)), 'myfile.csv') # write intersected points coords
writeOGR(day_1_subset, 'path', 'mysubsetlayer', driver = 'ESRI Shapefile') # write intersected points shape
With this code you can solve the 'ring' or doughnut 'hole' if you already have the shapefile.
If you have an image and want to clip it try the follow:
In the case you load a raster (get basemap image from web):
coordDf <- as.data.frame(coordinates(day_1)) # get basemap from points
# coordDf <- data.frame(hole#polygons[1][[1]]#Polygons[1][[1]]#coords) # get basemap from hole
colnames(coordDf) <- c('x', 'y')
imag <- dismo::gmap(coordDf, lonlat = TRUE)
myimag <- raster::crop(day_1.kriged, hole)
plot(day_1, add = TRUE, col = 2)
In case you use day_1.kriged:
myCropKrig<- raster::crop(day_1.kriged, hole)
myCropKrig %>% as.data.frame %>%
ggplot(aes(x=x, y=y)) + geom_tile(aes(fill=var1.pred)) + coord_equal() +
scale_fill_gradient(low = "yellow", high="red") +
scale_x_continuous(labels=comma) + scale_y_continuous(labels=comma) +
geom_point(data=coordDf[!is.na(over.pts$Id), ], aes(x=x, y=y), color="blue", size=3, shape=20) +
And "Finally, I take the values of the area covered by the image and store them into a csv or database."
write.csv(as.data.frame(myCropKrig), 'myCropKrig.csv')
Hope you find this useful and I respond your meaning
To simplify your question:
You want to delineate an area based on an image that is not georeferenced.
You want to extract results of a interpolation only for this area
Few steps are required
You need to use QGIS to georeference your image (Raster > Georeferencer). You need to have a georeferenced map in background to help. This creates a raster object with spatial information.
Two possibilities.
2.a. The central part of your image has a color than can be directly used as a mask in R (Ex. All green pixels in middle of red pixels).
2.b. If not, you need to use QGIS to delineate manually a Polygon of the area (Layer > Create Layer > New Shapefile > Polygon)
Import your raster or polygon shapefile in R
Use function raster::mask to extract values of your interpolation using the raster image or the SpatialPolygon.
I need to create a buffer zone on the set of data points with x and y coordinates (grey points on the graph).
Unfortunately, I don’t have a perimeter border of the points, from which to create a buffer.
I was trying to calculate the perimeter using chull function, however it is not working properly (orange area).
I can calculate the border points using max/min functions for the data by some step (let's say 10 m, red dots), and try to calculate the buffer from those points.
Is someone aware of more correct and clean way to calculate the buffer zone for set of points.
You could do a tesselation around the points. Points at the border will have much larger polygons.
triang <- deldir(data$x, data$y)
border <- triang$summary
border$Selected <- border$dir.area > 260
ggplot(border[order(border$Selected), ], aes(x = x, y = y, colour = Selected)) + geom_point()
thanks a lot for your suggestions and comments.
Indeed, It was my fault omitting the alphahull package.
After identifying the border with ashape I create a buffer polygon and identified the data that lies inside and outside the buffer. Challenge was to correctly extract the polygon from ashap, but solution of RPubs safe me.
You can see also the graphical example here.
## load
library(ggplot2); library(alphahull);
library(igraph); library(rgeos)
## Load the data
#Remove the duplicates in the data to do the chull calculation
data <- data.df[!duplicated(paste(data.df$xsite, data.df$ysite, sep ="_")), c("xsite","ysite") ]
#calculate the chull with alpha 20
data.chull <- ashape(data, alpha = 20)
## Below is the code to extract polygon from the ashape chull function
## credit to: http://rpubs.com/geospacedman/alphasimple
order.chull <- graph.edgelist(cbind(as.character(data.chull$edges[, "ind1"]), as.character(data.chull$edges[,"ind2"])), directed = FALSE)
cutg <- order.chull - E(order.chull)[1]
ends <- names(which(degree(cutg) == 1))
path <- get.shortest.paths(cutg, ends[1], ends[2])[[1]]
pathX <- as.numeric(V(order.chull)[unlist(path[[1]])]$name)
pathX = c(pathX, pathX[1])
data.chull <- as.data.frame(data.chull$x[pathX, ])
## Create a spatial object from the polygon and apply a buffer to
## Then extract the data to the dataframe.
data.chull.poly <- SpatialPolygons(list(Polygons(list(Polygon(as.matrix(data.chull))),"s1")))
data.chull.poly.buff <- gBuffer(data.chull.poly, width = -10)
data.buffer <- fortify(data.chull.poly.buff)[c("long","lat")]
## Identidfy the data that are inside the buffer polygon
data$posit <- "Outside"
data$posit[point.in.polygon(data$x,data$y,data.buffer$long,data.buffer$lat) %in% c(1,2,3)] <- "Inside"
## Plot the results
theme_bw()+xlab("X coordinates (m)")+ylab("Y coordinates (m)") +
geom_point(data = data, aes(xsite, ysite, color = posit))+
geom_polygon(data = data.chull, aes(V1, V2), color = "black", alpha = 0)+
geom_polygon(data = data.buffer, aes(long, lat), color = "blue", alpha = 0)
I have a question regarding data handling in R. I have two datasets. Both are originally .csv files.
I've prepared two example Datasets:
Table A - Persons
Table B - City
To make it as less work as possible the corresponding R Code for loading and visualizing.
# Read csv files
# check pastebin links and save content to persons.csv and city.csv.
persons_dataframe = read.csv("persons.csv", header = TRUE)
city_dataframe = read.csv("city.csv", header = TRUE)
# plot them on a map
# load used packages
persons_ggplot2 <- persons_dataframe
city_ggplot2 <- city_dataframe
gc <- geocode('new york, usa')
center <- as.numeric(gc)
G <- ggmap(get_googlemap(center = center, color = 'color', scale = 4, zoom = 10, maptype = "terrain", frame=T), extent="panel")
G1 <- G + geom_point(aes(x=POINT_X, y=POINT_Y ),data=city_dataframe, shape = 22, color="black", fill = "yellow", size = 4) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=persons_dataframe, shape = 8, color="red", size=2.5)
As a result I have a map, which visulaizes all cities and persons.
My problem: All persons are distributed only on these three cities.
My questions:
A more general questions: Is this a problem for R?
I want to create something like a bubble map, which visualized the amount of persons at one position. Like: In City A there are 20 persons, in City B are 5 persons. The position at city A should get a bigger bubble than City B.
I want to create a label, which states the amount of persons at a certain position. I've already tried to realize this with the ggplo2 geom_text options, but I can't figure out how to sum up all points at a certain position and write this to a label.
A more theoretical approach (maybe I come back to this later on): I want to create something like a density map / cluster map, which shows the area, with the highest amount of persons. I've already search for some packages, which I could use. Suggested ones were SpatialEpi, spatstat and DCluster. My question: Do I need the distance from the persons to a certain object (let's say supermarket) to perform a cluster analyses?
Hopefully, these were not too many questions.
Any help is much appreciated. Thanks in advance!
Btw: Is there any better help to prepare a question containing example datasets? Should I upload a file somewhere or is the pastebin way okay?
You can create the bubble chart by counting the number in each city and mapping the size of the points to the counts:
persons_count <- count(persons_dataframe, vars = c("city", "POINT_X", "POINT_Y"))
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red")
You can map the counts to the area of the points, which perhaps gives a better sense of the relative sizes:
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red") +
scale_size_area(breaks = unique(persons_count$freq))
You can add the frequency labels, though this is somewhat redundant with the size scale legend:
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red") +
geom_text(aes(x = POINT_X, y=POINT_Y, label = freq), data=persons_count) +
scale_size_area(breaks = unique(persons_count$freq))
You can't really plot densities with your example data because you only have three points. But if you had more fine-grained location information you could calculate and plot the densities using the stat_density2d function in ggplot2.