Handling many points at one position in R - r

I have a question regarding data handling in R. I have two datasets. Both are originally .csv files.
I've prepared two example Datasets:
Table A - Persons
http://pastebin.com/HbaeqACi
Table B - City
http://pastebin.com/Fyj66ahq
To make it as less work as possible the corresponding R Code for loading and visualizing.
# Read csv files
# check pastebin links and save content to persons.csv and city.csv.
persons_dataframe = read.csv("persons.csv", header = TRUE)
city_dataframe = read.csv("city.csv", header = TRUE)
# plot them on a map
# load used packages
library(RgoogleMaps)
library(ggplot2)
library(ggmap)
library(sp)
persons_ggplot2 <- persons_dataframe
city_ggplot2 <- city_dataframe
gc <- geocode('new york, usa')
center <- as.numeric(gc)
G <- ggmap(get_googlemap(center = center, color = 'color', scale = 4, zoom = 10, maptype = "terrain", frame=T), extent="panel")
G1 <- G + geom_point(aes(x=POINT_X, y=POINT_Y ),data=city_dataframe, shape = 22, color="black", fill = "yellow", size = 4) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=persons_dataframe, shape = 8, color="red", size=2.5)
plot(G1)
As a result I have a map, which visulaizes all cities and persons.
My problem: All persons are distributed only on these three cities.
My questions:
A more general questions: Is this a problem for R?
I want to create something like a bubble map, which visualized the amount of persons at one position. Like: In City A there are 20 persons, in City B are 5 persons. The position at city A should get a bigger bubble than City B.
I want to create a label, which states the amount of persons at a certain position. I've already tried to realize this with the ggplo2 geom_text options, but I can't figure out how to sum up all points at a certain position and write this to a label.
A more theoretical approach (maybe I come back to this later on): I want to create something like a density map / cluster map, which shows the area, with the highest amount of persons. I've already search for some packages, which I could use. Suggested ones were SpatialEpi, spatstat and DCluster. My question: Do I need the distance from the persons to a certain object (let's say supermarket) to perform a cluster analyses?
Hopefully, these were not too many questions.
Any help is much appreciated. Thanks in advance!
Btw: Is there any better help to prepare a question containing example datasets? Should I upload a file somewhere or is the pastebin way okay?

You can create the bubble chart by counting the number in each city and mapping the size of the points to the counts:
library(plyr)
persons_count <- count(persons_dataframe, vars = c("city", "POINT_X", "POINT_Y"))
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red")
You can map the counts to the area of the points, which perhaps gives a better sense of the relative sizes:
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red") +
scale_size_area(breaks = unique(persons_count$freq))
You can add the frequency labels, though this is somewhat redundant with the size scale legend:
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red") +
geom_text(aes(x = POINT_X, y=POINT_Y, label = freq), data=persons_count) +
scale_size_area(breaks = unique(persons_count$freq))
You can't really plot densities with your example data because you only have three points. But if you had more fine-grained location information you could calculate and plot the densities using the stat_density2d function in ggplot2.

Related

Plotting lines between two points in ggplot2

I'm looking for a way to represent a vector coming off of a point given angle and magnitude in ggplot. I've calculated what the endpoint of these vectors should be, but can't figure out a way to plot this properly in ggplot2. In short, given an observation with (X,Y,vec.x,vec.y), how can I plot a line from (X,Y) to (vec.x,vec.y) that does not show (vec.x,vec.y)?
My first instinct was to use geom_line, but this seems to rely on connecting different observations, so I would need to separate each observation into two observations, one with the original point and one with the vector endpoint. However, this seems fairly messy and like there should be a cleaner way to achieve this. Furthermore, this would make it complicated to show the original points but hide the vector points, as they would be plotted within the same geom_point call.
Here's a sample dataset in the form I'm talking about:
test <- tibble(
x = c(1,2,3,4,5),
y = c(5,4,3,2,1),
vec.x = c(1.5,2.5,3.5,4.5,5.5),
vec.y = c(4,3,2,1,0)
)
test %>%
ggplot() +
geom_point(aes(x=x,y=y),color='red') +
geom_point(aes(x=vec.x,y=vec.y),color='blue')
What I'm hoping to achieve is this, but without the blue dots:
Any thoughts? Apologies if this is a duplicated issue. I did some Googling and was unable to find a similar question for ggplot.
test %>%
ggplot() +
geom_point(aes(x=x,y=y),color='red') +
geom_point(aes(x=vec.x,y=vec.y),color='blue') +
geom_segment(
aes(x = x,y = y, xend = vec.x,yend = vec.y),
arrow = arrow(length = unit(0.03,units = "npc")),
size = 1
)
Reference: https://ggplot2.tidyverse.org/reference/geom_segment.html

How to plot density of points in one dimension with different factors in ggplot2

I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)

Controlling alpha in ggparcoord (from GGally package)

I am trying to build from a question similar to mine (and from which I borrowed the self-contained example and title inspiration). I am trying to apply transparency individually to each line of a ggparcoord or somehow add two layers of ggparcoord on top of the other. The detailed description of the problem and format of data I have for the solution to work is provided below.
I have a dataset with thousand of lines, lets call it x.
library(GGally)
x = data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
After clustering this data I also get a set of 5 lines, let's call this dataset y.
y = data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
In order to see the centroids y overlaying x I use the following code. First I add y to x such that the 5 rows are on the bottom of the final dataframe. This ensures ggparcoord will put them last and therefore stay on top of all the data:
df <- rbind(x,y)
Next I create a new column for df, following the question advice I referred such that I can color differently the centroids and therefore can tell it apart from the data:
df$cluster = "data"
df$cluster[(nrow(df)-4):(nrow(df))] <- "centroids"
Finally I plot it:
p <- ggparcoord(df, columns=1:4, groupColumn=5, scale="globalminmax", alphaLines = 0.99) + xlab("Sample") + ylab("log(Count)")
p + scale_colour_manual(values = c("data" = "grey","centroids" = "#94003C"))
The problem I am stuck with is from this stage and onwards. On my original data, plotting solely x doesn't lead to much insight since it is a heavy load of lines (on this data this is equivalent to using ggparcoord above on x instead of df:
By reducing alphaLines considerably (0.05), I can naturally see some clusters due to the overlapping of the lines (this is again running ggparcoord on x reducing alphaLines):
It makes more sense to observe the centroids added to df on top of the second plot, not the first.
However, since everything it is on a single dataframe, applying such a high value for alphaLine makes the centroid lines disappear. My only option is then to use ggparcoord (as provided above) on df without decreasing the alphaValue:
My goal is to have the red lines (centroid lines) on top of the second figure with very low alpha. There are two ways I thought so far but couldn't get it working:
(1) Is there any way to create a column on the dataframe, similar to what is done for the color, such that I can specify the alpha value for each line?
(2) I originally attempted to create two different ggparcoords and "sum them up" hoping to overlay but an error was raised.
The question may contain too much detail, but I thought this could motivate better the applicability of the answer to serve the interest of other readers.
The answer I am looking for would use the provided data variables on the current format and generate the plot I am looking for. Better ways to reconstruct the data is also welcomed, but using the current structure is preferred.
In this case I think it easier to just use ggplot, and build the graph yourself. We make slight adjustments to how the data is represented (we put it in long format), and then we make the parallel coordinates plot. We can now map any attribute to cluster that you like.
library(dplyr)
library(tidyr)
# I start the same as you
x <- data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
y <- data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
# I find this an easier way to combine the two data.frames, and have an id column
df <- bind_rows(data = x, centroids = y, .id = 'cluster')
# We need to add id's, so we know which points to connect with a line
df$id <- 1:nrow(df)
# Put the data into long format
df2 <- gather(df, 'column', 'value', a:d)
# And plot:
ggplot(df2, aes(column, value, alpha = cluster, color = cluster, group = id)) +
geom_line() +
scale_colour_manual(values = c("data" = "grey", "centroids" = "#94003C")) +
scale_alpha_manual(values = c("data" = 0.2, "centroids" = 1)) +
theme_minimal()

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

Average values of a point dataset to a grid dataset

I am relatively new to ggplot, so please forgive me if some of my problems are really simple or not solvable at all.
What I am trying to do is generate a "Heat Map" of a country where the filling of the shape is continous. Furthermore I have the shape of the country as .RData. I used hadley wickham's script to transform my SpatialPolygon data into a data frame. The long and lat data of my data frame now looks like this
head(my_df)
long lat group
6.527187 51.87055 0.1
6.531768 51.87206 0.1
6.541202 51.87656 0.1
6.553331 51.88271 0.1
This long/lat data draws the outline of Germany. The rest of the data frame is omitted here since I think it is not needed. I also have a second data frame of values for certain long/lat points. This looks like this
my_fixed_points
long lat value
12.817 48.917 0.04
8.533 52.017 0.034
8.683 50.117 0.02
7.217 49.483 0.0542
What I would like to do now, is colour each point of the map according to an average value over all the fixed points that lie within a certain distance of that point. That way I would get a (almost)continous colouring of the whole map of the country.
What I have so far is the map of the country plotted with ggplot2
ggplot(my_df,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") +
geom_path(color="white",aes(group=group)) + coord_equal()
My first Idea was to generate points that lie within the map that has been drawn and then calculate the value for every generated point my_generated_point like so
value_vector <- subset(my_fixed_points,
spDistsN1(cbind(my_fixed_points$long, my_fixed_points$lat),
c(my_generated_point$long, my_generated_point$lat), longlat=TRUE) < 50,
select = value)
point_value <- mean(value_vector)
I havent found a way to generate these points though. And as with the whole problem, I dont even know if it is possible to solve this way. My question now is if there exists a way to generate these points and/or if there is another way to come to a solution.
Solution
Thanks to Paul I almost got what I wanted. Here is an example with sample data for the Netherlands.
library(ggplot2)
library(sp)
library(automap)
library(rgdal)
library(scales)
#get the spatial data for the Netherlands
con <- url("http://gadm.org/data/rda/NLD_adm0.RData")
print(load(con))
close(con)
#transform them into the right format for autoKrige
gadm_t <- spTransform(gadm, CRS=CRS("+proj=merc +ellps=WGS84"))
#generate some random values that serve as fixed points
value_points <- spsample(gadm_t, type="stratified", n = 200)
values <- data.frame(value = rnorm(dim(coordinates(value_points))[1], 0 ,1))
value_df <- SpatialPointsDataFrame(value_points, values)
#generate a grid that can be estimated from the fixed points
grd = spsample(gadm_t, type = "regular", n = 4000)
kr <- autoKrige(value~1, value_df, grd)
dat = as.data.frame(kr$krige_output)
#draw the generated grid with the underlying map
ggplot(gadm_t,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") + geom_path(color="white",aes(group=group)) + coord_equal() +
geom_tile(aes(x = x1, y = x2, fill = var1.pred), data = dat) + scale_fill_continuous(low = "white", high = muted("orange"), name = "value")
I think what you want is something along these lines. I predict that this homebrew is going to be terribly inefficient for large datasets, but it works on a small example dataset. I would look into kernel densities and maybe the raster package. But maybe this suits you well...
The following snippet of code calculates the mean value of cadmium concentration of a grid of points overlaying the original point dataset. Only points closer than 1000 m are considered.
library(sp)
library(ggplot2)
loadMeuse()
# Generate a grid to sample on
bb = bbox(meuse)
grd = spsample(meuse, type = "regular", n = 4000)
# Come up with mean cadmium value
# of all points < 1000m.
mn_value = sapply(1:length(grd), function(pt) {
d = spDistsN1(meuse, grd[pt,])
return(mean(meuse[d < 1000,]$cadmium))
})
# Make a new object
dat = data.frame(coordinates(grd), mn_value)
ggplot(aes(x = x1, y = x2, fill = mn_value), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
An alternative approach is to use an interpolation algorithm. One example is kriging. This is quite easy using the automap package (spot the self promotion :), I wrote the package):
library(automap)
kr = autoKrige(cadmium~1, meuse, meuse.grid)
dat = as.data.frame(kr$krige_output)
ggplot(aes(x = x, y = y, fill = var1.pred), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
However, without knowledge as to what your goal is with this map, it is hard for me to see what you want exactly.
This slideshow offers another approach--see page 18 for a description of the approach and page 21 for a view of what the results looked like for the slide-maker.
Note however that the slide-maker used the sp package and the spplot function rather than ggplot2 and its plotting functions.

Resources