Related
I am trying to plot a scatterplot with continuous variables on both the x and y axis
Here is a simulation that pretty accurately represents what the data would look like (there's a couple of vectors with data items that are integers in the real thing, but that's immaterial here)
library(tidyverse)
set.seed(243)
DemoDataTable <- tibble(Treatment = rep(c(0, 1), times = 20), ## each column is repeated 20 times, not 20 times for both
CorrectAnswers = rnorm(40, 1, 2),
Age = rnorm(40, 20, 5),
PositiveAffect = rnorm(40, 1.5, 1),
NegativeAffect = rnorm(40, 2, 1),
TreatmentExpectancy = rnorm(40, 2, 1)) %>%
mutate(Treatment = as.factor(Treatment))
Positive affect = 'positive affect2', negative affect = 'negative affect2' in the code below
I am aware there are some NaN items in the x and y variables, but have attempted to remove them, the rest are continuous data on both axes. Have also checked the data frame using the glimpse function, only Treatment as a variable is showing up as a factor, the rest are coded as 'double', which to my understanding means numeric data.
The code:
ggplot(Data object, aes(x = 'Negative affect2', y = 'Positive affect2'), na.rm = TRUE) +
geom_point(colour = "blue") +
scale_x_continuous(name = "Negative affect") +
scale_y_continuous(name = "Positive affect") +
theme_minimal()
However when I attempt to run the code, it states the error message: 'Discrete value supplied to continuous scale'
I have also tried to use the na.omit function in place of na.rm but this has made no difference
EDIT:
I have changed the column names from "Negative Affect2" and "Positive Affect2" to "NegativeAffect" and "PositiveAffect" but this has made no difference to the error
I have found that when running the code for the scatterplot line by line (as per below), I immediately hit problems when I get to line 2 as it will only give me a single data point on the x and y axes (whether I insert na.rm = T or not).
ggplot(RTdataset2, aes(x = "NegativeAffect", y = "PositiveAffect", na.rm = T)) +
geom_point(colour = "blue")
However, I have not used the summarise function from dyplr at any point so not sure why this is, as it should plot all the data points from the sample against each other. I believe this is where the problem arises, as otherwise it would not state that I am trying to apply a discrete variable to a continuous scale as per below that.
I'm working with the data that represents n signals and their values in m different time moments (m ~ 1500, n ~ 50). I came up with the algorithm to check which values are representative for each signal and which are not, so I get n signals and m True/False values that tell me if the value is representative or not. I also know that True/False values are grouped pretty well, which means that if I take two close-by values, probability that they are the same is much higher, than that they are different. I'm working in R.
I want to visualize the result of this algorithm output. In order to do this I'm trying to do a raster plot, on which every point is filled with the value of the signal and every group of True/False values has a boarder that is visible. Similar to the picture below, but a bit different.
I'm also open to suggestions of other ways how to visualize this data.
I have one matrix n * m that gives me data to make a raster plot and another matrix n * m (of True/False values) that tells me which points are good/bad.
Code below is the closest I got to the answer, but there is still a boarder inside the group, which is not something that works for me.
library(ggplot2)
n = 5
m = 10
dataMatrix <- data.frame(cbind(1:m, matrix(runif(n * m), ncol = n)))
trueMatrix <- data.frame(cbind(1:m, matrix(rep(c(T, T, T, F, F), m), ncol = n)))
dataMatrix <- gather(dataMatrix, key = Signal, value = Value, -1)
trueMatrix <- gather(trueMatrix, key = TrueFalse, value = Value, -1)
dataMatrix$Representativeness <- trueMatrix$Value
dataMatrix$Signal <- as.integer(gsub("X", "", dataMatrix$Signal))
ggplot(data = dataMatrix) +
geom_raster(mapping = aes(x = X1, y = Signal, fill = Value)) +
scale_fill_gradient2(low = "blue", high = "red", na.value = "black", name = "") +
geom_rect(mapping = aes(xmin = X1 - 0.5, xmax = X1 + 0.5,
ymin = Signal - 0.5, ymax = Signal + 0.5),
size = 1, fill = NA, colour = dataMatrix$Representativeness)
Problem:
1.) I have a shapefile that looks like this:
Extreme values for coordinates are: xmin = 300,000, xmax = 620,000, ymin = 31,000 and ymax = 190,000.
2.) I have a dataset of approx. 2mio points (every point is inside the given polygon) - each one is in one of a 5 different categories.
Now, for every point inside the border (distance between points has to be 10, so that would give us 580,800,000 points) I want to determine color, depending on a category of the nearest point in a dataset.
In the end I would like to draw a ggplot, where the color of every point is dependent on its category (so I'll use 5 different colors).
What I have so far:
My ideas for solution are not optimized and it takes R forever to determine categories for every point inside the polygon.
1.) I created a new dataset with points in a shape of a rectangle with extreme values of coordinates, with 10 units between points. From a new dataset I selected points that have fallen inside the border of polygons (with a function pnt.in.poly from package SDMTools). Then I wanted to find nearest points (from dataset) of every point in a polygon and determined category, but I never manage to get a subset from 580,800,000 points (obviously).
2.) I tried to take 2mio points and color an area around them, dependent on their category, but that did not work right.
I know that it is not possible to plot so many points and see the difference between plot with 200,000,000 points and plot with 1,000,000 points, but I would like to have an accurate coloring when zooming (drawing) only one little spot in a polygon (size of 100 x 100 for example).
Question: Is there any better a way of coloring so many points in a polygon (with creating a new shapefile or grouping points)?
Thank you for your ideas!
It’s really helpful if you include some data with your question, even (especially) if it’s a toy data set. As you don’t, I’ve made a toy example. First, I define a simple shape data frame and a data frame of synthetic data that includes x, y, and grp (i.e., a categorical variable with 5 levels). I crop the latter to the former and plot the results,
# Dummy shape function
df_shape <- data.frame(x = c(0, 0.5, 1, 0.5, 0),
y = c(0, 0.2, 1, 0.8, 0))
# Load library
library(ggplot2)
library(sgeostat) # For in.polygon function
# Data frame of synthetic data: random [x, y] and category (grp)
df_synth <- data.frame(x = runif(500),
y = runif(500),
grp = factor(sample(1:5, 500, replace = TRUE)))
# Remove points outside polygon
df_synth <- df_synth[in.polygon(df_synth$x, df_synth$y, df_shape$x, df_shape$y), ]
# Plot shape and synthetic data
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_synth, aes(x = x, y = y, colour = grp))
g
Next, I create a regular grid and crop that using the polygon.
# Create a grid
df_grid <- expand.grid(x = seq(0, 1, length.out = 50),
y = seq(0, 1, length.out = 50))
# Check if grid points are in polygon
df_grid <- df_grid[in.polygon(df_grid$x, df_grid$y, df_shape$x, df_shape$y), ]
# Plot shape and show points are inside
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y))
g
To classify each point on this grid by the nearest point in the synthetic data set, I use knn or k-nearest-neighbours with k = 1. That gives something like this.
# Classify grid points according to synthetic data set using k-nearest neighbour
df_grid$grp <- class::knn(df_synth[, 1:2], df_grid, df_synth[, 3])
# Show categorised points
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y, colour = grp))
g
So, that's how I'd address that part of your question about classifying points on a grid.
The other part of your question seems to be about resolution. If I understand correctly, you want the same resolution even if you're zoomed in. Also, you don't want to plot so many points when zoomed out, as you can't even see them. Here, I create a plotting function that lets you specify the resolution. First, I plot all the points in the shape with 50 points in each direction. Then, I plot a subregion (i.e., zoom), but keep the same number of points in each direction the same so that it looks pretty much the same as the previous plot in terms of numbers of dots.
res_plot <- function(xlim, xn, ylim, yn, df_data, df_sh){
# Create a grid
df_gr <- expand.grid(x = seq(xlim[1], xlim[2], length.out = xn),
y = seq(ylim[1], ylim[2], length.out = yn))
# Check if grid points are in polygon
df_gr <- df_gr[in.polygon(df_gr$x, df_gr$y, df_sh$x, df_sh$y), ]
# Classify grid points according to synthetic data set using k-nearest neighbour
df_gr$grp <- class::knn(df_data[, 1:2], df_gr, df_data[, 3])
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_gr, aes(x = x, y = y, colour = grp))
g <- g + xlim(xlim) + ylim(ylim)
g
}
# Example plot
res_plot(c(0, 1), 50, c(0, 1), 50, df_synth, df_shape)
# Same resolution, but different limits
res_plot(c(0.25, 0.75), 50, c(0, 1), 50, df_synth, df_shape)
Created on 2019-05-31 by the reprex package (v0.3.0)
Hopefully, that addresses your question.
I have a dataset with 29 columns and 2500 rows resulting from an test. three columns need to be represented on a plot, the fist two are simple X,Y coordinate pairs representing actual X,Y positions on an image used in the test, the third is a response from the participants giving a simple yes or no answer (recorded as 1 and -1 respectively).
Each X,Y coordinate was used name times in the test, and I'm trying to get an overall bias for each point. The values can be found by a simple sum of the Y,N answers. My problem is that I can't plot the "sum" of the answers, only the density of the yes and no separately. I need to show the bias towards yes and no overall for each point, so having two plots or simply plotting the two sets of results on the same plot is on little value.
In the code I'm using the X value is audioDim1a and the Y value is audioDim2. There are 2 DFs used which have been reduced - one to include all the Y answers and the other all the N answers.
this code uses the two N & Y data frames
ggplot() +
xlim(0, 110) + ylim(0, 150) +
stat_density_2d(data = test_plot_N, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "purple") + stat_density_2d(data = test_plot_Y, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "green") + geom_point(data = test_plot_N, aes(audioDim1a, audioDim2), colour = "blue", size = 1)
If I use a dataset (see below) with the Y and N combined I hoped to get the situation where if the number of Y and N answers was equal the density would result in a 0 plot and thus the contour fill would be clear/white. This does not happen as it seems to simply show a count of responses rather than an arithmetic sum.
ggplot() +
xlim(0, 110) + ylim(0, 150) +
stat_density_2d(data = test_plot, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "purple") +
geom_point(data = test_plot_N, aes(audioDim1a, audioDim2), colour = "blue", size = 1)
Do I need to supply the data set and the full R code I'm using?
Any help would be really appreciated.
My initial goal was to plot a population of individual points and then draw a convex hull enclosing 80% of that population centered on the mass of the population.
After trying a number of ideas, the best solution I came up with was to use ggplot's stat_density2d. While this works great for a qualitative analysis, I still need to indicate an 80% boundary. I started out looking for a way to outline the 80th percentile population boundary, but I can work with an 80% probability density boundary instead.
Here's where I'm looking for help. The bin parameter for kde2d (used by stat_density2d) is not clearly documented. If I set bin = 4 in the example below, am I correct in interpreting the central (green) region as containing a 25% probability mass and the combined yellow, red, and green areas as representing a 75% probability mass? If so, by changing the bin to = 5, would the area inscribed then equal an 80% probability mass?
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=4, geom = "polygon", ) +
geom_point(aes(x = x, y = y)) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "black"))
TestData
I repeated a number of test cases and manually counted the excluded points [would love to find a way to count them based on what ..level.. they were contained within] but given the random nature of the data (both my real data and the test data) the number of points outside of the stat_density2d area varied enough to warrant asking for help.
Summarizing, is there a practical means of drawing a polygon around the central 80% of the population of points in the data frame? Or, baring that, am I safe to use stat_density2d and set bin equal to 5 to produce an 80% probability mass?
Excellent answer from Bryan Hanson dispelling the fuzzy notion that I could pass an undocumented bin parameter in stat_density2d. The results looked close at values for bin around 4 to 6, but as he stated, the actual function is unknown and therefore not usable.
I used the HDRegionplot as provided in the accepted answer by DWin to solve my problem. To that, I added a center of gravity (COGravity) and point in polygon (pnt.in.poly) from the SDMTools package to complete the analysis.
library(MASS)
library(coda)
library(SDMTools)
library(emdbook)
library(ggplot2)
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
ContourLines <- as.data.frame(HPDregionplot(mcmc(data.matrix(df)), prob=0.8))
df$inpoly <- pnt.in.poly(df, ContourLines[, c("x", "y")])$pip
dp <- df[df$inpoly == 1,]
COG100 <- as.data.frame(t(COGravity(df$x, df$y)))
COG80 <- as.data.frame(t(COGravity(dp$x, dp$y)))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=5, geom = "polygon", ) +
geom_point(aes(x = x, y = y, colour = as.factor(inpoly)), alpha = 1) +
geom_point(data=COG100, aes(COGx, COGy),colour="white",size=2, shape = 4) +
geom_point(data=COG80, aes(COGx, COGy),colour="green",size=4, shape = 3) +
geom_polygon(data = ContourLines, aes(x = x, y = y), color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
TestData
nrow(dp)/nrow(df) # actual number of population members inscribed within the 80% probability polgyon
Alright, let me start by saying I'm not entirely sure of this answer, and it's only a partial answer! There is no bin parameter for MASS::kde2d which is the function used by stat_density2d. Looking at the help page for kde2d and the code for it (seen simply by typing the function name in the console), I think the bin parameter is h (how these functions know to pass bin to h is not clear however). Following the help page, we see that if h is not provided, it is computed by MASS:bandwidth.nrd. The help page for that function says this:
# The function is currently defined as
function(x)
{
r <- quantile(x, c(0.25, 0.75))
h <- (r[2] - r[1])/1.34
4 * 1.06 * min(sqrt(var(x)), h) * length(x)^(-1/5)
}
Based on this, I think the answer to your last question ("Am I safe...") is definitely no. r in the above function is what you need for your assumption to be safe, but it is clearly modified, so you are not safe. HTH.
Additional thought: Do you have any evidence that your code is using your bins argument? I'm wondering if it is being ignored. If so, try passing h in place of bins and see if it listens.
HPDregionplot in package:emdbook is supposed to do that. It does use MASS::kde2d but it normalizes the result. It has the disadvantage to my mind that it requires an mcmc object.
library(MASS)
library(coda)
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
Building on the answer by 42, I've simplified HPDregionplot() to reduce dependencies and remove the requirement to work with mcmc-objects. The function works on a two-column data.frame and creates no intermediate plots. Note, however, that the this approach breaks as soon as grDevices::contourLines() return multiple contours.
hpd_contour <- function (x, n = 50, prob = 0.95, ...) {
post1 <- MASS::kde2d(x[[1]], x[[2]], n = n, ...)
dx <- diff(post1$x[1:2])
dy <- diff(post1$y[1:2])
sz <- sort(post1$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(prob, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
as.data.frame(grDevices::contourLines(post1$x, post1$y, post1$z, levels = levels))
}
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
ContourLines <- hpd_contour(df, prob=0.8)
ggplot(df, aes(x = x, y = y)) +
stat_density2d(aes(fill = as.factor(..level..)), bins=5, geom = "polygon") +
geom_point() +
geom_polygon(data = ContourLines, color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
Moreover, the workflow now easily extends to grouped data.
ContourLines <- iris[, c("Species", "Sepal.Length", "Sepal.Width")] %>%
group_by(Species) %>%
do(hpd_contour(.[, c("Sepal.Length", "Sepal.Width")], prob=0.8))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.6) +
geom_polygon(data = ContourLines, fill = NA) +
guides(color = FALSE) +
theme(plot.margin = margin())