How do I plot a probability heatmap with ggplot2? [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
Say I have a bunch of data with x and y coordinates and TRUE/FALSE for each row:
library(tidyverse)
set.seed(666) #666 for the devil
x <- rnorm(1000, 50, 10)
y <- sample(1:100, 1000, replace = T)
result <- sample(c(T, F), 1000, prob = c(1, 9),replace = T)
data <- tibble(x, y, result)
Now, I want to make a plot that shows the likelihood of an area being TRUE based on that data. I could group the data into little squares(or whatever) and calculate the TRUE percentage and then plot that but what I wonder if there is something in ggplot2 that will do that for me automatically.

ggplot(data, aes(x = x, y = y, z = as.numeric(result))) +
stat_summary_2d(bins = 20, color = "grey", fun = mean) +
theme_classic()

Not completely in ggplot2 but the following produces what I think you are asking for
library(tidyverse)
library(broom)
set.seed(666) #666 for the devil
data.frame(x = rnorm(1000, 50, 10),
y = sample(1:100, 1000, replace = T),
result = sample(c(T, F), 1000, prob = c(1, 9), replace = T)) %>%
do(augment(glm(result ~ x * y, data = ., family = "binomial"), type.predict = "response")) %>%
ggplot(aes(x, y, color = .fitted)) +
geom_point()
or geom_hex instead of geom_point looks interesting

This seems to be an other solution:
library(tidyverse)
#Preparing data
set.seed(666) #666 for the devil
data <- tibble(x = rnorm(1000, 50, 10),
y = sample(1:100, 1000, replace = T),
result = sample(c(T, F), 1000,
prob = c(1, 9), replace = T)) %>%
filter(result == TRUE)
#Plotting with ggplot
ggplot(data, aes(x, y)) +
geom_bin2d()

Related

Binning two columns of data frame together in R

I would like to bin two columns of a dataset simultaneously to create one common binned column. The simple code is as follows
x <- sample(100)
y <- sample(100)
data <- data.frame(x, y)
xbin <- seq(from = 0, to = 100, by = 10)
ybin <- seq(from = 0, to = 100, by = 10)
Any help is appreciated!
Not sure if this is what you are looking for
library(tidyverse)
x <- sample(100)
y <- sample(100)
data <- data.frame(x, y)
xbin <- seq(from = 0, to = 100, by = 10)
ybin <- seq(from = 0, to = 100, by = 10)
data <- data%>%
dplyr::mutate(
x_binned = cut(x, breaks = seq(0,100,10)),
y_binned = cut(y, breaks = seq(0,100,10))
)
data %>%
ggplot() +
geom_bin_2d(
aes(x = x_binned, y = y_binned), binwidth = c(10,10), colour = "red") +
theme_minimal()
After asking in the comments I am still not quite shure, what the desired answer would look like but I hope, that one of the two answers in the below code will work for you:
x <- sample(100)
y <- sample(100)
data <- data.frame(x, y)
xbin <- seq(from = 0, to = 100, by = 10)
ybin <- seq(from = 0, to = 100, by = 10)
data$xbin <- cut(data$x, breaks = xbin, ordered = TRUE)
data$ybin <- cut(data$y, breaks = ybin, ordered = TRUE)
data$commonbin1 <- paste0(data$xbin, data$ybin)
data$commonbin2 <- paste0("(",as.numeric(data$xbin),";", as.numeric(data$ybin),")")
head(data, 20)
This will construct a common binning variable commonbin1 that includes the bin-limits in the names of the bins and commonbin2 which will be easier to compare to the plot mentioned in the comment.

ggplot2 - how to adjust stat_bin and stat to use calculation of a different variable

The goal is to generate a "histogram" of x where the bars are sum(y)/count(x), where y is another variable describing the data. The point is to use ggplot binning to do the grouping part. I do not want to calculate the binning myself and then perform the calculation.
example:
library(ggplot2)
library(data.table)
k <- runif(1000)
k <- k[order(k)]
y <- c(rbinom(n = 500, size = 1, prob = .05), rbinom(n = 500, size = 1, prob = .95))
w <- data.table(k, y)
so a plot(w$k, w$y) gives
so theoretically what I am looking for looks like this:
ggplot(w, aes(k)) + geom_histogram(aes(y = stat(sum(y)/count)))
but it generates this:
Not sure if this is what you want but sum(y) is going to be the same for all bars.
library(ggplot2)
library(data.table)
set.seed(13434)
k <- runif(1000)
k <- k[order(k)]
y <- c(rbinom(n = 500, size = 1, prob = .05), rbinom(n = 500, size = 1, prob = .95))
w <- data.table(k, y)
constant_value <- sum(w$y)
ggplot(w, aes(k)) + geom_histogram(aes(y = stat(constant_value/count)))
gives exactly the same plot as
ggplot(w, aes(k)) + geom_histogram(aes(y = stat(sum(w$y)/count)))
Edit:
Not sure if this helps you, here I use the same binwidth (30) as ggplot2s default:
library(tidyverse)
w %>%
arrange(k) %>%
mutate(bin = cut_interval(1:length(k), length=30, labels=FALSE)) %>%
group_by(bin) %>%
summarise(mean_y = mean(y),
mean_k = mean(k),
width = max(k) - min(k)) %>%
ggplot(aes(mean_k, mean_y, width=width)) +
geom_bar(stat="identity") +
labs(x="k", y="mean y")
which makes this figure:

Kmean clustering in ggplot

I am using K-mean alg. in R in order to separe variables. I would like to plot results in ggplot witch I was able to manage,
however results seem to be different in ggplot and in cluster::clusplot
So I wanted to ask what I am missing: for example I know that scaling in different but I was wondering Whz when using clustplot all variables are inside the bounds and when using ggplot it is not.
Is it just because of the scaling?
So are two below result exatly the same?
library(cluster)
library(ggfortify)
x <- rbind(matrix(rnorm(2000, sd = 123), ncol = 2),
matrix(rnorm(2000, mean = 800, sd = 123), ncol = 2))
colnames(x) <- c("x", "y")
x <- data.frame(x)
A <- kmeans(x, centers = 3, nstart = 50, iter.max = 500)
cluster::clusplot(cbind(x$x, x$y), A$cluster, color = T, shade = T)
autoplot(kmeans(x, centers = 3, nstart = 50, iter.max = 500), data = x, frame.type = 'norm')
For me, I get the same plot using either clusplot or ggplot. But for using ggplot, you have to first make a PCA on your data in order to get the same plot as clustplot. Maybe it's where you have an issue.
Here, with your example, I did:
x <- rbind(matrix(rnorm(2000, sd = 123), ncol = 2),
matrix(rnorm(2000, mean = 800, sd = 123), ncol = 2))
colnames(x) <- c("x", "y")
x <- data.frame(x)
A <- kmeans(x, centers = 3, nstart = 50, iter.max = 500)
cluster::clusplot(cbind(x$x, x$y), A$cluster, color = T, shade = T)
pca_x = princomp(x)
x_cluster = data.frame(pca_x$scores,A$cluster)
ggplot(test, aes(x = Comp.1, y = Comp.2, color = as.factor(A.cluster), fill = as.factor(A.cluster))) + geom_point() +
stat_ellipse(type = "t",geom = "polygon",alpha = 0.4)
The plot using clusplot
And the one using ggplot:
Hope it helps you to figure out the reason of your different plots

Linear model of geom_histogram data

I'm working with dataset in which I have continuous variable x and categorical variables y and z. Something like this:
set.seed(222)
df = data.frame(x = c(0, c(1:99) + rnorm(99, mean = 0, sd = 0.5), 100),
y = rep(50, times = 101)-(seq(0, 50, by = 0.5))+rnorm(101, mean = 30, sd = 20),
z = rnorm(101, mean = 50, sd= 10))
df$positive.y = sapply(df$y,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
df$positive.z = sapply(df$z,
function(x){
if (x >= 50){"Yes"} else {"No"}
})
Then using this dataset I can create histograms to see either there is correlation between variables x and positive.y(z). With 10 bins it is clear that x correlates with positive.y, but not with positive.z:
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10)
ggplot(df,
aes(x = x, fill = positive.z))+
geom_histogram(position = "fill", bins = 10)
Now from this I want two things:
Extract the actual data points to supply them to corr.test() function or something like that.
Add geom_smooth(method = "lm") to plot I have.
I tried to add "bin" column to the df, like this:
df$bin = sapply(df$x,
function(x){
if (x <= 10){1}
else if (x > 10 & <= 20) {20}
else if .......
})
Then using tapply() count number of "Yes" and "No" for each df$bin, and convert it to the %.
But in this case each time I change number of bins at histogram, I have to re-write and re-run this part of code which is tedious and consumes a lot of computer time if dataset is large.
Is there a more straightforward way to achieve the same result?
I don't see a good justification for adding an lm line. Logistic regression is the appropriate model and doesn't require binning:
df$positive.y <- factor(df$positive.y)
mod <- glm(positive.y ~ x, data = df, family = "binomial")
summary(mod)
anova(mod)
library(ggplot2)
ggplot(df,
aes(x = x, fill = positive.y))+
geom_histogram(position = "fill", bins = 10) +
stat_function(fun = function(x) predict(mod, newdata = data.frame(x = x),
type = "response"),
size = 2)
If you need an R² value (why?), there are different pseudo-R² available for GLMs, e.g.,
library(fmsb)
NagelkerkeR2(mod)
#$N
#[1] 101
#
#$R2
#[1] 0.4074274

Create Color Palette with fixed breakpoints in R

Dear stackoverflow Community,
I have a vector with different correlation values, which I want to link to corresponding color codes (let's say -1="Dark Red", 0 ="Light Gray", 1="Dark Green"). So, for example, if my maximum value in the correlation would be 0.75, the corresponding color value should be a "Lighter green". Is there any solution to achieve this in R?
Thank you!
What you're looking for is ggplot2::scale_colour_gradient2(). Since you didn't provide any example data (which I highly recommend in the future; it encourages answers and helps answerers tailor their responses to your actual data structure), I concocted the following simple example:
library(ggplot2)
set.seed(123)
n <- 1000
corrs <- seq(-0.9, 0.9, length.out = 10)
vals <- matrix(0, nrow = 0, ncol = 2)
for ( corr in corrs ) {
tmp <- mvtnorm::rmvnorm(n/10, sigma = matrix(c(1, corr, corr, 1), nrow = 2))
# print(cor(tmp)) # If you want to do QA
vals <- rbind(vals, tmp)
}
df <- data.frame(var1 = vals[ , 1], var2 = vals[ , 2],
corr = rep(corrs, each = n/10))
ggplot(df, aes(x = var1, y = var2, colour = corr)) +
geom_point(shape = 1) +
scale_colour_gradient2(low = "darkred", mid = "gray", high = "darkgreen")

Resources