Plotting extreme values on a histogram with ggplot - r

Data:
data = data.frame(rnorm(250, 90, sd = 30))
I want to create a histogram where I have a bin of fixed width, but all observation which are bigger than arbitrary number or lower than another arbitrary number are group in their own bins. To take the above data as an example, I want binwidth = 10, but all values above 100 together in one bin and all values bellow 20 together in their own bin.
I looked at some answers, but they make no sense to me since they are mostly code. I would appreciate it greatly if somebody can explain the steps.

The examples below show how to create the desired histogram in base graphics and with ggplot2. Note that the resulting histogram will be quite distorted compared to one with a constant break size.
Base Graphics
The R function hist creates the histogram and allows us to set whatever bins we want using the breaks argument:
# Fake data
set.seed(1049)
dat = data.frame(value=rnorm(250, 90, 30))
hist(dat$value, breaks=c(min(dat$value), seq(20,100,10), max(dat$value)))
In the code above c(min(dat$value), seq(20,100,10), max(dat$value)) sets breaks that start at the lowest data value and end at the highest data value. In between we use seq to create a sequence of breaks that goes from 20 to 100 by increments of 10. Here's what the plot looks like:
ggplot2
library(ggplot2)
ggplot(dat, aes(value)) +
geom_histogram(breaks=c(min(dat$value), seq(20,100,10), max(dat$value)),
aes(y=..density..), color="grey30", fill=hcl(240,100,65)) +
theme_light()

Related

How to plot visualization of missing values for big data in R?

I would like to draw a plot of missing values for a big data (1000 variables), I tried vis_miss function as follows
library(naniar)
vis_miss(predictors, warn_large_data=TRUE)
However, it shows the names of the variables after drawing the plot which is barely readable as there are too many variables, I was wondering 1. if there is any way to remove variable names from the x axis
2. Is there any other beautiful way to draw a missing value plot for big data?
The vis_miss() function is ggplot-based, so you can change it relatively easily.
Regarding your question:
if there is any way to remove variable names from the x axis
You can remove them using e.g.
vis_miss(predictors, warn_large_data=TRUE) +
theme(axis.text.x = element_blank())
Or alter them using e.g.
vis_miss(predictors, warn_large_data=TRUE) +
theme(axis.text.x = element_text(size = 6, angle = 60))
And for your other question:
Is there any other beautiful way to draw a missing value plot for big data?
Without a sample of your actual data it is hard to say what would be best, but there are some suggestions such as gg_miss_upset() here: https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html

Using multiple summary statistics in a ggplot2 plot

I'm analysing some house sale transaction data, and I want to produce a geographic plot with the colour indicating average price per (hex-binned) region. Some regions have limited data, and I want to indicate this by adjusting the opacity to reflect the number of points in each region.
This would require me to calculate two statistics for each hex: average price and number of points. The ggplot2 package makes it very easy to calculate and plot one statistic in a chart, but I can't figure out how to calculate two.
To illustrate the point:
library(ggplot2)
N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) # dummy data
# I want to produce a hex-binned version of this:
ggplot(data=df_demo) + geom_point(mapping=aes(x=A, y=B, color=C))
# It's easy to get each hex's average price *or* its point density:
ggplot(data=df_demo) + stat_summary_hex(mapping=aes(x=A,y=B,z=C), fun=mean) # color = average of C across hex, but opacity can't be adjusted
ggplot(data=df_demo) + geom_hex(mapping=aes(x=A, y=B, color=C, alpha=..ndensity..)) # opacity = normalised # of points, but color is *total* value which is wrong
I would like to combine the effects of the last two lines, but that doesn't seem to be an option: the ..ndensity.. statistic doesn't work in the context of stat_summary_hex(), and geom_hex() won't calculate the mean value.
Is there a way to do this that I'm overlooking? Alternatively, is there an obvious way of precomputing the statistics needed before constructing the plot? E.g. by determining the expected hex for each datum during my dplyr pipeline.
One hint that there may not be an easy solution is this non-CRAN package which - if I've understood correctly - solves more or less this problem. However, I'd rather not rely on out-of-CRAN code if at all possible, so I'm holding onto hope that I've missed something obvious.
What about a different geom? E.g. geom_tile - you can create cuts for each dimension (A/B) and then pre-calculate mean and number for each tile and then plot like this:
library(tidyverse)
N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) %>%
mutate(cuts_a= cut(A, breaks = 20), cuts_b= cut(B, breaks = 20)) %>%
group_by(cuts_a, cuts_b) %>% mutate(mean_c = mean(C), n_obs = n())
# I want to produce a hex-binned version of this:
ggplot(data=df_demo) +
geom_tile(mapping=aes(x=cuts_a, y=cuts_b, fill=mean_c, alpha = n_obs))
Created on 2020-02-13 by the reprex package (v0.3.0)

How to create a heat map with the number of repetition inside a certain range value

I have a dataset that looks like this one, with month (mese) in one column and the corresponding value in the other column and I'm trying to create a heatmap with the month(s) on the x axis, different "intervals" on the y axis (e.g. from 0 to 10, 10 to 20, 20 to 30 etc.) and the number of times a certain range of value repeats itself inside the month for each range.
I tried to use the cut function for both the x and the y axis in order to create a number of ranges of values, then putting everything into a table and plotting it with this code
x_c <- cut(x, 12)
y_c <- cut(y, 50)
z <- table(x_c, y_c)
image2D(z=z, border="black")
but it doesn't seem to work: the scale is always from 0 to 1 (and i need the actual values)... is there an easier solution?
Essentially, I need the end result to look something like this (sorry for my very poor paint skills): i.e. the level of sulphate is higher during the winter than the summer and the majority of the data follow a "curve" that reflect this tendency
You can use geom_bin2d from ggplot2. You can define the number of bins:
ggplot(data, aes(mese, nnso4)) +
geom_bin2d(bins=c(12,50)) +
scale_fill_gradient(low="yellow", high="red")
You can change the fill scale, for instance viridis package has some options.

Plot Line Chart of Binary Variable Against Continuous Data

I am looking for a way to help better visualize the relationship between a independent continuous variable and a binary response variable.
I am trying to understand how I can add a 2nd y axis to the existing plot I have below. I want to get a sense of the response rate over different numerical ranges visually.
How can I add in the response percent at any given histogram bin? For example if there were 10 observations in a bin and 2 were the positive class, then this would show a response of 20%.
Ideally it's possible that this would be dynamic in that I might change the # of bins. For instance, I have 10 here, I might want 20 the next time.
This would be a connected line-chart with the corresponding percentages from #1 on the right y axis.
Or in other words, I want a line chart of the positive class to be displayed as a line chart with % show in Y axis.
library(mlbench)
library(tidyverse)
data(Sonar) ## from mlbench
library(ggplot2)
ggplot(Sonar, aes(x=V11, fill=Class)) +
geom_histogram(col='black', bins = 10) +
scale_fill_manual(values=c("purple", "green")) +
labs(title = "Count Left Y Axis; 'R' class percent of BIN in Right Y Axis" ,
x = 'Variable Value in this case V33', y ='Count of Observations' )
Not sure if this is what you are after but the description you gave sounded very similar to a conditional density plot.
ggplot probably has an alternative to this, but with base R:
cdplot(Class ~ V1, Sonar, col=c("cornflowerblue", "orange"), main="Conditional density plot")
And the result:

Plot two datasets with different scales on the same graph, same axis in R

I have two datasets, that I'd like to see on a single scatterplot with a single axis. One dataset has Y values ranging from 0 to 0.0006, the other between 0 and 1.
Each dataset has 50 entries.
In R, is there a way of changing the scale of the y axis at the 0.0006 mark to show detail in both halves of the graph, e.g., the range of 0 - 0.0006 and 0.0006 - 1 would be the same size on the graph.
I did this using a log scale, this is a sample dataset, which doesnt go all the way to 1 but taps out around 0.07.
I'm still open to other techniques as this one gives too much emphasis to the 0.0006-0 range.
You can scale your data for plotting, then call axis twice:
y1<-runif(50,0,0.0006)
y2<-runif(50,0.0006,1)
x<-runif(50)
y1.scaled<-y1*(0.5/0.0006)
y2.scaled<-(y2-0.0006)*(1-0.5)/(1-0.0006) + 0.5
plot(c(0,1),c(0,1),col=NA,yaxt='n',ylab="",xlab="")
points(x,y1.scaled,pch=20,col="red")
points(x,y2.scaled,pch=21,col="black")
axis(2,at=seq(0,0.5,length.out = 3), labels = c(0,0.0003,0.0006), col="red")
axis(2,at=seq(0.5,1,length.out = 3), labels = seq(0.0006,1,length.out=3))
See this post for how to re-scale a set of numbers with a known min and max to any other min and max:
How to scale down a range of numbers with a known min and max value
Assuming you have two different datasources (and that values from either source can be <0.0006) we could combine them, create an indicator for whether or not the value is <0.0006, and then use a facet_wrap with free scales. Something like this:
library(ggplot2)
set.seed(1)
y1<-runif(50,0,0.0006)
y2<-runif(50,0,1)
x<-1:50
df<-as.data.frame(rbind(cbind(y1,x),cbind(y2,x))) #Combine data
df$y1 <- as.numeric(as.character(df$y1))
df$x <- as.numeric(as.character(df$x))
df$group <- (df$y1 <= 0.0006) #Create group
#ggplot with facet
ggplot(data=df) + geom_point(aes(y=y1,x=x)) + facet_wrap(~grp,scales="free")

Resources