Limits and Breaks - r

enter image description here
For this question, I have been able to split the data into two histograms with one being income above the median and the other being income below median. The following code is what I've done so far:
library(openintro)
data("countyComplete")
attach("countyComplete")
median(median_household_income, na.rm = FALSE)
x<-subset(countyComplete,median_household_income > 42445)
y<-subset(countyComplete,median_household_income < 42445)
par(mfrow=c(1,2))
hist(x$median_household_income, main="Income Above Median" )
hist(y$median_household_income,main = "Income Below Median")
However, I am a bit confused about how do I force histograms to use same limits on y axis, as well as breaks. Could someone point me in the right direction. I tried to do this:
par(mfrow=c(1,2))
hist(x$median_household_income,
breaks=seq(0,100,by=5),
freq = FALSE,
ylim=c(0,.15),
xlim = range(breaks),
main="Income Above Median")
hist(y$median_household_income, main = "Income Below Median")
But I only get one histogram showing up on my plot screen and the console says
"Error in hist.default(x$median_household_income, breaks = seq(0, 100, :
some 'x' not counted; maybe 'breaks' do not span range of 'x'".
What do I do?

I would forget the breaks argument. It doesn't make sense, you are plotting values below and above the median, they do not intersect.
As for the histograms, I have precomputed the median, and the maximum value of the density.
library(openintro)
data("countyComplete")
med <- median(countyComplete$median_household_income, na.rm = FALSE)
x <- subset(countyComplete, median_household_income > med)
y <- subset(countyComplete, median_household_income < med)
hx <- hist(x$median_household_income, plot = FALSE)
hy <- hist(y$median_household_income, plot = FALSE)
MaxY <- max(c(hx$density, hy$density))
op <- par(mfrow = c(1, 2))
hist(x$median_household_income, main = "Income Above Median",
freq = FALSE, ylim = c(0, MaxY))
hist(y$median_household_income, main = "Income Below Median",
freq = FALSE, ylim = c(0, MaxY))
par(op)

Related

Plot percentage change figure with 95% CI and stats

I am planning to reproduce the attached figure, but I have no clue how to do so:
Let´s say I would be using the CO2 example dataset, and I would like to plot the relative change of the Uptake according to the Treatment. Instead of having the three variables in the example figure, I would like to show the different Plants grouped for each day/Type.
So far, I managed only to get this bit of code, but this is far away from what it should look like.
aov1 <- aov(CO2$uptake~CO2$Type+CO2$Treatment+CO2$Plant)
plot(TukeyHSD(aov1, conf.level=.95))
Axes should be switched, and I would like to add statistical significant changes indicated with letters or stars.
You can do this by building it in base R - this should get you started. See comments in code for each step, and I suggest running it line by line to see what's being done to customize for your specifications:
Set up data
# Run model
aov1 <- aov(CO2$uptake ~ CO2$Type + CO2$Treatment + CO2$Plant)
# Organize plot data
aov_plotdata <- data.frame(coef(aov1), confint(aov1))[-1,] # remove intercept
aov_plotdata$coef_label <- LETTERS[1:nrow(aov_plotdata)] # Example labels
Build plot
#set up plot elements
xvals <- 1:nrow(aov_plotdata)
yvals <- range(aov_plotdata[,2:3])
# Build plot
plot(x = range(xvals), y = yvals, type = 'n', axes = FALSE, xlab = '', ylab = '') # set up blank plot
points(x = xvals, y = aov_plotdata[,1], pch = 19, col = xvals) # add in point estimate
segments(x0 = xvals, y0 = aov_plotdata[,2], y1 = aov_plotdata[,3], lty = 1, col = xvals) # add in 95% CI lines
axis(1, at = xvals, label = aov_plotdata$coef_label) # add in x axis
axis(2, at = seq(floor(min(yvals)), ceiling(max(yvals)), 10)) # add in y axis
segments(x0=min(xvals), x1 = max(xvals), y0=0, lty = 2) #add in midline
legend(x = max(xvals)-2, y = max(yvals), aov_plotdata$coef_label, bty = "n", # add in legend
pch = 19,col = xvals, ncol = 2)

Calculate intersection point of two density curves in R

I have two vectors of 1000 values (a and b), from which I created density plots and histograms. I would like to retrieve the coordinates (or just the y value) where the two plots cross (it does not matter if it detects several crossings, I can discriminate them afterwards). Please find the data in the following link. Sample Data
xlim = c(min(c(a,b)), max(c(a,b)))
hist(a, breaks = 100,
freq = F,
xlim = xlim,
xlab = 'Test Subject',
main = 'Difference plots',
col = rgb(0.443137, 0.776471, 0.443137, 0.5),
border = rgb(0.443137, 0.776471, 0.443137, 0.5))
lines(density(a))
hist(b, breaks = 100,
freq = F,
col = rgb(0.529412, 0.807843, 0.921569, 0.5),
border = rgb(0.529412, 0.807843, 0.921569, 0.5),
add = T)
lines(density(b))
Using locate() is not optimal, since I need to retrieve this from several plots (but will use that approach if nothing else is viable). Thanks for your help.
We calculate the density curves for both series, taking care to use the same range. Then, we compare whether the y-value for a is greater than b at each x-value. When the outcome of this comparison flips, we know the lines have crossed.
df <- merge(
as.data.frame(density(a, from = xlim[1], to = xlim[2])[c("x", "y")]),
as.data.frame(density(b, from = xlim[1], to = xlim[2])[c("x", "y")]),
by = "x", suffixes = c(".a", ".b")
)
df$comp <- as.numeric(df$y.a > df$y.b)
df$cross <- c(NA, diff(df$comp))
points(df[which(df$cross != 0), c("x", "y.a")])
which gives you

Histogram Normalization in R

I'm new to R and my question might be a little silly, but any help is appreciated. I want to graphically explore a sample to find an appropriate distribution from which the sample could have been drawn. But when I plot a histogram of the sample, the density of the sample exceeds the theoretical maximum value of 1 :
How do I adjust this error? Do I need to transform the data or do I have to adjust the bins of the histogram?
My R code:
dataSample = read.table("sample6.txt", fill = TRUE)
sampleMatrix = as.matrix(dataSample)
sampleVector = as.vector(sampleMatrix)
h = hist(sampleVector, plot=F)
x =c(min(sampleVector, na.rm=T), max(sampleVector, na.rm=T))
ylim = range(0, max(get("density", h), max(density)))
hist(sampleVector, prob = T, col = "lightgreen", xlim = x,
ylim = ylim, main = "Histogram of data sample", xlab = "sample", ylab = "density")
This is my data sample:
0.5604785 0.0231508 0.2715692 0.2464922 0.2743465
0.434444 0.1779845 1.163666 0.5195378 0.08565649
0.2003622 0.3372351 0.02383633 0.2765776 0.1596984
0.3688299 0.2727399 0.3578011 0.4405475 0.07207568
0.424764 1.406219 1.12157 2.170512 0.6944183
2.429551 0.889546 0.1930762 0.579666 0.06834702
0.03690897 0.391838 1.019549 0.272865 0.1993042
0.02951076 0.3739699 0.2612313 1.988982 1.100386
0.9509101 1.978394 0.2469858 0.1256963 1.645895
0.1024105 0.336701 0.1322722 0.3881196 1.152153
0.6207026 1.506684 0.2826296
Thanks in advance!

r : ecdf over histogram

in R, with ecdf I can plot a empirical cumulative distribution function
plot(ecdf(mydata))
and with hist I can plot a histogram of my data
hist(mydata)
How I can plot the histogram and the ecdf in the same plot?
EDIT
I try make something like that
https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf
Also a bit late, here's another solution that extends #Christoph 's Solution with a second y-Axis.
par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
par(new = T)
ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')
The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).
Credits go here (#tim_yates Answer) and there.
There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.
I suspect this question will soon become interesting to you, particularly #hadley 's answer.
ggplot2 single scale
Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.
library(ggplot2)
set.seed(27272)
mydata <- data.frame(x= rexp(333, rate=4) + rnorm(333))
ggplot(mydata, aes(x)) +
stat_ecdf(color="red") +
geom_bar(aes(y = (..count..)/sum(..count..)))
base R multi scale
Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.
h <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec),
y=(1:length(mydata$x))/length(mydata$x) * max(h$density),
col ='red')
you can try a ggplot approach with a second axis
set.seed(15)
a <- rnorm(500, 50, 10)
# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() +
geom_histogram(aes(a), bins = binsize) +
geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))
Edit
Since the scaling was wrong I added a second solution, calculatin everything in advance:
binsize=30
a_range= floor(range(a)) +c(0,1)
b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor()
df_hist <- tibble(a) %>%
mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>%
count(gr) %>%
mutate(gr = as.character(gr) %>% as.numeric())
# calculate ecdf with binsize 30
df <- tibble(x=b) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
ggplot(df_hist, aes(gr, n)) +
geom_col(width = 2, color = "white") +
geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))
As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try
set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))
to get
Although a bit late... Another version which is working with preset bins:
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
ec <- ecdf(dt)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
y = c(0, max(h$counts)), col ='black', lty = 3)
(Only the second y-axis is not working yet...)
In addition to previous answers, I wanted to have ggplot do the tedious calculation (in contrast to #Roman's solution, which was kindly enough updated upon my request), i.e., calculate and draw the histogram and calculate and overlay the ECDF. I came up with the following (pseudo code):
# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)
# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)
# 3. Overlay scaled ECDF and add secondary axis
plot +
stat_ecdf(aes(y=..y..*maxPlotY)) +
scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))
This way you don't need to calculate everything beforehand and feed the results to ggpplot. Just lay back and let it do everything for you!

ScatterPlot and ONLY one Histogram plot together

I want to visualize time series data with a 'scatter plot' and a histogram on the right side, but I haven't been able to figure out how to turn OFF the histogram on the upper side.
Code Example:
install.packages("psych")
library(psych)
data = matrix(rnorm(n=100000,mean=2,sd=1.5), nrow = 100, ncol=1000)
fs = list()
fs$p_Z = 1*(data>2)
n_p = 1;
for(i in floor(seq(1,dim(data)[2],length.out=n_p)))
{
scatter.hist(x = rep(1:length(data[,i])), y = data[,i],
xlab = 'observations',
ylab = 'log(TPM)',
title = 'Mixture Plot',
col = c("red","blue")[fs$p_Z[,i]+1],
correl = FALSE, ellipse = FALSE, smooth = FALSE)
}
Result:
Expected Result:
Same as the one I have but with no histogram on the upper side. I.e., ONLY the histogram on the right side for log(TPM).
Note: I am using psych package, scatter.hist function which seemed easy and nice to use, but couldn't find how to turn off one histogram.
Where flexibility ends, hacking begins. If you look at scatter.hist function, you will see that it is pretty basic usage of R base graphics. Following modified code would create the plot you want:
scat.hist <- function(x, y, xlab = NULL, ylab = NULL, title = "", ...) {
## Create layout
layout(matrix(c(1,2),1,2,byrow=TRUE), c(3,1), c(1,3))
## Plot scatter
par(mar=c(5,5,3,1))
plot(x= x, y = y, xlab = xlab, ylab = ylab, main = title, ...)
## Right histogram
yhist <- hist(y, plot = FALSE, breaks = 11)
par(mar=c(5,2,3,1))
mp <- barplot(yhist$density, space=0, horiz=TRUE, axes = FALSE)
## Density
d <- density(y, na.rm = TRUE, bw = "nrd", adjust = 1.2)
temp <- d$y
d$y <- (mp[length(mp)] - mp[1] + 1) * (d$x - min(yhist$breaks))/(max(yhist$breaks) - min(yhist$breaks))
d$x <- temp
lines(d)
}
Let's try it for the first row:
i = 1
scat.hist(x = seq_along(data[,i]), y = data[,i], col = c("red", "blue")[fs$p_Z[,i]+1], xlab = 'observations', ylab = 'log(TPM)', title = 'Mixture Plot')

Resources