How to split x-axis as decile in R and make ggplot - r

Hi I am wondering how to split x-axis as decile in R and make ggplot?
I currently have age range data and NO2 pollution data. The two datasets share the same geographic reference named ward. I wish to plot my demographic data in quantiles of equal number of ward (Total 298).
I tried the quantile regression in R where I used the following:
library(SparseM)
library(quantreg)
mydata<- read.csv("M:/Desktop10/Test2.csv")
attach(mydata)
Y <- cbind(NO2.value)
X <- cbind(age.0.to.4, age..5.to.9, age.10.to.14, age.15.to.19, age.20.to.24, age.25.to.29, age.30.to.44, age.45.to.59, age.60.to.64, age.65.to.74, age.75.to.84, age.85.to.89, age.above.90)
quantreg.all <- rq(Y ~ X, tau = seq(0.05, 0.95, by = 0.05), data=mydata)
quantreg.plot <- summary(quantreg.all)
plot(quantreg.plot)
But what I get are not what I expected as the y-axies is not the NO2 data.
The ideal plot is attached:
Many thanks for your help and suggestions.

If I understand your question, I think the cut function combined with the quantile function will create the deciles. Here's an example with fake data.
In the code below, we use the cut function to split the data into deciles and we use the quantile function to set the breaks argument for cut. This tells cut to group the data into 10 groups of equal size, from smallest values of NO2 to largest.
group_by(age) means we create the deciles separately for each age group. This means that there are equal numbers of subjects within each decile in a given age group, but the NO2 cutoff values for each decile are different for different age groups. To create deciles over the data as a whole, just remove group_by(age). This will result in the same NO2 cutoff values for each decile across all age groups, but within a given age group, the number of subjects will not be the same in each decile.
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(NO2=c(runif(600, 0, 10), runif(400, 1, 11)),
age=rep(c("0-10","11-20"), c(600,400)))
# Create decile groups
dat = dat %>%
group_by(age) %>%
mutate(decile = cut(NO2, breaks=quantile(NO2, probs=seq(0,1,0.1)),
labels=10:1, include.lowest=TRUE),
decile = fct_rev(decile))
Now we plot using ggplot2. The stat_summary function returns the mean for each decile in each age group.
ggplot(dat, aes(decile, NO2, colour=age, group=age)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.y=mean, geom="point") +
expand_limits(y=0) +
theme_bw()

Related

Probability density plots from count/value ("binned") data

I have a data set which contains 27M samples per day. I can reduce this, using count(), to 1500 samples per day, without loss.
When I come to plot, for example, histograms from this, I can use stat="identity" to process the count data considerably faster than the original data.
Is there a similar way to process the count data to obtain ridges using ggridges::geom_density_ridges(), or similar, to get the probability density without having to process the original data set?
It sounds like your current set-up is something like this (obviously with far more cases): a data frame containing a large vector of numeric measurements, with at least one grouping variable to specify different ridge lines.
We will stick to 2000 samples rather than 27M samples for demonstration purposes:
set.seed(1)
df <- data.frame(x = round(c(rnorm(1000, 35, 5), rnorm(1000, 60, 12))),
group = rep(c('A', 'B', 'C'), len = 2000))
We can reduce these 2000 observations down to ~200 by using count, and plot with geom_histogram using stat = 'identity':
df %>%
group_by(x, group) %>%
count() %>%
ggplot(aes(x, y = n, fill = group)) +
geom_histogram(stat = 'identity', color = 'black')
But we want to create density ridgelines from these 200 rows of counts rather than the original data. Of course, we could uncount them and create a density ridgeline normally, but this would be tremendously inefficient. What we can do is use the counts as weights for a density calculation. It seems that geom_density_ridges doesn't take a weight parameter, but stat_density does, and you can tell it to use the density_ridges geom. This allows us to pass our counts as weights for the density calculation.
library(ggridges)
df %>%
group_by(x, group) %>%
count() %>%
ggplot(aes(x, fill = group)) +
stat_density(aes(weight = n, y = group, height = after_stat(density)),
geom = 'density_ridges', position = 'identity')
Note that this should give us the same result as creating a ridgleine from our whole data set before counting, since our 'bins' are unique interval values. If your real data is binning continuous data before counting, you will have a slightly less accurate kernel density estimate when using count data, depending on how 'thin' your bins are.

How can I plot 3 repeat observations per sample on a scatter in R?

I have a dataframe with the following columns; Sample, Read_length, Length, Rep, Year, Sex. Each unique sample has 6 Length values (2 Read_length conditions x 3 Reps). I would like to plot Length vs Year in such a way that each group of 3 repeats is visually linked on the plot, so I can see the variation. I am using colour and point shape to distinguish between the 2 read-lengths and between Male & Female.
ggplot(data1, aes(x = Year, y = Length, shape = Sex, colour = Read_length)) + geom_point(size = 3) + scale_shape_manual(values = c(1, 4))
Is there a way to group first by read_length, and then by sample name, to generate the groups of three (and how to then plot that)?
Take your input data and use group_by() from dplyr. This will allow ggplot, and many other tidyverse functions to process each sample separately.
data %>% group_by(Sample)

plotting two categorical vectors in ggridges

I have a dataset with a few organisms, which I would like to plot on my y-axis, against date, which I would like to plot on the x-axis. However, I want the fluctuation of the curve to represent the abundance of the organisms. I.e I would like to plot a time series with the relative abundance separated by the organism to show similar patterns with time.
However, of course, plotting just date against an organism does not yield any information on the abundance. So, my question is, is there a way to make the curve represent abundance using ggridges?
Here is my code for an example dataset:
set.seed(1)
Data <- data.frame(
Abundance = sample(1:100),
Organism = sample(c("organism1", "organism2"), 100, replace = TRUE)
)
Date = rep(seq(from = as.Date("2016-01-01"), to = as.Date("2016-10-01"), by =
'month'),times=10)
Data <- cbind(Date, Data)
ggplot(Data, aes(x = Abundance, y = Organism)) +
geom_density_ridges(scale=1.15, alpha=0.6, color="grey90")
This produces a plot with the two organisms, however, I want the date on the x-axis and not abundance. However, this doesn't work. I have read that you need to specify group=Date or change date into julian day, however, this doesn't change the fact that I do not get to incorporate abundance into the plot.
Does anyone have an example of a plot with date vs. a categorical variable (i.e. organism) plotted against a continuous variable in ggridges?
I really like to output from ggridges and would like to be able to use it for these visualizations. Thank you in advance for your help!
Cheers,
Anni
To use geom_density_ridges, it'll help to reshape the data to show observations in separate rows, vs. as summarized by Abundance.
library(ggplot2); library(ggridges); library(dplyr)
# Uncount copies the row "Abundance" number of times
Data_sum <- Data %>%
tidyr::uncount(Abundance)
ggplot(Data_sum, aes(x = Date, y = Organism)) +
ggridges::geom_density_ridges(scale=1, alpha=0.6, color="grey90")

Proper display of confidence interval in R using ggplot

I'm trying to make a plot that will represent 2 measurements(prr and ebgm) for different adverse reactions of different drugs grouped by age category like so:
library(ggplot2)
strata <- factor(c("Neonates", "Infants", "Children", "Adolescents", "Pediatrics"), levels=c("Neonates", "Infants", "Children", "Adolescents", "Pediatrics"), order=T)
Data <- data.frame(
strata = sample(strata, 200, replace=T),
drug=sample(c("ibuprofen", "clarithromycin", "fluticasone"), 200, replace=T), #20 de medicamente
reaction=sample(c("Liver Injury", "Sepsis", "Acute renal failure", "Anaphylaxis"), 200, replace=T),
measurement=sample(c("prr", "EBGM"), 200, replace=T),
value_measurement=sample(runif(16), 200, replace=T),
lower_CI=sample(runif(6), 200, replace=T),
upper_CI=sample(runif(5), 200, replace=T)
)
g <- ggplot(Data, aes(x=strata, y=value_measurement, fill=measurement, group=measurement))+
geom_histogram(stat="identity", position="dodge")+
facet_wrap(~reaction)+
geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI), position="dodge", stat="identity")
ggsave(file="meh.png", plot=g)
The upper and lower CI are the confidence interval limit of the measurement. Given that I for each measurement I have a confidence interval I want the proper histogram to have the corresponding confidence interval, but what I get is s follows.
Graph:
Any ideas how to place those nasty conf intervals properly? Thank you!
Later edit: in the original data for a given drug I have many rows each containing an adverse reaction, the age category and each of these categories has 2 measurements: prr or EBGM and the corresponding confidence interval. This is not reflected in the data simulation.
The problem is that each of your bars is really multiple bars plotted over each other, because you have more than one row of data for each combination of reaction, strata, and measurement. (You're getting multiple error bars for the same reason.)
You can see this in the code below, where I've changed geom_histogram to geom_bar and added alpha=0.3 and colour="grey40" to show the multiple overlapping bars. I've also commented out the error bars.
ggplot(Data, aes(x=strata, y=value_measurement, fill=measurement, group=measurement)) +
geom_bar(stat="identity", position="dodge", alpha=0.3, colour="grey40") +
facet_wrap(~reaction) #+
# geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI),
# position="dodge", stat="identity")
You can fix this by adding another column to your data that adds a grouping category by which you can separate these bars. For example, in the code below we add a new column called count that just assigns numbers 1 through n for each row of data within each combination of reaction and strata. We sort by measurement so that each measurement type will be kept together in the count sequence.
library(dplyr)
Data = Data %>% group_by(reaction, strata) %>%
arrange(measurement) %>%
mutate(count = 1:n())
Now plot the data:
ggplot(Data, aes(x=strata, y=value_measurement,
fill=measurement, group=count)) +
geom_bar(stat="identity", position=position_dodge(0.7), width=0.6) +
facet_wrap(~reaction, ncol=1) +
geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI, group=count),
position=position_dodge(0.7), stat="identity", width=0.3)
Now you can see the separate bars, along with their error bars (which are weird, but only because they're fake data).

Nested tables and calculating summary statistics with confidence intervals in R

This question is about the statistical program R.
Data
I have a data frame, study_data, that has 100 rows, each representing a different person, and three columns, gender, height_category, and freckles. The variable gender is a factor and takes the value of either "male" or "female". The variable height_category is also a factor and takes the value of "tall" or "short". The variable freckles is a continuous, numeric variable that states how many freckles that individual has.
Here are some example data (thanks to Roland for this):
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
Question 1
I would like to create a nested table that divides these patients into "male" versus "female", further subdivides them into "tall" versus "short", and then calculates the number of patients in each sub-grouping along with the median number of freckles with the lower and upper 95% confidence interval.
Example
The table should look something like what is shown below, where the # signs are replaced with the appropriate calculated results.
gender height_category n median_freckles LCI UCI
male tall # # # #
short # # # #
female tall # # # #
short # # # #
Question 2
Once these results have been calculated, I would then like to create a bar graph. The y axis will be the median number of freckles. The x axis will be divided into male versus female. However, these sections will be subdivided by height category (so there will be a total of four bars in groups of two). I'd like to overlay the 95% confidence bands on top of the bars.
What I've tried
I know that I can make a nested table using the MASS library and xtabs command:
ftable(xtabs(formula = ~ gender + height_category, data = study_data))
However, I'm not sure how to incorporate calculating the median of the number of freckles into this command and then getting it to show up in the summary table. I'm also aware that ggplot2 can be used to make bar graphs, but am not sure how to do this given that I can't calculate the data that I need in the first place.
You should really provide a reproducible example. Anyway, you may find library(plyr) helpful. Be careful with these confidence intervals because the Central Limit Theorem doesn't apply if n < 30.
library(plyr)
ddply(df, .(gender, height_category), summarize,
n=length(freckles), median_freckles=median(freckles),
LCI=qt(.025, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles),
UCI=qt(.975, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles))
EDIT: I forgot to add the bit on the plot. Assuming we save the previous result as tab:
library(ggplot2)
library(reshape)
m.tab <- melt(tab, id.vars=c("gender", "height_category"))
dodge <- position_dodge(width=0.9)
ggplot(m.tab, aes(fill=height_category, x=gender, y=median_freckles))+
geom_bar(position=dodge) + geom_errorbar(aes(ymax=UCI, ymin=LCI), position=dodge, width=0.25)
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
library(plyr)
res <- ddply(DF,.(gender,height_category),summarise,
n=length(na.omit(freckles)),
median_freckles=quantile(freckles,0.5,na.rm=TRUE),
LCI=quantile(freckles,0.025,na.rm=TRUE),
UCI=quantile(freckles,0.975,na.rm=TRUE))
library(ggplot2)
p1 <- ggplot(res,aes(x=gender,y=median_freckles,ymin=LCI,ymax=UCI,
group=height_category,fill=height_category)) +
geom_bar(stat="identity",position="dodge") +
geom_errorbar(position="dodge")
print(p1)
#a better plot that doesn't require to precalculate the stats
library(hmisc)
p2 <- ggplot(DF,aes(x=gender,y=freckles,colour=height_category)) +
stat_summary(fun.data="median_hilow",geom="pointrange",position = position_dodge(width = 0.4))
print(p2)

Resources