R ggplot How to Show Probability of Two Variables - r

I have a distribution of data that is shown below in image 1. My goal is to show the likelihood that a variable is below a particular value for both X and for Y. For instance, I'd like to have a good way to show that ~95% of values are below 8000 on X-axis and below 6500 on the Y-axis. I am confident that there is a simple answer to this. I apologize if this has been asked many times before.
plot1 <- df %>% ggplot(mapping = aes(x = FLUID_TOT)) + stat_ecdf() + theme_bw()
plot2 <- df %>% ggplot(mapping = aes(x = FLUID_TOT, y = y)) + geom_point() + theme_bw()

Related

Why the R script provide by site pubmed is not executing ? Is it possible to make it run?

If possible, I need help to understand why the code below is not working. This code I was found on the page: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3817376/. Would it be possible for any expert member to adapt it to work?
library(ggplot2)
library(nlme)
head(Theoph)
ggplot(data=Theoph, aes(x=Time, y=conc, group=Subject)) + geom_line() + labs(x=“Time (hr)”, y=“Concentration (mg/L)”)
p <- ggplot(data=Theoph, aes(x=Time, y=conc, group=Subject)) + geom_line() + labs(x=“Time (hr)”, y=“Concentration (mg/L)”) + stat_summary(fun.y=median, geom=“line”,aes(x=ntpd, y=conc, group=1), color=“red”, size=1)
print(p) # “p” is a ggplot object
# create a flag for body weight
Theoph$WT <- ifelse(Theoph$Wt<70, “WT < 70kg”, “WT >= 70kg”)
p + facet_grid(.~WT)""t>
There are a couple things to help you run this.
First, you have curly/smart quotes “ in your code, and should just use plain quotes ". Sometimes we get this excess formatting when we copy/paste code from other sources like this.
Second, you need to use the supplementary materials to calculate ntpd, add to the Theoph dataset.
Below is code that seemed to work at my end to reproduce the spaghetti plots.
library(ggplot2)
library(nlme)
# Reference:
# https://ascpt.onlinelibrary.wiley.com/doi/10.1038/psp.2013.56
head(Theoph)
ggplot(data = Theoph, aes(x = Time, y = conc, group = Subject)) +
geom_line() +
labs(x = "Time (hr)", y = "Concentration (mg/L)")
##################################################################################
## we need some data manipulation for Figure 1(e) and Figure (f)
## below code is how to calculate approximate ntpd (nominal post time dose)
## "ntpd" is used for summarizing conc data (calculate mean at each time point)
## create body weight category for <70 kg or >=70 kg
##################################################################################
#--create a cut (time intervals)
Theoph$cut <- cut(Theoph$Time, breaks=c(-0.1,0,1,1.5, 2,3,4,6,8,12,16,20,24))
#--make sure each time point has reasonable data
table(Theoph$cut)
#--calcuate approximate ntpd
library(plyr)
tab <- ddply(Theoph, .(cut), summarize, ntpd=round(mean(Time, na.rm=T),2))
#--merge ntpd into Theoph data
Theoph <- merge(Theoph, tab, by=c("cut"), all.x=T)
#--sort the data by Subject and Time, select only nessesary columns
Theoph <- Theoph[order(Theoph$Subject, Theoph$Time),c("Subject","Wt","Dose","Time","conc","ntpd")]
#--create body weight category for <70 kg or >=70 kg for Figure 1(f)
Theoph$WT <- ifelse(Theoph$Wt<70, "WT < 70kg", "WT >= 70kg")
#--end of data manipulation
##################################################################################
p <- ggplot(data = Theoph, aes(x=Time, y=conc, group=Subject)) +
geom_line() +
labs(x="Time (hr)", y="Concentration (mg/L)") +
stat_summary(fun = median, geom = "line", aes(x = ntpd, y = conc, group = 1), color = "red", size=1)
print(p)
p + facet_grid(. ~ WT)

How to graph "before and after" measures using ggplot with connecting lines and subsets?

I’m totally new to ggplot, relatively fresh with R and want to make a smashing ”before-and-after” scatterplot with connecting lines to illustrate the movement in percentages of different subgroups before and after a special training initiative. I’ve tried some options, but have yet to:
show each individual observation separately (now same values are overlapping)
connect the related before and after measures (x=0 and X=1) with lines to more clearly illustrate the direction of variation
subset the data along class and id using shape and colors
How can I best create a scatter plot using ggplot (or other) fulfilling the above demands?
Main alternative: geom_point()
Here is some sample data and example code using genom_point
x <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) # 0=before, 1=after
y <- c(45,30,10,40,10,NA,30,80,80,NA,95,NA,90,NA,90,70,10,80,98,95) # percentage of ”feelings of peace"
class <- c(0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1) # 0=multiple days 1=one day
id <- c(1,1,2,3,4,4,4,4,5,6,1,1,2,3,4,4,4,4,5,6) # id = per individual
df <- data.frame(x,y,class,id)
ggplot(df, aes(x=x, y=y), fill=id, shape=class) + geom_point()
Alternative: scale_size()
I have explored stat_sum() to summarize the frequencies of overlapping observations, but then not being able to subset using colors and shapes due to overlap.
ggplot(df, aes(x=x, y=y)) +
stat_sum()
Alternative: geom_dotplot()
I have also explored geom_dotplot() to clarify the overlapping observations that arise from using genom_point() as I do in the example below, however I have yet to understand how to combine the before and after measures into the same plot.
df1 <- df[1:10,] # data before
df2 <- df[11:20,] # data after
p1 <- ggplot(df1, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
p2 <- ggplot(df2, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
grid.arrange(p1,p2, nrow=1) # GridExtra package
Or maybe it is better to summarize data by x, id, class as mean/median of y, filter out ids producing NAs (e.g. ids 3 and 6), and connect the points by lines? So in case if you don't really need to show variability for some ids (which could be true if the plot only illustrates tendencies) you can do it this way:
library(ggplot)
library(dplyr)
#library(ggthemes)
df <- df %>%
group_by(x, id, class) %>%
summarize(y = median(y, na.rm = T)) %>%
ungroup() %>%
mutate(
id = factor(id),
x = factor(x, labels = c("before", "after")),
class = factor(class, labels = c("one day", "multiple days")),
) %>%
group_by(id) %>%
mutate(nas = any(is.na(y))) %>%
ungroup() %>%
filter(!nas) %>%
select(-nas)
ggplot(df, aes(x = x, y = y, col = id, group = id)) +
geom_point(aes(shape = class)) +
geom_line(show.legend = F) +
#theme_few() +
#theme(legend.position = "none") +
ylab("Feelings of peace, %") +
xlab("")
Here's one possible solution for you.
First - to get the color and shapes determined by variables, you need to put these into the aes function. I turned several into factors, so the labs function fixes the labels so they don't appear as "factor(x)" but just "x".
To address multiple points, one solution is to use geom_smooth with method = "lm". This plots the regression line, instead of connecting all the dots.
The option se = FALSE prevents confidence intervals from being plotted - I don't think they add a lot to your plot, but play with it.
Connecting the dots is done by geom_line - feel free to try that as well.
Within geom_point, the option position = position_jitter(width = .1) adds random noise to the x-axis so points do not overlap.
ggplot(df, aes(x=factor(x), y=y, color=factor(id), shape=factor(class), group = id)) +
geom_point(position = position_jitter(width = .1)) +
geom_smooth(method = 'lm', se = FALSE) +
labs(
x = "x",
color = "ID",
shape = 'Class'
)

Adding multiple points to a ggplot ecdf plot

I'm trying to generate a ggplot only C.D.F. plot for some of my data. I am also looking to be able to plot an arbitrary number of percentiles as points on top. I have a solution that works for adding a single point to my curve but fails for multiple values.
This works for plotting one percentile value
TestDf <- as.data.frame(rnorm(1000))
names(TestDf) <- c("Values")
percentiles <- c(0.5)
ggplot(data = TestDf, aes(x = Values)) +
stat_ecdf() +
geom_point(aes(x = quantile(TestDf$Values, percentiles),
y = percentiles))
However this fails
TestDf <- as.data.frame(rnorm(1000))
names(TestDf) <- c("Values")
percentiles <- c(0.25,0.5,0.75)
ggplot(data = TestDf, aes(x = Values)) +
stat_ecdf() +
geom_point(aes(x = quantile(TestDf$Values, percentiles),
y = percentiles))
With error
Error: Aesthetics must be either length 1 or the same as the data (1000): x, y
How can I add an arbitrary number of points to a stat_ecdf() plot?
You need to define a new dataset, outside of the aesthetics. aes refers to the original dataframe that you used for making the CDF (in the original ggplot argument).
ggplot(data = TestDf, aes(x = Values)) +
stat_ecdf() +
geom_point(data = data.frame(x=quantile(TestDf$Values, percentiles),
y=percentiles), aes(x=x, y=y))

ggplot2 boxplot medians aren't plotting as expected

So, I have a fairly large dataset (Dropbox: csv file) that I'm trying to plot using geom_boxplot. The following produces what appears to be a reasonable plot:
require(reshape2)
require(ggplot2)
require(scales)
require(grid)
require(gridExtra)
df <- read.csv("\\Downloads\\boxplot.csv", na.strings = "*")
df$year <- factor(df$year, levels = c(2010,2011,2012,2013,2014), labels = c(2010,2011,2012,2013,2014))
d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) +
facet_grid(station~.) +
scale_y_continuous(limits = c(0, 15)) +
theme(legend.position = "none"))
d
However, when you dig a little deeper, problems creep in that freak me out. When I labeled the boxplot medians with their values, the following plot results.
df.m <- aggregate(value~year+station, data = df, FUN = function(x) median(x))
d <- d + geom_text(data = df.m, aes(x = year, y = value, label = value))
d
The medians plotted by geom_boxplot aren't at the medians at all. The labels are plotted at the correct y-axis value, but the middle hinge of the boxplots are definitely not at the medians. I've been stumped by this for a few days now.
What is the reason for this? How can this type of display be produced with correct medians? How can this plot be debugged or diagnosed?
The solution to this question is in the application of scale_y_continuous. ggplot2 will perform operations in the following order:
Scale Transformations
Statistical Computations
Coordinate Transformations
In this case, because a scale transformation is invoked, ggplot2 excludes data outside the scale limits for the statistical computation of the boxplot hinges. The medians calculated by the aggregate function and used in the geom_text instruction will use the entire dataset, however. This can result in different median hinges and text labels.
The solution is to omit the scale_y_continuous instruction and instead use:
d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) +
facet_grid(station~.) +
theme(legend.position = "none")) +
coord_cartesian(y = c(0,15))
This allows ggplot2 to calculate the boxplot hinge stats using the entire dataset, while limiting the plot size of the figure.

Violin Plot (geom_violin) with aggregated values

I would like to create violin plots with aggregated data. My data has a category, a value coloumn and a count coloumn:
data <- data.frame(category = rep(LETTERS[1:3],3),
value = c(1,1,1,2,2,2,3,3,3),
count = c(3,2,1,1,2,3,2,1,3))
If I create a simple violin plot it looks like this:
plot <- ggplot(data, aes(x = category, y = value)) + geom_violin()
plot
(source: ahschulz.de)
That is not what I wanted. A solution would be to reshape the dataframe by multiplying the rows of each category-value combination. The problem is that my counts go up to millions which takes hours to be plotted! :-(
Is there a solution with my data?
Thanks in advance!
You can submit a weight when calculating the areas.
plot2 <- ggplot(data, aes(x = category, y = value, weight = count)) + geom_violin()
plot2
You will get warning messages that the weights do not add to one, but that is ok. See here for similar/related discussion.
Using stat="identity" and specifying a violinwidth aesthetic appears to work,although I had to put in a fudge factor:
ggplot(data, aes(x = category, y = value)) +
geom_violin(stat="identity",aes(violinwidth=0.2*count))

Resources