ggplot with stat_summary for mean along time represented by days - r

I have this data representing the value of a variable Q1 along time.
The time is not represented by dates, it is represented by the number of days since one event.
https://www.mediafire.com/file/yfzbx67yivvvkgv/dat.xlsx/file
I'm trying to plot the mean value of Q1along time, like in here
Plotting average of multiple variables in time-series using ggplot
I'm using this code
library(Hmisc)
ggplot(dat,aes(x=days,y=Q1,colour=type,group=type)) +
stat_summary(fun.data = "mean_cl_boot", geom = "smooth")

Besides the code, which does not appear to work with the new ggplot2 version, you also have the problem that your data is not really suited for that kind of plot. This code achieves what you wanted to do:
dat <- rio::import("dat.xlsx")
library(ggplot2)
library(dplyr)dat %>%
ggplot(aes(x = days, y = Q1, colour = type, group = type)) +
geom_smooth(stat = 'summary', fun.data = mean_cl_boot)
But the plot doesn't really tell you anything, simply because there aren't enough values in your data. Most often there seems to be only one value per day, the vales jump quickly up and down, and the gaps between days are sometimes quite big.
You can see this when you group the values into timespans instead. Here I used round(days, -2) which will round to the nearest 100 (e.g., 756 is turned into 800, 301 becomes 300, 49 becomes 0):
dat %>%
mutate(days = round(days, -2)) %>%
ggplot(aes(x = days, y = Q1, colour = type, group = type)) +
geom_smooth(stat = 'summary', fun.data = mean_cl_boot)
This should be the same plot as linked but with huge confidence intervals. Which is not surprising since, as mentioned, values quickly alternate between values 1-5. I hope that helps.

Related

R, ggplot, How do I keep related points together when using jitter?

One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)
You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!

Aside from binning in geom_histogram, I learned another way you can categorize continuous variable on the x-axis but cannot find online

I recall learning online I believe three options for binning continuous variables as discrete but cannot locate it anymore. Basically, I have an x scale of 1 through 60 (seconds) but because there are so many, my sample size is small for each number. I'd like to bin it into six diff groups of ten (1:9 Seconds, 10-19 Seconds, etc.) - so more samples provide a better average (y-column)
I put some code below to show my basic starting point.
ggplot(data, aes(Seconds, Percentage))+
geom_histogram()+
scale_x_continuous(breaks = 1:60)
One approach would be to specify geom_histogram(binwidth = 10). But this doesn't give you so much fine control; I think the bins will start with the minimum value, and won't necessarily be aligned to 1-10, 11-19, etc.
set.seed(0)
data = data.frame(Seconds = rnorm(1000, mean = 30, sd = 9))
range(data$Seconds)
ggplot(data, aes(Seconds))+
geom_histogram(binwidth = 10)
scale_x_continuous(breaks = 1:60)
Another option is to do it yourself, and count how many observations in each bin. floor(your_var/binsize)*binsize is a nice way to get bins like you describe.
library(dplyr)
binsize = 10
data %>%
count(bin = floor(Seconds/binsize)*binsize) %>%
ggplot(aes(bin + binsize/2, n)) + geom_col()

How to shade under part of a line from a dataset

I have a simple plot of same data from an experiment.
plot(x=sample95$PositionA, y=sample95$AbsA, xlab=expression(position (mm)), ylab=expression(A[260]), type='l')
I would like to shade a particular area under the line, let's say from 35-45mm. From what I've searched so far, I think I need to use the polygon function, but I'm unsure how to assign vertices from a big dataset like this. Every example I've seen so far uses a normal curve.
Any help is appreciated, I am very new to R/RStudio!
Here is a solution using tidyverse tools including ggplot2. I use the built in airquality dataset as an example.
This first part is just to put the data in a format that we can plot by combining the month and the day into a single date. You can just substitute date for PositionA in your data.
library(tidyverse)
df <- airquality %>%
as_tibble() %>%
magrittr::set_colnames(str_to_lower(colnames(.))) %>%
mutate(date = as.Date(str_c("1973-", month, "-", day)))
This is the plot code. In ggplot2, we start with the function ggplot() and add geom functions to it with + to create the plot in layers.
The first function, geom_line, joins up all observations in the order that they appear based on the x variable, so it makes the line that we see. Each geom needs a particular mapping to an aesthetic, so here we want date on the x axis and temp on the y axis, so we write aes(x = date, y = temp).
The second function, geom_ribbon, is designed to plot bands at particular x values between a ymax and a ymin. This lets us shade the area underneath the line by choosing a constant ymin = 55 (a value lower than the minimum temperature) and setting ymax = temp.
We shade a specific part of the chart by specifying the data argument. Normally geom functions act on the dataset inherited from ggplot(), but you can override them by specifying individually. Here we use filter to only plot the points where the date is in June in geom_ribbon.
ggplot(df) +
geom_line(aes(x = date, y = temp)) +
geom_ribbon(
data = filter(df, date < as.Date("1973-07-01") & date > as.Date("1973-06-01")),
mapping = aes(x = date, ymax = temp, ymin = 55)
)
This gives the chart below:
Created on 2018-02-20 by the reprex package (v0.2.0).

ggplot2 density-plot with discrete data

I want to create a density plot with the following data:
interval fr mi ab
0x 9765 3631 12985
1x 2125 2656 601
2x 1299 2493 191
3x 493 2234 78
4x 141 1559 20
5x and more 75 1325 23
On the X-Axis I want to have the Intervals and on the Y-Axis I want to have the density of "fr", "mi" and "ab" in different colors.
My imagination was something like this graph.
My problem is that I don't know how to get the density on the Y-Axis. I tried it with geom_density, but it didn't work. The best result I accomplished was using the following code:
DS29 <-as.data.frame(DS29)
DS29$interval <- factor(DS29$interval, levels = DS29$interval)
DS29 <- melt (DS29,id=c("interval"))
output$DS51<- renderPlot({
plot_tab6 <- ggplot(DS29, aes(x= interval,y = value, fill=variable, group = variable)) +
geom_col()+
geom_line()
return(plot_tab6)
})
This gives me the following plot, which is not the result I want to have. Do you have an idea how I could get to my wanted result? Thank you very much.
Seeing your sample data, I am not sure if you want to use geom_density. If you type ?geom_density, you will see some example codes. If I take one example from the help page, you may see things that you are missing.
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.1) +
xlim(55, 70)
For x-axis, depth is a continuous variable, not a categorical variable. Your current data has a categorical variable in x-axis. For geom_density, you are looking for density of something at a value on x-axis. The example code above shows that the density of diamonds classified as "Ideal" has high density around 61.5-62, suggesting that the largest proportion "Ideal" diamonds have depth value around 61.5-62. Indeed, mean value for depth of "Ideal" diamond is 61.71. This means that you need multiple data points to calculate density. Your data has only one data point for each interval for each group (e.g., ab, fr, mi). So, I do not think your data is not ready for calculating density.
If you want to draw a graphic similar to what you suggested in your question using the current data, I think you need to 1) convert interval to a numeric variable, 2) transform the data into long format, and 3) use stat_smooth.
library(tidyverse)
mydf %>%
mutate(interval = as.numeric(sub(x = as.character(interval), pattern = "x", replacement = ""))) %>%
gather(key = group, value = value, - interval) -> temp
ggplot(temp, aes(x = interval, y = value, fill = group)) +
stat_smooth(geom = "area", span = 0.4, method = "loess", alpha = 0.4)

Add line for mean, mean + sd, and mean - sd to multiple factor scatterplot in R

I have data of the form
cvar1 cvar1 numvar
a x 0.1
a y 0.2
b x 0.15
b y 0.25
That is, two categorical variables, and one numerical variable.
Using ggplot2, I can get a nice scatter plot that plots the data for each combination of cv1 and cv2 by doing qplot(y=numvar, x=interaction(cvar1, cvar2)). This gives me several columns of points like this:
To each of these columns I would like to add a small horizontal line representing the mean of the data points in that column. And a similar small horizontal line for the mean + sd and the mean - sd. (Kind of a bastardized box plot, but with all points visible and using mean and sd rather than median and IQR.) Thanks in advance!
You can create a table that contains the mean and mean +/- sd for each group of points. Then you can plot lines using geom_segment().
First, I create some sample data:
set.seed(1245)
data <- data.frame(cvar1 = rep(letters[1:2], each = 12),
cvar2 = rep(letters[25:26], times = 12),
numvar = runif(2*12))
This creates the table with the values that you need using dplyr and tidyr:
library(dplyr)
library(tidyr)
summ <- group_by(data, cvar1, cvar2) %>%
summarise(mean = mean(numvar),
low = mean - sd(numvar),
high = mean + sd(numvar)) %>%
gather(variable, value, mean:high)
The three lines do the following: First, the data is split into the groups and then for each group the three required values are calculated. Finally, the data is converted to long format, which is needed for ggplot(). (Maybe your are more familiar with melt(), which does basically the same thing as gather())
And finally, this creates the plot:
gplot(data) + geom_point(aes(x = interaction(cvar1, cvar2), y = numvar)) +
geom_segment(data = summ,
aes(x = as.numeric(interaction(cvar1, cvar2)) - .5,
xend = as.numeric(interaction(cvar1, cvar2)) + .5,
y = value, yend = value, colour = variable))
You probably won't want the colours. I just added them to make the example more clear.
geom_segments() needs the start and end coordinates of each line to be specified. Because interaction(cvar1, cvar2) is a factor, it needs to be converted to numeric before it is possible to do arithmetic with it. I added and subtracted 0.5 to interaction(cvar1, cvar2), which makes the lines quite wide. Choosing a smaller value will make the lines shorter.

Resources