Getting an error when using gather to create clustered bar chart - r

I am trying to create a clustered bar chart and am trying to use the gather function to get the correct bars to group together. When I do this, I get the error that the gather function could not be found. I have dplyr and magrittr installed. Any thoughts on how to make this work or if there is a better way to create the bar chart grouped by test and date?
Plaster <- Plaster_2019_Data %>%
gather("pH", "Temperature", "Surface", -Date)
Data:
Surface pH Temperature Date
12.08 8.56 11.16 5/13/2019
11.68 8.90 8.76 5/29/2019
8.69 9.07 14.65 6/10/2019
2.26 7.49 17.51 6/24/2019
4.54 7.77 23.82 7/8/2019
2.13 8.17 25.29 8/5/2019
6.34 8.62 26.50 8/19/2019
9.33 9.03 24.31 9/4/2019
10.98 8.58 21.02 9/16/2019
9.59 8.61 17.33 9/30/2019
16.07 8.70 10.39 10/14/2019
9.12 8.07 6.38 11/14/2019

We can use require to install and load the package tidyr as gather is from tidyr
require('tidyr')

As mentioned by #akrun, you need tidyr. Moreover, the function pivot_longer is dedicated to replace in a near future the function gather (https://tidyr.tidyverse.org/reference/gather.html).
Moreover, as the range of values between your test is quite different, I would suggest to use facet_wrap to make a nice plot.
Altogether, you can write something like that:
df$Date = as.Date(df$Date, format = "%m/%d/%Y")
library(tidyr)
library(ggplot2)
library(dplyr)
df %>% pivot_longer(., -Date, names_to = "Test", values_to = "value") %>%
ggplot(aes(x = Date, y = value, fill = Test))+
geom_bar(stat = "identity", position = position_dodge())+
facet_wrap(.~Test, scales = "free") +
scale_x_date(date_labels = "%b %d",
date_breaks = "2 weeks")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Data
structure(list(Surface = c(12.08, 11.68, 8.69, 2.26, 4.54, 2.13,
6.34, 9.33, 10.98, 9.59, 16.07, 9.12), pH = c(8.56, 8.9, 9.07,
7.49, 7.77, 8.17, 8.62, 9.03, 8.58, 8.61, 8.7, 8.07), Temperature = c(11.16,
8.76, 14.65, 17.51, 23.82, 25.29, 26.5, 24.31, 21.02, 17.33,
10.39, 6.38), Date = structure(c(18029, 18045, 18057, 18071,
18085, 18113, 18127, 18143, 18155, 18169, 18183, 18214), class = "Date")), row.names = c(NA,
-12L), class = "data.frame")

Related

Seasonal plots for multiple variables using forecast::ggseasonplot

I'm trying to draw a seasonal plots for dataframe df, which contains time series data for 2 variables (value1 and value2):
df <- structure(list(date = structure(c(18292, 18321, 18352, 18382,
18413, 18443, 18474, 18505, 18535, 18566, 18596, 18627, 18658,
18686, 18717, 18747, 18778, 18808, 18839, 18870, 18900, 18931,
18961, 18992), class = "Date"), value1 = c(-2.94, -40.61, -6.89,
3.04, -3.5, 0.18, 6.79, 9.08, 9.35, 10.92, 20.53, 18.04, 24.6,
154.6, 30.4, 32.1, 27.7, 32.1, 19.2, 25.4, 28, 26.9, 21.7, 20.9
), value2 = c(-12.66, 7.56, -1.36, -14.39, -16.18, 3.29, -0.69,
-1.6, 13.47, 4.83, 4.56, 7.58, 28.7, 18.9, 39.1, 44, 52, 37.1,
28.2, 32.7, 17.2, 20.4, 31.4, 19.5)), class = "data.frame", row.names = c(NA,
-24L))
We could draw two time series in one plot using:
meltdf <- melt(df, id='date')
meltdf %>%
ggplot(aes(x=date, y=value, colour=variable, group=variable)) +
geom_point() +
geom_line()
Out:
But I hope to use ggseasonplot to draw two plots for value1 and value2 seperately, each one will be similar to the following plot:
library(forecast)
ggseasonplot(AirPassengers, col=rainbow(12), year.labels=TRUE)
The problem I met is how to convert each subset dataframe to ts objects:
meltdf %>%
filter(variable=='value1') %>%
as.ts() %>%
ggseasonplot(col=rainbow(12), year.labels=TRUE)
Thanks.
Update: implement with ggplot2 only:
meltdf %>%
filter(variable=='value2') %>%
select(-variable) %>%
mutate(
year = factor(year(date)), # use year to define separate curves
date = update(date, year = 1) # use a constant year for the x-axis
) %>%
ggplot(aes(date, value, color = year)) +
scale_x_date(date_breaks = "1 month", date_labels = "%b")+
geom_line()+
geom_point()
This is what the feasts package is for -- handling time series graphics with multiple series in the same data frame. Here is how to do it with the sample data provided.
library(tsibble)
library(feasts)
library(tidyr)
library(dplyr)
# Convert to tsibble object and plot using gg_season()
df %>%
pivot_longer(value1:value2) %>%
mutate(date = yearmonth(date)) %>%
as_tsibble(index = date, key = name) %>%
gg_season(value)
Created on 2022-02-14 by the reprex package (v2.0.1)
See https://otexts.com/fpp3/seasonal-plots.html for more examples.

How to customize colors for lines and points in feasts::gg_season()

I'm able to convert the following df to tsibble object and plot using gg_season():
library(tsibble)
library(feasts)
library(tidyr)
library(dplyr)
df <- structure(list(date = structure(c(18292, 18321, 18352, 18382,
18413, 18443, 18474, 18505, 18535, 18566, 18596, 18627, 18658,
18686, 18717, 18747, 18778, 18808, 18839, 18870, 18900, 18931,
18961, 18992), class = "Date"), value1 = c(-2.94, -40.61, -6.89,
3.04, -3.5, 0.18, 6.79, 9.08, 9.35, 10.92, 20.53, 18.04, 24.6,
154.6, 30.4, 32.1, 27.7, 32.1, 19.2, 25.4, 28, 26.9, 21.7, 20.9
), value2 = c(-12.66, 7.56, -1.36, -14.39, -16.18, 3.29, -0.69,
-1.6, 13.47, 4.83, 4.56, 7.58, 28.7, 18.9, 39.1, 44, 52, 37.1,
28.2, 32.7, 17.2, 20.4, 31.4, 19.5)), class = "data.frame", row.names = c(NA,
-24L))
# Convert to tsibble object and plot using gg_season()
df %>%
pivot_longer(value1:value2) %>%
mutate(date = yearmonth(date)) %>%
mutate(year = year(date)) %>%
as_tsibble(index = date, key = name) %>%
gg_season(value) +
geom_point() # +
# scale_color_manual(values = c('2020' = 'blue', '2021' = 'red'))
Now I try to customize colors based on different years, ie., setting blue for 2020, and red for 2021. I've added scale_color_manual(values = c('2020' = 'blue', '2021' = 'red')), but I didn't succeed yet, how could I do that correctly? Thanks.
Reference:
how to change the color in geom_point or lines in ggplot
...
gg_season(value, pal = c("#3333FF", "#FF3333")) +
geom_point()
The year scale here is a continuous one (explaining why the scale_color_manual line produces "Error: Continuous value supplied to discrete scale"). But we can give gg_season a vector of color codes to use in its pal parameter.

Can I create scatterplots with "paired circles" in R using ggplot2

First, to clarify on the title. I am trying to create a single scatterplot. The nature of my data is such that there is 2 of each observation, and I would like each pair of observations to be "connected" in the scatterplot via a line or arrow between the two points.
To help with the question, here's a short dataset:
structure(list(evToRevJun15 = c(4.56, 1.35, 1.26, 5.99, 2.79,
6.97, 4.9, 2.28, 1.26, 4.83, 2, 2.36, 4.91, 2.31, 2.47), evToGiJun15 = c(21.71,
5, 4.85, 23.04, 21.46, 34.85, 44.53, 12.67, 9.69, 21.96, 11.76,
19.67, 11.69, 6.42, 5.74), evToRevDec18 = c(1.99, 5.92, 2.13,
6.6, 5.84, 4.32, 6.38, 6.77, 4.92, 2.67, 4.48, 6.69, 1.36, 3.79,
2.41), evToGiDec18 = c(7.37, 24.67, 7.89, 34.74, 19.47, 15.43,
33.58, 39.84, 28.94, 11.61, 17.23, 44.6, 7.56, 8.24, 5.74)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -15L))
> head(zed)
# A tibble: 6 x 4
evToRevJun15 evToGiJun15 evToRevDec18 evToGiDec18
<dbl> <dbl> <dbl> <dbl>
1 4.56 21.7 1.99 7.37
2 1.35 5 5.92 24.7
3 1.26 4.85 2.13 7.89
4 5.99 23.0 6.6 34.7
5 2.79 21.5 5.84 19.5
6 6.97 34.8 4.32 15.4
The two evToRev columns are for the X-axis, and the two evToGi columns are for the Y-axis, and therefore each row in the dataframe constitutes two points in the graph.
Here is an example that sort of highlights what I'm going for, but not exactly. Imagine this graph, but instead of 5 points for Messi, there would be 2 points for Messi, 2 for Angel di Maria, 2 for Neymar, etc.
Any thoughts or help on this would be great! Please let me know if i can add additional clarification.
Edit: The 2nd and 3rd graphs in this article are a better example of what im going for.
The first step in achieving this is reshaping the data into a format that works better with ggplot - once you've done that, the actual plotting code is pretty simple:
library(tidyverse)
df_long = df %>%
# Need an id that will keep observations together
# once they've been split into separate rows
mutate(id = 1:n()) %>%
gather(key = "key", value = "value", -id) %>%
mutate(Time = str_sub(key, nchar(key) - 4),
Type = str_remove(key, Time)) %>%
select(-key) %>%
# In this case we don't want the data entirely
# 'long' since evToRev and evToGi will be
# mapped separately to x and y
spread(Type, value)
df_long %>%
ggplot(aes(x=evToRev, y=evToGi, colour=Time)) +
# group aesthetic controls which points are connected
geom_line(aes(group = id), colour = "grey40") +
geom_point(size = 3) +
theme_bw()
Result:
The reshaping could probably be done more neatly using tidyr::pivot_longer(),
but that's still only available in the dev version, so I've used gather and spread.

Flow duration curve (fdc) extract low threshold

I am a newbie working with streamflow duration curves and the function fdc.
I am working with more than 300 series and I am interested in saving the low quartile threshold Qlow.thr value that appears in the plot generated:
Here is the reproducible example:
dat <- c(13.05, 90.29, 5.68, 49.13, 26.39, 15.06, 23.39, 17.98, 4.21, 2.51, 38.29, 8.57, 2.48 , 3.78, 18.09 ,15.16, 13.46, 8.69, 6.85, 11.97, 12.10, 9.87 ,21.89, 2.60 ,2.40, 27.40, 4.94, 83.17 ,12.10, 5.08 ,12.42, 6.19 ,3.60 ,32.58, 53.69, 38.49,3.61, 14.84, 34.48, 1.91, 21.79, 31.53, 6.70, 9.52, 22.64, 1.80 , 8.13, 10.60, 12.73, 4.17, 6.70 ,16.45)
fdc(dat,plot = T,lQ.thr=0.8,ylab='Hm3',main='Upstream monthly duration curve',thr.shw=TRUE)
The fdc function returns a vector of probabilities, but I am not sure how to convert these probabilities to the original units and select the 80% percentile value expressed in Hm3 as I would do with pnorm, for example, in case of working with normal probabilities.
Thank you so much.
You can construct the FDC yourself by
dat <- c(13.05, 90.29, 5.68, 49.13, 26.39, 15.06, 23.39, 17.98,
4.21, 2.51, 38.29, 8.57, 2.48 , 3.78, 18.09 ,15.16,
13.46, 8.69, 6.85, 11.97, 12.10, 9.87 ,21.89, 2.60,
2.40, 27.40, 4.94, 83.17 ,12.10, 5.08 ,12.42, 6.19,
3.60 ,32.58, 53.69, 38.49,3.61, 14.84, 34.48, 1.91,
21.79, 31.53, 6.70, 9.52, 22.64, 1.80 , 8.13, 10.60,
12.73, 4.17, 6.70 ,16.45)
dat <- sort(dat, decreasing = T)
df <- data.frame(x = 100/length(dat) * 1:length(dat), y = dat)
plot(x = df$x, y = df$y, type = "l", log = "y")
So the sorted flow data is simply plotted against the percentage exceedance scale. This scale is created by dividing 100% by the number of data points which gives us the increment for each point.
Therefore
quantile(dat, p = c(0.2, 0.8), type = 1)
gives you your desired results.
Notice that the computation of the quantile differs in fdc. It seems like they just use
p <- c(0.8, 0.2)
dat[round(p * length(dat))]
> [1] 4.21 27.40
to compute the values.

Rolling Regression Data Frame

Appreciate this may have been asked before but I have not found a clear solution to work over a data frame.
I want to run a rolling linear regression over a look back of 5 days. (small so can illustrate here)
So far I am trying:
rollingbeta <- rollapply(df,
width=5,
FUN = function(Z)
{
t = lm(formula=y_Close ~ x_Close+0, data = as.data.frame(Z));
return(t$coef)[1]
},
by.column=FALSE, align="right",fill = NA)
head(rollingbeta,100)
However, I expect to have the beta for the rolling lookback window. Instead I have and output with 10 columns.
> NCOL(rollingbeta)
[1] 10
Can anyone assist?
Here is dummy data (save to .txt and read)
df <- read.table("your_dir\df.txt",header=TRUE, sep="", stringsAsFactors=FALSE)
Date open.x high.x low.x x_Close volume.x open.y high.y low.y y_Close volume.y x.y.cor
1451 2010-01-04 57.32 58.13 57.32 57.85 442900 6.61 6.8400 6.61 6.83 833100 NA
1452 2010-01-05 57.90 58.33 57.54 58.20 436900 6.82 7.1200 6.80 7.12 904500 NA
1453 2010-01-06 58.20 58.56 58.01 58.42 850600 7.05 7.3800 7.05 7.27 759800 NA
1454 2010-01-07 58.31 58.41 57.14 57.90 463600 7.24 7.3000 7.06 7.11 557800 NA
1455 2010-01-08 57.45 58.62 57.45 58.47 206500 7.08 7.3500 6.95 7.29 588100 NA
1456 2010-01-11 58.79 59.00 57.22 57.73 331900 7.38 7.4500 7.17 7.22 450500 NA
1457 2010-01-12 57.20 57.21 56.15 56.34 428500 7.15 7.1900 6.87 7.00 694700 NA
1458 2010-01-13 56.32 56.66 54.83 56.56 577500 7.05 7.1700 6.98 7.15 528800 NA
1459 2010-01-14 56.51 57.05 55.37 55.53 368100 7.08 7.1701 7.08 7.11 279900 NA
1460 2010-01-15 56.59 56.59 55.19 55.84 417900 7.03 7.0500 6.95 7.03 407600 NA
The output should for the first rolling linear regression should be:
NA NA NA NA NA 0.1229065
A faster alternative than wibeasley's answer is to use the rollRegres package as follows
ds <- structure(list(Date = structure(
c(14613, 14614, 14615, 14616, 14617, 14620, 14621, 14622, 14623, 14624), class = "Date"),
open.x = c(57.32, 57.9, 58.2, 58.31, 57.45, 58.79, 57.2, 56.32, 56.51, 56.59),
high.x = c(58.13, 58.33, 58.56, 58.41, 58.62, 59, 57.21, 56.66, 57.05, 56.59),
low.x = c(57.32, 57.54, 58.01, 57.14, 57.45, 57.22, 56.15, 54.83, 55.37, 55.19),
x_Close = c(57.85, 58.2, 58.42, 57.9, 58.47, 57.73, 56.34, 56.56, 55.53, 55.84),
volume.x = c(442900L, 436900L, 850600L, 463600L, 206500L, 331900L, 428500L, 577500L, 368100L, 417900L),
open.y = c(6.61, 6.82, 7.05, 7.24, 7.08, 7.38, 7.15, 7.05, 7.08, 7.03),
high.y = c(6.84, 7.12, 7.38, 7.3, 7.35, 7.45, 7.19, 7.17, 7.1701, 7.05),
low.y = c(6.61, 6.8, 7.05, 7.06, 6.95, 7.17, 6.87, 6.98, 7.08, 6.95),
y_Close = c(6.83, 7.12, 7.27, 7.11, 7.29, 7.22, 7, 7.15, 7.11, 7.03),
volume.y = c(833100L, 904500L, 759800L, 557800L, 588100L, 450500L, 694700L, 528800L, 279900L, 407600L)),
row.names = c(NA, -10L), class = "data.frame")
# we get the same
library(roll)
library(rollRegres)
X <- as.matrix(ds$x_Close)
Y <- ds$y_Close
Ymat <- as.matrix(Y)
all.equal(
roll_lm(x = X, y = Ymat, intercept = FALSE, width = 5L)$coefficients[, 2],
drop(roll_regres.fit(x = X, y = Y, width = 5L)$coefs),
check.attributes = FALSE)
#R [1] TRUE
You can also fit the model with a formula as with lm using the roll_regres function
all.equal(
roll_lm(x = X, y = Ymat, intercept = FALSE, width = 5L)$coefficients[, 2],
drop(roll_regres(y_Close ~ x_Close - 1, ds, width = 5L)$coefs),
check.attributes = FALSE)
#R [1] TRUE
Here is a benchmark of the computation speed
# We add a few more observation to get an interesting example
set.seed(1)
n <- 250 * 5 # 5 years of trading data
X <- as.matrix(rnorm(n))
Y <- rnorm(n)
Ymat <- as.matrix(Y)
microbenchmark::microbenchmark(
roll_lm(x = X, y = Ymat, intercept = FALSE, width = 5L),
roll_regres.fit(x = X, y = Y, width = 5L),
times = 1e3)
#R Unit: microseconds
#R expr min lq mean median uq max neval
#R roll_lm(x = X, y = Ymat, intercept = FALSE, width = 5L) 663.7 739.9 834.2 777.1 860.2 3972.3 1000
#R roll_regres.fit(x = X, y = Y, width = 5L) 186.9 204.6 237.4 224.8 248.3 546.4 1000
Consider using the roll package.
library(magrittr); requireNamespace("roll")
ds <- readr::read_csv(
" Date, open.x, high.x, low.x, x_Close, volume.x, open.y, high.y, low.y, y_Close, volume.y
2010-01-04, 57.32, 58.13, 57.32, 57.85, 442900, 6.61, 6.8400, 6.61, 6.83, 833100
2010-01-05, 57.90, 58.33, 57.54, 58.20, 436900, 6.82, 7.1200, 6.80, 7.12, 904500
2010-01-06, 58.20, 58.56, 58.01, 58.42, 850600, 7.05, 7.3800, 7.05, 7.27, 759800
2010-01-07, 58.31, 58.41, 57.14, 57.90, 463600, 7.24, 7.3000, 7.06, 7.11, 557800
2010-01-08, 57.45, 58.62, 57.45, 58.47, 206500, 7.08, 7.3500, 6.95, 7.29, 588100
2010-01-11, 58.79, 59.00, 57.22, 57.73, 331900, 7.38, 7.4500, 7.17, 7.22, 450500
2010-01-12, 57.20, 57.21, 56.15, 56.34, 428500, 7.15, 7.1900, 6.87, 7.00, 694700
2010-01-13, 56.32, 56.66, 54.83, 56.56, 577500, 7.05, 7.1700, 6.98, 7.15, 528800
2010-01-14, 56.51, 57.05, 55.37, 55.53, 368100, 7.08, 7.1701, 7.08, 7.11, 279900
2010-01-15, 56.59, 56.59, 55.19, 55.84, 417900, 7.03, 7.0500, 6.95, 7.03, 407600"
)
runs <- roll::roll_lm(
x = as.matrix(ds$x_Close),
y = as.matrix(ds$y_Close),
width = 5,
intercept = FALSE
)
# Nested in a named-column, within a matrix, within a list.
ds$beta <- runs$coefficients[, "x1"]
ds$beta
# [1] NA NA NA NA 0.1224813
# [6] 0.1238653 0.1242478 0.1246279 0.1256553 0.1259121
Double-check the alignment of the variables in your dataset. x_Close is around 50, while y_Close is around 7. That might explain the small disparity between the expected 0.1229065 and the 0.1224813 value above.

Resources