Can I create scatterplots with "paired circles" in R using ggplot2 - r

First, to clarify on the title. I am trying to create a single scatterplot. The nature of my data is such that there is 2 of each observation, and I would like each pair of observations to be "connected" in the scatterplot via a line or arrow between the two points.
To help with the question, here's a short dataset:
structure(list(evToRevJun15 = c(4.56, 1.35, 1.26, 5.99, 2.79,
6.97, 4.9, 2.28, 1.26, 4.83, 2, 2.36, 4.91, 2.31, 2.47), evToGiJun15 = c(21.71,
5, 4.85, 23.04, 21.46, 34.85, 44.53, 12.67, 9.69, 21.96, 11.76,
19.67, 11.69, 6.42, 5.74), evToRevDec18 = c(1.99, 5.92, 2.13,
6.6, 5.84, 4.32, 6.38, 6.77, 4.92, 2.67, 4.48, 6.69, 1.36, 3.79,
2.41), evToGiDec18 = c(7.37, 24.67, 7.89, 34.74, 19.47, 15.43,
33.58, 39.84, 28.94, 11.61, 17.23, 44.6, 7.56, 8.24, 5.74)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -15L))
> head(zed)
# A tibble: 6 x 4
evToRevJun15 evToGiJun15 evToRevDec18 evToGiDec18
<dbl> <dbl> <dbl> <dbl>
1 4.56 21.7 1.99 7.37
2 1.35 5 5.92 24.7
3 1.26 4.85 2.13 7.89
4 5.99 23.0 6.6 34.7
5 2.79 21.5 5.84 19.5
6 6.97 34.8 4.32 15.4
The two evToRev columns are for the X-axis, and the two evToGi columns are for the Y-axis, and therefore each row in the dataframe constitutes two points in the graph.
Here is an example that sort of highlights what I'm going for, but not exactly. Imagine this graph, but instead of 5 points for Messi, there would be 2 points for Messi, 2 for Angel di Maria, 2 for Neymar, etc.
Any thoughts or help on this would be great! Please let me know if i can add additional clarification.
Edit: The 2nd and 3rd graphs in this article are a better example of what im going for.

The first step in achieving this is reshaping the data into a format that works better with ggplot - once you've done that, the actual plotting code is pretty simple:
library(tidyverse)
df_long = df %>%
# Need an id that will keep observations together
# once they've been split into separate rows
mutate(id = 1:n()) %>%
gather(key = "key", value = "value", -id) %>%
mutate(Time = str_sub(key, nchar(key) - 4),
Type = str_remove(key, Time)) %>%
select(-key) %>%
# In this case we don't want the data entirely
# 'long' since evToRev and evToGi will be
# mapped separately to x and y
spread(Type, value)
df_long %>%
ggplot(aes(x=evToRev, y=evToGi, colour=Time)) +
# group aesthetic controls which points are connected
geom_line(aes(group = id), colour = "grey40") +
geom_point(size = 3) +
theme_bw()
Result:
The reshaping could probably be done more neatly using tidyr::pivot_longer(),
but that's still only available in the dev version, so I've used gather and spread.

Related

Show the x value on the x-axis for each point in a ggplot

I have a ggplot with some points and I would like to show all the x values on the x-axis.
day = c(1,2,3,4,5,6,7,8,9,10)
value =c(4.63, 3.58, 3.52, 4.42, 4.09, 3.51, 3.95, 4.51, 3.45 ,1.76)
data <- tibble(day, value)
data %>% ggplot(aes(day,value)) + geom_point()
So instead of showing: 2.5,5.0,7.5,10 I would like to display all values between 1 and 10.
How can I do this?
Make your x axis as.factor
day = as.factor(c(1,2,3,4,5,6,7,8,9,10))
value =c(4.63, 3.58, 3.52, 4.42, 4.09, 3.51, 3.95, 4.51, 3.45 ,1.76)
data <- tibble(day, value)
data %>% ggplot(aes(day,value)) + geom_point()

Getting an error when using gather to create clustered bar chart

I am trying to create a clustered bar chart and am trying to use the gather function to get the correct bars to group together. When I do this, I get the error that the gather function could not be found. I have dplyr and magrittr installed. Any thoughts on how to make this work or if there is a better way to create the bar chart grouped by test and date?
Plaster <- Plaster_2019_Data %>%
gather("pH", "Temperature", "Surface", -Date)
Data:
Surface pH Temperature Date
12.08 8.56 11.16 5/13/2019
11.68 8.90 8.76 5/29/2019
8.69 9.07 14.65 6/10/2019
2.26 7.49 17.51 6/24/2019
4.54 7.77 23.82 7/8/2019
2.13 8.17 25.29 8/5/2019
6.34 8.62 26.50 8/19/2019
9.33 9.03 24.31 9/4/2019
10.98 8.58 21.02 9/16/2019
9.59 8.61 17.33 9/30/2019
16.07 8.70 10.39 10/14/2019
9.12 8.07 6.38 11/14/2019
We can use require to install and load the package tidyr as gather is from tidyr
require('tidyr')
As mentioned by #akrun, you need tidyr. Moreover, the function pivot_longer is dedicated to replace in a near future the function gather (https://tidyr.tidyverse.org/reference/gather.html).
Moreover, as the range of values between your test is quite different, I would suggest to use facet_wrap to make a nice plot.
Altogether, you can write something like that:
df$Date = as.Date(df$Date, format = "%m/%d/%Y")
library(tidyr)
library(ggplot2)
library(dplyr)
df %>% pivot_longer(., -Date, names_to = "Test", values_to = "value") %>%
ggplot(aes(x = Date, y = value, fill = Test))+
geom_bar(stat = "identity", position = position_dodge())+
facet_wrap(.~Test, scales = "free") +
scale_x_date(date_labels = "%b %d",
date_breaks = "2 weeks")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Data
structure(list(Surface = c(12.08, 11.68, 8.69, 2.26, 4.54, 2.13,
6.34, 9.33, 10.98, 9.59, 16.07, 9.12), pH = c(8.56, 8.9, 9.07,
7.49, 7.77, 8.17, 8.62, 9.03, 8.58, 8.61, 8.7, 8.07), Temperature = c(11.16,
8.76, 14.65, 17.51, 23.82, 25.29, 26.5, 24.31, 21.02, 17.33,
10.39, 6.38), Date = structure(c(18029, 18045, 18057, 18071,
18085, 18113, 18127, 18143, 18155, 18169, 18183, 18214), class = "Date")), row.names = c(NA,
-12L), class = "data.frame")

Using predicted log change values from regression to predict future prices

I had this problem I was hoping someone could help me with. I have a data set which shows the prices of multiple goods (each a column) on a daily basis for some years. I've run a regression like below for part of my data frame, and then created predicted values for the rest of the time period I have. My predicted values are the log changes in price for pet. For clarification, I have all of the actual values for price of pet already, however I am just trying to predict them using this method.
lin <- lm(diff(log(pet)) ~ diff(log(bron)) + diff(log(yen)) +
diff(yal) - 1, data = codData[1:634,])
predictions <- (predict(lin, newdata = codData[635:1025,]))
My problem now is that I want to get the actual predicted value of the price of pet, which I would normally do by multiplying the first predicted log change + 1 by the first price of pet which I want to predict, which would get me the first predicted value of the price of pet. I would then multiply the second predicted log change + 1 by that newly predicted value of pet, and so on and so forth. I'm not sure how I can do this in R though. Does anyone have any ideas?
Thanks ahead of time!
Code to get sample data
codData <- structure(list(date = structure(c(1306800000, 1306886400, 1306972800,
1307059200, 1307318400, 1307404800, 1307491200, 1307577600, 1307664000,
1307923200, 1308009600, 1308096000, 1308182400, 1308268800, 1308528000,
1308614400, 1308700800, 1308787200, 1308873600, 1309132800, 1309219200,
1309305600, 1309392000, 1309478400, 1309824000, 1309910400, 1309996800,
1310083200, 1310342400, 1310428800, 1310515200, 1310601600, 1310688000,
1310947200, 1311033600, 1311120000, 1311206400, 1311292800, 1311552000,
1311638400, 1311724800, 1311811200, 1311897600, 1312156800, 1312243200,
1312329600, 1312416000, 1312502400, 1312761600, 1312848000, 1312934400,
1313020800, 1313107200, 1313366400, 1313452800, 1313539200, 1313625600,
1313712000, 1313971200, 1314057600, 1314144000, 1314230400, 1314316800,
1314576000, 1314662400, 1314748800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), yal = c(3.05, 2.96, 3.04, 2.99, 3.01, 3.01,
2.98, 3.01, 2.99, 3, 3.11, 2.98, 2.93, 2.94, 2.97, 2.99, 3.01,
2.93, 2.88, 2.95, 3.05, 3.14, 3.18, 3.22, 3.16, 3.12, 3.17, 3.03,
2.94, 2.92, 2.92, 2.98, 2.94, 2.94, 2.91, 2.96, 3.03, 2.99, 3.03,
2.99, 3.01, 2.98, 2.82, 2.77, 2.66, 2.64, 2.47, 2.58, 2.4, 2.2,
2.17, 2.34, 2.24, 2.29, 2.23, 2.17, 2.08, 2.07, 2.1, 2.15, 2.29,
2.23, 2.19, 2.28, 2.19, 2.23), sp = c(1345.2, 1314.55, 1312.9399,
1300.16, 1286.17, 1284.9399, 1279.5601, 1289, 1270.98, 1271.83,
1287.87, 1265.42, 1267.64, 1271.5, 1278.36, 1295.52, 1287.14,
1283.5, 1268.45, 1280.1, 1296.67, 1307.41, 1320.64, 1339.67,
1337.88, 1339.22, 1353.22, 1343.8, 1319.49, 1313.64, 1317.72,
1308.87, 1316.14, 1305.4399, 1326.73, 1325.84, 1343.8, 1345.02,
1337.4301, 1331.9399, 1304.89, 1300.67, 1292.28, 1286.9399, 1254.05,
1260.34, 1200.0699, 1199.38, 1119.46, 1172.53, 1120.76, 1172.64,
1178.8101, 1204.49, 1192.76, 1193.89, 1140.65, 1123.53, 1123.8199,
1162.35, 1177.6, 1159.27, 1176.8, 1210.08, 1212.92, 1218.89),
pet = c(102.63, 100.13, 100.54, 100.49, 98.85, 98.98, 100.93,
101.71, 99.02, 96.98, 99.17, 95.29, 94.96, 92.96, 93.25,
93.4, 94.59, 91.75, 91.25, 90.81, 92.89, 94.93, 94.92, 94.7,
96.8, 96.64, 98.49, 96.31, 95.05, 96.77, 97.89, 95.73, 97.32,
96, 97.7, 98.14, 99.25, 99.82, 99.13, 99.44, 97.31, 97.13,
95.92, 95.33, 93.25, 91.93, 86.44, 87.07, 80.74, 81.12, 81.55,
85.46, 85.25, 87.89, 86.93, 87.45, 81.58, 82.63, 84.12, 86.12,
85.17, 84.94, 85.42, 87.45, 88.76, 88.91), bron = c(419.25,
409.5, 409.7, 412.4, 412.25, 414.65, 411.25, 410.5, 404.45,
403.38, 415.85, 411.63, 412.3, 410.05, 407.7, 408.35, 405.85,
406.58, 408.45, 407.2, 409.85, 421.8, 426.45, 430.25, 432.95,
432.4, 442.15, 439.08, 434.5, 438.52, 438.52, 437.95, 440.73,
440.55, 446.45, 442.42, 437.92, 440.2, 440.33, 447.3, 443.15,
447.3, 448.3, 441, 438.3, 433.65, 421.4, 412.35, 393.05,
403.55, 389.5, 404.1, 399.5, 403.67, 399.25, 404, 394.13,
396.85, 393.98, 401.25, 401.27, 409.17, 409.8, 409.5, 414.7,
418.2), yen = c(929.87, 932.16, 927.79, 922.76, 925.77, 921.77,
925.73, 926.87, 934, 929.98, 928.28, 939.99, 939.99, 934.44,
934.93, 929.78, 932.43, 936.68, 940.12, 938.95, 935.56, 930.47,
927.23, 925.86, 929.43, 932.42, 930.49, 931.15, 939.64, 938.86,
929.71, 930.59, 929.31, 931.59, 929.23, 925.3, 919.2, 919.95,
918.83, 912.58, 917.17, 919.02, 915.52, 918.61, 920.61, 918.09,
932.46, 926.3, 931.17, 921.45, 931.42, 929.27, 929.41, 922.31,
923.17, 920.27, 926.05, 924.52, 926.53, 923.23, 926.24, 929.12,
923.74, 922.74, 924.79, 925.04)), row.names = c(NA, -66L), class = c("tbl_df",
"tbl", "data.frame"))
Data picture
I recommend you read into these post to get familiar with the fable package https://www.mitchelloharawild.com/blog/fable/
library(tidyverse)
library(lubridate)
library(tsibble)
library(fable)
df_example <- codData %>%
mutate(simple_date = as_date(date)) %>%
select(-date) %>%
as_tsibble(index = simple_date) %>%
tsibble::fill_gaps() %>%
tidyr::fill(yal:yen)
fit <- df_example %>%
filter(simple_date < yearmonth("2011 08")) %>%
model(linear_reg = TSLM(log(pet) ~ log(bron) + log(yen) + log(yal)))
forecasts_result <- fit %>% forecast(df_example)
forecasts_result %>%
filter(simple_date >= yearmonth("2011 08")) %>%
autoplot(df_example)
forecasts_result %>%
accuracy(df_example)
#> # A tibble: 1 x 9
#> .model .type ME RMSE MAE MPE MAPE MASE ACF1
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 linear_reg Test -3.34 6.30 4.70 -3.97 5.34 NaN 0.947
Another option is using the VAR model
var_fit <- df_example %>%
filter(simple_date < yearmonth("2011 08")) %>%
model(VAR_MODEL =VAR(vars(yal,pet,bron,yen) ~ AR(7)))
forecast_result_var <- var_fit %>%
forecast(h = 31)
forecast_result_var %>%
autoplot(df_example)
forecast_result_var %>%
accuracy(df_example)
#> # A tibble: 4 x 10
#> .model .response .type ME RMSE MAE MPE MAPE MASE ACF1
#> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 VAR_MODEL yal Test -0.448 0.477 0.450 -20.2 20.2 4.53 0.713
#> 2 VAR_MODEL pet Test -10.7 11.3 10.7 -12.6 12.6 4.33 0.639
#> 3 VAR_MODEL bron Test -32.0 34.7 32.0 -7.95 7.95 4.75 0.746
#> 4 VAR_MODEL yen Test 41.8 45.3 41.8 4.52 4.52 6.19 0.827
Created on 2020-01-05 by the reprex package (v0.3.0)

Flow duration curve (fdc) extract low threshold

I am a newbie working with streamflow duration curves and the function fdc.
I am working with more than 300 series and I am interested in saving the low quartile threshold Qlow.thr value that appears in the plot generated:
Here is the reproducible example:
dat <- c(13.05, 90.29, 5.68, 49.13, 26.39, 15.06, 23.39, 17.98, 4.21, 2.51, 38.29, 8.57, 2.48 , 3.78, 18.09 ,15.16, 13.46, 8.69, 6.85, 11.97, 12.10, 9.87 ,21.89, 2.60 ,2.40, 27.40, 4.94, 83.17 ,12.10, 5.08 ,12.42, 6.19 ,3.60 ,32.58, 53.69, 38.49,3.61, 14.84, 34.48, 1.91, 21.79, 31.53, 6.70, 9.52, 22.64, 1.80 , 8.13, 10.60, 12.73, 4.17, 6.70 ,16.45)
fdc(dat,plot = T,lQ.thr=0.8,ylab='Hm3',main='Upstream monthly duration curve',thr.shw=TRUE)
The fdc function returns a vector of probabilities, but I am not sure how to convert these probabilities to the original units and select the 80% percentile value expressed in Hm3 as I would do with pnorm, for example, in case of working with normal probabilities.
Thank you so much.
You can construct the FDC yourself by
dat <- c(13.05, 90.29, 5.68, 49.13, 26.39, 15.06, 23.39, 17.98,
4.21, 2.51, 38.29, 8.57, 2.48 , 3.78, 18.09 ,15.16,
13.46, 8.69, 6.85, 11.97, 12.10, 9.87 ,21.89, 2.60,
2.40, 27.40, 4.94, 83.17 ,12.10, 5.08 ,12.42, 6.19,
3.60 ,32.58, 53.69, 38.49,3.61, 14.84, 34.48, 1.91,
21.79, 31.53, 6.70, 9.52, 22.64, 1.80 , 8.13, 10.60,
12.73, 4.17, 6.70 ,16.45)
dat <- sort(dat, decreasing = T)
df <- data.frame(x = 100/length(dat) * 1:length(dat), y = dat)
plot(x = df$x, y = df$y, type = "l", log = "y")
So the sorted flow data is simply plotted against the percentage exceedance scale. This scale is created by dividing 100% by the number of data points which gives us the increment for each point.
Therefore
quantile(dat, p = c(0.2, 0.8), type = 1)
gives you your desired results.
Notice that the computation of the quantile differs in fdc. It seems like they just use
p <- c(0.8, 0.2)
dat[round(p * length(dat))]
> [1] 4.21 27.40
to compute the values.

Creating a 2D-grid or raster in R comparing all respondents with all variables

reproducible example for my data:
df_1 <- data.frame(cbind("Thriving" = c(2.33, 4.21, 6.37, 5.28, 4.87, 3.92, 4.16, 5.53), "Satisfaction" = c(3.45, 4.53, 6.01, 3.87, 2.92, 4.50, 5.89, 4.72), "Wellbeing" = c(2.82, 3.45, 5.23, 3.93, 6.18, 4.22, 3.68, 4.74), "id" = c(1:8)))
As you can see, it includes three variables of psychological measures and one identifier with an id for each respondent.
Now, my aim is to create a 2D-grid with which I can have a nice overview of all the values for all respondents concerning each of the variables. So on the x-axis I would have the id of all the respondents and on the y-axis all variables, whereas the colour of the particular field depends on the value - 1 to 3 in red, 3 to 5 in yellow and 5 to 7 in green The style of the grid should be like this image.
All I have achieved so far is the following code which compresses all the variables/items into one column so they can together be portrayed on the y-axis - the id is of course included in its own column as are the values:
df_1 %>%
select("Thr" = Thriving, "Stf" = Satisfaction, "Wb" = Wellbeing, "id" = id) %>%
na.omit %>%
gather(key = "variable", value = "value", -id) %>%
I am looking for a solution that works without storing the data in a new frame.
Also, I am looking for a solution that would be useful for even 100 or more respondents and up to about 40 variables. It would not matter if one rectangle would then be very small, I just want to have a nice colour play which would give a nice taste of where an organisation may be achieving low or high - and how it is achieving in general.
Thanks for reading, very grateful for any help!
There is probably a better graphics oriented approach, but you can do this with base plot and by treating your data as a raster:
library(raster)
df_1 <- cbind("Thriving" = c(2.33, 4.21, 6.37, 5.28, 4.87, 3.92, 4.16, 5.53), "Satisfaction" = c(3.45, 4.53, 6.01, 3.87, 2.92, 4.50, 5.89, 4.72), "Wellbeing" = c(2.82, 3.45, 5.23, 3.93, 6.18, 4.22, 3.68, 4.74), "id" = c(1:8))
r <- raster(ncol=nrow(df_1), nrow=3, xmn=0, xmx=8, ymn=0, ymx=3)
values(r) <- as.vector(as.matrix(df_1[,1:3]))
plot(r, axes=F, box=F, asp=NA)
axis(1, at=seq(-0.5, 8.5, 1), 0:9)
axis(2, at=seq(-0.5, 3.5, 1), c("", colnames(df_1)), las=1)

Resources