Using predicted log change values from regression to predict future prices - r

I had this problem I was hoping someone could help me with. I have a data set which shows the prices of multiple goods (each a column) on a daily basis for some years. I've run a regression like below for part of my data frame, and then created predicted values for the rest of the time period I have. My predicted values are the log changes in price for pet. For clarification, I have all of the actual values for price of pet already, however I am just trying to predict them using this method.
lin <- lm(diff(log(pet)) ~ diff(log(bron)) + diff(log(yen)) +
diff(yal) - 1, data = codData[1:634,])
predictions <- (predict(lin, newdata = codData[635:1025,]))
My problem now is that I want to get the actual predicted value of the price of pet, which I would normally do by multiplying the first predicted log change + 1 by the first price of pet which I want to predict, which would get me the first predicted value of the price of pet. I would then multiply the second predicted log change + 1 by that newly predicted value of pet, and so on and so forth. I'm not sure how I can do this in R though. Does anyone have any ideas?
Thanks ahead of time!
Code to get sample data
codData <- structure(list(date = structure(c(1306800000, 1306886400, 1306972800,
1307059200, 1307318400, 1307404800, 1307491200, 1307577600, 1307664000,
1307923200, 1308009600, 1308096000, 1308182400, 1308268800, 1308528000,
1308614400, 1308700800, 1308787200, 1308873600, 1309132800, 1309219200,
1309305600, 1309392000, 1309478400, 1309824000, 1309910400, 1309996800,
1310083200, 1310342400, 1310428800, 1310515200, 1310601600, 1310688000,
1310947200, 1311033600, 1311120000, 1311206400, 1311292800, 1311552000,
1311638400, 1311724800, 1311811200, 1311897600, 1312156800, 1312243200,
1312329600, 1312416000, 1312502400, 1312761600, 1312848000, 1312934400,
1313020800, 1313107200, 1313366400, 1313452800, 1313539200, 1313625600,
1313712000, 1313971200, 1314057600, 1314144000, 1314230400, 1314316800,
1314576000, 1314662400, 1314748800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), yal = c(3.05, 2.96, 3.04, 2.99, 3.01, 3.01,
2.98, 3.01, 2.99, 3, 3.11, 2.98, 2.93, 2.94, 2.97, 2.99, 3.01,
2.93, 2.88, 2.95, 3.05, 3.14, 3.18, 3.22, 3.16, 3.12, 3.17, 3.03,
2.94, 2.92, 2.92, 2.98, 2.94, 2.94, 2.91, 2.96, 3.03, 2.99, 3.03,
2.99, 3.01, 2.98, 2.82, 2.77, 2.66, 2.64, 2.47, 2.58, 2.4, 2.2,
2.17, 2.34, 2.24, 2.29, 2.23, 2.17, 2.08, 2.07, 2.1, 2.15, 2.29,
2.23, 2.19, 2.28, 2.19, 2.23), sp = c(1345.2, 1314.55, 1312.9399,
1300.16, 1286.17, 1284.9399, 1279.5601, 1289, 1270.98, 1271.83,
1287.87, 1265.42, 1267.64, 1271.5, 1278.36, 1295.52, 1287.14,
1283.5, 1268.45, 1280.1, 1296.67, 1307.41, 1320.64, 1339.67,
1337.88, 1339.22, 1353.22, 1343.8, 1319.49, 1313.64, 1317.72,
1308.87, 1316.14, 1305.4399, 1326.73, 1325.84, 1343.8, 1345.02,
1337.4301, 1331.9399, 1304.89, 1300.67, 1292.28, 1286.9399, 1254.05,
1260.34, 1200.0699, 1199.38, 1119.46, 1172.53, 1120.76, 1172.64,
1178.8101, 1204.49, 1192.76, 1193.89, 1140.65, 1123.53, 1123.8199,
1162.35, 1177.6, 1159.27, 1176.8, 1210.08, 1212.92, 1218.89),
pet = c(102.63, 100.13, 100.54, 100.49, 98.85, 98.98, 100.93,
101.71, 99.02, 96.98, 99.17, 95.29, 94.96, 92.96, 93.25,
93.4, 94.59, 91.75, 91.25, 90.81, 92.89, 94.93, 94.92, 94.7,
96.8, 96.64, 98.49, 96.31, 95.05, 96.77, 97.89, 95.73, 97.32,
96, 97.7, 98.14, 99.25, 99.82, 99.13, 99.44, 97.31, 97.13,
95.92, 95.33, 93.25, 91.93, 86.44, 87.07, 80.74, 81.12, 81.55,
85.46, 85.25, 87.89, 86.93, 87.45, 81.58, 82.63, 84.12, 86.12,
85.17, 84.94, 85.42, 87.45, 88.76, 88.91), bron = c(419.25,
409.5, 409.7, 412.4, 412.25, 414.65, 411.25, 410.5, 404.45,
403.38, 415.85, 411.63, 412.3, 410.05, 407.7, 408.35, 405.85,
406.58, 408.45, 407.2, 409.85, 421.8, 426.45, 430.25, 432.95,
432.4, 442.15, 439.08, 434.5, 438.52, 438.52, 437.95, 440.73,
440.55, 446.45, 442.42, 437.92, 440.2, 440.33, 447.3, 443.15,
447.3, 448.3, 441, 438.3, 433.65, 421.4, 412.35, 393.05,
403.55, 389.5, 404.1, 399.5, 403.67, 399.25, 404, 394.13,
396.85, 393.98, 401.25, 401.27, 409.17, 409.8, 409.5, 414.7,
418.2), yen = c(929.87, 932.16, 927.79, 922.76, 925.77, 921.77,
925.73, 926.87, 934, 929.98, 928.28, 939.99, 939.99, 934.44,
934.93, 929.78, 932.43, 936.68, 940.12, 938.95, 935.56, 930.47,
927.23, 925.86, 929.43, 932.42, 930.49, 931.15, 939.64, 938.86,
929.71, 930.59, 929.31, 931.59, 929.23, 925.3, 919.2, 919.95,
918.83, 912.58, 917.17, 919.02, 915.52, 918.61, 920.61, 918.09,
932.46, 926.3, 931.17, 921.45, 931.42, 929.27, 929.41, 922.31,
923.17, 920.27, 926.05, 924.52, 926.53, 923.23, 926.24, 929.12,
923.74, 922.74, 924.79, 925.04)), row.names = c(NA, -66L), class = c("tbl_df",
"tbl", "data.frame"))
Data picture

I recommend you read into these post to get familiar with the fable package https://www.mitchelloharawild.com/blog/fable/
library(tidyverse)
library(lubridate)
library(tsibble)
library(fable)
df_example <- codData %>%
mutate(simple_date = as_date(date)) %>%
select(-date) %>%
as_tsibble(index = simple_date) %>%
tsibble::fill_gaps() %>%
tidyr::fill(yal:yen)
fit <- df_example %>%
filter(simple_date < yearmonth("2011 08")) %>%
model(linear_reg = TSLM(log(pet) ~ log(bron) + log(yen) + log(yal)))
forecasts_result <- fit %>% forecast(df_example)
forecasts_result %>%
filter(simple_date >= yearmonth("2011 08")) %>%
autoplot(df_example)
forecasts_result %>%
accuracy(df_example)
#> # A tibble: 1 x 9
#> .model .type ME RMSE MAE MPE MAPE MASE ACF1
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 linear_reg Test -3.34 6.30 4.70 -3.97 5.34 NaN 0.947
Another option is using the VAR model
var_fit <- df_example %>%
filter(simple_date < yearmonth("2011 08")) %>%
model(VAR_MODEL =VAR(vars(yal,pet,bron,yen) ~ AR(7)))
forecast_result_var <- var_fit %>%
forecast(h = 31)
forecast_result_var %>%
autoplot(df_example)
forecast_result_var %>%
accuracy(df_example)
#> # A tibble: 4 x 10
#> .model .response .type ME RMSE MAE MPE MAPE MASE ACF1
#> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 VAR_MODEL yal Test -0.448 0.477 0.450 -20.2 20.2 4.53 0.713
#> 2 VAR_MODEL pet Test -10.7 11.3 10.7 -12.6 12.6 4.33 0.639
#> 3 VAR_MODEL bron Test -32.0 34.7 32.0 -7.95 7.95 4.75 0.746
#> 4 VAR_MODEL yen Test 41.8 45.3 41.8 4.52 4.52 6.19 0.827
Created on 2020-01-05 by the reprex package (v0.3.0)

Related

How can I plot data from reactive dataframe and fix the error: subscript out of bounds

I am completely new to R shiny apps and I am currently trying to create a simple app to visualize some gene expression data from an RNA-sequencing experiment. I am trying to pass a textInput (a gene name) to the UI, so that the server can take it as an input to get the data from a data frame and plot it for that speficic gene using ggplot. My code (so far) is as follows:
library(shiny)
library(dplyr)
library(ggplot2)
ui <- fluidPage(
textInput("gene", "Please enter a gene of interest (Examples: IL2, CD28, LDLR):"),
plotOutput("plot")
)
server <- function(input, output) {
data1 <- reactive({
#Read data file and convert to a matrix
raw_df <- read_csv("C:/path/file.csv")
matsymbol <- as.matrix(raw_df[, 2:21])
row.names(matsymbol) <- raw_df$...1
## The gene is found in the matsymbol to extract all normalized read counts
## A matrix of dim=5x4 is formed
gene_counts <- t(matrix(matsymbol[input$gene,], nrow=4))
## Define rownames for the matrix
rownames(gene_counts) <- c("Non-stimulated",
"Stimulated, 24h",
"Stimulated, 48h",
"Stimulated, 48h + LV",
"Stimulated, 72h + LV")
## Calculate rowMeans and rowSDs for each row in the matrix
row_means_gene <- rowMeans(gene_counts)
row_sds_gene <- rowSds(gene_counts)
## Collect to a dataframe which can be used for ggplot
df_gene <- as.data.frame(cbind(row_means_gene, row_sds_gene))
})
output$plot <- renderPlot({
req(data1())
## Plot the expression using ggplot
p_gene <- ggplot(df_gene, aes(x=rownames(df_gene), y=row_means_gene, fill = rownames(df_gene))) +
geom_bar(stat="identity", color="grey", position=position_dodge(), width = 0.7) +
geom_errorbar(aes(ymin=row_means_gene-row_sds_gene, ymax=row_means_gene+row_sds_gene), width=0.2,
position=position_dodge(.9), color = "#404040") +
scale_fill_manual("Condition", values = c("Non-stimulated" = "blue",
"Stimulated, 24h" = "red",
"Stimulated, 48h" = "green",
"Stimulated, 48h + LV" = "yellow",
"Stimulated, 72h + LV" = "black")) +
labs(x="Condition", y = "Normalized expression (read counts) +/- s.d.") +
ggtitle(label = "Normalized expression") +
theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 30, hjust=1))
p_gene
})
}
shinyApp(ui, server)
The data from the raw_df (first col are gene names):
...1 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4 E1 E2 E3 E4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 DDX11L1 135. 2.45 24.0 8.77 0 9.14e-1 5.85e0 9.27e-1 1.12e0 1.97e+0 0 5.03e0 7.05e0 2.03e0 9.27e-1 3.27e+0 8.53e-1 0 9.38e-1 1.90e0
2 MIR6859-1 1.16 24.5 32.0 13.2 1.03e+1 9.14e-1 1.42e1 9.27e+0 3.35e0 0 8.76e+0 3.35e0 6.04e0 2.03e0 1.20e+1 3.27e+0 0 1.66e1 2.06e+1 1.05e1
3 DDX11L1 59.2 6.12 34.7 23.4 1.87e+0 1.83e+0 8.35e0 7.41e+0 1.12e0 4.92e+0 1.75e+0 8.38e0 5.03e0 3.05e0 3.71e+0 4.90e+0 1.11e+1 2.38e0 5.63e+0 7.61e0
4 MIR6859-2 16.2 6.12 2.67 10.2 9.36e-1 5.48e+0 4.18e0 9.27e-1 2.23e0 0 8.76e-1 0 5.03e0 1.02e0 0 8.17e-1 1.71e+0 3.96e0 3.75e+0 1.90e0
5 FAM87B 55.7 44.1 56.0 142. 3.74e+0 8.22e+0 1.67e0 2.59e+1 3.35e0 9.84e-1 5.25e+0 2.18e1 1.01e0 7.12e0 2.78e+0 2.29e+1 1.54e+1 7.92e0 1.59e+1 3.52e1
6 LINC00115 81.2 73.4 30.7 127. 8.42e+0 9.14e+0 1.42e1 1.76e+1 1.12e1 6.89e+0 7.88e+0 1.01e1 9.06e0 9.16e0 1.39e+1 1.63e+1 7.68e+0 1.74e1 2.72e+1 3.61e1
Data from the df_gene (example gene):
row_means_gene row_sds_gene
Non-stimulated 0.0000 0.0000
Stimulated, 24h 2692.3108 2627.0944
Stimulated, 48h 827.6718 823.3256
Stimulated, 48h + LV 1762.1866 1995.1224
Stimulated, 72h + LV 122.9962 144.6482
I learned that I should have my data wrapped in reactive() when using the input$, however, I am running into multiple different problems. First of all, I am getting the error error in evaluating the argument 'x' in selecting a method for function 't': subscript out of bounds. I know that it has something to do with the reactivity, and I have tried to look around on multiple forums for an answer, but I just can't figure out how to put it right. Secondly, I am not quite sure what to pass to the ggplot - the dataframe (df_gene) is inside the reactive element data1, so how can I tell it to find it? Can I use something like data1()$df_gene?
I really hope this is not trivial/a stupid question, otherwise I apologize.
Thank you so much in advance and all the best,
Lasse
I think it is not related to reactivity, but at first let me say something - I really do not work with row names, so perhaps I'm missing something.
However, I have checked this (rownames, subsetting) and see that:
raw_df <- data.frame(
stringsAsFactors = FALSE,
...1 = c("DDX11L1",
"MIR6859-1","DDX11L1","MIR6859-2","FAM87B",
"LINC00115"),
A1 = c(135, 1.16, 59.2, 16.2, 55.7, 81.2),
A2 = c(2.45, 24.5, 6.12, 6.12, 44.1, 73.4),
A3 = c(24, 32, 34.7, 2.67, 56, 30.7),
A4 = c(8.77, 13.2, 23.4, 10.2, 142, 127),
B1 = c(0, 10.3, 1.87, 0.936, 3.74, 8.42),
B2 = c(0.914, 0.914, 1.83, 5.48, 8.22, 9.14),
B3 = c(5.85, 14.2, 8.35, 4.18, 1.67, 14.2),
B4 = c(0.927, 9.27, 7.41, 0.927, 25.9, 17.6),
C1 = c(1.12, 3.35, 1.12, 2.23, 3.35, 11.2),
C2 = c(1.97, 0, 4.92, 0, 0.984, 6.89),
C3 = c(0, 8.76, 1.75, 0.876, 5.25, 7.88),
C4 = c(5.03, 3.35, 8.38, 0, 21.8, 10.1),
D1 = c(7.05, 6.04, 5.03, 5.03, 1.01, 9.06),
D2 = c(2.03, 2.03, 3.05, 1.02, 7.12, 9.16),
D3 = c(0.927, 12, 3.71, 0, 2.78, 13.9),
D4 = c(3.27, 3.27, 4.9, 0.817, 22.9, 16.3),
E1 = c(0.853, 0, 11.1, 1.71, 15.4, 7.68),
E2 = c(0, 16.6, 2.38, 3.96, 7.92, 17.4),
E3 = c(0.938, 20.6, 5.63, 3.75, 15.9, 27.2),
E4 = c(1.9, 10.5, 7.61, 1.9, 35.2, 36.1)
)
matsymbol <- as.matrix(raw_df[, 2:21])
row.names(matsymbol) <- raw_df$...1
input_gene <- "What should I choose?"
t(matrix(matsymbol[input_gene ,], nrow=4))
Gives you the same error if the input$gene (input_gene above) do not exists in the data frame / matrix. Or - more precisely - this what below gives you the same error:
input_gene <- ""
t(matrix(matsymbol[input_gene ,], nrow=4))
Because the empty textInput (and it is empty when the app starts) means ""
You can use the code below to get empty matrix instead of error:
t(matrix(matsymbol[rownames(matsymbol) == input_gene ,], nrow=4))
But you will have another problem if you get matrix with 0 rows - next line, i.e.:
rownames(gene_counts) <- c("Non-stimulated",
"Stimulated, 24h",
"Stimulated, 48h",
"Stimulated, 48h + LV",
"Stimulated, 72h + LV")
won't work, because you will get NULL for rownames(), like in this example:
rownames(matrix()[FALSE, ])
In other words, you would need at first to check if you have matrix with 4 rows to set the 4-length character vector as a names of rows. Or it will be better to gives the user possibility to choose the gene from the predefined list of genes? Check out the selectInput() function. However, because you are reading the file in the server part, you would need to familiarize yourself with updateSelectInput() function as well. Think about it, try it if you think it is wort to do this and if you would need help with "How to display values in selectInput()?" you can always ask another question here.
Notabene: I had a problem with function rowSds(), I have found it is from genefilter package, but I couldn't install this package (not available for my version of R), so I didn't check next steps in your app.

Forecasting Using Group and Regressors in Prophet

I'm trying to use prophet library to predict y using Group and Regressors. My code and the errors received are below.
In Model1:
I've received this error: Error in setup_dataframe(object, df) :
Regressor "x1" missing from dataframe
In Model2:
Model2 runs. But I'm unable to figure it out how to add regressors x1
and x2.
library(prophet)
library(dplyr)
df <- data.frame(ds = rep(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05",
"2020-01-06", "2020-01-07", "2020-01-08", "2020-01-09", "2020-01-10", "2020-01-11", "2020-01-12",
"2020-01-13", "2020-01-14", "2020-01-15"), 2),
group = rep(c("A", "B"), each = 15),
y = c(8.15, 1.74, 2.97, 2.36, 0.94, 1.84, 3.17, 12.51, 0.63, 6.92, 5.51,
7.50, -2.47, 4.38, 6.28, 7.69, 2.89, 3.77, 7.27, -1.19, 4.64, 9.49, 5.43, 0.36, 14.12,
8.77, -3.05, -0.72, 10.99, 10.33),
x1 = c(3.11, 2.16, 0.91, 2.78, 0.06, 1.12, 1.73, 3.95, 1.43, 3.40, 2.37, 1.80, 0.95,
1.66, 3.06, -0.23, 3.11, 3.07, -0.39, 0.13, 4.38, 2.15, 1.61, 1.54, 5.50, 2.21,
0.89, 3.24, 4.27, 2.55),
x2 = c(2.52, -0.21, 1.03, -0.21, 0.44, 0.36 , 0.72, 4.28, -0.40, 1.76, 1.57,
2.85, -1.71, 1.36, 1.61, 3.96, -0.11 , 0.35, 3.83, -0.66, 0.13, 3.67, 1.91, -0.59, 4.31,
3.28, -1.97, -1.98, 3.36, 3.89))
df$ds <- as.Date(df$ds)
# Model 1
Model1 <- function(df) {
m <- prophet(seasonality.mode = 'multiplicative')
m <- add_regressor(m, 'x1')
m <- add_regressor(m, 'x2')
m <- fit.prophet(m, df)
future <- make_future_dataframe(m, periods = 5, freq = 'day')
mod1 <- predict(m, future)
return(mod1)
}
mod1 <-df %>%
group_by(group) %>%
do(Model1(.)) %>%
dplyr::select(ds, group, yhat)
# Model 2
library(prophet)
library(dplyr)
library(purrr)
library(tidyr)
Model2 <- df %>%
nest(-group) %>%
mutate(m = map(data, prophet)) %>%
mutate(future = map(m, make_future_dataframe, period = 5)) %>%
mutate(forecast = map2(m, future, predict))

Can I create scatterplots with "paired circles" in R using ggplot2

First, to clarify on the title. I am trying to create a single scatterplot. The nature of my data is such that there is 2 of each observation, and I would like each pair of observations to be "connected" in the scatterplot via a line or arrow between the two points.
To help with the question, here's a short dataset:
structure(list(evToRevJun15 = c(4.56, 1.35, 1.26, 5.99, 2.79,
6.97, 4.9, 2.28, 1.26, 4.83, 2, 2.36, 4.91, 2.31, 2.47), evToGiJun15 = c(21.71,
5, 4.85, 23.04, 21.46, 34.85, 44.53, 12.67, 9.69, 21.96, 11.76,
19.67, 11.69, 6.42, 5.74), evToRevDec18 = c(1.99, 5.92, 2.13,
6.6, 5.84, 4.32, 6.38, 6.77, 4.92, 2.67, 4.48, 6.69, 1.36, 3.79,
2.41), evToGiDec18 = c(7.37, 24.67, 7.89, 34.74, 19.47, 15.43,
33.58, 39.84, 28.94, 11.61, 17.23, 44.6, 7.56, 8.24, 5.74)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -15L))
> head(zed)
# A tibble: 6 x 4
evToRevJun15 evToGiJun15 evToRevDec18 evToGiDec18
<dbl> <dbl> <dbl> <dbl>
1 4.56 21.7 1.99 7.37
2 1.35 5 5.92 24.7
3 1.26 4.85 2.13 7.89
4 5.99 23.0 6.6 34.7
5 2.79 21.5 5.84 19.5
6 6.97 34.8 4.32 15.4
The two evToRev columns are for the X-axis, and the two evToGi columns are for the Y-axis, and therefore each row in the dataframe constitutes two points in the graph.
Here is an example that sort of highlights what I'm going for, but not exactly. Imagine this graph, but instead of 5 points for Messi, there would be 2 points for Messi, 2 for Angel di Maria, 2 for Neymar, etc.
Any thoughts or help on this would be great! Please let me know if i can add additional clarification.
Edit: The 2nd and 3rd graphs in this article are a better example of what im going for.
The first step in achieving this is reshaping the data into a format that works better with ggplot - once you've done that, the actual plotting code is pretty simple:
library(tidyverse)
df_long = df %>%
# Need an id that will keep observations together
# once they've been split into separate rows
mutate(id = 1:n()) %>%
gather(key = "key", value = "value", -id) %>%
mutate(Time = str_sub(key, nchar(key) - 4),
Type = str_remove(key, Time)) %>%
select(-key) %>%
# In this case we don't want the data entirely
# 'long' since evToRev and evToGi will be
# mapped separately to x and y
spread(Type, value)
df_long %>%
ggplot(aes(x=evToRev, y=evToGi, colour=Time)) +
# group aesthetic controls which points are connected
geom_line(aes(group = id), colour = "grey40") +
geom_point(size = 3) +
theme_bw()
Result:
The reshaping could probably be done more neatly using tidyr::pivot_longer(),
but that's still only available in the dev version, so I've used gather and spread.

Flow duration curve (fdc) extract low threshold

I am a newbie working with streamflow duration curves and the function fdc.
I am working with more than 300 series and I am interested in saving the low quartile threshold Qlow.thr value that appears in the plot generated:
Here is the reproducible example:
dat <- c(13.05, 90.29, 5.68, 49.13, 26.39, 15.06, 23.39, 17.98, 4.21, 2.51, 38.29, 8.57, 2.48 , 3.78, 18.09 ,15.16, 13.46, 8.69, 6.85, 11.97, 12.10, 9.87 ,21.89, 2.60 ,2.40, 27.40, 4.94, 83.17 ,12.10, 5.08 ,12.42, 6.19 ,3.60 ,32.58, 53.69, 38.49,3.61, 14.84, 34.48, 1.91, 21.79, 31.53, 6.70, 9.52, 22.64, 1.80 , 8.13, 10.60, 12.73, 4.17, 6.70 ,16.45)
fdc(dat,plot = T,lQ.thr=0.8,ylab='Hm3',main='Upstream monthly duration curve',thr.shw=TRUE)
The fdc function returns a vector of probabilities, but I am not sure how to convert these probabilities to the original units and select the 80% percentile value expressed in Hm3 as I would do with pnorm, for example, in case of working with normal probabilities.
Thank you so much.
You can construct the FDC yourself by
dat <- c(13.05, 90.29, 5.68, 49.13, 26.39, 15.06, 23.39, 17.98,
4.21, 2.51, 38.29, 8.57, 2.48 , 3.78, 18.09 ,15.16,
13.46, 8.69, 6.85, 11.97, 12.10, 9.87 ,21.89, 2.60,
2.40, 27.40, 4.94, 83.17 ,12.10, 5.08 ,12.42, 6.19,
3.60 ,32.58, 53.69, 38.49,3.61, 14.84, 34.48, 1.91,
21.79, 31.53, 6.70, 9.52, 22.64, 1.80 , 8.13, 10.60,
12.73, 4.17, 6.70 ,16.45)
dat <- sort(dat, decreasing = T)
df <- data.frame(x = 100/length(dat) * 1:length(dat), y = dat)
plot(x = df$x, y = df$y, type = "l", log = "y")
So the sorted flow data is simply plotted against the percentage exceedance scale. This scale is created by dividing 100% by the number of data points which gives us the increment for each point.
Therefore
quantile(dat, p = c(0.2, 0.8), type = 1)
gives you your desired results.
Notice that the computation of the quantile differs in fdc. It seems like they just use
p <- c(0.8, 0.2)
dat[round(p * length(dat))]
> [1] 4.21 27.40
to compute the values.

How can I take log of particular columns?

I have this data
structure(list(DATE = 19620101:19620106, PRECIP = c(10.54, 6.39,
0.01, 0, 0.02, 20.94), OBS_Q = c(2.39, 2.38, 2.22, 2.24, 2.26,
5.13), swb = c(4.11, 3.92, 3.8, 3.8, 3.77, 7.16), gr4j = c(3.7,
4.24, 3.73, 3.24, 2.98, 4.93), isba = c(4.82, 3.44, 4.18, 3.01,
2.88, 6.35), noah = c(3.11, 2.84, 2.57, 2.59, 2.84, 4.83), sac = c(2.83,
2.84, 2.73, 2.87, 2.94, 6), swap = c(2.16, 2.56, 2.08, 2.19,
3.65, 4.43), vic.mm.day. = c(3.44, 3.14, 3.37, 3.15, 2.88, 4.52
)), .Names = c("DATE", "PRECIP", "OBS_Q", "swb", "gr4j", "isba",
"noah", "sac", "swap", "vic.mm.day."), row.names = c(NA, 6L), class = "data.frame")
And a function that does three things
Subsets based on what 'model' is mentioned(one of the column names of df)
Outputs dfcal with 3 columns
Outputs dfcallog with the log of those values
The function
dataprep.allcal<-function(df,model){
dfcal<<-subset(df, select=c("DATE",model, "OBS_Q"))
dfcallog<-subset(df, select=c("DATE",model, "OBS_Q"))
cols<-colnames(dfcallog)
dfcallog[cols] <<- log(dfcallog[cols])
}
dataprep.allcal(df=df,model='sac')
The problem with this, it converts the date to log as well.
Can you tell me how I can apply the log only on model and OBS_Q column?
This is a way to do it by excluding DATE from the cols variable:
dataprep.allcal<-function(df,model){
dfcal<<-subset(df, select=c("DATE",model, "OBS_Q"))
#need to use the <<- operator below
dfcallog<<-subset(df, select=c("DATE",model, "OBS_Q"))
cols<-colnames(dfcallog)
cols<-cols[!cols %in% 'DATE']
dfcallog[cols] <<- log(dfcallog[cols])
}
dataprep.allcal(df=df,model='sac')
Output:
> dfcal
DATE sac OBS_Q
1 19620101 2.83 2.39
2 19620102 2.84 2.38
3 19620103 2.73 2.22
4 19620104 2.87 2.24
5 19620105 2.94 2.26
6 19620106 6.00 5.13
> dfcallog
DATE sac OBS_Q
1 19620101 1.040277 0.8712934
2 19620102 1.043804 0.8671005
3 19620103 1.004302 0.7975072
4 19620104 1.054312 0.8064759
5 19620105 1.078410 0.8153648
6 19620106 1.791759 1.6351057
Note:
I don't know if this is a typo but you should have dfcallog<-subset(df, select=c("DATE",model, "OBS_Q")) with a <<- operator because using a <- operator at the beginning and then a <<- operator at the last command in the function will produce an error because <<- looks at the parent.environment to find a variable and thus you would get an object not found error.

Resources