For the following time series data:
#1. dates of 15 day frequency:
dates = seq(as.Date("2016-09-01"), as.Date("2020-07-30"), by=15) #96 times observation
#2. water content in crops corresponding to the times given.
water <- c(0.5702722, 0.5631781, 0.5560839, 0.5555985, 0.5519783, 0.5463459,
0.5511598, 0.546652, 0.5361545, 0.530012, 0.5360571, 0.5396569,
0.5683526, 0.6031535, 0.6417821, 0.671358, 0.7015542, 0.7177007,
0.7103561, 0.7036985, 0.6958607, 0.6775161, 0.6545367, 0.6380155,
0.6113306, 0.5846186, 0.5561815, 0.5251135, 0.5085149, 0.495352,
0.485819, 0.4730029, 0.4686458, 0.4616468, 0.4613918, 0.4615532,
0.4827496, 0.5149105, 0.5447824, 0.5776764, 0.6090217, 0.6297454,
0.6399422, 0.6428941, 0.6586344, 0.6507473, 0.6290631, 0.6011123,
0.5744375, 0.5313527, 0.5008027, 0.4770338, 0.4564025, 0.4464508,
0.4309046, 0.4351668, 0.4490393, 0.4701232, 0.4911582, 0.5162941,
0.5490387, 0.5737573, 0.6031149, 0.6400073, 0.6770058, 0.7048311,
0.7255012, 0.739107, 0.7338938, 0.7265202, 0.6940718, 0.6757214,
0.6460862, 0.6163091, 0.5743775, 0.5450822, 0.5057753, 0.4715266,
0.4469859, 0.4303232, 0.4187793, 0.4119401, 0.4201316, 0.426369,
0.4419331, 0.4757525, 0.5070846, 0.5248457, 0.5607567, 0.5859825,
0.6107531, 0.6201754, 0.6356589, 0.6336177, 0.6275579, 0.6214981)
I want to fit a double-logistic function curve to the data.
I found some examples and packages that can be of help,
https://greenbrown.r-forge.r-project.org/man/FitDoubleLogElmore.html
and an example here - Indexes overlap error when using dplyr to run a function.
However, the examples given only consider annual time series.
I have tried to fit the function as:
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
smooth.water = FitDoubleLogBeck(x, weighting = T, hessian = F, plot = T, ninit = 10)
plot(water)
plot(smooth.water$predicted)
plot(water- smooth.water$predicted)
However, this function does not seem to fit the entire time series. How can I run the function to fit the entire time series? Also, I noticed the output is different at different run, and I am not sure what makes that happen.
FitDoubleLogBeck can deal only with 1-year data, so you need analyze the data year by year. To do it just take window for 1 year then fit the data separately for each year.
As for different results at different runs the algorithm randomly chooses the initial parameters. The graph of double logistic curve is bell shaped. However you applying the algorithm to "sine"-like data but the algorithm expects to have "bell". Then it treats the water data as a cloud of points so the results are meaningless and very sensetive to intial parameter setting.
Code:
set.seed(123)
par(mfrow = c(1, 3))
# water vector taken from question above
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
res <- sapply((2017:2019), function(year) {
x2 <- as.vector(window(x, start=c(year, 1), end=c(year, 24)))
smooth.water2 = FitDoubleLogBeck(x2, weighting = T, hessian = F, plot = T, ninit = 10)
title(main = year)
c(year = year, smooth.water2$params)
})
t(res)
Output:
year mn mx sos rsp eos rau
[1,] 2017 -0.7709318 0.17234293 16.324163 -0.6133117 6.750885 -0.7618376
[2,] 2018 -0.8900971 0.09398673 7.529345 0.6701200 17.319465 0.8277409
[3,] 2019 -4.7669470 -0.34648434 15.930455 -0.2570877 10.690043 -0.2267284
Related
I have a model, and I want to generate random initial conditions, run the model, and save the output so that each simulation is a replicate. But I have a hard time interpreting and implementing loops (and I also know they are not always the best to use in R), so I'm struggling.
My ultimate goal is to iterate the simulation across 10 different random initial conditions, and save the output of the ODE including a column for simulation number.
First I have my random initial conditions:
library(deSolve)
states <- c(r=runif(1, min=0.1, max=25), # resource state variable
c=runif(1, min=0.1, max=10)) # consumer state variable
Then I have my parameters and model:
parameters <- c(g=5, # resource growth rate )
K=25, # resource carrying capacity
a=1, # consumer attack rate
h=1, # consumer handling time
e=0.9, # consumer conversion efficiency
m=0.5, # consumer mortality rate
avgrain = 1500, # average rainfall
A = 1000,
w = 0.6,
phi = 8.5,
ropt1 = 1500, # optimal rainfall for resource growth
s1 = 1000, # standard deviation for plant growth rate as a function of rainfall
ropt2 = 1000, # optimal rainfall for herbivore attack (feeding) rate
s2 = 500, # standard deviation for herbivore attack rate as a function of rainfall
avgtemp = 20, # average temperature
A_temp = 7,
w_temp = 0.5,
phi_temp = 0.5,
topt1 = 13, # optimal temperature for resource growth
ts1 = 10 # standard deviation for plant growth rate as a function of temperature
)
model <- function(t, states, parameters) {
with(as.list(c(states, parameters)), {
# rainfall time series function
rain <- avgrain + (A*sin((w*t)+phi)) # rainfall function
# temperature time series function
temp = avgtemp + (A_temp*sin((w_temp*t)+phi_temp))
# dynamic g and a equations
dg_both <- (exp(-(rain - ropt1)^2/(s1^2))) + (exp(-(temp - topt1)^2/(ts1^2)))
da = exp(-(rain - ropt2)^2/(s2^2))
# rate of change of state variables
dr <- dg_both*r*(1-(r/K)) - ((c*da*r)/(1+(da*h*r)))
dc <- ((c*e*da*r)/(1+(da*h*r)))- c*m
# return rate of change
list(c(dr, dc), rain=rain, temp=temp, dg_both=dg_both, da=da)
})
}
times <- seq(0, 200, by = 1)
out <- ode(y = states, times = times, func = model, parms = parameters, method="lsoda")
Would I do this with a for loop? Thank you in advance!
Here one of the other approaches, mentioned by #Ben Bolker. Here we use replicate instead of a loop. This has the advantage, that we don't need to create a list() for the results beforehand.
N <- 10
res <- replicate(N, ode(y = c(r = runif(1, min = 0.1, max = 25),
c = runif(1, min = 0.1, max = 10)),
times = times, func = model,
parms = parameters, method="lsoda"),
simplify = FALSE)
plot(out, res)
As an additional goody, we can also plot the results using deSolve's built-in plotting function. This works of course also with res in Ben's approach. The resulting data structure can then be simplified to something like a matrix or array, either with do.call(rbind, res) as in Ben's example, or with option simplify directly in replicate.
Yes, a for loop will be fine. There are lots of other slightly fancier ways to do this (replicate or lapply from base R, purrr::map_dfr from tidyverse ...), but they won't save you any time or memory — they're just a slightly more elegant way to do the same thing.
set.seed(101)
N <- 10
res <- list()
for (i in 1:N) {
## pick new initial conditions
cur_states <- c(r=runif(1, min=0.1, max=25),
c=runif(1, min=0.1, max=10))
## run model and attach index column to the matrix
res[[i]] <-
cbind(run = i,
ode(y = cur_states, times = times, func = model,
parms = parameters, method="lsoda")
)
}
## combine individual runs into one long matrix
res_final <- do.call(rbind,res)
I have a timeseries for which I need PSD values using R. The data was sampled at non uniform intervals but I did a spline interpolation with the predict command to interpolate readings at exactly 0.01 seconds. I could obtain amplitude values from spec.pgram quite correctly but they are not psd values. However the psd values from the pspectrum command of the psd package are only between 0 and 0.5Hz while my area of interest extends to about 1.2Hz. The time series is: here
Note that your time points are not equidistant. For the sake of this answer, we'll assume a frequency of 12 samples per second.
You have to specify the frequency for psd::pspectrum. Assuming your data is loaded as a data.frame called x:
out <- pspectrum(x[, 2], x.frqsamp = 12)
plot(out)
The pspectrum function also has a more detailed plot:
out <- pspectrum(x[, 2], x.frqsamp = 12, plot = TRUE)
Alternative
You can also use stats::spectrum, but it will require you to create a ts object:
our_ts <- ts(data = x[, 2],
start = 0,
frequency = 12)
plot(stats::spectrum(our_ts))
EDIT: Given new dataset (freq = 100)
x <- read.csv("test2.csv", header = F)
out <- pspectrum(x[, 2], x.frqsamp = 100)
out$freq[which.max(out$spec)]
# [1] 0.265708
our_ts <- ts(data = x[, 2], start = 4, frequency = 100)
out2 <- stats::spectrum(our_ts)
out2$freq[which.max(out2$spec)]
# [1] 0.2777778
I'm trying to build a forecast to predict future values of a keyword from Google Trends data.
My data is the daily indexed search volume from Jan 1 to Jun 30, 2020 for a keyword, saved in a CSV file:
Date | Keyword
2020-01-01 | 55
2020-01-02 | 79
2020-01-03 | 29
...
2020-06-29 | 19
2020-06-30 | 32
My R code seems to work okay until it generates the forecasts.
library(forecast)
data <- read.csv("<file path>.csv", header=TRUE)
#build time series data
#start and end periods of observed data
inds <- seq(as.Date("2020-01-01"), as.Date("2020-06-30"), by = "day")
#the frequency = 7 days (i.e. week)
sts <- ts(data$Keyword, start = c(2020, as.numeric(format(inds[1], "%j"))), frequency = 7)
#generate the forecast
model.ets <- ets(sts, model = "ANA")
fc.ets <- forecast(model.ets, h = 60)
plot(fc.ets)
The problem I'm having is that the forecast simply repeats the same pattern (doesn't seem to take into account the error, trend and/or seasonality to adjust the predictions).
I think I need to adjust the forecast() function but not sure how to do it.
In this case we have a daily series spanning less than a year that appears to display a weekly seasonality. Please note as is given here: https://otexts.com/fpp2/ts-objects.html [2.1 - ts objects], the frequency given to the ts object is 52.18 which is 365.25/7, the number of weeks in a year (taking into account leap years). This seasonality rules out the use of ets models which can't handle data with frequency greater than 24, unless used in combination with STL (Seasonal and Trend decomposition using Loess). As such I would recommend exploring other models. STL + ETS(A, Ad, N) [2nd best model] point forecasts look most realistic but the range in our prediction intervals is much larger when compared against the TBATS(1, {0,0}, 0.92, {<52.18, 6>}) model [best model] please see and play around with the below:
ts_ausAirBnb <- ts(ausAirBnb$airbnb_australia_, start = min(ausAirBnb$day), frequency = 52.18)
plot(decompose(ts_ausAirBnb))
snaivefit <- snaive(ts_ausAirBnb)
snaivefcast <- forecast(snaivefit, h = 60)
aafit <- auto.arima(ts_ausAirBnb)
aafcast <- forecast(aafit, h = 60)
stlffit <- stlf(ts_ausAirBnb, h = 60)
stlfcast <- forecast(stlffit, h = 60)
stlmfit <- stlm(ts_ausAirBnb)
stlmfcast <- forecast(stlmfit, h = 60)
tbatsfit <- tbats(ts_ausAirBnb)
tbatsfcast <- forecast(tbatsfit, h = 60)
nnetfit <- nnetar(ts_ausAirBnb)
nnetfcast <- forecast(nnetfit, h = 60)
autoplot(snaivefcast)
autoplot(aafcast)
autoplot(etsfcast)
autoplot(stlfcast)
autoplot(stlffit)
autoplot(stlmfcast)
autoplot(tbatsfcast)
autoplot(nnetfcast)
A hidden Markov model (HMM) is one in which you observe a sequence of observations, but do not know the sequence of states the model went through to generate the observations. Analyses of hidden Markov models seek to recover the sequence of hidden states from the observed data.
I have data with both observations and hidden states (observations are of continuous values) where the hidden states were tagged by an expert. I would like to train a HMM that would be able - based on a (previously unseen) sequence of observations - to recover the corresponding hidden states.
Is there any R package to do that? Studying the existing packages (depmixS4, HMM, seqHMM - for categorical data only) allows you to specify a number of hidden states only.
EDIT:
Example:
data.tagged.by.expert = data.frame(
hidden.state = c("Wake", "REM", "REM", "NonREM1", "NonREM2", "REM", "REM", "Wake"),
sensor1 = c(1,1.2,1.2,1.3,4,2,1.78,0.65),
sensor2 = c(7.2,5.3,5.1,1.2,2.3,7.5,7.8,2.1),
sensor3 = c(0.01,0.02,0.08,0.8,0.03,0.01,0.15,0.45)
)
data.newly.measured = data.frame(
sensor1 = c(2,3,4,5,2,1,2,4,5,8,4,6,1,2,5,3,2,1,4),
sensor2 = c(2.1,2.3,2.2,4.2,4.2,2.2,2.2,5.3,2.4,1.0,2.5,2.4,1.2,8.4,5.2,5.5,5.2,4.3,7.8),
sensor3 = c(0.23,0.25,0.23,0.54,0.36,0.85,0.01,0.52,0.09,0.12,0.85,0.45,0.26,0.08,0.01,0.55,0.67,0.82,0.35)
)
I would like to create a HMM with discrete time t whrere random variable x(t) represents the hidden state at time t, x(t) {"Wake", "REM", "NonREM1", "NonREM2"}, and 3 continuous random variables sensor1(t), sensor2(t), sensor3(t) representing the observations at time t.
model.hmm = learn.model(data.tagged.by.user)
Then I would like to use the created model to estimate hidden states responsible for newly measured observations
hidden.states = estimate.hidden.states(model.hmm, data.newly.measured)
Data (training/testing)
To be able to run learning methods for Naive Bayes classifier, we need longer data set
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
artificial.hypnogram = rep(c(5,4,1,2,3,4,5), times = c(40,150,200,300,50,90,30))
data.tagged.by.expert = data.frame(
hidden.state = states[artificial.hypnogram],
sensor1 = log(artificial.hypnogram) + runif(n = length(artificial.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*artificial.hypnogram + sample(c(-8:8), size = length(artificial.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(artificial.hypnogram), replace = T)
)
hidden.hypnogram = rep(c(5,4,1,2,4,5), times = c(10,10,15,10,10,3))
data.newly.measured = data.frame(
sensor1 = log(hidden.hypnogram) + runif(n = length(hidden.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*hidden.hypnogram + sample(c(-8:8), size = length(hidden.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(hidden.hypnogram), replace = T)
)
Solution
In the solution, we used Viterbi algorithm - combined with Naive Bayes classifier.
At each clock time t, a Hidden Markov Model consist of
an unobserved state (denoted as hidden.state in this case) taking a finite number of states
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
a set of observed variables (sensor1, sensor2, sensor3 in this case)
Transition matrix
A new state is entered based upon a transition probability distribution
(transition matrix). This can be easily computed from data.tagged.by.expert e.g. using
library(markovchain)
emit_p <- markovchainFit(data.tagged.by.expert$hidden.state)$estimate
Emission matrix
After each transition is made, an observation (sensor_i) is produced according to a conditional probability distribution (emission matrix) which depends on the current state H of hidden.state only. We will replace emmision matrices by Naive Bayes classifier.
library(caret)
library(klaR)
library(e1071)
model = train(hidden.state ~ .,
data = data.tagged.by.expert,
method = 'nb',
trControl=trainControl(method='cv',number=10)
)
Viterbi algorithm
To solve the problem, we use Viterbi algorithm with the initial probability of 1 for "Wake" state and 0 otherwise. (We expect the patient to be awake in the beginning of the experiment)
# we expect the patient to be awake in the beginning
start_p = c(NonREM1 = 0,NonREM2 = 0,NonREM3 = 0, REM = 0, Wake = 1)
# Naive Bayes model
model_nb = model$finalModel
# the observations
observations = data.newly.measured
nObs <- nrow(observations) # number of observations
nStates <- length(states) # number of states
# T1, T2 initialization
T1 <- matrix(0, nrow = nStates, ncol = nObs) #define two 2-dimensional tables
row.names(T1) <- states
T2 <- T1
Byj <- predict(model_nb, newdata = observations[1,])$posterior
# init first column of T1
for(s in states)
T1[s,1] = start_p[s] * Byj[1,s]
# fill T1 and T2 tables
for(j in 2:nObs) {
Byj <- predict(model_nb, newdata = observations[j,])$posterior
for(s in states) {
res <- (T1[,j-1] * emit_p[,s]) * Byj[1,s]
T2[s,j] <- states[which.max(res)]
T1[s,j] <- max(res)
}
}
# backtract best path
result <- rep("", times = nObs)
result[nObs] <- names(which.max(T1[,nObs]))
for (j in nObs:2) {
result[j-1] <- T2[result[j], j]
}
# show the result
result
# show the original artificial data
states[hidden.hypnogram]
References
To read more about the problem, see Vomlel Jiří, Kratochvíl Václav : Dynamic Bayesian Networks for the Classification of Sleep Stages , Proceedings of the 11th Workshop on Uncertainty Processing (WUPES’18), p. 205-215 , Eds: Kratochvíl Václav, Vejnarová Jiřina, Workshop on Uncertainty Processing (WUPES’18), (Třeboň, CZ, 2018/06/06) [2018] Download
I have a seasonal (7 days interval) time series, daily data for 30 days.
What is the best approach for a reasonable forecast?
The time series contains orders made with a app, it shows a seasonality of 1 week (lower sales at the beginning of the week).
I try the holt winters approach with this code:
(m <- HoltWinters(ts,seasonal = "mult"))
plot(m)
plot(fitted(m))
but it gives me an error like: Error in decompose(ts(x[1L:wind], start = start(x), frequency = f),seasonal) :
time series has no or less than 2 periods
What do you suggest?
EDIT:
data here
You must first determine a ts object. Assuming your data is called df:
ts <- ts(df$install, frequency = 7)
(m <- HoltWinters(ts,seasonal = "mult"))
plot(m)
plot(fitted(m))
Then you can make prediction like (10 steps-ahead):
predict(m, n = 10)
Time Series:
Start = c(4, 5)
End = c(5, 7)
Frequency = 7
fit
[1,] 1028.8874
[2,] 1178.4244
[3,] 1372.5466
[4,] 1165.2337
[5,] 866.6185
[6,] 711.6965
[7,] 482.2550
[8,] 719.0593
[9,] 807.6147
[10,] 920.3250
The question about the best method is too difficult to answer. Usually one compares the performance of different models considering their out-of-sample accuracy and chooses the one whith the best result.
You can use df$data to keep the dates that correspond to each day in the ts series.
ts_series <- ts(df$install, frequency = 7)
ts_dates <- as.Date(df$data, format = "%d/%m/%Y")
In a similar way, dates for the forecasted values can be kept in another sequence
m <- HoltWinters(ts_series, seasonal = "mult")
predict_values <- predict(m, 10)
predict_dates <- seq.Date(tail(ts_dates, 1) + 1, length.out = 10, by = "day")
With the dates sequence, the daily series can be plot with dates in x axis with the right format. More control on the x axis ticks can be obtained with the axis.Date function
plot(ts_dates, ts_series, typ = "o"
, ylim = c(0, 4000)
, xlim = c(ts_dates[1], tail(predict_dates, 1))
, xlab = "Date", ylab = "install", las = 1)
lines(predict_dates, predict_values, lty = 2, col = "blue", lwd = 2)
grid()