Related
This might be more of a math question than an R question but here it goes...
I'm trying to generate a low frequency oscillator (LFO2) where the frequency is controlled by another low frequency oscillator (LFO1). LFO1 has a frequency of 0.02 Hz while I want LFO2 to have a frequency that oscillates between 0.00 and 0.11 Hz dependent on the output of LFO1.
# length in seconds
track_length <- 356
upsample <- 10 # upsample the signal
# LFO rates (Hz)
rate1 <- 0.02
rate2_range <- list(0.00, 0.11)
# make plot of LFO1
x1 <- 1:(track_length*upsample)/upsample
amp <- (rate2_range[[2]] - rate2_range[[1]])/2
y1 <- amp*cos(2*pi*rate1*x1) + amp
plot(x1, y1, type='l')
The variable frequency for LFO2 generated by LFO1 looks exactly as I expected.
So I go on to make LFO2 using the output of LFO1 like so..
# make plot of LFO2
x2 <- x1
y2 <- cos(2*pi*y1*x2)
plot(x2, y2, type='l')
However, the output of LFO2 is not what I expected... It seems to be continuously getting faster and also has some peaks that don't oscillate at the full range. I don't understand this as the only thing I'm adjusting is the frequency and it shouldn't be faster than 0.11 Hz. At first I thought it might be an under sampling issue but I get the same results when upsampling the time series to any degree.
Any idea what I'm missing here?
The "frequency" of cos(f(t)) is not f(t). It's the derivative of f(t).
You have:
y1(t) = A*cos(2πf1t) + A
y2(t) = cos(2πy1(t))
If the frequency you want is Acos(2πf1t) + A, then you need to integrate that to get the argument to cos:
y1(t) = A*sin(2πf1t)/2πf1 + At
y2(t) = cos(2πy1(t))
In R:
# length in seconds
track_length <- 356
upsample <- 10 # upsample the signal
# LFO rates (Hz)
rate1 <- 0.02
rate2_range <- list(0.00, 2)
# make integral of LFO1
x1 <- 1:(track_length*upsample)/upsample
amp <- (rate2_range[[2]] - rate2_range[[1]])/2
y1 <- amp*sin(2*pi*rate1*x1)/(2*pi*rate1) + amp*x1
plot(x1, y1, type='l')
# make plot of LFO2
x2 <- x1
y2 <- cos(2*pi*y1 / upsample)
plot(x2, y2, type='l')
You are not restricting the data by amp as you did at the first plot. So it is normal to see cos output altering around -1 and 1.You need to restrict the formula by the max(y1) and min(y1).
So the codes below,
y2 <- vector()
amp <- (max(y1) - min(y1))/2
for(i in 1:length(y1)) {
y2[i] <- amp * cos(2*pi* y1[i] * x2) + amp
}
plot(x2, y2, type='l',col="blue")
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
gives this plot,
Running my logistic regression, I am to plot the data i have created, and then plot the line that is described by the coefficients of your model.
library("dplyr") # filter and reformat data frames
library("tidyr") # make data tidy
# Create an empty vector will 200 variables
df_vector <- c(1:200)
# For the first 100 variables,
for (i in 1: 100) {
# Set the normal distribution with mean -5 and sd 5
v1 <- rnorm(100, -5,5)
# For the second 100
for (i in 101:200)
# Set the normal distribution with mean 4 and sd 8
v2 <- rnorm(100,4,8)
}
# Input the vales of v1 and v2 into a data frame
df_vector <- data.frame(v1,v2)
# Combine the 2 columns of data into one single vector with a length of 200
Uni_dataframe <- data.frame(d = unlist(df_vector, use.names = FALSE))
# Create another vector with first half of the values = 0 and second half = 1
resp_vector <- c(1:200)
# For the first 100 variables
for (i in 1:100) {
# Set the first 100 variables = 0
resp1 <-(rep(0, times = 100 ))
# For the second 100 variables
for (i in 101:200)
# Set the second 100 variables = 1
resp2 <- (rep(1, times = 100 ))
}
# Input both resp1 and resp 2 into a data frame
resp_vector <- data.frame(resp1, resp2)
# Combine the 2 columns of data into a single vector with length 100
Uni_resp_vector <- data.frame (d = unlist(resp_vector))
The 2 vectors i have created above are Uni_dataframe and Uni_resp_vector.
I am then tasked to run a logistic regression according to the following code and function.
log.likelihood(params, x, y)
logistic.fit(xtrain, ytrain)
logistic.predict(xtest, fit)
log.likelihood = function(params, x, y) {
x = cbind(rep(1, nrow(x)), x)
Bx.sum = params %*% t(x)
t1 = sum((1-y)*Bx.sum)
t2 = sum(log(1+sapply(-Bx.sum, exp, simplify=T)))
likelihood = -(t1+t2)
return(-likelihood)
}
model <- optim(params = Uni_dataframe, fn = Uni_resp_vector, log.likelihood, data, method='BFGS')
However, I am unsure of what arguments to put in optim() and i am getting the error from optim: cannot coerce type 'closure' to vector of type 'double'. Is there an issue with the 'data' argument?
Any help would be appreciated to churn out a logistic regression model. Any suggestions on how it should be plotted? thank you!
I constructed a binary logistic model. The response variable is binary. There are 4 regressors - 2 binary and 2 integers. I want to find the outliers and delete them. For this i have create some plots:
par(mfrow = c(2,2))
plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals",
xlab="Hat value", ylab="Studentized residual")
dffits.ep <- dffits(model_logit)
plot(id,dffits.ep,type="l", col="#E69F00", main="Index Plot",
xlab="Identification", ylab="Diffits")
cov.ep <- covratio(model_logit)
plot(id,cov.ep,type="l",col="#E69F00", main="Covariance Ratio",
xlab="Identification", ylab="Covariance Ratio")
cook.ep <- cooks.distance(model_logit)
plot(id,cook.ep,type="l",col="#E69F00", main="Cook's Distance",
xlab="Identification", ylab="Cook's Distance")
According to the plots there is an outlier. How can I identify which observation is the outlier?
I have tried :
> outlierTest(model_logit)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
1061 1.931043 0.053478 NA
Are there some other functions for outlier detection?
Well this answer comes quite late. I'm unsure if you have found the answer or not. Continuing further, in the absence of a minimum reproducible example, I'll attempt to answer the question using some dummy data and two custom functions. For a given continuous variable, outliers are those observations that lie outside of 1.5*IQR, where IQR, the ‘Inter Quartile Range’ is the difference between the 75th and 25th quartiles. I also recommend you to see this post containing far better solutions than my crude answer.
> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
X Y Z
1 NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5 0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277
> boxplot(df$Y) # notice the outliers above the top whisker
Now, I'll create a custom function to detect the outliers and the other function will replace the outlier values with NA.
# this function will return the indices of the outlier values
> findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
# check for outliers
> outliers <- findOutlier(df)
# custom function to remove outliers
> removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
> filterData<- removeOutlier(df, outliers)
> boxplot(filterData$Y)
I am trying to run a Monte Carlo simulation of a difference in differences estimator, but I am running into an error. Here is the code I am running:
# Set the random seed
set.seed(1234567)
library(MonteCarlo)
#Set up problem, doing this before calling the function
# set sample size
n<- 400
# set true parameters: betas and sd of u
b0 <- 1 # intercept for control data (b0 in diffndiff)
b1 <- 1 # shift on both control and treated after treatment (b1 in
#diffndiff)
b2 <- 2 # difference between intercept on control vs. treated (b2-this is
#the level difference pre-treatment to compare to coef on treat)
b3 <- 3 # shift after treatment that is only for treated group (b3-this is
#the coefficient of interest in diffndiff)
b4 <- 0 # parallel time trend (not measured in diffndiff) biases b0,b1 but
#not b3 that we care about
b5 <- 0 # allows for treated group trend to shift after treatment (0 if
#parallel trends holds)
su <- 4 # std. dev for errors
dnd <- function(n,b0,b1,b2,b3,b4,b5,su){
#initialize a time vector (set observations equal to n)
timelength = 10
t <- c(1:timelength)
num_obs_per_period = n/timelength #allows for multiple observations in one
#time period (can simulate multiple states within one group or something)
t0 <- c(1:timelength)
for (p in 1:(num_obs_per_period-1)){
t <- c(t,t0)
}
T<- 5 #set treatment period
g <- t >T
post <- as.numeric(g)
# assign equal amounts of observations to each state to start with (would
#like to allow selection into treatment at some point)
treat <- vector()
for (m in 1:(round(n/2))){
treat <- c(treat,0)
}
for (m in 1:(round(n/2))){
treat <- c(treat,1)
}
u <- rnorm(n,0,su) #This assumes the mean error is zero
#create my y vector now from the data
y<- b0 + b1*post + b2*treat + b3*treat*post + b4*t + b5*(t-T)*treat*post +u
interaction <- treat*post
#run regression
olsres <- lm(y ~ post + treat + interaction)
olsres$coefficients
# assign the coeeficients
bhat0<- olsres$coefficients[1]
bhat1 <- olsres$coefficients[2]
bhat2<- olsres$coefficients[3]
bhat3<- olsres$coefficients[4]
bhat3_stderr <- coef(summary(olsres))[3, "Std. Error"]
#Here I will use bhat3 to conduct a t-test and determine if this was a pass
#or a fail
tval <- (bhat3-b3)/ bhat3_stderr
#decision at 5% confidence I believe (False indicates the t-stat was less
#than 1.96, and we fail to reject the null)
decision <- abs(tval) > 1.96
decision <- unname(decision)
return(list(decision))
}
#Define a parameter grid to simulate over
from <- -5
to <- 5
increment <- .25
gridparts<- c(from , to , increment)
b5_grid <- seq(from = gridparts[1], to = gridparts[2], by = gridparts[3])
parameter <- list("n" = n, "b0" = b0 , "b1" = b1 ,"b2" = b2 ,"b3" = b3 ,"b4"
=
b4 ,"b5" = b5_grid ,"su" = su)
#Now simulate this multiple times in a monte carlo setting
results <- MonteCarlo(func = dnd ,nrep = 100, param_list = parameter)
And the error that comes up is:
in results[[i]] <- array(NA, dim = c(dim_vec, nrep)) :
attempt to select less than one element in integerOneIndex
This leads me to believe that somewhere something is attempting to access the "0th" element of a vector, which doesn't exist in R as far as I understand. I don't think the part that is doing this arises from my code vs. internal to this package however, and I can't make sense of the code that runs when I run the package.
I am also open to hearing about other methods that will essentially replace simulate() from Stata.
The function passed to MonteCarlo must return a list with named components. Changing line 76 to
return(list("decision" = decision))
should work
I am trying to nowcast a time series data (Y) using another time series (X) as a predictor. X and Y are cointegrated. Y is a monthly data from Jan 2012 to Oct 2016 and X runs from Jan 2012 to Feb 2017.
So, I ran VECM as it shown in this video: https://www.youtube.com/watch?v=x9DcUA9puY0
Than, to obtain a predicted values, I transformed it in VAR by vec2var command, following information from this topic: https://stats.stackexchange.com/questions/223888/how-to-forecast-from-vecm-in-r
But I can not forecast Y with known X, how it can be made using predict function with a linear regression model. Also, I can not obtain modelled Y (Y hat) values.
This is my code:
# Cointegrated_series is a ZOO object, which contains two time series X and Y
library("zoo")
library("xts")
library("urca")
library("vars")
# Obtain lag length
Lagl <- VARselect(Cointegrated_series)$selection[[1]]
#Conduct Eigen test
cointest <- ca.jo(Cointegrated_series,K=Lagl,type = "eigen", ecdet = "const",
spec = "transitory")
#Fit VECM
vecm <- cajorls(cointest)
#Transform VECM to VAR
var <- vec2var(cointest)
Than I'm trying to use predict function in different ways: predict(var), predict(var, newdata = 50), predict(var, newdata = 1000) - result is the same.
Tried to use tsDyn package and newdata argument in predict method, as it mentioned here: https://stats.stackexchange.com/questions/261849/prediction-from-vecm-in-r-using-external-forecasts-of-regressors?rq=1
Not working. My newdata is a ZOO object, where X series has values from Nov 2016 to Feb 2017, and Y series are NAs. So, the method returns NAs in forecast:
# Cointegrated_series is a ZOO object, which contains
#two time series X and Y from Jan 2012 to Oct 2016. Both X and Y are values.
# newDat is a ZOO object, which contains two time series
#X and Y from Nov 2016 to Feb 2017. X are values, Y are NAs.
library(tsDyn)
vecm <-VECM(Cointegrated_series, lag=2)
predict(vecm,newdata = newDat, n.ahead=5)
This is a result:
Y X
59 NA NA
60 NA NA
61 NA NA
62 NA NA
63 NA NA
For example, this is what I get after calling predict whithout newdata argument:
predict(vecm, n.ahead=5)
Y X
59 65.05233 64.78006
60 70.54545 73.87368
61 75.65266 72.06513
62 74.76065 62.97242
63 70.03992 55.81045
So, my main questions are:
How to nowcast Y with known X, using VEC model in R?
How to obtain modelled Y (Y hat) values?
Besides that, I also couldn't find an answer on these questions:
How to call Akaike criteria (AIC) for VECM in R?
Does vars and urca packages provide F and t statistics for VECM?
UPD 10.04.2017
I slightly edited the question. Noticed, that my problem applies to a "ragged edge" problem, and it's incorrect to call it "forecasting" - it is "nowcasting".
UPD 11.04.2017
Thank you for answering!
Here is the full code:
library("lubridate")
library("zoo")
library("xts")
library("urca")
library("vars")
library("forecast")
Dat <- dget(file = "https://getfile.dokpub.com/yandex/get/https://yadi.sk/d/VJpQ75Rz3GsDKN")
NewDat <- dget(file = "https://getfile.dokpub.com/yandex/get/https://yadi.sk/d/T7qxxPUq3GsDLc")
Lagl <- VARselect(Dat)$selection[[1]]
#vars package
cointest_e <- ca.jo(Dat,K=Lagl,type = "eigen", ecdet = "const",
spec = "transitory")
vecm <- cajorls(cointest_e)
var <- vec2var(cointest_e)
Predict1 <- predict(var)
Predict2 <- predict(var, newdata = NewDat)
Predict1$fcst$Y
Predict2$fcst$Y
Predict1$fcst$Y == Predict2$fcst$Y
Predict1$fcst$X == Predict2$fcst$X
#As we see, Predict1 and Predict2 are similar, so the information in NewDat
#didn't came into account.
library("tsDyn")
vecm2 <-VECM(Dat, lag=3)
predict(vecm2)
predict(vecm2, newdata=NewDat)
If dget will return an error, please, download my data here:
https://yadi.sk/d/VJpQ75Rz3GsDKN - for Dat
https://yadi.sk/d/T7qxxPUq3GsDLc - for NewDat
About nowcasting
Saying Nowcasting I mean current-month or previous-month forecasts of unavailible data with currently availible data. Here are some referenses:
Gianonne, Reichlin, Small: Nowcasting: The real-time informational content of macroeconomic data (2008)
Now-Casting and the Real-time Data Flow (2013)
Marcellino, Schumacher: Factor MIDAS for Nowcasting and Forecasting with Ragged-Edge Data: A Model Comparison for German GDP (2010)
I feel your question is more about how to do nowcasting for cointegrated variables, then let's see later how to implement it in R.
In general, according to Granger's representation theorem, cointegrated variables can be represented in multiple forms:
Long term relationship: contemporaneous values of y and x
VECM representation: (diff of) y and x explained by (diff of) lags, and error-correction term at previous period.
So I am not sure how you would do nowcasting in the VECM representation, since it includes only past values? I can see two possibilities:
Do nowcasting based on the long-term relationship. So you just run standard OLS, and predict from there.
Do nowcasting based on a structural VECM, where you add contemporaneous values of the variables you know (X). In R, you would do this package urca, you need though to check whether the predict function will allow you to add know X values.
Regarding the long-term relationship approach, what is interesting is that you can obtain forecasts for X and Y based on the VECM (without known X) and from the LT with known X. This gives you a way to have an idea of the accuracy of your model (comparing known and predicted X), which you could use to create a forecast averaging scheme for your Y?
First of all, thank you so much #Matifou for your awesome package.
I am late in responding, but I was also struggling to figure out the same question and I did not find a solution. That is why I implemented the following function, I hope it will be useful for some people:
#' #title Special predict method for VECM models
#' #description Predict method for VECM models given some known endogenus
#' variables are known but one. It is just valid for one cointegration equation by the moment
#' #param object, an object of class ‘VECM’
#' #param new_data, a dataframe containing the forecast of all the endogenus variables
#' but one, if there are exogenus variables, its forecast must be provided.
#' #param predicted_var, a string with the desired endogenus variable to be predicted.
#' #return A list with the predicted variable, predicted values and a dataframe with the
#' detailed values used for the construction of the forecast.
#' #examples
#' data(zeroyld)
#' # Fit a VECM with Johansen MLE estimator:
#' vecm.jo<-VECM(zeroyld, lag=2, estim="ML")
#' predict.vecm(vecm.jo, new_data = data.frame("long.run" = c(7:10)), predicted_var = "short.run")
predict.vecm <- function(object, new_data, predicted_var){
if (inherits(object, "VECM")) {
# Just valid for VECM models
# Get endogenus and exogenus variables
summary_vecm <- summary(object)
model_vars <- colnames(object$model)
endovars <- sub("Equation ", "", rownames(summary_vecm$bigcoefficients))
if (!(predicted_var %in% endovars)) {
stop("You must provide a valid endogenus variable.")
}
exovars <- NULL
if (object$exogen) {
ind_endovars <-
unlist(sapply(endovars, function(x) grep(x, model_vars), simplify = FALSE))
exovars <- model_vars[-ind_endovars]
exovars <- exovars[exovars != "ECT"]
}
# First step: join new_data and (lags + 1) last values from the calibration data
new_data <- data.frame(new_data)
if (!all(colnames(new_data) %in% c(endovars, exovars))) {
stop("new_data must have valid endogenus or exogenus column names.")
}
# Endovars but the one desired to be predicted
endovars2 <- endovars[endovars != predicted_var]
# if (!all(colnames(new_data) %in% c(endovars2, exovars))) {
# stop("new_data must have valid endogenus (all but the one desired to predict) or exogenus column names.")
# }
new_data <- new_data[, c(endovars2, exovars), drop = FALSE]
new_data <- cbind(NA, new_data)
colnames(new_data) <- c(predicted_var, endovars2, exovars)
# Previous values to obtain lag values and first differences (lags + 1)
dt_tail <- data.frame(tail(object$model[, c(endovars, exovars), drop = FALSE], object$lag + 1))
new_data <- rbind(dt_tail, new_data)
# Second step: get long rung relationship forecast (ECT term)
ect_vars <- rownames(object$model.specific$beta)
if ("const" %in% ect_vars) {
new_data$const <- 1
}
ect_coeff <- object$model.specific$beta[, 1]
new_data$ECT_0 <-
apply(sweep(new_data[, ect_vars], MARGIN = 2, ect_coeff, `*`), MARGIN = 1, sum)
# Get ECT-1 (Lag 1)
new_data$ECT <- as.numeric(quantmod::Lag(new_data$ECT_0, 1))
# Third step: get differences of the endogenus and exogenus variables provided in new_data
diff_data <- apply(new_data[, c(endovars, exovars)], MARGIN = 2, diff)
colnames(diff_data) <- paste0("DIFF_", c(endovars, exovars))
diff_data <- rbind(NA, diff_data)
new_data <- cbind(new_data, diff_data)
# Fourth step: get x lags of the endogenus and exogenus variables
for (k in 1:object$lag) {
iter <- myLag(new_data[, paste0("DIFF_", endovars)], k)
colnames(iter) <- paste0("DIFF_", endovars, " -", k)
new_data <- cbind(new_data, iter)
}
# Fifth step: recursive calculatioon
vecm_vars <- colnames(summary_vecm$bigcoefficients)
if ("Intercept" %in% vecm_vars) {
new_data$Intercept <- 1
}
vecm_vars[!(vecm_vars %in% c("ECT", "Intercept"))] <-
paste0("DIFF_", vecm_vars[!(vecm_vars %in% c("ECT", "Intercept"))])
equation <- paste("Equation", predicted_var)
equation_coeff <- summary_vecm$coefficients[equation, ]
predicted_var2 <- paste0("DIFF_", predicted_var)
for (k in (object$lag + 2):nrow(new_data)) {
# Estimate y_diff
new_data[k, predicted_var2] <-
sum(sweep(new_data[k, vecm_vars], MARGIN = 2, equation_coeff, `*`))
# Estimate y_diff lags
for (j in 1:object$lag) {
new_data[, paste0(predicted_var2, " -", j)] <-
as.numeric(quantmod::Lag(new_data[, predicted_var2], j))
}
# Estimate y
new_data[k, predicted_var] <-
new_data[(k - 1), predicted_var] + new_data[k, predicted_var2]
# Estimate ECT
new_data[k, "ECT_0"] <- sum(sweep(new_data[k, ect_vars], MARGIN = 2, ect_coeff, `*`))
if (k < nrow(new_data)) {
new_data[k + 1, "ECT"] <- new_data[k, "ECT_0"]
}
}
predicted_values <- new_data[(object$lag + 2):nrow(new_data), predicted_var]
} else {
stop("You must provide a valid VECM model.")
}
return(
list(
predicted_variable = predicted_var,
predicted_values = predicted_values,
data = new_data
)
)
}
# Lag function applied to dataframes
myLag <- function(data, lag) data.frame(unclass(data[c(rep(NA, lag), 1:(nrow(data)-`lag)),]))`
#Andrey Goloborodko, in your example, you should apply:
NewDat <- NewDat[-1,] #Just new data is necessary to be provided
predict.vecm(vecm2, new_data=NewDat, predicted_var = "Y")
# $predicted_variable
# [1] "Y"
#
# $predicted_values
# [1] 65.05233 61.29563 59.45109
#
# $data
# Y X ECT_0 ECT DIFF_Y DIFF_X DIFF_Y -1 DIFF_X -1 DIFF_Y -2
# jul. 2016 92.40506 100 -29.0616718 NA NA NA NA NA NA
# ago. 2016 94.03255 78 -0.7115037 -29.0616718 1.627486 -22 NA NA NA
# sep. 2016 78.84268 53 14.4653067 -0.7115037 -15.189873 -25 1.627486 -22 NA
# oct. 2016 67.99277 52 4.8300645 14.4653067 -10.849910 -1 -15.189873 -25 1.627486
# nov. 2016 65.05233 51 3.1042967 4.8300645 -2.940435 -1 -10.849910 -1 -15.189873
# dic. 2016 61.29563 50 0.5622618 3.1042967 -3.756702 -1 -2.940435 -1 -10.849910
# ene. 2017 59.45109 55 -7.3556104 0.5622618 -1.844535 5 -3.756702 -1 -2.940435
# DIFF_X -2 DIFF_Y -3 DIFF_X -3 Intercept
# jul. 2016 NA NA NA 1
# ago. 2016 NA NA NA 1
# sep. 2016 NA NA NA 1
# oct. 2016 -22 NA NA 1
# nov. 2016 -25 1.627486 -22 1
# dic. 2016 -1 -15.189873 -25 1
# ene. 2017 -1 -10.849910 -1 1