I am using the R package MARSS to run a dynamic factor analysis. I have 8 timeseries and all of the time series have at least 1 NA value (range 1-20 of 50 years/timeseries).
When I ran my model with just 23 years of data (the years where all timeseries had no NA values), it had both Abstol and log-log convergence after 293,368 iterations (maxit was set to 1,000,000). However, after trying it again with the full time series, I only have Abstol convergence after 1,000,000 iterations and this took 2 days to run.
I can't seem to find any guidance on how many NA values a DFA can handle nor what is typically used for maxit. Are there any tools to determine if there are too many NA values in a timeseries for a DFA?
Here is how I have specified model. Note: I haven't provided any data because I don't think anyone wants to run this model given how long it presently takes.
library(MARSS)
listMod = list(m = mm, R = "diagonal and unequal")
listInit = list(x0 = matrix(rep(0, mm), mm, 1))
listCont = list(maxit = 1000000, allow.degen = TRUE)
dfa1 <- MARSS(y = data, # matrix with 50 columns (years) & 8 rows (each timeseries); 84 NA values
form = "dfa",
z.score = FALSE, # timeseries were individually centred and scaled while preparing the dataset (mean = 0, sd = 1)
model = listMod,
inits = listInit,
control = listCont)
Results:
Warning! Abstol convergence only. Maxit (=1e+06) reached before log-log convergence.
Alert: Numerical warnings were generated. Print the $errors element of output to see the warnings.
MARSS fit is
Estimation method: kem
Convergence test: conv.test.slope.tol = 0.5, abstol = 0.001
WARNING: Abstol convergence only no log-log convergence.
maxit (=1e+06) reached before log-log convergence.
The likelihood and params might not be at the ML values.
Try setting control$maxit higher.
Convergence warnings
2998019 warnings. First 10 shown. Type cat(object$errors) to see the full list.
Warning: the R.(Y1,Y1) parameter value has not converged.
Warning: the R.(Y2,Y2) parameter value has not converged.
Warning: the R.(Y7,Y7) parameter value has not converged.
Warning: the logLik parameter value has not converged.
Type MARSSinfo("convergence") for more info on this warning.
MARSSkem warnings. Type MARSSinfo() for help.
iter=412 Setting element of R to 0, blocked. See MARSSinfo("R0blocked"). The error is due to the following MARSSkemcheck errors.
MARSSkemcheck error: t=1: For method=kem (EM), if an element of the diagonal of R is 0, the corresponding row of Z must be fixed. See MARSSinfo('AZR0').
iter=413 Setting element of R to 0, blocked. See MARSSinfo("R0blocked"). The error is due to the following MARSSkemcheck errors.
MARSSkemcheck error: t=1: For method=kem (EM), if an element of the diagonal of R is 0, the corresponding row of Z must be fixed. See MARSSinfo('AZR0').
iter=414 Setting element of R to 0, blocked. See MARSSinfo("R0blocked"). The error is due to the following MARSSkemcheck errors.
MARSSkemcheck error: t=1: For method=kem (EM), if an element of the diagonal of R is 0, the corresponding row of Z must be fixed. See MARSSinfo('AZR0').
iter=415 Setting element of R to 0, blocked. See MARSSinfo("R0blocked"). The error is due to the following MARSSkemcheck errors.
MARSSkemcheck error: t=1: For method=kem (EM), if an element of the diagonal of R is 0, the corresponding row of Z must be fixed. See MARSSinfo('AZR0').
Related
I have a dataframe where a column is a mix of positive and negative numbers and the first entry is NA. I'm trying to run the shape function as
shape(data$col, models = 30, start = 30, end = 400, ci=.90,reverse = TRUE,auto.scale = TRUE)
where the data in 'col' is [NA, -0.2663194135, -3.7665034719, -0.2072122334, 1.5721742718, -9.142419, -8.954330, -5.167314, 11.805930, 9.533830, 7.065835]
but I get an error that says
Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
non-finite value supplied by optim
Can someone help me figure out what it means? I've googled it but haven't found anything concrete
It's not clear what you are trying to do here. Calling shape allows you to see how altering the threshold or nextremes parameters in the gpd function will alter the xi parameter of the resulting generalised Pareto distribution model.
There are a few reasons why the example you supplied doesn't work. Let's first of all show an example of what does work. The exponential distribution is a special case of a GPD with mu = 0 and xi = 0, so a sample drawn from the exponential distribution should do the trick:
library(evir) # For the shape() function
set.seed(69) # Makes this example reproducible
x <- rexp(300) # Random sample of 300 elements drawn from exponential distribution
shape(x)
Fine.
However, your sample contains an NA. What happens if we make a single value NA in our sample?
x[1] <- NA
shape(x)
#> Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
#> non-finite value supplied by optim
So, no NAs allowed.
Unfortunately, you will find that you still get the same error if you remove your NA value. There are two reasons for this. Firstly, you have 9 non-NA samples. What happens if we try a length-9 exponential sample?
shape(rexp(9))
#> Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
#> non-finite finite-difference value [1]
We will find that the model will fail to fit with fewer than about 16 data points.
But that's not the only problem. What if we try to get a plot for data that can't be drawn from a generalized Pareto distribution?
# Maybe a uniform distribution?
shape(runif(300, 1, 10))
#> Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
#> non-finite finite-difference value [1]
#> In addition: Warning message:
#> In sqrt(diag(varcov)) : NaNs produced
#>
So in effect, you need a bigger sample with no NAs, and it needs to conform approximately to a GPD, otherwise the gpd function will throw an error.
I might be able to help if you let us know the bigger picture of what you are trying to do.
I am working with h2o glrm function. When I am trying to pass loss_by_col argument in order to specify different loss function for each column in my DataFrame (I have normal, poisson and binomial variables, so I am passing "Quadratic", "Poisson" and "Logistic" loss), the objective is not getting computed. The testmodel#model$objective returns NaN. But at the same time summary shows that there was few iterations made and objective was NA for all of them. The quality of model is very bad, but the archetypes are somehow computed. So I am confused. How should pass different loss for every variable in my dataset? Here is a (i hope) reproducible example:
df <- data.frame(p1 = rpois(100, 5), n1 = rnorm(100), b1 = rbinom(100, 1, 0.5))
df$b1 <- factor(df$b1)
h2df <- as.h2o(df)
testmodel <- h2o.glrm(h2df,
k=3,
loss_by_col=c("Poisson", "Quadratic", "Logistic"),
transform="STANDARDIZE")
testmodel#model$objective
summary(testmodel)
plot(testmodel)
Please note that there is a jira ticket for this here
It's interesting that you don't get an error when you run your code snippet. When I run your code snippet I get the following error:
Error: DistributedException from localhost/127.0.0.1:54321: 'Poisson loss L(u,a) requires variable a >= 0', caused by java.lang.AssertionError: Poisson loss L(u,a) requires variable a >= 0
I can resolve this error by removing transform="STANDARDIZE", because standardization can lead to negative values. For more information on what the transformations do you can take a look at the user guide here for your convenience here is the definition of how standardize gets used Standardize: Standardizing subtracts the mean and then divides each variable by its standard deviation.
Mac OS 10.9.5, R 3.2.3, MuMIn_1.15.6, lme4_1.1-10
Reproducible example code, using example data
The MuMIn user guide recommends using na.action=na.fail, otherwise the dredge function will not work, which I have found:
Error in dredge: 'global.model''s 'na.action' argument is not set and options('na.action') is "na.omit".
However, when I try to run a glmer model with na.action=na.fail, I get this:
Error in na.fail.default(list(pr = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, :
missing values in object
Do I have any options other than removing every observation with an NA? My full data set consists of 10,000 observations and has 23 predictor variables which have NAs for different observations. Removing every obs with an NA will waste some data, which I'm looking to avoid.
It is difficult to know what you are asking.
From ?MuMIn::dredge "Use of na.action = "na.omit" (R's default) or "na.exclude" in global.model must be avoided, as it results with sub-models fitted to different data sets, if there are missing values. Error is thrown if it is detected."
In your example, leaving the default options(na.action = na.omit) works fine:
options()$na.action
mod.na.omit <- glmer(formula = pr ~ yr + soil_dist + sla_raw +
yr:soil_dist + yr:sla_raw + (1|plot) + (1|subplot),
data = coldat,
family = binomial)
But, options(na.action = na.fail) causes glmer to fail (as expected from the documentation).
If you look at the length of the data in coldat, complete cases of coldat, mod.na.omit you get the following:
> # number of rows in coldat
> nrow(coldat)
[1] 3171
> # number of complete cases in coldat
> nrow(coldat[complete.cases(coldat), ])
[1] 2551
> # number of rows in data included in glmer model when using 'na.omit'
> length(mod.na.omit#frame$pr)
[1] 2551
From the example data you provided, complete cases of coldat and the rows of coldat included by glmer when using na.omit (mod.na.omit#frame) yields the same number of rows, but it is conceivable that as predictors are added, this may no longer be the case (i.e., number of rows in mod.na.omit#frame > complete cases of coldat). In this scenario (as the documentation states), there is a risk of sub-models being fitted to different data sets as dredge generates the models. So, rather than potentially fitting sub-models, dredge takes a conservative approach to NA, and throws an error.
So, you basically either have to remove the incomplete cases (which you indicated is something you don't want to do) or interpolate the missing values. I typically avoid interpolation if there are large blocks of missing data which make estimating a value fraught, and remove incomplete cases instead.
I am trying to forecast a time series, and regress on temperature. The residuals show a different behaviour at low and high temperatures so I want to use piecewise linear approach, so learn different coeffecients for temperatures above and below 35 degrees.
The data is in a dataframe data$x, data$Season, data$Temp.
#Create data frame
len<-365*3 + 1 +31
x<-rnorm(len,mean=4000000,sd=100000)
Season<-c(rep(3,62),rep(4,91),rep(1,90),rep(2,92),rep(3,92),rep(4,91),rep(1,90),rep(2,92),rep(3,92),rep(4,91),rep(1,91),rep(2,92),rep(3,61))
Temp<-rnorm(len,mean=20,sd=5)
data<-data.frame(x,Season,Temp)
#Create model matrix
season_dummy<-model.matrix(~as.factor(data$Season)+0)
Temp_max=pmax(0,data$Temp-35) # creates 0, or a difference
Temp_restore<-restore_temp_up(Temp_max,data$Temp,35) # restores difference to original value
Temp_season_matrix_max=Temp_restore * season_dummy
#Create time-series and forecast
data_ts<-ts(data$x[1:1000],freq=365,start=c(2009,182))
len_train<-length(data_ts)
xreg1<-Temp_season_matrix_max[1:len_train,]
newxreg1<-Temp_season_matrix_max[(len_train+1):(len_train+30),]
stlf(data_ts,method="arima",h=30,xreg=xreg1,newxreg=newxreg1,s.window="periodic")
> Error in optim(init[mask], armaCSS, method = optim.method, hessian = FALSE, :
non-finite value supplied by optim
Error in auto.arima(x, xreg = xreg, seasonal = FALSE, ...) :
No suitable ARIMA model found
In addition: Warning message:
In auto.arima(x, xreg = xreg, seasonal = FALSE, ...) :
Unable to calculate AIC offset
>
Other threads suggest changing method solver from CSS to ML, but I cant edit these parameters in stlf. The help file shows an optional parameter "forecastfunction" but there are no examples of real explanation how to use it.
Note - when I set the min temperature to say 20, instead of 35, this works ok - I am sure it is because the xreg matrix containing temperatures above 35 degress is sparse (most temperatures are below this value), but I am not sure how to get around this.
(I have included code for restore_temp_up - possibly inefficient, but included here for question completion.)
restore_temp_up<-function(x,original,k){
if(!is.vector(x))
stop('x must be a vector')
for (i in 1:length(x)){
if(!is.na(x[i])){
if (x[i] > 0){
x[i]<-x[i]+k
}
if (original[i] == k){
x[i]<-original[i] ## this is the case if original WAS =k, then dont know whether original is 0,
}
}
}
return(x)
}
Your design matrix is rank deficient so the regression is singular. To see this:
> eigen(t(xreg1) %*% xreg1)$val
[1] 1321.223 0.000 0.000 0.000
You cannot fit a regression model with a rank deficient design matrix.
I want to estimate the coefficients for an AR process based on weekly data where the lags occur at t-1, t-52, and t-53. I will naturally lose a year of data to do this.
I currently tried:
lags <- rep(0,54)
lags[1]<- NA
lags[52] <- NA
lags[53] <- NA
testResults <- arima(data,order=c(53,0,0),fixed=lags)
Basically I tried using an ARIMA and shutting off the MA/differencing. I used 0's for the terms I wanted to exclude (plus intercept, and NAs for the terms I wanted.
I get the following error:
Error in optim(init[mask], armafn, method = optim.method, hessian =TRUE, :
non-finite finite-difference value [1]
In addition: Warning message:
In arima(data, order = c(53, 0, 0), fixed = lags) :
some AR parameters were fixed: setting transform.pars = FALSE
I'm hoping there is an easier method or potential solution to this error. I want to avoid creating columns with the lagged variables and simply running a regression. Thanks!