I have the following code:
z7 <- function(data, k, e){
require(zoo)
df = data
r = df$ROA
t = df$t
EA = df$EA
k = k
e = e
#Estimate rolling linear models
models = rollapply(df, width = k, FUN = function(z)
coef(lm(r~t, data = as.data.frame(z))), by.column = FALSE, align ="right")
#Extract residuals from the models
res = rollapply(df, width= k, FUN = function(x)
residuals(lm(r~t, data = as.data.frame(x))), by.column = FALSE, align ="right")
#Standard deviation and Mean of residuals, on a row basis
s = as.data.frame(apply(res, 1, sd))
m = as.data.frame(apply(res, 1, mean)) #note that this is aproximately 0 due to detrending.
#Combine the data define n as number of rows in the dataset
dataset = cbind(models, res, m, s)
n = as.vector(nrow(dataset))
n
dataset
#Compute predictions at k+1
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
#Compute coefficient of variation
for(j in n){
n2 = k +1
tau = ((1 + 1 / (4*(n2))) * (dataset$apply.res..1..sd./dataset$apply.res..1..mean.))
}
dataset3 = cbind(dataset, tau)
dataset3
#Compute mean of chi distribution and the adjusted standard deviation
Mchi <- sqrt(2)*((gamma((k+1)/2))/gamma(k/2))
S = s*Mchi*(k+1)/sqrt(k)
#Compute z7, checking whether the adjusted sd or cv should be used
for(i in nrow(dataset3)){
if (abs(dataset3$tau*dataset3$preds) < e) {
z = -(dataset3$EA + dataset3$preds) / S
} else
z = -(dataset3$EA + dataset3$preds) /(dataset3$tau*dataset3$preds)
}
}
As is noticeable, I am creating a function that creates an adjusted standardised score. Typically, the Z-score is defined as (x - mean)/sd.
In this case, we are taking into account the fact that x is a random variable which is nonstationary. Therefore, the measure must be estimated on a rolling basis and constructed iteratively over the number of observations.
df is the dataset of interest, k is the window length used for estimating the rolling linear models, and e is simply a value used to test whether the adjusted standard deviation is too small to use the coefficient of variation rather than an alternative standard deviation that is adjusted for heteroscedasticity.
I am getting an error when I run my function with the following test measures:
t = seq(0,15,1)
r = (100+50*sin(0.8*t))
EA = rnorm(0:15)
df = data.frame(t,r,EA)
test = z7(df, 3, 0.00000000001)
The error is:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 14, 0
The traceback is:
5.
stop(gettextf("arguments imply differing number of rows: %s",
paste(unique(nrows), collapse = ", ")), domain = NA)
4.
data.frame(..., check.names = FALSE)
3.
cbind(deparse.level, ...)
2.
cbind(dataset, tau)
1.
z7(df, 3, 1e-11)
How can I fix this error? Also, is there a way to simplify my code?
Thank you.
I think the error occurs at line
tau = ((1 + 1 / (4*(n2))) * (dataset$apply.res..1..sd./dataset$apply.res..1..mean.))
I changed it to
tau = ((1 + 1 / (4*(n2))) * (dataset$`apply(res, 1, sd)`/dataset$`apply(res, 1, mean)`))
And in the last for loop I guess there is a problem with dataset3$preds
>dataset3$preds
NULL
And at the beginning you declared r = df$ROA but I think this sets r equal to NULL.
Hope that was useful!
Greetings
WW
Related
I have a task to predict Brown's Exponential Double Smoothing in a rolling window frame of 60 observations from a dataset of 228 observatoins and 17 variables. I will only predict one variable of those 17.
Also need to plot the original data and the predicted data.
How can this be done?
First, I create this function to calculate the prediction:
estwo_brown = function(x ,m, lambda) {
# Number of observations
TT = length(x)
# Looking at the lecture - we need M1,t
so = c();
# We need M2,t
st = c();
# We need the level component, Lt
at = c();
# We need the trend component, Tt
bt = c();
# We need the predicted values, Y_t+h
fs = c();
# We set up initial values
so\[1\] = x\[1\]
st\[1\] = x\[1\]
# Now as before, we loop from the second to the last observation
# and use equations from the lecture (2:TT)
for (i in 2:TT) {
so\[i\] = lambda \* x\[i-1\] + (1-lambda) \* so\[i-1\]
st\[i\] = lambda \* so\[i\] + (1-lambda) \* st\[i-1\]
at\[i\] = 2\*so\[i\] - st\[i\]
bt\[i\] = lambda / (1 - lambda) \* (so\[i\] - st\[i\])
fs\[i+m\] = at\[i\] + bt\[i\] \* m
}
# Combining results
res = matrix(NA, nrow = TT, ncol = 6)
colnames(res) = c('Y', 'Mt1', 'Mt2', 'at', 'bt', 'Ft')
res\[,'Y'\] = x
res\[,'Mt1'\] = so
res\[,'Mt2'\] = st
res\[,'at'\] = at
res\[,'bt'\] = bt
res\[,'Ft'\] = fs\[1:TT\]
results = list()
results\[\['results'\]\] = as.data.frame(res)
results\[\['outpred'\]\] = fs\[length(fs)\]
return(results)
}
and then calculating the model using this:
b_double_exp = estwo_brown(x = dataset$column7, m = 1, lambda = 0.5)
But, this is will predict the whole dataset. How can I predict this in a rolling window of 60 observations? The rolling window should slide one observation ahead.
I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913
I am trying to figure out how to sample from a custom density in rJAGS but am running into issues. having searched the site, I saw that there is a zeroes (or ones) trick that can be employed based on BUGS code but am having a hard time with its implementation in rJAGS. I think I am doing it correctly but keep getting the following error:
Error in jags.model(model1.spec, data = list(x = x, N = N), n.chains = 4, :
Error in node dpois(lambda)
Length mismatch in Node::setValue
Here is my rJAGS code for reproducibility:
library(rjags)
set.seed(4)
N = 100
x = rexp(N, 3)
L = quantile(x, prob = 1) # Censoring point
censor = ifelse(x <= L, 1, 0) # Censoring indicator
x[censor == 1] <- L
model1.string <-"
model {
for (i in 1:N){
x[i] ~ dpois(lambda)
lambda <- -N*log(1-exp(-(1/mu)))
}
mu ~ dlnorm(mup, taup)
mup <- log(.0001)
taup <- 1/49
R <- 1 - exp(-(1/mu) * .0001)
}
"
model1.spec<-textConnection(model1.string)
jags <- jags.model(model1.spec,
data = list('x' = x,
'N' = N),
n.chains=4,
n.adapt=100)
Here, my negative log likelihood of the density I am interested in is -N*log(1-exp(-(1/mu))). Is there an obvious mistake in the code?
Using the zeros trick, the variable on the left-hand side of the dpois() relationship has to be an N-length vector of zeros. The variable x should show up in the likelihood somewhere. Here is an example using the normal distribution.
set.seed(519)
N <- 100
x <- rnorm(100, mean=3)
z <- rep(0, N)
C <- 10
pi <- pi
model1.string <-"
model {
for (i in 1:N){
lambda[i] <- pow(2*pi*sig2, -0.5) * exp(-.5*pow(x[i]-mu, 2)/sig2)
loglam[i] <- log(lambda[i]) + C
z[i] ~ dpois(loglam[i])
}
mu ~ dnorm(0,.1)
tau ~ dgamma(1,.1)
sig2 <- pow(tau, -1)
sumLL <- sum(log(lambda[]))
}
"
model1.spec<-textConnection(model1.string)
set.seed(519)
jags <- jags.model(model1.spec,
data = list('x' = x,
'z' = z,
'N' = N,
'C' = C,
'pi' = pi),
inits = function()list(tau = 1, mu = 3),
n.chains=4,
n.adapt=100)
samps1 <- coda.samples(jags, c("mu", "sig2"), n.iter=1000)
summary(samps1)
Iterations = 101:1100
Thinning interval = 1
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 4.493 2.1566 0.034100 0.1821
sig2 1.490 0.5635 0.008909 0.1144
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 0.6709 3.541 5.218 5.993 7.197
sig2 0.7909 0.999 1.357 1.850 2.779
I need to compare the variances of several independent samples. I don't have the data stored in vectors. I only know the mean, standard deviation and the sample count of each sample. Does anyone know a way to test whether the variances are aqual with only those three statistics in R?
Here is an implementation of the Bartlett test that doesn't require the samples only their sizes and standard errors or variances.
The arguments are
n a vector of sample sizes;
S a vector of standard errors or variances;
se a logical value, if TRUE argument S are the standard errors, if FALSE they are the variances.
Tested below with data set iris.
Bartlett_test <- function(n, S, se = TRUE){
dname <- deparse(substitute(S))
N <- sum(n)
k <- length(n)
S2 <- if(se) S^2 else S
S2p <- sum((n - 1)* S2)/(N - k)
numer <- (N - k)*log(S2p) - sum((n - 1)*log(S2))
denom <- 1 + (sum(1/(n - 1)) - 1/(N - k))/(3*(k - 1))
statistic <- c(X2 = numer/denom)
parameter <- k - 1
p.value <- pchisq(statistic, df = parameter, lower.tail = FALSE)
ht <- list(
statistic = statistic,
data.name = dname,
parameter = parameter,
p.value = p.value,
method = "Bartlett test of homogeneity of variances",
alternative = "there are at least two unequal variances"
)
class(ht) <- "htest"
ht
}
n <- with(iris, tapply(Sepal.Length, Species, FUN = length))
s <- with(iris, tapply(Sepal.Length, Species, FUN = sd))
s2 <- with(iris, tapply(Sepal.Length, Species, FUN = var))
Bartlett_test(n, s)
Bartlett_test(n, s2, se = FALSE)
I would like some help with my nested loop which is not returning the values I expect. I am new to nested loops so please bear with me. I want to calculate a new independent variable for a logistic regression model which is based upon different calculations of the original variables. Specifically, I have six variables "x1...x6", and I then create three new variables (newvar1, newvar2, newvar3) by extracting a percentile from pairs of the original variables. From these three new variables I then combine them via subtraction to form a final new variable which forms the independent variable for a logistic regression model. The value of that final variable is then evaluated by the AIC of the logistic regression model.
I need to determine the optimal combination of percentile values which form newvar2, newvar2, and newvar3 which gives me the best logistic regression model. To do this I have attempted to create a three level nested like this:
df <- data.frame(x1 <- rnorm(100),
x2 <- rnorm(100),
x3 <- rnorm(100),
x4 <- rnorm(100),
x5 <- rnorm(100),
x6 <- rnorm(100),
y <- as.factor(runif(100)<=.70))
n = 1
AIC = NULL
for (i in 0.1:n){
for (j in 0.1:n){
for (k in 0.1:n){
df$newvar1 <-apply(df[,1:2], 1, quantile, probs = i, na.rm = T)
df$newvar2 <-apply(df[,3:4], 1, quantile, probs = j, na.rm = T)
df$newvar3 <-apply(df[,5:6], 1, quantile, probs = k, na.rm = T)
df$finalvar <- df$newvar1 - df$newvar2 - df$newvar3
model <- glm(y ~ finalvar, data = df, family = "binomial")
AIC[i] <- as.numeric(model$aic)
}
}
}
I would like to provide a sequence of 11 values (0, 0.1, 0.2....0.9,1) to the "probs" argument of the quantile function, and I would like to get the AIC for each of the possible quantile parameter estimations (11*11*11). Thus the AIC variable in the end should be a numeric vector of 121 values. However, when I run the above code I get an empty numeric value for AIC. How can I get this code the run properly and supply me the values for all possible 121 models?
Thanks!
EDIT: this isn't the solution but provides part of the answer I think. in my previous code "n" was less than one so it was only performing a single iteration, (obviously) "n" needs to greater than one. The reason it was less than 1 before is that the "probs" argument to quantile requires a value betwee 0 and 1. The over come this, the parameter passed to the argument probs is now divided by 10. Now with AIC[1] i can get a vector of 10, but I still don"t understand how to get the full 10*10*10 (or 11*11*11) representing all combinations.
New code:
n = 10
AIC = NULL
for (i in 1:n){
for (j in 1:n){
for (k in 1:n){
df$newvar1 <-apply(df[,1:2], 1, quantile, probs = i/10, na.rm = T)
df$newvar2 <-apply(df[,3:4], 1, quantile, probs = j/10, na.rm = T)
df$newvar3 <-apply(df[,5:6], 1, quantile, probs = k/10, na.rm = T)
df$finalvar <- df$newvar1 - df$newvar2 - df$newvar3
model <- glm(y ~ finalvar, data = df, family = "binomial")
AIC[i] <- as.numeric(model$aic)
}
}
}
First of all, AICis an R function so I've changed the name to aic.
Second, in your code's innermost loop you index by i only, when you have 3 indices. So maybe this is what you really need.
n = 10
aic = array(0, dim = c(n, n, n)) # changed
for(...)
for(...)
for(...){
[...]
aic[i, j, k] <- as.numeric(model$aic) # changed
}