I have a time series data set. I would like to cluster it using different clustering techniques. Then, I would like to evaluate each method. I tried an R code from clusterSim package. However, I got an error"
Error in x[ind, ] - centerI : non-conformable arrays. I have no idea why I got this error.
Here is my code:
library(dtwclust)
library(Tsclust)
library(dplyr)
library(clusterSim)
data("interest.rates")
rate.change <- diff(log(interest.rates), 1)
md <- diss(rate.change, "AR.MAH")$p_value
min_nc=2
max_nc=8
res <- array(0, c(max_nc-min_nc+1, 2))
res[,1] <- min_nc:max_nc
clusters <- NULL
for (nc in min_nc:max_nc)
{
hc <- hclust(md, method="average")
cl2 <- cutree(hc, k=nc)
res[nc-min_nc+1, 2] <- DB <- index.DB(rate.change, cl2, md,centrotypes="medoids")$DB
clusters <- rbind(clusters, cl2)
}rate.change <- diff(log(interest.rates),1)
The original code is as follows:
# Example 3
library(clusterSim)
data(data_ratio)
md <- dist(data_ratio, method="euclidean")
# nc - number_of_clusters
min_nc=2
max_nc=8
res <- array(0, c(max_nc-min_nc+1, 2))
res[,1] <- min_nc:max_nc
clusters <- NULL
for (nc in min_nc:max_nc)
{
hc <- hclust(md, method="complete")
cl2 <- cutree(hc, k=nc)
res[nc-min_nc+1, 2] <- DB <- index.DB(data_ratio, cl2, centrotypes="centroids")$DB
clusters <- rbind(clusters, cl2)
}
library(caret)
library(mvtnorm)
library(ncvreg)
library(glmnet)
library(flare)
air <- read.csv(file = 'C:/Users/42stu/Downloads/AirI(1).csv', check.names = FALSE, stringsAsFactors = FALSE)
for(x in 1:nrow(air)){
for (y in 1:ncol(air)){
air[x, y] <- as.numeric(air[x,y])
if(!is.numeric(air[x, y])){
print(x)
print(y)
}
}
}
air_shuffled <- air[sample(nrow(air)),]
dataset <- air_shuffled
folds <- cut(seq(1,nrow(dataset)),breaks=10,labels=FALSE)
for(i in 1:10){
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- dataset[testIndexes,]
testDatax <- testData[,2:ncol(testData)]
testDatay <- testData[,1]
trainData <- dataset[-testIndexes, ]
trainDatax <- trainData[,2:ncol(trainData)]
trainDatay <- t(trainData[,1])
sqrt_lasso <- slim(trainDatax, trainDatay, method = "lq")
}
My code is included above. I have converted everything to a numeric type; why am I still getting an error? Is there something wrong with the size of the matrices? Apologies if this isn't a good question; this is my first time posting on StackOverflow and my first semester using R.
I am trying to write r2, rmse, coefficients, and standardized coefficients from stepAIC to a .CSV file:
NO3_lmres_ClimateOnly <- data.frame()
for (i in unique(Data$SeasAlltxt)){
print (i)
subdata1 <- subset(Data, SeasAlltxt == i)
for (j in unique(Data$ALSCIDtxtall)){
subdata2 <- subset(subdata1, ALSCIDtxtall == j)
fit <- lm(NO3resid~Avg94NO3+MaxDepth_m+MaxDepthDOY+FirstZeroDOY+PeakToGone+PRISMppt+PRISMtmax, data = subdata2, na.action = na.omit)
step <- stepAIC(fit, direction="both")
rmse <- round(sqrt(mean(resid(step)^2)), 3)
r2 <- round(summary(step)$r.squared, 3)
coefs <- summary(step)$coefficients
stdcoefs <- lm.beta(step)
stdcoefs <- unname(stdcoefs)
params <- names(stdcoefs)
tempvalues <- data.frame(i,j,rmse,r2,coefs,stdcoefs,params)
colnames(tempvalues) <- c('SeasAlltxt', 'ALSCIDtxtall', 'rmse', 'r2', 'coef', 'stdcoef','param')
NO3_lmres_ClimateOnly <- rbind(NO3_lmres_ClimateOnly,tempvalues)
}
}
write.csv(NO3_lmres_ClimateOnly, file = "NO3_ClimateOnly_stats.csv")
However, the above code produces this error:
Error in data.frame(i, j, rmse, r2, coefs, stdcoefs, params) :
arguments imply differing number of rows: 1, 3, 2, 0
I would also like to write the p-value associated with each parameter to the output table.
Any suggestions for how to accomplish this?
Maybe you want to change your code:
for (j in unique(subdata1$ALSCIDtxtall))
...
coefs <- summary(step)$coefficients[,1]
...
tempvalues <- data.frame(t(c(i,j,rmse,r2,coefs,stdcoefs,params)),stringsAsFactors=F)
colnames(tempvalues ) <- c('SeasAlltxt', 'ALSCIDtxtall', 'rmse', 'r2', names(coefs), paste('stdcoef:',params),params)
Bud the final rbind will give you an error when stepAIC select different number of coefficients.
Think of using a list instead:
Define cont=1 outside the for
then, change the following lines:
tempvalues <- data.frame(t(c(i,j,rmse,r2,coefs,stdcoefs,params)),stringsAsFactors=F)
colnames(tempvalues ) <- c('SeasAlltxt', 'ALSCIDtxtall', 'rmse', 'r2', names(coefs), paste('stdcoef:',params),params)
NO3_lmres_ClimateOnly[[cont]] <- tempvalues
cont=cont+1
Good luck!!
I'm writing function used for forecasting sales on the base of trend and seasonality. I use dummy variables to represent seasonality and time variable for trend. Here is the code:
forecast<-function(data, time, fn) {
n <- length(data)
seasonal <- factor(cycle(data))
new_data <- seq(from=cycle(data)[n]+1, length=fn, by=1)
new_seasonal <- factor(new_data)
trend <- rep(0, n)
new_trend <- rep(0, n)
if (time > 0) {
trend <- vector()
new_trend <- vector()
ttrend <- seq(from=1, to=n, by=1)
tnew_trend <- seq(from=n+1, length=fn, by=1)
for(i in 1:time) {
trend <- cbind(trend, ttrend^i)
new_trend <- cbind(new_trend, tnew_trend^i)
}
model_trend <- lm(data ~ seasonal + trend)
} else {
model_trend <- lm(data ~ seasonal)
}
df <- data.frame(new_seasonal, new_trend)
p <- predict(model_trend, df)
}
forecast(data = dane.ts[,"SALES"], time=2, fn=5)
However I get warning:
Warning message:
'newdata' had 5 rows but variable(s) found have 104 rows
And it seems that sth is wrong with this function. I would appreciate any help.
I have an issue finding the most efficient way to calculate a rolling linear regression over a xts object with multiple columns. I have searched and read several previously questions here on stackoverflow.
This question and answer comes close but not enough in my opinion as I want to calculate multiple regressions with the dependent variable unchanged in all the regressions. I have tried to reproduce an example with random data:
require(xts)
require(RcppArmadillo) # Load libraries
data <- matrix(sample(1:10000, 1500), 1500, 5, byrow = TRUE) # Random data
data[1000:1500, 2] <- NA # insert NAs to make it more similar to true data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
NR <- nrow(data) # number of observations
NC <- ncol(data) # number of factors
obs <- 30 # required number of observations for rolling regression analysis
info.names <- c("res", "coef")
info <- array(NA, dim = c(NR, length(info.names), NC))
colnames(info) <- info.names
The array is created in order to store multiple variables (residuals, coefficients etc.) over time and per factor.
loop.begin.time <- Sys.time()
for (j in 2:NC) {
cat(paste("Processing residuals for factor:", j), "\n")
for (i in obs:NR) {
regression.temp <- fastLm(data[i:(i-(obs-1)), j] ~ data[i:(i-(obs-1)), 1])
residuals.temp <- regression.temp$residuals
info[i, "res", j] <- round(residuals.temp[1] / sd(residuals.temp), 4)
info[i, "coef", j] <- regression.temp$coefficients[2]
}
}
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time) # prints the loop runtime
As the loop shows the idea is to run a 30 observations rolling regression with data[, 1] as the dependent variable (factor) every time against one of the other factors. I have to store the 30 residuals in a temporary object in order to standardize them as fastLm does not calculate standardized residuals.
The loop is extremely slow and becomes a cumbersome if the numbers of columns (factors) in the xts object increases to around 100 - 1,000 columns would take an eternity. I hope one has a more efficient code to create rolling regressions over a large data set.
It should be pretty quick if you go down to level of the math of the linear regression. If X is the independent variable and Y is the dependent variable. The coefficients are given by
Beta = inv(t(X) %*% X) %*% (t(X) %*% Y)
I'm a little confused about which variable you want to be the dependent and which one is the independent but hopefully solving a similar problem below will help you as well.
In the example below I take 1000 variables instead of the original 5 and do not introduce any NA's.
require(xts)
data <- matrix(sample(1:10000, 1500000, replace=T), 1500, 1000, byrow = TRUE) # Random data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
NR <- nrow(data) # number of observations
NC <- ncol(data) # number of factors
obs <- 30 # required number of observations for rolling regression analysis
Now we can calculate the coefficients using Joshua's TTR package.
library(TTR)
loop.begin.time <- Sys.time()
in.dep.var <- data[,1]
xx <- TTR::runSum(in.dep.var*in.dep.var, obs)
coeffs <- do.call(cbind, lapply(data, function(z) {
xy <- TTR::runSum(z * in.dep.var, obs)
xy/xx
}))
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time) # prints the loop runtime
Time difference of 3.934461 secs
res.array = array(NA, dim=c(NC, NR, obs))
for(z in seq(obs)) {
res.array[,,z] = coredata(data - lag.xts(coeffs, z-1) * as.numeric(in.dep.var))
}
res.sd <- apply(res.array, c(1,2), function(z) z / sd(z))
If I haven't made any errors in the indexing res.sd should give you the standardized residuals. Please feel free to fix this solution to correct any bugs.
Here is a much faster way to do it with the rollRegres package
library(xts)
library(RcppArmadillo)
#####
# simulate data
set.seed(50554709)
data <- matrix(sample(1:10000, 1500), 1500, 5, byrow = TRUE) # Random data
# data[1000:1500, 2] <- NA # only focus on the parts that are computed
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
#####
# setup for solution in OP
NR <- nrow(data)
NC <- ncol(data)
obs <- 30L
info.names <- c("res", "coef")
info <- array(NA, dim = c(NR, length(info.names), NC))
colnames(info) <- info.names
#####
# solve with rollRegres
library(rollRegres)
loop.begin.time <- Sys.time()
X <- cbind(1, drop(data[, 1]))
out <- lapply(2:NC, function(j){
fit <- roll_regres.fit(
y = data[, j], x = X, width = obs, do_compute = c("sigmas"))
# are you sure you want the residual of the first and not the last
# observation in each window?
idx <- 1:(nrow(data) - obs + 1L)
idx_tail <- idx + obs - 1L
resids <- c(rep(NA_real_, obs - 1L),
data[idx, j] - rowSums(fit$coefs[idx_tail, ] * X[idx, ]))
# the package uses the unbaised estimator so we have to time by this factor
# to get the same
sds <- fit$sigmas * sqrt((obs - 2L) / (obs - 1L))
unclass(cbind(coef = fit$coefs[, 2L], res = drop(round(resids / sds, 4))))
})
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time)
#R Time difference of 0.03123808 secs
#####
# solve with original method
loop.begin.time <- Sys.time()
for (j in 2:NC) {
cat(paste("Processing residuals for factor:", j), "\n")
for (i in obs:NR) {
regression.temp <- fastLm(data[i:(i-(obs-1)), j] ~ data[i:(i-(obs-1)), 1])
residuals.temp <- regression.temp$residuals
info[i, "res", j] <- round(residuals.temp[1] / sd(residuals.temp), 4)
info[i, "coef", j] <- regression.temp$coefficients[2]
}
}
#R Processing residuals for factor: 2
#R Processing residuals for factor: 3
#R Processing residuals for factor: 4
#R Processing residuals for factor: 5
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time) # prints the loop runtime
#R Time difference of 7.554767 secs
#####
# check that results are the same
all.equal(info[, "coef", 2L], out[[1]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 2L], out[[1]][, "res"])
#R [1] TRUE
all.equal(info[, "coef", 3L], out[[2]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 3L], out[[2]][, "res"])
#R [1] TRUE
all.equal(info[, "coef", 4L], out[[3]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 4L], out[[3]][, "res"])
#R [1] TRUE
all.equal(info[, "coef", 5L], out[[4]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 5L], out[[4]][, "res"])
#R [1] TRUE
Do notice this comment inside the above solution
# are you sure you want the residual of the first and not the last
# observation in each window?
Here is a comparison to Sameer's answer
library(rollRegres)
require(xts)
data <- matrix(sample(1:10000, 1500000, replace=T), 1500, 1000, byrow = TRUE) # Random data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
NR <- nrow(data) # number of observations
NC <- ncol(data) # number of factors
obs <- 30 # required number of observations for rolling regression analysis
loop.begin.time <- Sys.time()
X <- cbind(1, drop(data[, 1]))
out <- lapply(2:NC, function(j){
fit <- roll_regres.fit(
y = data[, j], x = X, width = obs, do_compute = c("sigmas"))
# are you sure you want the residual of the first and not the last
# observation in each window?
idx <- 1:(nrow(data) - obs + 1L)
idx_tail <- idx + obs - 1L
resids <- c(rep(NA_real_, obs - 1L),
data[idx, j] - rowSums(fit$coefs[idx_tail, ] * X[idx, ]))
# the package uses the unbaised estimator so we have to time by this factor
# to get the same
sds <- fit$sigmas * sqrt((obs - 2L) / (obs - 1L))
unclass(cbind(coef = fit$coefs[, 2L], res = drop(round(resids / sds, 4))))
})
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time)
#R Time difference of 0.9019711 secs
The time includes the time used to compute the standardized residuals.