I need help solving this error, I am not sure how to, but it seems as when I run the "dp_stat" in the end, an error appears when I do my OLS model. Essentially I try to find the cumulative difference between a benchmark and a given individual predictive model. Here dp is my independent variable and sg is my dependent variable. datanu is my excel data. I'm not sure how to attach the data here, however here is a link to the excel and the code: https://drive.google.com/drive/folders/12BOuNBODURIP7CQIBZWMmHFc1d7zXHxN?usp=sharing If anyone has a fix it would mean the world!
"#Error in lag():! n must be a positive integer, not a double vector of length 1."
rm(list= ls()) # Clear global environment
invisible(gc()) # Free up unused R-occupied memory
cat("\014") # Clear console output: equivalent to ctrl + L
library("tseries")
library("readxl")
library("Metrics")
library("lubridate")
library("ggplot2")
library("data.table")
library("dyn")
library("reshape2")
#header TRUE fordi første row er navne.
datanu <- read_xlsx("~/Documents/6.semester/Bachelor/Data/datanu.xlsx",
na = "NaN",
sheet = "datax",
)
myts <- ts(datanu, start=c(1872, 1), end=c(2020, 12), frequency=12)
plot(myts[, c("dp", "dy", "ep", "de")])
get_statistics <- function(myts, dp, sg, h=1, start=1872, end=2020, est_periods_OOS = 20) {
#### IS ANALYSIS
#1. Historical mean model for en portefølje
avg <- mean(window(myts, start, end)[, sg], na.rm=TRUE)
IS_error_N <- (window(myts, start, end)[, sg] - avg)
#2. OLS model
#reg <- dyn$lm(sg ~ lag(as.numeric(dp), 1), data=window(myts, start, end))
reg <- dyn$lm(eval(parse(text=sg)) ~ lag(eval(parse(text=dp)), -1), data=window(myts, start, end)) #Error in `lag()`:! `n` must be a positive integer, not a double vector of length 1.
IS_error_A <- reg$residuals
#OOS ANALYSIS
OOS_error_N <- numeric(end - start - est_periods_OOS)
OOS_error_A <- numeric(end - start - est_periods_OOS)
#anvender kun information op til forecasten er lavet.
j <- 0
for (i in (start + est_periods_OOS):(end-1)) {
j <- j + 1
#Get the actual ERP that you want to predict
actual_ERP <- as.numeric(window(myts, i+1, i+1)[, sg])
#1. Historical mean model
OOS_error_N[j] <- actual_ERP - mean(window(myts, start, i)[, sg], na.rm=TRUE)
#2. OLS model
reg_OOS <- dyn$lm(eval(parse(text=sg)) ~ lag(eval(parse(text=dp)), -1),
data=window(myts, start, i))
#Compute_error
df <- data.frame(x=as.numeric(window(myts, i, i)[, dp]))
names(df) <- dp
pred_ERP <- predict.lm(reg_OOS, newdata=df)
OOS_error_A[j] <- pred_ERP - actual_ERP
}
#Compute statistics
MSE_N <- mean(OOS_error_N^2)
MSE_A <- mean(OOS_error_A^2)
T <- length(!is.na(myts[, sg]))
OOS_R2 <- 1 - MSE_A/MSE_N
#Is the -1 enough (maybe -2 needed because of lag)?
OOS_oR2 <- OOS_R2 - (1-OOS_R2)*(reg$df.residual)/(T - 1)
dRMSE <- sqrt(MSE_N) - sqrt(MSE_A)
##
#### CREATE PLOT
IS <- cumsum(IS_error_N[2:length(IS_error_N)]^2)-cumsum(IS_error_A^2)
OOS <- cumsum(OOS_error_N^2)-cumsum(OOS_error_A^2)
df <- data.frame(x=seq.int(from=start + 1 + est_periods_OOS, to=end),
IS=IS[(1 + est_periods_OOS):length(IS)],
OOS=OOS) #Because you lose one observation due to the lag
#Shift IS errors vertically, so that the IS line begins
# at zero on the date of first OOS prediction. (se Goyal/Welch (2008, side 1465))
df$IS <- df$IS - df$IS[1]
df <- melt(df, id.var="x")
plotGG <- ggplot(df) +
geom_line(aes(x=x, y=value,color=variable)) +
geom_rect(data=data.frame(),#Needed by ggplot2, otherwise not transparent
aes(xmin=2008, xmax=2010,ymin=-0.2,ymax=0.2),
fill='red',
alpha=0.1) +
scale_y_continuous('Cumulative SSE Difference', limits=c(-0.2, 0.2)) +
scale_x_continuous('Year')
##
return(list(IS_error_N = IS_error_N,
IS_error_A = reg$residuals,
OOS_error_N = OOS_error_N,
OOS_error_A = OOS_error_A,
IS_R2 = summary(reg)$r.squared,
IS_aR2 = summary(reg)$adj.r.squared,
OOS_R2 = OOS_R2,
OOS_oR2 = OOS_oR2,
dRMSE = dRMSE,
plotGG = plotGG))
}
dp_stat <- get_statistics(myts, "dp", "sg", start=1872)
dp_stat$plotGG
As the error message states, n must be a positive integer, not a double vector of length 1. The error comes from you providing n = -1 (i.e., a negative number) as an argument. I assume your idea is to have a negative number of positions to lag by. However, the lag() function only accepts a positive number of lag positions. Instead of lag(), you should use lead() with n = 1 to achieve the desired result.
Related
So, just a touch of backstory. I've been learning biostatistics in the past 4-5 months in university, 6 months of biomathematics before that. I only started deep diving into programming around 5 days ago.
I've been trying to redo t.test() with my own function.
test2 = function(t,u){
T = (mean(t) - u) / ( sd(t) / sqrt(length(t)))
t1=round(T, digits=5)
df=length(t)
cat(paste('t - value =', t1,
'\n','df =', df-1,
'\n','Alternative hipotézis: a minta átlag nem egyenlő a hipotetikus átlaggal'))
}
I tried searching the formula for the p-value, I found one, but when I used it, my value was different from the one within the t.test.
The t-value and the df do match t.test().
I highly appreciate any help, thank you.
P.s: Don't worry about the last line, it's in Hungarian.
The p-value can be derived from the probability function of the t distribution pt. Using this and making the notation more common with sample x and population mean mu we can use something like:
test2 <- function(x, u){
t <- (mean(x) - u) / (sd(x) / sqrt(length(x)))
df <- length(x) - 1
cat('t-value =', t, ', df =', df, ', p =', 2 * (1 - pt(q=t, df=df)), '\n')
}
set.seed(123) # remove this for other random values
## random sample
x <- rnorm(10, mean=5.5)
## population mean
mu <- 5
## own function
test2(x, mu)
## one sample t-test from R
t.test(x, mu=mu)
We get for the own test2:
t-value = 1.905175 , df = 9, p = 0.08914715
and for R's t.test
One Sample t-test
data: x
t = 1.9052, df = 9, p-value = 0.08915
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.892330 6.256922
sample estimates:
mean of x
5.574626
The definitive source of what R is doing is the source code. If you look at the source code for stats:::t.test.default (which you can get by typing stats:::t.test.default into the console, without parentheses at the end and hitting enter), you'll see that for a single-sample test like the one you're trying to do above, you would get the following:
nx <- length(x)
mx <- mean(x)
vx <- var(x)
df <- nx - 1
stderr <- sqrt(vx/nx)
tstat <- (mx - mu)/stderr
if (alternative == "less") {
pval <- pt(tstat, df)
}
else if (alternative == "greater") {
pval <- pt(tstat, df, lower.tail = FALSE)
}
else {
pval <- 2 * pt(-abs(tstat), df)
}
These are the relevant pieces (there's a lot more code in there, too).
The basic gist is that I have a set of housing data that I need to create a model for to minimize the predicted price vs actual price of house based on the dataset. So I created this bit of code to essentially test for a range of different numerators and find the one that minimized the difference between them. I'm using the median instead of the mean as the data isn't exactly normal.
Since I only have experience with lm(), I'm using that to create the coefficients and C values. But since the model likes exponents, I have to also test various exponents. It does this for each of the variables and then goes back to the first and re-evaluates it based on the other exponents. The model starts out with all the exponents ending up equal to 1. So the same as the basic linear model. I know that this is probably horribly inefficient and probably uses a lot of code in a somewhat wasteful, but I'm in my first r class so sorry about the mess and/or convoluted coding logic.
Is there any way to do this same thing but being more efficient. Also, I can't really decrease the number of variables as the model likes having more variables and produces a greater margin of error when they aren't present.
w <- seq(1,10000,1)
r <- seq(1,10000,1)
t <- seq(1,10000,1)
z <- seq(1,10000,1)
s <- seq(1,10000,1)
coef_1 <- c(6000,6000,6000,6000,6000,6000,6000,6000)
v <- rep(6000, each = 8)
for(l_1 in 1:10){
for(t_1 in 1:8){
for(i in 1:10000){
t = t_1
coef_1[t] = i
mod5 <- lm(log(SALE_PRC) ~ I(TOT_LVG_AREA^((coef_1[1]-5000)/1000)) + I(LND_SQFOOT^((coef_1[2]-5000)/1000)) + I(RAIL_DIST^((coef_1[3]-5000)/1000)) + I(OCEAN_DIST^((coef_1[4]-5000)/1000)) + I(CNTR_DIST^((coef_1[5]-5000)/1000)) + I(HWY_DIST^((coef_1[6]-5000)/1000)) + I(structure_quality^((coef_1[7]-5000)/1000)) + SUBCNTR_DI + SPEC_FEAT_VAL + (exp(((coef_1[8]-5000)/1000)*SPECIAL_RATIO)) + age, data = kaggle_transform_final)
kaggle_new <- kaggle_transform_final %>%
add_predictions(model = mod5, var = "prediction") %>%
mutate(new_predict = exp(prediction)) %>%
mutate(new_difference = abs((new_predict-SALE_PRC))/SALE_PRC) %>%
mutate(average_percent_difference = median(new_difference)) %>%
mutate(mean_percent_difference = mean(new_difference)) %>%
mutate(quart_75 = quantile(new_difference,.75))
w[i] = kaggle_new$average_percent_difference[1]
r[i] = kaggle_new$mean_percent_difference[1]
t[i] = kaggle_new$quart_75[1]
z[i] = i
s[i] = (i-5000)/1000
if(i%%100 ==0){show(i)}
}
u <- data.frame(median_diff = w, mean_diff = r, quart_75 = t, actual = s, number = z) %>%
arrange(median_diff)
coef_1[t_1] <- u$number[1]
v[t_1] <- u$actual[1]
show(coef_1)
}
coef_1 <- coef_1
}
So I am running a survival analysis on my dataset of google playstor downloads.
My analysis using survreg only provides me with nas for coefficients though.
"(5 not defined because of singularities)"
If I use a normal lm regression this problem does not occur. This would not work however since all observations of the dependent variable are right censored for a different number (the numeric value is also the limit).
My original dataset: https://www.kaggle.com/lava18/google-play-store-apps
So here I will show you my entire code. It might be a bit long so scroll to the end for the survival analysis, but I wanted to give you the ability to fully comprehend.
library(readxl)
Dataset <- read_excel("Thesis/googleplaystore.xlsx")
View(Dataset)
#selecteer 500 apps
set.seed(1998)
dataset <- Dataset[sample(nrow(Dataset), 500), ]
View(dataset)
#Lastupdated --> days_since
end <- matrix( c("2018-08-31"), nrow=500, ncol=1, byrow=FALSE)
end <- format(as.Date(end), "%Y/%m/%d")
View(end)
dataset$`Last Updated` <- as.Date(dataset$`Last Updated`,
format = "%B %d, %Y")
dataset$`Last Updated` <- format(as.Date(dataset$`Last Updated`), "%Y/%m/%d")
View(dataset)
install.packages('lubridate')
library(lubridate)
elapsed.time <- dataset$`Last Updated` %--% end
View(elapsed.time)
dataset$days_since <- as.duration(elapsed.time) / ddays(1)
View(dataset)
# + verwijdern uit aantal installs
dataset$Install <- gsub("\\+","", dataset$Installs)
View(dataset)
dataset$Install <- gsub(",","", dataset$Install)
# installs en price numeric maken
typeof(dataset$Install)
dataset$Install <- as.numeric(dataset$Install)
View(dataset)
typeof(dataset$Rating)
dataset$Rating <- as.numeric(dataset$Rating)
typeof(dataset$Reviews)
typeof(dataset$Price)
dataset$Price <- gsub("\\$","", dataset$Price)
dataset$Price <- as.numeric(dataset$Price)
typeof(dataset$days_since)
#Tobit Survival analyses
library(help=survival)
library(survival)
dataset$ins_cen <- matrix( c("0"), nrow=500, ncol=1, byrow=FALSE)
typeof(dataset$ins_cen)
dataset$ins_cen <- as.numeric(dataset$ins_cen)
install.packages('tidyverse')
library(tidyverse)
dataset_2 <- dataset %>% filter(!is.na(dataset$Rating))
View(dataset_2)
dataset_2$dum_cen <- ifelse(dataset_2$ins_cen == 0, 0, 1)
dataset_2$dum_fac <- as.factor(dataset_2$dum_cen)
survreg(Surv(Install, ins_cen, type= 'right') ~ Rating + Price + Reviews + days_since,
dist="gaussian", data = dataset_2)
cor(dataset)
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Size + Reviews +days_since + `Current Ver` + Category, data = dataset, dist = 'gaussian', right = dataset_2$Install)
I tried turning the event into an dummy variable and a factor but both options do not work. The dummy variable changes nothing, while the factor variable gives an error.
Error in survreg(Surv(Install, dum_fac, type = "right") ~ Rating +
Price + : multi-state survival is not supported
Thanks for any help.
Sorry if I am asking stupid questions but I am still learning and can't figure my problem out.
p.s. I also tried to solve my problem using crch() but this lead to a different error, where I can't seem to wrap my head around either.
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Edit: I noticed I left character variables in the crch code.
When this is removed from the formula I get a different error.
Error in solve.default(hessfun(par)) : system is computationally
singular: reciprocal condition number = 7.31468e-142
CRCH code:
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Reviews +days_since, data = dataset, dist = 'gaussian', left = -Inf, right = dataset_2$Install)
x = Price + Size + Reviews +days_since + `Current Ver` + Category
I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer
I have been creating a few technical indicators using Quantmod's NewTa function.
I've been trying to create a custom indicator that ideally should be charted using ChartSeries. This indicator should show the slope of the line of the 50 day EMA of the adjusted closing price.
getSymbols("NOVO-B.CO")
p <- na.omit('NOVO-B.CO')
FiftyEMA <- function(x){
MA <- removeNA((EMA(p[,6],n=50)))
}
SlopeFiftyEMA <- function(x){
run=(FiftyEMA(y)/FiftyEMA(x))
}
Slope.Indicator <- newTA(SlopeFiftyEMA,legend.name = "50 Day EMA Slope of Line Indicator")
Slope.Indicator()
This gives me the error: Error in get.current.chob() : improperly set or missing graphics device
I also tried a new code that gives me an actual INDICATOR! Please let me know what you think (if you think it looks correct or not):
First I export the data to excel: (the stock data is still denoted as p)
write.csv(p,"data")
import data
x <- data[,1]
y <- data[,7]
MA <- removeNA(EMA(y,n=50))
length(MA)
length of MA = 1923
l=1:1923
SlopeFiftyEMA <- function(x){
(diff(MA)/diff(l))
}
Slope.Indicator <- newTA(SlopeFiftyEMA,legend.name = "50 Day EMA Slope of Line Indicator")
twelvemonths="last 12 months"
chartSeries(p,subset = twelvemonths,theme = 'white',up.col = 'blue',dn.col = 'grey',name ="Custom Indicators")
Slope.Indicator()
Any Input anyone? Last time I posted there was no indicator
Thanks in advance!
Your first error seems to exist because you don't call chartSeries before calling Slope.indicator(). But your code is a bit messy, including not defining y (maybe you introduce it later in import data).
The approach presented here will plot the slope of the MA according to linear regression, using chart_Series (arguably cleaner plots than the original chartSeries). Two types of slopes are computed, including the one you proposed, which is the differences of the EMA.
getSymbols(c("NOVO-B.CO"))
x <- `NOVO-B.CO`
x[, c(1:4, 6)] <- na.locf(x[, c(1:4, 6)])
x$EMA <- EMA(Cl(x), n = 50)
x <- merge(x, rollSFM(Ra = x[, "EMA"], Rb = 1:NROW(x), n = 20))
x <- merge(x, setNames(diff(x$EMA), "diff1"))
chart_Series(x, subset = "2016/")
add_TA(x$EMA, on = 1, col = "purple")
# Plot the slope of the MA:
add_TA(x$beta, col = "green")
# Plot the 1 lag diff of the moving average:
add_TA(x$diff1, lty = 2)