I am trying to (re)build a basic prediction model of the S&P 500 INDEX (data orignates from Yahoo finance)
I ran into some difficulties with the "ordering" of my data set.
During the build of data.model the following error occurs
Error in xts(new.x, x.index) : NROW(x) must match length(order.by)
After some research I realize that the problem is with the ordering, and it seems to lack ordering as is required for the underlying zoo package.
Is there an elegant way to solve this issue?! Thanks in advance
library(xts)
library(tseries)
library(quantmod)
GSPC <- as.xts(get.hist.quote("^GSPC",start="1970-01-02",
quote=c("Open", "High", "Low", "Close","Volume","AdjClose")))
head(GSPC)
T.ind <- function(quotes, tgt.margin = 0.025, n.days = 10) {
v <- apply(HLC(quotes), 1, mean)
r <- matrix(NA, ncol = n.days, nrow = NROW(quotes))
for (x in 1:n.days) r[, x] <- Next(Delt(v, k = x), x)
x <- apply(r, 1, function(x) sum(x[x > tgt.margin | x <
-tgt.margin]))
if (is.xts(quotes))
xts(x, time(quotes))
else x
}
myATR <- function(x) ATR(HLC(x))[, "atr"]
mySMI <- function(x) SMI(HLC(x))[, "SMI"]
myADX <- function(x) ADX(HLC(x))[, "ADX"]
myAroon <- function(x) aroon(x[, c("High", "Low")])$oscillator
myBB <- function(x) BBands(HLC(x))[, "pctB"]
myChaikinVol <- function(x) Delt(chaikinVolatility(x[, c("High", "Low")]))[, 1]
myCLV <- function(x) EMA(CLV(HLC(x)))[, 1]
myEMV <- function(x) EMV(x[, c("High", "Low")], x[, "Volume"])[, 2]
myMACD <- function(x) MACD(Cl(x))[, 2]
myMFI <- function(x) MFI(x[, c("High", "Low", "Close")], x[, "Volume"])
mySAR <- function(x) SAR(x[, c("High", "Close")])[, 1]
myVolat <- function(x) volatility(OHLC(x), calc = "garman")[, 1]
library(randomForest)
data.model <- specifyModel(T.ind(GSPC) ~ Delt(Cl(GSPC),k=1:10) +
myATR(GSPC) + mySMI(GSPC) + myADX(GSPC) + myAroon(GSPC) +
myBB(GSPC) + myChaikinVol(GSPC) + myCLV(GSPC) +
CMO(Cl(GSPC)) + EMA(Delt(Cl(GSPC))) + myEMV(GSPC) +
myVolat(GSPC) + myMACD(GSPC) + myMFI(GSPC) + RSI(Cl(GSPC)) +
mySAR(GSPC) + runMean(Cl(GSPC)) + runSD(Cl(GSPC)))
traceback() reveals the error occurs in the Delt(Cl(GSPC),k=1:10) call:
> Delt(Cl(GSPC),k=1:10)
Error in xts(new.x, x.index) : NROW(x) must match length(order.by)
Delt expects a (m x 1) object but you're passing a (m x 2) object. This is because GSPC has two columns that are matched by Cl ("Close" and "AdjClose"). This will probably cause headaches in other areas too...
Cl expects objects like those returned by getSymbols, where the adjusted close column is named "Adjusted". If you need to use get.hist.quote for some reason, just rename the "AdjClose" column after you download the data.
colnames(GSPC) <- c("Open", "High", "Low", "Close","Volume","Adjusted")
Delt(Cl(GSPC),k=1:10) # works now
## Error in xts(x, order.by = order.by, frequency = frequency, ...
## NROW(x) must match length(order.by)
I wasted hours running into this error. Regardless of whether or not I had the exact same problem, I'll show how I solved for this error message in case it saves you the pain I had.
I imported an Excel or CSV file (tried both) through several importing functions, then tried to convert my data (as either a data.frame or .zoo object) into an xts object and kept getting errors, this one included.
I tried creating a vector of dates seperately to pass in as the order.by parameter. I tried making sure the date vector the rows of the data.frame were the same. Sometimes it worked and sometimes it didn't, for reasons I can't explain. Even when it did work, R had "coerced" all my numeric data into character data. (Causing me endless problems, later. Watch for coercion, I learned.)
These errors kept happening until:
For xts conversion I used the date column from the imported Excel sheet as the order.by parameter with an as.Date() modifier, AND I *dropped the date column during the conversion to xts.*
Here's the working code:
xl_sheet <- read_excel("../path/to/my_excel_file.xlsx")
sheet_xts <- xts(xl_sheet[-1], order.by = as.Date(xl_sheet$date))
Note my date column was the first column, so the xl_sheet[-1] removed the first column.
Related
I cannot seem to even create a reproducible example on this as it works fine when I go through the code one line at a time.
The error message I get is as follows:
"Error in testData[, colCheck][length(testData[, colCheck])] - testData[, :
non-numeric argument to binary operator "
Both colCheck and testData$linearcorrd15N are numeric and like I said, the calculation works fine when I run it at that line. The error comes only when I run the function from QTest(df, colCheck).
Here is an example of what some of the code looks like. It will not produce an error, but maybe you can see something that I don't.
QTest <- function(testData, colCheck)
#%#
# testData <- This is the entire data frame for the std/ref that has too high
# of a SD, this way the data frame can be returned without the outlier
# colCheck <- The column name for values that were flagged for having too high of a SD
# This Q test info provided by: https://www.statisticshowto.com/dixons-q-test/
#%#
{
#Get the mean of the highest and lowest values
testData <- arrange(testData, desc(testData[, colCheck]))
len <- length(testData[,colCheck])-1
high <- sapply(1:len, function(i) testData[,colCheck][i])
meanhigh <- mean(high)
testData <- arrange(testData, (testData[, colCheck]))
low <- sapply(1:len, function(i) testData[,colCheck][i])
meanlow <- mean(low)
#If the mean of the lowest numbers is lower than the mean of the highest numbers, do this
if(meanlow < meanhigh){
QexpVal <- abs((testData[, colCheck][2] - testData[, colCheck][1])/
(testData[, colCheck][length(testData[, colCheck])] - testData[, colCheck][1]))
outlier <- testData[,colCheck][1]
closest <- testData[,colCheck][2]
#else if the mean of the lowest numbers is higher than the mean of the highest numbers, do this
} else {
QexpVal <- abs((testData[, colCheck][length(testData[,colCheck])-1] - (testData[, colCheck][length(testData[,colCheck])])) /
(testData[,colCheck][length(testData[,colCheck])]) - (testData[,colCheck][1]))
outlier <- testData[,colCheck][length(testData[,colCheck])]
closest <- testData[,colCheck][length(testData[,colCheck])-1]
}
return(QexpVal)
}
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(std1)))
QTestCorrVals <- QTest(df, colCheck)
It seems you realy overcomplicate this function by pushing the whole table in the function and loop over everything and read a value again from the whole table...
just the part to get meanhigh and meanlow requires this:
v <- df[, colCheck]
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
Or if you use the decreasing ordering this:
v <- df[, colCheck]
v <- v[order(v, decreasing = T)]
n <- length(v)
meanhigh <- mean(v[1:n-1])
meanlow <- mean(v[2:n])
Full function
Hereby the full code using this approach and I agree that is not the specific question you asked, but the way you coded it is highly inefficient and error prone by every time take the whole data.frame and subset it and recalculate lengths every time. Also you just have to order once, as if the lowest is on top, the highest is per definition on the bottom. Then play around with the 1 for first and 2 for second and n for last and n-1 for second last.
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(df)))
QTest <- function(v) {
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
if(meanlow < meanhigh) {
QexpVal <- abs((v[2]-v[1])/(v[n]-v[1]))
outlier <- v[1]
closest <- v[2]
} else {
QexpVal <- abs((v[n-1]-v[n])/(v[n]-v[1]))
outlier <- v[n]
closest <- v[n-1]
}
return(QexpVal)
}
QTestCorrVals <- QTest(df[, colCheck])
Side note
Using the column index number works slightly different whether your data is a data.frame or a data.table
class(df)
df[, colCheck]
dt <- data.table(df)
class(dt)
dt[, ..colCheck]
dt[, colCheck] # throws an error
I have written a short function to clean some dataframes that I have in a list. When selecting columns using the df[,1] method, my function doesn't work. However when I select using df$Column it does. Why is this?
columns_1 <- function(x) {
x[,1] <- dmy_hm(x[,1])
x[,2] <- NULL
x[,3] <- as.numeric(x[,3])
x[,4] <- NULL
return(x)
}
MS_ <- lapply(MS_, columns_1)
columns_2 <- function(x) {
x$DateTime <- dmy_hm(x$DateTime)
x$LogSeconds <- NULL
x$Pressure <- as.numeric(x$Pressure)
x$Temperature <- NULL
return(x)
}
MS_ <- lapply(MS_, columns_2)
The function columns_2 produces the desired results (all dataframes in list are cleaned). columns_1 returns the error message:
Error in FUN(X[[i]], ...) :
(list) object cannot be coerced to type 'double'
In addition: Warning message:
All formats failed to parse. No formats found.
The issue would be that the assignment was carried out after the first run and here some columns were lost.
library(lubridate)
MS_ <- lapply(MS_, columns_1)
Instead, it can be done by assigning to a different object
MS2_ <- lapply(MS_, columns_1)
data
set.seed(24)
df1 <- data.frame(DateTime = format(Sys.Date() + 1:5, "%d-%m-%Y %H:%M"),
LogSeconds = 1:5,
Pressure = rnorm(5), Temperature = rnorm(5, 25),
stringsAsFactors = FALSE)
MS_ <- list(df1, df1)
Here is some example data:
set.seed(1234) # Make the results reproducible
count <- 100
cs1 <- round(rchisq(count, 1), 2)
cs2 <- round(rchisq(count, 2), 2)
c(rep("Present", 30), rep("Absent", 30), rep("NA", 40)) -> temp
temp[temp == "NA"] <- NA
as.factor(temp) -> temp
temp1 <- round(rnorm(count, 3), 2)
temp1[7] <- NA
temp2 <- round(rnorm(count, 7), 2)
temp2[54] <- NA
c(rep("Yes", 30), rep("No", 30), rep("Maybe", 30), rep("NA", 10)) -> temp3
temp3[temp3 == "NA"] <- NA
as.factor(temp3) -> temp3
c(rep("Group A", 55), rep("Group B", 45)) -> temp4
as.factor(temp4) -> temp4
mydata <- data.frame(cs1, cs2, temp, temp1, temp2, temp3, temp4)
mydata$cs2[56:100] <- NA ; mydata
I know I can compute summary statistics for each variable stratified by temp4 like so:
by(mydata, mydata$temp4, summary)
However, I would also like to compute either a t.test or a chisq.test for each variable stratified by temp4. I've tried simply modifying the above code to do that but it always gives me an error. It seems the error stems from the fact that some of the variables in the data frame are numeric (and thus, would need a t.test) while others are factors (and thus, would need a chisq.test).
Is there a simple way to tell R to check the variable to see what kind it is, and then run the appropriate test, all at once? And to still print out all of the results even if it encounters an error?
I am not worried about the appropriateness of doing this (e.g., I am aware of the risks of multiple testing, etc) but rather just need to know how to do it. Thanks!
You can use lapply to loop through the variables and decide inside the anonymous function which test to conduct.
When an error occurs, it's caught by tryCatch and instead of a test result the final list will have the error message as a member.
tests_list <- lapply(mydata[-ncol(mydata)], function(x){
tryCatch({
if(is.numeric(x)){
if(length(levels(mydata$temp4)) == 2){
t.test(x ~ temp4, data = mydata)
}else{
aov(x ~ temp4, data = mydata)
}
}else{
tbl <- table(x, mydata$temp4)
chisq.test(tbl)
}
}, error = function(e) e)
})
err <- sapply(tests_list, inherits, "error")
tests_list$cs1
tests_list$temp3
tests_list[[err]]
Yes, you can loop through designated columns, keeping temp4 as the factor, and check class of each column (named x within the anonymous function). You can use sapply or apply(X, MARGIN = 2, FUN ...). Note that I'm explicitly subsetting mydata because I find it more explicit and readable.
sapply(mydata[, c("cs1", "cs2", "temp", "temp1", "temp2", "temp3")], FUN = function(x, group) {
if (class(x) == "numeric") {
# perform t-test, e.g. t.test(x ~ group)
return(result_of_t_test)
}
if (class(x) == "factor") {
# perform chi-square test
return(result_of_chisq_test)
}
}, group = mydata$temp4)
I am writing a function to group together actions I regularly take on time series data. I have included all libraries I am using in the script as I think my issue may be to do with plyr / dplyr being (rightly) super specific about the environment of each variable.
The first function works great, but when getting to the second one, R doesn't recognise the input as 'x', but spits out the error: 'Error in eval(predvars, data, env) : object 'x' not found.'
Why is this happening?
library(plyr)
library(dplyr)
library(caret)
library(xts)
library(forecast)
library(imputeTS)
library(lubridate)
x1 = arima.sim(list(order = c(0,1,0)), n = 119)
timetrend <- function(x, starts, ends, frequency) {
y <- list()
y[[1]] <- ts(x, start = starts, end = ends, frequency = frequency)
y[[2]] <- decompose(y[[1]])
y[[3]] <- y[[1]] - y[[2]]$seasonal - y[[2]]$random
return(y)
}
plottime <- function(x) { #takes a timetrend list as input
t <- tslm(x[[3]] ~ trend)
plot(x[[3]])
lines(t$fitted.values)
return(t)
}
use functions from here
result <- timetrend(x = x1,
starts = c(2000, 01, 01), ends = c(2009, 12, 01), frequency = 12)
plottime(x = result)
I could make it work with the following code.
plottime <- function(x) { #takes a timetrend list as input
y=x[[3]]
t <- tslm(formula = y ~ trend)
plot(x[[3]])
lines(t$fitted.values)
return(t)
}
Not sure why it is happening, maybe the use of indexing x[[3]] in the formula argument is a problem?
Here is my code:
data <-data.frame(matrix(0,nrow = 9,ncol = 2))
data[,1] <- c(0,15,41,81,146,211,438,958,1733)
data[,2] <-c(0.000000,5.7013061,13.2662515,26.0874534,42.2710547,55.6050052,75.597474,112.6755999,109.45890071)
rownames(data) <- c("E0_TAP","E3_TAP","E4_TAP","E5_TAP","E6_TAP","E7_TAP","E8_TAP","E10_TAP","E12_TAP")
colnames(data) <- c('S','v')
This is the light saturation curve of photosystem II in Chlamydomonas reinhardtii. I would like to find the best fitting for my curve using the Michaelis-Menten distribution model. I tried with the drm() command in this way :
model.drm <- drm (v ~ cluster(S), data = data, fct = MM.2())
When I run this code the calculation of the fitting starts, but it's interrupted by an error that I do not really comprehend:
Error in parse(text = paste(paste(rep("c(", nrep - 1), collapse = ""), :
<text>:2:39: unexpected ')'
1: mu[(1+( 1 * (i - 1))),] %*%
2: mu[( 2 + ( 1 * (i - 1))),drop=FALSE,])
^
In addition: Warning message:
In cbind(mu[, 2:(nclass - 1)], 1) - mu[, seq(nclass - 1)] :
longer object length is not a multiple of shorter object length
Timing stopped at: 0 0 0
Although I will keep trying solve the problem by myself, I would really appreciate if someone could help me fixing it quicker or finding an alternative way to perform the analysis.
Thanks in advance!
Thanks to the help of a friend here follows the answer:
data <-data.frame(matrix(0,nrow = 9,ncol = 2))
data[,1] <- c(0,15,41,81,146,211,438,958,1733)
data[,2] <-c(0.000000,5.7013061,13.2662515,26.0874534,42.2710547,55.6050052,75.597474,112.6755999,109.45890071)
rownames(data) <- c("E0_TAP","E3_TAP","E4_TAP","E5_TAP","E6_TAP","E7_TAP","E8_TAP","E10_TAP","E12_TAP")
colnames(data) <- c('S','v')
data <- t(data) #traspose
data1 <- cbind(data,data) #duplicate
data1 <- cbind(data1,data1) # quadruplicate
data <- as.data.frame(t(data1)) #transpose
model.drm <- drm (v ~ cluster(S), data = data, fct = MM.2()) #fitting analysis
S <- data[,1]
v <- data[,2]
mml <- data.frame(S = seq(0, max(S)+9000, length.out = 200))
mml$v <- predict(model.drm, newdata = mml)
s <- mml[,1]
v <- mml[,2]
plot(s,v)
lines(s,v,lty=2,col="red",lwd=3)
coeff <- as.data.frame(coef(summary(model.drm)))
The issue comes from the dataset itself. To bypass the error, a n-uplication of my data was needed. I assume that it would be even better having more replicas of the experiment instead of cloning the selfsame.
Please leave a comment!