variable lengths differ error when rollapply lm - r

I am trying to run a rolling window regression on a number of time series but encountered this strange problem. The following codes reproduce my data. I have a data frame containing returns named "rt" and a data frame containing factors named "factors". Then I produce a function to obtain the regression constant variable.
mat<-as.data.frame(matrix(runif(88*6), nrow = 88, ncol = 6))
colnames(mat)<-c("MKT","SMB","HML","AA","BB","CC")
rt<-mat[,c(4,6)]
factors<-mat[,c(1:3)]
coeffstat_alpha<-function(x){
fit<-lm(x~MKT+SMB+HML,data=factors,na.action=na.omit)
nn<-c(t(coeftest(fit)))[1]
return(nn)
}
When I run this function on the whole sample, it works.
apply(rt,2,FUN=coeffstat_alpha)
but when I rollapply the function, I received the error message
rollapply(reg[,1],width=24,FUN=coeffstat_alpha,by=1,align="left")
"Error in model.frame.default(formula = x ~ MKT + SMB + HML, data = factors, :
variable lengths differ (found for 'MKT')"
I have tried to fixed the problem by search online but couldn't find a post with the similar question. Can anyone help? Thanks!

As the error message suggests the length of variables differ meaning you are passing x in the function which is of length 24 (width) whereas using factors matrix which has 88 rows in it. For this to run you need to have equal length of x as well as factor. You can change the function to
library(lmtest)
coeffstat_alpha<-function(x){
fit<-lm(rt[x, 1]~MKT+SMB+HML,data=factors[x, ],na.action=na.omit)
nn<-c(t(coeftest(fit)))[1]
return(nn)
}
and use sapply as :
sapply(1:(nrow(rt)-23), function(x) coeffstat_alpha(x:(x+23)))

Related

Can't run glm due to the following error: "variable lengths differ (found for 'data')"

I try to run a regression using the glm function, however I keer getting the same error message: "variable lengths differ (found for 'data')". I can't see how my data does not have the same length as I use a sample of 1000 for both my dependent and independent variables. The reason I take a sample of my total data is because I have more than a million observations and I want to see if the model works properly. (running it with all the data takes a very long time) This is the code I use:
sample = sample(1:nrow(agg), 1000, replace = FALSE)
y=agg$TO_DEFAULT_IN_12M_INDICATOR[sample]
test <- glm(as.factor(y) ~., data = as.factor(agg[sample,]), family = binomial)
#coef(full.model)
Here agg contains all my data, and my y is an indicator function of 0's and 1's. Does anyone know how I could fix this problem?

Error in array, regression loop using "plyr"

Good morning,
I´m currently trying to run a truncated regression loop on my dataset. In the following I will give you a reproducible example of my dataframe.
library(plyr)
library(truncreg)
df <- data.frame("grid_id" = rep(c(1,2), 6),
"htcm" = rep(c(160,170,175), 4),
stringsAsFactors = FALSE)
View(df)
Now I tried to run a truncated regression on the variable "htcm" grouped by grid_id to receive only coefficients (intercept such as sigma), which I then stored into a dataframe. This code is written based on the ideas of #hadley
reg <- dlply(df, "grid_id", function(.)
truncreg(htcm ~ 1, data = ., point = 160, direction = "left")
)
regcoef <- ldply(reg, coef)
As this code works for one of my three datasets, I receive error messages for the other two ones. The datasets do not differ in any column but in their absolute length
(length(df1) = 4,000; length(df2) = 100,000; length(df3) = 13,000)
The error message which occurs is
"Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of type vector, was 'NULL'
I do not even know how to reproduce an example where this error code occurs, because this code works totally fine with one of my three datasets.
I already accounted for missing values in both columns.
Does anyone has a guess what I can fix to this code?
Thanks!!
EDIT:
I think I found the origin of error in my code, the problem is most likely about that in a truncated regression model, the standard deviation is calculated which automatically implies more than one observation for any group. As there are also groups with only n = 1 observations included, the standard deviation equals zero which causes my code to detect a vector of length = NULL. How can I drop the groups with less than two observations within the regression code?

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Non-numeric argument to binary operator, CSV

I've seen that other people before were already struggling with this, however I didn't manage to solve my problem with those posts. I get the error 'Non-numeric argument to binary operator'. The following reproducible example works:
x=rnorm(1000)+sin(c(1:1000)/100)#random data+ sinus superimposed
par(mfrow=c(2,2))
plot(x)# plot random data
plot(filter(x,rep(1/100,100)))
plot(x-filter(x,rep(1/100,100)))
# variances of variable, long term variability and short term variability
var(x)
var(filter(x, rep(1/100,100)),na.rm=T)
var(x-filter(x, rep(1/100,100)),na.rm=T)
However, I of course want to use my own dataset, it's a csv, and this is when the error occurs. It must have something to do with the data format, because when I export the random data to csv:
x=rnorm(1000)+sin(c(1:1000)/100)#random data+ sinus superimposed
write.csv(x,"dat.csv")
and then try to read in dat.csv
y <- read.csv("dat.csv", header=TRUE, stringsAsFactors=FALSE)
par(mfrow=c(2,2))
plot(y)
plot(filter(y,rep(1/100,100)))
plot(y-filter(y,rep(1/100,100)))
[...] I get the error
Error in x - filter(x, rep(1/100, 100)) :
non-numeric argument to binary operator
Calls: plot
In addition: Warning message:
In plot(x - filter(x, rep(1/100, 100))) :
Incompatible methods ("Ops.data.frame", "Ops.ts") for "-"
Execution halted
Why are the values not numeric? I don't get it. Thanks for your help!
I rewrote the post a little so the x variable wasn't reused for the input & output. The value from read.csv() is now y. Notice its a data.frame, while x is an ordinary numeric vector.
To get the 2nd set of graphs to behave like the first set, extract the first vector from y (called y1 below), then pass that vector to the dplyr functions.
y <- read.csv("dat.csv", header=TRUE, stringsAsFactors=FALSE)
y1 <- y$x # Extract the first column
par(mfrow=c(2,2))
plot(y1)
plot(filter(y1,rep(1/100,100)))
plot(y1-filter(y1,rep(1/100,100)))

Use of randomforest() for classification in R?

I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)

Resources