error : for loop - replacement has length zero - r

I am new to R and trying to do a coursework about factor analysis with it.
I have two data sets FundReturn(120 rows, 14 columns) and Factors(120 rows, 30 columns), I want to do a one-factor regression for all the possible pairs of factors and funds, starting with the first 60 observations. With the parameters estimated, I want to calculate the predicted value for the 61st fund return with the 61st value of the factor. Then the estimation window is expanded one observation bigger and new parameters are estimated with the updated sample, then the predicted value for 62rd fund return is calculated, so on so forth. Totally 60 predictions will be made, stored in Predictions=array(1,dim=c(60,30,14)), so I can compare them with the realized values.
The following is the code I used and produced this error:
Error in Predictions[p, fa, fu] <- coeff[1, p, fa, fu] + coeff[2, p, fa, :
replacement has length zero
Can anyone spot the problem? Your help is very appreciated.
Predictions=array(1,dim=c(60,30,14))
coeff=array(1,dim=c(3,60,30,14))
v1<- 1:30
v2<- 1:60
v3<- 1:14
for(fu in v3){
for (fa in v1){
for (p in v2){
y1=FundReturn[1:(59+p),fu]
x1=Factors[1:(59+p),fa]
Model<-lm(y1 ~ x1 + lag(y1))
coeff[1:3,p,fa,fu]=Model[["coefficients"]]
Predictions[p,fa,fu]= coeff[1,p,fa,fu]+coeff[2,p,fa,fu]*Factors[60+p,fa]+coeff[3,p,fa,fu]*FundReturn[59+p,fu]
}
}
}

Related

Fitting a truncated binomial distribution to data in R

I have discrete count data indicating the number of successes in 10 binomial trials for a pilot sample of 46 cases. (Larger samples will follow once I have the analysis set up.) The zero class (no successes in 10 trials) is missing, i.e. each datum is an integer value between 1 and 10 inclusive. I want to fit a truncated binomial distribution with no zero class, in order to estimate the underlying probability p. I can do this adequately on an Excel spreadsheet using least squares with Solver, but because I want to calculate bootstrap confidence intervals on p, I am trying to implement it in R.
Frankly, I am struggling to understand how to code this. This is what I have so far:
d <- detections.data$x
# load required packages
library(fitdistrplus)
library(truncdist)
library(mc2d)
ptruncated.binom <- function(q, p) {
ptrunc(q, "binom", a = 1, b = Inf, p)
}
dtruncated.binom <- function(x, p) {
dtrunc(x, "binom", a = 1, b = Inf, p)
}
fit.tbin <- fitdist(d, "truncated.binom", method="mle", start=list(p=0.1))
I have had lots of error messages which I have solved by guesswork, but the latest one has me stumped and I suspect I am totally misunderstanding something.
Error in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, :
'start' must specify names which are arguments to 'distr'.<
I think this means I must specify starting values for x in dtrunc and q in ptrunc, but I am really unclear what they should be.
Any help would be very gratefully received.

Compare PCs to data with lsfit()

I have a data frame with 2000 observations (rows) and 600 variables (columns). See reproducible example:
list <- list()
for(i in 1:600){
list[[i]] <- sample(seq(0,0.6,l=2000))
}
df <- as.data.frame(do.call(cbind,list))
I want to perform PCA on the variables and then use lsfit to compare the fit between the principal components and the data (as well as some other data, but this is left out here). My first issue is that when I perform PCA on the data set as it is, my principle components have length 20000. I would expect them to have length 600. However, this is resolved by transposing the data frame.
pc_model <- prcomp(df, center=F, rank=3)
pcs <- pc_model$x # wrong length, why?
df_trans <- as.data.frame(t(df))
pc_model2 <- prcomp(df_trans, center=F, rank=3)
pcs2 <- pc_model2$x # correct length, why?
My next issue is that when I try to use lsfit() to compare my 2000 observations to the principal components, I get all sorts of complaints:
fit <- lsfit(df_trans, pcs2) # Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables
fit2 <- lsfit(df, pcs2) # Error in complete.cases(x, y, wt) : not all arguments have the same length
fit3 <- lsfit(df[1,], pcs2[,1]) # Error in complete.cases(x, y, wt) : not all arguments have the same length
With the transposed data frame, lsfit() complains that I have too many variables. With the non-transposed data frame, it argues that the arguments donĀ“t have the same length, even when I only feed it one row from df (length=600) and one column from pcs2 (length=600). How do I get the least squared fits between my PCs and my 20000 observations?
first pc_model$x is just the coordinates of the observations in the new space defined by axises (PC1, PC2, PC3), so you'll have as many rows as there are observations, i.e 2000 rows for 2000 observations.
ls.fit(X, Y) is trying to fit the model Y = Xb + e where Y and e are (N,M) matrices, X is (N,K) matrix and b is (K,M) vector. and K is the number of variables you want to use in the estimation (K=number of columns in the original X matrix + 1 if you want to calculate the coefficient of the intercept which is the default) also N>=K for this regression to be computable.
Running fit2 <- lsfit(df, pcs) will give correct output, as the conditions are verified, i.e same number of lines and N=2000>=K=601.
the error Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables is caused by df_trans having 2000 columns (variables + 1 for the intercept) while pcs2 having only 600 rows. selecting the first 599 columns circumvents the error lsfit(df_trans[,1:599] ,pcs2)
the error not all arguments have the same length is caused by the arguments complete.cases call inside of ls.fit because df and pcs2 have different row numbers this error is thrown before reaching the conditional on different row numbers inside of lsfit.

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

Calculating correlation between residuals of linear regression with NAs and independent variable in R

I am trying to calculate the correlation coefficient between the residuals of a linear regression and the independent variable p.
Basically, the linear regression estimates the current sales as a function of the current price p and the past price p1.
The vector of current prices mydf$p has length 8, but the residuals is a vector of length 7 because one entry has been deleted due to the NA value of p1.
# lag vector and pad with NAs
# Source: http://heuristically.wordpress.com/2012/10/29/lag-function-for-data-frames/
lagpad <- function(x, k) {
if (!is.vector(x))
stop('x must be a vector')
if (!is.numeric(x))
stop('x must be numeric')
if (!is.numeric(k))
stop('k must be numeric')
if (1 != length(k))
stop('k must be a single number')
c(rep(NA, k), x)[1 : length(x)]
}
mydf <- data.frame(p = c(10, 8, 10, 9, 10, 9, 10, 8))
mydf$p1 <- lagpad(mydf$p,1)
mydf$sales <- with(mydf, 200 - 15 * p + 5 * p1) + rnorm(nrow(mydf), 0,0.13)
model <- lm(data = mydf, formula = 'sales ~ p + p1')
print(summary(model))
print(cor(residuals(model), mydf$p))
# Error in cor(residuals(model), mydf$p) : incompatible dimensions
In this particular case, it is easy to use mydf$p[2:8] instead of mydf$p.
However, in general, there may be multiple rows at random locations where then NAs are deleted.
How do I access the independent variables that were actually used in the regression after removing the rows containing NA?
One of my attempts was based on the R documentation for lm. I tried to access the "x" matrix through model[['x']] but that did not work.
You can get the actual data used to fit the model from model$model, and from there the p column:
cor(residuals(model), model$model$p)
Alternatively, is.na(mydf$p1) will tell you which rows in mydf have an NA in column p1:
cor(residuals(model), mydf$p[!is.na(mydf$p1)])
In general, is.na(x) tells us whether elements in x are NA or not:
> is.na(c(1,2,NA,4,NA,6))
[1] FALSE FALSE TRUE FALSE TRUE FALSE
model.matrix(model) seems to be what you are looking for
Then you can select the variables you want with [] and the column number or name
The x matrix is only created if you specify x=T in your call to lm. Then model$x will give you the value of x (this is more idiomatic that model[['x']].
lm handles missing values by just completely omitting an observation where a value is missing. Maybe you want to do something like:
cor(residuals(model), mydf$p[!is.na(mydf$p)])
?

Maximum first derivative in for values in a data frame R

Good day, I am looking for some help in processing my dataset. I have 14000 rows and 500 columns and I am trying to get the maximum value of the first derivative for individual rows in different column groups. I have my data saved as a data frame with the first column being the name of a variable. My data looks like this:
Species Spec400 Spec405 Spec410 Spec415
1 AfricanOilPalm_1_Lf_1 0.2400900 0.2318345 0.2329633 0.2432734
2 AfricanOilPalm_1_Lf_10 0.1783162 0.1808581 0.1844433 0.1960315
3 AfricanOilPalm_1_Lf_11 0.1699646 0.1722618 0.1615062 0.1766804
4 AfricanOilPalm_1_Lf_12 0.1685733 0.1743336 0.1669799 0.1818896
5 AfricanOilPalm_1_Lf_13 0.1747400 0.1772355 0.1735916 0.1800227
For each of the variables in the species column, I want to get the maximum derivative from Spec495 to Spec500 for example. This is what I did before I ran into errors.
x<-c(495,500,505,510,515,520,525,530,535,540,545,550)##get x values of reflectance(Spec495 to Spec500)
y.data.f<-hsp[,21:32]##get row values for the required columns
y<-as.numeric(y.data.f[1,])##convert to a vector, for just the first row of data
library(pspline) ##Using a spline so a derivative maybe calculated from a list of numeric values
I really wanted to avoid using a loop because of the time it takes, but this is the only way I know of thus far
for(j in 1:14900)
+ { y<-as.numeric(y.data.f[j,]) + a1d<-max(predict(sm.spline(x, y), x, 1))
+ write.table(a1d, file = "a1-d-appended.csv", sep = ",",
+ col.names = FALSE, append=TRUE) + }
This loop runs up until the 7861th value then get this error:
Error in smooth.Pspline(x = ux, y = tmp[, 1], w = tmp[, 2], method = method, :
NA/NaN/Inf in foreign function call (arg 6)
I am sure there must be a way to avoid using a loop, maybe using the plyr package, but I can't figure out how to do so, nor which package would be best to get the value for maximum derivative.
Can anyone offer some insight or suggestions? Thanks in advance
First differences are the numerical analog of first derivatives when the x-dimension is evenly spaced. So something along the lines of:
which.max( diff ( predict(sm.spline(x, y))$ysmth) ) )
... will return the location of the maximum (positive) slope of the smoothed spline. If you wanted the maximal slope allowing it to be either negative or postive you would use abs() around the predict()$ysmth. If you are having difficulties with non-finite values then using an index of is.finite will clear both Inf and NaN difficulties:
predy <- predict(sm.spline(x, y))$ysmth
predx <- predict(sm.spline(x, y))$x
is.na( predy ) <- !is.finite(pred)
plot(predx, predy, # NA values will not blow up R plotting function,
# ... just create discontinuities.
main ="First Derivative")

Resources