What does lag(log(emp), 1:2) mean when using pgmm function? - r

I tried an example regarding pgmm function in plm package. The codes are as follows:
library(plm)
data("EmplUK", package = "plm")
## Arellano and Bond (1991), table 4 col. b
z1 <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1)
+ log(capital) + lag(log(output), 0:1) | lag(log(emp), 2:99),
data = EmplUK, effect = "twoways", model = "twosteps")
summary(z1, robust = FALSE)
I am not sure the meaning of lag(log(emp), 1:2) and also lag(log(emp), 2:99). Does lag(log(emp), 1:2) mean that from one unit to two unit lag value of log(emp) and lag(log(emp), 2:99) from two units to 99 units' lag value of log(emp)?
And also sometimes I got an error when running the regression in summary part but sometimes there was no such error (the codes are the same):
Error in !class_ind : invalid argument type
Can anyone help me with these problems?That's the error here

log, a base R function, gives you the (natural) logarithm, in this case of variable emp.
lag of package plm can be given a second argument, called k, like in your example. By looking at ?plm::lag.plm it becomes clear: k is
an integer, the number of lags for the lag and lead methods (can also
be negative). For the lag method, a positive (negative) k gives lagged
(leading) values. For the lead method, a positive (negative) k gives
leading (lagged) values, thus, lag(x, k = -1) yields the same as
lead(x, k = 1). If k is an integer with length > 1 (k = c(k1, k2,
...)), a matrix with multiple lagged pseries is returned
Thus, instead of typing lag twice to have the first and second lag:
(lag(<your_variable>, 1)
lag(<your_variable>, 2)
one can simply type
lag(<your_variable>, k = 1:2), or without the named argument
lag(<your_variable>, 1:2).
Setting k to 2:99 gives you the 2nd to 99th lags.
The number refers to the number of time periods the lagging is applied to, not to the number of individuals (units) as the lagging is applied to all individuals.
You may want to run the example in ?plm::lag.plm to aid understanding of that function.

Related

R - Extract coefficients from a factor of lm object using conditions

I have fitted a lm with the following code:
Eq1_females = <- lm(earnings ~ event_time + factor(age) + factor(year) - 1, data=females)
Now, I would like to calculate a predicted value based on the factor coefficients, but this predicted value depends on certain conditions in the data. I therefore create a list of the coefficients and I now want to extract the factor coefficients if age = k and year = y, but it keeps returning 0 or NA. However, if I input a number (e.g. 34) instead of k, it does give the right value. I tried two different codes:
estimates <- coef(Eq1_females)
k = females$age[1]
Eq1_females$coefficients["factor(age)k"]
and
estimates <- coef(Eq1_females)
k = females$age[1]
beta_age = estimates[grep("^factor\\(age\\)k", names(estimates))]
(note that in the end, I would like to loop over different rows of females$age)
What does work, is calculating
beta_age = estimates[grep("^factor\\(age\\)34", names(estimates))]
Could anyone tell me if there is a way of also getting the code to work with k in the beta_age formula?
Thanks a lot in advance!
Answer
Paste the right number to the regex pattern using paste0:
beta = estimates[grep(paste0("^factor\\(Petal.Width\\)", k), names(estimates))]
This returns:
factor(Petal.Width)0.2
3.764947
Rationale
In "^factor\\(age\\)k", it will treat k as the literal k. However, you are referring to variable k. By using paste(..., sep = "") or paste0(...) you can simply paste k to the base pattern.

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

R: multicollinearity issues using glib(), Bayesian Model Averaging (BMA-package)

I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

Multtest package in R returning not properly adjusting p-values

I am trying to use the R package multtest to adjust a list of p-values for multiple testing. However, multtest will only return a list of "1" characters of equal length to the list of p-values that were analyzed.
The input file is a text file in which the pvalues are separated by newline characters. A segment of the file is reproduced below:
0.182942602
0.333002877
0.282000206
0.161501682
0.161501682
I downloaded the multtest package (multtest_2.14.0) from Bioconductor, and am running it in R version x64 2.15.2. Does anyone know if there is a compatibility problem between multtest and R 2.15.2?
My code:
library(multtest, verbose = FALSE)
table1 <- read.table("p-values.txt", header = FALSE, colClasses = "double")
table2 <-as.vector(as.double(table1[,1]))
results<-p.adjust(table2, method = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"))
write.table(results, file = "output.txt")
This is not an error- this is the correct adjustment when there are no p-values that can be considered significant within that vector of p-values.
Your code performs the Holm correction (method takes only one argument, and in this case will use the "holm" method, the first item in your vector). The Holm method will correctly return all ones in the case that
min(p) * length(p) > 1
In that situation (using this multiple hypothesis testing framework), there are no p-values in the vector that can be considered significant.
If you'd like to see the gory details, the code for the holm method (taken directly from the multtest package) is
i <- seq_len(lp)
o <- order(p)
ro <- order(o)
pmin(1, cummax((n - i + 1L) * p[o]))[ro]
where p is the input vector, and lp and n are the length of the vector. That expression (n - i + 1L) * p[o] is saying "for each item in the sorted list, take n+1 minus its index, then multiply it by the value". For the minimum item, that is (n + 1 - 1) * min(p) -> n * min(p). The cummax means the cumulative maximum- which means that none of the subsequent items can be smaller than the first value. And pmin(1, ...) means that for every item in the vector, if the item is greater than 1, set the value equal to 1 (since a p-value about 1 is meaningless).
This means that if n * min(p) is greater than one, then the adjusted p-value of the smallest item is 1, which means the adjusted p-value of every item must be 1.

Resources