I am trying my best at a simple event study in R using the market model. I am using the eventstudies package and one of the steps is the calculation of the abnormal returns, being ar <- es$z.e - esMean.
The authors of the package have provided an example for the xts object (StockPriceReturns), Events (SplitDates) and for Market (Other Returns). This looks like the following:
> library(eventstudies)
> data(StockPriceReturns)
> data(SplitDates)
> head(SplitDates)
unit when
5 BHEL 2011-10-03
6 Bharti.Airtel 2009-07-24
8 Cipla 2004-05-11
9 Coal.India 2010-02-16
10 Dr.Reddy 2001-10-10
11 HDFC.Bank 2011-07-14
> head(StockPriceReturns)
Bajaj.Auto
2010-07-01 0.5277396
2010-07-02 -1.7309383
2010-07-05 -0.2530097
2010-07-06 -0.3167551
2010-07-07 -1.2771502
2010-07-08 -0.2827092
With this data, I would like to do an event study – see below R code:
# 10-day window around the event
es <- phys2eventtime(z=StockPriceReturns, SplitDates, width = 10)
es.cmr <- constantMeanReturn(es$z.e[which(attributes(es$z.e)$index%in%-30:-11), ], residual = FALSE)
ar <- es$z.e - es.cmr
ar <- window(ar, start = -1, end = 10)
car <- remap.cumsum(ar, is.pc = FALSE, base = 0)
rowMeans(car, na.rm = TRUE)
So, if I run ar <- es$z.e - es.cmr it returns:
> ar <- es$z.e - es.cmr
Error in `-.default`(es$z.e, es.cmr) : non-conformable arrays
I have looked at the original function (available at their GitHub) to figure out this error, but I ran out of ideas how to debug it. I hope someone can help me sort it out.
Regarding the marketModel: does anybody know how to do the same with the marketModel function of the eventstudies package. Because here I'm not able to write es.mm (see below) due to the error message: ROW(firm.returns) == NROW(market.returns) is not TRUE
es.mm <- marketModel(firm.returns = es$z.e[which(attributes(es$z.e)$index%in%-30:-11), ], market.returns = OtherReturns[, "NiftyIndex"], residual = FALSE)
Thank you very much for your help.
Related
Applying lmer() function across all columns in dataframe. I have made a list of variables and used lapply. Below is the code:
varlist=names(Genus_abundance)[5:ncol(Genus_abundance)]
lapply(varlist, function(x){lmer(substitute(i ~ Status + (1|Match), list(i=as.name(x), data=Genus_abundance, na.action = na.exclude)))})
However, I keep getting this error:
Error in eval(predvars, data, env) : object 'Acetatifactor' not found
I have checked and Acetatifactor is in the Genus_abundance dataframe.
Bit stuck about where its going wrong
EDIT:
Added a working example:
set.seed(43)
n <- 6
dat <- data.frame(id=1:n, Status=rep(LETTERS[1:2], n/2), age= sample(18:90, n, replace=TRUE), match=1:n, Acetatifactor=runif(n), Acutalibacter=runif(n), Adlercreutzia=runif(n))
head(dat)
id Status age match Acetatifactor Acutalibacter Adlercreutzia
1 1 A 49 1 0.1861022 0.1364904 0.8626298
2 2 B 31 2 0.7297301 0.8246794 0.3169752
3 3 A 23 3 0.4118721 0.5923042 0.2592606
4 4 B 64 4 0.4140497 0.7943970 0.7422665
5 5 A 60 5 0.4803101 0.7690324 0.7473611
6 6 B 79 6 0.4274945 0.9180564 0.9179040
lapply(varlist,
function(x){lmer(substitute(i ~ status + (1|match), list(i=as.name(x))),
data=dd)
})
The specific problem here is misplaced parentheses. You should close the substitute(..., list(i=as.name(x))) with three close-parentheses so that the whole chunk is properly understood as the first argument to lme4.
More generally I agree with #Kat in the comments that this is a good place to look. Since your arguments are already strings (not symbols) you don't really need all of the substitute() business and could use
fit_fun <- function(v) {
lmer(reformulate(c("status", "(1|match)"), response = v),
data = dd, na.action = na.exclude)
}
lapply(varlist, fit_fun)
Or you could use refit to fit the first column, then update the fit with each of the next columns. For large models this is much more efficient.
m1 <- lmer(resp1 ~ status + (1|match), ...)
m_other <- lapply(dd[-(1:3)], refit, object = m1)
c(list(m1), m_other)
I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
I just begin to learn to code using R and I tried to do a classification by C5.0. But I encounter some problems and I don't understand. I am looking for help with gratitude. Below is the code I learned from someone and I tried to use it to run my own data:
require(C50)
data.resultc50 <- c()
prematrixc50 <- c()
for(i in 3863:3993)
{
needdata$class <- as.factor(needdata$class)
trainc50 <- C5.0(class ~ ., needdata[1:3612,], trials=5, control=C5.0Control(noGlobalPruning = TRUE, CF = 0.25))
predc50 <- predict(trainc50, newdata=testdata[i, -1], trials=5, type="class")
data.resultc50[i-3862] <- sum(predc50==testdata$class[i])/length(predc50)
prematrixc50[i-3862] <- as.character.factor(predc50)
}
Belows are two objects needdata & testdata I used in the code above with part of their heads respectively:
class Volume MA20 MA10 MA120 MA40 MA340 MA24 BIAS10
1 1 2800 8032.00 8190.9 7801.867 7902.325 7367.976 1751 7.96
2 1 2854 8071.40 8290.3 7812.225 7936.550 7373.624 1766 6.27
3 0 2501 8117.45 8389.3 7824.350 7973.250 7379.444 1811 5.49
4 1 2409 8165.40 8488.1 7835.600 8007.900 7385.294 1825 4.02
# the above is "needdata" and actually has 15 variables with 3862 obs.
class Volume MA20 MA10 MA120 MA40 MA340 MA24 BIAS10
1 1 2800 8032.00 8190.9 7801.867 7902.325 7367.976 1751 7.96
2 1 2854 8071.40 8290.3 7812.225 7936.550 7373.624 1766 6.27
3 0 2501 8117.45 8389.3 7824.350 7973.250 7379.444 1811 5.49
4 1 2409 8165.40 8488.1 7835.600 8007.900 7385.294 1825 4.02
# the above is "testdata" and has 15 variables with 4112 obs.
The data above contain the factor class with value of 0 & 1. After I run it I got warnings below:
In predict.C5.0(trainc50, newdata = testdata[i, -1], trials = 5, ... : 'trials' should be <= 1 for this object. Predictions generated
using 1 trials
And when I try to look at the object trainc50 just created, I noticed the number of boosting iterations is 1 due to early stopping as shown below:
# trainc50
Call:
C5.0.formula(formula = class ~ ., data = needdata[1:3612, ],
trials = 5, control = C5.0Control(noGlobalPruning = TRUE,
CF = 0.25), earlyStopping = FALSE)
Classification Tree
Number of samples: 3612
Number of predictors: 15
Number of boosting iterations: 5 requested; 1 used due to early stopping
Non-standard options: attempt to group attributes, no global pruning
I also tried to plot the decision tree and I got the error as below:
plot(trainc50)
Error in if (!n.cat[i]) { : argument is of length zero
In addition: Warning message:
In 1:which(out == "Decision tree:") : numerical expression has 2 elements: only the first used
Does that mean my code is too bad to perform further trials while running C5.0? What is wrong? Can someone please help me out about why do I encounter early stopping and what does the error and waring message mean? How can I fix it? If anyone can help me I'll be very thankful.
Used in
http://r-project-thanos.blogspot.tw/2014/09/plot-c50-decision-trees-in-r.html
using function
C5.0.graphviz(firandomf,
"a.txt",
fontname='Arial',
col.draw='black',
col.font='blue',
col.conclusion='lightpink',
col.question='grey78',
shape.conclusion='box3d',
shape.question='diamond',
bool.substitute=c('None', 'yesno', 'truefalse', 'TF'),
prefix=FALSE,
vertical=TRUE)
And in the command line:
pip install graphviz
dot -Tpng ~plot/a.txt >~/plot/a.png
I am receiving the error "Error in x[i, ] : subscript out of bounds while using the TukeyC library for R.
I am attempting to run an ANOVA followed by the Tukey HSD post-hoc test. The code (below) works fine for my initial dataset richORIG.
library(TukeyC)
avORIG <- with(richORIG, aov(rich ~ ClimDiv_ORIG, data=richORIG))
summary(avORIG)
tkORIG <- TukeyC(x=avORIG, which='ClimDiv_ORIG')
summary(tkORIG)
plot(tkORIG)
Resulting in:
avORIG <- with(richORIG, aov(rich ~ ClimDiv_ORIG, data=richORIG))
> summary(avORIG)
Df Sum Sq Mean Sq F value Pr(>F)
ClimDiv_ORIG 8 488.7 61.09 76.17 <2e-16 ***
Residuals 1413 1133.2 0.80
> tkORIG <- TukeyC(x=avORIG, which='ClimDiv_ORIG')
trace: TukeyC(x = avORIG, which = "ClimDiv_ORIG")
> summary(tkORIG)
Goups of means at sig.level = 0.05
Means G1 G2 G3 G4 G5
12 5.10 a
23 4.70 b
11 4.68 b
22 4.50 b
13 4.24 c
21 4.03 c
33 3.26 d
32 3.14 d
31 2.38 e
> plot(tkORIG) ##I can't post the picture of the plot w/o appropriate reputation
I try this again with my rich2080 dataset
av2080 <- with(rich2080, aov(rich ~ ClimDiv_2080, data=rich2080))
summary(av2080)
tk2080 <- TukeyC(x=av2080, which='ClimDiv_2080')
summary(tk2080)
plot(tk2080)
But get the error below
> av2080 <- with(rich2080, aov(richtimestep ~ ClimDiv_2080, data=rich2080))
> summary(av2080)
Df Sum Sq Mean Sq F value Pr(>F)
ClimDiv_2080 8 16.2 2.0264 7.574 5.97e-10
Residuals 1416 378.9 0.2676
> tk2080 <- TukeyC(x=av2080, which='ClimDiv_2080')
trace: TukeyC(x = av2080, which = "ClimDiv_2080")
Error in x[i, ] : subscript out of bounds
The code works for all of my other data-sets (ex: simpORIG, simp2080, shanORIG, shan2080 etc.)
I came across this StackOverflow Q/A in reference to my question, but I cannot figure out what code to change in my particular circumstance.
Here are the options available from the debug menu, but as I sift through the ls(), I don't know which value to change...
Enter a frame number, or 0 to exit
1: TukeyC(x = av2080, which = "ClimDiv_2080")
2: TukeyC.aov(x = av2080, which = "ClimDiv_2080")
3: make.TukeyC.test(r = r, MSE = MSE, m.inf = m.inf, ord = ord, sig.level
4: make.TukeyC.groups(dif)
How would I go about fixing this error to get my results?
I had this exact error, but with the TukeyHSD() function in the basic stats package. I have no idea why it happened - in short, yesterday it was working and today it's not. Really really strange...
I removed the stats package (BAD IDEA) with the hope of re-installing it, but couldn't. Eventually I just re-started by downloading the latest R version and starting again.
Sorry that's not more specific, but I couldn't find an explanation for why the function suddenly wasn't working, and I presume TukeyC() works in a very similar way to TukeyHSD()
It was like in Jurrasic Park when they had to switch the power back on: Long, complicated and scary...
I am trying to run some summary statistics on a large data set where the groups = (Entry + Plant). I am using the summaryBy() function, and it appears to be working fine for most of my variables. It is, however, transforming one of my variables (YieldPlant) using an unknown function and improperly calculating means and standard deviations. Here is some sample output:
> library(doBy)
> SP.data <- read.csv("~/Desktop/2014 Summer Research/Within-Line Variation Trial/2014 Heirloom Variation Trial.csv", na.string = c("NA"))
> head(SP.data$YieldPlant, n=10) [1] NA NA NA NA 16.16 18.58 11.2 10.95 11.61 13.94
> summaryTRAITS <- summaryBy(YieldPlant ~ Entry + Plant, data=SP.data, FUN = function(Plant) { c(m=mean(Plant, na.rm=T), s=sd(Plant, na.rm=T))})
> head(summaryTRAITS$YieldPlant.m, n=10) [1] NaN 307.8571 444.0000 364.0000 179.5714 354.2857 592.1429 521.3333 729.8571 322.4286
The "YieldPlant" should be much smaller than R is recognizing. I'd appreciate any help you all can offer. Thanks!
Hannah