I am trying my best at a simple event study in R, with some data retrieved from the Wharton Research Data Service (WRDS). I am not completely new to R, but I would describe my expertise level as intermediate. So, here is the problem. I am using the eventstudies package and one of the steps is converting the physical dates to event time frame dates with the phys2eventtime(..) function. This function takes multiple arguments:
z : time series data for which event frame is to be generated. In the form of an xts object.
Events : it is a data frame with two columns: unit and when. unit has column name of which response is to measured on the event date, while when has the event date.
Width : width corresponds to the number of days on each side of the event date. For a given width, if there is any NA in the event window then the last observation is carried forward.
The authors of the package have provided an example for the xts object (StockPriceReturns) and for Events (SplitDates). This looks like the following:
> data(StockPriceReturns)
> data(SplitDates)
> head(SplitDates)
unit when
5 BHEL 2011-10-03
6 Bharti.Airtel 2009-07-24
8 Cipla 2004-05-11
9 Coal.India 2010-02-16
10 Dr.Reddy 2001-10-10
11 HDFC.Bank 2011-07-14
> head(StockPriceReturns)
2000-04-03 -8.3381609
2000-04-04 0.5923550
2000-04-05 6.8097616
2000-04-06 -0.9448889
2000-04-07 7.6843828
2000-04-10 4.1220462
2000-04-11 -1.9078480
2000-04-12 -8.3286900
2000-04-13 -3.8876847
2000-04-17 -8.2886060
So I have constructed my data in the same way, an xts object (DS_xts) and a data.frame (cDS) with the columns "unit" and "when". This is how it looks:
> head(DS_xts)
2011-01-03 0.024247
2011-01-04 0.039307
2011-01-05 0.010589
2011-01-06 -0.022172
2011-01-07 0.018057
2011-01-10 0.041488
> head(cDS)
unit when
1 11754 2012-01-05
2 10104 2012-01-24
3 61241 2012-01-31
4 13928 2012-02-07
5 14656 2012-02-08
6 60097 2012-02-14
These are similar in my opinion, but how it looks does not tell the whole story. I am quite certain that my problem is in how I have constructed these two objects. Below is my R code:
DS = read.csv("ReturnData.csv")
cDS = read.csv("EventData.csv")
#Calculate Abnormal Returns
#Clean up and let only necessary columns remain
DS = DS[, c("PERMNO", "DATE", "AR")]
cDS = cDS[, c("PERMNO", "DATE")]
#Generate correct date format according to R's as.Date
for (i in 1:nrow(DS)) {
DS$DATE[i] = format(as.Date(toString(DS$DATE[i]), format = "%Y %m %d"), format = "%Y-%m-%d")
for (i in 1:nrow(cDS)) {
cDS$DATE[i] = format(as.Date(toString(cDS$DATE[i]), format = "%Y %m %d"), format = "%Y-%m-%d")
#Rename cDS columns according to phys2eventtime format
colnames(cDS)[1] = "unit"
colnames(cDS)[2] = "when"
#Create list of unique PERMNO's
for (i in 1:length(PERMNO)) {
#Subset based on PERMNO
DStmp <- DS[DS$PERMNO == PERMNO[i], ]
#Remove PERMNO column and rename AR to PERMNO
DStmp <- DStmp[, c("DATE", "AR")]
colnames(DStmp)[2] = as.character(PERMNO[i])
dates <- as.Date(DStmp$DATE)
DStmp <- DStmp[, -c(1)]
#Create a temporary XTS object
DStmp_xts <- xts(DStmp, = dates)
#If first iteration, just create new variable, otherwise merge
if (i == 1) {
DS_xts <- DStmp_xts
} else {
DS_xts <- merge(DS_xts, DStmp_xts, all = TRUE)
#Renaming columns for matching
colnames(DS_xts) <- c(PERMNO)
#Making sure classes are the same
cDS$unit <- as.character(cDS$unit)
eventList <- phys2eventtime(z = DS_xts, events = cDS, width = 10)
So, if I run phys2eventtime(..) it returns:
> eventList <- phys2eventtime(z = DS_xts, events = cDS, width = 10)
Error in if ((location <= 1) | (location >= length(x))) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In findInterval(when, index(x)) : NAs introduced by coercion
I have looked at the original function (it is available at their GitHub, can't use more than two links yet) to figure out this error, but I ran out of ideas how to debug it. I hope someone can help me sort it out. As a final note, I have also looked at another (magnificent) answer related to this R package (question: "format a zoo object with “dimnames”=List of 2"), but it wasn't enough to help me solve it (or I couldn't yet comprehend it).
Here is the link for the two CSV files if you would like to reproduce my error (or solve it!).


R/zoo: duplicate index entries in ‘’ are not unique

I have an excel file containing 3 columns of data against a column of time at one hour interval. I tried to convert the data into a zoo object. But everytime i tried to that there is an error that says "In zoo(y, = index(x), ...) : some methods for “zoo” objects do not work if the index entries in ‘’ are not unique".
> datos_meterologicos <- read_excel(datos, sheet = "Precip")
> idx <- as.Date(datos_meterologicos$Fecha)
> date.matrix <-[,-1])
> date.xts <- as.xts(date.matrix,
> date.zoo <- as.zoo(date.xts)
Warning message:
In zoo(y, = index(x), ...) :
some methods for “zoo” objects do not work if the index entries in ‘’ are not unique
I looked up some of the solutions from other case with the same conflict that I Have, so I tried the next code
datos_meterologicos$Fecha <- read.zoo(datos_meterologicos, FUN=as.POSIXct, format = "%Y/%m/%d %H:%M", tz="UTC"). But I get the same error.
The data is right here
You are transforming the your datetime values into a date with as.Date. You need to add the time as well otherwise you have 24 values of 1 day instead of the day and the hours. Using as.POSIXct will preserve your times.
idx <- as.POSIXct(datos_meterologicos$Fecha)
# rest of your code...

Using R to create a new dataframe using tests from values in another dataframe

I have salesorder data in the following (sample) format:
salesorder <- c('TM001', 'TM002', 'TM003', 'TM004')
esttxndate <- as.Date(c('2018-10-01', '2018-10-01', '2018-10-04', '2018-10-06'))
potxndate <- as.Date(c('2018-10-07', '2018-10-06', '2018-10-14', '2018-10-18'))
intxndate <- as.Date(c('2018-11-06', '2018-11-05', '2018-11-13', '2018-11-17'))
salesorder <- data.frame(salesorder, esttxndate, potxndate, intxndate)
salesorder esttxndate potxndate intxndate
1 TM001 2018-10-01 2018-10-07 2018-11-06
2 TM002 2018-10-01 2018-10-06 2018-11-05
3 TM003 2018-10-04 2018-10-14 2018-11-13
4 TM004 2018-10-06 2018-10-18 2018-11-17
I am trying to create a new dataframe which looks at the dates of each salesorder and outputs the status on each date:
date TM001 TM002 TM003 TM004
01 2018-10-01 est est dne dne
02 2018-10-02 est est dne dne
07 2018-10-07 pro pro est est
32 2018-11-01 pro pro pro pro
37 2018-11-06 inv inv pro pro
48 2018-11-17 inv inv inv inv
I was able to get the list of dates out using the min and max functions (saved as mindate & maxdate). I then started a new data.frame with the values from the date range as:
mindate <- min(esttxndate, potxndate, intxndate)
maxdate <- max(esttxndate, potxndate, intxndate)
dates <- data.frame(as.Date(as.Date(mindate):as.Date(maxdate), origin="1970-01-01"))
names(dates)[1] <- "date"
I am at a loss for what to do next as I have tried to utilize user-defined functions and applying across rows on both the newly created dates dataframe and on the previous salesorder dataframe.
I am coming from a background in Stata and was able to produce the desired dataset by first going through and saving temp values for each date (ex. local variable potxndate_TM001 = 2018-10-07)
ds *date
foreach dt in `r(varlist)' {
forval i = 1/`=_N' {
local so = salesorder[`i']
local `dt'_`so' = `dt'[`i']
Once all the dates are saved as local variables I dropped all the variables besides salesorder, transposed the table and created a new variable date ranging from the minimum date to the maximum date. I then ran the following to get the values based on the date column and the locally saved variables.
ds TM*
foreach so in `r(varlist)' {
forval i = 1/`=_N' {
if `intxndate_`so'' <= date[`i'] {
replace `so' = "inv" in `i'
else if `potxndate_`so'' <= date[`i'] {
replace `so' = "pro" in `i'
else if `esttxndate_`so'' <= date[`i'] {
replace `so' = "est" in `i'
else if `esttxndate_`so'' > date[`i'] {
replace `so' = "dne" in `i'
I believe there is a way to do this in R without creating the intermediate local variables / modifying original dataset, which should be much more efficient and faster (?).
Loops in R tend to be slow, a faster solution is to use functional programming tools such as those in the purrr package or the function apply rather than loops as you have used in stata.
To solve this, I have written my own function, most_recent_txn which returns the status of a given sales order on a given date, then applied this function to all dates in the vector dates$date, using purrr::map_chr().
Then, to do this for all sales orders (rows in the original dataframe, salesorder), I have written a function which carries this out for a given row and applied this to all rows using the apply function.
most_recent_txn <- function(as_of_date, order_dates) {
# return the column name of the last txn step compleated, as of the date given.
last_step = "dne"
# if there is any recorded activity at that point, we assign the most recent
# activity to last step
last_step = names(order_dates)[max(which(t(order_dates)<=as_of_date))]
progress_of_sales_order <- function(order) {
status = cbind(dates$date,
Not the most elegant solution, it relies on using the column order of salesorder dataframe to indicate the step in the process and implicitly assumes that the status of a sales order cannot go backward (eg. purchase order after invoice).
Created a function called status
status <- function(tstdate, estdate, podate, invdate) {
ifelse (tstdate >= invdate, "inv",
ifelse (tstdate >= podate, "pro",
ifelse (tstdate >= estdate, "est", "dne")))
I was then able to run the following across the dates:
final <- data.frame(apply(dates,c(1,2),function(x) {
status(x, salesorder$esttxndate, salesorder$potxndate, salesorder$intxndate)
The rest is formatting to get the data frame to my liking.
Minor note: this solution will have rows & columns reversed from how the question proposes for desired results. The issue with this way is not being able to name the columns dates, as numbers cannot be used in names.

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <-
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
PS: This is my code that is not working (A NYT api key is needed
# Need to install from source
# then load:
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="") <- readLines(uri, warn="F") # get them
res <- fromJSON( # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="") <- readLines(uri, warn="F")
res <- fromJSON(
dat <- append(dat, unlist(res$results)) # append
# aggregate counts for dates and coerce into a data frame
cts <-
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at
There is a RNYTimes package created by Duncan Temple-Lang - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

Create a trading day calendar from scratch

I just spent a day debugging some R code only to find that the problem I was having was caused by a missing date in the data returned by Yahoo using getSymbol. At the time I write this Yahoo is returning this:
QQQ.Open QQQ.High QQQ.Low QQQ.Close QQQ.Volume QQQ.Adjusted
2014-01-03 87.27 87.35 86.62 86.64 35723700 86.64
2014-01-06 86.66 86.76 86.00 86.32 32073100 86.32
2014-01-07 86.72 87.25 86.56 87.12 25860600 87.12
2014-01-08 87.14 87.55 86.95 87.31 27197400 87.31
2014-01-09 87.63 87.64 86.72 87.02 23674700 87.02
2014-01-13 87.18 87.48 85.68 86.01 48842300 86.01
2014-01-14 86.30 87.72 86.30 87.65 37178900 87.65
2014-01-15 88.03 88.54 87.94 88.37 39835600 88.37
2014-01-16 88.30 88.51 88.16 88.38 31630100 88.38
2014-01-17 88.11 88.37 87.67 87.88 36895800 87.88
which is missing 2014-01-10. That date is returned for other ETFs. I expect that Yahoo will fix the data one of these days (the data is on Google) but for now it is wrong which caused my code some fits.
To address this issue I want to check my data to ensure that there is data for all dates the markets were open. If there's a canned way to do this in some package I'd appreciate info on that but to that end I started writing some code using the timeDate package. However I have ended up with xts index questions I don't understand. The code follows:
MyZone = "UTC"
Sys.setenv(TZ = MyZone)
YearStart = "1990"
YearEnd = "2014"
currentYear = getRmetricsOptions("currentYear")
dateStart = paste0(YearStart, "-01-01")
dateEnd = paste0(YearEnd, "-12-31")
DayCal = timeSequence(from = dateStart, to = dateEnd, by="day", zone = MyZone)
TradingCal = DayCal[isBizday(DayCal, holidayNYSE())]
testSym = "QQQ"
getSymbols(testSym, src="yahoo", from = dateStart, to = dateEnd)
testData = get(testSym)
tail(testData, n=10)
#Save date range of data being checked
firstIndex = index(testData)[1]
lastIndex = index(testData)[nrow(testData)]
#Create an xts series covering all dates
AllDates = xts(x=rep(1, length.out=length(TradingCal)),, tzone = MyZone)
#Create an xts object that has all dates covered
#by testSym but using calendar I created
CheckData = subset(AllDates, ((index(AllDates)>=firstIndex) &&
The goal here was to create a 'known good calendar' which I could use to create a simple xts object. With that object I would then check whether every index in that object had a corresponding index in the data being tested. However I'm not getting that far as it appears my indexes are not compatible. When I run the code I get this at the end:
> CheckData = subset(AllDates, ((index(AllDates)>=firstIndex) && (index(AllDates)<=lastIndex))
+ )
Error in `>=.default`(index(AllDates), firstIndex) :
comparison (5) is possible only for atomic and list types
> class(index(AllDates))
[1] "timeDate"
[1] "timeDate"
> class(index(testData))
[1] "Date"
Can someone show me the errors of my ways here so that I can move forward? Thanks!
You need to convert TradingCal to Date:
TradingDates <- as.Date(TradingCal)
And here's another way to find index values in TradingDates that aren't in your testData index.
AllDates <- xts(,TradingDates)
testSubset <- paste(start(testData), end(testData), sep="/")
CheckData <- merge(AllDates, testData)[testSubset]
BadDates <- CheckData[]

R subsetting by date range

seems simple enough and I've been through all similar questions and applied them all... I'm either getting nothing or everything...
Trying to took at water temperatures (WTEMP) for specific date range(SAMPLE_DATE) 2007-06-01 to 2007-09-30 from (allconmon)
here is my code so far...
bydate<-subset(allconmon, allconmon$SAMPLE_DATE > as.Date("2007-06-01") & allconmon$SAMPLE_DATE < as.Date("2007-09-30"))
Ive also tried this but get errors
bydate2<- as.xts(allconmon$WTEMP,$SAMPLE_DATE)
Error in xts(x, =, frequency = frequency, .CLASS = "double", : requires an appropriate time-based object
not sure what I'm doing wrong here... seems to work for other people
I will highly recommend you using zoo package in R while dealing with time series data.
The operation you mentioned is actually a window function in zoo.
Here is the example from ?window:
window(presidents, 1960, c(1969,4)) # values in the 1960's
window(presidents, deltat = 1) # All Qtr1s
window(presidents, start = c(1945,3), deltat = 1) # All Qtr3s
window(presidents, 1944, c(1979,2), extend = TRUE)
pres <- window(presidents, 1945, c(1949,4)) # values in the 1940's
window(pres, 1945.25, 1945.50) <- c(60, 70)
window(pres, 1944, 1944.75) <- 0 # will generate a warning
window(pres, c(1945,4), c(1949,4), frequency = 1) <- 85:89
Here is a list of papers from JSS demonstrating the usage of the zoo package also reshape your data which I found very inspiring.
I figured it out! on multiple levels... first off I didn't notice that R did something funky with my sample date label when I uploaded from text file... probably my fault...
here is a small sample of the data set. its 5,573,301 observations of 30 variables
notice the funky symbol in front of sample date.... not sure why R did that...
however what I did.... (i changed the name to x as allconmon was a bit excessive)
x <- read.csv(file = "C:/Users/Desktop/cmon2001-08.txt",quote = "",header = TRUE,sep = "\t", na.strings = c("","NULL"))
x$month <- months(as.Date(x$ï..SAMPLE_DATE, "%Y-%m-%d"))
x$year <- substr(as.character(x$ï..SAMPLE_DATE), 1, 4)
y <- x[x$month == 'June' | x$month == 'July' | x$month == 'August' | x$month == 'September' ,]
so now I was able to subset all my data by those 4 months and then later by year, station, and water temp....
