Change nberDates() into a time series in R for subsetting - r

nberDates() in the tis package gives a list of recession start and end dates.
What's the slickest, shortest way to turn this into a set of dummies for subsetting an existing time series?
So, nberDates itself yields...
> nberDates()
Start End
[1,] 18570701 18581231
[2,] 18601101 18610630
[3,] 18650501 18671231
[4,] 18690701 18701231
[5,] 18731101 18790331
[6,] 18820401 18850531
and str(nberDates()) says the type is "Named num."
I have another time series object in xts which currently looks like this...
> head(mydata)
value
1966-01-01 15
1967-01-01 16
1968-01-01 20
1969-01-01 21
1970-01-01 18
1971-01-01 12
I'd like to have a second variable, recess, that is 1 during recessions:
> head(mydata)
value recess
1966-01-01 15 0
1967-01-01 16 0
1968-01-01 20 0
1969-01-01 21 0
1970-01-01 18 1
1971-01-01 12 0
(My goal is that I'd like to be able to compare values in recessions against values out of recessions.)
The clunky thing I'm trying that isn't working is this...
((index(mydata) > as.Date(as.character(nberDates()[,1]),format="%Y%m%d")) & (index(mydata) < as.Date(as.character(nberDates()[,2]),format="%Y%m%d")))
But this yields...
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Warning messages:
1: In `>.default`(index(mydata), as.Date(as.character(nberDates()[, :
longer object length is not a multiple of shorter object length
2: In `<.default`(index(mydata), as.Date(as.character(nberDates()[, :
longer object length is not a multiple of shorter object length
I know I can solve this with a clunky for-loop, but that always suggests to me I'm doing R wrong.
Any suggestions?

The following should do it:
sapply(index(mydata), function(x) any(((x >= as.Date(as.character(nberDates()[,1]),format="%Y%m%d") & (x <= as.Date(as.character(nberDates()[,2]),format="%Y%m%d"))))))
sapply basically goes through the vector and checks for each element if it falls within one of the NBER intervals.
Note, however, that the way this is currently written means that it will do the conversion of the raw NBER data into dates (as.Date) once for every element in mydata so you may want to do the conversion once, save it to some temporary data frame and then run the above on that.

Here's another solution that uses some handy behavior in merge.xts.
library(xts)
library(tis) # for nberDates()
# make two xts objects filled with ones
# 1) recession start dates
# 2) recession end dates
rd <- apply(nberDates(),2,as.character)
ones <- rep(1,nrow(rd))
rStart <- xts(ones, as.Date(rd[,1],"%Y%m%d"))
rEnd <- xts(ones, as.Date(rd[,2],"%Y%m%d"))
# merge recession start/end date objects with empty xts
# object containing indexes from mydata, filled with zeros
# and take the cumulative sum (by column)
rx <- cumsum(merge(rStart,rEnd,xts(,index(mydata)),fill=0))
# the recess column = (cumulative # of recessions started at date D) -
# (cumulative # of recessions ended at date D)
mydata$recess <- (rx[,1]-rx[,2])[index(mydata)]
Alternatively, you could just use the USREC series from FREDII.
library(quantmod)
getSymbols("USREC",src="FRED")
mydata2 <- merge(mydata, USREC, all=FALSE)

Tentatively, I'm using nested loops as follows. Still looking for a better answer!
mydata$recess <- 0
for (x in seq(1,dim(mydata)[1])){
for (y in seq(1,dim(nberDates())[1])){
if (index(mydata)[x] >= as.Date(as.character(nberDates()[y,1]),format="%Y%m%d") &
index(mydata)[x] <= as.Date(as.character(nberDates()[y,2]),format="%Y%m%d")){
mydata$recess[x] <- 1
}
}
}

Related

How does R use square brackets to return values in a vector?

I came across a question like this: "retrieve all values less than or equal to 5 from a vector of sequence 1 through 9 having a length of 9". Now based on my knowledge so far, I did trial & error, then I finally executed the following code:
vec <- c(1:9) ## assigns to vec
lessThanOrEqualTo5 <- vec[vec <= 5]
lessThanOrEqualTo5
[1] 1 2 3 4 5
I know that the code vec <= 5 would return the following logical
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
So my question is, how does R use these logical to return the appropriate values satisfying the condition since the code would end up having a structure like this vec[TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE]?

Count by row with variable criteria

I have a data.frame in which I want to perform a count by row versus a specified criterion. The part I cannot figure out is that I want a different count criterion for each row.
Say I have 10 rows, I want 10 different criteria for the 10 rows.
I tried: count.above <- rowSums(Data > rate), where rate is a vector with the 10 criterion, but R used only the first as the criterion for the whole frame.
I imagine I could split my frame into 10 vectors and perform this task, but I thought there would be some simple way to do this without resorting to that.
Edit: this depends whether you want to operate over rows or columns. See below:
This is a job for mapply and Reduce. Suppose you have a data frame along the lines of
df1 <- data.frame(a=1:10,b=2:11,c=3:12)
Let's say we want to count the rows where a>6, b>3 and c>5. This is done with mapply:
mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)
$a
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
$b
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
$c
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now we use Reduce to find those which are all TRUE:
Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE))
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Lastly, we use sum to add them all up:
sum(Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)))
[1] 4
If you want a result for each row rather than a global aggregate, then apply is the function to use:
apply(df1,1,function(v) sum(v>c(6,3,5)))
[1] 0 0 1 2 2 2 3 3 3 3
Given the dummy data (from #zx8754s solution)
# dummy data
df1 <- data.frame(matrix(1:15, nrow = 3))
myRate <- c(7, 5, 1)
Solution using apply
Courtesy of #JDL
rowSums(apply(df1, 2, function(v) v > myRate))
Alternative solution using the Reduce pattern
Reduce(function(l, v) cbind(l[,1] + (l[,2] > myRate), l[,-2:-1]),
1:ncol(df1),
cbind(0, df1))

reporting identical values across columns in matrix

I have a matrix that I am performing a for loop over. I want to know if the values of position i in the for loop exist anywhere else in the matrix, and if so, report TRUE. The matrix looks like this
dim
x y
[1,] 5 1
[2,] 2 2
[3,] 5 1
[4,] 5 9
In this case, dim[1,] is the same as dim[3,] and should therefore report TRUE if I am in position i=1 in the for loop. I could write another for loop to deal with this, but I am sure there are more clever and possibly vectorized ways to do this.
We can use duplicated
duplicated(m1)|duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE TRUE FALSE
The duplicated(m1) gives a logical vector of 'TRUE/FALSE' values. If there is a duplicate row, it will be TRUE
duplicated(m1)
#[1] FALSE FALSE TRUE FALSE
In this case, the third row is duplicate of first row. Suppose if we need both the first and third row, we can do the duplication from the reverse side and use | to make both positions TRUE. i.e.
duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE FALSE FALSE
duplicated(m1)|duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE TRUE FALSE
According to ?duplicated, the input data can be
x: a vector or a data frame or an array or ‘NULL’.
data
m1 <- cbind(x=c(5,2,5,5), y=c(1,2,1,9))

R: Choosing specific number of combinations from all possible combinations

Let's say we have the following dataset
set.seed(144)
dat <- matrix(rnorm(100), ncol=5)
The following function creates all possible combinations of columns and removes the first
(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
# Var1 Var2 Var3 Var4 Var5
# 2 TRUE FALSE FALSE FALSE FALSE
# 3 FALSE TRUE FALSE FALSE FALSE
# 4 TRUE TRUE FALSE FALSE FALSE
# ...
# 31 FALSE TRUE TRUE TRUE TRUE
# 32 TRUE TRUE TRUE TRUE TRUE
My question is how can I calculate single, binary and triple combinations only ?
Choosing the rows including no more than 3 TRUE values using the following function works for this vector: cols[rowSums(cols)<4L, ]
However, it gives following error for larger vectors mainly because of the error in expand.grid with long vectors:
Error in rep.int(seq_len(nx), rep.int(rep.fac, nx)) :
invalid 'times' value
In addition: Warning message:
In rep.fac * nx : NAs produced by integer overflow
Any suggestion that would allow me to compute single, binary and triple combinations only ?
You could try either
cols[rowSums(cols) < 4L, ]
Or
cols[Reduce(`+`, cols) < 4L, ]
You can use this solution:
col.i <- do.call(c,lapply(1:3,combn,x=5,simplify=F))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
#
# <...skipped...>
#
# [[24]]
# [1] 2 4 5
#
# [[25]]
# [1] 3 4 5
Here, col.i is a list every element of which contains column indices.
How it works: combn generates all combinations of the numbers from 1 to 5 (requested by x=5) taken m at a time (simplify=FALSE ensures that the result has a list structure). lapply invokes an implicit cycle to iterate m from 1 to 3 and returns a list of lists. do.call(c,...) converts a list of lists into a plain list.
You can use col.i to get certain columns from dat using e.g. dat[,col.i[[1]],drop=F] (1 is an index of the column combination, so you could use any number from 1 to 25; drop=F makes sure that when you pick just one column from dat, the result is not simplified to a vector, which might cause unexpected program behavior). Another option is to use lapply, e.g.
lapply(col.i, function(cols) dat[,cols])
which will return a list of data frames each containing a certain subset of columns of dat.
In case you want to get column indices as a boolean matrix, you can use:
col.b <- t(sapply(col.i,function(z) 1:5 %in% z))
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE
# [3,] FALSE FALSE TRUE FALSE FALSE
# ...
[UPDATE]
More efficient realization:
library("gRbase")
coli <- function(x=5,m=3) {
col.i <- do.call(c,lapply(1:m,combnPrim,x=x,simplify=F))
z <- lapply(seq_along(col.i), function(i) x*(i-1)+col.i[[i]])
v.b <- rep(F,x*length(col.i))
v.b[unlist(z)] <- TRUE
matrix(v.b,ncol=x,byrow = TRUE)
}
coli(70,5) # takes about 30 sec on my desktop

Measure the max value of all previous values in a data frame

I am trying to make a function that will determine if value in a column of a data frame is a new high. So for example if I had the following data:
x <- rnorm(10,100,sd=5)
x <- data.frame(x)
How can I return, TRUE or FALSE in a new column that only takes into account all the previous values. The resulting table would look something like:
x new.max
1 102.42810 NA
2 109.22762 TRUE
3 101.97970 FALSE
4 101.49303 FALSE
5 93.30595 FALSE
6 96.77199 FALSE
7 110.96441 TRUE
8 96.27485 FALSE
9 101.77163 FALSE
10 100.78992 FALSE
If I try
x$new.max <- ifelse ( x$x == max(x$x) , TRUE, FALSE )
The resulting table is below, as it calculates the maximum value of the entire column instead of a subset of all the previous values.
x new.max
1 102.42810 FALSE
2 109.22762 FALSE
3 101.97970 FALSE
4 101.49303 FALSE
5 93.30595 FALSE
6 96.77199 FALSE
7 110.96441 TRUE
8 96.27485 FALSE
9 101.77163 FALSE
10 100.78992 FALSE
There is a built-in function that computes the running maximum, called cummax().
diff(cummax(x)) will be non-zero at positions where a new maximum is achieved (there's no entry for the first element of x, which is always a new maximum).
Putting the pieces together:
new.max <- c(TRUE, diff(cummax(x)) > 0)
I've set the first element to TRUE, but it could just as well be NA.

Resources