Measure the max value of all previous values in a data frame - r

I am trying to make a function that will determine if value in a column of a data frame is a new high. So for example if I had the following data:
x <- rnorm(10,100,sd=5)
x <- data.frame(x)
How can I return, TRUE or FALSE in a new column that only takes into account all the previous values. The resulting table would look something like:
x new.max
1 102.42810 NA
2 109.22762 TRUE
3 101.97970 FALSE
4 101.49303 FALSE
5 93.30595 FALSE
6 96.77199 FALSE
7 110.96441 TRUE
8 96.27485 FALSE
9 101.77163 FALSE
10 100.78992 FALSE
If I try
x$new.max <- ifelse ( x$x == max(x$x) , TRUE, FALSE )
The resulting table is below, as it calculates the maximum value of the entire column instead of a subset of all the previous values.
x new.max
1 102.42810 FALSE
2 109.22762 FALSE
3 101.97970 FALSE
4 101.49303 FALSE
5 93.30595 FALSE
6 96.77199 FALSE
7 110.96441 TRUE
8 96.27485 FALSE
9 101.77163 FALSE
10 100.78992 FALSE

There is a built-in function that computes the running maximum, called cummax().
diff(cummax(x)) will be non-zero at positions where a new maximum is achieved (there's no entry for the first element of x, which is always a new maximum).
Putting the pieces together:
new.max <- c(TRUE, diff(cummax(x)) > 0)
I've set the first element to TRUE, but it could just as well be NA.

Related

How does R use square brackets to return values in a vector?

I came across a question like this: "retrieve all values less than or equal to 5 from a vector of sequence 1 through 9 having a length of 9". Now based on my knowledge so far, I did trial & error, then I finally executed the following code:
vec <- c(1:9) ## assigns to vec
lessThanOrEqualTo5 <- vec[vec <= 5]
lessThanOrEqualTo5
[1] 1 2 3 4 5
I know that the code vec <= 5 would return the following logical
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
So my question is, how does R use these logical to return the appropriate values satisfying the condition since the code would end up having a structure like this vec[TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE]?

R: make 2 subset vectors so that values are different index-wise

I want to make 2 vectors subsetting from the same data, with replace=TRUE.
Even if both vectors can contain the same values, they cannot be the same at the same index position.
For example:
> set.seed(1)
> a <- sample(15, 10, replace=T)
> b <- sample(15, 10, replace=T)
> a
[1] 4 6 9 14 4 14 15 10 10 1
> b
[1] 4 3 11 6 12 8 11 15 6 12
> a==b
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
In this case, vectors a and b contain the same value at index 1 (value==4), which is wrong for my purposes.
Is there an easy way to correct this?
And can it be done on the subset step?
Or should I go through a loop checking element by element and if the values are identical, make another selection for b[i] and check again if it's not identical ad infinitum?
many thanks!
My idea is, instead of getting 2 samples of length 10 with replacement, get 10 samples of length 2 without replacement
library(purrr)
l <- rerun(10,sample(15,2,replace=FALSE))
Each element in l is a vector of integers of length two. Those two integers are guaranteed to be different because we specified replace=FALSE in sample
# from l extract all first element in each element, this is a
a <- map_int(l,`[[`,1)
# from list extract all second elements, this is b
b <- map_int(l,`[[`,2)
How about a two-stage sampling process
set.seed(1)
x <- 1:15
a <- sample(x, 10, replace = TRUE)
b <- sapply(a, function(v) sample(x[x != v], 1))
a != b
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We first draw samples a; then for every sample from a, we draw a new sample from the set of values x excluding the current sample from a. Since we're doing this one-sample-at-a-time, we automatically allow for sampling with replacement.

Count by row with variable criteria

I have a data.frame in which I want to perform a count by row versus a specified criterion. The part I cannot figure out is that I want a different count criterion for each row.
Say I have 10 rows, I want 10 different criteria for the 10 rows.
I tried: count.above <- rowSums(Data > rate), where rate is a vector with the 10 criterion, but R used only the first as the criterion for the whole frame.
I imagine I could split my frame into 10 vectors and perform this task, but I thought there would be some simple way to do this without resorting to that.
Edit: this depends whether you want to operate over rows or columns. See below:
This is a job for mapply and Reduce. Suppose you have a data frame along the lines of
df1 <- data.frame(a=1:10,b=2:11,c=3:12)
Let's say we want to count the rows where a>6, b>3 and c>5. This is done with mapply:
mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)
$a
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
$b
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
$c
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now we use Reduce to find those which are all TRUE:
Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE))
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Lastly, we use sum to add them all up:
sum(Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)))
[1] 4
If you want a result for each row rather than a global aggregate, then apply is the function to use:
apply(df1,1,function(v) sum(v>c(6,3,5)))
[1] 0 0 1 2 2 2 3 3 3 3
Given the dummy data (from #zx8754s solution)
# dummy data
df1 <- data.frame(matrix(1:15, nrow = 3))
myRate <- c(7, 5, 1)
Solution using apply
Courtesy of #JDL
rowSums(apply(df1, 2, function(v) v > myRate))
Alternative solution using the Reduce pattern
Reduce(function(l, v) cbind(l[,1] + (l[,2] > myRate), l[,-2:-1]),
1:ncol(df1),
cbind(0, df1))

Writing a boolean matrix to a string

I have a lower triangular matrix containing TRUE/FALSE values. The matrix is created from a pairwise.t.test and a comparison to the acceptable p-value (p<0.05 => TRUE).
I am trying to output the matrix true values in a string according to a specific formatting without using a mess of if conditions. My thoughts were on matrix products/sums to achieve it, but there may be no elegant solution. If you think it's impossible to do it, I would like to know it aswell so I don't hit my head on the wall forever
The formatting:
If a pair of values (ex:1,2) are TRUE, we output it as "1≠2".
If a value is TRUE with multiple values (ex: 1 with 2,3), we output it as "1≠2,3".
If a value is TRUE with everyone(ex:1 with 2,3,4) we use the word "all" => output is "1≠all"
If 2 pairs (ex:1,2 and 3,4) are TRUE, we separate them with a space. output is "1≠2 3≠4"
If everything is TRUE, we output "all≠"
As of now, I am doing it manually so I don't really have any code to show. I am open to any ideas :)
Examples:
1 2 3
2 TRUE NA NA
3 TRUE TRUE NA
4 TRUE TRUE FALSE
The string for this matrix would be "1,2≠all" because 1 and 2 are true with everyone.
1 2 3
2 FALSE NA NA
3 TRUE TRUE NA
4 TRUE TRUE FALSE
The string for this matrix would be "1,2≠3,4 because 1 is true with 3,4 and 2 is true with 3,4.
Test matrices:
mTest = matrix(c(T,T,T,NA,F,T,NA,NA,F),nrow=3,ncol=3) # "1≠all 2≠3"
row.names(mTest) <- c(2,3,4) ; colnames(mTest) <- c(1,2,3)
mTest[] = c(T,F,T,NA,F,T,NA,NA,F) # "1≠2 1,2≠4"
mTest[] = c(T,T,T,NA,T,F,NA,NA,T) # "1,3≠all"

Change nberDates() into a time series in R for subsetting

nberDates() in the tis package gives a list of recession start and end dates.
What's the slickest, shortest way to turn this into a set of dummies for subsetting an existing time series?
So, nberDates itself yields...
> nberDates()
Start End
[1,] 18570701 18581231
[2,] 18601101 18610630
[3,] 18650501 18671231
[4,] 18690701 18701231
[5,] 18731101 18790331
[6,] 18820401 18850531
and str(nberDates()) says the type is "Named num."
I have another time series object in xts which currently looks like this...
> head(mydata)
value
1966-01-01 15
1967-01-01 16
1968-01-01 20
1969-01-01 21
1970-01-01 18
1971-01-01 12
I'd like to have a second variable, recess, that is 1 during recessions:
> head(mydata)
value recess
1966-01-01 15 0
1967-01-01 16 0
1968-01-01 20 0
1969-01-01 21 0
1970-01-01 18 1
1971-01-01 12 0
(My goal is that I'd like to be able to compare values in recessions against values out of recessions.)
The clunky thing I'm trying that isn't working is this...
((index(mydata) > as.Date(as.character(nberDates()[,1]),format="%Y%m%d")) & (index(mydata) < as.Date(as.character(nberDates()[,2]),format="%Y%m%d")))
But this yields...
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Warning messages:
1: In `>.default`(index(mydata), as.Date(as.character(nberDates()[, :
longer object length is not a multiple of shorter object length
2: In `<.default`(index(mydata), as.Date(as.character(nberDates()[, :
longer object length is not a multiple of shorter object length
I know I can solve this with a clunky for-loop, but that always suggests to me I'm doing R wrong.
Any suggestions?
The following should do it:
sapply(index(mydata), function(x) any(((x >= as.Date(as.character(nberDates()[,1]),format="%Y%m%d") & (x <= as.Date(as.character(nberDates()[,2]),format="%Y%m%d"))))))
sapply basically goes through the vector and checks for each element if it falls within one of the NBER intervals.
Note, however, that the way this is currently written means that it will do the conversion of the raw NBER data into dates (as.Date) once for every element in mydata so you may want to do the conversion once, save it to some temporary data frame and then run the above on that.
Here's another solution that uses some handy behavior in merge.xts.
library(xts)
library(tis) # for nberDates()
# make two xts objects filled with ones
# 1) recession start dates
# 2) recession end dates
rd <- apply(nberDates(),2,as.character)
ones <- rep(1,nrow(rd))
rStart <- xts(ones, as.Date(rd[,1],"%Y%m%d"))
rEnd <- xts(ones, as.Date(rd[,2],"%Y%m%d"))
# merge recession start/end date objects with empty xts
# object containing indexes from mydata, filled with zeros
# and take the cumulative sum (by column)
rx <- cumsum(merge(rStart,rEnd,xts(,index(mydata)),fill=0))
# the recess column = (cumulative # of recessions started at date D) -
# (cumulative # of recessions ended at date D)
mydata$recess <- (rx[,1]-rx[,2])[index(mydata)]
Alternatively, you could just use the USREC series from FREDII.
library(quantmod)
getSymbols("USREC",src="FRED")
mydata2 <- merge(mydata, USREC, all=FALSE)
Tentatively, I'm using nested loops as follows. Still looking for a better answer!
mydata$recess <- 0
for (x in seq(1,dim(mydata)[1])){
for (y in seq(1,dim(nberDates())[1])){
if (index(mydata)[x] >= as.Date(as.character(nberDates()[y,1]),format="%Y%m%d") &
index(mydata)[x] <= as.Date(as.character(nberDates()[y,2]),format="%Y%m%d")){
mydata$recess[x] <- 1
}
}
}

Resources