Matrix of booleans based on quantile in R - r

I have a matrix whose columns are stock returns and whose rows are dates, which looks like this:
ES1.Index VG1.Index TY1.Comdty RX1.Comdty GC1.Comdty
1999-01-05 0.009828476 0.012405717 -0.003058466 -0.0003480884 -0.001723317
1999-01-06 0.021310816 0.027030061 0.001883240 0.0017392317 0.002425398
1999-01-07 -0.001952962 -0.016130850 -0.002826191 -0.0011591516 0.013425435
1999-01-08 0.007989946 -0.004071275 -0.005913678 0.0016224363 -0.001363540
I'd like to have a function that returns a matrix with the same column-names and row-names filled with 1s and 0s based on whether each observation within each row-vector belongs or not to some group within two given quantiles.
For example, I may want to divide each row vector into 3 groups and have 1s for all observations falling within the 2nd group and 0s elsewhere. The result being something looking like:
ES1.Index VG1.Index TY1.Comdty RX1.Comdty GC1.Comdty
1999-01-05 0 0 1 1 0
1999-01-06 1 0 0 1 0
1999-01-07 0 1 0 0 1
1999-01-08 0 0 1 0 1
(The 1s and 0s in my example are meant to be just a visual outcome, the numbers aren't accurate)
Which would be the least verbose way to get to that?

Taking the intermediate steps of finding the quantiles and testing against them is not necessary. Only the ordinal properties of each vector matter.
# set bounds
lb = 1/3
ub = 2/3
# find ranks
p = t(apply(m,1,rank))/ncol(m)
# test ranks against bounds
+( p >= lb & p <= ub )
ES1.Index VG1.Index TY1.Comdty RX1.Comdty GC1.Comdty
1999-01-05 0 0 0 1 1
1999-01-06 0 0 1 0 1
1999-01-07 1 0 1 0 0
1999-01-08 0 1 0 0 1

We can use apply with MARGIN=1 to loop over the rows, cut each row vector with breaks specified by the quantile, transpose the output to get an output.
t(apply(df1, 1, function(x) {
x1 <- cut(x, breaks= quantile(x, seq(0, 1,1/3)))
+(levels(x1)[2]== x1 & !is.na(x1))}))

Related

Random value in column in R

Does anyone have an idea how to generate column of random values where only one random row is marked with number "1". All others should be "0".
I need function for this in R code.
Here is what i need in photos:
df <- data.frame(subject = 1, choice = 0, price75 = c(0,0,0,1,1,1,0,1))
This command will update the choice column to contain a single random row with value of 1 each time it is called. All other rows values in the choice column are set to 0.
df$choice <- +(seq_along(df$choice) == sample(nrow(df), 1))
With integer(length(DF$choice)) a vector of 0 is created where [<- is replacing a 1 on the position from sample(length(DF$choice), 1).
DF <- data.frame(subject=1, choice="", price75=c(0,0,0,1,1,1,0,1))
DF$choice <- `[<-`(integer(nrow(DF)), sample(nrow(DF), 1L), 1L)
DF
# subject choice price75
#1 1 0 0
#2 1 0 0
#3 1 0 0
#4 1 1 1
#5 1 0 1
#6 1 0 1
#7 1 0 0
#8 1 0 1
> x <- rep(0, 10)
> x[sample(1:10, 1)] <- 1
> x
[1] 0 0 0 0 0 0 0 1 0 0
Many ways to set a random value in a row\column in R
df<-data.frame(x=rep(0,10)) #make dataframe df, with column x, filled with 10 zeros.
set.seed(2022) #set a random seed - this is for repeatability
#two base methods for sampling:
#sample.int(n=10, size=1) # sample an integer from 1 to 10, sample size of 1
#sample(x=1:10, size=1) # sample from 1 to 10, sample size of 1
df$x[sample.int(n=10, size=1)] <- 1 # randomly selecting one of the ten rows, and replacing the value with 1
df

How to do queries for counts of matrix elements with values in given range

I'm working on a project that is looking at regrowth of trees after a deforestation event. To simplify the data set for this question, I have a matrix (converted from data frame), which has 10 columns corresponding to years 2001-2010.
-1 indicates a change point in the data, when a previously forested plot was deforested. 1 indicated when a previously deforested region became forested. 0's indicate no change in state.
I found this link which I think does what I need to do, except in python/c++. Since I did the rest of my analyses in R, I want to stick with it.
So I was trying to translate some of the code to R, but I've been having problems.
This is my sample data set. One of my alternative thoughts is that if I could identify the index of (-1) and then the index of 1, then I could subtract these two indices to get the difference (and then subtract 1 to account for factoring in the first index in the subtraction)
# Example data
head(tcc_change)
id 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 1 0 0 0 0 0 -1 0 0 1 0
2 2 0 0 0 -1 0 0 1 0 0 0
3 3 0 0 0 -1 0 0 0 1 0 0
4 4 0 -1 0 0 0 0 1 0 0 0
5 5 0 0 0 1 0 0 -1 1 0 0
# Indexing attempt
tcc_change$loss_init <- apply(tcc_change, 1, function(x) match(-1, x[1:10], nomatch = 99))
tcc_change$gain <- apply(tcc_change, 1, function(x) match(1, x[1:10], nomatch=99))
This method has a lot of problems though. What if there's a 1 before a (-1), for example. I'd like to figure out a better way to do this analysis, similar to the logical structure in the link above, but I don't know how to do this in R.
Ideally I'd like to identify points where there was deforestation (-1) and then regrowth (1) and then count the zeroes in between. The number of zeroes in between would be posted to a new column. This would give me a better idea of how long it takes for a plot to become forested after a deforestation event. If there are no zeroes in between (like row 5), I would want the code to output '0'.
Sorry my function may only handle simple case. Hope that helps.
First your code has some issues that when you search index, you include the id column as well (in x[1:10]). if you want to exclude that, can use x[-1] to exclude the first column, but the index will count from 2nd ones.
tcc_change$loss_init <- apply(tcc_change, 1, function(x) match(-1, x[1:10], nomatch = 99))
tcc_change$gain <- apply(tcc_change, 1, function(x) match(1, x[1:10], nomatch=99))
I adjusted your approach and first to get the -1 index, then use match again to search index of 1 starting from the index of -1; then once I found that, can just minus 1 to get the number of intervals:
get_interval = function(x){
init = match(-1, x[-1])
interval = match(1, x[-(1:(init+1))]) - 1
return(interval)
}
> apply(tcc_change, 1, get_interval)
[1] 2 2 3 4 0
Hope that helps.

Add index to runs of positive or negative values of certain length

I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0

R- Find Unique Permutations of Values

I am hoping to create all possible permutations of a vector containing two different values, in which I control the proportion of each of the values.
For example, if I have a vector of length three and I want all possible combinations containing a single 1, my desired output is a list looking like this:
list.1 <- list(c(1,0,0), c(0,1,0), c(0,0,1))
In contrast, if I want all possible combinations containing three 1s, my desired output is a list looking like this:
list.3 <- list(c(1,1,1))
To put it another way, the pattern of the 1 and 0 values matter, but all 1s should be treated as identical to all other 1s.
Based on searching here and elsewhere, I've tried several approaches:
expand.grid(0:1, 0:1, 0:1) # this includes all possible combinations of 1, 2, or 3 ones
permn(c(0,1,1)) # this does not treat the ones as identical (e.g. it produces (0,1,1) twice)
unique(permn(c(0,1,1))) # this does the job!
So, using the function permn from the package combinat seems promising. However, where I scale this up to my actual problem (a vector of length 20, with 50% 1s and 50% 0s, I run into problems:
unique(permn(c(rep(1,10), rep(0, 10))))
# returns the error:
Error in vector("list", gamma(n + 1)) :
vector size specified is too large
My understanding is that this is happening because, in the call to permn, it makes a list containing all possible permutations, even though many of them are identical, and this list is too large for R to handle.
Does anyone have a suggestion for how to work around this?
Sorry if this has been answered previously - there are many, many SO questions containing similar language but different problems and I have not bene able to find a solution which meets my needs!
It should not be a dealbreaker that expand.grid includes all permutations. Just add a subset after:
combinations <- function(size, choose) {
d <- do.call("expand.grid", rep(list(0:1), size))
d[rowSums(d) == choose,]
}
combinations(size=10, choose=3)
# Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
# 8 1 1 1 0 0 0 0 0 0 0
# 12 1 1 0 1 0 0 0 0 0 0
# 14 1 0 1 1 0 0 0 0 0 0
# 15 0 1 1 1 0 0 0 0 0 0
# 20 1 1 0 0 1 0 0 0 0 0
# 22 1 0 1 0 1 0 0 0 0 0
...
The problem is indeed that you are initially computing all factorial(20) (~10^18) permutations, which will not fit in your memory.
What you are looking for is an efficient way to compute multiset permutations. The multicool package can do this:
library(multicool)
res <- allPerm(initMC(c(rep(0,10),rep(1,10) )))
This computation takes about two minutes on my laptop, but is definitely feasible.

Working with matrices in r

I'm working on code to construct an option pricing matrix. What I have at the moment is the values along the diagonal part of the matrix. Currently I'm working in a matrix with 4 rows and 4 columns. What I'm attempting to do is to use the values in the diagonal part of the matrix to give values in the lower triangle of the matrix. So for my matrix Omat, Omat[1,1]+Omat[2,2] will give a value for [2,1], Omat[2,2]+Omat[3,3] will give a value for [3,2]. Then using these created values, Omat[2,1]+Omat[3,2] will give a value for [3,1].
My attempt:
Omat = diag(2, 4, 4)
Omat[j+i,j] <- Omat[i-1,j]+Omat[i,j+1]
Any ideas on how one could go about this?
What I currently have, a 4 row by 4 col matrix:
Omat
# 2 0 0 0
# 0 2 0 0
# 0 0 2 0
# 0 0 0 2
What I've been attempting to create, a 4 row by 4 col matrix:
0 0 0 0
4 0 0 0
8 4 0 0
16 8 4 0
You could try calculating successive diagonals underneath the main diagonal. Code could look like:
Omat = diag(2,4)
for(i in 1:(nrow(Omat)-1)) {
for( j in (i+1):nrow(Omat)) {
Omat[j,j-i] <- Omat[j,j-i+1] + Omat[j-1,j-i]
}
}
diag(Omat) <- 0
Am I probably missing something, but why not do this:
for (i in 2:dim){
for (j in 1:(i-1)){
Omat[i,j] <- Omat[i-1,j] + Omat[i,j+1]
}
}
diag(Omat) <- 0
,David.

Resources