I have a data frame with two columns in R and I want to create a third column that will roll by 2 in both columns and check if a condition is satisfied or not as described in the table below.
The condition is a rolling ifelse and goes like this :
IF -A1<B3<A1 TRUE ELSE FALSE
IF -A2<B4<A2 TRUE ELSE FALSE
IF -A3<B5<A3 TRUE ELSE FALSE
IF -A4<B6<A4 TRUE ELSE FALSE
A
B
CHECK
1
4
NA
2
5
NA
3
6
FALSE
4
1
TRUE
5
-4
FALSE
6
1
TRUE
How can I do it in R? Is there a base R's function or within the dplyr framework ?
Since R is vectorized, you can do that with one command, using for instance dplyr::lag:
library(dplyr)
df %>%
mutate(CHECK = -lag(A, n=2) < B & lag(A, n=2) > B)
A B CHECK
1 1 4 NA
2 2 5 NA
3 3 6 FALSE
4 4 1 TRUE
5 5 -4 FALSE
6 6 1 TRUE
Here is a simple example of trying to extract some rows
from a data.table, but what appear to be the same type
of logical vectors, I get different answers:
a <- data.table(a=1:10, b = 10:1)
a # so here is the data we are working with
a b
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6
6: 6 5
7: 7 4
8: 8 3
9: 9 2
10: 10 1
let's extract just the first column since I need to dynamically
specify the column number as part of my processing
col <- 1L # get column 1 ('a')
x <- a[[col]] > 5 # logical vector specifying condition
x
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(x)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
look at the structure of a[[col]] > 5
a[[col]] > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(a[[col]] > 5)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
this looks very much like 'x', so why do these two different ways
of indexing 'a' give different results
a[x] # using 'x' as the logical vector
a b
1: 6 5
2: 7 4
3: 8 3
4: 9 2
5: 10 1
a[a[[col]] > 5] # using the expression as the logical vector
Empty data.table (0 rows) of 2 cols: a,b
I am trying to figure out how to count the number of rows from when one column says True to when the other column says True. I attempted to use run length encoding but couldnt figure out how to get the alternating values form each column.
set.seed(42)
s<-sample(c(0,1,2,3),500,replace=T)
isOverbought<-s==1
isOverSold<-s==0
head(cbind(isOverbought,isOverSold),20)
res<-rle(isOverSold)
tt<-res[res$values==0] #getting when Oversold is true
> head(cbind(isOverbought,isOverSold))
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] TRUE FALSE <-starting condition is overbought
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE TRUE <-is oversold. length from overbought to oversold = 5
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] TRUE FALSE <- is overbought. length from oversold to overbought = 3
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] TRUE FALSE
[15,] TRUE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE TRUE <-is oversold. length from overbought to oversold = 7
[19,] TRUE FALSE <- is overbought. length from oversold to overbought = 1
[20,] FALSE FALSE
GOAL
overboughtTOoversold oversoldTOoverbought
5 3
7 1
This is sufficient to solve your problem.
## `a` to `b`
a2b <- function (a, b) {
x <- which(a) ## position of `TRUE` in `a`
y <- which(b) ## position of `TRUE` in `b`
z <- which(a | b) ## position of all `TRUE`
end <- match(y, z) ## match for end position
start <- c(1L, end[-length(end)] + 1L) ## start position
valid <- end > start ## remove cases with `end = start`
z[end[valid]] - z[start[valid]]
}
## cross `a` and `b`
axb <- function (a, b) {
if (any(a & b))
stop ("Invalid input! `a` and `b` can't have TRUE at the same time!")
x <- a2b(a, b); y <- a2b(b, a)
if (which(a)[1L] < which(b)[1L]) cbind(a2b = x, b2a = c(NA_integer_, y))
else cbind(a2b = c(NA_integer_, x), b2a = y)
}
For your isOverbought and isOverSold, we obtain:
result <- axb(isOverbought, isOverSold)
head(result)
# a2b b2a
#[1,] 5 NA
#[2,] 7 3
#[3,] 3 1
#[4,] 8 5
#[5,] 2 6
#[6,] 10 2
Since isOverbought has the first TRUE before isOverSold, the first element of the 2nd column is NA.
The assumption for this answer is that there is at least one overbought/oversold transition (either direction) and hence at least two rows in the data. This condition can easily be checked by counting the number of overbought and oversold conditions and making sure that both are greater than one.
The key is to remove the consecutive overbought and oversold conditions so that we only have alternating overbought and oversold conditions. One way to do this is:
## detect where we are overbought and oversold
i1 <- which(isOverbought)
i2 <- which(isOverSold)
## concatenate into one vector
i3 <- c(i1,i2)
## sort these and get the indices from the sort
i4 <- order(i3)
## at this point consecutive overbought or oversold conditions
## will be marked by a difference of 1 in i4 while alternating
## conditions will be marked by something other than 1. So
## filter those out to get i6. BTW, consecutive here does not mean
## consecutive rows in the data but consecutive occurrence of
## either overbought or oversold conditions without an intervening
## condition of the other. The assumption for at least one transition
## in the data is needed for this to work.
i5 <- diff(i4)
i6 <- i4[c(1,which(i5 != 1)+1)]
## then recover the alternating rows of overbought and oversold conditions in i7
i7 <- i3[i6]
## take the difference and format the output
## I need to credit #akrun for this part
i8 <- diff(i7)
## need to determine which is first
if (i1[1] < i2[1]) {
overboughtTOoversold <- i8[c(TRUE, FALSE)]
oversoldTOoverbought <- i8[c(FALSE, TRUE)]
} else {
overboughtTOoversold <- i8[c(FALSE, TRUE)]
oversoldTOoverbought <- i8[c(TRUE, FALSE)]
}
d1 <- cbind(overboughtTOoversold, oversoldTOoverbought)
print(head(d1))
## overboughtTOoversold oversoldTOoverbought
##[1,] 5 3
##[2,] 7 1
##[3,] 3 5
##[4,] 8 6
##[5,] 2 2
##[6,] 10 4
The cbind may generate a warning that the columns are not the same length. To get rid of that, just pad with NA at the end as appropriate.
A more compact version of the above is:
i3 <- c(which(isOverbought), which(isOverSold))
i4 <- order(i3)
i8 <- diff(i3[i4[c(1,which(diff(i4) != 1)+1)]])
if (which(isOverbought)[1] < which(isOverSold)[1]) {
overboughtTOoversold <- i8[c(TRUE, FALSE)]
oversoldTOoverbought <- i8[c(FALSE, TRUE)]
} else {
overboughtTOoversold <- i8[c(FALSE, TRUE)]
oversoldTOoverbought <- i8[c(TRUE, FALSE)]
}
d1 <- cbind(overboughtTOoversold, oversoldTOoverbought)
Here is a short version:
create a vector called mktState. Encode it with 1 if overbought is TRUE, -1 if oversold is TRUE and NA if both first 2 cols are FALSE.( You are interested only in days where the market state switches)
use na.locf to fill the NAs with the last observation carried forward
now use the rle function
mktState <- ifelse(df$overBought == TRUE,1,ifelse(df$overSold == TRUE,-1,NA))
mktState <- na.locf(mktState)
to get 'overbought' runs:
> rle(mktState)$lengths[rle(mktState)$values == 1]
[1] 5 7 3 8 2 10 7 3 1 2 4 2 5 6 3 11 4 1 5 2 4 6 1 1 8
[26] 7 3 1 1 1 1 3 2 3 1 6 1 1 1 3 2 4 2 1 6 8 8 1 5 15
[51] 2 5 4 2 1 1 3 4 7 1 7 11 1 3 4 2 4 1
and this will give you the 'oversold' runs:
> rle(mktState)$lengths[rle(mktState)$values == -1]
[1] 3 1 5 6 2 4 1 4 3 3 3 5 2 4 1 14 2 2 10 3 7 1 13 1 1
[26] 3 3 1 6 5 2 1 8 7 2 3 1 1 3 5 1 1 2 3 1 2 2 3 3 1
[51] 8 9 4 2 1 6 2 1 3 2 4 5 1 3 7 4 2 2
Here's a [somewhat long] tidyverse version:
library(dplyr)
library(tidyr)
# put vectors in a data.frame
data.frame(isOverbought, isOverSold) %>%
# evaluate each row separately
rowwise() %>%
# add column with name of event for any TRUE, else NA
mutate(change_type = ifelse(isOverbought | isOverSold, names(.)[c(isOverbought, isOverSold)], NA)) %>%
# reset grouping
ungroup() %>%
# replace NA values with last non-NA value
fill(change_type) %>%
# add a column of the cumulate number of changes in change_type
mutate(changes = data.table::rleid(change_type)) %>%
# count number of rows in each changes and change_type grouping
count(changes, change_type) %>%
# remove leading NAs
na.omit() %>%
# reset grouping
ungroup() %>%
# edit change into runs of two with integer division
mutate(changes = changes %/% 2) %>%
# spread to wide form
spread(change_type, n) %>%
# get rid of extra column
select(-changes)
## # A tibble: 68 x 2
## isOverbought isOverSold
## * <int> <int>
## 1 5 3
## 2 7 1
## 3 3 5
## 4 8 6
## 5 2 2
## 6 10 4
## 7 7 1
## 8 3 4
## 9 1 3
## 10 2 3
## # ... with 58 more rows
I have a dataframe with two columns. I want to add a new colume to df where all the values are inside, were the dataframe matches with the first colume.
I tried:
df<-data.frame(A=c("1","test","2","3",NA,"Test", NA),B=c("1","No Match","No Match","3",NA,"Test", "No Match"))
df[df$A == df$B ]
However, I get:
Error in Ops.factor(df$A, df$B) : level sets of factors are different
Any recommednation what I am doing wrong?
Dealing with NA first and then add your column:
> df[is.na(df)]=""
> df$New = with(df, A==B)
> df
A B New
1 1 1 TRUE
2 test No Match FALSE
3 2 No Match FALSE
4 3 3 TRUE
5 TRUE
6 Test Test TRUE
7 No Match FALSE
Or remove NA from your initial data.frame with df = df[complete.cases(df),] and then add the column.
If you really want to have False when there is NA in A or B column:
> transform(df, New=ifelse(is.na(A)|is.na(B), FALSE, df$A==df$B))
A B New
1 1 1 TRUE
2 test No Match FALSE
3 2 No Match FALSE
4 3 3 TRUE
5 <NA> <NA> FALSE
6 Test Test TRUE
7 <NA> No Match FALSE
With data as such below, I'm trying to reassign any of the test cols (test_A, etc.) to their corresponding time cols (time_A, etc.) if the test is true, and then find the minimum of all true test times.
[ID] [time_A] [time_B] [time_C] [test_A] [test_B] [test_C] [min_true_time]
[1,] 1 2 3 4 FALSE TRUE FALSE ?
[2,] 2 -4 5 6 TRUE TRUE FALSE ?
[3,] 3 6 1 -2 TRUE TRUE TRUE ?
[4,] 4 -2 3 4 TRUE FALSE FALSE ?
My actual data set is quite large so my attempts at if and for loops have failed miserably. But I can't make any progress on an apply function.
And more negative time, say -2 would be considered the minimum for row 3.
Any suggestions are welcomed gladly
You don't give much information, but I think this does what you need. No idea if it is efficient enough, since you don't say how big your dataset actually is.
#I assume your data is in a data.frame:
df <- read.table(text="ID time_A time_B time_C test_A test_B test_C
1 1 2 3 4 FALSE TRUE FALSE
2 2 -4 5 6 TRUE TRUE FALSE
3 3 6 1 -2 TRUE TRUE TRUE
4 4 -2 3 4 TRUE FALSE FALSE")
#loop over all rows and subset column 2:4 with column 5:7, then take the mins
df$min_true_time <- sapply(1:nrow(df), function(i) min(df[i,2:4][unlist(df[i,5:7])]))
df
# ID time_A time_B time_C test_A test_B test_C min_true_time
#1 1 2 3 4 FALSE TRUE FALSE 3
#2 2 -4 5 6 TRUE TRUE FALSE -4
#3 3 6 1 -2 TRUE TRUE TRUE -2
#4 4 -2 3 4 TRUE FALSE FALSE -2
Another way, which might be faster (I'm not in the mood for benchmarking):
m <- as.matrix(df[,2:4])
m[!df[,5:7]] <- NA
df$min_true_time <- apply(m,1,min,na.rm=TRUE)