counting lengths between alternating columns - r

I am trying to figure out how to count the number of rows from when one column says True to when the other column says True. I attempted to use run length encoding but couldnt figure out how to get the alternating values form each column.
set.seed(42)
s<-sample(c(0,1,2,3),500,replace=T)
isOverbought<-s==1
isOverSold<-s==0
head(cbind(isOverbought,isOverSold),20)
res<-rle(isOverSold)
tt<-res[res$values==0] #getting when Oversold is true
> head(cbind(isOverbought,isOverSold))
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] TRUE FALSE <-starting condition is overbought
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE TRUE <-is oversold. length from overbought to oversold = 5
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] TRUE FALSE <- is overbought. length from oversold to overbought = 3
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] TRUE FALSE
[15,] TRUE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE TRUE <-is oversold. length from overbought to oversold = 7
[19,] TRUE FALSE <- is overbought. length from oversold to overbought = 1
[20,] FALSE FALSE
GOAL
overboughtTOoversold oversoldTOoverbought
5 3
7 1

This is sufficient to solve your problem.
## `a` to `b`
a2b <- function (a, b) {
x <- which(a) ## position of `TRUE` in `a`
y <- which(b) ## position of `TRUE` in `b`
z <- which(a | b) ## position of all `TRUE`
end <- match(y, z) ## match for end position
start <- c(1L, end[-length(end)] + 1L) ## start position
valid <- end > start ## remove cases with `end = start`
z[end[valid]] - z[start[valid]]
}
## cross `a` and `b`
axb <- function (a, b) {
if (any(a & b))
stop ("Invalid input! `a` and `b` can't have TRUE at the same time!")
x <- a2b(a, b); y <- a2b(b, a)
if (which(a)[1L] < which(b)[1L]) cbind(a2b = x, b2a = c(NA_integer_, y))
else cbind(a2b = c(NA_integer_, x), b2a = y)
}
For your isOverbought and isOverSold, we obtain:
result <- axb(isOverbought, isOverSold)
head(result)
# a2b b2a
#[1,] 5 NA
#[2,] 7 3
#[3,] 3 1
#[4,] 8 5
#[5,] 2 6
#[6,] 10 2
Since isOverbought has the first TRUE before isOverSold, the first element of the 2nd column is NA.

The assumption for this answer is that there is at least one overbought/oversold transition (either direction) and hence at least two rows in the data. This condition can easily be checked by counting the number of overbought and oversold conditions and making sure that both are greater than one.
The key is to remove the consecutive overbought and oversold conditions so that we only have alternating overbought and oversold conditions. One way to do this is:
## detect where we are overbought and oversold
i1 <- which(isOverbought)
i2 <- which(isOverSold)
## concatenate into one vector
i3 <- c(i1,i2)
## sort these and get the indices from the sort
i4 <- order(i3)
## at this point consecutive overbought or oversold conditions
## will be marked by a difference of 1 in i4 while alternating
## conditions will be marked by something other than 1. So
## filter those out to get i6. BTW, consecutive here does not mean
## consecutive rows in the data but consecutive occurrence of
## either overbought or oversold conditions without an intervening
## condition of the other. The assumption for at least one transition
## in the data is needed for this to work.
i5 <- diff(i4)
i6 <- i4[c(1,which(i5 != 1)+1)]
## then recover the alternating rows of overbought and oversold conditions in i7
i7 <- i3[i6]
## take the difference and format the output
## I need to credit #akrun for this part
i8 <- diff(i7)
## need to determine which is first
if (i1[1] < i2[1]) {
overboughtTOoversold <- i8[c(TRUE, FALSE)]
oversoldTOoverbought <- i8[c(FALSE, TRUE)]
} else {
overboughtTOoversold <- i8[c(FALSE, TRUE)]
oversoldTOoverbought <- i8[c(TRUE, FALSE)]
}
d1 <- cbind(overboughtTOoversold, oversoldTOoverbought)
print(head(d1))
## overboughtTOoversold oversoldTOoverbought
##[1,] 5 3
##[2,] 7 1
##[3,] 3 5
##[4,] 8 6
##[5,] 2 2
##[6,] 10 4
The cbind may generate a warning that the columns are not the same length. To get rid of that, just pad with NA at the end as appropriate.
A more compact version of the above is:
i3 <- c(which(isOverbought), which(isOverSold))
i4 <- order(i3)
i8 <- diff(i3[i4[c(1,which(diff(i4) != 1)+1)]])
if (which(isOverbought)[1] < which(isOverSold)[1]) {
overboughtTOoversold <- i8[c(TRUE, FALSE)]
oversoldTOoverbought <- i8[c(FALSE, TRUE)]
} else {
overboughtTOoversold <- i8[c(FALSE, TRUE)]
oversoldTOoverbought <- i8[c(TRUE, FALSE)]
}
d1 <- cbind(overboughtTOoversold, oversoldTOoverbought)

Here is a short version:
create a vector called mktState. Encode it with 1 if overbought is TRUE, -1 if oversold is TRUE and NA if both first 2 cols are FALSE.( You are interested only in days where the market state switches)
use na.locf to fill the NAs with the last observation carried forward
now use the rle function
mktState <- ifelse(df$overBought == TRUE,1,ifelse(df$overSold == TRUE,-1,NA))
mktState <- na.locf(mktState)
to get 'overbought' runs:
> rle(mktState)$lengths[rle(mktState)$values == 1]
[1] 5 7 3 8 2 10 7 3 1 2 4 2 5 6 3 11 4 1 5 2 4 6 1 1 8
[26] 7 3 1 1 1 1 3 2 3 1 6 1 1 1 3 2 4 2 1 6 8 8 1 5 15
[51] 2 5 4 2 1 1 3 4 7 1 7 11 1 3 4 2 4 1
and this will give you the 'oversold' runs:
> rle(mktState)$lengths[rle(mktState)$values == -1]
[1] 3 1 5 6 2 4 1 4 3 3 3 5 2 4 1 14 2 2 10 3 7 1 13 1 1
[26] 3 3 1 6 5 2 1 8 7 2 3 1 1 3 5 1 1 2 3 1 2 2 3 3 1
[51] 8 9 4 2 1 6 2 1 3 2 4 5 1 3 7 4 2 2

Here's a [somewhat long] tidyverse version:
library(dplyr)
library(tidyr)
# put vectors in a data.frame
data.frame(isOverbought, isOverSold) %>%
# evaluate each row separately
rowwise() %>%
# add column with name of event for any TRUE, else NA
mutate(change_type = ifelse(isOverbought | isOverSold, names(.)[c(isOverbought, isOverSold)], NA)) %>%
# reset grouping
ungroup() %>%
# replace NA values with last non-NA value
fill(change_type) %>%
# add a column of the cumulate number of changes in change_type
mutate(changes = data.table::rleid(change_type)) %>%
# count number of rows in each changes and change_type grouping
count(changes, change_type) %>%
# remove leading NAs
na.omit() %>%
# reset grouping
ungroup() %>%
# edit change into runs of two with integer division
mutate(changes = changes %/% 2) %>%
# spread to wide form
spread(change_type, n) %>%
# get rid of extra column
select(-changes)
## # A tibble: 68 x 2
## isOverbought isOverSold
## * <int> <int>
## 1 5 3
## 2 7 1
## 3 3 5
## 4 8 6
## 5 2 2
## 6 10 4
## 7 7 1
## 8 3 4
## 9 1 3
## 10 2 3
## # ... with 58 more rows

Related

Extract and plot n rows before and after the onset of each event, centered on zero

Objective: From a time-series df, make a plot of each occurrence of a particular state (or factor level) with x timepoints before, and y timepoints after, the onset (i.e. first row) of that state. The graph should be centered on zero (on the x-axis), such that the x timepoints before the event are negative values, and the y timepoints after the event are positive values. This is the same principal as a peristimulus time histogram.
The data: I have time-series data where different states can occur for variable amounts of time. First I use run length encoding (rle) to determine the start and stop of each run of each state (not shown). Second, I use a function, similar to the one described here, to return, say one row above and two rows below the onset of a particular state (state "A" in the example below). Here’s what that data looks like.
df <- data.frame(
state = c("A","A","A","A","A","B","A","A","X","Y","Z","A","A","A","B","A","A"),
start = c("start","NA","NA","NA","NA","NA","start","NA","NA","NA","NA","start","NA","NA","NA","start","NA"),
rleGroup = c("1","1","1","1","1","2","3","3","4","5","6","7","7","7","8","9","9"),
data = runif(17)
)
df <- df %>% tidyr::unite(stateStart, c(state,start), sep = ".", remove = FALSE)
stateStart state start rleGroup data
1 A.start A start 1 0.85118187
2 A.NA A NA 1 0.23502147
3 A.NA A NA 1 0.97435662
4 A.NA A NA 1 0.45669042
5 A.NA A NA 1 0.48271803
6 B.NA B NA 2 0.80561653
7 A.start A start 3 0.27228361
8 A.NA A NA 3 0.07008506
9 X.NA X NA 4 0.44101076
10 Y.NA Y NA 5 0.95173954
11 Z.NA Z NA 6 0.65693316
12 A.start A start 7 0.45831802
13 A.NA A NA 7 0.83629347
14 A.NA A NA 7 0.62107270
15 B.NA B NA 8 0.53294588
16 A.start A start 9 0.08533221
17 A.NA A NA 9 0.28805362
extract.with.context <- function(x, colname, rows, after = 0, before = 0) {
match.idx <- which(x[[colname]] %in% rows)
span <- seq(from = -before, to = after)
extend.idx <- c(outer(match.idx, span, `+`))
extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
extend.idx <- sort(unique(extend.idx))
return(x[extend.idx, , drop = FALSE])
}
extracted = extract.with.context(x=df, colname="stateStart", rows=c("A.start"), after = 2, before = 1)
stateStart state start rleGroup data
1 A.start A start 1 0.85118187
2 A.NA A NA 1 0.23502147
3 A.NA A NA 1 0.97435662
6 B.NA B NA 2 0.80561653
7 A.start A start 3 0.27228361
8 A.NA A NA 3 0.07008506
9 X.NA X NA 4 0.44101076
11 Z.NA Z NA 6 0.65693316
12 A.start A start 7 0.45831802
13 A.NA A NA 7 0.83629347
14 A.NA A NA 7 0.62107270
15 B.NA B NA 8 0.53294588
16 A.start A start 9 0.08533221
17 A.NA A NA 9 0.28805362
The problem: I want to plot each extracted run of state A (i.e., one timepoint before and two timepoints after A.start). One thought is to make a unique identifier for each run grouping of state "A" (Question 1), and then make a time sequence counter that reflects the desired rows before and after the onset of state A (Question 2).
QUESTION NUMBER 1.
Create a unique identifier for each "plotGroup", defined as each run of state A (i.e., a row before A.start and two rows after A.start)
I tried this, but it's not quite working
extracted %>%
mutate(plotGroup = cumsum(lag(state) == "A" & state != "A"))
it should look like this
extracted$plotGroup <- c("0","0","0","1","1","1","1","2","2","2","2","3","3","3")
stateStart state start rleGroup data plotGroup
1 A.start A start 1 0.85118187 0
2 A.NA A NA 1 0.23502147 0
3 A.NA A NA 1 0.97435662 0
6 B.NA B NA 2 0.80561653 1
7 A.start A start 3 0.27228361 1
8 A.NA A NA 3 0.07008506 1
9 X.NA X NA 4 0.44101076 1
11 Z.NA Z NA 6 0.65693316 2
12 A.start A start 7 0.45831802 2
13 A.NA A NA 7 0.83629347 2
14 A.NA A NA 7 0.62107270 2
15 B.NA B NA 8 0.53294588 3
16 A.start A start 9 0.08533221 3
17 A.NA A NA 9 0.28805362 3
QUESTION NUMBER 2. Create a "counter," centered on zero, of one row above and two rows after the A.start
This I have no idea how to do! But presumably can make use of "span" in the function
This is my desired output
extracted$span <- c("0","1","2","-1","0","1","2","-1","0","1","2","-1","0","1")
stateStart state start rleGroup data plotGroup span
1 A.start A start 1 0.85118187 0 0
2 A.NA A NA 1 0.23502147 0 1
3 A.NA A NA 1 0.97435662 0 2
6 B.NA B NA 2 0.80561653 1 -1
7 A.start A start 3 0.27228361 1 0
8 A.NA A NA 3 0.07008506 1 1
9 X.NA X NA 4 0.44101076 1 2
11 Z.NA Z NA 6 0.65693316 2 -1
12 A.start A start 7 0.45831802 2 0
13 A.NA A NA 7 0.83629347 2 1
14 A.NA A NA 7 0.62107270 2 2
15 B.NA B NA 8 0.53294588 3 -1
16 A.start A start 9 0.08533221 3 0
17 A.NA A NA 9 0.28805362 3 1
Ultimate objective: plot data by span for each individual plotgroup
ggplot(data=extracted, aes(x=span, y = data, group = plotGroup)) + geom_line()
This approach subdivides the data into plotGroups where each group starts one step before each new A (except for the first grp), and the counter is set at zero for each group's first A. The division point prior is determined by the n in lead(), and we could add a filter to limit the points after.
# edit to fix first group counting
df %>%
mutate(start = state == "A" & lag(state, default = "") != "A") %>%
mutate(plotGroup = cumsum(lead(start, n = 1, default = FALSE))) %>%
group_by(plotGroup) %>%
mutate(counter = row_number() - row_number()[start]) %>%
ungroup() %>%
filter(counter <= 2) %>%
ggplot(aes(counter, data, group = plotGroup)) +
geom_line()
Result before plotting:
# A tibble: 14 × 6
state start rleGroup data plotGroup counter
<chr> <lgl> <chr> <dbl> <int> <int>
1 A TRUE 1 0.0198 0 0
2 A FALSE 1 0.338 0 1
3 A FALSE 1 0.635 0 2
4 B FALSE 2 0.0138 1 -1
5 A TRUE 3 0.218 1 0
6 A FALSE 3 0.208 1 1
7 X FALSE 4 0.0934 1 2
8 Z FALSE 6 0.499 2 -1
9 A TRUE 7 0.0417 2 0
10 A FALSE 7 0.934 2 1
11 A FALSE 7 0.507 2 2
12 B FALSE 8 0.555 3 -1
13 A TRUE 9 0.158 3 0
14 A FALSE 9 0.437 3 1
#Define number of rows you want before and after the zero-centered graph
after <- 2
before <- 1
#made up data
df <- data.frame(
state = c("A","A","A","A","A","B","A","A","X","Y","Z","A","A","A","B","A","A"),
start = c("start","NA","NA","NA","NA","NA","start","NA","NA","NA","NA","start","NA","NA","NA","start","NA"),
rleGroup = c("1","1","1","1","1","2","3","3","4","5","6","7","7","7","8","9","9"),
data = runif(17)
)
df <- df %>% tidyr::unite(stateStart, c(state,start), sep = ".", remove = FALSE)
#extract the rows before and after the onset of a particular state
extract.with.context <- function(x, colname, rows, after = 0, before = 0) {
match.idx <- which(x[[colname]] %in% rows)
span <- seq(from = -before, to = after)
extend.idx <- c(outer(match.idx, span, `+`))
extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
extend.idx <- sort(unique(extend.idx))
return(x[extend.idx, , drop = FALSE])
}
extracted.df = extract.with.context(x=df, colname="stateStart", rows=c("A.start"), after = after, before = before)
# Create plotGroup
# if we go off starting cue = T/F, and start counting when lead (by "before") is T,
# then we should get correct plotGroup, regardless whether the desired state is in first row or not
boo <- extracted.df %>%
dplyr::mutate(start2 = state == "A" & lag(state, default = "") != "A") %>%
mutate(plotGroup = cumsum(lead(start2, n = before, default = FALSE)))
#create the counter/sequence to zero the graph
counter <- rep(NA, times = length(boo$start)) # make an empty counter
starts <- which(boo$start == "start") # find the start positions
counter[starts] <- 0
for(i in 1:after){ # for every position after a start, up to "after"
indexes <- starts + i # index of positions "i" after the start
indexes_1 <- indexes[which(indexes %in% 1:length(counter))] # indexes can run over the length of the counter - we only want indexes that are within the length of the counter
counter[indexes_1] <- i # for those indexes, put in the count, i
}
for(i in 1:before){ # same as for "after", but in reverse for "before"
indexes <- starts - i
indexes_1 <- indexes[which(indexes %in% 1:length(counter))]
counter[indexes_1] <- -i
}
boo$span <- counter
boo
stateStart state start rleGroup data start2 plotGroup span
1 A.start A start 1 0.22771277 TRUE 0 0
2 A.NA A NA 1 0.39769158 FALSE 0 1
3 A.NA A NA 1 0.42416120 FALSE 0 2
6 B.NA B NA 2 0.06402964 FALSE 1 -1
7 A.start A start 3 0.22233942 TRUE 1 0
8 A.NA A NA 3 0.77667057 FALSE 1 1
9 X.NA X NA 4 0.36675437 FALSE 1 2
11 Z.NA Z NA 6 0.49100719 FALSE 2 -1
12 A.start A start 7 0.26012695 TRUE 2 0
13 A.NA A NA 7 0.88900224 FALSE 2 1
14 A.NA A NA 7 0.59714172 FALSE 2 2
15 B.NA B NA 8 0.15040234 FALSE 3 -1
16 A.start A start 9 0.85581300 TRUE 3 0
17 A.NA A NA 9 0.15780435 FALSE 3 1
# plot
ggplot(data=boo, aes(x=span, y = data, group = plotGroup)) +
geom_line()

is.na or complete.cases in R using column number

In this example I need to drop all rows with NA values, I tried
drop <- is.na(df[,c(3,4,5)])
Error in df[, c(3, 4, 5)] : incorrect number of dimensions
My dataframe have 5 columns
I am not trying to select columns with column name
Also tried
df[complete.cases(df[ , 3:5]),]
Same error, incorrect number of dimensions
Dropping missing values from vectors
The errors indicate that your data are likely a vector, not a data.frame. Accordingly, there are no rows or columns (it has no dim) and so using [,] is throwing errors. To support this, below I create a vector, reproduce the errors, and demonstrate how to drop missing values from it.
# Create vector, show it's a vector
vec <- c(NA,1:4)
vec
#> [1] NA 1 2 3 4
is.vector(vec)
#> [1] TRUE
# Reproduces your errors for both methods
is.na(vec[ ,2:3])
#> Error in vec[, 2:3]: incorrect number of dimensions
vec[complete.cases(vec[ , 2:3]), ]
#> Error in vec[, 2:3]: incorrect number of dimensions
# Remove missing values from the vector
vec[!is.na(vec)]
#> [1] 1 2 3 4
vec[complete.cases(vec)]
#> [1] 1 2 3 4
I'll additionally show you below how to check if your data object is a data.frame and how to omit rows with missing values in case it is.
Create data and check it's a data.frame
# Create an example data.frame
set.seed(123)
N <- 10
df <- data.frame(
x1 = sample(c(NA_real_, 1, 2, 3), N, replace = T),
x2 = sample(c(NA_real_, 1, 2, 3), N, replace = T),
x3 = sample(c(NA_real_, 1, 2, 3), N, replace = T)
)
print(df)
#> x1 x2 x3
#> 1 2 3 NA
#> 2 2 1 3
#> 3 2 1 NA
#> 4 1 NA NA
#> 5 2 1 NA
#> 6 1 2 2
#> 7 1 3 3
#> 8 1 NA 1
#> 9 2 2 2
#> 10 NA 2 1
# My hunch is that you are not using a data.frame. You can check as follows:
class(df)
#> [1] "data.frame"
Approaches to removing rows with missing values from data.frames
Your first approach returns logical values for whether a value is missing for the specified columns. You could then rowSum and drop them per below.
# Example: shows whether values are missing for second and third columns
miss <- is.na(df[ ,2:3])
print(miss)
#> x2 x3
#> [1,] FALSE TRUE
#> [2,] FALSE FALSE
#> [3,] FALSE TRUE
#> [4,] TRUE TRUE
#> [5,] FALSE TRUE
#> [6,] FALSE FALSE
#> [7,] FALSE FALSE
#> [8,] TRUE FALSE
#> [9,] FALSE FALSE
#> [10,] FALSE FALSE
# We can sum all of these values by row (`TRUE` = 1, `FALSE` = 0 in R) and keep only
# those rows that sum to 0 to remove missing values. Notice that the row names
# retain the original numbering.
df[rowSums(miss) == 0, ]
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
#> 10 NA 2 1
Your second approach is to use complete.cases. This also works and produces the same result as the first approach.
miss_cases <- df[complete.cases(df[ ,2:3]), ]
miss_cases
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
#> 10 NA 2 1
A third approach is to use na.omit() however, it doesn't let you specify columns and you should just use complete.cases instead if you need to filter on specific columns.
na.omit(df)
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
A fourth approach is to use the tidyr package where the appeal is you can use column indices as well as unquoted column names. This also updates row names.
library(tidyr)
drop_na(df, 2:3)
#> x1 x2 x3
#> 1 2 1 3
#> 2 1 2 2
#> 3 1 3 3
#> 4 2 2 2
#> 5 NA 2 1

Count instances of overlap in two vectors in R

I am hoping to create a matrix that shows a count of instances of overlapping values for a grouping variable based on a second variable. Specifically, I am hoping to determine the degree to which primary studies overlap across meta-analyses in order to create a network diagram.
So, in this example, I have three meta-analyses that include some portion of three primary studies.
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,3,2,1,2,3))
metas studies
1 1 1
2 1 3
3 1 2
4 2 1
5 3 2
6 3 3
I would like it to return:
v1 v2 v3
1 3 1 2
2 1 1 0
3 2 0 2
The value in row 1, column 1 indicates that Meta-analysis 1 had three studies in common with itself (i.e., it included three studies). Row 1, column 2 indicates that Meta-analysis 1 had one study in common with Meta-analysis 2. Row 1, column 3 indicates that Meta-analysis 1 had two studies in common with Meta-analysis 3.
I believe you are looking for a symmetric matrix of intersecting studies.
dfspl <- split(df$studies, df$metas)
out <- outer(seq_along(dfspl), seq_along(dfspl),
function(a, b) lengths(Map(intersect, dfspl[a], dfspl[b])))
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
If you need names on them, you can go with the names as defined by df$metas:
rownames(out) <- colnames(out) <- names(dfspl)
out
# 1 2 3
# 1 3 1 2
# 2 1 1 0
# 3 2 0 2
If you need the names defined as v plus the meta name, go with
rownames(out) <- colnames(out) <- paste0("v", names(dfspl))
out
# v1 v2 v3
# v1 3 1 2
# v2 1 1 0
# v3 2 0 2
If you need to understand what this is doing, outer creates an expansion of the two argument vectors, and passes them all at once to the function. For instance,
outer(seq_along(dfspl), seq_along(dfspl), function(a, b) { browser(); 1; })
# Called from: FUN(X, Y, ...)
debug at #1: [1] 1
# Browse[2]>
a
# [1] 1 2 3 1 2 3 1 2 3
# Browse[2]>
b
# [1] 1 1 1 2 2 2 3 3 3
# Browse[2]>
What we ultimately want to do is find the intersection of each pair of studies.
dfspl[[1]]
# [1] 1 3 2
dfspl[[3]]
# [1] 2 3
intersect(dfspl[[1]], dfspl[[3]])
# [1] 3 2
length(intersect(dfspl[[1]], dfspl[[3]]))
# [1] 2
Granted, we are doing it twice (once for 1 and 3, once for 3 and 1, which is the same result), so this is a little inefficient ... it would be better to filter them to only look at the upper or lower half and transferring it to the other.
Edited for a more efficient process (only calculating each intersection pair once, and never calculating self-intersection.)
eg <- expand.grid(a = seq_along(dfspl), b = seq_along(dfspl))
eg <- eg[ eg$a < eg$b, ]
eg
# a b
# 4 1 2
# 7 1 3
# 8 2 3
lens <- lengths(Map(intersect, dfspl[eg$a], dfspl[eg$b]))
lens
# 1 1 2 ## btw, these are just names, from eg$a
# 1 2 0
out <- matrix(nrow = length(dfspl), ncol = length(dfspl))
out[ cbind(eg$a, eg$b) ] <- lens
out
# [,1] [,2] [,3]
# [1,] NA 1 2
# [2,] NA NA 0
# [3,] NA NA NA
out[ lower.tri(out) ] <- out[ upper.tri(out) ]
diag(out) <- lengths(dfspl)
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
Same idea as #r2evans, also Base R (and a bit less eloquent) (edited as required):
# Create df using sample data:
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,7,2,1,2,3))
# Test for equality between the values in the metas vector and the rest of
# of the values in the dataframe -- Construct symmetric matrix from vector:
m1 <- diag(v1); m1[,1] <- m1[1,] <- v1 <- rowSums(data.frame(sapply(df$metas, `==`,
unique(unlist(df)))))
# Coerce matrix to dataframe setting the names as desired; dropping non matches:
df_2 <- setNames(data.frame(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)]),
paste0("v", 1:ncol(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)])))

Finding maximum value before sequence repeats in R

I have a data set that looks like the following
xx = c(1:5, 1:9, 1:7)
# [1] 1 2 3 4 5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
I would like to know the index of 1 and the maximum value before the sequence begins again. For example:
[1] 1, 1
[2] 5, 5
[3] 6, 1
[4] 14, 9
[5] 15, 1
[6] 21, 7
and so on.....
An option would be (Assuming that the vector contains only sequence elements)
v1 <- which(xx == 1)
v2 <- c(rbind(v1, c(v1[-1]-1, length(xx))))
cbind(ind = v2, value = xx[v2])
# ind value
#[1,] 1 1
#[2,] 5 5
#[3,] 6 1
#[4,] 14 9
#[5,] 15 1
#[6,] 21 7
Or another option is to do a split on the sequence of elements of 'xx' and get the first and last elements of each list
ind <- unlist(lapply(split(seq_along(xx), cumsum(xx==1)), function(x) x[c(1, length(x))]))
cbind(ind, value = xx[ind])
This can be answered with a somewhat ugly (but efficient!) lapply:
a<-lapply(1:length(xx),function(x) {
if(x==length(xx)){c(x,xx[x])}else{
if(xx[x] == 1){c(x,1)}else{
if(xx[x]>xx[x+1]){c(x,xx[x])}
}
}
})
matrix(unlist(a),ncol = 2,byrow = T)

Group index from column labeling the last element in each group

I'm trying to subset a data frame. The data frame is to be broken into subsets, where the last element in each subset has a "TRUE" value in the "bool" column. Consider the following data frame:
df <- data.frame(c(3,1,3,4,1,1,4), rnorm(7))
df <- cbind(df, df[,1] != 1)
names(df) <- c("ind", "var", "bool")
df
# ind var bool
# 1 3 0.02343906 TRUE
# 2 1 0.94786193 FALSE
# 3 3 0.50632766 TRUE
# 4 4 0.24655548 TRUE
# 5 1 -1.58103304 FALSE
# 6 1 0.73999468 FALSE
# 7 4 0.10929906 TRUE
Row 1 should be a subset, rows 2 and 3 should be a subset, row 4 a subset and then rows 5 through 7 a subset. The code I have below works (I can subset on the new column), but I was wondering if there was a more "R" way of doing it.
index = 1
for (i in 1:nrow(df))
{
if(df$bool[i])
{df$index[i] = index
index = index + 1
}
else
{df$index[i] = index
}
}
df
# ind var bool index
# 1 3 0.02343906 TRUE 1
# 2 1 0.94786193 FALSE 2
# 3 3 0.50632766 TRUE 2
# 4 4 0.24655548 TRUE 3
# 5 1 -1.58103304 FALSE 4
# 6 1 0.73999468 FALSE 4
# 7 4 0.10929906 TRUE 4
The first thought I would have would be to use the cumulative sum (cumsum) on the bool column to get the group indices -- this will increase the index value by 1 every time the bool value is TRUE:
df$index <- cumsum(df$bool)
df
# ind var bool index
# 1 3 -1.0712125 TRUE 1
# 2 1 0.4994369 FALSE 1
# 3 3 2.1335274 TRUE 2
# 4 4 -1.5950432 TRUE 3
# 5 1 0.5919880 FALSE 3
# 6 1 2.7039831 FALSE 3
# 7 4 -1.3526646 TRUE 4
This is not quite right because all the observations before the TRUE of each group are assigned to the previous group. We can fix that by adding 1 for all the observations with bool set to FALSE:
df$index <- cumsum(df$bool) + !df$bool
df
# ind var bool index
# 1 3 -1.0712125 TRUE 1
# 2 1 0.4994369 FALSE 2
# 3 3 2.1335274 TRUE 2
# 4 4 -1.5950432 TRUE 3
# 5 1 0.5919880 FALSE 4
# 6 1 2.7039831 FALSE 4
# 7 4 -1.3526646 TRUE 4
Splitting the data frame into a list of subsets can now be achieved efficiently with subsets <- split(df, df$index).

Resources