I have a very large data set including 250 string and numeric variables. I want to compare one after another columns together. For example, I am going to compare (difference) the first variable with second one, third one with fourth one, fifth one with sixth one and so on.
For example (The structure of the data set is something like this example), I want to compare number.x with number.y, day.x with day.y, school.x with school.y and etc.
number.x<-c(1,2,3,4,5,6,7)
number.y<-c(3,4,5,6,1,2,7)
day.x<-c(1,3,4,5,6,7,8)
day.y<-c(4,5,6,7,8,7,8)
school.x<-c("a","b","b","c","n","f","h")
school.y<-c("a","b","b","c","m","g","h")
city.x<- c(1,2,3,7,5,8,7)
city.y<- c(1,2,3,5,5,7,7)
You mean, something like this?
> number.x == number.y
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
> length(which(number.x==number.y))
[1] 1
> school.x == school.y
[1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE
> test.day <- day.x == day.y
> test.day
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
EDIT: Given your example variables above, we have:
df <- data.frame(number.x,
number.y,
day.x,
day.y,
school.x,
school.y,
city.x,
city.y,
stringsAsFactors=FALSE)
n <- ncol(df) # no of columns (assumed EVEN number)
k <- 1
comp <- list() # comparisons will be stored here
while (k <= n-1) {
l <- (k+1)/2
comp[[l]] <- df[,k] == df[,k+1]
k <- k+2
}
After which, you'll have:
> comp
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[[3]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE
[[4]]
[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE
To get the comparison result between columns k and k+1, you look at the (k+1)/2 element of comp - i.e to get the comparison results between columns 7 & 8, you look at the comp element 8/2=4:
> comp[[4]]
[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE
EDIT 2: To have the comparisons as new columns in the dataframe:
new.names <- rep('', n/2)
for (i in 1:(n/2)) {
new.names[i] <- paste0('V', i)
}
cc <- as.data.frame(comp, optional=TRUE)
names(cc) <- new.names
df.new <- cbind(df, cc)
After which, you have:
> df.new
number.x number.y day.x day.y school.x school.y city.x city.y V1 V2 V3 V4
1 1 3 1 4 a a 1 1 FALSE FALSE TRUE TRUE
2 2 4 3 5 b b 2 2 FALSE FALSE TRUE TRUE
3 3 5 4 6 b b 3 3 FALSE FALSE TRUE TRUE
4 4 6 5 7 c c 7 5 FALSE FALSE TRUE FALSE
5 5 1 6 8 n m 5 5 FALSE FALSE FALSE TRUE
6 6 2 7 7 f g 8 7 FALSE TRUE FALSE FALSE
7 7 7 8 8 h h 7 7 TRUE TRUE TRUE TRUE
Related
I have a data.frame similar to this:
mydf=data.frame(LETTERS=LETTERS, rev_letters=rev(letters), var1=c(rep('a',10),rep('b',10),rep('c',6)), value=1:26)
> head(mydf)
LETTERS rev_letters var1 value
1 A z a 1
2 B y a 2
3 C x a 3
4 D w a 4
5 E v a 5
6 F u a 6
I want to select the row indexes that correspond to the columns and values stored in a list, like this one:
mylist=list(LETTERS=c('A','M','X'), var1='b')
> mylist
$LETTERS
[1] "A" "M" "X"
$var1
[1] "b"
I would like to do something like the following, but for all columns and values at once:
> which(mydf[,names(mylist)[1]] %in% mylist[[1]])
[1] 1 13 24
... or even better as a TRUE/FALSE variable:
> mydf[,names(mylist)[1]] %in% mylist[[1]]
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[25] FALSE FALSE
The idea is to end up with a single variable of all the indexes for all the columns and values in the list; in the example above, the result would be:
> indexes
[1] 1 11 12 13 14 15 16 17 18 19 20 24
... or the TRUE/FALSE counterpart:
> indexes
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
[25] FALSE FALSE
Thanks!
With %in% + sapply:
mydf=data.frame(LETTERS=LETTERS, rev_letters=rev(letters), var1=c(rep('a',10),rep('b',10),rep('c',6)), value=1:26)
mylist = list(LETTERS = c('A','M','X'), var1 = 'b')
rowSums(sapply(names(mylist), function(x) mydf[[x]] %in% mylist[[x]])) != 0
# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[11] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[21] FALSE FALSE FALSE TRUE FALSE FALSE
which(rowSums(sapply(names(mylist), function(x) mydf[[x]] %in% mylist[[x]])) != 0)
#[1] 1 11 12 13 14 15 16 17 18 19 20 24
Loop through names and use which:
sort(unique(unlist(sapply(names(mylist), function(i){
which(mydf[, i] %in% mylist[[ i ]])
}))))
# [1] 1 11 12 13 14 15 16 17 18 19 20 24
I have a group of columns for each time and I want to convert it to a lot of boolean columns (one by category) with mutate() and across() like that :
data <- data.frame(category_t1 = c("A","B","C","C","A","B"),
category_t2 = c("A","C","B","B","B",NA),
category_t3 = c("C","C",NA,"B",NA,"A"))
data %>% mutate(across(starts_with("category"),
~case_when(.x == "A" ~ TRUE, !is.na(.x) ~ FALSE),
.names = "{str_replace(.col, 'category', 'A')}"),
across(starts_with("category"),
~case_when(.x == "B" ~ TRUE, !is.na(.x) ~ FALSE),
.names = "{str_replace(.col, 'category', 'B')}"),
across(starts_with("category"),
~case_when(.x == "C" ~ TRUE, !is.na(.x) ~ FALSE),
.names = "{str_replace(.col, 'category', 'C')}"))
Which makes :
category_t1 category_t2 category_t3 A_t1 A_t2 A_t3 B_t1 B_t2 B_t3 C_t1 C_t2
1 A A C TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
2 B C C FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE NA FALSE TRUE NA TRUE FALSE
4 C B B FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
5 A B <NA> TRUE FALSE NA FALSE TRUE NA FALSE FALSE
6 B <NA> A FALSE NA TRUE TRUE NA FALSE FALSE NA
It works but I would like to know if there is a better idea because here I am doing the same code 3 times instead of one big code (and imagine if I had 10 times to repeat it...). I though I could do it with map() but I didn't manage to make it work.
I think there is a problem because of .names argument in across() that cannot connect with the string I use in case_when().
I think maybe there is something to do in the ... argument, like :
data %>% mutate(across(starts_with("category"),
~case_when(.x == mod ~ TRUE, !is.na(.x) ~ FALSE),
mod = levels(as.factor(data$category_t1)),
.names = "{str_replace(.col, 'category', mod)}"))
But of course that doesn't work here. Do you know how to do that ?
Thanks a lot.
We may use table in across
library(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(across(everything(), ~ as.data.frame.matrix(table(row_number(), .x) *
NA^(is.na(.x)) > 0),
.names = "{str_remove(.col, 'category_')}")) %>%
unpack(where(is.data.frame), names_sep = ".")
-output
# A tibble: 6 × 12
category_t1 category_t2 category_t3 t1.A t1.B t1.C t2.A t2.B t2.C t3.A t3.B t3.C
<chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE NA NA NA
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE NA NA NA
6 B <NA> A FALSE TRUE FALSE NA NA NA TRUE FALSE FALSE
Or use model.matrix from base R
data1 <- replace(data, is.na(data), "NA")
lvls <- lapply(data1, \(x) levels(factor(x, levels = c("NA", "A", "B", "C"))))
m1 <- model.matrix(~ 0 + ., data = data1, xlev = lvls)
out <- cbind(data, m1[, -grep("NA", colnames(m1))] > 0)
-output
out
category_t1 category_t2 category_t3 category_t1A category_t1B category_t1C category_t2A category_t2B category_t2C category_t3A category_t3B category_t3C
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
6 B <NA> A FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
> colnames(out)
[1] "category_t1" "category_t2" "category_t3"
[4] "category_t1A" "category_t1B" "category_t1C"
[7] "category_t2A" "category_t2B" "category_t2C"
[10] "category_t3A"
[11] "category_t3B" "category_t3C"
Or another option with table
cbind(data, do.call(cbind.data.frame,
lapply(data, \(x) (table(seq_along(x), x)* NA^is.na(x)) > 0)))
-output
category_t1 category_t2 category_t3 category_t1.A category_t1.B category_t1.C category_t2.A category_t2.B category_t2.C category_t3.A category_t3.B
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE NA NA
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE NA NA
6 B <NA> A FALSE TRUE FALSE NA NA NA TRUE FALSE
category_t3.C
1 TRUE
2 TRUE
3 NA
4 FALSE
5 NA
6 FALSE
Not a tidyverse option (although pipe-compatible), it is very easily doable with package fastDummies:
fastDummies::dummy_cols(data, ignore_na = TRUE)
category_t1 category_t2 category_t3 category_t1_A category_t1_B category_t1_C category_t2_A category_t2_B category_t2_C category_t3_A category_t3_B category_t3_C
1 A A C 1 0 0 1 0 0 0 0 1
2 B C C 0 1 0 0 0 1 0 0 1
3 C B <NA> 0 0 1 0 1 0 NA NA NA
4 C B B 0 0 1 0 1 0 0 1 0
5 A B <NA> 1 0 0 0 1 0 NA NA NA
6 B <NA> A 0 1 0 NA NA NA 1 0 0
purrr's map_dfc could match well with your current approach:
library(dplyr)
library(purrr)
bind_cols(data,
map_dfc(LETTERS[1:3], \(letter) { mutate(data,
across(starts_with("category"),
~ case_when(.x == letter ~ TRUE, !is.na(.x) ~ FALSE),
.names = paste0("{str_replace(.col, 'category', '", letter, "')}")),
.keep = "none") }
)
)
Or skip the bind_cols and use .keep = ifelse(letter == "A", "all", "none").
Output:
category_t1 category_t2 category_t3 A_t1 A_t2 A_t3 B_t1 B_t2 B_t3 C_t1 C_t2 C_t3
1 A A C TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
3 C B <NA> FALSE FALSE NA FALSE TRUE NA TRUE FALSE NA
4 C B B FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
5 A B <NA> TRUE FALSE NA FALSE TRUE NA FALSE FALSE NA
6 B <NA> A FALSE NA TRUE TRUE NA FALSE FALSE NA FALSE
A base solution with nested lapply():
cbind(data, lapply(data, \(x) {
lev <- levels(factor(x))
sapply(setNames(lev, lev), \(y) x == y)
}))
category_t1 category_t2 category_t3 category_t1.A category_t1.B category_t1.C category_t2.A category_t2.B category_t2.C category_t3.A category_t3.B category_t3.C
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE NA NA NA
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE NA NA NA
6 B <NA> A FALSE TRUE FALSE NA NA NA TRUE FALSE FALSE
The following figure shows what I want to do:
Grow a tree with rpart for some dataset
Create a table with one row per observation in the original data set and one column per node in the tree, plus an id. The nodes columns should take the value 1 if the observation belongs to that node and zero otherwise.
This is some code that I wrote:
library(rpart)
library(rattle)
data <- kyphosis
fit <- rpart(Age ~ Number + Start, data = kyphosis)
fancyRpartPlot(fit)
nodeNumbers <- as.numeric(rownames(fit$frame))
paths <- path.rpart(fit, nodeNumbers)
for(i in 1:length(nodeNumbers)){
nodeNumber <- nodeNumbers[i]
data[,paste0('gp', nodeNumber)] <- NA
path <- paths[[i]]
if(length(path) == 1) # i.e. we're at the root
data[,paste0('gp', nodeNumber)] <- 1 else
print('help')
}
data
Is there a package out there to do what I need? The only way that I can think of doing it is with some regular expression magic for the paths object. My guess/hope is that there is an easier way of doing this.
Is there a package out there to do what I need?
AFAIK, no but this work in rpart version 4.1.13
# function to get the binary matrix OP wants given the leaf index
get_nodes <- function(object, where){
rn <- row.names(object$frame)
edges <- descendants(as.numeric(rn))
o <- t(edges)[where, , drop = FALSE]
colnames(o) <- paste0("GP", rn)
o
}
environment(get_nodes) <- environment(rpart)
# use function
nodes <- get_nodes(fit, fit$where)
head(nodes, 9)
#R GP1 GP2 GP3 GP6 GP7 GP14 GP15
#R [1,] TRUE FALSE TRUE FALSE TRUE TRUE FALSE
#R [2,] TRUE FALSE TRUE FALSE TRUE FALSE TRUE
#R [3,] TRUE FALSE TRUE FALSE TRUE TRUE FALSE
#R [4,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#R [5,] TRUE FALSE TRUE FALSE TRUE FALSE TRUE
#R [6,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
#R [7,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
#R [8,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
#R [9,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
# compare with
head(data, 9)
#R Kyphosis Age Number Start
#R 1 absent 71 3 5
#R 2 absent 158 3 14
#R 3 present 128 4 5
#R 4 absent 2 5 1
#R 5 absent 1 4 15
#R 6 absent 1 2 16
#R 7 absent 61 2 17
#R 8 absent 37 3 16
#R 9 absent 113 2 16
Here is the full code which fits the model, creates a function that can get the end leaf for a new data set, and creates and uses the above function
# do as OP
library(rpart)
library(rattle)
data <- kyphosis
fit <- rpart(Age ~ Number + Start, data = kyphosis)
fancyRpartPlot(fit)
# function that gives us the leaf index
get_where <- function(object, newdata, na.action = na.pass){
if (is.null(attr(newdata, "terms"))) {
Terms <- delete.response(object$terms)
newdata <- model.frame(Terms, newdata, na.action = na.action,
xlev = attr(object, "xlevels"))
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, newdata, TRUE)
}
pred.rpart(object, rpart.matrix(newdata))
}
environment(get_where) <- environment(rpart)
# check that we get the correct value
where <- get_where(fit, data)
stopifnot(isTRUE(all.equal(
fit$frame$yval[where], unname(predict(fit, newdata = data)))))
# function to get the binary matrix OP wants given the leaf index
get_nodes <- function(object, where){
rn <- row.names(object$frame)
edges <- descendants(as.numeric(rn))
o <- t(edges)[where, , drop = FALSE]
colnames(o) <- paste0("GP", rn)
o
}
environment(get_nodes) <- environment(rpart)
# use function
nodes <- get_nodes(fit, where)
head(nodes, 9)
#R GP1 GP2 GP3 GP6 GP7 GP14 GP15
#R [1,] TRUE FALSE TRUE FALSE TRUE TRUE FALSE
#R [2,] TRUE FALSE TRUE FALSE TRUE FALSE TRUE
#R [3,] TRUE FALSE TRUE FALSE TRUE TRUE FALSE
#R [4,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#R [5,] TRUE FALSE TRUE FALSE TRUE FALSE TRUE
#R [6,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
#R [7,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
#R [8,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
#R [9,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE
# compare with
head(data, 9)
#R Kyphosis Age Number Start
#R 1 absent 71 3 5
#R 2 absent 158 3 14
#R 3 present 128 4 5
#R 4 absent 2 5 1
#R 5 absent 1 4 15
#R 6 absent 1 2 16
#R 7 absent 61 2 17
#R 8 absent 37 3 16
#R 9 absent 113 2 16
The code is from rpart:::predict.rpart and rpart::path.rpart. You can, of course, merge the get_where and get_nodes function if you want.
Given a data.frame with some type of a flag or identifier column, I would like to be able to flag the surrounding (leading and lagging) records by some time window parameter, n. So given:
df <- data.frame(
id = letters[1:26],
flag = FALSE
)
df$flag[10] <- TRUE
df$flag[17] <- TRUE
I would like to write something like:
flag_surrounding <- function(flag, n) {
# should flag surrounding -n to +n records with condition flag
}
# expected results for n = 2, n = 1...
df
# id flag flag_n2 flag_n1
# 1 a FALSE FALSE FALSE
# 2 b FALSE FALSE FALSE
# 3 c FALSE FALSE FALSE
# 4 d FALSE FALSE FALSE
# 5 e FALSE FALSE FALSE
# 6 f FALSE FALSE FALSE
# 7 g FALSE FALSE FALSE
# 8 h FALSE TRUE FALSE
# 9 i FALSE TRUE TRUE
# 10 j TRUE TRUE TRUE
# 11 k FALSE TRUE TRUE
# 12 l FALSE TRUE FALSE
# 13 m FALSE FALSE FALSE
# 14 n FALSE FALSE FALSE
# 15 o FALSE TRUE FALSE
# 16 p FALSE TRUE TRUE
# 17 q TRUE TRUE TRUE
# 18 r FALSE TRUE TRUE
# 19 s FALSE TRUE FALSE
# 20 t FALSE FALSE FALSE
# 21 u FALSE FALSE FALSE
# 22 v FALSE FALSE FALSE
# 23 w FALSE FALSE FALSE
# 24 x FALSE FALSE FALSE
# 25 y FALSE FALSE FALSE
# 26 z FALSE FALSE FALSE
I started writing some things using dplyr::lead and dplyr::lag and variants with cumsum, but I felt like this is already in a package somewhere, but couldn't find it quickly (and not really sure how to phrase this as a question for googling) - maybe someone has better recall than me :)
The following does the trick (using ideas from this post), but feels a bit clunky and error prone. I'd be curious to get other approaches/techniques and/or something more robust from a package.
library(dplyr)
flag_surrounding <- function(flag, n) {
as.logical(cumsum(lead(flag, n, default = FALSE)) - cumsum(lag(flag, n + 1, default = FALSE)))
}
df %>%
mutate(flag_n2 = flag_surrounding(flag, 2),
flag_n1 = flag_surrounding(flag, 1))
Here's a simple solution in base:
set.seed(4)
df <- data.frame(
id = letters[1:26],
flag = as.logical(rbinom(n = 26, size = 1, prob = 0.1))
)
lead_lag_flag = function(x, n) {
flagged = which(x)
to_flag = sapply(flagged, function(z) (z - n):(z + n))
to_flag = pmax(0, to_flag)
to_flag = pmin(length(x), to_flag)
to_flag = unique(to_flag)
new_flag = rep(FALSE, length(x))
new_flag[to_flag] = TRUE
return(new_flag)
}
df$flag_n1 = lead_lag_flag(df$flag, 1)
df$flag_n2 = lead_lag_flag(df$flag, 2)
df
# id flag flag_n1 flag_n2
# 1 a FALSE FALSE FALSE
# 2 b FALSE FALSE FALSE
# 3 c FALSE FALSE FALSE
# 4 d FALSE FALSE FALSE
# 5 e FALSE FALSE FALSE
# 6 f FALSE FALSE TRUE
# 7 g FALSE TRUE TRUE
# 8 h TRUE TRUE TRUE
# 9 i TRUE TRUE TRUE
# 10 j FALSE TRUE TRUE
# 11 k FALSE FALSE TRUE
# 12 l FALSE FALSE TRUE
# 13 m FALSE TRUE TRUE
# 14 n TRUE TRUE TRUE
# 15 o FALSE TRUE TRUE
# 16 p FALSE TRUE TRUE
# 17 q TRUE TRUE TRUE
# 18 r FALSE TRUE TRUE
# 19 s TRUE TRUE TRUE
# 20 t FALSE TRUE TRUE
# 21 u FALSE TRUE TRUE
# 22 v TRUE TRUE TRUE
# 23 w FALSE TRUE TRUE
# 24 x FALSE FALSE TRUE
# 25 y FALSE FALSE FALSE
# 26 z FALSE FALSE FALSE
Another base alternative:
n <- 1
nm <- paste0("flag", n)
i <- -n:n
df[ , nm] <- FALSE
ix <- rep(which(df$flag), each = length(i)) + i
ix <- ix[ix > 0 & ix <= nrow(d)]
df[ix, nm] <- TRUE
df
# id flag flag1
# 1 a FALSE FALSE
# 2 b FALSE FALSE
# 3 c FALSE FALSE
# 4 d FALSE FALSE
# 5 e FALSE FALSE
# 6 f FALSE FALSE
# 7 g FALSE FALSE
# 8 h FALSE FALSE
# 9 i FALSE TRUE
# 10 j TRUE TRUE
# 11 k FALSE TRUE
# 12 l FALSE FALSE
# 13 m FALSE FALSE
# 14 n FALSE FALSE
# 15 o FALSE FALSE
# 16 p FALSE TRUE
# 17 q TRUE TRUE
# 18 r FALSE TRUE
# 19 s FALSE FALSE
# 20 t FALSE FALSE
# 21 u FALSE FALSE
# 22 v FALSE FALSE
# 23 w FALSE FALSE
# 24 x FALSE FALSE
# 25 y FALSE FALSE
# 26 z FALSE FALSE
I have data.frame which looks like this:
v1 <- c(1:10)
v2 <- c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)
dfb <- data.frame(v1, v2)
> dfb
v1 v2
1 1 FALSE
2 2 FALSE
3 3 TRUE
4 4 FALSE
5 5 FALSE
6 6 FALSE
7 7 TRUE
8 8 FALSE
9 9 FALSE
10 10 FALSE
I need those operations:
split data.frame into intervals according to V2 if is TRUE
rows where V2 is TRUE will be last interval element
if the last element is not TRUE it will be treated as if is (this can be easily achieved by adding TRUE to last vector position)
print V1 as first and last element from created intervals
after this operations my results should look like this:
> df_final
Vx Vy
1 3
4 7
8 10
I've tried cumsum on v2 vector but TRUE values are treated as first interval element not last
> split(v2, cumsum(v2==TRUE))
$`0`
[1] FALSE FALSE
$`1`
[1] TRUE FALSE FALSE FALSE
$`2`
[1] TRUE FALSE FALSE FALSE
You can still use cumsum, you just have to slightly adjust v2:
v3 <- c(TRUE,v2[-length(v2)])
v3
[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
res <- split(v2,cumsum(v3))
res[[length(res)]][length(last(res))] <- T
res
$`1`
[1] FALSE FALSE TRUE
$`2`
[1] FALSE FALSE FALSE TRUE
$`3`
[1] FALSE FALSE TRUE
df_final <- data.frame(Vx=which(v3),Vy=which(unlist(res,use.names=F)))
df_final
Vx Vy
1 1 3
2 4 7
3 8 10
Get df_final
Vy <- c(which(dfb$v2 %in% T),nrow(dfb))
Vx <- c(1,Vy[-length(Vy)]+1)
df_final <- data.frame(Vx,Vy)
Split Df
library(data.table)
split_ind <- rleid(dfb$v2)-!(rleid(dfb$v2) %% 2)
split(dfb,split_ind)
I will also post my answer heavily inspired by Eldioo, this one is useful also when V1 are non numeric values and avoids using split and cumsum functions.
Input:
v1 <- letters[1:10]
v2 <- c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)
dfb <- data.frame(v1, v2)
> dfb
v1 v2
1 a FALSE
2 b FALSE
3 c TRUE
4 d FALSE
5 e FALSE
6 f FALSE
7 g TRUE
8 h FALSE
9 i FALSE
10 j FALSE
Solution:
# data wrangling
library(data.table)
dfb["v3"] <- c(TRUE,dfb$v2[-length(dfb$v2)])
dfb["v4"] <- dfb$v2
dfb$v4[length(dfb$v4)] <- T
Vx <- which(dfb$v3)
Vy <- which(dfb$v4)
Vx <- dfb[Vx, ]$v1
Vy <- dfb[Vy, ]$v1
# for debugging purposes
dfb
v1 v2 v3 v4
1 a FALSE TRUE FALSE
2 b FALSE FALSE FALSE
3 c TRUE FALSE TRUE
4 d FALSE TRUE FALSE
5 e FALSE FALSE FALSE
6 f FALSE FALSE FALSE
7 g TRUE FALSE TRUE
8 h FALSE TRUE FALSE
9 i FALSE FALSE FALSE
10 j FALSE FALSE TRUE
# final results
data.frame(Vx, Vy)
Vx Vy
1 a c
2 d g
3 h j