R: expand sequence of binary values from two columns - r

I have a very complex dataset. But for simplification I am going to use this data. Let's assume we have this data, one column X showing numbers in different ranges and the other column is binary 0/1. I want to expand a sequence of binary 0/1 by 1. This will create new longer columns (newX, newbinary) form the original X and binary columns.
X binary
1 1 1
2 3 0
3 5 0
4 6 1
5 8 1
6 10 0
I want to expand X column here and put the binary value next to it. Something like,
newX newbinary
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
My attempts is this, First I created the new X column (expanded the X).
newX <- seq(X[1], X[length])
Then, I used for loop to iterate through newX and then compare the value with X to check if it is equal to or less than the value. If newX[i] is not equal to X[i], then put the previous binary value, else put the binary[i].
for (i in 1:newX[length])
{
newbinary= ifelse((newX != X)&(between(newX[i], X[i],X[j+1])), lag(binary), binary)
}
But this is not working and give me this,
newX newbinary
1 1 1
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
I don't really know how the newX column (longer one) will iterate through X (shorter one) and put the value correspondingly.
How can I achieve this in R?

Try this tidyverse approach creating a dataframe for the sequence, then use left_join() and finally fill() to complete the binary variable:
library(dplyr)
library(tidyr)
#Code
newdf <- data.frame(X=seq(min(df$X),max(df$X),by=1)) %>%
left_join(df) %>%
fill(binary)
Output:
X binary
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
Some data used:
#Data
df <- structure(list(X = c(1L, 3L, 5L, 6L, 8L, 10L), binary = c(1L,
0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Does this work:
library(dplyr)
library(tidyr)
library(purrr)
df %>% mutate(y = lead(X, default = max(X)+1), newX = map2(X,y-1,`:`)) %>%
unnest(newX) %>% mutate(newbinary = binary) %>% select(newX, newbinary)
# A tibble: 10 x 2
newX newbinary
<int> <dbl>
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
Data used:
df
# A tibble: 6 x 2
X binary
<dbl> <dbl>
1 1 1
2 3 0
3 5 0
4 6 1
5 8 1
6 10 0

Assuming the data frame shown reproducibly in the Note at the end the following one-linear uses read.zoo to convert it to a zoo object and converts that to a ts object expanding it at the same time. Then it uses na.locf to fill in the NAs and fortify.zoo to convert it to a data.frame. Set the names on the data frame to the original names.
library(zoo)
setNames(fortify.zoo(na.locf(as.ts(read.zoo(DF)))), names(DF))
giving:
X binary
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
The above one-linear could alternately be written as a pipeline:
library(zoo)
library(magrittr)
DF %>%
read.zoo %>%
as.ts %>%
na.locf %>%
fortify.zoo %>%
setNames(names(DF))
Note
Lines <- "
X binary
1 1 1
2 3 0
3 5 0
4 6 1
5 8 1
6 10 0"
DF <- read.table(text = Lines)

Related

How to count groupings of elements in base R or dplyr using multiple conditions?

I am trying to count the number of elements by groupings, subject to the condition that each grouping code ("Group") is > 0. Suppose we start with the below output DF generated via the code immediately beneath:
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 1
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 1
7 X 0 1
8 X 0 1
9 B 0 1
10 R 0 1
11 R 2 2
12 R 2 2
13 X 3 3
14 X 3 3
15 X 3 3
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>% group_by(Element) %>% mutate(reSeq = match(Group, unique(Group)))
Instead, I would like the reSeq column to calculate and output as shown below with explanations to the right:
Element Group reSeq reSeq explanation
<chr> <dbl> <int>
1 R 0 1 1st instance of R (ungrouped)(Group = 0 means not grouped)
2 R 0 2 2nd instance of R (ungrouped)(Group = 0 means not grouped)
3 X 0 1 1st instance of X (ungrouped)(Group = 0 means not grouped)
4 X 1 2 2nd instance of X (grouped by Group = 1)
5 X 1 2 2nd instance of X (grouped by Group = 1)
6 X 0 3 3rd instance of X (ungrouped)
7 X 0 4 4th instance of X (ungrouped)
8 X 0 5 5th instance of X (ungrouped)
9 B 0 1 1st instance of B (ungrouped)
10 R 0 3 3rd instance of R (ungrouped)
11 R 2 4 4th instance of R (grouped by Group = 2)
12 R 2 4 4th instance of R (grouped by Group = 2)
13 X 3 6 6th instance of X (grouped by Group = 3)
14 X 3 6 6th instance of X (grouped by Group = 3)
15 X 3 6 6th instance of X (grouped by Group = 3)
Any recommendations for doing this? If possible, starting with the dplyr code I use above because I am fairly familiar with it.
If we use rowid from data.table, can skip a couple of steps
library(dplyr)
library(data.table)
library(tidyr)
myDF %>%
mutate(reSeq = rowid(Element) * NA^!(Group == 0 |!duplicated(Group))) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
-output
# A tibble: 15 × 3
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 2
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 3
7 X 0 4
8 X 0 5
9 B 0 1
10 R 0 3
11 R 2 4
12 R 2 4
13 X 3 6
14 X 3 6
15 X 3 6
Below is what I managed to cobble together. Maybe there's a cleaner solution? Here's the code:
library(dplyr)
library(tidyr)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
And here's the output:
# A tibble: 15 x 4
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 3
7 X 0 5 4
8 X 0 6 5
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 6
14 X 3 8 6
15 X 3 9 6

Simple operation with lagged values

I need to calculate line-wise simple operations using lagged values, for example the sum for a variable for the previous x years
I tried:
toy %>%
group_by(student) %>%
mutate(lag_passed = sum(lag(passed, n = 5, order_by = year, default = 0)))
toy %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed = lapply(passed, function(x) sum(lag(x, n = 5, default = 0))))
Reproducible examples. Task sum the number of passed tests in the previous five years.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,0,0,1))
student year passed
1 A 1 0
2 A 2 0
3 A 3 0
4 A 4 1
5 A 5 2
6 A 6 0
7 A 7 0
8 A 8 0
9 A 9 0
10 A 10 1
expected <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1),
lag_passed=c(0,0,0,0,1,3,3,3,4,3))
student year passed lag_passed
1 A 1 0 0
2 A 2 0 0
3 A 3 0 0
4 A 4 1 0
5 A 5 2 1
6 A 6 0 3
7 A 7 0 3
8 A 8 1 3
9 A 9 0 4
10 A 10 1 3
runner::sum_run() will help here. using idx = year is optional, unless you have missing values in some of the years, in which case it will take into account those missing years too, which is however, not the case with sample data. grouping on student is added because, in actual you may want to carry out the operation for each student.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1))
library(dplyr)
library(runner)
toy %>% group_by(student) %>%
mutate(lag_passed = sum_run(x = passed,
idx = year,
k = 5,
lag = 1))
#> # A tibble: 10 x 4
#> # Groups: student [1]
#> student year passed lag_passed
#> <chr> <int> <dbl> <dbl>
#> 1 A 1 0 NA
#> 2 A 2 0 0
#> 3 A 3 0 0
#> 4 A 4 1 0
#> 5 A 5 2 1
#> 6 A 6 0 3
#> 7 A 7 0 3
#> 8 A 8 1 3
#> 9 A 9 0 4
#> 10 A 10 1 3
Created on 2021-05-15 by the reprex package (v2.0.0)
Another rolling sum solution with zoo::rollapply:
f <- function(x) {zoo::rollapply(x, 6, sum, align = 'right', partial = TRUE) - x}
expected %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed2 = f(passed)) %>%
ungroup()
# student year passed lag_passed lag_passed2
# <chr> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 0
# 2 A 2 0 0 0
# 3 A 3 0 0 0
# 4 A 4 1 0 0
# 5 A 5 2 1 1
# 6 A 6 0 3 3
# 7 A 7 0 3 3
# 8 A 8 1 3 3
# 9 A 9 0 4 4
# 10 A 10 1 3 3
lag_passed2 created with the helper function is the same as lag_passed. The idea is to calculate a sliding window sum with a window length of 6 (allow partial window at begining by partial = T and align = 'right'), then substract the passed value of the current years.
Note: the helper function f can be replaced to a simpler one by specifying the window using offsets and default right alignment as pointed out by #G. Grothendieck:
f <- function(x) rollapplyr(x, list(-seq(5)), sum, partial = TRUE, fill = 0)

Conditional values [duplicate]

This question already has answers here:
Convert numeric vector to binary (0/1) based on limit
(4 answers)
Closed 2 years ago.
I'm having trouble with some data, and I think it's easy to solve. I have a subset like this:
data <- data.frame("treat" = 1:10, "value" = c(12,32,41,0,12,13,11,0,12,0))
And what I need is a third column that returns to me the value "1" when the value on second column is different from 0, and returns "0" when the value on the second column is equal 0. Like this:
data$param <- c(1,1,1,0,1,1,1,0,1,0)
I tried to do this with the function if() and else() but I don't get it.
You can try:
data$param <- ifelse(data$value != 0, 1, 0)
or you can use dplyr library:
data %>%
mutate(param = case_when(value != 0 ~ 1, TRUE ~ 0))
or
data$param <- +(data$value != 0)
data$param <- as.integer(data$value != 0)
data
treat value param
1 1 12 1
2 2 32 1
3 3 41 1
4 4 0 0
5 5 12 1
6 6 13 1
7 7 11 1
8 8 0 0
9 9 12 1
10 10 0 0
Here is another alternative using cut function.
library(dplyr)
data %>%
mutate(param = cut(value, breaks = c(-Inf,0,max(value)), labels = c(0,1)))
# treat value param
# 1 1 12 1
# 2 2 32 1
# 3 3 41 1
# 4 4 0 0
# 5 5 12 1
# 6 6 13 1
# 7 7 11 1
# 8 8 0 0
# 9 9 12 1
# 10 10 0 0

Turning factor variable into a list of binary variable per row (trial) in R [duplicate]

This question already has answers here:
Transform one column from categoric to binary, keep the rest [duplicate]
(3 answers)
Closed 3 years ago.
A while ago I've posted a question about how to convert factor data.frame into a binary (hot-encoding) data.frame here. Now I am trying to find the most efficient way to loop over trials (rows) and binarize a factor variable. A minimal example would look like this:
d = data.frame(
Trial = c(1,2,3,4,5,6,7,8,9,10),
Category = c('a','b','b','b','a','b','a','a','b','a')
)
d
Trial Category
1 1 a
2 2 b
3 3 b
4 4 b
5 5 a
6 6 b
7 7 a
8 8 a
9 9 b
10 10 a
While I would like to get this:
Trial a b
1 1 1 0
2 2 0 1
3 3 0 1
4 4 0 1
5 5 1 0
6 6 0 1
7 7 1 0
8 8 1 0
9 9 0 1
10 10 1 0
What would be the most efficient way of doing it?
here is an option with pivot_wider. Create a column of 1's and then apply pivot_wider with names_from the 'Category' and values_from the newly created column
library(dplyr)
library(tidyr)
d %>%
mutate(n = 1) %>%
pivot_wider(names_from = Category, values_from = n, values_fill = list(n = 0))
# A tibble: 10 x 3
# Trial a b
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 1
# 4 4 0 1
# 5 5 1 0
# 6 6 0 1
# 7 7 1 0
# 8 8 1 0
# 9 9 0 1
#10 10 1 0
The efficient option would be data.table
library(data.table)
dcast(setDT(d), Trial ~ Category, length)
It can also be done with base R
table(d)

Finding values in consecutive rows

An example of the dataframe I have is given below.
ID X
1 1
2 2
3 1
4 0
5 0
6 1
7 4
8 5
9 6
10 7
11 0
12 0
I want to apply logic to it that looks to see whether 3 or more consecutive rows have a value >0 in it. If they do I want to flag them in another column. Hence the output will look as follows.
ID X Y
1 1 1
2 2 1
3 1 1
4 0 0
5 0 0
6 1 1
7 4 1
8 5 1
9 6 1
10 7 1
11 0 0
12 0 0
EXTENSION -
How would I get the following output, givibng a different Y value for each group?
ID X Y
1 1 1
2 2 1
3 1 1
4 0 0
5 0 0
6 1 2
7 4 2
8 5 2
9 6 2
10 7 2
11 0 0
12 0 0
One option with base R. Using rle to find the adjacent values in 'X' that are greater than 0, then do the replication based on the lengths
df1$Y <- with(rle(df1$X > 0), as.integer(rep(values & lengths > 2, lengths)))
df1$Y
#[1] 1 1 1 0 0 1 1 1 1 1 0 0
For the updated case in the OP's post
df1$Y <- inverse.rle(within.list(rle(df1$X > 0), {
i1 <- values & (lengths > 2)
values[i1] <- seq_along(values[i1])}))
df1$Y
#[1] 1 1 1 0 0 2 2 2 2 2 0 0
Or using rleid from data.table
library(data.table)
setDT(df1)[, Y := as.integer((.N > 2) * (X > 0)),rleid(X > 0)]
data
df1 <- structure(list(ID = 1:12, X = c(1L, 2L, 1L, 0L, 0L, 1L, 4L, 5L,
6L, 7L, 0L, 0L)), class = "data.frame", row.names = c(NA, -12L
))
We can use rleid from data.table to create groups and use it in ave and get length of each group and assign 1 to groups which has length greater than equal to 3.
library(data.table)
df$Y <- as.integer(ave(df$X, rleid(df$X > 0), FUN = length) >= 3)
df
# ID X Y
#1 1 1 1
#2 2 2 1
#3 3 1 1
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 4 1
#8 8 5 1
#9 9 6 1
#10 10 7 1
#11 11 0 0
#12 12 0 0
EDIT
For updated post we could include the above data.table part with dplyr by doing
library(dplyr)
library(data.table)
df %>%
group_by(group = rleid(X > 0)) %>%
mutate(Y = ifelse(n() >= 3 & row_number() == 1, 1, 0)) %>%
ungroup() %>%
mutate(Y = cumsum(Y) * Y) %>%
group_by(group) %>%
mutate(Y = first(Y)) %>%
ungroup() %>%
select(-group)
# ID X Y
# <int> <int> <dbl>
# 1 1 1 1
# 2 2 2 1
# 3 3 1 1
# 4 4 0 0
# 5 5 0 0
# 6 6 1 2
# 7 7 4 2
# 8 8 5 2
# 9 9 6 2
#10 10 7 2
#11 11 0 0
#12 12 0 0

Resources