Summarize Gaps in Binary Data using R - r

I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L

Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L

An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))

Related

Subtract above number that is not NA

I have a table similar to this minimal example without the difference column:
trigger
values
difference
0
3
0
NA
0
NA
1
5
2
0
4
0
NA
1
10
6
I want to subtract the above number (and leave out the NAs) from the number at each trigger point (trigger = 1)
Is there a way to do this in R?
Edit:
I have now the situation where the triggers lie close together like in this example:
trigger
values
difference
0
3
0
NA
0
NA
1
5
2
0
4
1
5
1
0
10
How can I tackle this problem?
Create a grouping column with cumsum on the 'trigger' and taking the lag, then do the difference between the first and last element and replace it as the last value per group
library(dplyr)
df1 %>%
group_by(grp = lag(cumsum(trigger), default = 0)) %>%
mutate(difference = replace(rep(NA, n()), n(),
values[n()] - values[1])) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 7 × 3
trigger values difference
<int> <int> <int>
1 0 3 NA
2 0 NA NA
3 0 NA NA
4 1 5 2
5 0 4 NA
6 0 NA NA
7 1 10 6
For the second case, we may need a condition with if/else that checks the number of rows i.e. if the number of rows is greater than 1 only need the computation to replace
df2 %>%
group_by(grp = lag(cumsum(trigger), default = 0)) %>%
mutate(difference = if(n() > 1) replace(rep(NA, n()), n(),
values[n()] - values[1]) else NA) %>%
ungroup
-output
# A tibble: 7 × 4
trigger values grp difference
<int> <int> <dbl> <int>
1 0 3 0 NA
2 0 NA 0 NA
3 0 NA 0 NA
4 1 5 0 2
5 0 4 1 NA
6 1 5 1 1
7 0 10 2 NA
data
df1 <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 0L, 1L), values = c(3L,
NA, NA, 5L, 4L, NA, 10L)), class = "data.frame", row.names = c(NA,
-7L))
df2 <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 1L, 0L), values = c(3L,
NA, NA, 5L, 4L, 5L, 10L)), class = "data.frame", row.names = c(NA,
-7L))
# Import data: df => data.frame
df <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 0L, 1L), values = c(3L,
NA, NA, 5L, 4L, NA, 10L), diff_col = c(NA, NA, NA, 2L, -1L, NA,
6L)), row.names = c(NA, -7L), class = "data.frame")
# Create an empty vector: diff_col => integer vector
df$diff_col <- NA_integer_
# Difference the X.values vector, ignoring NAs:
# diff_col => integer vector
df[which(!(is.na(df$values)))[-1], "diff_col"] <- diff(
na.omit(
df$values
)
)
# Nullify the value if the trigger is 0:
# diff_col => integer vector
df$diff_col <- with(
df,
ifelse(
trigger == 0,
NA_integer_,
diff_col
)
)

How can i count frequency AMT by ABCD?

C1:
AMT A B C D
1 13 0 1 0 0
2 17 0 0 1 0
3 19 0 0 0 1
4 1 0 0 1 0
5 9 0 1 0 0
How can i count frequency AMT by ABCD?
C2= t(as.matrix(C1[1])) %*% as.matrix(C1[2:5])
It gives me a result of Total Sum by Region.
My desired output to combine A B C D in one col since it is binary then count frequency by Type. ie.
AMT GROUP N
1 1 A 1
2 9 B 1
3 13 B 1
4 17 C 1
5 19 D 1
...
AMT IS NOT LIMITED TO 1 9 13 17 ... RANGE FROM 0-30
res <- C1 %>% group_by( ) %>% summarise(Freq=n())
library(tidyverse)
C1 %>%
tidyr::pivot_longer(
cols = A:D,
names_to = "Names",
values_to = "Values",
) %>%
group_by(Names) %>%
filter(Values == 1) %>%
summarise(AMT = sum(AMT))
select(Names, AMT, -Values)
Output:
Names AMT
<chr> <dbl>
1 B 22
2 C 18
3 D 19
You can use max.col to get the column name which has value 1 in it.
library(dplyr)
C1 %>%
transmute(AMT,
GROUP = names(.)[-1][max.col(select(., -1))],
N = 1) %>%
arrange(AMT) -> res
res
# AMT GROUP N
#4 1 C 1
#5 9 B 1
#1 13 B 1
#2 17 C 1
#3 19 D 1
data
C1 <- structure(list(AMT = c(13L, 17L, 19L, 1L, 9L), A = c(0L, 0L,
0L, 0L, 0L), B = c(1L, 0L, 0L, 0L, 1L), C = c(0L, 1L, 0L, 1L,
0L), D = c(0L, 0L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L))

How to show percentage up to the current observation?

I am working with the following data frame:
I am wondering how I can create a new column which shows the percentage of the indicator column for all previous observations within the group. So the above data frame would become:
Basically, the new column just indicates the percentage (in decimal form) of the indicator up to that point within the group. It just divides the sum of the indicator column up to that point by the row count of previous observations within the group.
My first thought was to use group_by along with row_number in order reference previous observations, but I couldn't figure out how to make it work.
Data:
structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), Indicator = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L), IndicatorPercent = c(NA,
1, 0.5, 0.67, 0.75, NA, 0, 0, 0, 0.25)), class = "data.frame", row.names = c(NA,
-10L))
We get the cummean of the 'Indicator' after grouping by 'Group' and then get the lag on it
library(dplyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = lag(cummean(Indicator))) %>%
ungroup
-output
# A tibble: 10 x 3
# Group Indicator IndicatorPercent
# <int> <int> <dbl>
# 1 1 1 NA
# 2 1 0 1
# 3 1 1 0.5
# 4 1 1 0.667
# 5 1 0 0.75
# 6 2 0 NA
# 7 2 0 0
# 8 2 0 0
# 9 2 1 0
#10 2 0 0.25
If we want to do this based on value of other column, use replace
library(tidyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = replace(rep(NA_real_, n()),
color == 'red', lag(cummean(Indicator[color == "red"])))) %>%
fill(IndicatorPercent) %>%
ungroup
Or with data.table
library(data.table)
setDT(df1)[color == 'red',
IndicatorPercent := shift(cummean(Indicator)), Group][,
IndicatorPercent := nafill(IndicatorPercent, type = 'locf'), Group][]

Count values above 0 and count how many match a pattern in a row (in R)

I would like to count how many rows in each column are >0 and how many of those rows (that are >0) start with "mt-".
The result should also be in a data frame.
Here is an example.
df1
mt-abc 1 0 2
mt-dca 1 1 2
cla 0 2 0
dla 0 3 0
result
above0 2 3 2
mt 2 1 2
In base R you can do :
mat <- df[-1] > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(df$V1, 'mt') & mat))
# V2 V3 V4
#above0 2 3 2
#mt 2 1 2
Actual data has numbers in the column and names in rownames for which we can do :
mat <- df > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(rownames(df), 'mt') & mat))
data
df <- structure(list(V1 = c("mt-abc", "mt-dca", "cla", "dla"), V2 = c(1L,
1L, 0L, 0L), V3 = 0:3, V4 = c(2L, 2L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -4L))
I don't think this is the most elegant approach in the tidyverse, but just out of curiosity:
library(tidyverse)
my_df <- data.frame(
stringsAsFactors = FALSE,
var = c("mt-abc", "mt-dca", "cla", "dla"),
x = c(1L, 1L, 0L, 0L),
y = c(0L, 1L, 2L, 3L),
z = c(2L, 2L, 0L, 0L)
)
df_1 <- my_df %>%
summarize(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="above0")
df_2 <- my_df %>%
filter(str_detect(var, "^mt")) %>%
summarise(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="mt")
bind_rows(df_1, df_2)
#> x y z var
#> 1 2 3 2 above0
#> 2 2 1 2 mt
Created on 2020-12-04 by the reprex package (v0.3.0)

Grouping by a column and counting number of positive and negative values corresponding to each value in R

I want to have a list of positive and negative values corresponding to each value that comes after grouping a column. My data looks like this:
dataset <- read.table(text =
"id value
1 4
1 -2
1 0
2 6
2 -4
2 -5
2 -1
3 0
3 0
3 -4
3 -5",
header = TRUE, stringsAsFactors = FALSE)
I want my result to look like this:
id num_pos_value num_neg_value num_zero_value
1 1 1 1
2 1 3 0
3 0 2 2
I want to extend the columns of the above result by adding sum of the positive and negative values.
id num_pos num_neg num_zero sum_pos sum_neg
1 1 1 1 4 -2
2 1 3 0 6 -10
3 0 2 2 0 -9
We create a group by 'id' and calculate the sum of logical vector
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(num_pos = sum(value > 0),
num_neg = sum(value < 0),
num_zero = sum(value == 0))
# A tibble: 3 x 4
# id num_pos num_neg num_zero
# <int> <int> <int> <int>
#1 1 1 1 1
#2 2 1 3 0
#3 3 0 2 2
Or get the table of sign of 'value' and spread it to 'wide'
library(tidyr)
df1 %>%
group_by(id) %>%
summarise(num = list(table(factor(sign(value), levels = -1:1)))) %>%
unnest %>%
mutate(grp = rep(paste0("num", c("pos", "zero", "neg")), 3)) %>%
spread(grp, num)
Or using count
df1 %>%
count(id, val = sign(value)) %>%
spread(val, n, fill = 0)
data
df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), value = c(4L, -2L, 0L, 6L, -4L, -5L, -1L, 0L, 0L, -4L, -5L
)), class = "data.frame", row.names = c(NA, -11L))

Resources