I have a table similar to this minimal example without the difference column:
trigger
values
difference
0
3
0
NA
0
NA
1
5
2
0
4
0
NA
1
10
6
I want to subtract the above number (and leave out the NAs) from the number at each trigger point (trigger = 1)
Is there a way to do this in R?
Edit:
I have now the situation where the triggers lie close together like in this example:
trigger
values
difference
0
3
0
NA
0
NA
1
5
2
0
4
1
5
1
0
10
How can I tackle this problem?
Create a grouping column with cumsum on the 'trigger' and taking the lag, then do the difference between the first and last element and replace it as the last value per group
library(dplyr)
df1 %>%
group_by(grp = lag(cumsum(trigger), default = 0)) %>%
mutate(difference = replace(rep(NA, n()), n(),
values[n()] - values[1])) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 7 × 3
trigger values difference
<int> <int> <int>
1 0 3 NA
2 0 NA NA
3 0 NA NA
4 1 5 2
5 0 4 NA
6 0 NA NA
7 1 10 6
For the second case, we may need a condition with if/else that checks the number of rows i.e. if the number of rows is greater than 1 only need the computation to replace
df2 %>%
group_by(grp = lag(cumsum(trigger), default = 0)) %>%
mutate(difference = if(n() > 1) replace(rep(NA, n()), n(),
values[n()] - values[1]) else NA) %>%
ungroup
-output
# A tibble: 7 × 4
trigger values grp difference
<int> <int> <dbl> <int>
1 0 3 0 NA
2 0 NA 0 NA
3 0 NA 0 NA
4 1 5 0 2
5 0 4 1 NA
6 1 5 1 1
7 0 10 2 NA
data
df1 <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 0L, 1L), values = c(3L,
NA, NA, 5L, 4L, NA, 10L)), class = "data.frame", row.names = c(NA,
-7L))
df2 <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 1L, 0L), values = c(3L,
NA, NA, 5L, 4L, 5L, 10L)), class = "data.frame", row.names = c(NA,
-7L))
# Import data: df => data.frame
df <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 0L, 1L), values = c(3L,
NA, NA, 5L, 4L, NA, 10L), diff_col = c(NA, NA, NA, 2L, -1L, NA,
6L)), row.names = c(NA, -7L), class = "data.frame")
# Create an empty vector: diff_col => integer vector
df$diff_col <- NA_integer_
# Difference the X.values vector, ignoring NAs:
# diff_col => integer vector
df[which(!(is.na(df$values)))[-1], "diff_col"] <- diff(
na.omit(
df$values
)
)
# Nullify the value if the trigger is 0:
# diff_col => integer vector
df$diff_col <- with(
df,
ifelse(
trigger == 0,
NA_integer_,
diff_col
)
)
Related
I am working with the following data frame:
I am wondering how I can create a new column which shows the percentage of the indicator column for all previous observations within the group. So the above data frame would become:
Basically, the new column just indicates the percentage (in decimal form) of the indicator up to that point within the group. It just divides the sum of the indicator column up to that point by the row count of previous observations within the group.
My first thought was to use group_by along with row_number in order reference previous observations, but I couldn't figure out how to make it work.
Data:
structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), Indicator = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L), IndicatorPercent = c(NA,
1, 0.5, 0.67, 0.75, NA, 0, 0, 0, 0.25)), class = "data.frame", row.names = c(NA,
-10L))
We get the cummean of the 'Indicator' after grouping by 'Group' and then get the lag on it
library(dplyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = lag(cummean(Indicator))) %>%
ungroup
-output
# A tibble: 10 x 3
# Group Indicator IndicatorPercent
# <int> <int> <dbl>
# 1 1 1 NA
# 2 1 0 1
# 3 1 1 0.5
# 4 1 1 0.667
# 5 1 0 0.75
# 6 2 0 NA
# 7 2 0 0
# 8 2 0 0
# 9 2 1 0
#10 2 0 0.25
If we want to do this based on value of other column, use replace
library(tidyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = replace(rep(NA_real_, n()),
color == 'red', lag(cummean(Indicator[color == "red"])))) %>%
fill(IndicatorPercent) %>%
ungroup
Or with data.table
library(data.table)
setDT(df1)[color == 'red',
IndicatorPercent := shift(cummean(Indicator)), Group][,
IndicatorPercent := nafill(IndicatorPercent, type = 'locf'), Group][]
X Y
1 1 2
2 2 4
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 1 4
8 2 6
9 1 8
10 1 10
It should be so: In the first case the average of the values 2 and 4 is 3 In the second case, the average of the values 4,6,8,10 is 7 and so on...
Your data:
df = data.frame(X=c(1,2,NA,NA,NA,NA,1,2,1,1),Y=c(2,4,NA,NA,NA,NA,4,6,8,10))
You can define rows with consecutive rows with no NAs using diff(complete.cases(..)) :
blocks = cumsum(c(0,diff(complete.cases(df)) != 0 ))
block_means = tapply(df$Y,blocks,mean)
0 1 2
3 NA 7
block_means[!is.na(block_means)]
0 2
3 7
Or if you don't need to know the order:
na.omit(as.numeric(tapply(df$Y,blocks,mean)))
[1] 3 7
We can create groups of continuous values using rleid from data.table , within each group calculate the mean of Y values/
library(dplyr)
df %>%
group_by(gr = data.table::rleid(is.na(Y))) %>%
summarise(Y = mean(Y, na.rm = TRUE)) %>%
filter(!is.na(Y)) -> df1
df1
# gr Y
# <int> <dbl>
#1 1 3
#2 3 7
data.table way of doing this would be :
library(data.table)
df1 <- setDT(df)[, .(Y = mean(Y, na.rm = TRUE)), rleid(is.na(Y))][!is.na(Y)]
data
df <- structure(list(X = c(1L, 2L, NA, NA, NA, NA, 1L, 2L, 1L, 1L),
Y = c(2L, 4L, NA, NA, NA, NA, 4L, 6L, 8L, 10L)),
class = "data.frame", row.names = c(NA, -10L))
I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L
Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L
An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
I want to have a list of positive and negative values corresponding to each value that comes after grouping a column. My data looks like this:
dataset <- read.table(text =
"id value
1 4
1 -2
1 0
2 6
2 -4
2 -5
2 -1
3 0
3 0
3 -4
3 -5",
header = TRUE, stringsAsFactors = FALSE)
I want my result to look like this:
id num_pos_value num_neg_value num_zero_value
1 1 1 1
2 1 3 0
3 0 2 2
I want to extend the columns of the above result by adding sum of the positive and negative values.
id num_pos num_neg num_zero sum_pos sum_neg
1 1 1 1 4 -2
2 1 3 0 6 -10
3 0 2 2 0 -9
We create a group by 'id' and calculate the sum of logical vector
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(num_pos = sum(value > 0),
num_neg = sum(value < 0),
num_zero = sum(value == 0))
# A tibble: 3 x 4
# id num_pos num_neg num_zero
# <int> <int> <int> <int>
#1 1 1 1 1
#2 2 1 3 0
#3 3 0 2 2
Or get the table of sign of 'value' and spread it to 'wide'
library(tidyr)
df1 %>%
group_by(id) %>%
summarise(num = list(table(factor(sign(value), levels = -1:1)))) %>%
unnest %>%
mutate(grp = rep(paste0("num", c("pos", "zero", "neg")), 3)) %>%
spread(grp, num)
Or using count
df1 %>%
count(id, val = sign(value)) %>%
spread(val, n, fill = 0)
data
df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), value = c(4L, -2L, 0L, 6L, -4L, -5L, -1L, 0L, 0L, -4L, -5L
)), class = "data.frame", row.names = c(NA, -11L))
I have data that looks like this
ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1
I want to replace all values with 'NA' if the ID occurs more than once in the dataframe. The final product should look like this
ID v1 v2
1 1 0
2 0 1
3 NA NA
3 NA NA
4 0 1
I could do this by hand, but I want R to detect all the duplicate cases (in this case two times ID '3') and replace the values with 'NA'.
Thanks for your help!
You could use duplicated() from either end, and then replace.
idx <- duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE)
df[idx, -1] <- NA
which gives
ID v1 v2
1 1 1 0
2 2 0 1
3 3 NA NA
4 3 NA NA
5 4 0 1
This will also work if the duplicated IDs are not next to each other.
Data:
df <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
One more option:
df1[df1$ID %in% df1$ID[duplicated(df1$ID)], -1] <- NA
#> df1
# ID v1 v2
#1 1 1 0
#2 2 0 1
#3 3 NA NA
#4 3 NA NA
#5 4 0 1
data
df1 <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
Here is a base R method
# get list of repeated IDs
repeats <- rle(df$ID)$values[rle(df$ID)$lengths > 1]
# set the corresponding variables to NA
df[, -1] <- sapply(df[, -1], function(i) {i[df$ID %in% repeats] <- NA; i})
In the first line, we use rle to extract repeated IDs. In the second, we use sapply to loop through non-ID variables and replace IDs that repeat with NA for each variable.
Note that this assumes that the data set is sorted by ID. This may be accomplished with the order function. (df <- df[order(df$ID),]).
If the dataset is very large, you might break up the first function into two steps to avoid computing the rle twice:
dfRle <- rle(df$ID)
repeats <- dfRle$values[dfRle$lengths > 1]
data
df <- read.table(header=T, text="ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1")