How to count number of days between events in a dataset - r

I'm trying to restructure my data to recode a variable ('Event') so that I can determine the number of days between events. Essentially, I want to be able to count the number of days that occur between events occuring Importantly, I only want to start the 'count' between events after the first event has occurred for each person. Here is a sample dataframe:
Day = c(1:8,1:8)
Event = c(0,0,1,NA,0,0,1,0,0,1,NA,NA,0,1,0,1)
Person = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
sample <- data.frame(Person,Day,Event);sample
I would like it to end up like this:
NewEvent = c(NA,NA,0,1,2,3,0,1,NA,0,1,2,3,0,1,0)
sample2 <- sample <- data.frame(Person,Day,NewEvent);sample2
I'm new to R, unfamiliar with loops or if statements, and I could not find a thread which already answered this type of issue, so any help would be greatly appreciated. Thank you!

One approach is to group on Person and calculate distinct occurrence of events by cumsum(Event == 1). Now, group on both Person and grp to count days passed from occurrence of distinct event. The solution will be as :
library(dplyr)
sample %>% group_by(Person) %>%
mutate(EventNum = cumsum(!is.na(Event) & Event == 1)) %>%
group_by(Person, EventNum) %>%
mutate(NewEvent = ifelse(EventNum ==0, NA, row_number() - 1)) %>%
ungroup() %>%
select(Person, Day, NewEvent) %>%
as.data.frame()
# Person Day NewEvent
# 1 1 1 NA
# 2 1 2 NA
# 3 1 3 0
# 4 1 4 1
# 5 1 5 2
# 6 1 6 3
# 7 1 7 0
# 8 1 8 1
# 9 2 1 NA
# 10 2 2 0
# 11 2 3 1
# 12 2 4 2
# 13 2 5 3
# 14 2 6 0
# 15 2 7 1
# 16 2 8 0
Note: If data is not sorted on Day then one should add arrange(Day) in above code.

Related

How to use a for loop to changed consecutive values in R?

How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1

Efficient way to add sample information as new column to data set

I know how I can subset a data frame by sampling certain rows. However, I'm struggling with finding an easy (preferably tidyverse) way to just ADD the sampling information as a new column to my data set, i.e. I simply want to populate a new column with "1" if it is sampled and "0" if not.
I currently have this one, but it feels overly complicated. Note, in the example I want to sample 3 rows per group.
df <- data.frame(group = c(1,2,1,2,1,1,1,1,2,2,2,2,2,1,1),
var = 1:15)
library(tidyverse)
df <- df %>%
group_by(group) %>%
mutate(sampling_info = sample.int(n(), size = n(), replace = FALSE),
sampling_info = if_else(sampling_info <= 3, 1, 0))
You can try -
library(dplyr)
set.seed(123)
df %>%
arrange(group) %>%
group_by(group) %>%
mutate(sampling_info = as.integer(row_number() %in% sample(n(), size = 3))) %>%
ungroup
# group var sampling_info
# <dbl> <int> <int>
# 1 1 1 0
# 2 1 3 0
# 3 1 5 1
# 4 1 6 0
# 5 1 7 0
# 6 1 8 0
# 7 1 14 1
# 8 1 15 1
# 9 2 2 0
#10 2 4 1
#11 2 9 1
#12 2 10 0
#13 2 11 0
#14 2 12 1
#15 2 13 0
sample(n(), size = 3) will generate 3 random row numbers for each group and we assign 1 for those row numbers.

Insert missing rows in time series data

I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1

Filter for first 5 observations per group in tidyverse

I have precipitation data of several different measurement locations and would like to filter for only the first n observations per location and per group of precipitation intensity using tidyverse functions.
So far, I've grouped the data by location and by precipitation intensity.
This is a minimal example (there are several observations of each rainfall intensity per location)
df <- data.frame(location = c(rep(1, 7), rep(2, 7)),
rain = c(1:7, 1:7))
location rain
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 2 1
9 2 2
10 2 3
11 2 4
12 2 5
13 2 6
14 2 7
I thought that it should be quite easy using group_by() and filter(), but so far, I haven't found an expression that would return only the first n observations per rain group per location.
df %>% group_by(rain, location) %>% filter(???)
You can do:
df %>%
group_by(location) %>%
slice(1:5)
location rain
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
library(dplyr)
df %>%
group_by(location) %>%
filter(row_number() %in% 1:5)
Non-dplyr solutions (that also rearrange the rows)
# Base R
df[unlist(lapply(split(row.names(df), df$location), "[", 1:5)), ]
# data.table
library(data.table)
setDT(df)[, .SD[1:5], by = location]
An option in data.table
library(data.table)
setDT(df)[, .SD[seq_len(.N) <=5], location]

Filter all rows of a group according to specific member of group [duplicate]

This question already has an answer here:
How to filter (with dplyr) for all values of a group if variable limit is reached?
(1 answer)
Closed 5 years ago.
I want to filter an entire group based on a value at a specified row.
In the data below, I'd like to remove all rows of group ID, according the value of Metric for Hour == '2'. (Note that I am not trying to filter based on two conditions here, I'm trying to filter based on one condition but at a specific row)
Sample data:
ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')
Metric <- c(3,4,1,6,7,8,8,3,6,1,1)
x <- data.frame(ID, Hour, Metric)
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
6 B 0 8
7 B 2 8
8 B 5 3
9 B 6 6
10 C 0 1
11 C 2 1
I want to filter each ID based on whether Metric > 5 for Hour == '2'. The result should look like this (all rows of ID B are removed):
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
10 C 0 1
11 C 2 1
A dplyr-based solution would be preferred, but any help is much appreciated.
Adapting How to filter (with dplyr) for all values of a group if variable limit is reached?
we get:
x %>%
group_by(ID) %>%
filter(any(Metric[Hour == '2'] <= 5))
# # A tibble: 7 x 3
# # Groups: ID [2]
# ID Hour Metric
# <fctr> <fctr> <dbl>
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
These type of problems can be also answered by first creating a by group intermediate variable, to flag whether rows should be removed.
Method 1:
x %>%
group_by(ID) %>%
mutate(keep_group = (any(Metric[Hour == '2'] <= 5))) %>%
ungroup %>%
filter(keep_group) %>%
select(-keep_group)
Method 2:
groups_to_keep <-
x %>%
filter(Hour == '2', Metric <= 5) %>%
select(ID) %>%
distinct() # N.B. this sorts groups_to_keep by ID which may not be desired
# ID
# 1 A
# 2 C
x %>%
inner_join(groups_to_keep, by = 'ID')
# ID Hour Metric
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
Method 3 - as suggested by #thelatemail (safe with respect to duplicates in ID):
groups_not_to_keep <-
x %>%
filter(Hour == 2, Metric > 5) %>%
select(ID)
x %>%
anti_join(groups_not_to_keep, by = 'ID')
Not in (!()) should be useful here. Try this
library(dplyr)
filter(x, Metric > 5 & Hour == '2')$ID # gives B
subset(x, !(ID %in% filter(x, Metric > 5 & Hour == '2')$ID))

Resources