I am trying to find a way to subset or filter my dataset (repeated measures of individuals) using a conditional statement on the first measure. In other words, I want to filter the dataset to only include data for all time points for the individuals which have a specific condition present at time point 1.
Example Data:
Puck_Number <- c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
Date <- as.Date(c('2020-07-29','2020-07-29','2020-07-29','2020-07-29','2020-07-29','2020-07-29','2020-09-07','2020-09-07','2020-09-07','2020-09-07','2020-09-07','2020-09-07','2020-09-22','2020-09-22','2020-09-22','2020-09-22','2020-09-22','2020-09-22'))
Bleached <- c(1,0,1,1,0,1,1,0,1,1,0,1,0,0,0,1,0,1)
Alive <- c(1,1,1,1,1,1,1,1,1,1,0,1,0,1,0,1,0,1)
Data <- data.frame(Puck_Number, Date, Bleached, Alive)
Which will produce the following:
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
2 2 2020-07-29 0 1
3 3 2020-07-29 1 1
4 4 2020-07-29 1 1
5 5 2020-07-29 0 1
6 6 2020-07-29 1 1
7 1 2020-09-07 1 1
8 2 2020-09-07 0 1
9 3 2020-09-07 1 1
10 4 2020-09-07 1 1
11 5 2020-09-07 0 0
12 6 2020-09-07 1 1
13 1 2020-09-22 0 0
14 2 2020-09-22 0 1
15 3 2020-09-22 0 0
16 4 2020-09-22 1 1
17 5 2020-09-22 0 0
18 6 2020-09-22 1 1
What I want to include through filtering or subsetting is only those which have a 1 in the bleached column during the Date of '2020-07-29' and the repeated measure of those individuals for the entire dataset.
So I am looking for the data to look like this:
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
3 3 2020-07-29 1 1
4 4 2020-07-29 1 1
6 6 2020-07-29 1 1
7 1 2020-09-07 1 1
9 3 2020-09-07 1 1
10 4 2020-09-07 1 1
12 6 2020-09-07 1 1
13 1 2020-09-22 0 0
15 3 2020-09-22 0 0
16 4 2020-09-22 1 1
18 6 2020-09-22 1 1
The puck number is a unique identifier for each individual (repeated for each measurement) and I suspect that it may help in this filtering, but I haven't come across a way to accomplish this with the R skill set I have.
Try this
with(Data, Data[Puck_Number %in% Puck_Number[Date == as.Date("2020-07-29") & Bleached], ])
Output
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
3 3 2020-07-29 1 1
4 4 2020-07-29 1 1
6 6 2020-07-29 1 1
7 1 2020-09-07 1 1
9 3 2020-09-07 1 1
10 4 2020-09-07 1 1
12 6 2020-09-07 1 1
13 1 2020-09-22 0 0
15 3 2020-09-22 0 0
16 4 2020-09-22 1 1
18 6 2020-09-22 1 1
Or a tidyverse way
library(tidyverse)
subset <- Data %>% filter(Date == as.Date("2020-07-29", format = "%Y-%m-%d") & Bleached == 1) %>%
select(Puck_Number) %>% left_join(Data)
> subset
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
2 3 2020-07-29 1 1
3 4 2020-07-29 1 1
4 6 2020-07-29 1 1
5 1 2020-09-07 1 1
6 3 2020-09-07 1 1
7 4 2020-09-07 1 1
8 6 2020-09-07 1 1
9 1 2020-09-22 0 0
10 3 2020-09-22 0 0
11 4 2020-09-22 1 1
12 6 2020-09-22 1 1
Related
I want to keep an observation (grouped by ID) for every 30 days. I want to do this by creating a variable that tells me which observations are left inside (1) and which ones are outside (0) of the filter.
Example
id date
1 3/1/2021
1 4/1/2021
1 5/1/2021
1 6/1/2021
1 2/2/2021
1 3/2/2021
1 5/2/2021
1 7/2/2021
1 9/2/2021
1 11/2/2021
1 13/2/2021
1 16/3/2021
2 5/1/2021
2 31/10/2021
2 9/1/2021
2 6/2/2021
2 1/6/2021
3 1/1/2021
3 1/6/2021
3 31/12/2021
4 5/5/2021
Expected result
id date count
1 3/1/2021 1
1 4/1/2021 0
1 5/1/2021 0
1 6/1/2021 0
1 2/2/2021 0
1 3/2/2021 1
1 5/2/2021 0
1 7/2/2021 0
1 9/2/2021 0
1 11/2/2021 0
1 13/2/2021 0
1 16/3/2021 1
2 5/1/2021 1
2 31/10/2021 1
2 9/1/2021 0
2 6/2/2021 1
2 1/6/2021 1
3 1/1/2021 1
3 1/6/2021 1
3 31/12/2021 1
4 5/5/2021 1
here is a data.table approach
library(data.table)
# sort by id by date
setkey(DT, id, date)
# create groups
DT[, group := rleid((as.numeric(date - date[1])) %/% 30), by = .(id)][]
# create count column
DT[, count := ifelse(!group == shift(group, type = "lag", fill = 0), 1, 0), by = .(id)][]
# id date group count
# 1: 1 2021-01-03 1 1
# 2: 1 2021-01-04 1 0
# 3: 1 2021-01-05 1 0
# 4: 1 2021-01-06 1 0
# 5: 1 2021-02-02 2 1
# 6: 1 2021-02-03 2 0
# 7: 1 2021-02-05 2 0
# 8: 1 2021-02-07 2 0
# 9: 1 2021-02-09 2 0
#10: 1 2021-02-11 2 0
#11: 1 2021-02-13 2 0
#12: 1 2021-03-16 3 1
#13: 2 2021-01-05 1 1
#14: 2 2021-01-09 1 0
#15: 2 2021-02-06 2 1
#16: 2 2021-06-01 3 1
#17: 2 2021-10-31 4 1
#18: 3 2021-01-01 1 1
#19: 3 2021-06-01 2 1
#20: 3 2021-12-31 3 1
#21: 4 2021-05-05 1 1
# id date group count
sample data used
DT <- fread("id date
1 3/1/2021
1 4/1/2021
1 5/1/2021
1 6/1/2021
1 2/2/2021
1 3/2/2021
1 5/2/2021
1 7/2/2021
1 9/2/2021
1 11/2/2021
1 13/2/2021
1 16/3/2021
2 5/1/2021
2 31/10/2021
2 9/1/2021
2 6/2/2021
2 1/6/2021
3 1/1/2021
3 1/6/2021
3 31/12/2021
4 5/5/2021")
# set date as actual date
DT[, date := as.Date(date, "%d/%m/%Y")]
so I´m trying to add two new variables to my dataframe. A variable named start, which is supposed to be a a running count from 0 to whatever number of rows there are for one group, and a second variable named stop which is practically the same, but starting at 1. The count should start, once the value of a second variable scores >0. It is further important, that the count continues until the last value of the group (so it shouldn´t stop if Var1=0 again) and that NAs are ignored in the sense, that counting continues.
Consider the following dataset as an example
ID Var1 start stop
1 0
1 1 0 1
1 4 1 2
1 2 2 3
1 NA 3 4
1 4 4 5
2 0
2 0
2 3 0 1
2 5 1 2
2 9 2 3
2 0 3 4
I don´t really care for the values start and stop take on before Var1>0 first, so whether it´s 0 or NA is not important
Thanks very much for the good answers in advance!!
Dirty solution to the problem, will probably work just take out the extra columns that I made as steps with select
library(tidyverse)
df_example <- read_table("ID Var1 start stop
1 0
1 1 0 1
1 4 1 2
1 2 2 3
1 NA 3 4
1 4 4 5
2 0
2 0
2 3 0 1
2 5 1 2
2 9 2 3
2 0 3 4")
df_example %>%
group_by(ID) %>%
mutate(greater_1 = if_else(replace_na(Var1,1) > 0,1,0),
run_sum = cumsum(greater_1),
to_fill = if_else(run_sum == 1,1,NA_real_)) %>%
fill(to_fill) %>%
mutate(end2 = cumsum(to_fill %>% replace_na(0)),
star2 = if_else(end2 -1 > 0,end2 -1,0))
#> # A tibble: 12 x 9
#> # Groups: ID [2]
#> ID Var1 start stop greater_1 run_sum to_fill end2 star2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA 0 0 NA 0 0
#> 2 1 1 0 1 1 1 1 1 0
#> 3 1 4 1 2 1 2 1 2 1
#> 4 1 2 2 3 1 3 1 3 2
#> 5 1 NA 3 4 1 4 1 4 3
#> 6 1 4 4 5 1 5 1 5 4
#> 7 2 0 NA NA 0 0 NA 0 0
#> 8 2 0 NA NA 0 0 NA 0 0
#> 9 2 3 0 1 1 1 1 1 0
#> 10 2 5 1 2 1 2 1 2 1
#> 11 2 9 2 3 1 3 1 3 2
#> 12 2 0 3 4 0 3 1 4 3
Created on 2020-08-04 by the reprex package (v0.3.0)
I have a grouped data structure (different households answering a weekly opinion poll) and I observe every household over 52 weeks (in the example 4 weeks). Now I want to indicate the value of a household at a given point in time using entropy. The value of a household participating in the poll should be higher, if the household didn't participate in the past weeks. So a household always answering the poll should have a lower value in these 4 given weeks than a household answering every two weeks in the two weeks when it does participate. It's important that for a given household the inequality measure varies over weeks.
What's the best way to do so? If it's entropy, how do I apply it to a panel data structure using R?
The data structure is as follows:
da_poll <- data.frame(household = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), participation = c(1,1,1,1,0,0,0,1,0,1,0,1,1,1,1,0))
da_poll
household participation
1 1 1
2 1 1
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 1
9 3 0
10 3 1
11 3 0
12 3 1
13 4 1
14 4 1
15 4 1
16 4 0
# 1 indicates participation, 0 no participation.
I have tried to group it by households, but then I only get one value for each household:
da_poll %>%
group_by(household) %>%
mutate(entropy = entropy(participation))
A tibble: 16 x 4
# Groups: household [4]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1.39
2 1 2 1 1.39
3 1 3 1 1.39
4 1 4 1 1.39
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 0
9 3 1 0 0.693
10 3 2 1 0.693
11 3 3 0 0.693
12 3 4 1 0.693
13 4 1 1 1.10
14 4 2 1 1.10
15 4 3 1 1.10
16 4 4 0 1.10
If I group based in household and week, I also get something strange:
da_poll %>%
group_by(household, week) %>%
mutate(entropy = entropy(participation))
# A tibble: 16 x 4
# Groups: household, week [16]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 1 0
3 1 3 1 0
4 1 4 1 0
5 2 1 0 NA
6 2 2 0 NA
7 2 3 0 NA
8 2 4 1 0
9 3 1 0 NA
10 3 2 1 0
11 3 3 0 NA
12 3 4 1 0
13 4 1 1 0
14 4 2 1 0
15 4 3 1 0
16 4 4 0 NA
To calculate the entropy cummulatively you need to write your own cummulative function. There is probably a more tidyverse-idomatic way do it but this is what I came up with. Based on your post and your comments, entropy may not be the metric you are looking for.
cummulEntropy <- function(x){
unlist(lapply(seq_along(x), function(i) entropy::entropy(x[1:i])))
}
da_poll %>%
group_by(household) %>%
mutate(entropy=cummulEntropy(participation))
# A tibble: 16 x 3
# Groups: household [4]
# household participation entropy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 1 0.693
# 3 1 1 1.10
# 4 1 1 1.39
# 5 2 0 NA
# 6 2 0 NA
# 7 2 0 NA
# 8 2 1 0
# 9 3 0 NA
#10 3 1 0
#11 3 0 0
#12 3 1 0.693
#13 4 1 0
#14 4 1 0.693
#15 4 1 1.10
#16 4 0 1.10
My data frame looks like this
> data
Date Dummy
1 2020-01-01 1
2 2020-01-02 0
3 2020-01-03 0
4 2020-01-04 0
5 2020-01-05 1
6 2020-01-06 1
7 2020-01-07 1
8 2020-01-08 0
9 2020-01-09 1
10 2020-01-10 1
11 2020-01-11 0
I want to create a new column which gives the cumulative frequency of dummy values but conditional on whether the dummy was present or not. The final data set looks like this
> data
Date Dummy Modified
1 2020-01-01 1 1
2 2020-01-02 0 1
3 2020-01-03 0 1
4 2020-01-04 0 1
5 2020-01-05 1 2
6 2020-01-06 1 3
7 2020-01-07 1 4
8 2020-01-08 0 4
9 2020-01-09 1 5
10 2020-01-10 1 6
11 2020-01-11 0 6
How can I acheive this in R. Preferably dplyr . Any help will be greatly appreciated
We can just do a cumsum
cumsum(data$Dummy)
#[1] 1 1 1 1 2 3 4 4 5 6 6
This can be implemented within the %>% chain
library(dplyr)
data %>%
mutate(Modified = cumsum(Dummy))
I have a dataset as below:
the outcome have no relationship with contact_date, when a subscriber response a cold call, we mark it successful contact attempt(1) else (0). The count is how many times we called the subscriber.
subscriber_id outcome contact_date queue multiple_number count
(int) (int) (date) (fctr) (int) (int)
1 1 1 2015-01-29 2 1 1
2 1 0 2015-02-21 2 1 2
3 1 0 2015-03-29 2 1 3
4 1 1 2015-04-30 2 1 4
5 2 0 2015-01-29 2 1 1
6 2 0 2015-02-21 2 1 2
7 2 0 2015-03-29 2 1 3
8 2 0 2015-04-30 2 1 4
9 2 1 2015-05-31 2 1 5
10 2 1 2015-08-25 5 1 6
11 2 0 2015-10-30 5 1 7
12 2 0 2015-12-14 5 1 8
13 3 1 2015-01-29 2 1 1
I would like to get the count number for the first outcome ==1 for each subscriber, could you please tell me how can I get it? the final data set I would like is:
(Please noticed some may don't have any success call, in this case, I would like to mark the first_success as 0)
subscriber_id first_success
1 1
2 5
3 1
...
require(dplyr)
data %>% group_by(subscriber_id) %>% filter(outcome==1) %>%
slice(which.min(contact_date)) %>% data.frame() %>%
select(subscriber_id,count)