I have been struggling for a while now, and I am just missing a step. I hope you can help with this final step.
Reprex
structure(list(record_id = c(110001, 110001, 110001, 110001,
110001, 110001, 110001, 110001, 110001, 110021, 110021, 110021,
110021, 110021, 110021, 110021, 110021, 110021, 110021, 110021,
110021, 110021), day_count = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1,
2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), previous_treatment = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0
), treatment = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), interruption_streak = c(1, 2, 3, 4, 5,
6, 7, 8, 9, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), row.names = c(NA,
-22L), groups = structure(list(record_id = c(110001, 110021),
.rows = structure(list(1:9, 10:22), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Explanation
This is just an excerpt of the main dataset in which I have listed per participant per day how they were treated.
Here, you can see two study participants record_id 110001 and 110021
To count how many days their treatment was interrupted, i have created a count_streak function interruption_streak
This is a function of treatment: if treatment = 0, then start counting until treatment > 0.
Both treatment and previous_treatment can be 0 (no treatment) or 1,2,3 (treatment A,B,C)
However, as you can see in record_id 110001, you can't really call the first streak an interruption, as prior to day 1, he didn't receive any treatment at all previous_treatment = 0. Same goes for the first streak of 110021.
The second streak of 110021 is the only valid one which I would like to consider as an interruption and keep in the dataset:
at day 5, it went from previous_treatment = 1 to treatment = 0.
Question
I would like to delete all streaks which started with a previous_treatment = 0 and keep all streaks which started with a previous_treatment > 0.
Thanks a lot in advance
You were close. Does this suffice?
df %>% group_by(record_id) %>%
filter(cumsum(previous_treatment) > 0)
record_id day_count previous_treatment treatment interruption_streak
<dbl> <dbl> <dbl> <dbl> <dbl>
1 110021 5 1 0 1
2 110021 6 0 0 2
3 110021 7 0 0 3
4 110021 8 0 0 4
5 110021 9 0 0 5
6 110021 10 0 0 6
7 110021 11 0 0 7
8 110021 12 0 0 8
9 110021 13 0 0 9
10 110021 14 0 0 10
Related
I have the following data:
df<-structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2), day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), x1 = c(15, 15, 15.2, 15.2,
15.3, 15.2, 15.3, 15, 15, 15.2, 15.3, 12, 12.1, 12.3, 12.2, 12,
12.4, 12.5, 12.4, 12.6, 12.7), x2 = c(1, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -21L))
And I want to generate a variable that indicates a change from 1 to 0 in x2, but only if the following 4 rows remain 0 (by ID). As in the first occurrence of a change in x2 from 1 to 0 for at least 4 days. To generate the variable in this data:
df2<-structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2), day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), x1 = c(15, 15, 15.2, 15.2,
15.3, 15.2, 15.3, 15, 15, 15.2, 15.3, 12, 12.1, 12.3, 12.2, 12,
12.4, 12.5, 12.4, 12.6, 12.7), x2 = c(1, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1), x3 = c(0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -21L))
Where x3 gets a value of 1 from the first occurrence of when x2 stops for at least 4 days, regardless of re-occurrence
I imagine there is a way to use lag or lead functions in dplyr, but I am not sure how to program the 'at least 4 days' condition.
Any suggestions?
We can use zoo::rollapply for a rolling-window calculation.
fun <- function(z) +(length(z) == 6 && z[1] == 1 && z[2] == 0 && all(z[-(1:2)] == 0))
df %>%
group_by(ID) %>%
mutate(x3a = cummax(zoo::rollapply(lead(x2), 6, fun, fill = 0))) %>%
ungroup()
# # A tibble: 21 x 6
# ID day x1 x2 x3 x3a
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 15 1 0 0
# 2 1 2 15 1 0 0
# 3 1 3 15.2 0 0 0
# 4 1 4 15.2 0 0 0
# 5 1 5 15.3 0 0 0
# 6 1 6 15.2 1 0 0
# 7 1 7 15.3 0 1 1
# 8 1 8 15 0 1 1
# 9 1 9 15 0 1 1
# 10 1 10 15.2 0 1 1
# # ... with 11 more rows
A tidyverse solution could (also) look as follows:
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
mutate(grp = cumsum(x2)) %>%
group_by(ID, grp) %>%
mutate(fourOrMore = n() > 4,
x3 = + lag(fourOrMore),
x3 = replace_na(x3, 0)) %>%
ungroup() %>%
select(- c("grp", "fourOrMore"))
# # A tibble: 21 × 5
# ID day x1 x2 x3
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 15 1 0
# 2 1 2 15 1 0
# 3 1 3 15.2 0 0
# 4 1 4 15.2 0 0
# 5 1 5 15.3 0 0
# 6 1 6 15.2 1 0
# 7 1 7 15.3 0 1
# 8 1 8 15 0 1
# 9 1 9 15 0 1
# 10 1 10 15.2 0 1
# # … with 11 more rows
I got two data sets of different lengths. I want to create a new column in the dataset which got more rows based on filtering a specific column from the shorter df. I am getting a waring " Longer object length is not a multiple of shorter object length". And the result is also not correct. I tried to created a smaller example datasets and tried the same code and its working with correct results. I am not sure why on my original data the results are not correct and I am getting the warning.
The example datasets are
structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
the second df
structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
I tried this on both my original datasets where I get the warning and these dummy dfs where no warning and correct results.
data2 <- data2 %>%
mutate(d_Activity = ifelse(code2 %in% data1$code & activity == 1, 1,0))
Actually, you are doing it wrong way. Let me explain-
In sample data it is working because larger df have rows (20) which is multiple of rows in smaller df (10).
So in you syntax what you are doing is, to check one complete vector with another complete vector (column of another df), because R normally works in vectorised way of operations.
the correct way of matching one to many is through purrr::map where each individual value in first argument (code2 here) operates with another vector i.e. df1$code which is not in argument of map.
df1 <- structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
library(tidyverse)
df2 %>%
mutate(d_Activity = map(code2, ~ +(.x %in% df1$code[df1$activity == 1])))
#> id2 code2 d_Activity
#> 1 1 2 0
#> 2 2 5 0
#> 3 3 11 0
#> 4 4 15 0
#> 5 5 9 0
#> 6 6 18 0
#> 7 7 21 0
#> 8 8 3 1
#> 9 9 27 0
#> 10 10 55 0
#> 11 11 2 0
#> 12 12 5 0
#> 13 13 11 0
#> 14 14 15 0
#> 15 15 3 1
#> 16 16 18 0
#> 17 17 21 0
#> 18 18 3 1
#> 19 19 27 0
#> 20 20 55 0
Created on 2021-06-17 by the reprex package (v2.0.0)
I would like to create a column in a data.frame, placing the first time that the year appears in each id.
That is, I have this data:
example <- structure(list(id = structure(c(1, 2, 3, 4, 5), class = "numeric"),
`2007` = c(0, 0, 0, 0, 0), `2008` = c(0, 0, 0, 0, 1), `2009` = c(1,
0, 0, 0, 0), `2010` = c(1, 0, 1, 0, 1), `2011` = c(0, 0,
0, 0, 0), `2012` = c(1, 0, 1, 1, 1), `2013` = c(1, 0, 1,
0, 1), `2014` = c(1, 1, 1, 1, 0), `2015` = c(1, 1, 0, 0,
0), `2016` = c(1, 1, 1, 0, 1)), row.names = c(NA, 5L), class = "data.frame")
And I would like to get the following:
example2 <- structure(list(id = structure(c(1, 2, 3, 4, 5), class = "numeric"),
`2007` = c(0, 0, 0, 0, 0), `2008` = c(0, 0, 0, 0, 1), `2009` = c(1,
0, 0, 0, 0), `2010` = c(1, 0, 1, 0, 1), `2011` = c(0, 0,
0, 0, 0), `2012` = c(1, 0, 1, 1, 1), `2013` = c(1, 0, 1,
0, 1), `2014` = c(1, 1, 1, 1, 0), `2015` = c(1, 1, 0, 0,
0), `2016` = c(1, 1, 1, 0, 1), situation = c(2009, 2014,
2010, 2012, 2008)), row.names = c(NA, 5L), class = "data.frame")
Is it possible to do that ? Every help is welcome. Thanks.
Try this:
#Code
example$situation <- apply(example[,-1],1,function(x) names(x)[min(which(x==1))])
Output:
example
id 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 situation
1 1 0 0 1 1 0 1 1 1 1 1 2009
2 2 0 0 0 0 0 0 0 1 1 1 2014
3 3 0 0 0 1 0 1 1 1 0 1 2010
4 4 0 0 0 0 0 1 0 1 0 0 2012
5 5 0 1 0 1 0 1 1 0 0 1 2008
Or with dplyr and tidyr reshaping and merging:
library(dplyr)
library(tidyr)
#Code
example <- example %>%
left_join(
example %>% pivot_longer(-1) %>%
group_by(id) %>%
summarise(situation=name[min(which(value==1))])
)
Same output.
I have a data frame set up like the one below (plot vs species occurrence data).
df=data.frame(plot=c(1, 2, 3, 4, 5, 6, 7, 8, 9), speciesA=c(5, 0, 10, 0, 8, 45, 0, 0, 17), speciesB = c(0, 0, 0, 0, 0, 0, 0, 0, 0), speciesC = c(0.7, 0, 17, 0, 0, 8, 0, 9, 0), species D = c(1, 0, 0, 3, 0, 0, 0, 9, 1))
I need to be able to create a second data frame (or subset this one) that contains only species that occur in greater than 4 plots. I used colSums to sount the number of occurances > 0 for each column, but cannot apply that to filtering the data frame.
colSums(df != 0)
df2 <- df[,which(apply(df,2,colSums)> 4)]
Any suggestions?
How about this...
df2 <- df[,colSums(df>0)>4]
df2
plot speciesA
1 1 5
2 2 0
3 3 10
4 4 0
5 5 8
6 6 45
7 7 0
8 8 0
9 9 17
I have a bit of a question about obtaining a previous value from one column and putting it into another column. I have a data.table object (which can easily be converted into an xts object) as follows:
The dput output is as follows:
structure(list(Time = structure(c(1122855314, 1122855315, 1122855316,
1122855317, 1122855318, 1122855319, 1122855320, 1122955811, 1122955812,
1122955813, 1122955814, 1123027212, 1123027213, 1123027214, 1123027215,
1123027216, 1123027217), class = c("POSIXct", "POSIXt"), tzone = "Australia/Melbourne"),
`Inventory_{t}` = c(0, 2, 2, 2, 5, 8, 3, 7, 6, 6, 1, 0, 1,
1, 3, 3, 3), `Inventory_{t-1}` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `Delta Inventory_{t-1}` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Time",
"Inventory_{t}", "Inventory_{t-1}", "Delta Inventory_{t-1}"), row.names = c(NA,
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000000028b0788>)
I would like to 'fill in' the "Inventory_{t-1}" such that it takes the value which was in "Inventory_{t}" one second earlier and puts it into that cell. Similarly, for "Delta Inventory_{t-1}" I want it to be equal to Delta Inventory_{t-1} = Inventory_{t-1} - Inventory_{t-2}
I should also note that at the start of each new day, the initial values for "Inventory_{t-1}" and "Delta Inventory_{t-1}" must be 0.
With this information, I would like to get a new data.table/xts object which looks like this:
structure(list(Time = structure(c(1122855314, 1122855315, 1122855316,
1122855317, 1122855318, 1122855319, 1122855320, 1122955811, 1122955812,
1122955813, 1122955814, 1123027212, 1123027213, 1123027214, 1123027215,
1123027216, 1123027217), class = c("POSIXct", "POSIXt"), tzone = "Australia/Melbourne"),
`Inventory_{t}` = c(0, 2, 2, 2, 5, 8, 3, 7, 6, 6, 1, 0, 1,
1, 3, 3, 3), `Inventory_{t-1}` = c(0, 0, 2, 2, 2, 5, 8, 0,
7, 6, 6, 0, 0, 1, 1, 3, 3), `Delta Inventory_{t-1}` = c(0,
0, 2, 0, 0, 3, 3, 0, 7, -1, 0, 0, 0, 1, 0, 2, 0)), .Names = c("Time",
"Inventory_{t}", "Inventory_{t-1}", "Delta Inventory_{t-1}"), row.names = c(NA,
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000000028b0788>)
The things is, this issue is very straightforward for me to solve if I use loops, but since I have so much data I was hoping for a much faster way to do this, so if anyone can help me out with this I'd really appreciate it, thanks in advance.
This can be solved using the shift() function. The OP has requested to restart the calculation anew every day. This is accomplished by the by = parameter:
z[, `:=`(`Inventory_{t-1}` = shift(`Inventory_{t}`, fill = 0),
`Delta Inventory_{t-1}` = shift(`Inventory_{t}`, fill = 0) -
shift(`Inventory_{t}`, n = 2L, fill = 0)), by = .(Day = as.Date(Time))][]
Time Inventory_{t} Inventory_{t-1} Delta Inventory_{t-1}
1: 2005-08-01 10:15:14 0 0 0
2: 2005-08-01 10:15:15 2 0 0
3: 2005-08-01 10:15:16 2 2 2
4: 2005-08-01 10:15:17 2 2 0
5: 2005-08-01 10:15:18 5 2 0
6: 2005-08-01 10:15:19 8 5 3
7: 2005-08-01 10:15:20 3 8 3
8: 2005-08-02 14:10:11 7 0 0
9: 2005-08-02 14:10:12 6 7 7
10: 2005-08-02 14:10:13 6 6 -1
11: 2005-08-02 14:10:14 1 6 0
12: 2005-08-03 10:00:12 0 0 0
13: 2005-08-03 10:00:13 1 0 0
14: 2005-08-03 10:00:14 1 1 1
15: 2005-08-03 10:00:15 3 1 0
16: 2005-08-03 10:00:16 3 3 2
17: 2005-08-03 10:00:17 3 3 0