I am not 100% sure how to formulate my question because I don't know the formal names are for what it is that I am trying to do with my dataset. Based on previous questions, there appears to be some way to address what I am trying to, but I am unable at making the logical jump from their problem to my own.
I have attached a sample of my data here.
The first thing I did with my data was add a column indicating which species (sps) are predators (coded as 1) and which species are prey (coded as 0).
#specify which are predators and prey
d1 = d1 %>%
group_by(sps) %>% #grouped by species
mutate(pp=ifelse(sps %in% c("MUXX", "MUVI","MEME"), 1,0)) #mutate to specify predators as 1 and prey as 0
My data is structured as such:
head(d1) #visualize the first few lines of the data
# A tibble: 6 x 8
# Groups: sps [4]
ID date km culv.id type sps time pp
<int> <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <dbl>
1 2012-06-19 80 A DCC MICRO 2:19 0
2 2012-06-21 80 A DCC MUXX 23:23 1
3 2012-07-15 80 A DCC MAMO 11:38 0
4 2012-07-20 80 A DCC MICRO 22:19 0
5 2012-07-29 80 A DCC MICRO 23:03 0
6 2012-08-07 80 A DCC PRLO 2:04 0
Here is also the output for dput(head(d1)):
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
I also converted the time and date using the following code:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new formatĀ
The (most) relevant columns are date, time, and pp (where 1 = predator species and 0 = prey species).
I am now trying to figure out how to extract the following information (average +/- std):
average time between prey-prey observations
average time between prey-predator observations
average time between predator-predator observations
average time between predator-prey observations
To put one of these examples (#2) into words:
What is the average time between when a prey species (pp = 0) is first seen followed by a predator species (pp = 1)?
I am trying to figure out how to do this for my dataset overall first. I think that once I figure out how to do that, it should be fairly straightforward to restrict the data.
Here is a data.table (and lubridate) version that might address your problem:
Using a selection of your posted data (posted at bottom), with a slight modification to your datetime creation so that the format works with data.table:
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
Convert to a data table:
d1 <- as.data.table(d1)
Calculate time differences for equal pp values for animals by specialization (prey or predator), less than (pred to prey), or greater than (prey to pred).
d1$class.class <- d1[d1, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
d1$prey.pred <-d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
d1$pred.prey <- d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
Gives you a column for each:
> head(d1[, 7:ncol(d1)])
time pp datetime class.class prey.pred pred.prey
1: 2:19 0 2012-06-19 02:19:00 26.388194 days 2.877778 days NA days
2: 23:23 1 2012-06-21 23:23:00 74.177083 days NA days 23.51042 days
3: 11:38 0 2012-07-15 11:38:00 5.445139 days 50.666667 days NA days
4: 22:19 0 2012-07-20 22:19:00 9.030556 days 45.221528 days NA days
5: 23:03 0 2012-07-29 23:03:00 8.125694 days 36.190972 days NA days
6: 2:04 0 2012-08-07 02:04:00 1.911111 days 28.065278 days NA days
And you can get summary statistics as you like:
d1[by = sps,, .(mean.same.class = mean(class.class, na.rm = TRUE),
sd.same.class = sd(class.class, na.rm = TRUE),
mean.prey.pred = mean(prey.pred, na.rm = TRUE),
sd.prey.pred = sd(prey.pred, na.rm = TRUE),
mean.pred.prey = mean(pred.prey, na.rm = TRUE),
sd.pred.prey = sd(pred.prey, na.rm = TRUE))]
sps mean.same.class sd.same.class mean.prey.pred sd.prey.pred mean.pred.prey sd.pred.prey
1: MICRO 7.886237 days 8.0547631 18.80733 days 15.504646 NaN days NA
2: MUXX 42.073611 days 45.4011658 NaN days NA 13.01366 days 9.315697
3: MAMO 5.445139 days NA 50.66667 days NA NaN days NA
4: PRLO 2.475694 days 0.7984414 26.62708 days 2.033914 NaN days NA
5: LEAM 2.897222 days NA 10.11597 days NA NaN days NA
Libraries: data.table, lubridate
Data:
> dput(d1)
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15200L, 15201L, 15199L, 15177L, 15178L, 15204L, 15205L
), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20",
"2012-07-29", "2012-08-07", "2012-08-08", "2012-08-09", "2012-08-13",
"2012-08-13", "2012-08-25", "2012-08-27", "2012-09-04", "2012-09-09",
"2012-09-11", "2012-09-14", "2012-09-23", "2012-09-26", "2012-09-27"
), km = c(80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L,
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX",
"MAMO", "MICRO", "MICRO", "PRLO", "MICRO", "PRLO", "MICRO", "MICRO",
"LEAM", "MICRO", "MUXX", "MICRO", "MICRO", "MUXX", "MICRO", "MICRO",
"MICRO"), time = c("2:19", "23:23", "11:38", "22:19", "23:03",
"2:04", "23:56", "23:06", "0:04", "0:46", "0:51", "22:23", "3:38",
"21:08", "0:40", "2:55", "22:09", "20:46", "3:20"), pp = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0)), class = "data.frame", .Names = c("ID",
"date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA,
-19L))
Edit:
I'm not really sure about mixing tidyverse and data.table ideologies, but you could potentially do what you described in comments using do. For example, make a modified version of df:
d1 <- as.data.table(d1)
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
d1Mod <- d1
d1Mod$km[10:nrow(d1Mod)] <- 90
Then, define the data.table bit as a function:
foo <- function(df_) {
df_$class.class <- df_[df_, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
df_$prey.pred <-df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
df_$pred.prey <- df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
return(df_)
}
Running d1 %>% group_by(km) %>% do(foo(as.data.table(.))) gets you the same output as in the original answer above (since all km values are 80). If you run it on the modified d1Mod you get an output that looks like it has been grouped by km:
> d1Mod %>%
+ group_by(km) %>%
+ do(foo(as.data.table(.)))
# A tibble: 19 x 12
# Groups: km [2]
ID date km culv.id type sps time pp datetime class.class prey.pred pred.prey
<int> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dttm> <time> <time> <time>
1 1 2012-06-19 80 A DCC MICRO 2:19 0 2012-06-19 02:19:00 26.3881944 days 2.877778 days NA days
2 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23:00 NA days NA days 23.510417 days
3 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38:00 5.4451389 days NA days NA days
4 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19:00 9.0305556 days NA days NA days
5 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03:00 8.1256944 days NA days NA days
6 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04:00 1.9111111 days NA days NA days
7 9 2012-08-08 80 A DCC MICRO 23:56 0 2012-08-08 23:56:00 0.9652778 days NA days NA days
8 10 2012-08-09 80 A DCC PRLO 23:06 0 2012-08-09 23:06:00 3.0402778 days NA days NA days
9 11 2012-08-13 80 A DCC MICRO 0:04 0 2012-08-13 00:04:00 NA days NA days NA days
10 12 2012-08-13 90 A DCC MICRO 0:46 0 2012-08-13 00:46:00 12.0034722 days 22.119444 days NA days
11 13 2012-08-25 90 A DCC LEAM 0:51 0 2012-08-25 00:51:00 2.8972222 days 10.115972 days NA days
12 14 2012-08-27 90 A DCC MICRO 22:23 0 2012-08-27 22:23:00 12.9479167 days 7.218750 days NA days
13 15200 2012-09-04 90 A DCC MUXX 3:38 1 2012-09-04 03:38:00 9.9701389 days NA days 5.729167 days
14 15201 2012-09-09 90 A DCC MICRO 21:08 0 2012-09-09 21:08:00 1.1472222 days 4.240972 days NA days
15 15199 2012-09-11 90 A DCC MICRO 0:40 0 2012-09-11 00:40:00 12.8951389 days 3.093750 days NA days
16 15177 2012-09-14 90 A DCC MUXX 2:55 1 2012-09-14 02:55:00 NA days NA days 9.801389 days
17 15178 2012-09-23 90 A DCC MICRO 22:09 0 2012-09-23 22:09:00 2.9423611 days NA days NA days
18 15204 2012-09-26 90 A DCC MICRO 20:46 0 2012-09-26 20:46:00 0.2736111 days NA days NA days
19 15205 2012-09-27 90 A DCC MICRO 3:20 0 2012-09-27 03:20:00 NA days NA days NA days
However, you'll have to do some checking to make sure that the calculations are actually doing what you need- I don't have example output or actual km/year info to truth these results against (read: I don't know what I'm looking at!).
Note also that I think arrange is irrelevant for the operations here, considering that the datetime gets sorted in the function.
I'll use the piece on the comments as an example:
d1 = structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
We add the datetime column just as you specified:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE)
First, add a column indicating which sequence of happened prey/predator and the time between observations (we remove the first row because there is no information about the previous observation). Note that, the timedif is a numerical value indicating the number of days.
d1 = d1 %>% mutate(prev = lag(pp))
d1 = d1 %>% mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
At this point, your table looks like
> d1
ID date km culv.id type sps time pp datetime prev timedif seque
1 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23 0 2.877778 10
2 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38 1 23.510417 01
3 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19 0 5.445139 00
4 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03 0 9.030556 00
5 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04 0 8.125694 00
After that, just take the wanted statistics for each group by using
avg = d1 %>% group_by(seque) %>% summarise(mean(timedif))
sdevs = d1 %>% group_by(seque) %>% summarise(sd(timedif))
We obtain
>avg
# A tibble: 3 x 2
seque `mean(timedif)`
<fctr> <dbl>
1 00 7.533796
2 01 23.510417
3 10 2.877778
> sdevs
# A tibble: 3 x 2
seque `sd(timedif)`
<fctr> <dbl>
1 00 1.864554
2 01 NA
3 10 NA
Note that the standard deviation is not computed because we only have one observation in the sample dataset for these categories.
Related
id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)
I have a dataset that looks like below:
PPID join_date week date visit
A 2017-10-01 1 NA 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 NA 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 NA 0
week indicates the difference between the Sunday of the week of join_date and date in weeks (e.g. for participant B, the Sunday of the week of 2017-05-23 is 2017-05-21; thus participant B's week1 starts on 2017-05-21, and week2 starts on 2017-05-28).
My goal is to fill in date where it is currently NA, such that the output looks like below:
PPID join_date week date visit
A 2017-10-01 1 2017-10-01 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 2017-10-22 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 2017-06-04 0
The code I currently have is:
library(dplyr)
library(lubridate)
df2 <- df %>%
group_by(PPID) %>%
mutate(date = seq(unique(floor_date(as.Date(join_date), "weeks")),
unique(floor_date(as.Date(join_date), "weeks") + 7*(max(week)-1)),
by="week"))
The problem with this approach is that I'm working with large dataset (~8 mil observation) and it takes forever to run! I read some posts that all those date conversion/calculation (e.g. floor_date or as.Date) is what takes so long, and was wondering if there's ways to make my code more efficient.
Thanks!
How about simply
df2$date = floor_date(df2$join_date, 'week') + 7*(df2$week-1)
# PPID join_date week date visit
# 1 A 2017-10-01 1 2017-10-01 0
# 2 A 2017-10-01 2 2017-10-08 2
# 3 A 2017-10-01 3 2017-10-15 1
# 4 A 2017-10-01 4 2017-10-22 0
# 5 B 2017-05-23 1 2017-05-21 4
# 6 B 2017-05-23 2 2017-05-28 2
# 7 B 2017-05-23 3 2017-06-04 0
Although this calculates floor_date for every row, it is vectorised rather looping (as you did implicitly using by), so should be fast enough for most purposes. If you need even more speed-up, you could subset on is.na(df2$data) to only calculate the rows you need to impute.
Data:
df2 = structure(list(PPID = c("A", "A", "A", "A", "B", "B", "B"), join_date = structure(c(17440,
17440, 17440, 17440, 17309, 17309, 17309), class = "Date"), week = c(1L,
2L, 3L, 4L, 1L, 2L, 3L), date = structure(c(NA, 17447, 17454,
NA, 17307, 17314, NA), class = "Date"), visit = c(0L, 2L, 1L,
0L, 4L, 2L, 0L)), row.names = c(NA, -7L), class = "data.frame")
I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.
I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?
At the moment, I have this:
sales
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
4 1 2018 11 0 987
sales.NA
day month year employees holiday sales
1 1 2018 NA NA 0
2 1 2018 NA NA 0
3 1 2018 NA NA 0
4 1 2018 NA NA 0
I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this
new.data
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
3 1 2018 NA NA 0
4 1 2018 11 0 987
I tried used something like this
merge(sales.NA,sales, all.y=T, by = c("day","month","year"))
But it does not work
Using dplyr, you could use a "right_join". For example:
sales <- data.frame(day = c(1,2,4),
month = c(1,1,1),
year = c(2018, 2018, 2018),
employees = c(14, 25, 11),
holiday = c(0,1,0),
sales = c(1058, 2174, 987)
)
sales.NA <- data.frame(day = c(1,2,3,4),
month = c(1,1,1,1),
year = c(2018,2018,2018, 2018)
)
right_join(sales, sales.NA)
This leaves you with
day month year employees holiday sales
1 1 1 2018 14 0 1058
2 2 1 2018 25 1 2174
3 3 1 2018 NA NA NA
4 4 1 2018 11 0 987
This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"
right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))
Here is another data.table solution:
jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]
day month year employees holiday sales
1: 1 1 2018 14 0 1058
2: 2 1 2018 25 1 2174
3: 3 1 2018 NA NA 0
4: 4 1 2018 11 0 987
Or with some neater syntax:
sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]
Reproducible data:
sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L,
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L,
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L,
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA,
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))
That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line.
Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.
library(data.table)
dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day" ))
dt.sales <- data.table(day = c(1,2,4)
, month = c(1,1,1)
, year = c(2018,2018,2018)
, employees = c(14, 25, 11)
, holiday = c(0,1,0)
, sales = c(1058, 2174, 987)
)
dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]
merge( x = dt.dates
, y = dt.sales
, by.x = "Date"
, by.y = "Date"
, all.x = TRUE
)
> Date day month year employees holiday sales
1: 2018-01-01 1 1 2018 14 0 1058
2: 2018-01-02 2 1 2018 25 1 2174
3: 2018-01-03 NA NA NA NA NA NA
4: 2018-01-04 4 1 2018 11 0 987
...
I need some help working with consecutive results.
Here is my sample data:
df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909,
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645,
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA,
-16L))
The 'idno' represents individual people who had a test with 'result' on a given date ('samp_date').
From each individual person, I need to find the earliest consecutive 'Negatives' and return the date of the first 'negative' result. To return this date, the consecutive negatives must span >30 days with no 'positive' results.
The example answer for idno == 1 would be 2013-10-29, and 2015-11-06 for idno == 2.
I have tried using rle(as.character(df$result)) but have struggled to understand how to apply this to grouped data.
I would prefer an approach that uses dplyr or data.table.
Thanks for any help.
Similar to #MKR's answer, you can make a grouping variable and summarize in data.table:
library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]
# summarize by grouping var g = rleid(idno, result)
runDT = df[, .(
start = first(samp_date),
end = last(samp_date),
dur = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]
# idno result g start end dur
# 1: 1 Negative 1 2013-07-23 2013-07-23 0 days
# 2: 1 Positive 2 2013-08-21 2013-10-01 41 days
# 3: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 4: 2 Positive 4 2015-10-13 2015-10-13 0 days
# 5: 2 Negative 5 2015-11-06 2016-10-31 360 days
# 6: 2 Positive 6 2017-04-25 2017-09-20 148 days
# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30),
on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]
# filter
runDT[w]
# idno result g start end dur
# 1: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 2: 2 Negative 5 2015-11-06 2016-10-31 360 days
A dplyr based solution can be achieved by creating a group of consecutive occurrence of result column and then finally taking 1st occurrence that meets criteria:
library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>%
group_by(idno) %>%
arrange(samp_date) %>%
mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
group_by(idno, result_grp) %>%
filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
slice(1) %>%
ungroup() %>%
select(-result_grp)
# # A tibble: 2 x 3
# idno result samp_date
# <dbl> <ord> <date>
# 1 1.00 Negative 2013-10-29
# 2 2.00 Negative 2015-11-06
library(dplyr)
df %>% group_by(idno) %>%
mutate(time_diff = ifelse(result=="Negative" & lead(result)=='Negative', samp_date - lead(samp_date),0),
ConsNegDate = min(samp_date[which(abs(time_diff)>30)]))
# A tibble: 16 x 5
# Groups: idno [2]
idno result samp_date time_diff ConsNegDate
<dbl> <ord> <date> <dbl> <date>
1 1 Negative 2013-07-23 0 2013-10-29
2 1 Positive 2013-08-21 0 2013-10-29
3 1 Positive 2013-10-01 0 2013-10-29
4 1 Negative 2013-10-29 -34 2013-10-29
5 1 Negative 2013-12-02 -39 2013-10-29
6 1 Negative 2014-01-10 -102 2013-10-29
7 1 Negative 2014-04-22 -322 2013-10-29
8 1 Negative 2015-03-10 -72 2013-10-29
9 1 Negative 2015-05-21 -69 2013-10-29
10 1 Negative 2015-07-29 NA 2013-10-29
11 2 Positive 2015-10-13 0 2015-11-06
12 2 Negative 2015-11-06 -360 2015-11-06
13 2 Negative 2016-10-31 0 2015-11-06
14 2 Positive 2017-04-25 0 2015-11-06
15 2 Positive 2017-09-07 0 2015-11-06
16 2 Positive 2017-09-20 0 2015-11-06
I have a list of items with 2 dates (start date and end date) and duration in days (end date - start date). I want to cut them into bins to show the number of "active items" in each bin, i.e. if start date <= bin date and end date > bin date, the item should be counted in the bin.
Item StartDate EndDate Duration
Machine1 2005/01/21 2011/03/29 2258
Machine2 2004/05/12 2012/05/08 2918
Machine3 2004/10/15 2005/09/10 330
Machine4 2004/08/30 2011/08/02 2528
Machine5 2005/06/06 2010/12/03 2006
Machine6 2004/05/11 2007/03/17 1040
Machine7 2005/08/09 2011/05/30 2120
Machine8 2005/01/06 2012/06/07 2709
Machine9 2005/06/13 2008/08/28 1172
Machine10 2005/06/28 2010/04/08 1745
Machine11 2004/11/09 2007/05/14 916
Machine12 2005/05/26 2012/09/16 2670
Machine13 2004/05/28 2009/06/09 1838
Machine14 2005/01/06 2012/05/25 2696
Machine15 2005/08/20 2012/02/11 2366
Machine16 2004/08/02 2011/10/23 2638
Machine17 2004/08/10 2009/03/15 1678
Machine18 2005/05/08 2006/04/17 344
Machine19 2005/08/26 2006/07/24 332
Machine20 2004/03/30 2006/05/07 768
Bin counts that I want to produce:
2004/01/01 0
2005/01/01 9
2006/01/01 19
2007/01/01 16
2008/01/01 14
2009/01/01 13
2010/01/01 11
2011/01/01 9
2012/01/01 5
2013/01/01 0
As you can see, the totals of the bins do not add up to the total number of items, as you would expect with a traditional histogram.
I can do this with some verbose code, but I'm sure there must be some short way, using cut or split. I'm aware that the bin labels are off by one according to my definition above, but let's ignore that for now.
A way is:
#turn dates to actual dates
DF$StartDate <- as.Date(DF$StartDate, "%Y/%m/%d")
DF$EndDate <- as.Date(DF$EndDate, "%Y/%m/%d")
binDF[,1] <- as.Date(binDF[,1], "%Y/%m/%d")
counts <- colSums(sapply(binDF[,1], function(x) {DF$StartDate <= x & DF$EndDate > x}))
#> counts
#[1] 0 9 19 16 14 13 11 9 5 0
And as a complete dataframe:
resDF <- data.frame(dates = binDF[,1], counts = counts, stringsAsFactors = F)
#> resDF
# dates counts
#1 2004-01-01 0
#2 2005-01-01 9
#3 2006-01-01 19
#4 2007-01-01 16
#5 2008-01-01 14
#6 2009-01-01 13
#7 2010-01-01 11
#8 2011-01-01 9
#9 2012-01-01 5
#10 2013-01-01 0
The dataframes DF and binDF:
DF <- structure(list(Item = c("Machine1", "Machine2", "Machine3", "Machine4",
"Machine5", "Machine6", "Machine7", "Machine8", "Machine9", "Machine10",
"Machine11", "Machine12", "Machine13", "Machine14", "Machine15",
"Machine16", "Machine17", "Machine18", "Machine19", "Machine20"
), StartDate = c("2005/01/21", "2004/05/12", "2004/10/15", "2004/08/30",
"2005/06/06", "2004/05/11", "2005/08/09", "2005/01/06", "2005/06/13",
"2005/06/28", "2004/11/09", "2005/05/26", "2004/05/28", "2005/01/06",
"2005/08/20", "2004/08/02", "2004/08/10", "2005/05/08", "2005/08/26",
"2004/03/30"), EndDate = c("2011/03/29", "2012/05/08", "2005/09/10",
"2011/08/02", "2010/12/03", "2007/03/17", "2011/05/30", "2012/06/07",
"2008/08/28", "2010/04/08", "2007/05/14", "2012/09/16", "2009/06/09",
"2012/05/25", "2012/02/11", "2011/10/23", "2009/03/15", "2006/04/17",
"2006/07/24", "2006/05/07"), Duration = c(2258L, 2918L, 330L,
2528L, 2006L, 1040L, 2120L, 2709L, 1172L, 1745L, 916L, 2670L,
1838L, 2696L, 2366L, 2638L, 1678L, 344L, 332L, 768L)), .Names = c("Item",
"StartDate", "EndDate", "Duration"), class = "data.frame", row.names = c(NA,
-20L))
binDF <- structure(list(V1 = c("2004/01/01", "2005/01/01", "2006/01/01",
"2007/01/01", "2008/01/01", "2009/01/01", "2010/01/01", "2011/01/01",
"2012/01/01", "2013/01/01"), V2 = c(0L, 9L, 19L, 16L, 14L, 13L,
11L, 9L, 5L, 0L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-10L))