I have the following dataframe:
test_df <- structure(list(system = c("A", "B", "B", "C", "D", "B", "B",
"C", "B", "B", "A", "D", "D", "B", "E", NA, NA, "B", "A", "D"
), type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L), start_date = structure(c(16567,
16604, 16324, 16595, 16111, 17597, 16784, 16648, 16121, 16549,
16438, 16484, 15997, 16488, 16708, 16121, 16327, 16329, 17010,
16342), class = "Date"), end_date = structure(c(16995, 16984,
16661, 16909, 16414, 17843, 16990, 16853, 16323, 16751, 16622,
16665, 16154, 16624, 16839, 16251, 16456, 16456, 17134, 16458
), class = "Date"), event_duration = c(428, 380, 337, 314, 303,
246, 206, 205, 202, 202, 184, 181, 157, 136, 131, 130, 129, 127,
124, 116)), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
test_df
#> system type start_date end_date event_duration
#> 1 A 2 2015-05-12 2016-07-13 428
#> 2 B 2 2015-06-18 2016-07-02 380
#> 3 B 2 2014-09-11 2015-08-14 337
#> 4 C 2 2015-06-09 2016-04-18 314
#> 5 D 1 2014-02-10 2014-12-10 303
#> 6 B 1 2018-03-07 2018-11-08 246
#> 7 B 1 2015-12-15 2016-07-08 206
#> 8 C 2 2015-08-01 2016-02-22 205
#> 9 B 1 2014-02-20 2014-09-10 202
#> 10 B 2 2015-04-24 2015-11-12 202
#> 11 A 2 2015-01-03 2015-07-06 184
#> 12 D 2 2015-02-18 2015-08-18 181
#> 13 D 1 2013-10-19 2014-03-25 157
#> 14 B 2 2015-02-22 2015-07-08 136
#> 15 E 1 2015-09-30 2016-02-08 131
#> 16 <NA> 1 2014-02-20 2014-06-30 130
#> 17 <NA> 2 2014-09-14 2015-01-21 129
#> 18 B 2 2014-09-16 2015-01-21 127
#> 19 A 1 2016-07-28 2016-11-29 124
#> 20 D 1 2014-09-29 2015-01-23 116
For each system, I would like to plot differently colored sequences of segments, for each type of event, starting at start_date and ending at end_date. For example, for system A, I would like to plot two sequences of segments:
one, corresponding to events of type 1, containing a single segment, which starts at 2016-07-28 and ends at 2016-11-29
another, corresponding to events of type 2, containing two segments, one starting at 2015-01-03 and ending at 2015-07-06, and another starting at 2015-05-12 and ending at 2016-07-13. As you see, events in a sequence can overlap. I'm not sure how to make sure that the user can still distinguish among events: maybe one could use arrows, or vertical bars or whatever, to show the start and the end of an event.
Ideally, the plots for each system should be in different facets, because I believe that having all of them in the same plot would make a complete mess (of course, the actual data frame is much bigger than this sample data frame).
For system B, I would have 3 segments corresponding to events of type 1, 5 corresponding to events of type 2. And so on. How can I create the plot I desire? I would prefer a ggplot2 solution.
One option is to use jittering to avoid overplotting the start and end points. Whether this works well will depend on how many segments you're trying to plot.
To ensure you're able to adjust the y-direction of the lines the same amount, you can add the jitter to the df itself and use that to plot segments:
test_df$jitter <- jitter(test_df$type, amount = 0.25)
ggplot(test_df) +
geom_segment(aes(x=start_date, xend=end_date, y=jitter, yend=jitter)) +
facet_wrap(~system) +
scale_y_continuous(breaks=c(1,2), labels=c(1,2)) +
theme(panel.grid.minor.y = element_blank())
You could also use start and end indicators as you suggested, to help emphasize the ends of the segments, but this may just add more noise if the number of segments is large.
ggplot(test_df) +
geom_point(aes(x=start_date, y=jitter), size=1) +
geom_segment(aes(x=start_date, xend=end_date, y=jitter, yend=jitter),
arrow=arrow(30,unit(1.25,"mm"),"last","closed")) +
facet_wrap(~system) +
scale_y_continuous(breaks=c(1,2), labels=c(1,2)) +
theme(panel.grid.minor.y = element_blank())
Related
This seems like a simple enough thing but I can't figure it out nor find an answer online - apologies if it something obvious. I have two seperate dataframes containing the same patients with the same unique identifier. Both datasets have time varying variables - one continuous and one categorical and the time to each reading is different in the sets but have a common start point at time 1. I have tried to modify the tmerge function from survival package but without luck as I don't have a dichotomous outcome variable nor a single data set with one row per patient.
Reprex for creating the datasets below (df1 and df2) and an example of my desired combined output table for a single patient (ID 3), output gets very long if done for all 4 patients
Thanks for any possible help
df1 <- structure(list(tstart = c(1, 1, 1, 1426, 1, 560, 567), tstop = c(2049,
3426, 1426, 1707, 560, 567, 4207), category = structure(c(1L,
1L, 1L, 2L, 1L, 4L, 2L), .Label = c("none", "high", "low", "moderate"
), class = "factor"), id = c(1L, 2L, 3L, 3L, 4L, 4L, 4L)), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(tstart = c(1, 365, 730, 1, 365, 730, 1096, 2557,
1, 365, 730, 1096, 1826, 2557, 3652, 1), tstop = c(365, 730,
1096, 365, 730, 1096, 2557, 2582, 365, 730, 1096, 1826, 2557,
3652, 4864, 365), egfr = c(66, 62, 58, 54, 50, 43, 49, 51, 106,
103, 80, 92, 97, 90, 81, 51), id = c(1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L)), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))
df_example_patient_3 <- structure(list(id = c(3L, 3L, 3L,
3L, 3L, 3L,3L, 3L, 3L), tstart = c(1, 365, 730, 1096, 1426, 1707, 1826, 2557, 3652), tstop = c(365, 730,
1096, 1426, 1707, 1826, 2557, 3652, 4864), egfr = c(106, 103, 80, 92, 92, 92, 97, 90, 81), category = c("none", "none", "none", "none", "high", "high", "high", "high", "high")), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
# DF1
tstart tstop category id
<dbl> <dbl> <fct> <int>
1 1 2049 none 1
2 1 3426 none 2
3 1 1426 none 3
4 1426 1707 high 3
5 1 560 none 4
6 560 567 moderate 4
7 567 4207 high 4
# DF2
tstart tstop egfr id
<dbl> <dbl> <dbl> <int>
1 1 365 66 1
2 365 730 62 1
3 730 1096 58 1
4 1 365 54 2
5 365 730 50 2
6 730 1096 43 2
7 1096 2557 49 2
8 2557 2582 51 2
9 1 365 106 3
10 365 730 103 3
11 730 1096 80 3
12 1096 1826 92 3
13 1826 2557 97 3
14 2557 3652 90 3
15 3652 4864 81 3
16 1 365 51 4
# Combined set
id tstart tstop egfr category
<int> <dbl> <dbl> <dbl> <chr>
1 3 1 365 106 none
2 3 365 730 103 none
3 3 730 1096 80 none
4 3 1096 1426 92 none
5 3 1426 1707 92 high
6 3 1707 1826 92 high
7 3 1826 2557 97 high
8 3 2557 3652 90 high
9 3 3652 4864 81 high
I had to do it this way to really work out the details.
First, i construct a full df1 with all the timestamps, including those of df2.
then i proceed with multiple merges. This is not elegant, but it works:
library(data.table)
library(zoo)
# Proper data.tables
setDT(df1, key = c("id", "tstart"))
setDT(df2, key = c("id", "tstart"))
timestamps_by_id <- unique(rbind(
df1[, .(id, tstart)],
df1[, .(id, tstop)],
df2[, .(id, tstart)],
df2[, .(id, tstop)],
use.names = F
))
setorder(timestamps_by_id, id, tstart)
# Merge to construct full df1
df1_full <- df1[timestamps_by_id]
df1_full[, category := na.locf(category), by = id]
df1_full[, tstop := shift(tstart, -1), by = id]
setkey(df1_full, id, tstart)
# Merge with df2
result <- na.omit(df2[df1_full, roll = T])
result[, tstop := i.tstop]
print(result[id == 3, .(id, tstart, tstop, egfr, category)])
Or a more data.tabley solution using the more arcane foverlaps:
library(data.table)
# Proper data.tables
setDT(df1, key = c("id", "tstart", "tstop"))
setDT(df2, key = c("id", "tstart", "tstop"))
# We add an infinite upper range
proper_df1 <- rbind(
df1,
df1[, .SD[which.max(tstop)], by = .(id)][, .(id, tstart = tstop, tstop = Inf, category), ]
)
setkey(proper_df1, id, tstart, tstop)
overlaps <- foverlaps(df2, proper_df1, type = "any") # Overlap join
overlaps[
tstart %between% .(i.tstart, i.tstop) & tstart != 1,
i.tstart := tstart
]
overlaps[tstop %between% .(i.tstart, i.tstop), i.tstop := tstop]
print(overlaps[
id == 3,
.(id, "tstart" = i.tstart, "tstop" = i.tstop, category, egfr)
])
This messy dplyr solution seems to work for this particular dataset but don't know would it work for all datasets, the direction of the fill may need to be altered depending on particular dataset
library(tidyverse)
library(magrittr)
df1 %>%
bind_rows(df2) %>%
group_by(id) %>%
arrange(id, tstop) %>%
mutate(
tstart = case_when(
tstart < lag(tstop) ~ lag(tstop), TRUE ~ tstart)) %>%
fill(egfr, category, .direction = "updown") %>%
ungroup() %>%
filter(id == 3)
tstart tstop category id egfr
<dbl> <dbl> <fct> <int> <dbl>
1 1 365 none 3 106
2 365 730 none 3 103
3 730 1096 none 3 80
4 1096 1426 none 3 92
5 1426 1707 high 3 92
6 1707 1826 high 3 92
7 1826 2557 high 3 97
8 2557 3652 high 3 90
9 3652 4864 high 3 81
I would like to filter data frame using numeric vector. I am applying function below:
test_data <- exp_data[exp_data$Size_Change %in% vec_data,]
That's how example data looks like:
dput(exp_data)
structure(list(Name = c("Mark", "Greg", "Tomas", "Morka", "Pekka",
"Robert", "Tim", "Tom", "Bobby", "Terka"), Mode = c(1, 2, NA,
4, NA, 3, NA, 1, NA, 3), Change = structure(c(6L, 2L, 4L, 5L,
7L, 7L, 7L, 8L, 3L, 1L), .Label = c("D[+58], I[+12][+385]", "C[+58], K[+1206]",
"C[+58], P[+2074]", "C[+58], K[+2172]", "C[+58], K[+259]", "C[+58], K[+2665]",
"C[+58], T[+385]", "C[+58], C[+600]"), class = "factor"), Size = c(1335.261,
697.356, 1251.603, 920.43, 492.236, 393.991, 492.239, 727.696,
1218.933, 495.237), Place = c(3L, 4L, 3L, 2L, 4L, 5L, 4L, 3L,
3L, 4L), Size_Change = c(4004, 2786, 3753, 1840, 1966, 1966,
1966, 2181, 3655, 1978)), row.names = 2049:2058, class = "data.frame")
and vector used for filtering:
dput(vec_data)
c(4003, 2785, 954, 1129, 4013, 756, 1852, 2424, 1954, 246, 147,
234, 562, 1617, 2180, 888, 1176)
I mentioned about tolerance because vec_data is not very precise and I am expecting +1/-1 difference in numbers and after applying function it will not filter rows with such difference. It may also happen that difference will be +12/-12 or +24/-24. Can I somehow take it into account while filtering ?
Of course probably solution is to do smth like that (vec_data +1) / (vec_data -1) / (vec_data +12), etc. and do couple of filtering attempts and maybe finally rbind outputs of all but I am looking for more "elegant" way. It would also be great if there could be a column added which will indicate how the row was filtered if it was an exact number from vec_data or it was modified by +1, +12, -24 or whatever. Please, take into account that the combination of +1/-1 with any other modification is also possible. Additional column is not necessary if it makes it too complicated.
One option could be (tolerance = 1):
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:1)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Tom 1 C[+58], C[+600] 727.696 3 2181
Tolerance = 14:
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:14)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Morka 4 C[+58], K[+259] 920.430 2 1840
4 Pekka NA C[+58], T[+385] 492.236 4 1966
5 Robert 3 C[+58], T[+385] 393.991 5 1966
6 Tim NA C[+58], T[+385] 492.239 4 1966
7 Tom 1 C[+58], C[+600] 727.696 3 2181
The same logic with rowwise():
df %>%
rowwise() %>%
filter(any(abs(Size_Change - vec) %in% 0:1))
The most obvious methodology is to filter based on inequality rather than exact matched (always recommended when comparing numeric [not integers])
comp <- function(x, yvec, tolerance = 1){
sapply(x, \(xi){any(abs(xi - yvec) <= tolerance)})
}
exp_data[comp(exp_data$Size_Change, vec_data),]
Name Mode Change Size Place Size_Change
2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# Tolerance = 2
# exp_data[comp(exp_data$Size_Change, vec_data, 2),]
What about using a tolerance function.
tol <- \(x, tol=1L) sapply(seq(-tol, tol, 1L), \(i) sweep(as.matrix(x), 1L, i))
exp_data[exp_data$Size_Change %in% tol(vec_data), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
It defaults to tolerance ±1, if we want ±24 we may define it in the argument:
exp_data[exp_data$Size_Change %in% tol(vec_data, 24L), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2052 Morka 4 C[+58], K[+259] 920.430 2 1840
# 2053 Pekka NA C[+58], T[+385] 492.236 4 1966
# 2054 Robert 3 C[+58], T[+385] 393.991 5 1966
# 2055 Tim NA C[+58], T[+385] 492.239 4 1966
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# 2058 Terka 3 D[+58], I[+12][+385] 495.237 4 1978
I you are wondering about the L in 24L, it is integer notation, you may also use tol=24 without any problems.
Note: R version 4.1.2 (2021-11-01)
I am not 100% sure how to formulate my question because I don't know the formal names are for what it is that I am trying to do with my dataset. Based on previous questions, there appears to be some way to address what I am trying to, but I am unable at making the logical jump from their problem to my own.
I have attached a sample of my data here.
The first thing I did with my data was add a column indicating which species (sps) are predators (coded as 1) and which species are prey (coded as 0).
#specify which are predators and prey
d1 = d1 %>%
group_by(sps) %>% #grouped by species
mutate(pp=ifelse(sps %in% c("MUXX", "MUVI","MEME"), 1,0)) #mutate to specify predators as 1 and prey as 0
My data is structured as such:
head(d1) #visualize the first few lines of the data
# A tibble: 6 x 8
# Groups: sps [4]
ID date km culv.id type sps time pp
<int> <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <dbl>
1 2012-06-19 80 A DCC MICRO 2:19 0
2 2012-06-21 80 A DCC MUXX 23:23 1
3 2012-07-15 80 A DCC MAMO 11:38 0
4 2012-07-20 80 A DCC MICRO 22:19 0
5 2012-07-29 80 A DCC MICRO 23:03 0
6 2012-08-07 80 A DCC PRLO 2:04 0
Here is also the output for dput(head(d1)):
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
I also converted the time and date using the following code:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
The (most) relevant columns are date, time, and pp (where 1 = predator species and 0 = prey species).
I am now trying to figure out how to extract the following information (average +/- std):
average time between prey-prey observations
average time between prey-predator observations
average time between predator-predator observations
average time between predator-prey observations
To put one of these examples (#2) into words:
What is the average time between when a prey species (pp = 0) is first seen followed by a predator species (pp = 1)?
I am trying to figure out how to do this for my dataset overall first. I think that once I figure out how to do that, it should be fairly straightforward to restrict the data.
Here is a data.table (and lubridate) version that might address your problem:
Using a selection of your posted data (posted at bottom), with a slight modification to your datetime creation so that the format works with data.table:
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
Convert to a data table:
d1 <- as.data.table(d1)
Calculate time differences for equal pp values for animals by specialization (prey or predator), less than (pred to prey), or greater than (prey to pred).
d1$class.class <- d1[d1, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
d1$prey.pred <-d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
d1$pred.prey <- d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
Gives you a column for each:
> head(d1[, 7:ncol(d1)])
time pp datetime class.class prey.pred pred.prey
1: 2:19 0 2012-06-19 02:19:00 26.388194 days 2.877778 days NA days
2: 23:23 1 2012-06-21 23:23:00 74.177083 days NA days 23.51042 days
3: 11:38 0 2012-07-15 11:38:00 5.445139 days 50.666667 days NA days
4: 22:19 0 2012-07-20 22:19:00 9.030556 days 45.221528 days NA days
5: 23:03 0 2012-07-29 23:03:00 8.125694 days 36.190972 days NA days
6: 2:04 0 2012-08-07 02:04:00 1.911111 days 28.065278 days NA days
And you can get summary statistics as you like:
d1[by = sps,, .(mean.same.class = mean(class.class, na.rm = TRUE),
sd.same.class = sd(class.class, na.rm = TRUE),
mean.prey.pred = mean(prey.pred, na.rm = TRUE),
sd.prey.pred = sd(prey.pred, na.rm = TRUE),
mean.pred.prey = mean(pred.prey, na.rm = TRUE),
sd.pred.prey = sd(pred.prey, na.rm = TRUE))]
sps mean.same.class sd.same.class mean.prey.pred sd.prey.pred mean.pred.prey sd.pred.prey
1: MICRO 7.886237 days 8.0547631 18.80733 days 15.504646 NaN days NA
2: MUXX 42.073611 days 45.4011658 NaN days NA 13.01366 days 9.315697
3: MAMO 5.445139 days NA 50.66667 days NA NaN days NA
4: PRLO 2.475694 days 0.7984414 26.62708 days 2.033914 NaN days NA
5: LEAM 2.897222 days NA 10.11597 days NA NaN days NA
Libraries: data.table, lubridate
Data:
> dput(d1)
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15200L, 15201L, 15199L, 15177L, 15178L, 15204L, 15205L
), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20",
"2012-07-29", "2012-08-07", "2012-08-08", "2012-08-09", "2012-08-13",
"2012-08-13", "2012-08-25", "2012-08-27", "2012-09-04", "2012-09-09",
"2012-09-11", "2012-09-14", "2012-09-23", "2012-09-26", "2012-09-27"
), km = c(80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L,
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX",
"MAMO", "MICRO", "MICRO", "PRLO", "MICRO", "PRLO", "MICRO", "MICRO",
"LEAM", "MICRO", "MUXX", "MICRO", "MICRO", "MUXX", "MICRO", "MICRO",
"MICRO"), time = c("2:19", "23:23", "11:38", "22:19", "23:03",
"2:04", "23:56", "23:06", "0:04", "0:46", "0:51", "22:23", "3:38",
"21:08", "0:40", "2:55", "22:09", "20:46", "3:20"), pp = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0)), class = "data.frame", .Names = c("ID",
"date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA,
-19L))
Edit:
I'm not really sure about mixing tidyverse and data.table ideologies, but you could potentially do what you described in comments using do. For example, make a modified version of df:
d1 <- as.data.table(d1)
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
d1Mod <- d1
d1Mod$km[10:nrow(d1Mod)] <- 90
Then, define the data.table bit as a function:
foo <- function(df_) {
df_$class.class <- df_[df_, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
df_$prey.pred <-df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
df_$pred.prey <- df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
return(df_)
}
Running d1 %>% group_by(km) %>% do(foo(as.data.table(.))) gets you the same output as in the original answer above (since all km values are 80). If you run it on the modified d1Mod you get an output that looks like it has been grouped by km:
> d1Mod %>%
+ group_by(km) %>%
+ do(foo(as.data.table(.)))
# A tibble: 19 x 12
# Groups: km [2]
ID date km culv.id type sps time pp datetime class.class prey.pred pred.prey
<int> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dttm> <time> <time> <time>
1 1 2012-06-19 80 A DCC MICRO 2:19 0 2012-06-19 02:19:00 26.3881944 days 2.877778 days NA days
2 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23:00 NA days NA days 23.510417 days
3 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38:00 5.4451389 days NA days NA days
4 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19:00 9.0305556 days NA days NA days
5 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03:00 8.1256944 days NA days NA days
6 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04:00 1.9111111 days NA days NA days
7 9 2012-08-08 80 A DCC MICRO 23:56 0 2012-08-08 23:56:00 0.9652778 days NA days NA days
8 10 2012-08-09 80 A DCC PRLO 23:06 0 2012-08-09 23:06:00 3.0402778 days NA days NA days
9 11 2012-08-13 80 A DCC MICRO 0:04 0 2012-08-13 00:04:00 NA days NA days NA days
10 12 2012-08-13 90 A DCC MICRO 0:46 0 2012-08-13 00:46:00 12.0034722 days 22.119444 days NA days
11 13 2012-08-25 90 A DCC LEAM 0:51 0 2012-08-25 00:51:00 2.8972222 days 10.115972 days NA days
12 14 2012-08-27 90 A DCC MICRO 22:23 0 2012-08-27 22:23:00 12.9479167 days 7.218750 days NA days
13 15200 2012-09-04 90 A DCC MUXX 3:38 1 2012-09-04 03:38:00 9.9701389 days NA days 5.729167 days
14 15201 2012-09-09 90 A DCC MICRO 21:08 0 2012-09-09 21:08:00 1.1472222 days 4.240972 days NA days
15 15199 2012-09-11 90 A DCC MICRO 0:40 0 2012-09-11 00:40:00 12.8951389 days 3.093750 days NA days
16 15177 2012-09-14 90 A DCC MUXX 2:55 1 2012-09-14 02:55:00 NA days NA days 9.801389 days
17 15178 2012-09-23 90 A DCC MICRO 22:09 0 2012-09-23 22:09:00 2.9423611 days NA days NA days
18 15204 2012-09-26 90 A DCC MICRO 20:46 0 2012-09-26 20:46:00 0.2736111 days NA days NA days
19 15205 2012-09-27 90 A DCC MICRO 3:20 0 2012-09-27 03:20:00 NA days NA days NA days
However, you'll have to do some checking to make sure that the calculations are actually doing what you need- I don't have example output or actual km/year info to truth these results against (read: I don't know what I'm looking at!).
Note also that I think arrange is irrelevant for the operations here, considering that the datetime gets sorted in the function.
I'll use the piece on the comments as an example:
d1 = structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
We add the datetime column just as you specified:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE)
First, add a column indicating which sequence of happened prey/predator and the time between observations (we remove the first row because there is no information about the previous observation). Note that, the timedif is a numerical value indicating the number of days.
d1 = d1 %>% mutate(prev = lag(pp))
d1 = d1 %>% mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
At this point, your table looks like
> d1
ID date km culv.id type sps time pp datetime prev timedif seque
1 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23 0 2.877778 10
2 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38 1 23.510417 01
3 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19 0 5.445139 00
4 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03 0 9.030556 00
5 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04 0 8.125694 00
After that, just take the wanted statistics for each group by using
avg = d1 %>% group_by(seque) %>% summarise(mean(timedif))
sdevs = d1 %>% group_by(seque) %>% summarise(sd(timedif))
We obtain
>avg
# A tibble: 3 x 2
seque `mean(timedif)`
<fctr> <dbl>
1 00 7.533796
2 01 23.510417
3 10 2.877778
> sdevs
# A tibble: 3 x 2
seque `sd(timedif)`
<fctr> <dbl>
1 00 1.864554
2 01 NA
3 10 NA
Note that the standard deviation is not computed because we only have one observation in the sample dataset for these categories.
I would like to able to aggregate survey data collected over a range of days into a unique period. For example, for the first three dates (2015-03-17, 2015-03-23, 2015-03-26), i'd like to combine to produce the period "March 2015". I will then use these combined dates to produce boxplots which show "Average.Counts" for that period.
All up I would like to make 4 unique periods:
March 15 (first 3 dates as per table below)
September 15 (dates 4,5 as per table below)
March 2016 (dates 6-15 as per table below)
September 2016 (dates 16-23 as per table below)
Here are the dataset headings.
head(Survival.Pre.Harvest)
Bay.Unique Date Average.Count Total.Predators Time Previous.Average.Count
2 1 2015-03-17 346.9 2 0 NA
3 1 2015-09-14 326.6 8 181 346.9
4 1 2016-02-29 322.6 3 349 326.6
7 2 2015-03-17 326.4 2 0 NA
8 2 2015-09-14 288.8 4 181 326.4
9 2 2016-02-29 271.4 6 349 288.8
These are the unique dates within the dataset.
table(Survival.Pre.Harvest$Date)
2015-03-17 2015-03-23 2015-03-26 2015-09-14 2015-09-15 2016-02-24 2016-02-25 2016-02-26 2016-02-29
9 3 1 9 3 4 6 6 5
2016-03-01 2016-03-02 2016-03-03 2016-03-04 2016-03-22 2016-03-23 2016-09-12 2016-09-13 2016-09-14
3 6 3 6 6 2 6 6 4
2016-09-20 2016-09-22 2016-10-18 2016-10-19 2016-10-20
7 10 4 3 14
Thanks in advance!
dput(head(Survival.Pre.Harvest))
structure(list(Bay.Unique = c(1, 1, 1, 2, 2, 2), Date = structure(c(16511,
16692, 16860, 16511, 16692, 16860), class = "Date"), Average.Count = c(346.9,
326.6, 322.6, 326.4, 288.8, 271.4), Total.Predators = c(2L, 8L,
3L, 2L, 4L, 6L), Time = c(0, 181, 349, 0, 181, 349), Previous.Average.Count = c(NA,
346.9, 326.6, NA, 326.4, 288.8)), .Names = c("Bay.Unique", "Date",
"Average.Count", "Total.Predators", "Time", "Previous.Average.Count"
), row.names = c(2L, 3L, 4L, 7L, 8L, 9L), class = "data.frame")
This should work:
library(lubridate)
library(ggplot2)
Survival.Pre.Harvest$Date <- ymd(Survival.Pre.Harvest$Date)
bks = ymd("2015-01-01", "2015-08-31", "2016-01-01", "2016-08-31", "2017-01-01")
lbs <- c("Mar2015", "Sep2015", "Mar2016", "Sep2016")
Survival.Pre.Harvest$yearmonth <- cut.Date(Survival.Pre.Harvest$Date, breaks = bks, labels = lbs)
ggplot(Survival.Pre.Harvest, aes(x=yearmonth, y=Average.Count)) + geom_boxplot()
I have a data frame that looks as the following:
system Id initial final
665 9 16001 6070 6071
683 10 16001 6100 6101
696 11 16001 6101 6113
712 10 16971 6150 6151
715 11 16971 6151 6163
4966 7 4118 10238 10242
5031 9 4118 10260 10278
5088 10 4118 10279 10304
5115 11 4118 10305 10317
structure(list(system = c(9L, 10L, 11L, 10L, 11L, 7L, 9L, 10L,
11L), Id = c(16001L, 16001L, 16001L, 16971L, 16971L, 4118L, 4118L,
4118L, 4118L), initial = c(6070, 6100, 6101, 6150, 6151, 10238,
10260, 10279, 10305), final = c(6071, 6101, 6113, 6151, 6163,
10242, 10278, 10304, 10317)), .Names = c("system", "Id", "initial",
"final"), row.names = c(665L, 683L, 696L, 712L, 715L, 4966L,
5031L, 5088L, 5115L), class = "data.frame")
I would like to get a new data frame with the next structure
Id system length initial final
1 16001 9,10,11 3 6070 6113
2 16971 10,11 2 6150 6163
3 4118 7 1 10238 10242
4 4118 9,10,11 3 10260 10317
structure(list(Id = c(16001L, 16971L, 4118L, 4118L), system = structure(c(3L,
1L, 2L, 3L), .Label = c("10,11", "7", "9,10,11"), class = "factor"),
length = c(3L, 2L, 1L, 3L), initial = c(6070L, 6150L, 10238L,
10260L), final = c(6113, 6163, 10242, 10317)), .Names = c("Id",
"system", "length", "initial", "final"), class = "data.frame", row.names = c(NA,
-4L))
The grouping is by Id and the difference (between rows) in "system" field equal to one. Also I would like to get the different "system" and how many of that involved in grouping. Finally a column with the first "initial" and the last "final" involved also.
It is possible to do that in r?
Thanks.
You could use data.table. Convert "data.frame" to "data.table" (setDT), create a grouping variable "indx" by taking the difference of adjacent elements of "system" (diff(system)), cumsum the logical vector, use "Id" and "indx" as grouping variable to get the statistics.
library(data.table)
setDT(df)[,list(system=toString(system), length=.N, initial=initial[1L],
final=final[.N]), by=list(Id,indx=cumsum(c(TRUE, diff(system)!=1)))][,
indx:=NULL][]
# Id system length initial final
#1: 16001 9, 10, 11 3 6070 6113
#2: 16971 10, 11 2 6150 6163
#3: 4118 7 1 10238 10242
#4: 4118 9, 10, 11 3 10260 10317
Or based on #jazzurro's comment about using first/last functions from dplyr,
library(dplyr)
df %>%
group_by(indx=cumsum(c(TRUE, diff(system)!=1)), Id) %>%
summarise(system=toString(system), length=n(),
initial=first(initial), final=last(final))
A solution without data.table, but plyr:
library(plyr)
func = function(subdf)
{
bool = c(diff(subdf$system),1)==1
ldply(split(subdf, bool), function(u){
data.frame(system = paste(u$system, collapse=','),
Id = unique(u$Id),
length = nrow(u),
initial= head(u,1)$initial,
final = tail(u,1)$final)
})
}
ldply(split(df, df$Id), func)
# .id system length Id initial final
#1 FALSE 7 1 4118 10238 10242
#2 TRUE 9,10,11 3 4118 10260 10317
#3 TRUE 9,10,11 3 16001 6070 6113
#4 TRUE 10,11 2 16971 6150 6163