Related
I am working through some messy data where, after reading it in, it appears as the following:
> glimpse(il_births)
Rows: 106
Columns: 22
$ x1989 <dbl> 190247, 928, 175, 187, 445, 57, 425, 41, 207, 166, 2662, 48…
$ x1990 <dbl> 195499, 960, 192, 195, 462, 68, 449, 53, 222, 187, 2574, 47…
$ x1991 <dbl> 194066, 971, 164, 195, 464, 72, 448, 54, 179, 211, 2562, 49…
$ x1992 <dbl> 190923, 881, 189, 185, 462, 72, 414, 55, 201, 161, 2426, 46…
$ x1993 <dbl> 190709, 893, 152, 206, 497, 50, 389, 75, 202, 183, 2337, 43…
$ x1994 <dbl> 189182, 865, 158, 200, 538, 58, 429, 48, 189, 171, 2240, 41…
$ x1995 <dbl> 185801, 828, 140, 202, 566, 58, 417, 48, 173, 166, 2117, 43…
$ x1996 <dbl> 183079, 830, 147, 194, 529, 58, 417, 49, 175, 150, 2270, 41…
$ x1997 <dbl> 180649, 812, 132, 193, 531, 64, 389, 37, 163, 185, 2175, 43…
$ x1998 <dbl> 182503, 862, 140, 201, 545, 41, 417, 57, 185, 188, 2128, 41…
$ x1999 <dbl> 182027, 843, 117, 188, 595, 51, 396, 47, 193, 191, 2194, 39…
$ x2000 <dbl> 185003, 825, 132, 184, 587, 63, 434, 51, 170, 181, 2260, 40…
$ x2001 <dbl> 184022, 866, 138, 196, 629, 57, 420, 49, 147, 215, 2312, 39…
$ x2002 <dbl> 180555, 760, 129, 172, 629, 54, 434, 48, 191, 185, 2226, 39…
$ x2003 <dbl> 182393, 794, 141, 239, 668, 76, 458, 58, 154, 208, 2288, 39…
$ x2004 <dbl> 180665, 802, 126, 209, 646, 56, 396, 51, 151, 181, 2291, 42…
$ x2005 <dbl> 178872, 883, 122, 189, 744, 54, 409, 58, 160, 199, 2490, 40…
$ x2006 <dbl> 180503, 805, 112, 215, 737, 57, 392, 55, 140, 177, 2455, 41…
$ x2007 <dbl> 180530, 890, 136, 185, 736, 60, 413, 49, 163, 195, 2508, 44…
$ x2008 <dbl> 176634, 817, 120, 173, 676, 64, 409, 59, 142, 200, 2482, 40…
$ x2009 <dbl> 171077, 804, 114, 198, 622, 65, 381, 53, 123, 164, 2407, 40…
$ county_name <chr> "ILLINOIS TOTAL", "ADAMS", "ALEXANDER", "BOND", "BOONE", "B…
The data comes from All Live Births In Illinois, 1989-2009. The data frame is difficult to work with, as the years are the column headers in addition to a column with all of the counties. I would prefer if the table were formatted such that there is a year column and a county column, and each row contains an observation for one year and one county. This would make it easier to work with in ggplot such that I can make some quick visualizations of the data.
I first tried transposing the data frame, but that leaves counties as rows so that does not help much.
I also tried using the pivot_longer() function but was not sure how to set my parameters based on my issue.
Any help or suggestions are appreciated!
I suspect a reading of pivot_longer's help page would have done the trick:
data - A data frame to pivot.
cols - Columns to pivot into longer format.
names_to - A character vector specifying the new column or columns to
create from the information stored in the column names of data
specified by cols.
values_to - A string specifying the name of the column to create from
the data stored in cell values.
The other arguments are for more complex operations. To solve your case:
data should be il_births
cols should be all the year column names, you
can use any tidy-select method to get them, the easier in this case
is to say "everyone less county_name", so -county_name
names_to is the name of the column that will have the years, by default "name", but you can change it to "year" or anything else.
values_to is the name of the column that will have the values, by default "value", but you can change it here.
pivot_longer(il_births, -county_name, names_to = "year")
Additionally, you can remove the "x"'s from the column names, and format the year column as numeric:
pivot_longer(il_births, -county_name, names_to = "year",
names_prefix = "x", names_transform = list(year = as.numeric))
Here's a full reprex of how you might read in and tidy the data. A plot is going to look very messy if you include all counties, so I have used slice_max to include only the five most populous counties. This line could be removed if you want to retain all the data:
library(tidyverse)
data <- "https://data.illinois.gov/dataset/" %>%
paste0("ac7f40df-b256-4867-9953-78c8c4a52590/",
"resource/d7ec861b-6b7c-4260-82d8-3f05f49053f9/",
"download/data.csv") %>%
read.csv(check.names = FALSE) %>%
filter(row_number() != 1) %>%
slice_max(`_2009`, n = 5) %>% # Remove this line to keep all data
mutate(county_name = str_to_title(county_name)) %>%
mutate(county_name = reorder(county_name, -`_2009`)) %>%
pivot_longer(-county_name, names_to = "Year", values_to = "Births") %>%
mutate(Year = as.numeric(substr(Year, 2, 5)))
This results in:
data
#> # A tibble: 105 x 3
#> county_name Year Births
#> <fct> <dbl> <dbl>
#> 1 "Cook " 1989 94096
#> 2 "Cook " 1990 97005
#> 3 "Cook " 1991 96387
#> 4 "Cook " 1992 95140
#> 5 "Cook " 1993 94614
#> 6 "Cook " 1994 92881
#> 7 "Cook " 1995 90029
#> 8 "Cook " 1996 87747
#> 9 "Cook " 1997 85589
#> 10 "Cook " 1998 85970
#> # ... with 95 more rows
Which we could plot like this:
ggplot(data, aes(Year, Births, color = county_name)) +
geom_line(alpha = 0.5) +
scale_y_continuous(labels = scales::comma) +
geom_point() +
theme_minimal(base_size = 16) +
scale_color_brewer(palette = "Set1", name = "County") +
ggtitle("Live births in five most populous Illinois counties, 1989-2009") +
labs(caption = "Source: Illinois Department of Public Health")
Created on 2022-11-20 with reprex v2.0.2
I want to group daily data from Google Trends into weekly observations and smooth them by 7-day centered moving average? How can I do this? In which order?
Should I first group data? Or should I use centered moving average on daily data?
This is my data:
dput(multiTimeline)
structure(list(day = structure(c(1598400000, 1598486400, 1598572800,
1598659200, 1598745600, 1598832000, 1598918400, 1599004800, 1599091200,
1599177600, 1599264000, 1599350400, 1599436800, 1599523200, 1599609600,
1599696000, 1599782400, 1599868800, 1599955200, 1600041600, 1600128000,
1600214400, 1600300800, 1600387200, 1600473600, 1600560000, 1600646400,
1600732800, 1600819200, 1600905600, 1600992000, 1601078400, 1601164800,
1601251200, 1601337600, 1601424000, 1601510400, 1601596800, 1601683200,
1601769600, 1601856000, 1601942400, 1602028800, 1602115200, 1602201600,
1602288000, 1602374400, 1602460800, 1602547200, 1602633600, 1602720000,
1602806400, 1602892800, 1602979200, 1603065600, 1603152000, 1603238400,
1603324800, 1603411200, 1603497600, 1603584000, 1603670400, 1603756800,
1603843200, 1603929600, 1604016000, 1604102400, 1604188800, 1604275200,
1604361600, 1604448000, 1604534400, 1604620800, 1604707200, 1604793600,
1604880000, 1604966400, 1605052800, 1605139200, 1605225600, 1605312000,
1605398400, 1605484800, 1605571200, 1605657600, 1605744000, 1605830400,
1605916800, 1606003200, 1606089600), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), football = c(36, 36, 41, 60, 45, 38, 38, 39,
43, 49, 70, 49, 44, 46, 50, 62, 71, 92, 96, 61, 51, 45, 50, 58,
87, 81, 54, 50, 43, 49, 58, 97, 84, 55, 48, 41, 51, 56, 94, 83,
51, 47, 46, 49, 62, 97, 84, 51, 55, 51, 47, 52, 96, 79, 51, 49,
42, 44, 52, 100, 82, 49, 45, 41, 42, 50, 89, 73, 48, 40, 21,
29, 36, 75, 69, 45, 37, 39, 45, 51, 87, 69, 47, 48, 43, 37, 45,
79, 66, 46)), row.names = c(NA, -90L), class = c("tbl_df", "tbl",
"data.frame"))
Data is from 2020-08-26 to 2020-11-23.
I allowed myself to use the packages dplyr, to make data manipulation easier, and lubidrate, which makes date manipualtion easy.
The code is:
library(dplyr)
library(lubridate)
df2 <- df %>%
mutate(week = week(day)) %>%
group_by(week) %>%
summarise(average = mean(football))
The only function I used from lubidrate there was week(), if you're interested.
What I did was: first, I created another column (could have been the same one, though) that states the week. Note that this only works because your column was already in date-time format (though just date would have workes too, maybe even better). From that, I grouped by week and took the average. I hope I understood your question correctly and this will help.
It worked; this was the output:
> df2
# A tibble: 13 x 2
week average
<dbl> <dbl>
1 35 42
2 36 48.6
3 37 69
4 38 60.7
5 39 62
6 40 60.4
7 41 63.4
8 42 60.7
9 43 59.1
10 44 54.7
11 45 44.6
12 46 55.1
13 47 52.7
You can use rollmean from zoo package to do all this as a one-liner.
multiTimeline$rolling <- zoo::rollmean(multiTimeline$football, 7, na.pad = TRUE)
multiTimeline
#> # A tibble: 90 x 3
#> day football rolling
#> <dttm> <dbl> <dbl>
#> 1 2020-08-26 00:00:00 36 NA
#> 2 2020-08-27 00:00:00 36 NA
#> 3 2020-08-28 00:00:00 41 NA
#> 4 2020-08-29 00:00:00 60 42
#> 5 2020-08-30 00:00:00 45 42.4
#> 6 2020-08-31 00:00:00 38 43.4
#> 7 2020-09-01 00:00:00 38 44.6
#> 8 2020-09-02 00:00:00 39 46
#> 9 2020-09-03 00:00:00 43 46.6
#> 10 2020-09-04 00:00:00 49 47.4
#> # ... with 80 more rows
If you want to pick out the smoothed average for each week from Saturday to Friday, just use filter to select only Tuesdays. This will give you the 7-day average from the previous Saturday to the following Friday.
multiTimeline %>% filter(lubridate::wday(day) == 3)
#> # A tibble: 12 x 3
#> day football rolling
#> <dttm> <dbl> <dbl>
#> 1 2020-09-01 00:00:00 38 44.6
#> 2 2020-09-08 00:00:00 46 56
#> 3 2020-09-15 00:00:00 51 64.7
#> 4 2020-09-22 00:00:00 50 60.3
#> 5 2020-09-29 00:00:00 48 61.7
#> 6 2020-10-06 00:00:00 47 61.7
#> 7 2020-10-13 00:00:00 55 62.4
#> 8 2020-10-20 00:00:00 49 59
#> 9 2020-10-27 00:00:00 45 58.4
#> 10 2020-11-03 00:00:00 40 48
#> 11 2020-11-10 00:00:00 37 51.6
#> 12 2020-11-17 00:00:00 48 53.7
To show this is what you want, we can plot your data and the averaged line using ggplot:
ggplot(multiTimeline, aes(day, football)) +
geom_line() +
geom_line(data = multiTimeline %>% filter(lubridate::wday(day) == 3),
aes(y = rolling), col = "red", lty = 2, size = 1.5)
I would like to modify the answer to the question here or have a new solution to include another column which shows the second largest consecutive run of "0". My sample data and code is below, the function is operating on the month columns and the second largest run column is what I hope to add. I am working with a large dataset so the more efficient the better, any ideas are appreciated, thanks.
sample data
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9), V1 = c("A",
"B", "A", "B", "B", "A", "A", "B", "B"), V2 = c(21, 233, 185,
85, 208, 112, 238, 66, 38), V3 = c(149, 250, 218, 104, 62, 19,
175, 168, 28), Jan = c(10, 20, 10, 12, 76, 28, 137, 162, 101),
Feb = c(20, 25, 15, 0, 89, 0, 152, 177, 119), March = c(0,
28, 20, 14, 108, 0, 165, 194, 132), April = c(0, 34, 25,
16, 125, 71, 181, 208, 149), May = c(25, 0, 30, 22, 135,
0, 191, 224, 169), June = c(29, 0, 35, 24, 145, 0, 205, 244,
187), July = c(34, 0, 40, 28, 163, 0, 217, 256, 207), August = c(37,
0, 45, 29, 173, 0, 228, 276, 221), Sep = c(0, 39, 50, 31,
193, 0, 239, 308, 236), Oct = c(0, 48, 55, 35, 210, 163,
252, 0, 247), Nov = c(48, 55, 60, 40, 221, 183, 272, 0, 264
), Dec = c(50, 60, 65, 45, 239, 195, 289, 0, 277), `Second largest run` = c(1,
NA, NA, NA, NA, 2, NA, NA, NA), result = c(2, 4, -Inf, 1,
-Inf, 5, -Inf, 3, -Inf)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
code
most_consecutive_val = function(x, val = 0) {
with(rle(x), max(lengths[values == val]))
}
test$result=apply(test[,-c(1:4,17)], MARGIN = 1, most_consecutive_val)
Rather than taking the max from the run length encoding (rle) function, we want to sort the output and then extract the desired index. We'll get NA's when we request an index that doesn't exist -- where there isn't a second run of zeroes in row 2 for example.
ordered_runs = function(x, val = 0, idx = 1) {
with(rle(x), sort(lengths[values == val], decreasing = TRUE))[idx]
}
test$result_1 <- apply(test[,-c(1:4,17:18)], MARGIN = 1, ordered_runs, idx = 1)
test$result_2 <- apply(test[,-c(1:4,17:18)], MARGIN = 1, ordered_runs, idx = 2)
Output is slightly different than your expected -- (1) using NA's rather than -Inf, and (2) in your first row, where I believe there is a tie with a second run of 2 zeroes.
> test[,c(1,17:20)]
# A tibble: 9 x 5
ID `Second largest run` result result_1 result_2
<dbl> <dbl> <dbl> <int> <int>
1 1 1 2 2 2
2 2 NA 4 4 NA
3 3 NA -Inf NA NA
4 4 NA 1 1 NA
5 5 NA -Inf NA NA
6 6 2 5 5 2
7 7 NA -Inf NA NA
8 8 NA 3 3 NA
9 9 NA -Inf NA NA
Here is an option using data.table which should be quite fast for OP's large dataset and also identifies all sequences of zeros simultaneously:
library(data.table)
setDT(DF)
cols <- c("Jan", "Feb", "March", "April", "May", "June", "July", "August", "Sep", "Oct", "Nov", "Dec")
#convert into a long format
m <- melt(DF, measure.vars=cols)[
#identify consecutive sequences of the same number and count
order(ID), c("rl", "rw") := .(rl <- rleid(ID, value), rowid(rl))][
#extract the last element where values = 0 (that is the length of sequences of zeros)
value == 0L, .(ID=ID[.N], len=rw[.N]), rl][
#sort in descending order for length of sequences
order(ID, -len)]
#pivot into wide format and perform a update join
wide <- dcast(m, ID ~ rowid(ID), value.var="len")
DF[wide, on=.(ID), (names(wide)) := mget(names(wide))]
output:
ID V1 V2 V3 Jan Feb March April May June July August Sep Oct Nov Dec 1 2
1: 1 A 21 149 10 20 0 0 25 29 34 37 0 0 48 50 2 2
2: 2 B 233 250 20 25 28 34 0 0 0 0 39 48 55 60 4 NA
3: 3 A 185 218 10 15 20 25 30 35 40 45 50 55 60 65 NA NA
4: 4 B 85 104 12 0 14 16 22 24 28 29 31 35 40 45 1 NA
5: 5 B 208 62 76 89 108 125 135 145 163 173 193 210 221 239 NA NA
6: 6 A 112 19 28 0 0 71 0 0 0 0 0 163 183 195 5 2
7: 7 A 238 175 137 152 165 181 191 205 217 228 239 252 272 289 NA NA
8: 8 B 66 168 162 177 194 208 224 244 256 276 308 0 0 0 3 NA
9: 9 B 38 28 101 119 132 149 169 187 207 221 236 247 264 277 NA NA
data:
DF <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9), V1 = c("A",
"B", "A", "B", "B", "A", "A", "B", "B"), V2 = c(21, 233, 185,
85, 208, 112, 238, 66, 38), V3 = c(149, 250, 218, 104, 62, 19,
175, 168, 28), Jan = c(10, 20, 10, 12, 76, 28, 137, 162, 101),
Feb = c(20, 25, 15, 0, 89, 0, 152, 177, 119), March = c(0,
28, 20, 14, 108, 0, 165, 194, 132), April = c(0, 34, 25,
16, 125, 71, 181, 208, 149), May = c(25, 0, 30, 22, 135,
0, 191, 224, 169), June = c(29, 0, 35, 24, 145, 0, 205, 244,
187), July = c(34, 0, 40, 28, 163, 0, 217, 256, 207), August = c(37,
0, 45, 29, 173, 0, 228, 276, 221), Sep = c(0, 39, 50, 31,
193, 0, 239, 308, 236), Oct = c(0, 48, 55, 35, 210, 163,
252, 0, 247), Nov = c(48, 55, 60, 40, 221, 183, 272, 0, 264
), Dec = c(50, 60, 65, 45, 239, 195, 289, 0, 277), `1` = c(2L,
4L, NA, 1L, NA, 5L, NA, 3L, NA), `2` = c(2L, NA, NA, NA,
NA, 2L, NA, NA, NA)), row.names = c(NA, -9L), class = "data.frame")
I currently have data spread out across multiple columns in R. I am looking for a way to put this information into the one column as a vector for each of the individual rows.
Is there a function to do this?
For example, the data looks like this:
DF <- data.frame(id=rep(LETTERS, each=1)[1:26], replicate(26, sample(1001, 26)), Class=sample(c("Yes", "No"), 26, TRUE))
select(DF, cols=c("id", "X1","X2", "X23", "Class"))
How can I merge the columns "X1","X2", "X23" into a vector containing numeric type variables for each of the IDs?
Like this?
library(reshape2)
melt(df) %>% dcast(id ~ ., fun.aggregate = list)
Using id, Class as id variables
id .
1 A 422, 74, 439
2 B 879, 443, 923
3 C 575, 901, 749
4 D 813, 747, 21
5 E 438, 526, 675
6 F 863, 562, 474
7 G 103, 713, 918
8 H 585, 294, 525
9 I 115, 76, 175
10 J 953, 379, 926
11 K 679, 439, 377
12 L 816, 624, 538
13 M 678, 226, 142
14 N 667, 369, 586
15 O 795, 422, 248
16 P 165, 22, 612
17 Q 294, 476, 746
18 R 968, 368, 290
19 S 238, 481, 980
20 T 921, 482, 741
21 U 550, 15, 296
22 V 121, 358, 625
23 W 213, 313, 242
24 X 92, 77, 58
25 Y 607, 936, 350
26 Z 660, 42, 275
A note though: I do not know your final use case, but this strikes me as something you probably do not want to have. It is often more advisable to stick to tidy data, see e.g. https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
I have a dataset that looks like this:
> head(df)
# A tibble: 6 × 3
id tstart tstop
<dbl> <dttm> <dttm>
1 115 2016-01-04 19:14:06 2016-01-04 19:14:15
2 115 2016-01-04 19:14:15 2016-01-04 19:14:16
3 115 2016-01-04 19:14:16 2016-01-04 20:00:00
4 115 2016-01-04 20:00:00 2016-01-04 23:32:06
5 119 2016-01-09 12:56:49 2016-01-09 13:09:38
6 119 2016-01-09 19:21:30 2016-01-09 19:26:48
> dput(df)
structure(list(id = c(115, 115, 115, 115, 119, 119, 119, 119,
115, 119, 115, 115, 119, 119, 115, 115, 115, 115, 119, 115, 115,
119, 119, 115, 115, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 115, 119, 119, 115, 119, 119, 115, 119, 115, 115, 115, 115,
115), tstart = structure(c(1451960046, 1451960055, 1451960056,
1451962800, 1452369409, 1452392490, 1452656773, 1452768075, 1453117929,
1453158614, 1453211410, 1453241664, 1453472208, 1453501656, 1453683210,
1453859618, 1453923350, 1454160212, 1454185221, 1454334295, 1454667974,
1454893810, 1455228853, 1455498598, 1455551174, 1455586503, 1455652857,
1455747333, 1455965433, 1456053421, 1456137889, 1456482398, 1456590733,
1456839351, 1456945452, 1457003430, 1457099049, 1457108703, 1457445523,
1457478749, 1457480525, 1457542159, 1457562948, 1458598425, 1458822311,
1458940977, 1459028316, 1459083563), class = c("POSIXct", "POSIXt"
), tzone = ""), tstop = structure(c(1451960055, 1451960056, 1451962800,
1451975526, 1452370178, 1452392808, 1452656986, 1452768517, 1453118186,
1453158918, 1453211770, 1453242132, 1453472619, 1453502485, 1453683500,
1453859899, 1453923567, 1454161008, 1454185580, 1454334848, 1454668930,
1454894182, 1455229448, 1455499217, 1455552432, 1455587211, 1455653538,
1455747987, 1455965658, 1456053774, 1456138469, 1456482801, 1456591336,
1456839506, 1456945790, 1457003644, 1457099216, 1457109800, 1457445783,
1457480525, 1457480533, 1457542907, 1457563544, 1458598877, 1458822887,
1458941209, 1459028558, 1459083990), class = c("POSIXct", "POSIXt"
))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-48L), .Names = c("id", "tstart", "tstop"))
> head(df)
# A tibble: 6 × 3
id tstart tstop
<dbl> <dttm> <dttm>
1 115 2016-01-04 19:14:06 2016-01-04 19:14:15
2 115 2016-01-04 19:14:15 2016-01-04 19:14:16
3 115 2016-01-04 19:14:16 2016-01-04 20:00:00
4 115 2016-01-04 20:00:00 2016-01-04 23:32:06
5 115 2016-01-18 04:52:09 2016-01-18 04:56:26
6 115 2016-01-19 06:50:10 2016-01-19 06:56:10
I'm trying to create an event sequence, event.seq, where an event is defined as the continuation in time of the previous row. The sequence resets at every id change. The end dataframe I'm trying to get is:
> dput(df.out)
structure(list(id = c(115, 115, 115, 115, 115, 115, 115, 115,
115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115,
115, 115, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119), tstart = structure(c(1451960046, 1451960055, 1451960056,
1451962800, 1453117929, 1453211410, 1453241664, 1453683210, 1453859618,
1453923350, 1454160212, 1454334295, 1454667974, 1455498598, 1455551174,
1457003430, 1457445523, 1457542159, 1458598425, 1458822311, 1458940977,
1459028316, 1459083563, 1452369409, 1452392490, 1452656773, 1452768075,
1453158614, 1453472208, 1453501656, 1454185221, 1454893810, 1455228853,
1455586503, 1455652857, 1455747333, 1455965433, 1456053421, 1456137889,
1456482398, 1456590733, 1456839351, 1456945452, 1457099049, 1457108703,
1457478749, 1457480525, 1457562948), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), tstop = structure(c(1451960055, 1451960056,
1451962800, 1451975526, 1453118186, 1453211770, 1453242132, 1453683500,
1453859899, 1453923567, 1454161008, 1454334848, 1454668930, 1455499217,
1455552432, 1457003644, 1457445783, 1457542907, 1458598877, 1458822887,
1458941209, 1459028558, 1459083990, 1452370178, 1452392808, 1452656986,
1452768517, 1453158918, 1453472619, 1453502485, 1454185580, 1454894182,
1455229448, 1455587211, 1455653538, 1455747987, 1455965658, 1456053774,
1456138469, 1456482801, 1456591336, 1456839506, 1456945790, 1457099216,
1457109800, 1457480525, 1457480533, 1457563544), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), event.seq = c(1, 1, 1, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 23, 24)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -48L), .Names = c("id", "tstart", "tstop",
"event.seq"))
> head(df.out)
# A tibble: 6 × 4
id tstart tstop event.seq
<dbl> <dttm> <dttm> <dbl>
1 115 2016-01-05 02:14:06 2016-01-05 02:14:15 1
2 115 2016-01-05 02:14:15 2016-01-05 02:14:15 1
3 115 2016-01-05 02:14:15 2016-01-05 03:00:00 1
4 115 2016-01-05 03:00:00 2016-01-05 06:32:06 1
5 115 2016-01-18 11:52:09 2016-01-18 11:56:26 2
6 115 2016-01-19 13:50:10 2016-01-19 13:56:09 3
This gets me closer, but not quite what I want:
df.2 <- df %>%
arrange(id, tstart) %>%
mutate(tstart.ahead = lead(tstart)) %>%
mutate(tstop.behind = lag(tstop)) %>%
mutate(event.seq.1 = as.numeric(tstop == tstart.ahead), event.seq.2 = as.numeric(tstart == tstop.behind)) %>%
mutate(event.seq = pmax(event.seq.1, event.seq.2, na.rm = TRUE)) %>%
select(id, tstart, tstop, event.seq)
This is a little tricky. Since you want to reset for each id, we'll definitely need to group_by(id). Then we'll create a column indicating if each row is not a continuation of the previous row. Finally, we can use cumsum of this indicator. If it's not a continuation, 1 is added and event.seq goes up. If it is a continuation, 0 is added and event.seq stays the same. We add 1 to start at 1 not 0.
library(dplyr)
df.2 <- df %>%
arrange(id, tstart) %>%
group_by(id) %>%
mutate(not_continued = c(0, (lag(tstop) != tstart)[-1]),
event.seq = 1 + cumsum(not_continued)) %>%
select(-not_continued)
all.equal(df.2, df.out)
# [1] TRUE