Time lag between sequential observations giving negative values - r

I am trying to calculate the time between sequential observations. I have attached a sample of my data here.
A subset of my data looks like:
head(d1) #visualize the first few lines of the data
date time year km sps pp datetime next timedif seque
<fct> <fct> <int> <dbl> <fct> <dbl> <chr> <dbl> <dbl> <fct>
2012/06/21 23:23 2012 80 MUXX 1 2012-06-21 23:23 0 4144 10
2012/07/15 11:38 2012 80 MAMO 0 2012-07-15 11:38 1 33855 01
2012/07/20 22:19 2012 80 MICRO 0 2012-07-20 22:19 0 7841 00
2012/07/29 23:03 2012 80 MICRO 0 2012-07-29 23:03 0 13004 00
2012/10/18 2:54 2012 80 MICRO 0 2012-10-18 02:54 0 -971 00
2012/10/23 2:49 2012 80 MICRO 0 2012-10-23 02:49 0 -1094 00
Where:
pp: which species (sps) are predators (coded as 1) and which are prey (coded as 0)
next: very next pp after the current observation
timedif: time difference between the current observation and the next one
seque: this should be the sequence order: where the first number is the current pp and the second number is the next pp
To generate the datetime column, I did this:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
To make the other columns I used the following code:
d1 = d1 %>%
ungroup() %>%
group_by(km, year) %>%
mutate(next = dplyr::lag(pp)) %>%
mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
I have two questions:
My lag function appears to be recording the previous pp event, not the next pp event. How do I fix this?
My timedif calculation is giving me negative values, which shouldn't be possible. Why is that happening?
Just in case, here is the output for str(d1):
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 96 obs. of 10 variables:
$ date : Factor w/ 1093 levels "2012/05/30","2012/05/31",..: 23 47 52 61 71 76 76 88 90 98 ...
$ time : Factor w/ 1439 levels "0:00","0:01",..: 983 219 919 963 1016 5 47 52 923 1058 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ km : num 80 80 80 80 80 80 80 80 80 80 ...
$ sps : Factor w/ 17 levels "CACA","ERDO",..: 11 7 9 9 9 9 9 4 9 11 ...
$ pp : num 1 0 0 0 0 0 0 0 0 1 ...
$ datetime: chr "2012-06-21 23:23" "2012-07-15 11:38" "2012-07-20 22:19" "2012-07-29 23:03" ...
$ next : num 0 1 0 0 0 0 0 0 0 0 ...
$ timedif : num 4144 33855 7841 13004 14453 ...
$ seque : Factor w/ 4 levels "00","01","10",..: 3 2 1 1 1 1 1 1 1 3 ...
And also:
dput(d1[1:10,])
structure(list(
date = structure(c(23L, 47L, 52L, 61L, 71L, 76L, 76L, 88L, 90L, 98L),
.Label = c("2012/05/30", "2012/05/31", "2012/06/01", "2012/06/02", "2012/06/03", "2012/06/04", "2012/06/05", "2013/06/18", "2013/06/19", "2013/06/20", "2013/06/21", "2013/06/22", "2014/07/19", "2014/07/20", "2014/07/21", "2014/07/22", "2014/07/23", "2015/08/06", "2015/08/07", "2015/08/08", "2015/08/09", "2015/08/10"),
class = "factor"),
time = structure(c(983L, 219L, 919L, 963L, 1016L, 5L, 47L, 52L, 923L, 1058L),
.Label = c("0:00", "0:01", "0:02", "0:03", "0:04", "0:05", "0:06", "0:07", "0:33","0:34", "0:35", "0:36", "0:37","10:06", "10:07", "10:08", "10:09", "10:10", "10:11", "10:12", "10:13", "2:05", "2:06", "2:07", "2:08", "2:09", "2:10", "2:11", "9:54", "9:55", "9:56", "9:57", "9:58", "9:59"),
class = "factor"),
year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L),
km = c(80, 80, 80, 80, 80, 80, 80, 80, 80, 80),
sps = structure(c(11L, 7L, 9L, 9L, 9L, 9L, 9L, 4L, 9L, 11L),
.Label = c("CACA", "ERDO", "FEDO", "LEAM", "LOCA", "MAAM", "MAMO", "MEME", "MICRO", "MUVI", "MUXX", "ONZI", "PRLO", "TAHU", "TAST", "URAM", "VUVU"),
class = "factor"),
pp = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 1),
datetime = c("2012-06-21 23:23", "2012-07-15 11:38", "2012-07-20 22:19", "2012-07-29 23:03", "2012-08-08 23:56", "2012-08-13 00:04", "2012-08-13 00:46", "2012-08-25 00:51", "2012-08-27 22:23", "2012-09-04 03:38"),
prev = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
timedif = c(4144, 33855, 7841, 13004, 14453, 5768, 42, 17285, 4172, 10395),
seque = structure(c(3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c("00", "01", "10", "11"),
class = "factor")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -10L))

Related

How can I can convert character dates into numerics? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Given a time series entailing data about cinemas, the identifier "dates" are of interest. I would like to convert into the format "YYYY/MM/DD." However, when I run my code:
CINEMA.TICKET$DATE <- as.Date(CINEMA.TICKET$date , format = "%y/%m/%d")
Two issues occur:
First, the dates are shown on the far right of the table as, e.g. , "0005-05-20." And many entries disappear entirely. Can someone explain what I am doing wrong, and how can I do it properly?
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day newdate DATE
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 5/5/2018 5 2 5 0005-05-20 2005-05-20
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 5/5/2018 5 2 5 0005-05-20 2005-05-20
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 5/5/2018 5 2 5 0005-05-20 2005-05-20
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 5/5/2018 5 2 5 0005-05-20 2005-05-20
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 5/5/2018 5 2 5 0005-05-20 2005-05-20
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 5/5/2018 5 2 5 0005-05-20 2005-05-20
> str(CINEMA.TICKET)
As #Dave2e pointed out. You are looking for:
CINEMA.TICKET[, date := as.Date(date , format = "%d/%m/%Y")]
assuming our input format is "30/5/2018" since question is not clear with an example of "5/5/2018" where this could be "%d/%m/%Y" or "%m/%d/%Y"
As for ordering columns use:
setcolorder(CINEMA.TICKET, c("c", "b", "a"))
where c,b,a are column names in their desired order
lubridate probably does the trick
> lubridate::mdy("5/5/2018")
[1] "2018-05-05"
So you should use
library(lubridate)
library(tidyverse)
CINEMA.TICKET <- CINEMA.TICKET %>%
mutate(DATE=mdy(date))
Here is another option:
library(tidyverse)
output <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y"))
Output
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 2018-05-05 5 2 5
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 2018-05-05 5 2 5
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 2018-05-05 5 2 5
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 2018-05-05 5 2 5
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 2018-05-05 5 2 5
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 2018-05-05 5 2 5
To have date classified as a date, you cannot have the forward slash. You can change the format, but it will no longer be classified as date, but will be classified as character again.
class(output$date)
# [1] "Date"
output2 <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y")) %>%
mutate(date = format(date, "%Y/%m/%d"))
class(output2$date)
# [1] "character"
Data
df <-
structure(
list(
film_code = c(1492L, 1492L, 1492L, 1492L, 1492L,
1492L),
cinema_code = c(304L, 352L, 489L, 429L, 524L, 71L),
total_sales = c(3900000L,
3360000L, 2560000L, 1200000L, 1200000L, 1050000L),
tickets_sold = c(26L,
42L, 32L, 12L, 15L, 7L),
tickets_out = c(0L, 0L, 0L, 0L, 0L,
0L),
show_time = c(4L, 5L, 4L, 1L, 3L, 3L),
occu_perc = c(4.26,
8.08, 20, 11.01, 16.67, 0.98),
ticket_price = c(150000L, 80000L,
80000L, 100000L, 80000L, 150000L),
ticket_use = c(26L, 42L, 32L,
12L, 15L, 7L),
capacity = c(610.3286, 519.802, 160, 108.9918,
89.982, 714.2857),
date = c("5/5/2018", "5/5/2018", "5/5/2018", "5/5/2018",
"5/5/2018", "5/5/2018"),
month = c(5L, 5L, 5L, 5L, 5L, 5L),
quarter = c(2L,
2L, 2L, 2L, 2L, 2L),
day = c(5L, 5L, 5L, 5L, 5L, 5L)
),
class = "data.frame",
row.names = c(NA,-6L)
)

Calculating time lag between sequential events after grouping for subsets

I am trying to calculate the time between sequential observations for different combinations of my columns. I have attached a sample of my data here.
A subset of my data looks like:
head(d1) #visualize the first few lines of the data
date time year km sps pp datetime prev timedif seque
<fct> <fct> <int> <dbl> <fct> <dbl> <chr> <dbl> <dbl> <chr>
2012/06/09 2:22 2012 110 MICRO 0 2012-06-09 02:22 0 260. 00
2012/06/19 2:19 2012 80 MICRO 0 2012-06-19 02:19 1 4144 01
2012/06/19 22:15 2012 110 MICRO 0 2012-06-19 22:15 0 100. 00
2012/06/21 23:23 2012 80 MUXX 1 2012-06-21 23:23 0 33855 10
2012/06/24 2:39 2012 110 MICRO 0 2012-06-24 02:39 0 120. 00
2012/06/29 2:14 2012 110 MICRO 0 2012-06-29 02:14 0 43.7 00
Where:
pp: which species (sps) are predators (coded as 1) and which are prey (coded as 0)
prev: very next pp after the current observation
timedif: time difference (in seconds?) between the current observation and the next one
seque: this is the sequence order: where the first number is the current pp and the second number is the next pp
To generate the datetime column, I did this:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
To make the other columns I used the following code:
d1 = d1 %>%
ungroup() %>%
group_by(km, year) %>% #group by km and year because I don't want time differences calculated between different years or km (i.e., locations)
arrange(datetime)%>%
mutate(next = dplyr::lead(pp)) %>%
mutate(timedif = lead(as.POSIXct(datetime))-as.numeric(as.POSIXct(datetime)))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
I can then extract the average (geometric mean) time between sequences:
library(psych)
geo_avg = d1 %>% group_by(seque) %>% summarise(geometric.mean(timedif))
geo_avg
# A tibble: 6 x 2
seque `geometric.mean(timedif)`
<chr> <dbl>
1 00 58830. #prey followed by a prey
2 01 147062. #prey followed by a predator
3 0NA NA #prey followed by nothing (end of time series)
4 10 178361. #predator followed by prey
5 11 1820. #predator followed by predator
6 1NA NA #predator followed by nothing (end of time series)
I have one questions that can be broken down into three parts
How can I calculate the time difference between:
individuals of the same sps (for example how long does it take for one MICRO to be followed by the next MICRO
species-specific time for opposite classifications prey-predator (01 or 10) sequences for each prey (pp = 0) or predator (pp = 1) sps (for example, how long does it take for the prey MICRO to be followed by each other predator (pp = 1).
species-specific time for same classification (00 or 11) sequences for each prey (pp = 0) or predator (pp = 1) sps (for example, how long does it take for the prey MICRO to be followed by any other prey (pp = 0), MICRO and otherwise.
I would like to be able to do something along these lines:
sps pp same_sps same_class opposite_class
MICRO 0 10 days 5 days 2 days
MUXX 1 15 days 20 days 12 days
etc
Just in case, here is the output for dput(d1[1:10,]):
structure(list(
date = structure(c(11L, 21L, 21L, 23L, 26L, 31L,32L, 37L, 38L, 39L), .Label = c("2012/05/30", "2012/05/31", "2012/06/01", "2015/08/19", "2015/08/20"), class = "factor"),
time = structure(c(742L, 739L, 915L, 983L, 759L, 734L, 897L, 769L, 901L, 14L), .Label = c("0:00", "0:01", "0:02", "0:03", "9:58", "9:59"), class = "factor"),
year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L),
km = c(110, 80, 110, 80, 110, 110, 110, 110, 110, 110),
sps = structure(c(9L, 9L, 9L, 11L, 9L, 9L, 9L, 9L, 9L, 9L), .Label = c("CACA", "ERDO", "FEDO", "LEAM", "LOCA", "MAAM", "MAMO", "MEME", "MICRO", "MUVI", "MUXX", "ONZI", "PRLO", "TAHU", "TAST", "URAM", "VUVU"), class = "factor"),
pp = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0),
datetime = c("2012-06-09 02:22", "2012-06-19 02:19", "2012-06-19 22:15", "2012-06-21 23:23"),
prev = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
timedif = c(259.883333333333, 4144, 100.4, 43.2, 2.2, 453.083333333333),
seque = c("00", "01", "00", "10", "00", "00", "00", "00", "00", "00")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L),
groups = structure(list(km = c(80, 110), year = c(2012L, 2012L), .rows = list(c(2L, 4L), c(1L, 3L, 5L, 6L, 7L, 8L, 9L, 10L))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
I believe you could answer all three of your questions by adding a column for the next sps and including it and the current sps in your group_by
d1 %>%
mutate(next_sps = lead(sps)) %>%
group_by(sps, next_sps, seque) %>%
summarise(AvgTime = mean(timedif))
# A tibble: 5 x 4
# Groups: sps, next_sps [?]
sps next_sps seque AvgTime
<fct> <fct> <chr> <dbl>
1 MICRO MICRO 00 1.19e+ 2
2 MICRO MICRO 01 4.14e+ 3
3 MICRO MUXX 00 1.00e+ 2
4 MICRO NA 00 1.01e-317
5 MUXX MICRO 10 4.32e+ 1

Calculating lag over a month

I have this data:
library(dplyr)
glimpse(samp)
Observations: 15
Variables: 6
$ date <date> 2013-01-04, 2013-01-31, 2013-01-09, 2013-01-20, 2013-01-29, 2013...
$ shop_id <int> 4, 1, 30, 41, 26, 16, 25, 10, 29, 52, 54, 42, 8, 59, 31
$ item_id <int> 1904, 17880, 14439, 15010, 10917, 10331, 2751, 1475, 16071, 13901...
$ item_cnt_day <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1
$ month <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
It´s just a sample of a large data set, so there are jumps between the date.
In the original data, the time series stars at 2013-01-01 and ends at 2015-11-30. The data are a time series. My goal is to calculate the lag for one month. The problem is that the length of a month is not consistent (i.e. some months have 30 other have 31 days). In order to calculate the lag, I have to set a number. However, as I mentioned before for a month it´s not possible to set a fixed number. Is there a way to calculate the lag month wise?
The target variable is item_cnt_day. The lag should be calculated for the rolling mean. In this example each month has 5 days so the result should like this:
library(RcppRoll)
library(dplyr)
samp %>%
mutate(r_mean_5 = lag(roll_meanr(item_cnt_day, 5), 1))
date shop_id item_id item_cnt_day month year r_mean_5
30717 2013-01-04 4 1904 1 1 2013 NA
43051 2013-01-31 1 17880 1 1 2013 NA
66273 2013-01-09 30 14439 1 1 2013 NA
105068 2013-01-20 41 15010 1 1 2013 NA
23332 2013-01-29 26 10917 1 1 2013 NA
28838 2013-02-22 16 10331 1 2 2013 1.0
40418 2013-02-08 25 2751 2 2 2013 1.0
62219 2013-02-12 10 1475 1 2 2013 1.2
98641 2013-02-16 29 16071 1 2 2013 1.2
21905 2013-02-23 52 13901 2 2 2013 1.2
32219 2013-03-31 54 2972 1 3 2013 1.4
45156 2013-03-17 42 11184 1 3 2013 1.4
69513 2013-03-24 8 19405 1 3 2013 1.2
110206 2013-03-10 59 2255 1 3 2013 1.2
24473 2013-03-07 31 15119 1 3 2013 1.2
Here is the dput().
structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Maybe this?
library(lubridate)
df$lag <- df$date %m-% months(1)
df$rollmean <- sapply(1:nrow(df), function(x) mean(df[df$date <= df$date[x] & df$date >= df$lag[x], "item_cnt_day" ]))
date shop_id item_id item_cnt_day month year lag rollmean
30717 2013-01-04 4 1904 1 1 2013 2012-12-04 1.000000
43051 2013-01-31 1 17880 1 1 2013 2012-12-31 1.000000
66273 2013-01-09 30 14439 1 1 2013 2012-12-09 1.000000
105068 2013-01-20 41 15010 1 1 2013 2012-12-20 1.000000
23332 2013-01-29 26 10917 1 1 2013 2012-12-29 1.000000
28838 2013-02-22 16 10331 1 2 2013 2013-01-22 1.166667
40418 2013-02-08 25 2751 2 2 2013 2013-01-08 1.200000
62219 2013-02-12 10 1475 1 2 2013 2013-01-12 1.200000
98641 2013-02-16 29 16071 1 2 2013 2013-01-16 1.166667
21905 2013-02-23 52 13901 2 2 2013 2013-01-23 1.285714
32219 2013-03-31 54 2972 1 3 2013 2013-02-28 1.000000
45156 2013-03-17 42 11184 1 3 2013 2013-02-17 1.200000
69513 2013-03-24 8 19405 1 3 2013 2013-02-24 1.000000
110206 2013-03-10 59 2255 1 3 2013 2013-02-10 1.166667
24473 2013-03-07 31 15119 1 3 2013 2013-02-07 1.333333
%m-% calculates for every date the date one month ago, while accounting for different length of the months (31 days, 30 days, 28 days) and puts it into the column lag. Then in sapply(), the mean of item_cnt_day is calculated for all observations whose date lies within the range of date and lag of the current iteration.
So it doesn't matter how many elements are there for each month or how the elements are ordered.
The date class supports seq for different time intervals (documentation).
So you can basically do:
calculate_lag <- function(date) {
return(seq(date, by = "1 month", length.out = 2)[2])
}
date_column <- as.Date(sapply( _YOUR_DATAFRAME_ , calculate_lag), origin="1970-01-01")
I am not really familiar calculating lag, but maybe that is what you want?
Data:
df <- structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Calculation:
df %>%
dplyr::mutate(days_in_month = lubridate::days_in_month(date)) %>%
tidyr::nest(-c(month, days_in_month)) %>%
dplyr::mutate(lag = purrr::map2(data, days_in_month, ~ stats::lag(.x$item_cnt_day, .y)))
EDIT based on comment:
maybe this then?
df %>%
tidyr::nest(-month) %>%
dplyr::mutate(
ndays = purrr::map_int(data, nrow),
lag = purrr::map2_dbl(data, ndays, ~ zoo::rollmean(.x$item_cnt_day, .y))
)

Combining and Transforming Data Frames in R

I have a bunch of data frames that look like this in R:
print(output[2])
Button Intensity Acc Intensity RT Time tdelta SubjectID CoupleID PrePost
1: 0 30 0 0.0 0 83325.87 0.000 1531 153 Post
2: 1 30 1 13.5 0 83362.65 36.782 1531 153 Post
3: 1 30 1 15.0 0 83376.68 14.027 1531 153 Post
4: 1 30 1 6.0 0 83392.27 15.585 1531 153 Post
5: 1 30 1 15.0 0 83398.77 6.507 1531 153 Post
print(output[1])
[[1]]
Button Intensity Acc Intensity RT Time tdelta SubjectID CoupleID PrePost
1: 0 30 0 0.0 0 77987.93 0.000 1531 153 Pre
2: 1 30 1 13.5 0 78084.57 96.639 1531 153 Pre
3: 1 30 1 15.0 0 78098.62 14.054 1531 153 Pre
4: 1 30 1 6.0 0 78114.13 15.508 1531 153 Pre
5: 1 30 1 15.0 0 78120.67 6.537 1531 153 Pre
I want to combine them into one big data frame that has the following logic and format:
SubjectID CoupleID PrePost Miss1RT Miss2RT Miss3RT Hit1RT Hit2RT Hit3RT
1531 153 Post 0.00 NA NA NA 36.78 14.027
1531 153 Pre 0.00 NA NA NA 96.638 14.054
if Button == 0, then it's a Miss, if it ==1, then it's a Hit. So, it should be something like:
for row in output[i].rows:
if Button ==0:
Miss1RT ==tdelta
elif Button ==1;
Miss1RT =='NA'
and then a flipped version where if Button is 1, Hit[i]RT is tdelta or else 'NA'.
There are 26 lines per data frame and each row is either a hit or a miss so there will be 26 Miss and 26 Hit columns and each SubjectID gets two rows - one for Pre and one for Post. So the column headers for the final output will be:
SubjectID CoupleID PrePost Miss1RT Miss2RT ...Miss26RT Hit1RT Hit2RT ... Hit26RT
I'm new to R and struggling with the proper syntax.
Something like this should work:
#Get data in structure OP has
output <- list(pre, post)
output2 <- lapply(output, function(x) cbind(x, num = paste0(1:nrow(x), "RT")))
pre_post <- do.call("rbind", output2)
#Perform actual calculations
pre_post$miss <- ifelse(pre_post$Button == 0, pre_post$tdelta, NA)
pre_post$hit <- ifelse(pre_post$Button == 1, pre_post$tdelta, NA)
pre_post_melted <- melt(pre_post, id.vars = c("SubjectID", "CoupleID", "num", "PrePost"), measure.vars = c("hit","miss"))
pre_post_res <- dcast(pre_post_melted, SubjectID + CoupleID + PrePost ~ variable + num, sep = "")
pre_post_res
#SubjectID CoupleID PrePost hit_1RT hit_2RT hit_3RT hit_4RT hit_5RT miss_1RT miss_2RT miss_3RT miss_4RT miss_5RT
#1 1531 153 Post NA 36.782 14.027 15.585 6.507 0 NA NA NA NA
#2 1531 153 Pre NA 96.639 14.054 15.508 6.537 0 NA NA NA NA
We transpose the data to dynamically create all the variables we want. We also stack the data to avoid repeated steps.
Data:
pre <- structure(list(Button = c(0L, 1L, 1L, 1L, 1L), Intensity = c(30L,
30L, 30L, 30L, 30L), Acc = c(0L, 1L, 1L, 1L, 1L), Intensity = c(0,
13.5, 15, 6, 15), RT = c(0L, 0L, 0L, 0L, 0L), Time = c(77987.93,
78084.57, 78098.62, 78114.13, 78120.67), tdelta = c(0, 96.639,
14.054, 15.508, 6.537), SubjectID = c(1531L, 1531L, 1531L, 1531L,
1531L), CoupleID = c(153L, 153L, 153L, 153L, 153L), PrePost = c("Pre",
"Pre", "Pre", "Pre", "Pre")), .Names = c("Button", "Intensity",
"Acc", "Intensity", "RT", "Time", "tdelta", "SubjectID", "CoupleID",
"PrePost"), row.names = c(NA, -5L), class = "data.frame")
post <- structure(list(Button = c(0L, 1L, 1L, 1L, 1L), Intensity = c(30L,
30L, 30L, 30L, 30L), Acc = c(0L, 1L, 1L, 1L, 1L), Intensity = c(0,
13.5, 15, 6, 15), RT = c(0L, 0L, 0L, 0L, 0L), Time = c(83325.87,
83362.65, 83376.68, 83392.27, 83398.77), tdelta = c(0, 36.782,
14.027, 15.585, 6.507), SubjectID = c(1531L, 1531L, 1531L, 1531L,
1531L), CoupleID = c(153L, 153L, 153L, 153L, 153L), PrePost = c("Post",
"Post", "Post", "Post", "Post")), .Names = c("Button", "Intensity",
"Acc", "Intensity", "RT", "Time", "tdelta", "SubjectID", "CoupleID",
"PrePost"), row.names = c(NA, -5L), class = "data.frame")

R: Recursive Function to Move Through Dataset

Much of the following problem arises from the sheer size of the dataframe (198240 observations). I'll try to break it down as best as I can.
The Goal
I want to create a variable DURATION which is how long a house was sick.
The Known
Household ID and Week (There are 1120 houses and 177 weeks)
HDINC (Currently Sick variable )
HDINC_1 (Sick Week Prior variable )
The Problem
I don't understand how to get the function/loop to be traversing the dataframe in both household and time concurrently.
I know it will be a function or loop that goes something like the following (Not in R-code, but in logic)
IF (hdinc > 0) #a house on a certain date is sick
{ Duration = 1 AND look at hdinc_1
IF (hdinc_1 = 0 )
{ Duration = Duration + 0
AND Go onto the next date for that house.
IF hdinc_1 >0 then #if the house was sick last week
{ Duration = Duration + 1
Go to SameHouse, Week-1 and look at hdinc_1 to see if it was sick the week prior
I am having trouble with the following:
Getting it to start on a particular observation based on household/date
Moving the function backwards or forwards while maintaining the household
Eventually getting the function to restart using a different household
I know this is really convoluted but I can't even get the loop to start to provide y'all sample code.
Sample Data:
dat <- structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L), survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L), .Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"), hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)), .Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L))
Sample Output:
Using only base R :
# create sample data
sampleData <-
structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L),
survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L),
.Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"),
hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)),
.Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L))
# you must be sure the rows are already ordered, otherwise you can use something like:
#sampleData <- sampleData[order(sampleData$id_casa,sampleData$survdate),]
sampleData$Duration <-
unlist(
by(sampleData,
INDICES=sampleData$id_casa,
FUN=function(house){
tail(Reduce(f=function(prv,nxt){if(nxt == 0) 0 else (prv+nxt)},
x=as.integer(house$hdinc > 0),init=0,accumulate=TRUE),-1)
}))
> sampleData
id_casa survdate hdinc hdinc_1 Duration
1 802 1/19/2005 125.0000 0.0000 1
2 802 12/28/2005 142.8571 125.0000 2
3 802 1/4/2006 0.0000 142.8571 0
4 802 1/11/2006 0.0000 0.0000 0
5 802 1/18/2006 0.0000 0.0000 0
6 802 1/25/2006 142.8571 0.0000 1
7 802 2/1/2006 0.0000 142.8571 0
8 955 10/13/2004 50.0000 0.0000 1
9 955 10/20/2004 32.0000 50.0000 2
10 955 10/27/2004 159.0000 32.0000 3
11 955 11/3/2004 2.5000 159.0000 4
We can use the function rle in combination with dplyr to find runs, and then remove those where the run is of wellness:
library(dplyr)
dat %>% group_by(id_casa) %>%
mutate(duration = unlist(lapply(rle(hdinc > 0)[["lengths"]], seq, from = 1))) %>%
mutate(duration = ifelse(hdinc > 0, as.numeric(duration), 0))
Source: local data frame [11 x 5]
Groups: id_casa [2]
id_casa survdate hdinc hdinc_1 duration
(int) (fctr) (dbl) (dbl) (dbl)
1 802 1/19/2005 125.0000 0.0000 1
2 802 12/28/2005 142.8571 125.0000 2
3 802 1/4/2006 0.0000 142.8571 0
4 802 1/11/2006 0.0000 0.0000 0
5 802 1/18/2006 0.0000 0.0000 0
6 802 1/25/2006 142.8571 0.0000 1
7 802 2/1/2006 0.0000 142.8571 0
8 955 10/13/2004 50.0000 0.0000 1
9 955 10/20/2004 32.0000 50.0000 2
10 955 10/27/2004 159.0000 32.0000 3
11 955 11/3/2004 2.5000 159.0000 4
How it works: first we find all the runs using rle:
rle(dat$hdinc>0)
Run Length Encoding
lengths: int [1:5] 2 3 1 1 4
values : logi [1:5] TRUE FALSE TRUE FALSE TRUE
We then make a seq from 0 to each of the lengths from the rle using lapply:
z <- unlist(lapply(rle(dat$hdinc > 0)[["lengths"]], seq, from = 1))
z
[1] 1 2 1 2 3 1 1 1 2 3 4
Then we filter that by whether it was sickness or wellness:
ifelse(dat$hdinc > 0, z, 0)
[1] 1 2 0 0 0 1 0 1 2 3 4
Using dplyr group_by we make sure we are running it on each id_casa by itself.
EDIT: In base:
dat$duration2 <- ifelse(dat$hdinc > 0,
unlist(by(dat, dat$id_casa, FUN = function(x) unlist(lapply(rle(x$hdinc > 0)[["lengths"]], seq, from = 1)))),
0)

Resources