I am trying to calculate the time between sequential observations for different combinations of my columns. I have attached a sample of my data here.
A subset of my data looks like:
head(d1) #visualize the first few lines of the data
date time year km sps pp datetime prev timedif seque
<fct> <fct> <int> <dbl> <fct> <dbl> <chr> <dbl> <dbl> <chr>
2012/06/09 2:22 2012 110 MICRO 0 2012-06-09 02:22 0 260. 00
2012/06/19 2:19 2012 80 MICRO 0 2012-06-19 02:19 1 4144 01
2012/06/19 22:15 2012 110 MICRO 0 2012-06-19 22:15 0 100. 00
2012/06/21 23:23 2012 80 MUXX 1 2012-06-21 23:23 0 33855 10
2012/06/24 2:39 2012 110 MICRO 0 2012-06-24 02:39 0 120. 00
2012/06/29 2:14 2012 110 MICRO 0 2012-06-29 02:14 0 43.7 00
Where:
pp: which species (sps) are predators (coded as 1) and which are prey (coded as 0)
prev: very next pp after the current observation
timedif: time difference (in seconds?) between the current observation and the next one
seque: this is the sequence order: where the first number is the current pp and the second number is the next pp
To generate the datetime column, I did this:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
To make the other columns I used the following code:
d1 = d1 %>%
ungroup() %>%
group_by(km, year) %>% #group by km and year because I don't want time differences calculated between different years or km (i.e., locations)
arrange(datetime)%>%
mutate(next = dplyr::lead(pp)) %>%
mutate(timedif = lead(as.POSIXct(datetime))-as.numeric(as.POSIXct(datetime)))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
I can then extract the average (geometric mean) time between sequences:
library(psych)
geo_avg = d1 %>% group_by(seque) %>% summarise(geometric.mean(timedif))
geo_avg
# A tibble: 6 x 2
seque `geometric.mean(timedif)`
<chr> <dbl>
1 00 58830. #prey followed by a prey
2 01 147062. #prey followed by a predator
3 0NA NA #prey followed by nothing (end of time series)
4 10 178361. #predator followed by prey
5 11 1820. #predator followed by predator
6 1NA NA #predator followed by nothing (end of time series)
I have one questions that can be broken down into three parts
How can I calculate the time difference between:
individuals of the same sps (for example how long does it take for one MICRO to be followed by the next MICRO
species-specific time for opposite classifications prey-predator (01 or 10) sequences for each prey (pp = 0) or predator (pp = 1) sps (for example, how long does it take for the prey MICRO to be followed by each other predator (pp = 1).
species-specific time for same classification (00 or 11) sequences for each prey (pp = 0) or predator (pp = 1) sps (for example, how long does it take for the prey MICRO to be followed by any other prey (pp = 0), MICRO and otherwise.
I would like to be able to do something along these lines:
sps pp same_sps same_class opposite_class
MICRO 0 10 days 5 days 2 days
MUXX 1 15 days 20 days 12 days
etc
Just in case, here is the output for dput(d1[1:10,]):
structure(list(
date = structure(c(11L, 21L, 21L, 23L, 26L, 31L,32L, 37L, 38L, 39L), .Label = c("2012/05/30", "2012/05/31", "2012/06/01", "2015/08/19", "2015/08/20"), class = "factor"),
time = structure(c(742L, 739L, 915L, 983L, 759L, 734L, 897L, 769L, 901L, 14L), .Label = c("0:00", "0:01", "0:02", "0:03", "9:58", "9:59"), class = "factor"),
year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L),
km = c(110, 80, 110, 80, 110, 110, 110, 110, 110, 110),
sps = structure(c(9L, 9L, 9L, 11L, 9L, 9L, 9L, 9L, 9L, 9L), .Label = c("CACA", "ERDO", "FEDO", "LEAM", "LOCA", "MAAM", "MAMO", "MEME", "MICRO", "MUVI", "MUXX", "ONZI", "PRLO", "TAHU", "TAST", "URAM", "VUVU"), class = "factor"),
pp = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0),
datetime = c("2012-06-09 02:22", "2012-06-19 02:19", "2012-06-19 22:15", "2012-06-21 23:23"),
prev = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
timedif = c(259.883333333333, 4144, 100.4, 43.2, 2.2, 453.083333333333),
seque = c("00", "01", "00", "10", "00", "00", "00", "00", "00", "00")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L),
groups = structure(list(km = c(80, 110), year = c(2012L, 2012L), .rows = list(c(2L, 4L), c(1L, 3L, 5L, 6L, 7L, 8L, 9L, 10L))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
I believe you could answer all three of your questions by adding a column for the next sps and including it and the current sps in your group_by
d1 %>%
mutate(next_sps = lead(sps)) %>%
group_by(sps, next_sps, seque) %>%
summarise(AvgTime = mean(timedif))
# A tibble: 5 x 4
# Groups: sps, next_sps [?]
sps next_sps seque AvgTime
<fct> <fct> <chr> <dbl>
1 MICRO MICRO 00 1.19e+ 2
2 MICRO MICRO 01 4.14e+ 3
3 MICRO MUXX 00 1.00e+ 2
4 MICRO NA 00 1.01e-317
5 MUXX MICRO 10 4.32e+ 1
Related
Given a dataframe as follows:
df <- structure(list(year = c(2001L, 2001L, 2001L, 2001L, 2002L, 2002L,
2002L, 2002L, 2003L, 2003L, 2003L, 2003L), quater = c(1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), value = c(4L, 23L, 14L,
12L, 6L, 22L, 45L, 12L, 34L, 15L, 3L, 40L)), class = "data.frame", row.names = c(NA,
-12L))
Out:
year quater value
0 2001 1 4
1 2001 2 23
2 2001 3 14
3 2001 4 12
4 2002 1 6
5 2002 2 22
6 2002 3 45
7 2002 4 12
8 2003 1 34
9 2003 2 15
10 2003 3 3
11 2003 4 40
How could I plot a chart similar to the plot below:
Please note the year and quater in this dataset correspondent to year and week to the plot above.
I need to first cut the value column by (0, 10], (10, 20], (20, 30], (30, 40], (40, 50] then plot them.
The code I have tried:
ggplot(df, aes(week, year, fill= value)) +
geom_tile() +
scale_fill_gradient(low="white", high="red")
Out:
As you can see, the legend is different to what I need.
Thanks for your help.
You should first use cut to get the classes (as Ronak Shah already mentioned) and then you can use scale_fill_brewer to change the color of the tiles.
library(tidyverse)
df %>%
mutate(class = cut(value, seq(0, 50, 10))) %>%
ggplot(aes(quater, year, fill = class) ) +
geom_tile() +
scale_fill_brewer(type = "seq",
direction = 1,
palette = "RdPu")
I want to count the number of days rain fell in a month for different years at different location.
This is my data:
Location Year Month Day Precipitation
A 2008 1 1 0
A 2008 1 2 8.32
A 2008 1 3 4.89
A 2008 1 4 0
I have up to 18 locations, year is from 2008 - 2018, 12 months in each year and 0 for precipitation means no rain on that day.
You can use aggregate:
aggregate(cbind(days=x$Precipitation > 0), as.list(x[c("Location", "Year", "Month")]), sum)
# Location Year Month days
#1 A 2008 1 2
Data:
x <- structure(list(Location = structure(c(1L, 1L, 1L, 1L), .Label = "A", class = "factor"),
Year = c(2008L, 2008L, 2008L, 2008L), Month = c(1L, 1L, 1L,
1L), Day = 1:4, Precipitation = c(0, 8.32, 4.89, 0)), class = "data.frame", row.names = c(NA, -4L))
Based on the available information
df <- df %>%
filter(Precipitation != 0) %>%
group_by(Location, Year, Month) %>%
summarize(DaysOfRain = n())
I am trying to calculate the time between sequential observations. I have attached a sample of my data here.
A subset of my data looks like:
head(d1) #visualize the first few lines of the data
date time year km sps pp datetime next timedif seque
<fct> <fct> <int> <dbl> <fct> <dbl> <chr> <dbl> <dbl> <fct>
2012/06/21 23:23 2012 80 MUXX 1 2012-06-21 23:23 0 4144 10
2012/07/15 11:38 2012 80 MAMO 0 2012-07-15 11:38 1 33855 01
2012/07/20 22:19 2012 80 MICRO 0 2012-07-20 22:19 0 7841 00
2012/07/29 23:03 2012 80 MICRO 0 2012-07-29 23:03 0 13004 00
2012/10/18 2:54 2012 80 MICRO 0 2012-10-18 02:54 0 -971 00
2012/10/23 2:49 2012 80 MICRO 0 2012-10-23 02:49 0 -1094 00
Where:
pp: which species (sps) are predators (coded as 1) and which are prey (coded as 0)
next: very next pp after the current observation
timedif: time difference between the current observation and the next one
seque: this should be the sequence order: where the first number is the current pp and the second number is the next pp
To generate the datetime column, I did this:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
To make the other columns I used the following code:
d1 = d1 %>%
ungroup() %>%
group_by(km, year) %>%
mutate(next = dplyr::lag(pp)) %>%
mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
I have two questions:
My lag function appears to be recording the previous pp event, not the next pp event. How do I fix this?
My timedif calculation is giving me negative values, which shouldn't be possible. Why is that happening?
Just in case, here is the output for str(d1):
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 96 obs. of 10 variables:
$ date : Factor w/ 1093 levels "2012/05/30","2012/05/31",..: 23 47 52 61 71 76 76 88 90 98 ...
$ time : Factor w/ 1439 levels "0:00","0:01",..: 983 219 919 963 1016 5 47 52 923 1058 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ km : num 80 80 80 80 80 80 80 80 80 80 ...
$ sps : Factor w/ 17 levels "CACA","ERDO",..: 11 7 9 9 9 9 9 4 9 11 ...
$ pp : num 1 0 0 0 0 0 0 0 0 1 ...
$ datetime: chr "2012-06-21 23:23" "2012-07-15 11:38" "2012-07-20 22:19" "2012-07-29 23:03" ...
$ next : num 0 1 0 0 0 0 0 0 0 0 ...
$ timedif : num 4144 33855 7841 13004 14453 ...
$ seque : Factor w/ 4 levels "00","01","10",..: 3 2 1 1 1 1 1 1 1 3 ...
And also:
dput(d1[1:10,])
structure(list(
date = structure(c(23L, 47L, 52L, 61L, 71L, 76L, 76L, 88L, 90L, 98L),
.Label = c("2012/05/30", "2012/05/31", "2012/06/01", "2012/06/02", "2012/06/03", "2012/06/04", "2012/06/05", "2013/06/18", "2013/06/19", "2013/06/20", "2013/06/21", "2013/06/22", "2014/07/19", "2014/07/20", "2014/07/21", "2014/07/22", "2014/07/23", "2015/08/06", "2015/08/07", "2015/08/08", "2015/08/09", "2015/08/10"),
class = "factor"),
time = structure(c(983L, 219L, 919L, 963L, 1016L, 5L, 47L, 52L, 923L, 1058L),
.Label = c("0:00", "0:01", "0:02", "0:03", "0:04", "0:05", "0:06", "0:07", "0:33","0:34", "0:35", "0:36", "0:37","10:06", "10:07", "10:08", "10:09", "10:10", "10:11", "10:12", "10:13", "2:05", "2:06", "2:07", "2:08", "2:09", "2:10", "2:11", "9:54", "9:55", "9:56", "9:57", "9:58", "9:59"),
class = "factor"),
year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L),
km = c(80, 80, 80, 80, 80, 80, 80, 80, 80, 80),
sps = structure(c(11L, 7L, 9L, 9L, 9L, 9L, 9L, 4L, 9L, 11L),
.Label = c("CACA", "ERDO", "FEDO", "LEAM", "LOCA", "MAAM", "MAMO", "MEME", "MICRO", "MUVI", "MUXX", "ONZI", "PRLO", "TAHU", "TAST", "URAM", "VUVU"),
class = "factor"),
pp = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 1),
datetime = c("2012-06-21 23:23", "2012-07-15 11:38", "2012-07-20 22:19", "2012-07-29 23:03", "2012-08-08 23:56", "2012-08-13 00:04", "2012-08-13 00:46", "2012-08-25 00:51", "2012-08-27 22:23", "2012-09-04 03:38"),
prev = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
timedif = c(4144, 33855, 7841, 13004, 14453, 5768, 42, 17285, 4172, 10395),
seque = structure(c(3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c("00", "01", "10", "11"),
class = "factor")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -10L))
I have this data:
library(dplyr)
glimpse(samp)
Observations: 15
Variables: 6
$ date <date> 2013-01-04, 2013-01-31, 2013-01-09, 2013-01-20, 2013-01-29, 2013...
$ shop_id <int> 4, 1, 30, 41, 26, 16, 25, 10, 29, 52, 54, 42, 8, 59, 31
$ item_id <int> 1904, 17880, 14439, 15010, 10917, 10331, 2751, 1475, 16071, 13901...
$ item_cnt_day <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1
$ month <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
It´s just a sample of a large data set, so there are jumps between the date.
In the original data, the time series stars at 2013-01-01 and ends at 2015-11-30. The data are a time series. My goal is to calculate the lag for one month. The problem is that the length of a month is not consistent (i.e. some months have 30 other have 31 days). In order to calculate the lag, I have to set a number. However, as I mentioned before for a month it´s not possible to set a fixed number. Is there a way to calculate the lag month wise?
The target variable is item_cnt_day. The lag should be calculated for the rolling mean. In this example each month has 5 days so the result should like this:
library(RcppRoll)
library(dplyr)
samp %>%
mutate(r_mean_5 = lag(roll_meanr(item_cnt_day, 5), 1))
date shop_id item_id item_cnt_day month year r_mean_5
30717 2013-01-04 4 1904 1 1 2013 NA
43051 2013-01-31 1 17880 1 1 2013 NA
66273 2013-01-09 30 14439 1 1 2013 NA
105068 2013-01-20 41 15010 1 1 2013 NA
23332 2013-01-29 26 10917 1 1 2013 NA
28838 2013-02-22 16 10331 1 2 2013 1.0
40418 2013-02-08 25 2751 2 2 2013 1.0
62219 2013-02-12 10 1475 1 2 2013 1.2
98641 2013-02-16 29 16071 1 2 2013 1.2
21905 2013-02-23 52 13901 2 2 2013 1.2
32219 2013-03-31 54 2972 1 3 2013 1.4
45156 2013-03-17 42 11184 1 3 2013 1.4
69513 2013-03-24 8 19405 1 3 2013 1.2
110206 2013-03-10 59 2255 1 3 2013 1.2
24473 2013-03-07 31 15119 1 3 2013 1.2
Here is the dput().
structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Maybe this?
library(lubridate)
df$lag <- df$date %m-% months(1)
df$rollmean <- sapply(1:nrow(df), function(x) mean(df[df$date <= df$date[x] & df$date >= df$lag[x], "item_cnt_day" ]))
date shop_id item_id item_cnt_day month year lag rollmean
30717 2013-01-04 4 1904 1 1 2013 2012-12-04 1.000000
43051 2013-01-31 1 17880 1 1 2013 2012-12-31 1.000000
66273 2013-01-09 30 14439 1 1 2013 2012-12-09 1.000000
105068 2013-01-20 41 15010 1 1 2013 2012-12-20 1.000000
23332 2013-01-29 26 10917 1 1 2013 2012-12-29 1.000000
28838 2013-02-22 16 10331 1 2 2013 2013-01-22 1.166667
40418 2013-02-08 25 2751 2 2 2013 2013-01-08 1.200000
62219 2013-02-12 10 1475 1 2 2013 2013-01-12 1.200000
98641 2013-02-16 29 16071 1 2 2013 2013-01-16 1.166667
21905 2013-02-23 52 13901 2 2 2013 2013-01-23 1.285714
32219 2013-03-31 54 2972 1 3 2013 2013-02-28 1.000000
45156 2013-03-17 42 11184 1 3 2013 2013-02-17 1.200000
69513 2013-03-24 8 19405 1 3 2013 2013-02-24 1.000000
110206 2013-03-10 59 2255 1 3 2013 2013-02-10 1.166667
24473 2013-03-07 31 15119 1 3 2013 2013-02-07 1.333333
%m-% calculates for every date the date one month ago, while accounting for different length of the months (31 days, 30 days, 28 days) and puts it into the column lag. Then in sapply(), the mean of item_cnt_day is calculated for all observations whose date lies within the range of date and lag of the current iteration.
So it doesn't matter how many elements are there for each month or how the elements are ordered.
The date class supports seq for different time intervals (documentation).
So you can basically do:
calculate_lag <- function(date) {
return(seq(date, by = "1 month", length.out = 2)[2])
}
date_column <- as.Date(sapply( _YOUR_DATAFRAME_ , calculate_lag), origin="1970-01-01")
I am not really familiar calculating lag, but maybe that is what you want?
Data:
df <- structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Calculation:
df %>%
dplyr::mutate(days_in_month = lubridate::days_in_month(date)) %>%
tidyr::nest(-c(month, days_in_month)) %>%
dplyr::mutate(lag = purrr::map2(data, days_in_month, ~ stats::lag(.x$item_cnt_day, .y)))
EDIT based on comment:
maybe this then?
df %>%
tidyr::nest(-month) %>%
dplyr::mutate(
ndays = purrr::map_int(data, nrow),
lag = purrr::map2_dbl(data, ndays, ~ zoo::rollmean(.x$item_cnt_day, .y))
)
Apologies if this is unclear. Say I have a dataframe as such:
ID TIME AMOUNTSPENT
01 12:34 50
01 14:37 100
02 12:40 25
03 10:10 50
01 14:35 25
And I would like to generate a lot of features. Specifically based on the TIME and the aspects such as the mean per hour for each unique ID. This would typically generate 24 columns for each hour in the day. So the resulting data frame would be something as such:
ID HOUR12MEANSPEND HOUR13MEANSPEND HOUR14MEANSPEND
01 37.5 0 100
I understand this is a complex problem to explain, even some tips on how to begin this would be massively helped!
One way with dplyr and reshape2:
library(dplyr)
library(reshape2)
df %>%
#grouping - only by the hour
group_by(ID, TIME = substr(TIME, 1, 2)) %>%
#summarise
summarise(averagespend = mean(AMOUNTSPENT)) %>%
#cast time in columns
dcast(ID ~ TIME, value.var = 'averagespend')
Output:
ID 10 12 14
1 1 NA 50 62.5
2 2 NA 25 NA
3 3 50 NA NA
Data:
structure(list(ID = c(1L, 1L, 2L, 3L, 1L), TIME = structure(c(2L,
5L, 3L, 1L, 4L), .Label = c("10:10", "12:34", "12:40", "14:35",
"14:37"), class = "factor"), AMOUNTSPENT = c(50L, 100L, 25L,
50L, 25L)), .Names = c("ID", "TIME", "AMOUNTSPENT"), class = "data.frame", row.names = c(NA,
-5L))