I’m working with a dataset that contains GPS locations for a small group of polar bears. For every bear, there should theoretically be one location every 4 hours, but unfortunately the radio collars don’t always work perfectly and there are gaps in my data.
My goal is to produce a csv that subsets the maximum number of locations between gaps for each bear.
For example, if a bear’s data is composed of 100 locations, then has one gap, and then 50 locations, I only want to subset the first 100 locations in the final csv.
Here is a code to generate the kind of dataset I would use:
bears<-as.character(c(rep("bear1",times=5),rep("bear2",times=5)))
time<-c("2007-09-08 13:00:00","NA","2007-09-08 21:00:00","2007-09-09 1:00:00","NA","NA","2007-10-09 17:00:00","2007-10-09 1:00:00","NA","2007-10-09 9:00:00")
bear.data<-data.frame(bears,time)
Where:
bears refers to the individual bear.
time refers to the time at which a particular location is transmitted. When the collar fails to transmit a GPS location,
this column has a value of NA.
Any help would be appreciated!!
bear.data <- data.frame(bears, time) %>%
mutate(time = ymd_hms(time),
helper = floor_date(time, unit = "year"),
seq = rleid(helper)) %>%
filter(!is.na(helper)) %>%
group_by(bears, seq) %>%
add_tally() %>% ungroup() %>%
group_by(bears) %>%
slice_max(n)
The problem can be thought of as finding the maximum length of blocks of boolean values per group:
bear.data$time <- as.Date(bear.data$time)
bear.data$not_na <- !is.na(bear.data$time)
bear.data$gap <- ave(bear.data$not_na, cumsum(!bear.data$not_na), FUN = cumsum)
aggregate(gap ~ bears, FUN = max, data=bear.data)
Output
> aggregate(gap ~ bears, FUN = max, data=bear.data)
bears gap
1 bear1 2
2 bear2 3
Data
bears time
1 bear1 2007-09-08 13:00:00
2 bear1 NA
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
5 bear1 NA
6 bear2 NA
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 17:00:00
9 bear2 2007-10-09 1:00:00
10 bear2 NA
11 bear2 2007-10-09 9:00:00
You can create a function that calculates the rows of the longest non-NA sequence for a bear. This function is based on rle() and is.na() :
seq_max <- function(x) {
r <- rle(!is.na(x))
rd <- as.data.frame(unclass(r))
rd$ends <- cumsum(rd$lengths)
rd$starts <- c(1, rd$ends[-length(rd$ends)] + 1)
rd <- rd[rd$values, ]
rd <- rd[which.max(rd$lengths)[1], ]
seq(rd$starts, rd$ends)
}
Then you apply it to each bear. This is very convenient with dplyr :
library(dplyr)
bear.data %>%
group_by(bears) %>%
slice(seq_max(time))
if you were to do this in Base R,
first write a Mode function(Returns the most occurring element):
Mode <- function(x){
y <- unique(x)
y[which.max(tabulate(match(x,y)))]
}
Now write a logical function that will give the maximum ids:
max_ids <- function(x){
id <- with(rle(x),rep(seq_along(values),lengths))
id == Mode(id) # Uses the mode function above
}
Use the two functions as follows:
subset(bear.data, ave(is.na(as.Date(time)), bears, FUN = max_ids))
bears time
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 1:00:00
Related
I am working with these data on R. These are the first six rows —without counting the first column, which the write.csv function always adds—:
> head(my_data)
client_id contract_id contract_start contract_end inter_complex
1 1 15/07/2019 15/07/2020 18092+18458i
3 3 1/01/2015 1/01/2015 16436+16436i
5 5 12/06/2020 12/06/2020 18425+18425i
13 13 1/01/2015 1/01/2015 16436+16436i
18 18 1/01/2015 1/01/2015 16436+16436i
19 19 1/01/2015 1/01/2015 16436+16436i
Each row represents a different contract. The variable inter_complex is a complex number whose real part is the numeric representation of the date when a contract started, and whose imaginary part analogously represents the date when a contract ended. In case you're wondering, you can obtain that column by executing this:
library(tidyverse)
library(lubridate)
chars_2_cplex = function(start, end) {
cbind(start, end) %>%
apply(2, compose(as.numeric, dmy)) %*% rbind(1, 1i)
}
my_data %>% transmute(inter_complex = chars_2_cplex(contract_start, contract_end))
What I want is, for each client id and each contract, to identify how many contracts associated to that same client id intersect with that contract. In other words: I want to create a new column called simultaneous which will depict for each row —i.e. for each contract— how many active contracts the corresponding client has during the very same period that the current contract is active. In case no intersection with any other contract is found for a given contract, then the value of simultaneous would have to be 1 —as while that contract is active it is also the only active contract that the respective client has—.
I figured it would help to obtain the combinations of inter_complex, then turn those combinations of complex numbers into combinations of intervals, and then use lubridate's intersect function to discern whether or not each combination of intervals intersect. For that purpose, I have written the following code:
## This function turns complex numbers into intervals.
cplex_2_inter = function(x) {
start = x %>% Re() %>% as.integer()
end = x %>% Im() %>% as.integer()
interval(as_date(start), as_date(end))
}
## This other function returns a list whose j-th element is a data frame that shows the interceptions
## between combinations of j + 1 intervals.
get_intersections = function(x) {
max_m = length(x)
output = vector(mode = "list", length = max_m - 1)
for (i in 2:max_m) {
output[[i - 1]] = combn(x, m = i) %>% t() %>% as.data.frame() %>%
mutate_all(cplex_2_inter) %>% rowid_to_column("id") %>%
pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", i, 1L))
}
return(output)
}
In order to get a better grasp on what the function get_intersections does, I propose that you run the following:
example = my_data %>% filter(client_id == 1) %>% pull(inter_complex) %>% get_intersections()
The data frame example[[1]] shows whether there are interceptions —or, for a better word, overlaps— between pairs of intervals. The data frame example[[2]] shows whether there are overlaps between groups of three intervals, and so on.
You may notice that according to example[[1]] the interval 2019-07-15 UTC--2020-07-15 UTC overlaps with some other interval —and hence, the associated value of simultaneous is 2— while, according to example[[2]], that very same interval is associated to a value of 3 for the variable simultaneous. Naturally, the idea is to assign to each interval its highest simultaneous value.
Since I do not care about global overlaps but rather about overlaps within each client id I figured I would require to work on a grouped data frame. The furthest I got on this project was writing this:
my_data %>% group_by(client_id) %>% group_map(~ get_intersections(.x$inter_complex))
Now onto my questions. 1) I have executed the line above, but the process is not very efficient. It has already run for a bit more than a whole day and it doesn't finish yet. Recently I came across the concept of interval trees but I'm not a computer scientist and I would require help in order to tackle this problem in a smarter way. 2) In case we stick to my not-so-smart approach to the problem, I would still require a function that accesses each element of the list that is returned by get_intersections, so as to identify and retrieve the highest simultaneous value associated to each interval. On that matter I would have to request help as well.
Edit
Regarding Wimpel's answer, I have examined their data table and I found this.
> DT %>% filter(client_id == 502 & contract_id == 3093) %>%
> select(contract_start, contract_end, contract_intersect)
# Output
contract_start contract_end contract_intersect
1: 2018-01-11 2019-01-11 7
That is, the displayed contract allegedly overlaps with seven other contracts that the same client has.
On the other hand, let's see if this holds true when using my combinations-based approach.
combs_10_502 = my_data %>% filter(client_id == 502) %>% pull(inter_complex) %>%
combn(10) %>% t() %>% as.data.frame() %>% mutate_all(cplex_2_inter) %>%
rowid_to_column("id") %>% pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
ungroup() %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", 10L, 1L))
> combs_10_502 %>% filter(simultaneous == 10) %>% slice(11:20)
# A tibble: 10 x 4
id name value simultaneous
<int> <chr> <Interval> <int>
1 24311 V1 2018-01-11 UTC--2019-01-11 UTC 10
2 24311 V2 2018-03-01 UTC--2019-03-01 UTC 10
3 24311 V3 2018-07-11 UTC--2019-07-11 UTC 10
4 24311 V4 2018-04-20 UTC--2019-04-20 UTC 10
5 24311 V5 2018-05-21 UTC--2019-05-21 UTC 10
6 24311 V6 2018-08-10 UTC--2019-08-10 UTC 10
7 24311 V7 2018-08-09 UTC--2019-08-09 UTC 10
8 24311 V8 2018-09-27 UTC--2019-09-27 UTC 10
9 24311 V9 2020-01-03 UTC--2021-01-03 UTC 10
10 24311 V10 2019-12-19 UTC--2020-12-19 UTC 10
The same contract is displayed on the first row of the tibble above. As can be seen, that contract actually overlaps with nine other contracts of the given client —those nine are displayed on the remaining rows—.
I don't know how Wimpel's solution got this wrong, but I checked that it does get the number of intersections right for several other contracts. Now I know that a data table-based solution is what I am looking for, since the processes are made very fast, but there seems to be an issue with the proposed solution.
I believe you are looking for something like this?
library(data.table)
DT <- fread("https://raw.githubusercontent.com/pazos-feren/Data/main/contracts.csv")
#set dates as real dates
DT[, contract_start := as.Date(contract_start, format = "%d/%m/%Y")]
DT[, contract_end := as.Date(contract_end, format = "%d/%m/%Y")]
setkey(DT, V1)
DT[DT, c("contract_intersect", "contract_intersect_ids") := {
val = DT[ !V1 == i.V1 & client_id == i.client_id &
contract_start <= i.contract_end & contract_end >= i.contract_start, ]
list( nrow(val), paste0(val$contract_id, collapse = ";") )
}, by = .EACHI]
# V1 client_id contract_id contract_start contract_end inter_complex contract_intersect contract_intersect_ids
# 1: 1 1 1 2019-07-15 2020-07-15 18092+18458i 2 4162;4168
# 2: 2 3 3 2015-01-01 2015-01-01 16436+16436i 0
# 3: 3 5 5 2020-06-12 2020-06-12 18425+18425i 0
# 4: 4 13 13 2015-01-01 2015-01-01 16436+16436i 0
# 5: 5 18 18 2015-01-01 2015-01-01 16436+16436i 0
# 6: 6 19 19 2015-01-01 2015-01-01 16436+16436i 0
I have a data frame with dates and numbers called 'df'. I have another data frame with start and end dates called 'date_ranges'.
My goal is to filter/subset df so that it only shows for the start/end dates in each row of the date_ranges column. Here is my code so far:
df_date <- as.Date((as.Date('2010-01-01'):as.Date('2010-04-30')))
df_numbers <- c(1:120)
df <- data.frame(df_date, df_numbers)
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
date_ranges <- data.frame(start_dates, end_dates)
# Attempting to filter df by start and end dates
for (i in range(date_ranges$start_dates)){
for (j in range(date_ranges$end_dates)){
print (
df %>%
filter(between(df_date, i, j)))
}
}
The first and third result of the nested for loop is what I want, but not the second result. The first and third give me the dates and values for df between their respective rows, but the second result is the range from the earliest date to the latest date. How can I fix this loop to exclude the second result?
A tidyverse approach could be to create a sequence between start and end_dates and join with df to keep only the dates which lie in the range.
library(dplyr)
date_ranges %>%
mutate(df_date = purrr::map2(start_dates, end_dates, seq, "day")) %>%
tidyr::unnest(df_date) %>%
select(-start_dates, -end_dates) %>%
left_join(df, by = 'df_date')
# A tibble: 39 x 2
# df_date df_numbers
# <date> <int>
# 1 2010-01-06 6
# 2 2010-01-07 7
# 3 2010-01-08 8
# 4 2010-01-09 9
# 5 2010-01-10 10
# 6 2010-01-11 11
# 7 2010-01-12 12
# 8 2010-01-13 13
# 9 2010-01-14 14
#10 2010-01-15 15
# … with 29 more rows
You can try looping through index
for (i in seq_along(date_ranges$start_dates)){
print (
df %>%
filter(between(df_date, date_ranges$start_dates[i], date_ranges$end_dates[i])))
}
Base R solution:
# Your data creation can be simplified:
df <- data.frame(df_date = seq.Date(as.Date('2010-01-01', "%Y-%m-%d"), as.Date('2010-04-30', "%Y-%m-%d"),
by = 1), df_numbers = c(1:120))
# Store start and end date vectors to filter the data.frame:
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
# Subset the data to extract records with matching dates: df => stdout (Console
df[df$df_date %in% c(start_dates, end_dates),]
I have a large number of files (~1200) which each contains a large timeserie with data about the height of the groundwater. The starting date and length of the serie is different for each file. There can be large data gaps between dates, for example (small part of such a file):
Date Height (cm)
14-1-1980 7659
28-1-1980 7632
14-2-1980 7661
14-3-1980 7638
28-3-1980 7642
14-4-1980 7652
25-4-1980 7646
14-5-1980 7635
29-5-1980 7622
13-6-1980 7606
27-6-1980 7598
14-7-1980 7654
28-7-1980 7654
14-8-1980 7627
28-8-1980 7600
12-9-1980 7617
14-10-1980 7596
28-10-1980 7601
14-11-1980 7592
28-11-1980 7614
11-12-1980 7650
29-12-1980 7670
14-1-1981 7698
28-1-1981 7700
13-2-1981 7694
17-3-1981 7740
30-3-1981 7683
14-4-1981 7692
14-5-1981 7682
15-6-1981 7696
17-7-1981 7706
28-7-1981 7699
28-8-1981 7686
30-9-1981 7678
17-11-1981 7723
11-12-1981 7803
18-2-1982 7757
16-3-1982 7773
13-5-1982 7753
11-6-1982 7740
14-7-1982 7731
15-8-1982 7739
14-9-1982 7722
14-10-1982 7794
15-11-1982 7764
14-12-1982 7790
14-1-1983 7810
28-3-1983 7836
28-4-1983 7815
31-5-1983 7857
29-6-1983 7801
28-7-1983 7774
24-8-1983 7758
28-9-1983 7748
26-10-1983 7727
29-11-1983 7782
27-1-1984 7801
28-3-1984 7764
27-4-1984 7752
28-5-1984 7795
27-7-1984 7748
27-8-1984 7729
28-9-1984 7752
26-10-1984 7789
28-11-1984 7797
18-12-1984 7781
28-1-1985 7833
21-2-1985 7778
22-4-1985 7794
28-5-1985 7768
28-6-1985 7836
26-8-1985 7765
19-9-1985 7760
31-10-1985 7756
26-11-1985 7760
20-12-1985 7781
17-1-1986 7813
28-1-1986 7852
26-2-1986 7797
25-3-1986 7838
22-4-1986 7807
27-5-1986 7785
24-6-1986 7787
26-8-1986 7744
23-9-1986 7742
22-10-1986 7752
1-12-1986 7749
17-12-1986 7758
I want to calculate the average height over 5 years. So, in case of the example 14-1-1980 + 5 years, 14-1-1985 + 5 years, .... The amount of datapoints is different for each calculation of the average. It is very likely that the date 5 years later will not be in the dataset as a datapoint. Hence, I think I need to tell R somehow to take an average in a certain timespan.
I searched on the internet but didn't find something that fitted my needs. A lot of useful packages like uts, zoo, lubridate and the function aggregate passed by. Instead of getting closer to the solution I get more and more confused about which approach is the best for my problem.
Thanks a lot in advance!
As #vagabond points out, you'll want to combine your 1200 files into a single data frame (the plyr package would allow you to do something simple like: data.all <- adply(dir([DATA FOLDER]), 1, read.csv).
Once you have the data, the first step would be to transform the Date column into proper POSIXct date data. Right now the data appear to be strings, and we want them to have an underlying numerical representation (which POSIXct does):
library(lubridate)
df$date.new <- as.Date(dmy(df$Date))
Date Height date.new
1 14-1-1980 7659 1980-01-14
2 28-1-1980 7632 1980-01-28
3 14-2-1980 7661 1980-02-14
4 14-3-1980 7638 1980-03-14
5 28-3-1980 7642 1980-03-28
6 14-4-1980 7652 1980-04-14
Note that the date.new column looks like a string, but is in fact Date data, and can be handled with numerical operations (addition, comparison, etc.).
Next, we might construct a set of date periods, over which we want to compute averages. Your example mentions 5 years, but with the data you provided, that's not a very illustrative example. So here I'm creating 1-year periods starting at every day between Jan 14 1980 and Jan 14 1985
date.start <- as.Date(as.Date('1980-01-14') : as.Date('1985-01-14'), origin = '1970-01-01')
date.end <- date.start + years(1)
dates <- data.frame(start = date.start, end = date.end)
start end
1 1980-01-14 1981-01-14
2 1980-01-15 1981-01-15
3 1980-01-16 1981-01-16
4 1980-01-17 1981-01-17
5 1980-01-18 1981-01-18
6 1980-01-19 1981-01-19
Then we can use the dplyr package to move through each row of this data frame and compute a summary average of Height:
library(dplyr)
df.mean <- dates %>%
group_by(start, end) %>%
summarize(height.mean = mean(df$Height[df$date.new >= start & df$date.new < end]))
start end height.mean
<date> <date> <dbl>
1 1980-01-14 1981-01-14 7630.273
2 1980-01-15 1981-01-15 7632.045
3 1980-01-16 1981-01-16 7632.045
4 1980-01-17 1981-01-17 7632.045
5 1980-01-18 1981-01-18 7632.045
6 1980-01-19 1981-01-19 7632.045
The foverlaps function is IMHO the perfect candidate for such a situation:
library(data.table)
library(lubridate)
# convert to a data.table with setDT()
# convert the 'Date'-column to date-format
# create a begin & end date for the required period
setDT(dat)[, Date := as.Date(Date, '%d-%m-%Y')
][, `:=` (begindate = Date, enddate = Date + years(1))]
# set the keys (necessary for the foverlaps function)
setkey(dat, begindate, enddate)
res <- foverlaps(dat, dat, by.x = c(1,3))[, .(moving.average = mean(i.Height)), Date]
the result:
> head(res,15)
Date moving.average
1: 1980-01-14 7633.217
2: 1980-01-28 7635.000
3: 1980-02-14 7637.696
4: 1980-03-14 7636.636
5: 1980-03-28 7641.273
6: 1980-04-14 7645.261
7: 1980-04-25 7644.955
8: 1980-05-14 7646.591
9: 1980-05-29 7647.143
10: 1980-06-13 7648.400
11: 1980-06-27 7652.900
12: 1980-07-14 7655.789
13: 1980-07-28 7660.550
14: 1980-08-14 7660.895
15: 1980-08-28 7664.000
Now you have for each date an average of all the values that lie the date and one year ahead of that date.
Hey I just tried after seeing your question!!! Ran on a sample data frame. Try it on yours after understanding the code and then let me know!
Bdw instead of having an interval of 5 years, I used just 2 months (2*30 = approx 2 months) as the interval!
df = data.frame(Date = c("14-1-1980", "28-1-1980", "14-2-1980", "14-3-1980", "28-3-1980",
"14-4-1980", "25-4-1980", "14-5-1980", "29-5-1980", "13-6-1980:",
"27-6-1980", "14-7-1980", "28-7-1980", "14-8-1980"), height = 1:14)
# as.Date(df$Date, "%d-%m-%Y")
df1 = data.frame(orig = NULL, dest = NULL, avg_ht = NULL)
orig = as.Date(df$Date, "%d-%m-%Y")[1]
dest = as.Date(df$Date, "%d-%m-%Y")[1] + 2*30 #approx 2 months
dest_final = as.Date(df$Date, "%d-%m-%Y")[14]
while (dest < dest_final){
m = mean(df$height[which(as.Date(df$Date, "%d-%m-%Y")>=orig &
as.Date(df$Date, "%d-%m-%Y")<dest )])
df1 = rbind(df1,data.frame(orig=orig,dest=dest,avg_ht=m))
orig = dest
dest = dest + 2*30
print(paste("orig:",orig, " + ","dest:",dest))
}
> df1
orig dest avg_ht
1 1980-01-14 1980-03-14 2.0
2 1980-03-14 1980-05-13 5.5
3 1980-05-13 1980-07-12 9.5
I hope this works for you as well
This is my best try, but please keep in mind that I am working with the years instead of the full date, i.e. based on the example you provided I am averaging over beginning of 1980- end of 1984.
dat<-read.csv("paixnidi.csv")
install.packages("stringr")
library(stringr)
dates<-dat[,1]
#extract the year of each measurement
years<-as.integer(str_sub(dat[,1], start= -4))
spread_y<-years[length(years)]-years[1]
ind<-list()
#find how many 5-year intervals there are
groups<-ceiling(spread_y/4)
meangroups<-matrix(0,ncol=2,nrow=groups)
k<-0
for (i in 1:groups){
#extract the indices of the dates vector whithin the 5-year period
ind[[i]]<-which(years>=(years[1]+k)&years<=(years[1]+k+4),arr.ind=TRUE)
meangroups[i,2]<-mean(dat[ind[[i]],2])
meangroups[i,1]<-(years[1]+k)
k<-k+5
}
colnames(meangroups)<-c("Year:Year+4","Mean Height (cm)")
I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))