Related
I am doing time series analysis. Part of my data is as follow:
# A tibble: 6 x 3
time DOY Value
<dttm> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664
2 2015-01-08 13:00:00 8 0.647
3 2015-01-11 14:00:00 11 0.669
4 2015-01-11 15:00:00 11 0.644
5 2015-02-04 12:30:00 35 0.664
6 2015-02-04 13:00:00 35 0.647
I would like to calculate the maximum values of 7 consecutive days of the data. For example:
# A tibble: 6 x 4
time DOY Value Max
<dttm> <dbl> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664 11.669
2 2015-01-08 13:00:00 8 0.647 11.669
3 2015-01-11 14:00:00 11 0.669 11.669
4 2015-01-11 15:00:00 11 0.644 11.669
5 2015-02-04 12:30:00 35 0.664 35.664
6 2015-02-04 13:00:00 35 0.647 35.664
welcome to R and Stackoverflow. As mentioned above, you will find many friends here, if you provide a reproducible example, and explain what you have done and/or where things go wrong for you. This helps others to help you.
Based on your data fragment, I do some basic operations that I think might help you. Still you may need to adapt the principles to your problem case.
data
I turned your example into a tibble. Please note, when you work with date, times, date-times I recommend you use the respective variable type. This will give you access to helpful functions, etc.
Please also note you mentioned 8*3 tibble above. In fact your data structure is already a 4 column tibble with Date, time, DOY, and value!
library(dplyr) # basic dataframe/tibble operations
library(lubridate) # for datetime handling
df <- tribble(
~Date, ~time, ~DOY, ~Value
,"2015-01-08", "12:30:00", 8, 0.664
,"2015-01-08", "13:00:00", 8, 0.647
,"2015-01-11", "14:00:00", 11, 0.669
,"2015-01-11", "15:00:00", 11, 0.644
,"2015-02-04", "12:30:00", 35, 0.664
,"2015-02-04", "13:00:00", 35, 0.647
)
df <- df %>%
mutate(timestamp = ymd_hms(paste(Date, time))
This yields:
df
# A tibble: 6 x 5
Date time DOY Value timestamp
<chr> <chr> <dbl> <dbl> <dttm>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00
Note: timestamp is now a datetime objet dttm.
binning of data
It is not fully clear what your consecutive 7 days are and/or how you "group" them.
I assume you want to pick 7 days of a week.
As datetime is dttm, we can use the power of {lubridate} and extract the week from the datetime.
Note: you may want to bin/group your data differently. Think about what you want to achieve here and adpat this accordingly.
df <- df %>% mutate(bin = week(timestamp))
df
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 2
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 2
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 2
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 2
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
If you want to work on "7 consecutive days" you will need to identify the groups of 7 days. Again, there are different ways to do this, check what the modulo operator does and how to apply this to your DOY.
operating on your groups/bins
You describe looking for the maximum per bin (7 days ~ week).
{dplyr} offers for such problems grouped operations. Read up on them:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value) # we create a new variable and assing the max of each group to it
)
# A tibble: 2 x 2
bin MavValue
<dbl> <dbl>
1 2 0.669
2 5 0.664
Obviously, you can perform many operations (summaries of your bins/groups).
Note: You can create bins on multiple variables. Read up on group_by() and summarise(..., .groups = "drop"), if you want to use this interim tibble for further calculations.
Hope this gets you started.
clarification on grouping by 7 days
If you have a sequence of (integer) numbers, there is a neat way to group this into n-element bins, i.e. using integer division.
In your case the data comes already with a date-of-year DOY variable. For completeness: with lubridate you can pull the DOY from a timestamp with the function yday(), i.e. (df %>% mutate(DOY = yday(timestamp)).
# let's use integer division to group our DOYs into group of 7s
##--------- does not look at date or day
##--------- group 1-7 := 0, group 8-14 := 1, .... group 29-35 := 5
df <- df %>%
mutate(bin = DOY %/% 7)
This yields:
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 1
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 1
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 1
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 1
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
And then build your max summary as before on the (new) grouping variable:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value)
# A tibble: 2 x 2
bin MaxValue
<dbl> <dbl>
1 1 0.669
2 5 0.664
For the example data given the result is identical. However, with your full dataset and the offset between "weeks" (with their defined start date) vs cutting your DOYs into bins of 7 consecutive days, you will get a different summary (unless, the first day of the week (*) coincides with DOY 1).
(*): in lubridate you can set weeks to start Monday or Sunday as a parameter (in case you ever need this).
Similarly to a question posted here, I want to compute number of overlapping days between two periods conditional on a third variable (location).
For each observation of the main dataset (DF) I have a starting and an end date, and a location (character) variable. The Events data comprises information on event location, starting date and end date. Multiple events in the same location and (partially) overlapping periods are allowed.
Thus for each observation in DF the period must be compared to other periods in an event dataset (Events). This means that the count of overlapping days between the between one (DF) and multiple periods (Events) must be done net of overlapping days between two (or more) periods in the Events dataset
An example of the data structure of my two data sources can be easily reproduced in R using this code (note that the location variable has been set to an integer for simplicity):
set.seed(1)
DF <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 20),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 20),
location = sample(seq(1:5)),20)
Events <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 30),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 30),
location = sample(seq(1:5)), 30 )
In the simple case in which the Events data reduces to only one event (and we do not care about the location) counting overallping days for each obervation in DF can be done easily with the following code and dplyr: code taken from Matthew Lundberg answer here, also note that I have created another dataframe with a single event (One_event):
library(dplyr)
One_event <- data.frame(
start = as.Date('2018-01-01'),
end = as.Date('2018-07-30'))
DF %>%
mutate(overlap = pmax(pmin(One_event$end, end) - pmax(One_event$start, start) + 1,0))
resulting in:
start end location X20 overlap
1 2018-02-01 2018-10-19 5 20 180 days
2 2018-02-14 2018-06-08 3 20 115 days
3 2018-03-09 2018-08-26 4 20 144 days
4 2018-04-17 2018-05-23 2 20 37 days
5 2018-01-24 2018-06-17 1 20 145 days
6 2018-04-14 2018-07-08 5 20 86 days
7 2018-04-18 2018-05-03 3 20 16 days
8 2018-03-16 2018-07-07 4 20 114 days
9 2018-03-12 2018-09-30 2 20 141 days
10 2018-01-07 2018-06-29 1 20 174 days
11 2018-01-23 2018-07-23 5 20 182 days
12 2018-01-20 2018-08-12 3 20 192 days
13 2018-04-23 2018-07-24 4 20 93 days
14 2018-02-11 2018-06-01 2 20 111 days
15 2018-03-23 2018-09-17 1 20 130 days
16 2018-02-22 2018-08-21 5 20 159 days
17 2018-04-24 2018-09-10 3 20 98 days
18 2018-04-13 2018-05-18 4 20 36 days
19 2018-02-08 2018-08-28 2 20 173 days
20 2018-03-20 2018-10-23 1 20 133 days
Now back to the orginal problem.
To allow comparison between the period of each observation in Data and the matching event(s) according to observation's and event's location I think that would be reasonable to use the apply function, subset the Event dataset according to the observation location, and finally run the mutate function for each row and a subset of the Events data (temp):
apply(DF, 1, function(x) {
temp = Events[Events$location %in% x["location"]
x %>%
mutate(overlap = pmax(pmin(temp$end, end) - pmax(temp$start, start) +
1,0))
})
There are several issues with this last part of code. First, does not work and gives an error message:
(Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "character")
Second, it does not account for two (or more periods) overlapping in the Events dataset.
are you looking for this:
apply(DF, MARGIN = 1, function(x) {
Events[Events$location == x["location"],] %>% mutate(overlap = pmax(pmin(.data$end,
x["end"]) - pmax(.data$start, x["start"])))
})
This results in my case to:
[[1]]
start end location X30 overlap
1 2018-02-01 2018-07-28 5 30 177 days
2 2018-04-14 2018-08-27 5 30 135 days
3 2018-01-23 2018-09-20 5 30 231 days
4 2018-02-22 2018-09-10 5 30 200 days
5 2018-04-04 2018-07-17 5 30 104 days
6 2018-02-06 2018-05-16 5 30 99 days
[[2]]
start end location X30 overlap
1 2018-01-24 2018-09-26 3 30 114 days
2 2018-01-07 2018-07-11 3 30 114 days
3 2018-03-23 2018-10-28 3 30 77 days
4 2018-03-20 2018-08-22 3 30 80 days
5 2018-01-26 2018-05-12 3 30 87 days
6 2018-01-31 2018-07-02 3 30 114 days
[[3]]
start end location X30 overlap
1 2018-03-09 2018-07-29 4 30 142 days
2 2018-03-16 2018-05-19 4 30 64 days
3 2018-04-23 2018-09-11 4 30 125 days
4 2018-04-13 2018-07-19 4 30 97 days
5 2018-03-05 2018-07-10 4 30 123 days
6 2018-02-05 2018-07-20 4 30 133 days
...
I'd like to create a time series chart in R that compares this year to last year. I'd like to use this year's dates as my x axis.
What is the function that, given a set of dates from 2 or more years, will make them all the same year?
Data example:
mydata <- data.frame(
date.col = as.Date(c('2014-05-23', '2014-05-24', '2014-05-25', '2015-05-23', '2015-05-24', '2015-05-25', '2016-05-23', '2016-05-24', '2016-05-25')),
value = c(10,23,15,13,26,17,22,30,19))
date.col value
1 2014-05-23 10
2 2014-05-24 23
3 2014-05-25 15
4 2015-05-23 13
5 2015-05-24 26
6 2015-05-25 17
7 2016-05-23 22
8 2016-05-24 30
9 2016-05-25 19
Expected output:
date.col value adj.date
1 2014-05-23 10 2016-05-23
2 2014-05-24 23 2016-05-24
3 2014-05-25 15 2016-05-25
4 2015-05-23 13 2016-05-23
5 2015-05-24 26 2016-05-24
6 2015-05-25 17 2016-05-25
7 2016-05-23 22 2016-05-23
8 2016-05-24 30 2016-05-24
9 2016-05-25 19 2016-05-25
Ideally, the function could fit easily into a dplyr chain:
mydata %>% normalize.dates(date.col, fit.to='2016') %>% ggplot(...)
This works well:
as.Date(format(date.col, '2016-%m-%d'))
Fit into a dplyr chain:
my.data %>% mutate(adj.date = as.Date(format(date.col, '2016-%m-%d'))) %>%
ggplot(aes(x=adj.date)) + ...
Useful for time series plots when all dates need to be normalized for year over year comparison:
mydata %>%
mutate(adj.date = as.Date(format(date.col, '2016-%m-%d'))) %>%
ggplot(aes(x=adj.date, y=value, color=as.factor(year(date.col)))) + geom_line()
I guess I don't even know really what to 'title' this question as.
But I think this is quite a common data manipulation requirement.
I have data that has a periodic exchange between two parties of a quantity of a good. The exchanges are made hourly. Here is an example data frame:
df <- cbind.data.frame(Seller = as.character(c("A","A","A","A","A","A")),
Buyer = c("B","B","B","C","C","C"),
DateTimeFrom = c("1/07/2013 0:00","1/07/2013 9:00","1/07/2013 0:00","1/07/2013 6:00","1/07/2013 8:00","2/07/2013 9:00"),
DateTimeTo = c("1/07/2013 8:00","1/07/2013 15:00","2/07/2013 8:00","1/07/2013 9:00","1/07/2013 12:00","2/07/2013 16:00"),
Qty = c(50,10,20,25,5,5)
)
df$DateTimeFrom <- as.POSIXct(df$DateTimeFrom, format = '%d/%m/%Y %H:%M', tz = 'GMT')
df$DateTimeTo <- as.POSIXct(df$DateTimeTo, format = '%d/%m/%Y %H:%M', tz = 'GMT')
> df
Seller Buyer DateTimeFrom DateTimeTo Qty
1 A B 2013-07-01 00:00:00 2013-07-01 08:00:00 50
2 A B 2013-07-01 09:00:00 2013-07-01 15:00:00 10
3 A B 2013-07-01 00:00:00 2013-07-02 08:00:00 20
4 A C 2013-07-01 06:00:00 2013-07-01 09:00:00 25
5 A C 2013-07-01 08:00:00 2013-07-01 12:00:00 5
6 A C 2013-07-02 09:00:00 2013-07-02 16:00:00 5
So, for example, the first row of this data frame says that the Seller "A" sells 50 units of the good to the buyer "B" every hour from midnight on 1/7/13 until 8am on 1/7/13. You can also notice that some of these exchanges between the same two parties can overlap, but just with a different negotiated quantity.
What I need to do (and need your help with) is to generate a sequence covering all hours over this two day period that sums the total quantity exchanged in that hour between two sellers over all neogociations.
Here would be the resulting dataframe.
DateTimeSeq <- data.frame(seq(ISOdate(2013,7,1,0),by = "hour", length.out = 48))
colnames(DateTimeSeq) <- c("DateTime")
#What the Answer should be
DateTimeSeq$QtyAB <- c(70,70,70,70,70,70,70,70,70,30,30,30,30,30,30,30,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
DateTimeSeq$QtyAC <- c(0,0,0,0,0,0,25,25,30,30,5,5,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,5,5,0,0,0,0,0,0,0)
> DateTimeSeq
DateTime QtyAB QtyAC
1 2013-07-01 00:00:00 70 0
2 2013-07-01 01:00:00 70 0
3 2013-07-01 02:00:00 70 0
4 2013-07-01 03:00:00 70 0
5 2013-07-01 04:00:00 70 0
6 2013-07-01 05:00:00 70 0
7 2013-07-01 06:00:00 70 25
8 2013-07-01 07:00:00 70 25
9 2013-07-01 08:00:00 70 30
10 2013-07-01 09:00:00 30 30
11 2013-07-01 10:00:00 30 5
12 2013-07-01 11:00:00 30 5
13 2013-07-01 12:00:00 30 5
14 2013-07-01 13:00:00 30 0
15 2013-07-01 14:00:00 30 0
.... etc
Anybody able to lend a hand?
Thanks,
A
Here is my solution which uses the dplyr and reshape package.
library(dplyr)
library(reshape)
Firstly, we should expand the dataframe so that everything is in an hourly format. This can be done using the do part of dplyr.
df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour")))
Output:
Source: local data frame [66 x 4]
Groups: <by row>
Seller Buyer Qty DateTimeCurr
1 A B 50 2013-07-01 00:00:00
2 A B 50 2013-07-01 01:00:00
3 A B 50 2013-07-01 02:00:00
...
From there it is trivial to get the correct id's and summarise the total using the group_by function.
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
Output:
Source: local data frame [48 x 5]
Groups: Seller, Buyer
Seller Buyer DateTimeCurr TotalQty id
1 A B 2013-07-01 00:00:00 70 QtyAB
2 A B 2013-07-01 01:00:00 70 QtyAB
3 A B 2013-07-01 02:00:00 70 QtyAB
From this dataframe, all we have to do is cast it into the format you have above.
> cast(df1, DateTimeCurr~ id, value="TotalQty")
DateTimeCurr QtyAB QtyAC
1 2013-07-01 00:00:00 70 NA
2 2013-07-01 01:00:00 70 NA
3 2013-07-01 02:00:00 70 NA
4 2013-07-01 03:00:00 70 NA
5 2013-07-01 04:00:00 70 NA
6 2013-07-01 05:00:00 70 NA
So the whole piece of code
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
cast(df1, DateTimeCurr~ id, value="TotalQty")
I have one file (location) that has an x,y coordinates and a date/time identification. I want to get information from a second table (weather) that has a "similar" date/time variable and the co-variables (temperature and wind speed). The trick is the date/time are not exactly the same numbers in both tables. I want to select the weather data that is closest from the location data. I know I need to do some loops and thats about it.
Example location example weather
x y date/time date/time temp wind
1 3 01/02/2003 18:00 01/01/2003 13:00 12 15
2 3 01/02/2003 19:00 01/02/2003 16:34 10 16
3 4 01/03/2003 23:00 01/02/2003 20:55 14 22
2 5 01/04/2003 02:00 01/02/2003 21:33 14 22
01/03/2003 00:22 13 19
01/03/2003 14:55 12 12
01/03/2003 18:00 10 12
01/03/2003 23:44 2 33
01/04/2003 01:55 6 22
So the final output would be a table with the correctly "best" matched weather data to the location data
x y datetime datetime temp wind
1 3 01/02/2003 18:00 ---- 01/02/2003 16:34 10 16
2 3 01/02/2003 19:00 ---- 01/02/2003 20:55 14 22
3 4 01/03/2003 23:00 ---- 01/03/2003 00:22 13 19
2 5 01/04/2003 02:00 ---- 01/04/2003 01:55 6 22
Any suggestions where to start? I am trying to do this in R
I needed to bring that data in as data and time separately and then paste and format
location$dt.time <- as.POSIXct(paste(location$date, location$time),
format="%m/%d/%Y %H:%M")
And the same for weather
Then for each value of date.time in location, find the entry in weather that has the lowest absolute values for the time differences:
sapply(location$dt.time, function(x) which.min(abs(difftime(x, weather$dt.time))))
# [1] 2 3 8 9
cbind(location, weather[ sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])
x y date time dt.time date time temp wind dt.time
2 1 3 01/02/2003 18:00 2003-01-02 18:00:00 01/02/2003 16:34 10 16 2003-01-02 16:34:00
3 2 3 01/02/2003 19:00 2003-01-02 19:00:00 01/02/2003 20:55 14 22 2003-01-02 20:55:00
8 3 4 01/03/2003 23:00 2003-01-03 23:00:00 01/03/2003 23:44 2 33 2003-01-03 23:44:00
9 2 5 01/04/2003 02:00 2003-01-04 02:00:00 01/04/2003 01:55 6 22 2003-01-04 01:55:00
cbind(location, weather[
sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])[ #pick columns
c(1,2,5,8,9,10)]
x y dt.time temp wind dt.time.1
2 1 3 2003-01-02 18:00:00 10 16 2003-01-02 16:34:00
3 2 3 2003-01-02 19:00:00 14 22 2003-01-02 20:55:00
8 3 4 2003-01-03 23:00:00 2 33 2003-01-03 23:44:00
9 2 5 2003-01-04 02:00:00 6 22 2003-01-04 01:55:00
My answers seem a bit different than yours but another reader has already questioned your abilities to do the matching properly by hand.
One fast and short way may be using data.table.
If you create two data.table's X and Y, both with keys, then the syntax is :
X[Y,roll=TRUE]
We call that a rolling join because we roll the prevailing observation in X forward to match the row in Y. See the examples in ?data.table and the introduction vignette.
Another way to do this is the zoo package which has locf (last observation carried forward), and possibly other packages too.
I'm not sure if you mean closest in terms of location, or time. If location, and that location is x,y coordinates then you will need some distance measure in 2D space I guess. data.table only does univariate 'closest' e.g. by time. Reading your question for a 2nd time it does seem you mean closest in the prevailing sense though.
EDIT: Seen the example data now. data.table won't do this in one step because although it can roll forwards or backwards, it won't roll to the nearest. You could do it with an extra step using which=TRUE and then test whether the one after the prevailing was actually closer.