Join data objects by date but with different intervals - r

I have ran into this issue and I really have no clue how to do it. I have two data.frames, both with date columns. However, the first one, which is a big object, contains measurements each 3 seconds, while the second contains measurements each 10 minutes. I want to include the measurement variable of object 2 into object 1 (something like a left_join or merge) by the date variable. My data looks like this (df1):
date_time
measurement1
yyyy-mm-dd HH:MM:03
val1
yyyy-mm-dd HH:MM:06
val2
df2:
date_time
measurement2
yyyy-mm-dd HH:10:00
val1
yyyy-mm-dd HH:20:00
val2
I hope that is enough info, otherwise please comment. I have explored foverlapse and fuzzyjoin but without success.
Thank you in advance
Here is what I have in a bit more detail (df1):
date_time
measurement1
05/06/2018 0:00:03
73
05/06/2018 0:00:06
73.5
05/06/2018 0:00:09
48.5
05/06/2018 0:00:12
50.7
05/06/2018 0:00:15
80
05/06/2018 0:00:18
81
Data continue for a number of months every time each 3 seconds
df2:
date_time
measurement2
05/06/2018 0:00:00
110
05/06/2018 0:10:00
120
05/06/2018 0:20:00
180
What I want is this:
df:
date_time
measurement1
measurement2
05/06/2018 0:00:03
73
110
05/06/2018 0:00:06
73.5
110
05/06/2018 0:00:09
48.5
110
05/06/2018 0:00:12
50.7
110
05/06/2018 0:00:15
80
110
05/06/2018 0:00:18
81
110
I hope now is clearer, by the way, there might be an issue with tables, I am using the format I am told by Stack overflow and I can see the tables being produced in the review, but then the format is lost when I submit.
Thank you

Every minute has 20 observations if those observations occur every 3 seconds. Hence, there are 200 observations for every 10 minute interval. If your data is complete, then it would suffice that you stretch out your seconds 10-minute-interval observations accordingly, i.e. you copy every 10-minute-interval value 200 times next to the 3-second-interval values.
Try the following and tell me what you get
df1$measurement2 <- rep(df2$measurement2, each = 200)

Related

Calculating week numbers WITHOUT a yearwise reset (i.e. week_id = 55 is valid and shows it is a year after) + with a specified start date

This probably seems straightforward, but I am pretty stumped.
I have a set of dates ~ August 1 of each year and need to sum sales by week number. The earliest date is 2008-12-08 (YYYY-MM-DD). I need to create a "week_id" field where week #1 begins on 2008-12-08. And the date 2011-09-03 is week 142. Note that this is different since the calculation of week number does not reset every year.
I am putting up a small example dataset here:
data <- data.frame(
dates = c("2008-12-08", "2009-08-10", "2010-03-31", "2011-10-16", "2008-06-03", "2009-11-14" , "2010-05-05", "2011-09-03"))
data$date = as.Date(data$date)
Any help is appreciated
data$week_id = as.numeric(data$date - as.Date("2008-12-08")) %/% 7 + 1
This would take the day difference between the two dates and find the integer number of 7 days elapsed. I add one since we want the dates where zero weeks have elapsed since the start to be week 1 instead of week 0.
dates date week_id
1 2008-12-07 2008-12-07 0 # added for testing
2 2008-12-08 2008-12-08 1
3 2008-12-09 2008-12-09 1 # added for testing
4 2008-12-14 2008-12-14 1 # added for testing
5 2008-12-15 2008-12-15 2 # added for testing
6 2009-08-10 2009-08-10 36
7 2010-03-31 2010-03-31 69
8 2011-10-16 2011-10-16 149
9 2008-06-03 2008-06-03 -26
10 2009-11-14 2009-11-14 49
11 2010-05-05 2010-05-05 74
12 2011-09-03 2011-09-03 143

Time difference between rows of a dataframe

I have been zoning in the R part of StackOverflow for quite a while looking for a proper answer but nothing that what saw seems to apply to my problem.
I have a dataset of this format ( I have adapted it for what seems to be the easiest way to work with, but the stop_sequence values are normally just incremental numbers for each stop) :
route_short_name trip_id direction_id departure_time stop_sequence
33A 1.1598.0-33A-b12-1.451.I 1 16:15:00 start
33A 1.1598.0-33A-b12-1.451.I 1 16:57:00 end
41C 10.3265.0-41C-b12-1.277.I 1 08:35:00 start
41C 10.3265.0-41C-b12-1.277.I 1 09:26:00 end
41C 100.3260.0-41C-b12-1.276.I 1 09:40:00 start
41C 100.3260.0-41C-b12-1.276.I 1 10:53:00 end
114 1000.987.0-114-b12-1.86.O 0 21:35:00 start
114 1000.987.0-114-b12-1.86.O 0 22:02:00 end
39 10000.2877.0-39-b12-1.242.I 1 11:15:00 start
39 10000.2877.0-39-b12-1.242.I 1 12:30:00 end
It is basically a bus trips dataset. All I want is to manage to get the duration of each trip, so something like that:
route_short_name trip_id direction_id duration
33A 1.1598.0-33A-b12-1.451.I 1 42
41C 10.3265.0-41C-b12-1.277.I 1 51
41C 100.3260.0-41C-b12-1.276.I 1 73
114 1000.987.0-114-b12-1.86.O 0 27
39 10000.2877.0-39-b12-1.242.I 1 75
I have tried a lot of things, but in no case have I managed to group the data by trip_id and then working on the two values at each time. I must have misunderstood something, but I do not know what.
Does anyone have a clue?
We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.
Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'
df1 %>%
mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
group_by(route_short_name, trip_id, direction_id) %>%
summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups: route_short_name, trip_id [?]
# route_short_name trip_id direction_id duration
# <chr> <chr> <int> <dbl>
#1 114 1000.987.0-114-b12-1.86.O 0 27
#2 33A 1.1598.0-33A-b12-1.451.I 1 42
#3 39 10000.2877.0-39-b12-1.242.I 1 75
#4 41C 10.3265.0-41C-b12-1.277.I 1 51
#5 41C 100.3260.0-41C-b12-1.276.I 1 73
Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.
library(tidyverse)
wide_df <-
spread(your_df,key = stop_sequence, value = departure_time) %>%
mutate(timediff = as.numeric(difftime(end,start)))
If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book

how to group sales data by 4 days from yesterday to start date in r?

Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.

Calendaring Monthly Usages for each Date

Here, i have a data set with Start date and End Date and the usages. I have calculated the number of Days between these two days and got the daily usages. (I am okay with one flat usages for each day for now).
Now, what i want to achieve is the sum of the usage for each day in those TIME-FRAME FOR month of June. For example, the first case will be just the Daily_usage
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.156250
And, for 2nd, i want to the add the Usage 3905 to June 1st, and also to June 2nd because it spans in both June 1st and June 2nd.
2015-05-04 2015-06-02 117159.00 30 3905.3000000
I want to continue doing this for all 387 rows and at the end get the sum of Usages for each day. And,I do not know how to do this for hundreds of records.
This is what my datasets looks right now:
str(YYY)
'data.frame': 387 obs. of 5 variables:
$ START_DATE : Date, format: "2015-05-01" "2015-05-04" "2015-05-11" "2015- 05-13" ...
$ END_DATE : Date, format: "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" ...
$ x : num 261605 1380796 183 103 489 ...
$ DAYS : num 32 29 22 20 19 12 1 34 30 29 ...
$ DAILY_USAGE: num 8175.16 47613.66 8.32 5.13 25.74 ...
Also, the header.
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.1562500
2 2015-05-04 2015-06-01 1380796.00 29 47613.6551724
6 2015-05-21 2015-06-01 1392.00 12 116.0000000
7 2015-06-01 2015-06-01 2503.00 1 2503.0000000
8 2015-04-30 2015-06-02 0.00 34 0.0000000
9 2015-05-04 2015-06-02 117159.00 30 3905.3000000
10 2015-05-05 2015-06-02 193334.00 29 6666.6896552
13 2015-05-04 2015-06-03 630.00 31 20.3225806
and so on........
Example of data sets and Results
I will call this data set. EXAMPLE1 (For 3 days, mocked up data)
START_DATE END_DATE x DAYS DAILY_USAGE
5/1/2015 6/1/2015 261605 32 8175.15625
5/4/2015 6/1/2015 1380796 29 47613.65517
5/11/2015 6/1/2015 183 22 8.318181818
4/30/2015 6/2/2015 0 34 0
5/20/2015 6/2/2015 70 14 5
6/1/2015 6/2/2015 569 2 284.5
6/1/2015 6/3/2015 582 3 194
6/2/2015 6/3/2015 6 2 3
For the above examples, answer should be like this
DAY USAGE
6/1/2015 56280.6296
6/2/2015 486.5
6/3/2015 197
HOW?
In Example 1, for June 1st, i have added all the rows of usages except the last row usage because the last row doesn't include the the date 06/01 in time-frame. It starts in 06/02 and ends in 06/03.
To get June 2nd, i have added all the usages from Row 4 to 8 because June 2nd is between all of those start and end dates.
For June 3rd, i have only added, Last two rows to get 197.
So, where to sum, depends on the time-frame of Start & End_date.
Hope this helps!
There might be a easy trick to do this than to write 400 lines of If else statement.
Thank you again for your time!!
-Gyve
library(lubridate)
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
cbind.data.frame(DAY=unique(df$END_DATE),
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x]))))
# DAY USAGE
# 1 6/1/2015 56280.63
# 2 6/2/2015 486.50
# 3 6/3/2015 197.00
Explanation
We can expand it to explain what is happening:
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
The unique end dates are tested to be within the range days in the first and second columns. mdy is a quick way to convert to POSIXct with lubridate. The operator %within% tests a date against an interval. We created intervals with interval('col1', 'col2'). This creates an index that we can subset the data by.
In our final data frame,
cbind.data.frame(DAY=unique(df$END_DATE),
creates the first column of dates.
And,
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x])))
takes the sum of df$DAILY_USAGE by the index that we created.

how to concatenate 2 columns in the same data set in R

I have a data set in the following order:
Date Time Open High Low Close Volume NumberOfTrades BidVolume AskVolume
1 2011/12/22 02:00:00 5805.5 5820.5 5804.0 5820.5 253 96 161 71
2 2011/12/22 02:01:00 5819.0 5820.0 5813.0 5817.0 77 57 43 23
3 2011/12/22 02:02:00 5816.5 5820.0 5816.0 5819.0 30 22 9 14
I need to add a column before column a (Date) that will be A+B ("Date" "Time") and than I will be able to make my dataset an XTS (XTS needs a unick key)
The final result will be something like:
DateTime Date Time Open High Low Close Volume NumberOfTrades BidVolume AskVolume
1 2011/12/22 02:00:00 2011/12/22 02:00:00 5805.5 5820.5 5804.0 5820.5 253 96 161 71
Thanks
Use paste to combine the Date and Time columns and as.POSIXct to convert the string to date-time class.
Assuming your data frame is called df:
df$DateTime = as.POSIXct(paste(df$Date, df$Time))
After you've added DateTime to your data frame, per #RichardScriven's comment, you can rearrange the order of the columns as follows:
df = df[ , c(length(df), 1:(length(df)-1))]
Or, you can add DateTime as the first column as follows:
df = data.frame(DateTime=as.POSIXct(paste(df$Date, df$Time)), df)

Resources