I have been zoning in the R part of StackOverflow for quite a while looking for a proper answer but nothing that what saw seems to apply to my problem.
I have a dataset of this format ( I have adapted it for what seems to be the easiest way to work with, but the stop_sequence values are normally just incremental numbers for each stop) :
route_short_name trip_id direction_id departure_time stop_sequence
33A 1.1598.0-33A-b12-1.451.I 1 16:15:00 start
33A 1.1598.0-33A-b12-1.451.I 1 16:57:00 end
41C 10.3265.0-41C-b12-1.277.I 1 08:35:00 start
41C 10.3265.0-41C-b12-1.277.I 1 09:26:00 end
41C 100.3260.0-41C-b12-1.276.I 1 09:40:00 start
41C 100.3260.0-41C-b12-1.276.I 1 10:53:00 end
114 1000.987.0-114-b12-1.86.O 0 21:35:00 start
114 1000.987.0-114-b12-1.86.O 0 22:02:00 end
39 10000.2877.0-39-b12-1.242.I 1 11:15:00 start
39 10000.2877.0-39-b12-1.242.I 1 12:30:00 end
It is basically a bus trips dataset. All I want is to manage to get the duration of each trip, so something like that:
route_short_name trip_id direction_id duration
33A 1.1598.0-33A-b12-1.451.I 1 42
41C 10.3265.0-41C-b12-1.277.I 1 51
41C 100.3260.0-41C-b12-1.276.I 1 73
114 1000.987.0-114-b12-1.86.O 0 27
39 10000.2877.0-39-b12-1.242.I 1 75
I have tried a lot of things, but in no case have I managed to group the data by trip_id and then working on the two values at each time. I must have misunderstood something, but I do not know what.
Does anyone have a clue?
We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.
Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'
df1 %>%
mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
group_by(route_short_name, trip_id, direction_id) %>%
summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups: route_short_name, trip_id [?]
# route_short_name trip_id direction_id duration
# <chr> <chr> <int> <dbl>
#1 114 1000.987.0-114-b12-1.86.O 0 27
#2 33A 1.1598.0-33A-b12-1.451.I 1 42
#3 39 10000.2877.0-39-b12-1.242.I 1 75
#4 41C 10.3265.0-41C-b12-1.277.I 1 51
#5 41C 100.3260.0-41C-b12-1.276.I 1 73
Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.
library(tidyverse)
wide_df <-
spread(your_df,key = stop_sequence, value = departure_time) %>%
mutate(timediff = as.numeric(difftime(end,start)))
If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book
Related
This probably seems straightforward, but I am pretty stumped.
I have a set of dates ~ August 1 of each year and need to sum sales by week number. The earliest date is 2008-12-08 (YYYY-MM-DD). I need to create a "week_id" field where week #1 begins on 2008-12-08. And the date 2011-09-03 is week 142. Note that this is different since the calculation of week number does not reset every year.
I am putting up a small example dataset here:
data <- data.frame(
dates = c("2008-12-08", "2009-08-10", "2010-03-31", "2011-10-16", "2008-06-03", "2009-11-14" , "2010-05-05", "2011-09-03"))
data$date = as.Date(data$date)
Any help is appreciated
data$week_id = as.numeric(data$date - as.Date("2008-12-08")) %/% 7 + 1
This would take the day difference between the two dates and find the integer number of 7 days elapsed. I add one since we want the dates where zero weeks have elapsed since the start to be week 1 instead of week 0.
dates date week_id
1 2008-12-07 2008-12-07 0 # added for testing
2 2008-12-08 2008-12-08 1
3 2008-12-09 2008-12-09 1 # added for testing
4 2008-12-14 2008-12-14 1 # added for testing
5 2008-12-15 2008-12-15 2 # added for testing
6 2009-08-10 2009-08-10 36
7 2010-03-31 2010-03-31 69
8 2011-10-16 2011-10-16 149
9 2008-06-03 2008-06-03 -26
10 2009-11-14 2009-11-14 49
11 2010-05-05 2010-05-05 74
12 2011-09-03 2011-09-03 143
I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet
I have an R dataframe df_demand with a date column (depdate) and a dependent variable column bookings. The duration is 365 days starting from 2017-11-02 and ending at 2018-11-01, sorted in ascending order.
We have booking data for only 279 days in the year.
dplyr::arrange(df_demand, depdate)
depdate bookings
1 2017-11-02 43
2 2017-11-03 27
3 2017-11-05 27
4 2017-11-06 22
5 2017-11-07 39
6 2017-11-08 48
.
.
279 2018-11-01 60
I want to introduce another column day_of_year in the following way:
depdate day_of_year bookings
1 2017-11-02 1 43
2 2017-11-03 2 27
3 2017-11-04 3 NA
4 2017-11-05 4 27
.
.
.
365 2018-11-01 365 60
I am trying to find the best possible way to do this.
In Python, I could use something like :
df_demand['day_of_year'] = df_demand['depdate'].sub(df_demand['depdate'].iat[0]).dt.days + 1
I wanted to know about an R equivalent of the same.
When I run
typeof(df_demand_2$depdate)
the output is
"double"
Am I missing something?
You can create a row for every date using the complete function from the tidyr package.
First, I'm creating a data frame with some sample data:
df <- data.frame(
depdate = as.Date(c('2017-11-02', '2017-11-03', '2017-11-05')),
bookings = c(43, 27, 27)
)
Next, I'm performing two operations. First, using tidyr::complete, I'm specifying all the dates I want in my analysis. I can do that using seq.Date, creating a sequence from the first to the last day.
Once that is done, the day_of_year column is simply equal to the row number.
df_complete <- tidyr::complete(df,
depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)
)
df_complete$day_of_year <- 1:nrow(df_complete)
> df_complete
#> # A tibble: 4 x 3
#> depdate bookings day_of_year
#> <date> <dbl> <int>
#> 1 2017-11-02 43 1
#> 2 2017-11-03 27 2
#> 3 2017-11-04 NA 3
#> 4 2017-11-05 27 4
An equivalent solution with the pipe operator from dplyr:
df %>%
complete(depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)) %>%
mutate(days_of_year = row_number())
I have a data set containing data for about 4.5 years. I'm trying to create two different data frames from this, for what I will call holiday and non-holiday periods. There are multiple periods per year, and these periods will repeat over multiple years.
For example, I'd like to choose a time period between Thanksgiving and New Year's Day, as well as periods prior to Valentine's Day and Mother's Day for each year, and make this my holiday data frame. Everything else would be non-holiday.
I apologize if this has been asked before, I just can't find it. I found a similar question for SQL, but I'm trying to figure out how to do this in R.
I've tried filtering and selecting, to no avail.
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
subset(cdate>=2011-11-25, cdate<=2011-12-31)
wine.holiday
Source: local data frame [27,628 x 3]
Groups: clubgroup_id.x [112]
clubgroup_id.x total cdate
(chr) (dbl) (date)
1 1 45 2011-10-04
2 1 45 2011-10-08
3 1 45 2011-10-09
4 1 45 2011-10-09
5 1 45 2011-10-11
6 1 45 2011-10-15
7 1 45 2011-10-24
8 1 90 2011-11-13
9 1 45 2011-11-18
10 1 45 2011-11-26
.. ... ... ...
Clearly something isn't right, because not only is it not limiting the date range, but it's including a column in the data frame that I'm not even selecting.
As mentioned in the comments, dplyr uses filter not subset. Just a simple change to the code you've got (therefore not a complete solution to your issue, but hopefully helps) should get the subset working.
wine.holiday <- wine.sub2 %>%
select(total, cdate)
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
Or, to stick with dplyr piping:
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
filter( cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31") )
wine.holiday
EDIT to add: If the dplyr select isn't working (it looks fine to me), you could try this:
wine.holiday <- subset( wine.sub2, select = c( total, cdate ) )
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
You could, of course, combine those two lines into one. This makes it harder to read, but would probably improve the processing efficiency:
wine.holiday <- subset(wine.sub2, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"), select=c(total,cdate) )
I figured out another method for this through looking through SO posts (took a while).
> library(dateTime)
> wine.holiday <- data.table(start = c(as.Date(USThanksgivingDay(2010:2020))),
+ end = as.Date(USNewYearsDay(2011:2021))-1)
> wine.holiday
start end
1: 2010-11-25 2010-12-31
2: 2011-11-24 2011-12-31
3: 2012-11-22 2012-12-31
4: 2013-11-28 2013-12-31
5: 2014-11-27 2014-12-31
6: 2015-11-26 2015-12-31
7: 2016-11-24 2016-12-31
8: 2017-11-23 2017-12-31
9: 2018-11-22 2018-12-31
10: 2019-11-28 2019-12-31
11: 2020-11-26 2020-12-31
I still need to figure out how to add other ranges (e.g. two weeks before Valentine's Day or Mother's Day) to this, and will update this answer if/when I figure it out.
I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!
The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})
I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]