I'm using R and RStudio to analyse GTFS public transport feeds and to create timetable range plots using ggplot2. The code currently works fine but is quite slow, which is problematic when working with very big CSVs as is often the case here.
The slowest part of the code is as follows (with some context): a for loop that iterates through the data frame and subsets each unique trip into a temporary data frame from which the extreme arrival and departure values (first & last rows) are extracted:
# Creates an empty df to contain trip_id, trip start and trip end times
Trip_Times <- data.frame(Trip_ID = character(), Departure = character(), Arrival = character(), stringsAsFactors = FALSE)
# Creates a vector containing all trips of the analysed day
unique_trips = unique(stop_times$trip_id)
# Iterates through stop_times for each unique trip_id and populates previously created data frame
for (i in seq(from = 1, to = length(unique_trips), by = 1)) {
temp_df <- subset(stop_times, trip_id == unique_trips[i])
Trip_Times[nrow(Trip_Times) + 1, ] <- c(temp_df$trip_id[[1]], temp_df$departure_time[[1]], temp_df$arrival_time[[nrow(temp_df)]])
}
The stop_times df looks as follows with some feeds containing over 2.5 million lines giving around 200k unique trips, hence 200k loop iterations...
head(stop_times)
trip_id arrival_time departure_time stop_sequence
1 011_0840101_A14 7:15:00 7:15:00 1
2 011_0840101_A14 7:16:00 7:16:00 2
3 011_0840101_A14 7:17:00 7:17:00 3
4 011_0840101_A14 7:18:00 7:18:00 4
5 011_0840101_A14 7:19:00 7:19:00 5
6 011_0840101_A14 7:20:00 7:20:00 6
Would anyone be able to advise me how to optimise this code in order to obtain faster results. I don't believe apply can be used here but I may well be wrong.
This should be straightforward with dplyr...
library(dplyr)
Trip_Times <- stop_times %>%
group_by(trip_id) %>%
summarise(departure_time=first(departure_time),
arrival_time=last(arrival_time))
We can use data.table
library(data.table)
setDT(stop_times)[, .(departure_time = departure_time[1L],
arrival_time = arrival_time[.N]) , by = trip_id]
Related
I want to distinctly count the number of customers who have purchased from the company between each SKU's first and last purchase date. This is after I have distinctly counted the number of customers for each SKU given in SQL (as well as finding the first and last purchase date),
I have code that successfully solves this problem; however, it uses a for loop and it is taking far too long because there are tens of thousands of SKUs. This is short example of what my SKU table looks like:
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
Then I have another table that is about 6 million rows long, a select distinct of the sales date and the CustomerID that I call OrderTable. I can't summarize the distinct count on a day-to-day basis and sum them together because this would double count customers who have purchased on separate days. I have to re-calculate the distinct count with every FirstPurchase/LastPurchase permutation that I see in my SKUCount table. From there, I use the following code to calculate the distinct number of customers in the given time frame:
library(dplyr)
for (i in 1:nrow(SKUCount))
{
SKUCount[i, c('DateCustomers')] <-
sapply(OrderTable %>%
filter(Date >= SKUCount[i,'FirstPurchase'],
Date <= SKUCount[i,'LastPurchase']) %>%
select(CustomerID),
function(x) length(unique(x)))
}
As I previously noted, this piece of code DOES work, but it's very slow (~0.5 second for each row). Is there a quicker way to calculate the distinct counts, or is there a more clever solution to my problem?
Try this one:
library("purrrlyr")
library("dplyr")
#First creating the datasets including OrderTable (please correct me if I got it wrong!):
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
OrderTable <- data.frame(Date=c('2014-06-02', '2014-08-02', '2015-02-03', '2017-05-13'
,'2015-05-02', '2014-06-03', '2016-07-13', '2017-09-30', '2018-07-01', '2019-01-09'),
CustomerID=c('121','212','3434','24232','121','124','212','131','412','3634'))
#changing factors to date
SKUCount$FirstPurchase<-as.Date(SKUCount$FirstPurchase,format = "%Y-%m-%d")
SKUCount$LastPurchase<-as.Date(SKUCount$LastPurchase,format = "%Y-%m-%d")
OrderTable$Date<-as.Date(OrderTable$Date,format = "%Y-%m-%d")
#defining a function, named FUN, which limit the Date from OrderTable between
#the two date arguments (FirstPurchase and LastPurchase) and returns the
#distinct count of CustomerID's from OrderTable:
FUN <- function(FirstPurchase,LastPurchase){
Rtrn<-OrderTable %>%
filter(Date >= FirstPurchase,
Date <= LastPurchase) %>%
summarize(n_distinct(CustomerID))
as.numeric(Rtrn)
}
Next you want to take your dataset, SKUCount, and create a variable called DateCustomers by applying the function, FUN, to every row of it:
SKUCount %>%
rowwise() %>%
mutate(DateCustomers= FUN(FirstPurchase,LastPurchase))
# Source: local data frame [3 x 5]
# Groups: <by row>
#
# # A tibble: 3 x 5
# SKU NumberOfCustomers FirstPurchase LastPurchase DateCustomers
# <fct> <dbl> <date> <date> <dbl>
# 1 123 204543 2014-05-02 2017-09-30 6
# 2 456 92703 2014-02-03 2018-07-01 7
# 3 789 305727 2016-05-13 2019-01-09 5
I am just beginning with R and I have a beginner's question.
I have the following data frame (simplified):
Time: 00:01:00 00:02:00 00:03:00 00:04:00 ....
Flow: 2 4 5 1 ....
I would like to know the mean flow every two minutes instead of every minute. I need this for many hours of data.
I want to save those new means in a list. How can I do this using an apply function?
I assume you have continuous data without gaps, with values for Flow for every minute.
In base R we can use aggregate:
df.out <- data.frame(Time = df[seq(0, nrow(df) - 1, 2) + 1, "Time"]);
df.out$mean_2min = aggregate(
df$Flow,
by = list(rep(seq(1, nrow(df) / 2), each = 2)),
FUN = mean)[, 2];
df.out;
# Time mean_2min
#1 00:01:00 3
#2 00:03:00 3
Explanation: Extract only the odd rows from df; aggregate values in column Flow by every 2 rows, and store the mean in column mean_2min.
Sample data
df <- data.frame(
Time = c("00:01:00", "00:02:00", "00:03:00", "00:04:00"),
Flow = c(2, 4, 5, 1))
You can create a new variable in your data by using rounding your time variable to the closest two minutes below, then use a data table function to calculate the mean for your new minutes.
In order to help you precisely, you're gonna have to point out how your data is set up. If, for instance, your data is set up like this:
dt = data.table(Time = c(0:3), Flow = c(2,4,5,1))
Then the following would work for you:
dt[, twomin := floor(Time/2)*2]
dt[, mean(Flow), by = twomin]
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28
I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2
I'm stuck on a problem calculating travel dates. I have a data frame of departure dates and return dates.
Departure Return
1 7/6/13 8/3/13
2 7/6/13 8/3/13
3 6/28/13 8/7/13
I want to create and pass a function that will take these dates and form a list of all the days away. I can do this individually by turning each column into dates.
## Turn the departure and return dates into a readable format
Dept <- as.Date(travelDates$Dept, format = "%m/%d/%y")
Retn <- as.Date(travelDates$Retn, format = "%m/%d/%y")
travel_dates <- na.omit(data.frame(dept_dates,retn_dates))
seq(from = travel_dates[1,1], to = travel_dates[1,2], by = 1)
This gives me [1] "2013-07-06" "2013-07-07"... and so on. I want to scale to cover the whole data frame, but my attempts have failed.
Here's one that I thought might work.
days_abroad <- data.frame()
get_days <- function(x,y){
all_days <- seq(from = x, to = y, by =1)
c(days_abroad, all_days)
return(days_abroad)
}
get_days(travel_dates$dept_dates, travel_dates$retn_dates)
I get this error:
Error in seq.Date(from = x, to = y, by = 1) : 'from' must be of length 1
There's probably a lot wrong with this, but what I would really like help on is how to run multiple dates through seq().
Sorry, if this is simple (I'm still learning to think in r) and sorry too for any breaches in etiquette. Thank you.
EDIT: updated as per OP comment.
How about this:
travel_dates[] <- lapply(travel_dates, as.Date, format="%m/%d/%y")
dts <- with(travel_dates, mapply(seq, Departure, Return, by="1 day"))
This produces a list with as many items as you had rows in your initial table. You can then summarize (this will be data.frame with the number of times a date showed up):
data.frame(count=sort(table(Reduce(append, dts)), decreasing=T))
# count
# 2013-07-06 3
# 2013-07-07 3
# 2013-07-08 3
# 2013-07-09 3
# ...
OLD CODE:
The following gets the #days of each trip, rather than a list with the dates.
transform(travel_dates, days_away=Return - Departure + 1)
Which produces:
# Departure Return days_away
# 1 2013-07-06 2013-08-03 29 days
# 2 2013-07-06 2013-08-03 29 days
# 3 2013-06-28 2013-08-07 41 days
If you want to put days_away in a separate list, that is trivial, though it seems more useful to have it as an additional column to your data frame.