I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times.
I will include a link if you want the whole dataset:
https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing
I then ran the following code
Delays <- read.table("FlightDelays.csv", header=T, sep=",")
DepatureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0"), "%H%M")
ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0"), "%H%M")
DelayTime <- as.numeric(difftime(DepatureTime, ScheduleTime))/60
DelayData <- data.frame(DelayTime, Delays)
The above code allowed me to get the delay time in minutes
For those of you who do not want to obtain the whole dataset I will now include a small example of some observations of the form
structure(list(schedtime = c(1455, 1640, 1245, 1715, 1039 , 2120), deptime = c(1455, 1640, 1245, 1709, 1035, 0010)), .Names = c("schedtime", "deptime"), row.names = c(NA, 6L), class = "data.frame")
and if you run the a code I did at the beginning, the delay in minutes for the 6th observation will be -1270 minutes not a delay of 170 minutes as i believe strptime assumes you are still in the same day and doesn't recognise that the delay caused the departure time to be the early hours of the following day.
How can i get the code to recognise the delays will sometimes mean the departure time will go on to the following day?
Thank you for any help
Using lubridate:
library(lubridate)
ScheduleTime <- as_datetime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0"),format="%H%M")
DepatureTime <- as_datetime(formatC(Delays$deptime, width = 4, format = "d", flag = "0"),format="%H%M") + hours(ifelse(Delays$deptime < Delays$schedtime & Delays$schedtime > 2000,24,0))
DelayTime <- difftime(DepatureTime, ScheduleTime)/60
DelayData <- data.frame(DelayTime, Delays)
The Problem is, that you have to decide when it isn't resonable, that a smaller value of deptime compared to schedtime does not correspond to a day shift, but to a flight leaving early. I don't see a general way around that.
Related
I have a csv that contains "Period", which are quarters, and "Percent". After reading the data into R, the "Period" column is "chr" and "Percent" column is "num". I want to change the quarter values to dates, so:
for (i in 1:length(sloos_tighten$Period)) {
sloos_tighten$Period[i] <- paste("Q", substring(sloos_tighten$Period[i], 6), "/", substring(sloos_tighten$Period[i], 1, 4), sep = "")
sloos_tighten$Period[i] <- as.Date(as.yearqtr(sloos_tighten$Period[i], format = "Q%q/%Y"))
}
where the first line in the for-loop changes the format of the quarter to be readable by as.yearqtr, and the second line changes the quarter to a date. The first line works as intended, but the second line converts the date to a four-digit number. I think this is because "Period" is of type "chr", but I don't know how to change it to date. I have tried to create a new column with type date, but I cannot find any resource online that explains it. Any help is appreciated. Thanks in advance.
> dput(head(sloos_tighten, 10))
structure(list(Period = c("1990:2", "1990:3", "1990:4", "1991:1",
"1991:2", "1991:3", "1991:4", "1992:1", "1992:2", "1992:3"),
`Large and medium` = c(54.4, 46.7, 54.2, 38.6, 20, 18.6,
16.7, 10, 3.5, -3.4), Small = c(52.7, 33.9, 40.7, 31.6, 6.9,
8.8, 7, 0, -7.1, -1.7)), row.names = c(NA, 10L), class = "data.frame")
^What the data looks like after import
The literal for loop is fine in a sense, but unfortunately there are two problems here:
There is a class problem here: if $Period is a string, then when you reassign one of its values with something of Date class, the date is then converted into a string. This is because in R data.frame's, with few exceptions, all values in a column must be the same type. That's because a column is (almost always) a vector, and R treats vectors as homogenous.
You can get around this by pre-allocating a vector of type Date and assigning it piecemeal:
newdate <- rep(Sys.Date()[NA], nrow(sloos_tighten)) # just to get the class right
for (i in 1:length(sloos_tighten$Period)) {
tmp <- paste("Q", substring(sloos_tighten$Period[i], 6), "/", substring(sloos_tighten$Period[i], 1, 4), sep = "")
newdate[i] <- as.Date(as.yearqtr(tmp, format = "Q%q/%Y"))
}
(But please, don't use this code, look at #2 below first.)
Not a problem per se, but an efficiency: R is good at doing things as a whole vector. If you reassign all of $Period in one step, then all is faster.
sloos_tighten$Period <-
as.Date(
paste0(substring(sloos_tighten$Period, 6),
"/", substring(sloos_tighten$Period, 1, 4)),
format = "%q/%Y")
This switches from paste(.., sep="") to paste0, a convenience function. Then, it removes the leading "Q" since really we don't keep it around, so why add it (other than perhaps declarative code). Last, it does a whole vector of strings at once.
(This is taking the data sight-unseen, so untested.)
I want to transform from chr to date format
I have this representing year -week:
2020-53
I ve tried to do this
mutate(semana=as_date(year_week,format="%Y-%U"))
but I get the same date in all dataset 2020-01-18
I also tried
mutate(semana=strptime(year_week, "%Y-%U"))
getting the same result
Here you can see the wrong convertion
Any idea?, thanks
I think I've got something that does the job.
library(tidyverse)
library(lubridate)
# Set up table like example in post
trybble <- tibble(year_week = c("2020-53", rep("2021-01", 5)),
country = c("UK", "FR", "GER", "ITA", "SPA", "UK"))
# Function to go into mutate with given info of year and week
y_wsetter <- function(fixme, yeargoal, weekgoal) {
lubridate::year(fixme) <- yeargoal
lubridate::week(fixme) <- weekgoal
return(fixme)
}
# Making a random date so col gets set right
rando <- make_datetime(year = 2021, month = 1, day = 1)
# Show time
trybble <- trybble %>%
add_column(semana = rando) %>% # Set up col of dates to fix
mutate(yerr = substr(year_week, 1, 4)) %>% # Get year as chr
mutate(week = substr(year_week, 6, 7)) %>% # Get week as chr
mutate(semana2 = y_wsetter(semana,
as.numeric(yerr),
as.numeric(week))) %>% # fixed dates
select(-c(yerr, week, semana))
Notes:
If you somehow plug in a week greater than 53, lubridate doesn't mind, and goes forward a year.
I really struggled to get mutate to play nicely without writing my own function y_wsetter. In my experience with mutates with multiple inputs, or where I'm changing a "property" of a value instead of the whole value itself, I need to probably write a function. I'm using the lubridate package to change just the year or week based on your year_week column, so this is one such situation where a quick function helps mutate out.
I was having a weird time when I tried setting rando to Sys.Date(), so I manually set it to something using make_datetime. YMMV
How can I sum the total time per driver in R? Can someone help me?
Total time
Prefered end result
One recommendation to make: please do not use images to share data. Instead, use dput() of your data frame. See this post on making a reproducible example on SO.
One approach to this involves the tidyverse and lubridate packages (I am sure there are other solutions).
First, would put your data into long form instead of wide. The times are then converted from %H:%M:%OS (with milliseconds) to durations since midnight.
Then, for each driver, these times are summed up, and results are provided in different formats:
total_time1 - total number of seconds (with decimal places)
total_time2 - number minutes (M) and the number decimal seconds (S)
total_time3 - total time in %M:%OS format (minutes and decimal seconds)
Edit: In addition, I have added two columns based on OP request:
total_time_minutes - total number of minutes (with decimal places)
avg_speed - average speed in km/hr, assuming 27.004,65 meters
I hope this is helpful. Please let me know.
library(tidyverse)
library(lubridate)
df %>%
pivot_longer(cols = -lap) %>%
mutate(lap_time = as.numeric(as.POSIXct(value, format = "%H:%M:%OS", tz = "UTC")) -
as.numeric(as.POSIXct(Sys.Date(), tz = "UTC"))) %>%
group_by(name) %>%
summarise(total_time1 = sum(lap_time)) %>%
mutate(total_time2 = seconds_to_period(total_time1),
total_time3 = sprintf("%d:%.4f", minute(total_time2), second(total_time2)),
total_time_minutes = total_time1/60,
avg_speed = 3.6 * 27004.65/total_time1) %>%
as.data.frame()
Output
name total_time1 total_time2 total_time3 total_time_minutes avg_speed
1 Bottas 319.782 5M 19.7815999984741S 5:19.7816 5.32969 304.010
2 Hamilton 320.320 5M 20.3204002380371S 5:20.3204 5.33867 303.498
3 Leclerc 319.981 5M 19.98140001297S 5:19.9814 5.33302 303.820
4 Verstappen 318.220 5M 18.219899892807S 5:18.2199 5.30366 305.502
5 Vettel 318.625 5M 18.6247997283936S 5:18.6248 5.31041 305.114
Data
df <- structure(list(lap = 1:5, Bottas = c("00:01:04.9388", "00:01:03.7164",
"00:01:04.0028", "00:01:03.3424", "00:01:03.7812"), Hamilton = c("00:01:04.5280",
"00:01:03.7524", "00:01:03.9632", "00:01:04.3712", "00:01:03.7056"
), Leclerc = c("00:01:04.9812", "00:01:03.7740", "00:01:04.6026",
"00:01:03.3920", "00:01:03.2316"), Verstappen = c("00:01:04.1704",
"00:01:03.7383", "00:01:03.7128", "00:01:02.8460", "00:01:03.7524"
), Vettel = c("00:01:04.3632", "00:01:02.8244", "00:01:03.7164",
"00:01:03.8532", "00:01:03.8676")), class = "data.frame", row.names = c(NA,
-5L))
I'm fairly new to R and am trying to plot some expenditure data. I read the data in from excel and then do some manipulation on the dates
data <- read.csv("Spending2019.csv", header = T)
#converts time so R can use the dates
strdate <- strptime(data$DATE,"%m/%d/%Y")
newdate <- cbind(data,strdate)
finaldata <- newdate[order(strdate),]
This probably isn't the most efficient, but it gets me there :)
Here's the relevant columns of the first four lines of my finaldata dataframe
dput(droplevels(finaldata[1:4,c(5,7)]))
structure(list(AMOUNT = c(25.13, 14.96, 43.22, 18.43), strdate = structure(c(1546578000,
1546750800, 1547010000, 1547010000), class = c("POSIXct", "POSIXt"
), tzone = "")), row.names = c(NA, 4L), class = "data.frame")
The full data set has 146 rows and the dates range from 1/4/2019 to 12/30/2019
I then plot the data
plot(finaldata$strdate,finaldata$AMOUNT, xlab = "Month", ylab = "Amount Spent")
and I get this plot
This is fine for me getting started, EXCEPT why is JAN repeated at the far right end? I have tried various forms of xlim and can't seem to get it to go away.
In my data https://pastebin.com/CernhBCg I have irregular timestamps and a corresponding value. Additionally to the irregularity I have large gaps, for which I have no value in my data. I know however that for those gaps value is zero and I would like to fill up to gaps with rows with value=0. How can I do this?
Data
> dput(head(hub2_select,10))
structure(list(time = structure(c(1492033212.648, 1492033212.659,
1492033212.68, 1492033212.691, 1492033212.702, 1492033212.724,
1492033212.735, 1492033212.757, 1492033212.768, 1492033212.779
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), value = c(3,
28, 246, 297, 704, 798, 1439, 1606, 1583, 1572)), .Names = c("time",
"value"), row.names = c(NA, 10L), class = "data.frame")
Please take the file I provided to see the data and read it into R with
library(readr)
df <- read_csv("data.csv", col_types = list(time = col_datetime(), value = col_double()))
Solutions
For one the the values left and right of a gap are usually 0 or 1. So that might help. I thought I'd use a rolling join, but from I understand by now, this seems not be the way to go.
What works is
library(dplyr)
library(lubridate)
threshold_time = dseconds(2)
time_prev = df$time[1]
addrows = data.frame()
for (i in seq(2, nrow(df),1)){
time_current <- df$time[i]
if ((time_current - time_prev) > threshold_time){
time_add <- seq(time_prev, time_current, dseconds(0.1))
addrows = bind_rows(addrows, data.frame(time=time_add, value=rep(0, length(time_add))))
}
time_prev <- time_current
}
addrows$type <- 'filled'
df$type <- 'orig'
df_new <- bind_rows(df, addrows)
library(ggplot2)
ggplot(df_new, aes(time,value,color=type)) + geom_point()
But this solution is neither elegant nor efficient (I did not test efficiency though).
Honestly I haven't tried it yet (I had to switch to Python for other reasons and solved it there and didn't get around to try it out), but I am pretty sure https://cran.r-project.org/web/packages/padr/vignettes/padr.html would have been the answer. I just wanted to write this here for other readers with the same question.