T tests in R- unable to run together - r

I have an airline dataset from stat computing which I am trying to analyse.
There are variables DepTime and ArrDelay (Departure Time and Arrival Delay). I am trying to analyse how Arrival Delay is varying with certain chunks of departure time. My objective is to find which time chunks should a person avoid while booking their tickets to avoid arrival delay
My understanding-If a one tailed t test between arrival delays for dep time >1800 and arrival delays for dep time >1900 show a high significance, it means that one should avoid flights between 1800 and 1900. ( Please correct me if I am wrong). I want to run such tests for all departure hours.
**Totally new to programming and Data Science. Any help would be much appreciated.
Data looks like this. The highlighted columns are the ones I am analysing

Sharing an image of the data is not the same as providing the data for us to work with...
That said I went and grabbed one year of data and worked this up.
flights <- read.csv("~/Downloads/1995.csv", header=T)
flights <- flights[, c("DepTime", "ArrDelay")]
flights$Dep <- round(flights$DepTime-30, digits = -2)
head(flights, n=25)
# This tests each hour of departures against the entire day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the day as a whole.
pVsDay <- tapply(flights$ArrDelay, flights$Dep,
function(x) t.test(x, flights$ArrDelay, alternative = "less"))
# This tests each hour of departures against every other hour of the day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the other hours.
pAllvsAll <- tapply(flights$ArrDelay, flights$Dep,
function(x) tapply(flights$ArrDelay, flights$Dep, function (z)
t.test(x, z, alternative = "less")))
I'll let you figure out multiple hypothesis testing and the like.
All vs All

Related

assign phase of the day to each date - but fast

I want to compare an element within a list to intervals within a data frame and assign the respective interval to that element.
In my case I want to get a phase of the day (i.e. morning,day,evening,night) for a measurement. I found the R package 'suncalc' which creates the intervals for such phases and also have a solution to assign these phases of the day. BUT this is very slow and I wonder how to do this faster.
#make a list of different days and times
times<-seq.POSIXt(from=Sys.time(),
to=Sys.time()+2*24*60*60,length.out = 50)
#load the suncalc package
library(suncalc)
#a function to get a phase for one point in time
get.dayphase<-function(x){
phases<-getSunlightTimes(date=as.Date(x,tz=Sys.timezone()),
lat=52.52,lon=13.40,
tz=Sys.timezone())
if(x<phases$nightEnd)return("night_morning")
if(x>=phases$nightEnd&x<phases$goldenHourEnd)return("morning")
if(x>=phases$goldenHourEnd&x<phases$goldenHour)return("day")
if(x>=phases$goldenHour&x<phases$night)return("evening")
if(x>=phases$night)return("night_evening")
}
#use sapply to get a phase for each point in time of the list
df=data.frame(time=times,dayphase=sapply(times,get.dayphase))
the desired but slow result:
head(df)
time dayphase
1 2019-09-05 16:12:08 day
2 2019-09-05 17:10:55 day
3 2019-09-05 18:09:41 day
4 2019-09-05 19:08:28 evening
5 2019-09-05 20:07:14 evening
6 2019-09-05 21:06:01 evening
Basically, this is what I want. But it is too slow when I run it on a lot of points in time. getSunlightTimes() can also take a list of dates and returns a data table, but I have no idea how to handle this to get the desired result.
Thanks for your help
What is slowing your process down is most likely the sapply function, which is basically a hidden for loop.
To improve perform you need to vectorize your code. The getSunlightTimes can take vector of dates. Also, instead of using a series of if statements the case_when function from the dplyr package simplifies the code and should reduces the number of logic operations.
library(dplyr)
times<-seq.POSIXt(from=Sys.time(),
to=Sys.time()+2*24*60*60,length.out = 50)
library(suncalc)
#a function to get phases for all of the times
phases<-getSunlightTimes(as.Date(times),
lat=52.52,lon=13.40,
tz=Sys.timezone(),
keep = c("nightEnd", "goldenHourEnd", "goldenHour", "night"))
dayphase<-case_when(
times < phases$nightEnd ~ "night_morning",
times < phases$goldenHourEnd ~ "morning",
times < phases$goldenHour ~ "day",
times < phases$night ~ "evening",
TRUE ~ "night_evening"
)
This should provide a significant improvement. Additional performance improvements are possible, if you have a large number times on each day. If this is the case, calculate the phases dataframe once each day and then use this result as a lookup table for the individual times.

Average of Data in accordance with similar time

I have a dataset having solar power generation for 24 hours for many days, now I have to find the average of the power generated in accordance with the time, as for example, Have a glimpse of the datasetI have to find the average of the power generated at time 9:00:00 AM.
Start by stripping out the time from the date-time variable.
Assuming your data is called myData
library(lubridate)
myData$Hour <- hour(strptime(myData$Time, format = "%Y-%m-%d %H:%M:%S"))
Then use ddply from the plyr package, which allows us to apply a function to a subset of data.
myMeans <- ddply(myData[,c("Hour", "IT_solar_generation")], "Hour", numcolwise(mean))
The resulting frame will have one column called Time which will give you the hour, and another with the means at each hour.
NOW, on another side but important note, when you ask a question you should be providing information on the attempts you've made so far to answer the question. This isn't a help desk.

How to cluster according to hour of specific day

I have the logs of the amount of arrivals at a bank , every half an hour for one month.
I am trying to find different cluster groups according to the amount of "arrivals". I tried according to the day, and i tried according to the hour (not of a specific day). I would like to try according to the hour of a specific day.
An example:
Thursdays at 14:00 and Sundays at 15:00 are one cluster with an average of 10000 arrivals
Mondays at 13:00, Mondays at 10:00 and Tuesdays at 16:00 are one cluster with an average of 15000 arrivals.
all the rest are another cluster with an average of 2000 arrivals.
I have a csv file with the columns: Date, Day(1-7), Time, Arrivals
Until now I used this:
km <- kmeans(table, 3, 15)
plot(km)
(i tried 3 clusters) - this code clusters pairs .( a matrix of 3x3 with a plot of each 2 out of 3 columns)
Is there a way to do that?
k-means and similar algorithms will yield meaningless results on this kind of data.
The problem is you are using the wrong tool for the wrong problem on the wrong data.
Your data is: Date, Day(1-7), Time, Arrivals
K-means will try to minimize variance. But does variance make any sense on this data set? How do you know hich k makes most sense? Since Arrivals likely has the largest variance of these attributes, it will completely dominate your result.
The question you should first try to answer is: what is a good result? Then, consider ways of visualizing the results to verify that you are up to something. And when you've visualized the data, consider ways to manually mark the desired result on the visualization, this may well be good enough for you. Better than praying for k-means to yield a somewhat meaningful result; because on this kind of mixed type data, it usually does not work very well.

Getting Date to Add Correctly

I have a 3000 x 1000 matrix time series database going back 14 years that is updated every three months. I am forecasting out 9 months using this data still keeping a 3200 x 1100 matrix (mind you these are rough numbers).
During the forecasting process I need the variables Year and Month to be calculated appropriately . I am trying to automate the process so I don't have to mess with the code any more; I can just run the code every three months and upload the projections into our database.
Below is the code I am using right now. As I said above I do not want to have to look at the data or the code just run the code every three months. Right now everything else is working as planed, but I still have to ensure the dates are appropriately annotated. The foo variables are changed for privacy purposes due to the nature of their names.
projection <- rbind(projection, data.frame(foo=forbar, bar=barfoo,
+ Year=2012, Month=1:9,
+ Foo=as.vector(fc$mean)))
I'm not sure exactly where the year/months are coming from, but if you want to refer to the current date for those numbers, here is an option (using the wonderful package, lubridate):
library(lubridate)
today = Sys.Date()
projection <- rbind(projection, data.frame(foo=foobar, bar=barfoo,
year = year(today),
month = sapply(1:9,function(x) month(today+months(x))),
Foo = as.vector(fc$mean)))
I hope this is what you're looking for.

Compute average over sliding time interval (7 days ago/later) in R

I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)

Resources