Apply cumsum function using condition - r

I am trying to calculate the maximum number of aircraft on the ground simultaneously throughout the year per station, knowing that I have more than 300 stations and that the data is per (day and hours) for 1 year.
So I thought of this solution: find the maximum per day and per station then extract the maximum per station.
my data are in this format: station, aircraft ,time , type ( arrive to the station or depart from the station) and value is 1 if is arrival and -1 if is depart, I create this column to facilitate the count, the idea is apply cumsum once the data are sorted by time for each station.
I need to create a function which group the data by day and by station and count the cumulative sum, but I have planes that have been sleeping in the station, so I need to delete them( the yellows lines in the screenshot).
The trick to detect these planes:
Aircraft allows us to track the plane:
generally it appears twice a day when it arrives and when it leaves.
to detect these planes that I have to look
the variables: Aircraft and Type:
if the type is departure and the aircraft variable of this line appears only once in this day, ( it means there is no arrival for this flight) then I should not count them.
I was thinking to create a function: to group by (station and time) then apply cumsum but skipping the lines with the conditions that I explained before.(if the type is departure and the aircraft variable of this line appears only once in this day, then I should not count them)
Any Help??

I solved this by creating new table Tab1 with Table function to count the frequency of the aircrafts by day, then I joined this table to my original data base to then delete the ones with freq=1 and type=departure.
tab<- as.data.frame(table(df$day,df$AIRCRAFT, useNA = "always"))
df_new<-df%>% left_join(tab,by=c("day"="Var1","AIRCRAFT" = "Var2"))
df_final<-df_new%>% group_by(day)%>% filter(TYPE=="Departure",Freq==1)%>% arrange(TIMESTAMP)%>% group_by(day) %>%
mutate(ground= cumsum(Freq))

Related

Average after 2 group_by's in R

I am new to R can't find the right syntax for a specific average I need. I have a large fitbit dataset of heartrate per second for 30 people, for a month each. I want an average of heartrate per day per person to make the data easier to manage and join with other fitbit data.
First few lines of Data
The columns I have are Id (person Id#), Time (Date-Time), and Value (Heartrate). I already separated Time into two columns, one for date and one for time only. My idea is to group the information by person, then by date and get one average number per person per day. But, my code is not doing that.
hr_avg <- hr_per_second %>% group_by(Id) %>% group_by(Date) %>% summarize(mean(Value))
As a result I get an average by date only. I can't do this manually because the dataset is so big, Excel can't open it. And I can't upload it to BigQuery either, the database I learned to use during my data analysis course. Thanks.

Netsuite saved search formula that sums distinct values within a date range

I am trying to create a saved search of time entries in Netsuite.
Under criteria, I have specified a date range. This varies as this is an available filter
For the purposes of this example, the date range is 1/4/2020 to 10/4/2020
The first column ('Total Customer Billable Hours') in this sums all time entries that are coded against project task type 'billable project'. The formula I am using for this:
Formula (Numeric), sum, Case when {project.task_type}='Billable' then {durationdecimal} else 0 end
For the second column, I want the sum of hours the employee would normally work (in the time period specified under criteria-1/4/2020 to 10/4/2020 in this example)
The formula I am using to sum this is
Formula(numeric), sum, {timesheet.workcalendarhoursdecimal}
However, this is multiplying the employee's weekly hours by the number of time entries that make up the 'Total customer billable hours' figure
i.e. if an employee works a 40 hour week, the formula is multiple 40 x 36 (the number of time entries that make up the customer billable figure for example)
What would the correct formula be so that the second column is only summing the employee's work calendar hours for the period specified in the criteria/available filter selection?
Try changing sum to maximum:
Formula(numeric), maximum, {timesheet.workcalendarhoursdecimal}

How do I compute the moving average age of a product from the first day it is purchased?

I have a series of random data points with order date, expiry date and product category.
I am trying to compute 2 age variables, wherein 1 would compute the rolling age using just the order date (hence no expiry date) between order dates.
In other words, I don't need age determined based on current date but ones' relative to each data point in the data. As such, the average would keep increasing as more products are purchased and first ones get old. The second age variable would essentially cap each order by expiry date, which would mean the rolling average would not necessarily be increasing.
I am trying to do this using R or Tableau.

Extracting summarized information from one large dataset and then storing in another dataset

I am relatively new to R and I am facing an issue in extracting summarized data from one dataset and then storing this data in another dataset. I am presently trying to analyze flight dataset along with the weather information. I am a having a query in this regard which is as follows:
I am having one dataset for the flight data.This dataset comprises of columns such as flight number, date,origin city name, origin state name, departure time, departure time block(each block of 1 hour duration spanning the entire 24 hours), arrival time, arrival time block, flight distance, total flight time and a binary response variable indicating whether the flight was delayed or not.
I have another dataset which comprises of weather data. This dataset includes columns such as date, time of recording, hour block(each block of 1 hour duration spanning the entire 24 hours), state, city and the respective weather related data such as temperature,humidity levels, wind speed etc. For each city,state,date and a single hour block, there can be several entries in the weather dataset indicating varying temperature levels at various times in the hour block.
I am trying to extract the temperature and humidity levels from the weather dataset and place this information in the flight dataset. I have created four new columns in the flight dataset for origin temperature,origin humidity, destination temperature and destination humidity for this purpose.
My objective is to store the average temperature and humidity levels for the origin city during the departure time hour block as well as for the destination city during the arrival time hour block in the flight dataset. The reason I am choosing average temperature and humidity is because there can be multiple entries in weather dataset for the same hour block,city and state combination.
I have tried to extract the origin temperature information from weather dataset using the following loop:
for(i in 1:nrow(flight))
{
flight[[i,13]]=mean(weather$temperature[weather$date == flight[[i,1]] & weather$time_blk == train[[i,5]]
& weather$state == train[[i,3]] & weather$city == train[[i,2]]])
}
However, the weather data set contains around 2 Million rows and the flight dataset also comprises of around 20000 observations. As a result, the above calculation takes a lot of time to execute and even then the origin temperature data in flight data set is not calculated properly.
I am also attaching the images of sample flight and weather datasets for reference.
Can you please suggest another approach for obtaining the above information considering the constraints of the large datasets and the time required for processing.

Compute average over sliding time interval (7 days ago/later) in R

I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)

Resources