Extracting summarized information from one large dataset and then storing in another dataset - r

I am relatively new to R and I am facing an issue in extracting summarized data from one dataset and then storing this data in another dataset. I am presently trying to analyze flight dataset along with the weather information. I am a having a query in this regard which is as follows:
I am having one dataset for the flight data.This dataset comprises of columns such as flight number, date,origin city name, origin state name, departure time, departure time block(each block of 1 hour duration spanning the entire 24 hours), arrival time, arrival time block, flight distance, total flight time and a binary response variable indicating whether the flight was delayed or not.
I have another dataset which comprises of weather data. This dataset includes columns such as date, time of recording, hour block(each block of 1 hour duration spanning the entire 24 hours), state, city and the respective weather related data such as temperature,humidity levels, wind speed etc. For each city,state,date and a single hour block, there can be several entries in the weather dataset indicating varying temperature levels at various times in the hour block.
I am trying to extract the temperature and humidity levels from the weather dataset and place this information in the flight dataset. I have created four new columns in the flight dataset for origin temperature,origin humidity, destination temperature and destination humidity for this purpose.
My objective is to store the average temperature and humidity levels for the origin city during the departure time hour block as well as for the destination city during the arrival time hour block in the flight dataset. The reason I am choosing average temperature and humidity is because there can be multiple entries in weather dataset for the same hour block,city and state combination.
I have tried to extract the origin temperature information from weather dataset using the following loop:
for(i in 1:nrow(flight))
{
flight[[i,13]]=mean(weather$temperature[weather$date == flight[[i,1]] & weather$time_blk == train[[i,5]]
& weather$state == train[[i,3]] & weather$city == train[[i,2]]])
}
However, the weather data set contains around 2 Million rows and the flight dataset also comprises of around 20000 observations. As a result, the above calculation takes a lot of time to execute and even then the origin temperature data in flight data set is not calculated properly.
I am also attaching the images of sample flight and weather datasets for reference.
Can you please suggest another approach for obtaining the above information considering the constraints of the large datasets and the time required for processing.

Related

Apply cumsum function using condition

I am trying to calculate the maximum number of aircraft on the ground simultaneously throughout the year per station, knowing that I have more than 300 stations and that the data is per (day and hours) for 1 year.
So I thought of this solution: find the maximum per day and per station then extract the maximum per station.
my data are in this format: station, aircraft ,time , type ( arrive to the station or depart from the station) and value is 1 if is arrival and -1 if is depart, I create this column to facilitate the count, the idea is apply cumsum once the data are sorted by time for each station.
I need to create a function which group the data by day and by station and count the cumulative sum, but I have planes that have been sleeping in the station, so I need to delete them( the yellows lines in the screenshot).
The trick to detect these planes:
Aircraft allows us to track the plane:
generally it appears twice a day when it arrives and when it leaves.
to detect these planes that I have to look
the variables: Aircraft and Type:
if the type is departure and the aircraft variable of this line appears only once in this day, ( it means there is no arrival for this flight) then I should not count them.
I was thinking to create a function: to group by (station and time) then apply cumsum but skipping the lines with the conditions that I explained before.(if the type is departure and the aircraft variable of this line appears only once in this day, then I should not count them)
Any Help??
I solved this by creating new table Tab1 with Table function to count the frequency of the aircrafts by day, then I joined this table to my original data base to then delete the ones with freq=1 and type=departure.
tab<- as.data.frame(table(df$day,df$AIRCRAFT, useNA = "always"))
df_new<-df%>% left_join(tab,by=c("day"="Var1","AIRCRAFT" = "Var2"))
df_final<-df_new%>% group_by(day)%>% filter(TYPE=="Departure",Freq==1)%>% arrange(TIMESTAMP)%>% group_by(day) %>%
mutate(ground= cumsum(Freq))

Average after 2 group_by's in R

I am new to R can't find the right syntax for a specific average I need. I have a large fitbit dataset of heartrate per second for 30 people, for a month each. I want an average of heartrate per day per person to make the data easier to manage and join with other fitbit data.
First few lines of Data
The columns I have are Id (person Id#), Time (Date-Time), and Value (Heartrate). I already separated Time into two columns, one for date and one for time only. My idea is to group the information by person, then by date and get one average number per person per day. But, my code is not doing that.
hr_avg <- hr_per_second %>% group_by(Id) %>% group_by(Date) %>% summarize(mean(Value))
As a result I get an average by date only. I can't do this manually because the dataset is so big, Excel can't open it. And I can't upload it to BigQuery either, the database I learned to use during my data analysis course. Thanks.

Assign column variables by date (R)

I hope we're all doing great
I have several decades of daily rainfall data from several monitoring stations. The data all beings at separate dates. I have combined them into a single data frame with the date in the first column, with the rainfall depth in the second column. I want to sort the variable 'Total' by the variable: 'Date and time' (please see the links below)
ms1 <- read.csv('ms1.csv')
ms2 <- read.csv('ms2.csv')
etc.etc
df <- merge(ms1, ms2 etc. etc, by = "Date and Time")
The problem is that the range of dates would differ for each monitoring station (csv file). There may also missing dates in a range. Is there a way around this?
Would I have to create a separate vector with the greatest possible date range? Or would it automatically detect the earliest start date from the imported data.
for monitoring station 1 (ms1)
for monitoring station 2 (ms2)
Note: the data continues to the current date

Using historical High/Low Tide data from NOAA to classify my data points as high or low based on date and time

I'm pretty new to R and I'm working on adding tide data to an animal movement data frame. My animal data includes a date_time column with multiple data points occurring in each day. The tide data also has multiple points per day, as tide typically changes between high and low 4 times per day. For this reason, I haven't tried to use a loop.
I'd like to create a new column in R that will classify each date/time into a "LOW" or "HIGH" category based on historical data from NOAA. So far, I've followed this: Classify date/time by tidal states
After organizing my data frames into tide2015 and rays2015, when I get to the last line of code:
Tidal.State15=tide2015$High.Low[pmax(findInterval(rays2015$Date - 3600*3, tide2015$datetime), findInterval(rays2015$Date + 3600*3, tide2015$datetime))]
I get the same error as the original post:
Error in `$<-.data.frame`(`*tmp*`, Tidal.State, value = c("L", "H", "L", : replacement has 38 rows, data has 39
I'm not sure how to get around this error. I know there are no duplicates in my data, all my dates and times are in POSIXct format, and all dates in my data are accounted for in the NOAA tide data.
Any suggestions would be greatly appreciated!

R: subsetting timestamped dataframe periodically

I have a csv file that contains many thousands of timestamped data points. The file includes the following columns: Date, Tag, East, North & DistFromMean. The following is a sample of the data in the file:
The data is recorded approximately every 15 minutes for 12 tags over a month. What I'm wanting to do is select from the data, starting from the first date entry, subsets of data i.e. every 3 hours but due to the tags transmitting at slightly different rates I need a minimum and maximum value start and end time.
I have found the a related previous question but don't understand the answer enough to implement.
The solution could firstly ask for the Tag number, then the period required perhaps in minutes from the start time (i.e. every 3hrs or 180 minutes), the minimum time range and the maximum time range, both of which would be constant for whatever time period was used. The minimum and maximum would probably need to be plus and minus 6 minutes from the period selected.
As the code below shows, I've managed to read in the file, change the Date format to POSIXlt and extract data within a specific time frame but the bit I'm stuck on is extracting the data every nth minute and within a range.
TestData<- read.csv ("TestData.csv", header=TRUE, as.is=TRUE)
TestData$Date <- strptime(TestData$Date, "%d/%m/%Y %H:%M")
TestData[TestData$Date >= as.POSIXlt("2014-02-26 7:10:00") & TestData$Date < as.POSIXlt("2014-02-26 7:18:00"),]

Resources