Using historical High/Low Tide data from NOAA to classify my data points as high or low based on date and time - r

I'm pretty new to R and I'm working on adding tide data to an animal movement data frame. My animal data includes a date_time column with multiple data points occurring in each day. The tide data also has multiple points per day, as tide typically changes between high and low 4 times per day. For this reason, I haven't tried to use a loop.
I'd like to create a new column in R that will classify each date/time into a "LOW" or "HIGH" category based on historical data from NOAA. So far, I've followed this: Classify date/time by tidal states
After organizing my data frames into tide2015 and rays2015, when I get to the last line of code:
Tidal.State15=tide2015$High.Low[pmax(findInterval(rays2015$Date - 3600*3, tide2015$datetime), findInterval(rays2015$Date + 3600*3, tide2015$datetime))]
I get the same error as the original post:
Error in `$<-.data.frame`(`*tmp*`, Tidal.State, value = c("L", "H", "L", : replacement has 38 rows, data has 39
I'm not sure how to get around this error. I know there are no duplicates in my data, all my dates and times are in POSIXct format, and all dates in my data are accounted for in the NOAA tide data.
Any suggestions would be greatly appreciated!

Related

Assign column variables by date (R)

I hope we're all doing great
I have several decades of daily rainfall data from several monitoring stations. The data all beings at separate dates. I have combined them into a single data frame with the date in the first column, with the rainfall depth in the second column. I want to sort the variable 'Total' by the variable: 'Date and time' (please see the links below)
ms1 <- read.csv('ms1.csv')
ms2 <- read.csv('ms2.csv')
etc.etc
df <- merge(ms1, ms2 etc. etc, by = "Date and Time")
The problem is that the range of dates would differ for each monitoring station (csv file). There may also missing dates in a range. Is there a way around this?
Would I have to create a separate vector with the greatest possible date range? Or would it automatically detect the earliest start date from the imported data.
for monitoring station 1 (ms1)
for monitoring station 2 (ms2)
Note: the data continues to the current date

How Can I Create a Distribution Visualization for an Average Day?

In R, my dataframe ("sampledata") looks like this:
The timestamp column is POSIXct, format: "2018-10-01 00:03:23"
The state column is Factor w/ 3 levels "AVAILABLE", "MUST_NOT_RUN", "MUST_RUN"
There are 6 unique device_id. The timestamps for each device are not the same, meaning data was not always collected at the same minute for each device. In some cases, there are multiple records per minute for the same device.
I want to transform the data into a visualization that shows distribution of "state" across a "typical" day. Ideally, something like this:
I've tried to count each occurrence of "state" grouped by timestamp minutes but failed (Error: can't sum factors). I've been trying to use ggplot and geom_area for the visualization, but believe I need to restructure my data before it will work. Very new to R (obviously). Happy to read any tutorials or links provided as background and appreciate any help you can provide. Thanks!
Other information that may/may not be helpful:
There are a handful of columns in the dataframe not shown.
223,446 entries between 10/2/18 - 11/8/18.
You can take the hours from the timestamps and then compute proportions of your states by hour:
library(ggplot2)
library(plyr)
#get hours from timestamp
obj$hour <- as.POSIXlt(obj$timestamp)$hour
#get average state proportions per hour
plot_obj <- ddply(obj,.(hour), #take data.frame "obj" and group by "hour"
function(x) with(x,
data.frame(100*table(state)/length(state))))
ggplot(plot_obj, aes(x=hour,y=Freq,fill=state)) +
geom_area()

how I change the frame data into time series?

I have a daily rainfall data for 36 years. I want to analyze the time series, but my data is still in the form of frame data, how I change the frame data into time series. My data is a variable, how to unify the year number with the date and month, so the data is only in one column
You could use a time series package for that, such as fpp i.e. install.packages('fpp'). Since you don't give an example code, I can't really help you properly with it but it's quite easy.
ts(your_data, start =, frequency = ) At start = you put the year or month where you'd start and at frequency = you'd put e.g. 36 since you talk about 36 years.
You might want to check out https://robjhyndman.com/. He has an online (free) book available that walks you through the use of his package as well as providing useful information with respect to time series analysis.
Hope this helps.

How to add dates to remote sensing data?

I am working with remote sensing data to create time series of NDVI and Cloud cover. The data consists of 30 rows (latitude) and 40 columns (Longitude) and 212 matrix slices which are the months from 2000-03 until 2017-10. How ever R sees the data as an array and there is no date visible in the data. Except that I know that the 212 matrix are the months. How can I add a date to the matrix slices but still have the latitude and longitude to work with?
When I use:
NDVI_timeserie <- ts(name.ts.ndvi, frequency=12, start=c(2000,3))
This gives the correct time to the data but then it is not saved as date. So when I want to select only the rain season, I can't because I can't select the months and years.
I am open to suggestions and advice of creating a time series.
Thank you so much!

Extracting summarized information from one large dataset and then storing in another dataset

I am relatively new to R and I am facing an issue in extracting summarized data from one dataset and then storing this data in another dataset. I am presently trying to analyze flight dataset along with the weather information. I am a having a query in this regard which is as follows:
I am having one dataset for the flight data.This dataset comprises of columns such as flight number, date,origin city name, origin state name, departure time, departure time block(each block of 1 hour duration spanning the entire 24 hours), arrival time, arrival time block, flight distance, total flight time and a binary response variable indicating whether the flight was delayed or not.
I have another dataset which comprises of weather data. This dataset includes columns such as date, time of recording, hour block(each block of 1 hour duration spanning the entire 24 hours), state, city and the respective weather related data such as temperature,humidity levels, wind speed etc. For each city,state,date and a single hour block, there can be several entries in the weather dataset indicating varying temperature levels at various times in the hour block.
I am trying to extract the temperature and humidity levels from the weather dataset and place this information in the flight dataset. I have created four new columns in the flight dataset for origin temperature,origin humidity, destination temperature and destination humidity for this purpose.
My objective is to store the average temperature and humidity levels for the origin city during the departure time hour block as well as for the destination city during the arrival time hour block in the flight dataset. The reason I am choosing average temperature and humidity is because there can be multiple entries in weather dataset for the same hour block,city and state combination.
I have tried to extract the origin temperature information from weather dataset using the following loop:
for(i in 1:nrow(flight))
{
flight[[i,13]]=mean(weather$temperature[weather$date == flight[[i,1]] & weather$time_blk == train[[i,5]]
& weather$state == train[[i,3]] & weather$city == train[[i,2]]])
}
However, the weather data set contains around 2 Million rows and the flight dataset also comprises of around 20000 observations. As a result, the above calculation takes a lot of time to execute and even then the origin temperature data in flight data set is not calculated properly.
I am also attaching the images of sample flight and weather datasets for reference.
Can you please suggest another approach for obtaining the above information considering the constraints of the large datasets and the time required for processing.

Resources