Daily count of individuals detected by each monitoring station - r

I have a ver large (2+ years) data set of acoustic telemetry data. The study is based on constantly monitoring several animals fitted with an acoustic tag, using stations throughout the study area to detect them.
I want to obtain a summary table that shows the number of individual tags detected by each station per day (e.g.,"on 2019-04-02, five tags were detected at station 1, three tags at station 2"). Other time frames interest me too but a daily count is the most relevant right now.
I haven't worked with R for long, so I'm having a hard time coming up with this kind of code. I don't really have anything useful to show. I have several individuas and stations, but below is an example of how the data looks like (already dropped time of day).
Thanks,
Date.and.time Station Transmitter
2019-04-02 Station_3 Tag_1
2019-04-02 Station_26 Tag_3
2019-04-02 Station_3 Tag_13
2019-04-02 Station_15 Tag_15
.
.
.
2021-10-14 Station_13 Tag_20
2021-10-15 Station_8 Tag_8
2021-10-15 Station_23 Tag_31

Related

Time Series Analysis with weekly data, how to define frequency?

I am trying to analyze this time series data from the Wooldridge Econometrics book containing weekly data on the New York Stock Exchange, beginning in the year 1976 January and ending in 1989.
I have never worked with the ts() function before but I understand the general grammar already. What I have difficulties with, is how I should define the frequencies since every 4th year has 366 instead of 365 days. In the book is already stated that for holidays or weekends, the following day was used, when the stock exchange was open.
So how do I exactly deal with this problem of creating a time series object?
Here is a screenshot of the first rows of the data frame:
data frame of nyse

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

Extracting summarized information from one large dataset and then storing in another dataset

I am relatively new to R and I am facing an issue in extracting summarized data from one dataset and then storing this data in another dataset. I am presently trying to analyze flight dataset along with the weather information. I am a having a query in this regard which is as follows:
I am having one dataset for the flight data.This dataset comprises of columns such as flight number, date,origin city name, origin state name, departure time, departure time block(each block of 1 hour duration spanning the entire 24 hours), arrival time, arrival time block, flight distance, total flight time and a binary response variable indicating whether the flight was delayed or not.
I have another dataset which comprises of weather data. This dataset includes columns such as date, time of recording, hour block(each block of 1 hour duration spanning the entire 24 hours), state, city and the respective weather related data such as temperature,humidity levels, wind speed etc. For each city,state,date and a single hour block, there can be several entries in the weather dataset indicating varying temperature levels at various times in the hour block.
I am trying to extract the temperature and humidity levels from the weather dataset and place this information in the flight dataset. I have created four new columns in the flight dataset for origin temperature,origin humidity, destination temperature and destination humidity for this purpose.
My objective is to store the average temperature and humidity levels for the origin city during the departure time hour block as well as for the destination city during the arrival time hour block in the flight dataset. The reason I am choosing average temperature and humidity is because there can be multiple entries in weather dataset for the same hour block,city and state combination.
I have tried to extract the origin temperature information from weather dataset using the following loop:
for(i in 1:nrow(flight))
{
flight[[i,13]]=mean(weather$temperature[weather$date == flight[[i,1]] & weather$time_blk == train[[i,5]]
& weather$state == train[[i,3]] & weather$city == train[[i,2]]])
}
However, the weather data set contains around 2 Million rows and the flight dataset also comprises of around 20000 observations. As a result, the above calculation takes a lot of time to execute and even then the origin temperature data in flight data set is not calculated properly.
I am also attaching the images of sample flight and weather datasets for reference.
Can you please suggest another approach for obtaining the above information considering the constraints of the large datasets and the time required for processing.

Statistical analysis on daily data

I have a number of data points that I am trying to extract a meaningful pattern from (or derive an equation that could then be predictive). I am trying to find a correlation (?) between RANK and DAILY SALES for any given ITEM.
So, for any given item, I have (say) two weeks of daily information, each day consists of a pairing of Inventory, and Rank.
ITEM #1
Monday: 20 in stock (rank 30)
Tuesday: 17 in stock (rank 29)
Wednesday: 14 in stock (rank 31)
The presumption is that 3 items were sold each day, and that selling ~3 a day is roughly what it means to have a rank of ~30.
Given information like this across a wide span (20,000 items, over 2 weeks) of inventory/rank/date pairings, I'd like to derive an equation/method of estimating what the daily sales would be for any given rank.
There's one problem:
The data isn't entirely clean, because -occasionally- the inventory fluctuates upward, either because of re-stocking, or because of returns. So for example, you might see something like
MONDAY: 30 in stock.
TUESDAY: 20 in stock.
WEDNESDAY: 50 in stock.
THURSDAY: 40 in stock.
FRIDAY: 41 in stock.
Indicating that, between Tuesday and wednesday, 30 more were replenished, and on thursday, one was returned.
I am planning to use mean and standard deviation on Daily sales for given rank.
So if any rank given I can predict the daily sales based on mean and standard deviation values.
Is this correct approach? IS there any better approach for this scenario
Sounds like this could be a good read for you, fpp
It provides an introduction to timeseries forecasting. Timeseries forecasting
has a lot of nuance so it can trip people up pretty easily. Some of the issues
you have already noted (e.g. seasonality). Others pertain to the statistical
properties of such series of data. Take a look through this and

How to prepare my data for a Neural Network training for a natural gas ( NATGAS ) price-predictions?

I want to create an NN to predict future prices in natural gas
I'm not sure it's a simple time series problem:
Each month ( so 12 of these) I follow a future spread ( e.g sep-oct ) until the front contract expires.
I start following it for approx 60 days ( data points ).
For each of the data points I have other inputs e.g. weather, inventories for the week, price of coal etc.
I have the previous 5 yrs data for each spread for each of the months of the year.
I want the NN to learn if it can predict the direction of the current months spread for the next x days of the 60 days for the particular months spread, given that I'll know weather, inventories, coal prices etc. -- the Features Vector -- at the moment of prediction.
Qus's I'd like to know if it can predict -"given this years inventories, weather patterns, coal price - where will the spread go in the next last 20 days of the contract?"
Is this suitable for a NN-based Predictor?
If so how should I be preparing my data?
I'm using MATLAB

Resources