Filling gaps of time data with zero-values - r

In my data https://pastebin.com/CernhBCg I have irregular timestamps and a corresponding value. Additionally to the irregularity I have large gaps, for which I have no value in my data. I know however that for those gaps value is zero and I would like to fill up to gaps with rows with value=0. How can I do this?
Data
> dput(head(hub2_select,10))
structure(list(time = structure(c(1492033212.648, 1492033212.659,
1492033212.68, 1492033212.691, 1492033212.702, 1492033212.724,
1492033212.735, 1492033212.757, 1492033212.768, 1492033212.779
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), value = c(3,
28, 246, 297, 704, 798, 1439, 1606, 1583, 1572)), .Names = c("time",
"value"), row.names = c(NA, 10L), class = "data.frame")
Please take the file I provided to see the data and read it into R with
library(readr)
df <- read_csv("data.csv", col_types = list(time = col_datetime(), value = col_double()))
Solutions
For one the the values left and right of a gap are usually 0 or 1. So that might help. I thought I'd use a rolling join, but from I understand by now, this seems not be the way to go.
What works is
library(dplyr)
library(lubridate)
threshold_time = dseconds(2)
time_prev = df$time[1]
addrows = data.frame()
for (i in seq(2, nrow(df),1)){
time_current <- df$time[i]
if ((time_current - time_prev) > threshold_time){
time_add <- seq(time_prev, time_current, dseconds(0.1))
addrows = bind_rows(addrows, data.frame(time=time_add, value=rep(0, length(time_add))))
}
time_prev <- time_current
}
addrows$type <- 'filled'
df$type <- 'orig'
df_new <- bind_rows(df, addrows)
library(ggplot2)
ggplot(df_new, aes(time,value,color=type)) + geom_point()
But this solution is neither elegant nor efficient (I did not test efficiency though).

Honestly I haven't tried it yet (I had to switch to Python for other reasons and solved it there and didn't get around to try it out), but I am pretty sure https://cran.r-project.org/web/packages/padr/vignettes/padr.html would have been the answer. I just wanted to write this here for other readers with the same question.

Related

Prediction on time series analysis using ARIMA in R

I am new to programming and am attempting to create a prediction model for multiple articles.
Unfortunately, using Excel or similar software is not possible for this task. Therefore, I have installed Rstudio to solve this problem. My goal is to make a 18-month prediction for each article in my dataset using an ARIMA model.
However, I am currently facing an issue with the format of my data frame. Specifically, I am unsure of how my CSV should be structured to be read by my code.
I have attached an image of my current dataset in CSV format : https://i.stack.imgur.com/AQJx1.png
Here is my dput(sales_data) :
structure(list(X.Article.1.Article.2.Article.3 = c("janv-19;42;49;55", "f\xe9vr-19;56;58;38", "mars-19;55;59;76")), class = "data.frame", row.names = c(NA, -3L))
And also provided the code I have constructed so far with the help of blogs and websites :
library(forecast)
library(reshape2)
sales_data <- read.csv("sales_data.csv", header = TRUE)
sales_data_long <- reshape2::melt(sales_data, id.vars = "Code Article")
for(i in 1:nrow(sales_data_long)) {
sales_data_article <- subset(sales_data_long, sales_data_long$`Code Article` == sales_data_long[i,"Code Article"])
sales_ts <- ts(sales_data_article$value, start = c(2010,6), frequency = 12)
arima_fit <- auto
arima_forecast <- forecast(arima_fit, h = 18)
print(arima_forecast)
print("Article: ", Code article[i])
}
With this code, RStudio gives me the following error : "Error: id variables not found in data: Code Article"
Currently, I am not interested in generating any plots or outputs. My main focus is on identifying the appropriate format for my data.
Do I need to modify my CSV file and separate each column using "," or ";"? Or, can I keep my data in its current format and make adjustments in the code instead?
Added the dput output as per jrcalabrese request.
Swapped to the replacement for reshape2 (tidyr).
Used pivot_longer.
Now doesn't give error, which was happening in reshape2::melt.
It doesn't matter so much what the csv structure is. Your structure was fine.
Hope this helps! :-)
library(tidyr)
sales_data <- structure(list(var1 = c("Article 1", "Article 2", "Article 3"),
`janv-19` = c(42, 56, 55),
`fev-19` = c(49, 58, 59),
`mars-19` = c(55, 38, 76)),
row.names = c(NA, 3L), class = "data.frame")
sales_data_long <- sales_data |> pivot_longer(!var1,
names_to = "month",
values_to = "count")

R plotting annual data and "January" repeated at end of graph

I'm fairly new to R and am trying to plot some expenditure data. I read the data in from excel and then do some manipulation on the dates
data <- read.csv("Spending2019.csv", header = T)
#converts time so R can use the dates
strdate <- strptime(data$DATE,"%m/%d/%Y")
newdate <- cbind(data,strdate)
finaldata <- newdate[order(strdate),]
This probably isn't the most efficient, but it gets me there :)
Here's the relevant columns of the first four lines of my finaldata dataframe
dput(droplevels(finaldata[1:4,c(5,7)]))
structure(list(AMOUNT = c(25.13, 14.96, 43.22, 18.43), strdate = structure(c(1546578000,
1546750800, 1547010000, 1547010000), class = c("POSIXct", "POSIXt"
), tzone = "")), row.names = c(NA, 4L), class = "data.frame")
The full data set has 146 rows and the dates range from 1/4/2019 to 12/30/2019
I then plot the data
plot(finaldata$strdate,finaldata$AMOUNT, xlab = "Month", ylab = "Amount Spent")
and I get this plot
This is fine for me getting started, EXCEPT why is JAN repeated at the far right end? I have tried various forms of xlim and can't seem to get it to go away.

Issue with time variables using strptime

I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times.
I will include a link if you want the whole dataset:
https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing
I then ran the following code
Delays <- read.table("FlightDelays.csv", header=T, sep=",")
DepatureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0"), "%H%M")
ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0"), "%H%M")
DelayTime <- as.numeric(difftime(DepatureTime, ScheduleTime))/60
DelayData <- data.frame(DelayTime, Delays)
The above code allowed me to get the delay time in minutes
For those of you who do not want to obtain the whole dataset I will now include a small example of some observations of the form
structure(list(schedtime = c(1455, 1640, 1245, 1715, 1039 , 2120), deptime = c(1455, 1640, 1245, 1709, 1035, 0010)), .Names = c("schedtime", "deptime"), row.names = c(NA, 6L), class = "data.frame")
and if you run the a code I did at the beginning, the delay in minutes for the 6th observation will be -1270 minutes not a delay of 170 minutes as i believe strptime assumes you are still in the same day and doesn't recognise that the delay caused the departure time to be the early hours of the following day.
How can i get the code to recognise the delays will sometimes mean the departure time will go on to the following day?
Thank you for any help
Using lubridate:
library(lubridate)
ScheduleTime <- as_datetime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0"),format="%H%M")
DepatureTime <- as_datetime(formatC(Delays$deptime, width = 4, format = "d", flag = "0"),format="%H%M") + hours(ifelse(Delays$deptime < Delays$schedtime & Delays$schedtime > 2000,24,0))
DelayTime <- difftime(DepatureTime, ScheduleTime)/60
DelayData <- data.frame(DelayTime, Delays)
The Problem is, that you have to decide when it isn't resonable, that a smaller value of deptime compared to schedtime does not correspond to a day shift, but to a flight leaving early. I don't see a general way around that.

Can't import R dump() data into JAGS, "unexpected LIST"

I used R dump() to create a data.txt file as specified by the latest JAGS manual, but I keep running into this error:
Reading data file data.txt
syntax error, unexpected LIST, expecting DOUBLE or NA or ASINTEGER or 'c'
The data.txt produced by dump(), from which I have removed the "L" assigned by R:
M <- 4
N <- 2
x <- structure(list(Var1 = c(0, 1, 0, 1), Var2 = c(0, 0, 1, 1)), .Names = c("Var1",
"Var2"), out.attrs = structure(list(dim = c(2, 2), dimnames = structure(list(
Var1 = c("Var1=0", "Var1=1"), Var2 = c("Var2=0", "Var2=1"
)), .Names = c("Var1", "Var2"))), .Names = c("dim", "dimnames"
)), class = "data.frame", row.names = c(NA, -4))
counts <- c(377558, 1001, 2000, 2000)
total <- 382559
If I remove x, the data will import correctly, but obviously that is not what I want. The strangest part is that if using the RJAGS and R2JAGS packages instead, the whole thing works fine. Does anyone know how to format this data to work in JAGS?
As Martyn said over on the JAGS forum, a list (or data.frame) is not allowed in JAGS. You need to convert this to an array or matrix before using dump.
By the way, if you need to call JAGS externally then you might also want to check out the runjags package (on CRAN) which does a lot of the automation of creating files to call JAGS (try run.jags(..., method='interruptible', keep.jags.files='my_folder') for example). You will still need to convert your data frame to a matrix first though.
Matt
What seemed to fix this issue for me was a simple command per Martyn's suggestion on the JAGS board:
x <- as.matrix(x)

Creating Netcdf files issue

I have created some netcdf files in R before, but right now, I am having some problems to create a netcdf file that I don't know how handle it. I have been looking for the error but I am not sure why it is. Given that my data is too long, I include a smaller sample to give an idea about the structure:
#data.frame with the date and the values
dat <-dput(y.or[1:10,])
structure(list(date = structure(c(852073200, 852159600, 852246000,
852332400, 852418800, 852505200, 852591600, 852678000, 852764400,
852850800), class = c("POSIXct", "POSIXt"), tzone = ""), dymax = c(79.125,
75.375, 78, 72.375, 76.375, 76.571, 76.125, 82.75, 86.125, 86
)), .Names = c("date", "dymax"), row.names = c("1997-01-01.01",
"1997-01-01.02", "1997-01-01.03", "1997-01-01.04", "1997-01-01.05",
"1997-01-01.06", "1997-01-01.07", "1997-01-01.08", "1997-01-01.09",
"1997-01-01.10"), class = "data.frame")
#****Creating Netcdf files********
#One lat and lon, and 5478 days (14 years)
missval <- -999
dimX <- dim.def.ncdf( "longitude", "degrees_east",10)
dimY <- dim.def.ncdf( "latitude", "degrees_north", 50)
dimT <- dim.def.ncdf("time",as.Date(dates[1]),as.numeric(dates))
#Def.variable
var <- var.def.ncdf(name="max8hO3","ppb",list(dimX,dimY,dimT), missval=missval, longname="max8hO3",prec="double")
#creating the file
fil <- create.ncdf("fileout.nc",var)
Then, before put the variable into the file , I have:
Error in nc$var[[nc$varid2Rindex[varid]]] :
attempt to select less than one element
I am sure that I am missing something...but I don't know, any idea???
I really appreciate some help, thanks!

Resources