R check consistency of separated timeseries-table - r

I have a timeseries-table like this, which goes up to 2000 31 12 23 (12/31/2000 23:00):
I'd like to add temparature values from several weatherstations to it. The problem is, that obviously the different timeseries dont't match by count of rows, so there must be gaps.
How can I check up on these dataframes if they consequently follow the pattern of 0-24 hours, 1-12 months and get information of where these gaps are?

If your data is in the format of the link then you can probably convert it to a POSIXct object by doing the following (assuming your data frame is called data):
data = as.data.frame(list(YY = rep("1962",10),
MM = rep("01",10),
DD = rep("01",10),
HH = c("00","01","02","03","04",
"05","06","07","08","09")))
date = paste(data$YY,data$MM,data$DD,sep="-")
data$dateTime = as.POSIXct(paste(date,data$HH,sep=" "),format="%Y-%m-%d %H")
That should put your data into a POSIXct format. If your temperature dataset also has a column called "dateTime" and it's a POSIXct object you should be able to use the merge function and it will combine the two data frames
temp = as.data.frame(list(YY = rep("1962",10),
MM = rep("01",10),
DD = rep("01",10),
HH = c("00","01","02","03","04",
"05","06","07","08","09")))
date1 = paste(temp$YY,temp$MM,temp$DD,sep="-")
temp$dateTime = as.POSIXct(paste(date1,temp$HH,sep=" "),format="%Y-%m-%d %H")
temp$temp = round(rnorm(10,0,5),1)
temp = temp[,c("dateTime","temp")]
#let's say your temperature dataset is missing an entry for a certain timestamp
temp = temp[-3,]
# this data frame won't have an entry for 02:00:00
data1 = merge(data,temp)
data1
# if you want to look at time differences you can try something like this
diff(data1$dateTime)
# this one will fill in the temp value as NA at 02:00:00
data2 = merge(data,temp,all.x = T)
data2
diff(data2$dateTime)
I hope that helps, I often use the merge function when I'm trying to match up timestamps from ecological datasets

Thank you for you answer and sorry for my late reply.
Couldn't make it without your helpful hints though I now managed to merge all my timeseries on a slightly different way:
Sys.setenv(TZ='UTC') #setting system time to UTC for not having DST-gaps
# creating empty hourly timeseries for following join
start = strptime("1962010100", format="%Y%m%d%H")
end = strptime("2000123123", format= "%Y%m%d%H")
series62_00 <- data.frame(
MESS_DATUM=seq(start, end, by="hour",tz ='UTC'), t = NA)
# joining all the temperatureseries with same timespan using "plyr"-package
library("plyr")
t_allstations <- list(series62_00,t282,t867,t1270,t2261,t2503
,t2597,t3668,t3946,t4752,t5397,t5419,t5705)
t_omain_DWD <- join_all(t_allstations, by = "MESS_DATUM", type = "left")
Using join_all with type = "left" makes sure, that the column "Date" is not changed and missing temperature values are filled in as NA's.

Related

Add multiple rows per observation

I'm trying to construct a panel dataset that can work as a "vessel" where I can put my real data.
I have information about 346 municipalities, and I want to add daily information, for a total of 166 days. So, for each municipality, I want it to have 166 rows (per day). I've only managed to get a dataset with 57.436 rows (which would be 346*166), but I can't find a way to include both the name of the municipality, and the date. It's one thing or the other. Any ideas on how I can do this? The code that I'm using so far, which produces 346 observations per day, is the following:
comunas_panel <- data.frame()
for(i in 1:nrow(codigos_territoriales)) {
dates <- data.frame(date = seq(from = as.Date("2019-10-18"),
to = as.Date("2020-03-31"), by = 1))
comunas_panel = rbind(comunas_panel, dates)
}
Try the expand.grid() function (documentation here).
It takes any number of arguments that can be named whatever you choose. Each argument is a vector. The result is a data frame with columns named after the arguments, with all possible combinations of the elements from each of the vectors you input. So in this example I use your vector of 166 dates and cross it with a toy example of 3 municipality names to get a data frame with 166*3 = 498 rows and 2 columns (date and municipality).
date <- seq(from = as.Date("2019-10-18"), to = as.Date("2020-03-31"), by = 1)
municipalities <- c('name1', 'name2', 'name3') #etc.
comunas_panel <- expand.grid(municipality = municipalities, date = date)
Similar alternatives are expand_grid() in tidyverse, and CJ() in data.table.

Creating time series in R

I have a CSV file containing data as follows-
date, group, integer_value
The date starts from 01-January-2013 to 31-October-2015 for the 20 groups contained in the data.
I want to create a time series for the 20 different groups. But the dates are not continuous and have sporadic gaps in it, hence-
group4series <- ts(group4, frequency = 365.25, start = c(2013,1,1))
works from programming point of view but is not correct due to gaps in data.
How can I use the 'date' column of the data to create the time series instead of the usual 'frequency' parameter of 'ts()' function?
Thanks!
You could use zoo::zoo instead of ts.
Since you don't provide sample data, let's generate daily data, and remove some days to introduce "gaps".
set.seed(2018)
dates <- seq(as.Date("2015/12/01"), as.Date("2016/07/01"), by = "1 day")
dates <- dates[sample(length(dates), 100)]
We construct a sample data.frame
df <- data.frame(
dates = dates,
val = cumsum(runif(length(dates))))
To turn df into a zoo timeseries, you can do the following
library(zoo)
ts <- with(df, zoo(val, dates))
Let's plot the timeseries
plot.zoo(ts)

R: Pad dates with weekends or bank holidays

Assume the following dataset. I get closing prices for all working days. But I also have missing rows for dates for which there is no observation. How can I add rows equal to each day and date all the way to the present? The reason I need this done is that I need to average by week and having variable time windows renders that impossible.
Here is my code:
library(quantmod)
from="2012-09-01"
sym = c("BARC")
prices = Map(function(n)
{
print(n)
tryCatch(getSymbols(n, src="google", env=NULL, from=from)[, 4], error =
function(e) NA)
}, sym)
N = length(prices)
# identify symbols returning valid data
i = ! unlist(Map(function(i) is.na(prices[i]), seq(N)))
# combine returned prices list into a matrix, one column for each symbol
prices = Reduce(cbind, prices[i])
colnames(prices) = sym[i]
If you see the "prices" data frame you will see the point I am making.
You can create a blank xts with all the dates first, and then merge with your prices object.
full_dates <- xts(,order.by = seq(from = start(prices), to = end(prices), by= "day"))
full_prices <- merge(full_dates,prices, all = TRUE)
You can also choose to fill forward the missing prices, by the following
na.locf(full_prices)

Pad data frame with missing dates in a series [duplicate]

This question already has an answer here:
r - time series padding with NA
(1 answer)
Closed 8 years ago.
I've aggregated a data frame of rows representing events into another data frame of daily counts using aggregate(). The resultant frame is sorted by date, but it's missing days with zero counts, and I want to fill those days in to get a continuous daily series. The count frame looks something like this:
agg <- data.frame(
date = as.Date(c("2013-04-02", "2013-04-04", "2013-04-07", "2013-04-08")),
count = c(4, 2, 6, 1))
The way I previously solved this was by iterating through the frame to find non-continuous days, then rbinding subsets of the frame with an empty one. But this is an ugly solution, horrible to debug and painfully inefficient to boot. My thinking is that it would be better to generate a new data frame, populate it with the target date series...
target <- data.frame(
date = seq(from = as.Date("2013-04-01"), to = as.Date("2013-04-10"), by = "day"),
count = NA)
... and then somehow project counts from agg onto target using the matching dates. Does anyone know how I'd do this -- or have a better solution?
You're almost there. Just do:
merge(agg,target[-2],all.y=TRUE)
subset [-2] is needed to remove the count column from target, as it is not needed. Alternatively, you could do:
target <- data.frame(
date = seq(from = as.Date("2013-04-01"), to = as.Date("2013-04-10"), by = "day"))
merge(agg,target,all.y=TRUE)
As another solution, how about this?
other <- data.frame(date = seq(as.Date("2013-04-01"), as.Date("2013-04-10"), by = "day"), count = 0)
other <- filter(other, !(date %in% agg$date))
join = full_join(agg, other, by = c("date", "count")) %>% arrange(date)
It's a little messy, but it does the trick.
edit: fixed a mistake or two

Convert CSV with dates using lubridate

I got a dataset in CSV format that has two columns: Date and Value. There are hundreds of rows in the file. Date format in the file is given as YYYY-MM-DD. When I imported this dataset, the Date column got imported as a factor, so I cannot run a regression between those two variables.
I am very new to R, but I understand that lubridate can help me convert the data in the Date column. Could someone provide some suggestions on what command should I use to do so? The file name is: Test.csv.
Next time please provide some test data and show what you did. For variations see ?as.Date and ?read.csv . The following does not use any packages:
# test data
Lines <- "Date,Value
2000-01-01,12
2001-01-01,13"
# DF <- read.csv("myfile.csv")
DF <- read.csv(text = Lines)
DF$Date <- as.Date(DF$Date)
plot(Value ~ Date, DF, type = "o")
giving:
> DF
Date Value
1 2000-01-01 12
2 2001-01-01 13
Note: Since your data is a time series you might want to use a time series representation. In this case read.zoo automatically converts the first column to "Date" class:
library(zoo)
# z <- read.zoo("myfile.csv", header = TRUE, sep = ",")
z <- read.zoo(text = Lines, header = TRUE, sep = ",")
plot(z)

Resources