R: How to arrange a daily time series of rows and columns to a single column? - r

I have a dataset, a daily timeseries and I want to arrange into a single column, this is my data:
Date Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 .... Day 31
01/01/1964 0 0 0 0 0 0 3
01/02/1964 NA NA NA NA NA NA ...
01/03/1964 195 445 329 121 61,6 44 ...
01/04/1964 17,2 14,9 17,1 102 54,3 9,33 ...
I want this:
Day1 0
Day2 0
.
.
.
Day31 3
I having problems because of leap years that have 366 days, i trying this, but no succes, thanks in advanced.
EDIT:
I finally got it, but if anyone knows a more easy way, using some package or function, I'm grateful. Or I'll create my own function.
EDIT 2:
Now I have a problem, when I not start in the first month of a year.
rm(list = ls())
cat("\014")
setwd("C:/")
require(XLConnect)
# Load Streamflow Gauging Station
wb <- loadWorkbook("rainfall.xls")
Data<- readWorksheet(wb, sheet = "rainfall",header = FALSE,region = "B02:AF517")
R<- Data; ##1964 - 2006
sum(R[is.na(R)==FALSE])
# Number of days in each month
Ny<- c(31,28,31,30,31,30,31,31,30,31,30,31); # Normal Year
Ly<- c(31,29,31,30,31,30,31,31,30,31,30,31); # Leap/bissextile Year
S1<- c(1,0,0,0) # Leap year, normal year...
S2<- c(0,1,0,0) # Normal year, leap year...
S3<- c(0,0,1,0) #...
S4<- c(0,0,0,1) #...
Iab<- rep(S1,times=ceiling((nrow(R)/12)/4)); # Index of years
Iab<- Iab[1:(nrow(R)/12)];
Rnew<- matrix(numeric(0), 0,0);
#Organize data in a only collumn
for(i in 1:(nrow(R)/12)){
for(j in 1:12){
if(Iab[i]==0){
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ny[j]]))
}else{
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ly[j]]))
}
}
}
sum(R[is.na(R)==FALSE])==sum(Rnew[is.na(Rnew)==FALSE]) #Test for succes of organize
sum(R[is.na(R)==FALSE])
sum(Rnew[is.na(Rnew)==FALSE])

I have a similar problem. However in a way even worse, since I have discharge data (Brasilian ANA station) with several interruptions of several month and years. Vazao01 stands for the discharge at the first day of the month, Vazao02 for the second and the data frame goes up to Vazao31 (which is obviously NA for month with less days, but can as well be NA for existing days without record). The data looks like this and is the data.frame "ANAday"
Date Vazao01 Vazao02 Vazao03...
20 01.05.1989 3463.00 3476.500 3463.000
21 01.06.1989 1867.70 1835.900 1809.400
22 01.07.1989 809.90 798.200 774.800
23 01.08.1989 344.60 308.700 297.900
24 01.11.1989 376.50 388.100 391.000
25 01.12.1989 279.00 289.800 319.500
26 01.01.1990 1715.00 1649.000 1573.200
27 01.02.1990 1035.20 1005.800 972.200
28 01.03.1990 2905.60 2962.100 NA
29 01.06.1990 NA NA NA
30 01.07.1990 297.90 284.400 271.200
31 01.08.1990 228.00 223.200 218.400
32 01.08.1999 NA NA 144.000
33 01.09.1999 20.74 18.620 16.500
34 01.10.1999 119.85 111.450 95.385
35 01.11.1999 11.20 23.705 48.370
36 01.12.1999 160.10 179.000 187.400
37 01.01.2000 843.00 865.300 914.500
38 01.02.2000 1331.30 1368.900 1387.800
39 01.04.2000 1823.60 1808.000 1789.800
40 01.05.2000 1579.00 1524.100 1445.700
I made a list of the month with data
ANAm=as.Date(ANAday[,1], format="%d.%m.%Y")
format(ANAm, format="%Y-%m")
Than I used the "monthDays" function of the Hmisc package to list the number of days in each month
require(Hmisc)
nodm=monthDays(ANAm)
Nodm=cbind.data.frame(ANAm,nodm)
I prepared a data.frame for the data I want to have with 3 columns for "YEAR MONTH", "DAY" and "DISCHARGE"
ANATS=array(NA,c(1,3))
colnames(ANATS)=c("mY","d","Q")
And used a simple "for" loop to extract the data into one column according to the number of days in each month
for(i in 1:nrow(Nodm)){
selectANA=as.vector(ANAd[i,1:(Nodm[i,2]) ])
selectANA=as.vector(t(selectANA))##to generate a simple vector
dayANA=c(1:(Nodm[i,2]))
monthANA=rep(format(as.Date(Nodm[i,1]),format="%Y-%m"),times=as.numeric(Nodm[i,2]))
ANAts=cbind(monthANA,dayANA,auswahlANA)
ANATS<<-rbind(ANATS,ANAts)
}
The ANATS can than be transferred into a timeseries:
combine.date=as.character(paste(ANATS[,1],ANATS[,2],sep="-"))
DATE=as.Date(combine.date, format="%Y-%m-%d")
rownames(ANATS)=as.character(DATE)
ANATS=ANATS[-1,]
ANAXTS=as.xts(ANATS)

Maybe I'm having trouble understanding exactly what you're looking for, but are you trying to transpose the data?
t(data)

Related

How to use Sys.Date() To Extract Current Year? [duplicate]

This question already has answers here:
How can I get the extract the previous year (2020) using Sys.Date()?
(2 answers)
Closed 1 year ago.
I have manually separated my dataset (discrete_8) into 2 separate datasets (data & data2). 'Data' contains the data from this current year (2021), whereas 'Data2' contains data from previous years. Of course, this is based on the current year (2021), but I want to automate the line of code so that when the year 2022 comes, I will not have to edit the script to change 2021 to 2022. Should I use Sys.Date() for calling the most recent year? How would I go about incorporating sys.date() to partition the dataset?
Here is my code so far, where I partition the dataset:
data <- discrete_8 %>% filter(PS_DATE >= as.POSIXct("2021-01-01"))#current year
data2 <- discrete_8 %>% filter(PS_DATE < as.POSIXct("2021-01-01"))#past years
Here is what discrete_8 looks like:
X PS_DATE PS_NAME Control.Parameters.Cell.Return.Flow.Rate Control.Parameters.Harvest.Flow.Rate Control.Parameters.Microsparger.Total.Gas.Flow.Rate
1 0 2014-02-06 123 NA NA 1
2 1 2014-02-07 124 NA NA 1
3 2 2014-02-08 125 NA NA 1
4 3 2014-02-09 126 1.5 NA 1
5 4 2014-02-10 127 1.5 NA 1
6 5 2014-02-11 128 1.5 NA 1
There is somewhat tedious bug still present in that trunc(Sys.Date(), "year") does not give you Jan 01 of the current year -- it does in R-devel.
But you can build yourself a helper such as this:
> firstDay <- function() { d <- Sys.Date(); d - as.POSIXlt(d)$yday }
> firstDay()
[1] "2021-01-01"
and you can use that to compare. (Also, in the code you posted, as.Date() is simpler as you ignore hours/minutes/seconds here.)
one option can be the lubridate::floor_date() function:
lubridate::floor_date(Sys.Date(), unit = "years")
[1] "2021-01-01"
I use substr(Sys.Date(),1,4) to get the current year. In your code you can replace as.POSIXct("2021-01-01") with
as.POSIXct(paste0(substr(Sys.Date(),1,4),"-01-01"))
This will give the 1st of the current year in your datetime format.

Format function output as data frame

I use the following to sum several measures of TABR per Julian date in 5 separate years:
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
Which produces output like:
2015 2016 2017 2018 2019
33 NA NA NA 2 NA
....
80 NA 1 NA 21 NA
81 NA 47 NA 25 NA
82 NA 12 1 9 NA
But I want to convert these results into a dataframe with 6 columns Julian + 2015-2019.
I tried:
TABR_Day<-as.data.frame(TABR_YearDay)
But that seems to not produce a fully realized df: there is no column for Julian and if I want to call an individual variable, I have to use quotes around it like:
hist(TABR_Day$"2017")
Can you help me transition the function output to a dataframe with 6 viable columns?
The column is present in the rownames of the result. Column names usually don't start with a number, we can prepend the column name with the word 'Year' to make it 'Year_2015' etc and construct the final dataframe.
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
colnames(TABR_YearDay) <- paste0('Year_', colnames(TABR_YearDay))
TABR_Day <- data.frame(Julian = rownames(TABR_YearDay), TABR_YearDay)

How to make all the months to have an equal number of days (for example 22 days) for a MIDAS regression in R

This is a follow up question for these two posts.
How to deal with impossible dates for midasr package
https://stats.stackexchange.com/questions/77495/what-can-i-do-with-these-two-time-series
I need to use mls function in MIDAS package in R to transform the high frequency (daily) financial data to low frequency (quarterly) macroeconomic data.
The author #mpiktas mentioned
You must make all the months to have an equal number of days. And then
set frequency to that number. You can achieve that by discarding data,
padding NAs or extrapolating.
and
You could use zoo objects to make the padding easier, but in the end
simple numeric vector should be passed.
I tried different ways to search and did not find an easy way to implement.
I use dplyr to get each month to have 31 days with 7-11 NA.
# generate the date vector
library(midasr)
library(dplyr)
library(quantmod)
tsxdate <- as.Date( paste(1979, rep(1:12, each=31), 1:31, sep="-") )
for (year in 1980:2015){
tsxdate <- c(tsxdate,as.Date( paste(year, rep(1:12, each=31), 1:31, sep="-") ))
}
# transform to dataframe
tsxdate.df <- as.data.frame(tsxdate)
# get the stock market index from yahoo
tsxindex <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# merge two data frame to get each month with 31 days
tsx.df <- left_join(tsxdate.df, tsxindex)
I doubt this caused a problem due to too many NAs.
I put the new daily data into MIDAS regression in R. It did not work. None of the weight functions work.
# since each month has 31 days. one quarter yy correspond to 93 days data.
midas_r(midas_r(yy~trend+fmls(zz,30,93,nealmon) ,start=list(zz=rep(0,4))), Ofunction="nls")
Could you tell me how to make all the months to have an equal number of days?
update:
Finally, I got a way in zoo package with aggregate and first function. It is not perfect, but it works and fast. first will add NAs according to the parameter.
I still need to figure out how to fit it into a MIDAS regression.
# get data
tsx <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# subset
# generate a zoo object
library(zoo)
tsx.zoo <- zoo(tsx$GSPTSE.Adjusted)
# group by yearmonth and take first 22 days data.
days <-aggregate(tsx.zoo, as.yearmon, first, 22)
It looks like this: each row is one month with 22 days data.
Jun 1979 1614.29 NA NA NA NA NA NA NA NA NA
Jul 1979 1614.29 1598.73 1579.88 1582.57 1582.27 1576.19 1559.23 1529.81 1533.50 1547.66
Aug 1979 1554.14 1556.94 1553.84 1553.84 1551.95 1561.23 1562.52 1571.00 1578.08 1580.28
Sep 1979 1685.11 1657.58 1690.10 1720.92 1716.53 1711.34 1722.71 1714.63 1727.50 1724.51
Oct 1979 1749.05 1767.40 1775.98 1786.35 1800.12 1800.12 1735.88 1685.21 1681.52 1670.65
Nov 1979 1599.33 1606.81 1596.54 1592.94 1574.49 1569.20 1583.97 1608.70 1611.00 1619.78
Jun 1979 NA NA NA NA NA NA NA NA NA NA
Jul 1979 1556.94 1546.86 1548.46 1553.54 1542.07 1543.17 1552.85 1566.01 1573.99 1564.12
Aug 1979 1596.64 1602.82 1615.09 1636.53 1653.09 1660.97 1657.78 1665.46 1674.44 1674.64
Sep 1979 1714.73 1717.53 1732.59 1736.48 1731.19 1732.49 1746.75 1754.33 1747.45 NA
Oct 1979 1639.03 1613.19 1616.29 1635.34 1593.44 1533.40 1522.12 1534.49 1517.24 1523.92
Nov 1979 1628.55 1621.57 1624.36 1627.56 1620.27 1647.51 1677.93 1683.81 1690.70 1698.97
Jun 1979 NA NA
Jul 1979 1554.14 NA
Aug 1979 1674.24 1675.43
Sep 1979 NA NA
Oct 1979 1538.68 1552.25
update again:
#mpiktas gives a better and right way to do it.
1 NAs should be padded at beginning of each period.
2 Data should be gather in the frequency of response variable. In my case, it is quarterly.
His function can be used in aggregate function in zoo. I guess it do the same job as group_by plus do in dplyr: split, operate, and give back a list of results. I try this
tsxdaily <- aggregate(tsx.zoo, yearqtr, padd_nas, 66)
yearqtr is the frequency of response variable.
Here is one possible way of how to add NAs.
First, note that MIDAS regression puts the emphasis on the last values of the period, so you need to put NAs in front, not in the back.
Suppose that we have the following dummy data:
> dt <- data.frame(Day=1:10,Quarter=c(rep(1,6),rep(2,4)),value=1:10)
> dt
Day Quarter value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
7 7 2 7
8 8 2 8
9 9 2 9
10 10 2 10
In this example there are two quarters, the first one has 6 days, the second one 4. Suppose we want to harmonize the data, so that the quarter has 7 days (for example).
Define simple function which adds NAs at the beginning of the data:
padd_nas <- function(x, desired_length) {
n <- length(x)
if(n < desired_length) {
c(rep(NA,desired_length-n),x)
} else {
tail(x,desired_length)
}
}
Here is an example illustrating how this function works:
> padd_nas(1:4,7)
[1] NA NA NA 1 2 3 4
>
Now add NAs for each quarter and make sure that the data is ordered by day:
library(dplyr)
pdt <- dt %>% arrange(Day) %>% group_by(Quarter) %>% do(pv = padd_nas(.$value, 7))
> pdt
Source: local data frame [2 x 2]
Groups: <by row>
Quarter pv
1 1 <int[7]>
2 2 <int[7]>
To get the padded result simply use unlist on column pv:
> pv <- pdt$pv %>% unlist
> pv
[1] NA 1 2 3 4 5 6 NA NA NA 7 8 9 10
Now we can prepared this for MIDAS regression with mls. Suppose that only last 3 days are relevant for each quarter:
> library(midasr)
> mls(pv, 0:2, 7)
X.0/m X.1/m X.2/m
[1,] 6 5 4
[2,] 10 9 8
Compare this with original data dt.
This approach can be generalized for any low and high frequency data configuration.

as.Date is throwing a row number mismatch, but all vectors are same length

The following (CSV) dataset has 3133 rows of expenses by day between 7/1/2000 and 12/31/2014:
head(d_exp_0014)
2000 7 6 792078.595 9
2000 7 7 140065.5 9
2000 7 11 190553.2 9
2000 7 12 119208.65 9
2000 7 16 1068156.293 9
2000 7 17 0 9
2000 7 21 457828.8033 9
2000 7 26 661445.0775 9
2000 7 28 211122.82 9
2000 8 2 273575.1733 8
The columns here are Year, Month, Day, Expense, and Count (for how many days of the each month had an expense).
I am trying to do a forecast out to the end of 2015, and need to deal with these messy date columns so I can slice and dice xts (?) objects with dplyr. ISOdate and as.Date functions are throwing this error:
> exp <- data.frame(data = d_exp_0014, Date = as.Date(paste(Year, Month, Day), format = "m%/d%/Y%"), Amount = Amount, Count = Count, t = c(1:3133))
Error in data.frame(data = d_exp_0014, Date = as.Date(paste(Year, Month, :
arguments imply differing number of rows: 3133, 3134
> length(d_exp_0014$Year)
[1] 3133
> length(d_exp_0014$Month)
[1] 3133
> length(d_exp_0014$Day)
[1] 3133
What am I doing wrong? And should I instead build a vector of 5296 continuous dates between 7/1/2000 and 12/31/2014 and merge my 3133 rows of observations to this table (thus effectively inserting '0' in the Amount column for days on which there were no payments)?
Several errors (but not from paste): I'm guessing you were taught to use attach. That is probably the source of this particular error. Start by
detach(d_exp_0014)
d_exp_0014 <- cbind(d_exp_0014,
myDate = with(d_exp_0014,
as.Date(paste(Year, Month, Day, sep="/"),
format = "%Y/%m/%d") # note % first then letter
)
)
Then you can add further columns as needed.

Count/frequency of observations by month

I have a dataset with 34 variables and about 25,000 observations. Each observation refers to a specific incident. Its formatted something like this:
no id date ....
1 363 006 2005-11-05
2 939 012 2007-01-23
3 541 090 2009-06-14
I'm wondering if someone could walk me through how to get the total number of incidents for each month over the 5 years contained in this dataset.
It seems like you just want to count the number of rows for each month, if so you can just use table():
> #make junk data
> data <- data.frame(no=rnorm(100),id=rnorm(100),date=seq(Sys.Date()-99,Sys.Date(),by="day"))
> table(format(data$date,"%b-%Y"))
Aug-2013 Jul-2013 Nov-2013 Oct-2013 Sep-2013
31 7 1 31 30
tapply( dfrm$no, sub(".+-(.+)-.+", "\\1", dfrm$date), sum, na.rm=TRUE)
For month and year in MM-YYYY format use this as you category code:
.... , sub("(.+)-(.+)-.+", "\\2-\\1", data$date) , ....

Resources