Format function output as data frame - r

I use the following to sum several measures of TABR per Julian date in 5 separate years:
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
Which produces output like:
2015 2016 2017 2018 2019
33 NA NA NA 2 NA
....
80 NA 1 NA 21 NA
81 NA 47 NA 25 NA
82 NA 12 1 9 NA
But I want to convert these results into a dataframe with 6 columns Julian + 2015-2019.
I tried:
TABR_Day<-as.data.frame(TABR_YearDay)
But that seems to not produce a fully realized df: there is no column for Julian and if I want to call an individual variable, I have to use quotes around it like:
hist(TABR_Day$"2017")
Can you help me transition the function output to a dataframe with 6 viable columns?

The column is present in the rownames of the result. Column names usually don't start with a number, we can prepend the column name with the word 'Year' to make it 'Year_2015' etc and construct the final dataframe.
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
colnames(TABR_YearDay) <- paste0('Year_', colnames(TABR_YearDay))
TABR_Day <- data.frame(Julian = rownames(TABR_YearDay), TABR_YearDay)

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

How to match third (or whatever) from the bottom in a rolling fashion in R?

Here is my example data frame with the expected output.
data.frame(index=c("3435pear","3435grape","3435apple","3435avocado","3435orange","3435kiwi","3436grapefruit","3436apple","3436banana","3436grape","3437apple","3437grape","3437avocado","3437orange","3438apple","3439apple","3440apple"),output=c("na","na","na","na","na","na","na","na","na","na","na","na","na","na","3435apple","3436apple","3437apple"))
index output
1 3435pear na
2 3435grape na
3 3435apple na
4 3435avocado na
5 3435orange na
6 3435kiwi na
7 3436grapefruit na
8 3436apple na
9 3436banana na
10 3436grape na
11 3437apple na
12 3437grape na
13 3437avocado na
14 3437orange na
15 3438apple 3435apple
16 3439apple 3436apple
17 3440apple 3437apple
I want to match the fruit that is third from the bottom as I go down the column. If there are not three previous fruits it should return NA. Once the 4th apple appears it matches the apple 3 before it, then the 5th apple appears it matches the one 3 before that one, and so on.
I was trying to use rollapply, match, and tail to make this work, but I don't know how to reference the current row for the matching. In excel I would use the large, if, and row functions to do this. Excel makes my computer grind for hours to calculate everything and I know R could do this in minutes(seconds?).
You can do this:
library(dplyr)
df %>%
mutate(fruit = gsub("[0-9]", "", index)) %>%
group_by(fruit) %>%
mutate(new_output = lag(index, 3)) %>%
select(-fruit) %>%
ungroup
By each group of fruit, your new_output gives you the index value lagged by 3. I preserved the output column and saved my results in new_output so that you can compare.

How to make all the months to have an equal number of days (for example 22 days) for a MIDAS regression in R

This is a follow up question for these two posts.
How to deal with impossible dates for midasr package
https://stats.stackexchange.com/questions/77495/what-can-i-do-with-these-two-time-series
I need to use mls function in MIDAS package in R to transform the high frequency (daily) financial data to low frequency (quarterly) macroeconomic data.
The author #mpiktas mentioned
You must make all the months to have an equal number of days. And then
set frequency to that number. You can achieve that by discarding data,
padding NAs or extrapolating.
and
You could use zoo objects to make the padding easier, but in the end
simple numeric vector should be passed.
I tried different ways to search and did not find an easy way to implement.
I use dplyr to get each month to have 31 days with 7-11 NA.
# generate the date vector
library(midasr)
library(dplyr)
library(quantmod)
tsxdate <- as.Date( paste(1979, rep(1:12, each=31), 1:31, sep="-") )
for (year in 1980:2015){
tsxdate <- c(tsxdate,as.Date( paste(year, rep(1:12, each=31), 1:31, sep="-") ))
}
# transform to dataframe
tsxdate.df <- as.data.frame(tsxdate)
# get the stock market index from yahoo
tsxindex <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# merge two data frame to get each month with 31 days
tsx.df <- left_join(tsxdate.df, tsxindex)
I doubt this caused a problem due to too many NAs.
I put the new daily data into MIDAS regression in R. It did not work. None of the weight functions work.
# since each month has 31 days. one quarter yy correspond to 93 days data.
midas_r(midas_r(yy~trend+fmls(zz,30,93,nealmon) ,start=list(zz=rep(0,4))), Ofunction="nls")
Could you tell me how to make all the months to have an equal number of days?
update:
Finally, I got a way in zoo package with aggregate and first function. It is not perfect, but it works and fast. first will add NAs according to the parameter.
I still need to figure out how to fit it into a MIDAS regression.
# get data
tsx <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# subset
# generate a zoo object
library(zoo)
tsx.zoo <- zoo(tsx$GSPTSE.Adjusted)
# group by yearmonth and take first 22 days data.
days <-aggregate(tsx.zoo, as.yearmon, first, 22)
It looks like this: each row is one month with 22 days data.
Jun 1979 1614.29 NA NA NA NA NA NA NA NA NA
Jul 1979 1614.29 1598.73 1579.88 1582.57 1582.27 1576.19 1559.23 1529.81 1533.50 1547.66
Aug 1979 1554.14 1556.94 1553.84 1553.84 1551.95 1561.23 1562.52 1571.00 1578.08 1580.28
Sep 1979 1685.11 1657.58 1690.10 1720.92 1716.53 1711.34 1722.71 1714.63 1727.50 1724.51
Oct 1979 1749.05 1767.40 1775.98 1786.35 1800.12 1800.12 1735.88 1685.21 1681.52 1670.65
Nov 1979 1599.33 1606.81 1596.54 1592.94 1574.49 1569.20 1583.97 1608.70 1611.00 1619.78
Jun 1979 NA NA NA NA NA NA NA NA NA NA
Jul 1979 1556.94 1546.86 1548.46 1553.54 1542.07 1543.17 1552.85 1566.01 1573.99 1564.12
Aug 1979 1596.64 1602.82 1615.09 1636.53 1653.09 1660.97 1657.78 1665.46 1674.44 1674.64
Sep 1979 1714.73 1717.53 1732.59 1736.48 1731.19 1732.49 1746.75 1754.33 1747.45 NA
Oct 1979 1639.03 1613.19 1616.29 1635.34 1593.44 1533.40 1522.12 1534.49 1517.24 1523.92
Nov 1979 1628.55 1621.57 1624.36 1627.56 1620.27 1647.51 1677.93 1683.81 1690.70 1698.97
Jun 1979 NA NA
Jul 1979 1554.14 NA
Aug 1979 1674.24 1675.43
Sep 1979 NA NA
Oct 1979 1538.68 1552.25
update again:
#mpiktas gives a better and right way to do it.
1 NAs should be padded at beginning of each period.
2 Data should be gather in the frequency of response variable. In my case, it is quarterly.
His function can be used in aggregate function in zoo. I guess it do the same job as group_by plus do in dplyr: split, operate, and give back a list of results. I try this
tsxdaily <- aggregate(tsx.zoo, yearqtr, padd_nas, 66)
yearqtr is the frequency of response variable.
Here is one possible way of how to add NAs.
First, note that MIDAS regression puts the emphasis on the last values of the period, so you need to put NAs in front, not in the back.
Suppose that we have the following dummy data:
> dt <- data.frame(Day=1:10,Quarter=c(rep(1,6),rep(2,4)),value=1:10)
> dt
Day Quarter value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
7 7 2 7
8 8 2 8
9 9 2 9
10 10 2 10
In this example there are two quarters, the first one has 6 days, the second one 4. Suppose we want to harmonize the data, so that the quarter has 7 days (for example).
Define simple function which adds NAs at the beginning of the data:
padd_nas <- function(x, desired_length) {
n <- length(x)
if(n < desired_length) {
c(rep(NA,desired_length-n),x)
} else {
tail(x,desired_length)
}
}
Here is an example illustrating how this function works:
> padd_nas(1:4,7)
[1] NA NA NA 1 2 3 4
>
Now add NAs for each quarter and make sure that the data is ordered by day:
library(dplyr)
pdt <- dt %>% arrange(Day) %>% group_by(Quarter) %>% do(pv = padd_nas(.$value, 7))
> pdt
Source: local data frame [2 x 2]
Groups: <by row>
Quarter pv
1 1 <int[7]>
2 2 <int[7]>
To get the padded result simply use unlist on column pv:
> pv <- pdt$pv %>% unlist
> pv
[1] NA 1 2 3 4 5 6 NA NA NA 7 8 9 10
Now we can prepared this for MIDAS regression with mls. Suppose that only last 3 days are relevant for each quarter:
> library(midasr)
> mls(pv, 0:2, 7)
X.0/m X.1/m X.2/m
[1,] 6 5 4
[2,] 10 9 8
Compare this with original data dt.
This approach can be generalized for any low and high frequency data configuration.

R: How to arrange a daily time series of rows and columns to a single column?

I have a dataset, a daily timeseries and I want to arrange into a single column, this is my data:
Date Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 .... Day 31
01/01/1964 0 0 0 0 0 0 3
01/02/1964 NA NA NA NA NA NA ...
01/03/1964 195 445 329 121 61,6 44 ...
01/04/1964 17,2 14,9 17,1 102 54,3 9,33 ...
I want this:
Day1 0
Day2 0
.
.
.
Day31 3
I having problems because of leap years that have 366 days, i trying this, but no succes, thanks in advanced.
EDIT:
I finally got it, but if anyone knows a more easy way, using some package or function, I'm grateful. Or I'll create my own function.
EDIT 2:
Now I have a problem, when I not start in the first month of a year.
rm(list = ls())
cat("\014")
setwd("C:/")
require(XLConnect)
# Load Streamflow Gauging Station
wb <- loadWorkbook("rainfall.xls")
Data<- readWorksheet(wb, sheet = "rainfall",header = FALSE,region = "B02:AF517")
R<- Data; ##1964 - 2006
sum(R[is.na(R)==FALSE])
# Number of days in each month
Ny<- c(31,28,31,30,31,30,31,31,30,31,30,31); # Normal Year
Ly<- c(31,29,31,30,31,30,31,31,30,31,30,31); # Leap/bissextile Year
S1<- c(1,0,0,0) # Leap year, normal year...
S2<- c(0,1,0,0) # Normal year, leap year...
S3<- c(0,0,1,0) #...
S4<- c(0,0,0,1) #...
Iab<- rep(S1,times=ceiling((nrow(R)/12)/4)); # Index of years
Iab<- Iab[1:(nrow(R)/12)];
Rnew<- matrix(numeric(0), 0,0);
#Organize data in a only collumn
for(i in 1:(nrow(R)/12)){
for(j in 1:12){
if(Iab[i]==0){
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ny[j]]))
}else{
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ly[j]]))
}
}
}
sum(R[is.na(R)==FALSE])==sum(Rnew[is.na(Rnew)==FALSE]) #Test for succes of organize
sum(R[is.na(R)==FALSE])
sum(Rnew[is.na(Rnew)==FALSE])
I have a similar problem. However in a way even worse, since I have discharge data (Brasilian ANA station) with several interruptions of several month and years. Vazao01 stands for the discharge at the first day of the month, Vazao02 for the second and the data frame goes up to Vazao31 (which is obviously NA for month with less days, but can as well be NA for existing days without record). The data looks like this and is the data.frame "ANAday"
Date Vazao01 Vazao02 Vazao03...
20 01.05.1989 3463.00 3476.500 3463.000
21 01.06.1989 1867.70 1835.900 1809.400
22 01.07.1989 809.90 798.200 774.800
23 01.08.1989 344.60 308.700 297.900
24 01.11.1989 376.50 388.100 391.000
25 01.12.1989 279.00 289.800 319.500
26 01.01.1990 1715.00 1649.000 1573.200
27 01.02.1990 1035.20 1005.800 972.200
28 01.03.1990 2905.60 2962.100 NA
29 01.06.1990 NA NA NA
30 01.07.1990 297.90 284.400 271.200
31 01.08.1990 228.00 223.200 218.400
32 01.08.1999 NA NA 144.000
33 01.09.1999 20.74 18.620 16.500
34 01.10.1999 119.85 111.450 95.385
35 01.11.1999 11.20 23.705 48.370
36 01.12.1999 160.10 179.000 187.400
37 01.01.2000 843.00 865.300 914.500
38 01.02.2000 1331.30 1368.900 1387.800
39 01.04.2000 1823.60 1808.000 1789.800
40 01.05.2000 1579.00 1524.100 1445.700
I made a list of the month with data
ANAm=as.Date(ANAday[,1], format="%d.%m.%Y")
format(ANAm, format="%Y-%m")
Than I used the "monthDays" function of the Hmisc package to list the number of days in each month
require(Hmisc)
nodm=monthDays(ANAm)
Nodm=cbind.data.frame(ANAm,nodm)
I prepared a data.frame for the data I want to have with 3 columns for "YEAR MONTH", "DAY" and "DISCHARGE"
ANATS=array(NA,c(1,3))
colnames(ANATS)=c("mY","d","Q")
And used a simple "for" loop to extract the data into one column according to the number of days in each month
for(i in 1:nrow(Nodm)){
selectANA=as.vector(ANAd[i,1:(Nodm[i,2]) ])
selectANA=as.vector(t(selectANA))##to generate a simple vector
dayANA=c(1:(Nodm[i,2]))
monthANA=rep(format(as.Date(Nodm[i,1]),format="%Y-%m"),times=as.numeric(Nodm[i,2]))
ANAts=cbind(monthANA,dayANA,auswahlANA)
ANATS<<-rbind(ANATS,ANAts)
}
The ANATS can than be transferred into a timeseries:
combine.date=as.character(paste(ANATS[,1],ANATS[,2],sep="-"))
DATE=as.Date(combine.date, format="%Y-%m-%d")
rownames(ANATS)=as.character(DATE)
ANATS=ANATS[-1,]
ANAXTS=as.xts(ANATS)
Maybe I'm having trouble understanding exactly what you're looking for, but are you trying to transpose the data?
t(data)

copy result of unique() string vector in a dataframe R

I am puzzled by something that I thought would easily work.
I have a dataframe with year, city, and species columns.
species City Year
80 Landpattedyr Sisimiut 2007
83 Landpattedyr Sisimiut 2008
87 Landpattedyr Sisimiut 2009
721733 Havpattedyr Upernavik 2010
721734 Havpattedyr Upernavik 2011
721735 Havpattedyr Upernavik 2007
I have used the function unique as follows
years<-unique(df$year)
city<-unique(df$City)
species<-unique(df$species)
now I need to assign a value in each of those vectors to a dataframe row based on an index, for example
hunting[1,]$year<-year[i]
hunting[1,]$group<-species[j]
hunting[1,]$city<-city[k]
The problem is that only year is copied properly while city and species in the hunting df show up as numbers. I can't figure out why this is happening. Can anybody help please?
year group city lat long total
1 2007 6 19 66.93 -53.66 4563
NA 2007 6 20 72.78 -56.15 91
3 2007 6 8 67.01 -50.72 388
4 2007 6 21 70.66 -52.12 280
5 2007 6 14 77.47 -69.23 469
6 2007 6 5 69.22 -51.10 1114
To find out if a column is factor or character you can use this is.factor(df$City) or is.character(df$City).
In the case of a factor column, the (unique) levels are stored in the levels attribute, which can be accessed with
levels(df$City)
Note: this may include levels that are not present in the vector, for instance, if some rows have been removed or if some levels have been added.
To retrieve the unique elements of a factoror character vector, you can use this:
as.character(unique(df$City))
Which will not return levels that are not present in factor columns.
Note: the last command is slightly more efficient than unique(as.character(df$City)), since the conversion is evaluated on a possibly shorter vector.

Resources