Giving month names to a variable of numbers in R - r

I have a data set with the variable 'months' from 1 to 12, but need to change them to the month names. i.e "1" needs to be January and so on. Whats the easiest way to do this?

R has an inbuilt vector called month.name for your purpose you could do something like the following:
# Some dummy data
set.seed(1)
df <- data.frame(
month = sample(1:12, size = 10)
)
# Now use your integer month to subset month.name
df$month2 <- month.name[df$month] # Also has month.abb
df
month month2
1 9 September
2 4 April
3 7 July
4 1 January
5 2 February
6 5 May
7 3 March
8 8 August
9 6 June
10 11 November

Related

index a dataframe with repeated values according to vector

I am trying to average values in different months over vectors of dates. Basically, I have a dataframe with monthly values of a variable, and I'm trying to get a representative average of the experienced values for samples that sometimes span month boundaries.
I've ended up with a dataframe of monthly values, and vectors of the representative number of "month-year" combinations of every sampling duration (e.g. if a sample was out from Jan 28, 2000 to Feb 1, 2000, the vector would show 4 values of Jan 2000, 1 value of Feb 2000). Later I'm going to average the values with these weights, so it's important that the returned variable values appear in representative numbers.
I am having trouble figuring out how to index the dataframe pulling the representative value repeatedly. See below.
# data frame of monthly values
reprex_df <-
tribble(
~my, ~value,
"2000-01", 10,
"2000-02", 11,
"2000-03", 15,
"2000-04", 9,
"2000-05", 13
) %>%
as.data.frame()
# vector of month-year dates from Jan 28 to Feb 1:
reprex_vec <- c("2000-01","2000-01","2000-01","2000-01","2000-02")
# I want to index the df using the vector to get a return vector of
# January value*4, Feb value*1, or 10, 10, 10, 10, 11
# I tried this:
reprex_df[reprex_df$my %in% reprex_vec,"value"]
# but %in% only returns each value once ("10 11", not "10 10 10 10 11").
# is there a different way I should be indexing to account for repeated values?
# eventually I will take an average, e.g.:
mean(reprex_df[reprex_df$my %in% reprex_vec,"value"])
# but I want this average to equal 10.2 for mean(c(10,10,10,10,11)), not 10.5 for mean(c(10,11))
Simple tidy solution with inner_join:
dplyr::inner_join(reprex_df, data.frame(my = reprex_vec), by = "my")$value
in base R:
merge(reprex_df, list(my = reprex_vec))
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11
Perhaps use match from base R to get the index
reprex_df[match(reprex_vec, reprex_df$my),]
my value
1 2000-01 10
1.1 2000-01 10
1.2 2000-01 10
1.3 2000-01 10
2 2000-02 11
Another base R option using setNames
with(
reprex_df,
data.frame(
my = reprex_vec,
value = setNames(value, my)[reprex_vec]
)
)
gives
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11

Updating a numeric column with characters in R

I have a column like this of the Data data.frame:
Month
3
6
9
3
6
9
3
6
9
...
I want to update 3 with March, 6 with Jume, 9 with September. I know how to do it if I have two months 3 and 10 for example with: mutate(Data, Month=if_else(Month==3,"March","October")) How can I do it for three months?
Expected output:
Month
March
June
September
March
June
September
March
June
September
...
You could just use your numerical month values to access month.name, which is R's built-in vector of month names, starting at index 1:
Data <- data.frame(Month=c(3,6,9))
Data$MonthName <- month.name[Data$Month]
Data
Month MonthName
1 3 March
2 6 June
3 9 September

How to complete missing values with Na in a list?

I have a data frame that has the following column: Tree ID, month, values. For some months, there is no recorded data, therefore those months do not exist in the data frame. I have completed the list with the missing months but now I do not know how to insert NA in the value column for the added months.
Example:
Tree.Id: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Month: Jan, Feb, Mar, May, Jun, Jul, Sept, Oct, Nov, Dec
Values: 1,0,1,1,0,2,1,1,0,2
The following months are missing: Apr, Aug,
I added them with the code below, and now I want for those 2 added months to introduce NA in the value column.
Here is what I tried:
tree_ls <- list()
for (i in unique(data$Tree.ID)){
mon1 <- data$month[data$Tree.ID == i] ### extract the month for every Tree iD
mon <- min(mon1, na.rm=T):max(mon1, na.rm=T) # completes the numbers with the missing month
dat1 <- data$value[data$Tree.ID == i]
......
After this step, I do not know how to create a list that will add NA for all the added months that were missing, so I will have lists of the same length.
Thanks
This is an old post, but I have a pretty good solution for this:
To begin, your small reproducible code should probably be the following:
month <- c(Jan, Feb, Mar, May, Jun, Jul, Sept, Oct, Nov, Dec)
value <- c(1,0,1,1,0,2,1,1,0,2)
df <- data.frame(id=id, month=month,value=value)
> head(df)
id month value
1 1 Jan 1
2 2 Feb 0
3 3 Mar 1
4 4 May 1
5 5 Jun 0
6 6 Jul 2
Now just simply introduce an entire list of your domain, e.g., your months you want to obtain NA's where missing.
completeMonths <- c("Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul","Aug", "Sept", "Oct", "Nov", "Dec")
df2 <- dataframe(month=completeMonths)
> df2
month
1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sept
10 Oct
11 Nov
12 Dec
Now we have a column with all the underlying values, so when we merge, we can fill the missing rows as NA with the following syntax:
merge(df, df2, on=month, all=TRUE)
With our results as follows:
month id value
1 Dec 10 2
2 Feb 2 0
3 Jan 1 1
4 Jul 6 2
5 Jun 5 0
6 Mar 3 1
7 May 4 1
8 Nov 9 0
9 Oct 8 1
10 Sept 7 1
11 Apr NA NA
12 Aug NA NA
Hope this helps, data wrangling sucks.
When you say that you have a data frame with some months that have "no recorded data" and therefore "do not exist", the fact that they are in the data frame at all means they have some representation. I'm going to guess that by "do not exist" you mean that they are blank strings, such as "". If that's the case, you can replace the blank strings with NA values using mutate in the dplyr package and ifelse in the base package as follows:
library(dplyr);
data_with_nas <- mutate(data, value = ifelse(value=="", NA, value));
That reads as "change the data data frame such that its value cells are replaced with NA if they were a blank string, or kept as is otherwise."

Read Data into Time Series Object in R

My data looks as follows:
Month/Year;Number
01/2010; 1.0
02/2010;19.0
03/2010; 1.0
...
How can I read this into a ts(object) in R?
Try this (assuming your data is called df)
ts(df$Number, start = c(2010, 01), frequency = 12)
## Jan Feb Mar
## 2010 1 19 1
Edit: this will work only if you don't have missing dates and your data is in correct order. For a more general solution see #Anandas answer below
I would recommend using zoo as a starting point. This will ensure that if there are any month/year combinations missing, they would be handled properly.
Example (notice that data for April is missing):
mydf <- data.frame(Month.Year = c("01/2010", "02/2010", "03/2010", "05/2010"),
Number = c(1, 19, 1, 12))
mydf
# Month.Year Number
# 1 01/2010 1
# 2 02/2010 19
# 3 03/2010 1
# 4 05/2010 12
library(zoo)
as.ts(zoo(mydf$Number, as.yearmon(mydf$Month.Year, "%m/%Y")))
# Jan Feb Mar Apr May
# 2010 1 19 1 NA 12

How to make all the months to have an equal number of days (for example 22 days) for a MIDAS regression in R

This is a follow up question for these two posts.
How to deal with impossible dates for midasr package
https://stats.stackexchange.com/questions/77495/what-can-i-do-with-these-two-time-series
I need to use mls function in MIDAS package in R to transform the high frequency (daily) financial data to low frequency (quarterly) macroeconomic data.
The author #mpiktas mentioned
You must make all the months to have an equal number of days. And then
set frequency to that number. You can achieve that by discarding data,
padding NAs or extrapolating.
and
You could use zoo objects to make the padding easier, but in the end
simple numeric vector should be passed.
I tried different ways to search and did not find an easy way to implement.
I use dplyr to get each month to have 31 days with 7-11 NA.
# generate the date vector
library(midasr)
library(dplyr)
library(quantmod)
tsxdate <- as.Date( paste(1979, rep(1:12, each=31), 1:31, sep="-") )
for (year in 1980:2015){
tsxdate <- c(tsxdate,as.Date( paste(year, rep(1:12, each=31), 1:31, sep="-") ))
}
# transform to dataframe
tsxdate.df <- as.data.frame(tsxdate)
# get the stock market index from yahoo
tsxindex <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# merge two data frame to get each month with 31 days
tsx.df <- left_join(tsxdate.df, tsxindex)
I doubt this caused a problem due to too many NAs.
I put the new daily data into MIDAS regression in R. It did not work. None of the weight functions work.
# since each month has 31 days. one quarter yy correspond to 93 days data.
midas_r(midas_r(yy~trend+fmls(zz,30,93,nealmon) ,start=list(zz=rep(0,4))), Ofunction="nls")
Could you tell me how to make all the months to have an equal number of days?
update:
Finally, I got a way in zoo package with aggregate and first function. It is not perfect, but it works and fast. first will add NAs according to the parameter.
I still need to figure out how to fit it into a MIDAS regression.
# get data
tsx <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# subset
# generate a zoo object
library(zoo)
tsx.zoo <- zoo(tsx$GSPTSE.Adjusted)
# group by yearmonth and take first 22 days data.
days <-aggregate(tsx.zoo, as.yearmon, first, 22)
It looks like this: each row is one month with 22 days data.
Jun 1979 1614.29 NA NA NA NA NA NA NA NA NA
Jul 1979 1614.29 1598.73 1579.88 1582.57 1582.27 1576.19 1559.23 1529.81 1533.50 1547.66
Aug 1979 1554.14 1556.94 1553.84 1553.84 1551.95 1561.23 1562.52 1571.00 1578.08 1580.28
Sep 1979 1685.11 1657.58 1690.10 1720.92 1716.53 1711.34 1722.71 1714.63 1727.50 1724.51
Oct 1979 1749.05 1767.40 1775.98 1786.35 1800.12 1800.12 1735.88 1685.21 1681.52 1670.65
Nov 1979 1599.33 1606.81 1596.54 1592.94 1574.49 1569.20 1583.97 1608.70 1611.00 1619.78
Jun 1979 NA NA NA NA NA NA NA NA NA NA
Jul 1979 1556.94 1546.86 1548.46 1553.54 1542.07 1543.17 1552.85 1566.01 1573.99 1564.12
Aug 1979 1596.64 1602.82 1615.09 1636.53 1653.09 1660.97 1657.78 1665.46 1674.44 1674.64
Sep 1979 1714.73 1717.53 1732.59 1736.48 1731.19 1732.49 1746.75 1754.33 1747.45 NA
Oct 1979 1639.03 1613.19 1616.29 1635.34 1593.44 1533.40 1522.12 1534.49 1517.24 1523.92
Nov 1979 1628.55 1621.57 1624.36 1627.56 1620.27 1647.51 1677.93 1683.81 1690.70 1698.97
Jun 1979 NA NA
Jul 1979 1554.14 NA
Aug 1979 1674.24 1675.43
Sep 1979 NA NA
Oct 1979 1538.68 1552.25
update again:
#mpiktas gives a better and right way to do it.
1 NAs should be padded at beginning of each period.
2 Data should be gather in the frequency of response variable. In my case, it is quarterly.
His function can be used in aggregate function in zoo. I guess it do the same job as group_by plus do in dplyr: split, operate, and give back a list of results. I try this
tsxdaily <- aggregate(tsx.zoo, yearqtr, padd_nas, 66)
yearqtr is the frequency of response variable.
Here is one possible way of how to add NAs.
First, note that MIDAS regression puts the emphasis on the last values of the period, so you need to put NAs in front, not in the back.
Suppose that we have the following dummy data:
> dt <- data.frame(Day=1:10,Quarter=c(rep(1,6),rep(2,4)),value=1:10)
> dt
Day Quarter value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
7 7 2 7
8 8 2 8
9 9 2 9
10 10 2 10
In this example there are two quarters, the first one has 6 days, the second one 4. Suppose we want to harmonize the data, so that the quarter has 7 days (for example).
Define simple function which adds NAs at the beginning of the data:
padd_nas <- function(x, desired_length) {
n <- length(x)
if(n < desired_length) {
c(rep(NA,desired_length-n),x)
} else {
tail(x,desired_length)
}
}
Here is an example illustrating how this function works:
> padd_nas(1:4,7)
[1] NA NA NA 1 2 3 4
>
Now add NAs for each quarter and make sure that the data is ordered by day:
library(dplyr)
pdt <- dt %>% arrange(Day) %>% group_by(Quarter) %>% do(pv = padd_nas(.$value, 7))
> pdt
Source: local data frame [2 x 2]
Groups: <by row>
Quarter pv
1 1 <int[7]>
2 2 <int[7]>
To get the padded result simply use unlist on column pv:
> pv <- pdt$pv %>% unlist
> pv
[1] NA 1 2 3 4 5 6 NA NA NA 7 8 9 10
Now we can prepared this for MIDAS regression with mls. Suppose that only last 3 days are relevant for each quarter:
> library(midasr)
> mls(pv, 0:2, 7)
X.0/m X.1/m X.2/m
[1,] 6 5 4
[2,] 10 9 8
Compare this with original data dt.
This approach can be generalized for any low and high frequency data configuration.

Resources