generate seq of quarter date in R - r

I am new to R and am I have a data frame that looks something like this.
Date A B
1990 Q1 2 3
Q2 4 2
Q3 7 6
Q4 5 3
1991 Q1 7 6
Q2 1 8
Q3 7 6
Q4 9 2
1992 Q1 1 7
Q2 4 6
Q3 1 3
Q4 5 8
...
The column stretches all the way to the end of the row and both the start date and the end date is not fixed as the data is constantly updated. I would like to format the date column into a date class and achieve something like this:
Date A B
1990 Q1 2 3
1990 Q2 4 2
1990 Q3 7 6
1990 Q4 5 3
1991 Q1 7 6
1991 Q2 1 8
1991 Q3 7 6
1991 Q4 9 2
1992 Q1 1 7
1992 Q2 4 6
1992 Q3 1 3
1992 Q4 5 8
...
I thought of recreating a new column of dates on the left and use the first date provided by the data (i.e. '1990 Q1') as the starting date and the length based on the number of rows. Was looking at using seq. and as.yearqtr commands but can't seem to work out a proper code for it. Anyone knows of a better way to do this?

To use the yearqtr function from the zoo package to create a year-quarter time series, you can first split the df$Date values into year and quarter strings, use na.locf, also from the zoo package, to fill in missing values of year with the value from the previous row, and then transform to a zoo time series with year quarter dates. Code would look like
library(zoo)
# split Date into year and quarter strings
tmp <- t(sapply(strsplit((df$Date), " "), function(x) if(length(x)==1) c(NA, x) else x))
# use na.locf to replace NA with previous year
tmp <- paste(na.locf(tmp[,1]), tmp[,2])
# transform df into a zoo time series object with yearqtr dates
df_zoo <- zoo(df[,-1], order.by = as.yearqtr(tmp))

We could do this in base R. Create a grouping variable using grep and cumsum, extract the numeric substring from 'Date', replace the '' values with the year values using ave, and then paste it with the quarter substring extracted using sub.
df$Date <- paste(ave(sub("\\s*Q.", "", df$Date),
cumsum(grepl("^\\d+", df$Date)), FUN = function(x) x[nzchar(x)]),
sub("^\\d+\\s+", "", df$Date))
df$Date
#[1] "1990 Q1" "1990 Q2" "1990 Q3" "1990 Q4" "1991 Q1" "1991 Q2"
#[7] "1991 Q3" "1991 Q4" "1992 Q1" "1992 Q2" "1992 Q3" "1992 Q4"
NO Addtional packages needed.
If we need a package solution, data.table can be used
library(data.table)
library(stringr)
setDT(df)[, Date:=sub("^(Q.*)", paste0(word(Date[1],1), " \\1") , Date),
cumsum(grepl("^\\d+" , Date))]
df
# Date A B
# 1: 1990 Q1 2 3
# 2: 1990 Q2 4 2
# 3: 1990 Q3 7 6
# 4: 1990 Q4 5 3
# 5: 1991 Q1 7 6
# 6: 1991 Q2 1 8
# 7: 1991 Q3 7 6
# 8: 1991 Q4 9 2
# 9: 1992 Q1 1 7
#10: 1992 Q2 4 6
#11: 1992 Q3 1 3
#12: 1992 Q4 5 8
data
df <- structure(list(Date = c("1990 Q1", "Q2", "Q3", "Q4", "1991 Q1",
"Q2", "Q3", "Q4", "1992 Q1", "Q2", "Q3", "Q4"), A = c(2L, 4L,
7L, 5L, 7L, 1L, 7L, 9L, 1L, 4L, 1L, 5L), B = c(3L, 2L, 6L, 3L,
6L, 8L, 6L, 2L, 7L, 6L, 3L, 8L)), .Names = c("Date", "A", "B"
), row.names = c(NA, -12L), class = "data.frame")

Here is a straight forward way to create the sequence which you are looking for:
numrows<-10 #number of elements desired
#create the sequence of Date objects
qtrseq<-seq(as.Date("1990-01-01"), by="quarter", length.out = numrows)
#created vector for the formatted display
qtrformatted<-paste(as.POSIXlt(qtrseq)$year+1900, quarters(qtrseq))
The downside of this method and the other listed solutions is the lost of the Date object. There is no good way in base R to format the Q1, Q2... and have the object remain a Date object. Depending on your application it might be best to store the date sequence in the data frame and use the statement for qtr formatted only output purposes.
Best of luck.

Assuming Date is a single character column, here's an option using tidyr:
library(tidyr)
# separate date into year and quarter, inserting NAs in year as necessary
df %>% separate(Date, into = c('year', 'quarter'), fill = 'left') %>%
# fill NAs with previous value
fill(year) %>%
# join year and quarter back into a single column
unite(Date, year, quarter, sep = ' ')
# Date A B
# 1 1990 Q1 2 3
# 2 1990 Q2 4 2
# 3 1990 Q3 7 6
# 4 1990 Q4 5 3
# 5 1991 Q1 7 6
# 6 1991 Q2 1 8
# 7 1991 Q3 7 6
# 8 1991 Q4 9 2
# 9 1992 Q1 1 7
# 10 1992 Q2 4 6
# 11 1992 Q3 1 3
# 12 1992 Q4 5 8
Data
df <- structure(list(Date = structure(c(1L, 4L, 5L, 6L, 2L, 4L, 5L,
6L, 3L, 4L, 5L, 6L), .Label = c("1990 Q1", "1991 Q1", "1992 Q1",
"Q2", "Q3", "Q4"), class = "factor"), A = c(2L, 4L, 7L, 5L, 7L,
1L, 7L, 9L, 1L, 4L, 1L, 5L), B = c(3L, 2L, 6L, 3L, 6L, 8L, 6L,
2L, 7L, 6L, 3L, 8L)), .Names = c("Date", "A", "B"), class = "data.frame", row.names = c(NA,
-12L))

Here is something you can try
library(dplyr); library(stringr); library(zoo)
df %>% mutate(Date = paste(na.locf(str_extract(Date, "^[0-9]{4}")),
str_extract(Date, "Q[1-4]$"), sep = " "))
Date A B
1 1990 Q1 2 3
2 1990 Q2 4 2
3 1990 Q3 7 6
4 1990 Q4 5 3
5 1991 Q1 7 6
6 1991 Q2 1 8
7 1991 Q3 7 6
8 1991 Q4 9 2
9 1992 Q1 1 7
10 1992 Q2 4 6
11 1992 Q3 1 3
12 1992 Q4 5 8

Related

How to calculate three year rolling return using R

I need to get a 3-year rolling return working (3-year return for each id, for each year).
I have tried to use the PerformanceAnalytics package but I keep getting an error that my data is not a time series.
When I use the function it says TRUE so I am completely stuck as to how to get the 3-year rolling return to work. So I just need someone to provide me with the R code that will produce the 3-year returns.
Here's a sample dataset
ppd_id FY TF_1YR
1 2001 -0.0636
1 2002 -0.0929
1 2003 0.1648
1 2004 0.1006
1 2005 0.1098
1 2006 0.0837
1 2007 0.1792
1 2008 -0.1521
1 2009 -0.1003
1 2010 0.0847
1 2011 0.0221
1 2012 0.1801
1 2013 0.146
1 2014 0.1202
1 2015 0.0105
1 2016 0.1022
1 2017 0.1286
1 2018 0.0929
Here's link to dataset
Here's my code
library(smooth)
library(readr)
pensionreturns <- read_csv("pensionreturns.csv")
sma(pensionreturns, h=
Assuming that:
we are starting out with the data frame DF2 in the Note at the end which is the data in question duplicated so that there are 2 id's
the third column represents returns so the 3 year returns are the product of one plus each of the last 3 values (current value and prior 2) all minus 1, i.e. (1 + r0) * (1 + r1) * (1 + r2) - 1 where r0, is the current year's return, r1 is the prior year's return and r2 is the return in the year prior to that.
convert the data to the wide form zoo series z and then use rollapplyr. Omit the fill= argument if the NA's at the beginning are not needed. The result will be a zoo series of returns. (We could use fortify.zoo, see ?fortify.zoo, to convert it to a data frame although it will be easier to perform further time series manipulations if you leave it as a time series.)
library(zoo)
z <- read.zoo(DF2, index = 2, split = 1, FUN = c)
rollapplyr(z + 1, 3, prod, fill = NA) - 1
giving this zoo series:
1 2
2001 NA NA
2002 NA NA
2003 -0.010609049 -0.010609049
2004 0.162883042 0.162883042
2005 0.422740161 0.422740161
2006 0.323680900 0.323680900
2007 0.418212355 0.418212355
2008 0.083530596 0.083530596
2009 -0.100440641 -0.100440641
2010 -0.172530498 -0.172530498
2011 -0.002527919 -0.002527919
2012 0.308343674 0.308343674
2013 0.382282521 0.382282521
2014 0.514952431 0.514952431
2015 0.297228567 0.297228567
2016 0.247648627 0.247648627
2017 0.257004321 0.257004321
2018 0.359505217 0.359505217
Note
DF <- structure(list(ppd_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), FY = 2001:2018, TF_1YR = c(-0.0636,
-0.0929, 0.1648, 0.1006, 0.1098, 0.0837, 0.1792, -0.1521, -0.1003,
0.0847, 0.0221, 0.1801, 0.146, 0.1202, 0.0105, 0.1022, 0.1286,
0.0929)), class = "data.frame", row.names = c(NA, -18L))
DF2 <- rbind(DF, transform(DF, ppd_id = 2))

How do I combine dates, regardless of a third variable in R?

The following is a data example,
Month Year Tornado Location
January 1998 3 Illinois
February 1998 2 Illinois
March 1998 5 Illinois
January 1998 1 Florida
January 2010 3 Illinois
Here is what I want it to look like essentially,
Date Tornado
1998-01 4
1998-02 2
1998-03 5
2010-01 3
So, I want to combine the Year and Month into one, new column. The locations do not matter, I want to know the total number of tornadoes for January, 1998, and etc.
I have the following code, but do not know how to change it to incorporate both the variables I want, or if this is even the correct code for what I am attempting to do.
mydata$Date <- format(as.Date(mydata$month), "%m-%Y")
The real dataset is far too large to fix manually. I am basically attempting to make this data into time series data.
You need to apply some data transformation before applying How to sum a variable by group
aggregate(Tornado~Date, transform(df, Date = format(as.Date(paste(Month,Year,"01"),
"%B %Y %d"), "%Y-%m")), sum)
# Date Tornado
#1 1998-01 4
#2 1998-02 2
#3 1998-03 5
#4 2010-01 3
data
df <- structure(list(Month = structure(c(2L, 1L, 3L, 2L, 2L),
.Label = c("February", "January", "March"), class = "factor"),
Year = c(1998L, 1998L,1998L, 1998L, 2010L),
Tornado = c(3L, 2L, 5L, 1L, 3L), Location = structure(c(2L,
2L, 2L, 1L, 2L), .Label = c("Florida", "Illinois"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))
In the first place, I combined Month and Year into a single variable called Date, applied the appropriate format with zoo package, and grouped the results by Date.
library(tidyverse)
library(zoo)
df %>%
unite(Date, Month, Year) %>%
mutate(Date = as.yearmon(Date, format = '%B_%Y')) %>%
group_by(Date) %>%
summarise(Tornado = sum(Tornado))
# A tibble: 4 x 2
Date Tornado
<yearmon> <int>
1 Jan 1998 4
2 Feb 1998 2
3 Mar 1998 5
4 Jan 2010 3
if the day doesn't matter you can do:
#library (tidyverse)
library(lubridate)
x$Date<-as_date(paste0(x$Year,x$Month,"-01"))
# A tibble: 5 x 4
Month Year Tornados Date
<chr> <dbl> <dbl> <date>
1 January 1998 3 1998-01-01
2 February 1998 2 1998-02-01
3 March 1998 5 1998-03-01
4 January 1998 1 1998-01-01
5 January 2010 3 2010-01-01

Linear regression on split data in R

I want to make groups of data where measurements are done in multiple Year on the same species at the same Lat and Long. Then, I want to run linear regression on all those groups (using N as dependent variable and Year as independent variable).
Practice dataset:
Species Year Lat Long N
1 1 1999 1 1 5
2 1 2001 2 1 5
3 2 2010 3 3 4
4 2 2010 3 3 2
5 2 2011 3 3 5
6 2 2012 3 3 8
7 3 2007 8 7 -10
8 3 2019 8 7 100
9 2 2000 1 1 5
First, I averaged data where multiple measurements were done in the same Year on the same Species at the same latitude and longitude . Then, I split data based on Lat, Long and Species. However, this still groups rows together where Lat, Long and Species are not equal ($ '4'). Furthermore, I want to remove $'1', since I only want to use data where multiple measurements are done over a number of Year. How do I do this?
Data <- read.table("Dataset.txt", header = TRUE)
Agr_Data <- aggregate(N ~ Lat + Long + Year + Species, data = Data, mean)
Split_Data <- split(Agr_Data, Agr_Data$Lat + Agr_Data$Long + Agr_Data$Species)
Regression_Data <- lapply(Split_Data, function(Split_Data) lm(N~Year, data = Split_Data) )
Split_Data
$`3`
Lat Long Year Species N
1 1 1 1999 1 5
$`4`
Lat Long Year Species N
2 2 1 2001 1 5
3 1 1 2000 2 5
$`8`
Lat Long Year Species N
4 3 3 2010 2 3
5 3 3 2011 2 5
6 3 3 2012 2 8
$`18`
Lat Long Year Species N
7 8 7 2007 3 -10
8 8 7 2019 3 100
Desired output:
Lat Long Species Coefficients
3 3 2 2.5
8 7 3 9.167
Base R solution:
# 1. Import data:
df <- structure(list(Species = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 2L ),
Year = c(1999L, 2001L, 2010L, 2010L, 2011L, 2012L, 2007L, 2019L, 2000L),
Lat = c(1L, 2L, 3L, 3L, 3L, 3L, 8L, 8L, 1L),
Long = c(1L, 1L, 3L, 3L, 3L, 3L, 7L, 7L, 1L),
N = c(5L, 5L, 4L, 2L, 5L, 8L, -10L, 100L, 5L)),
class = "data.frame", row.names = c(NA, -9L ))
# 2. Aggregate data:
df <- aggregate(N ~ Lat + Long + Year + Species, data = df, mean)
# 3. Concatenate vecs to create grouping vec:
df$grouping_var <- paste(df$Species, df$Lat, df$Long, sep = ", ")
# 4. split apply combine lm:
coeff_n <- as.numeric(do.call("rbind", lapply(split(df, df$grouping_var),
function(x){
ifelse(nrow(x) > 1, coef(lm(N ~ Species+Lat+Long, data = x)), NA)
}
)
)
)
# 5. Create a dataframe of coeffs:
coeff_df <- data.frame(cbind(grouping_var = unique(df$grouping_var), coeff_n = coeff_n))
# 6. Merge the dataframes together:
df <- merge(df, coeff_df, by = "grouping_var", all.x = TRUE)

Reshape data set from wide to long format grouped by variable suffix

Similar yet different to this post:Reshaping data.frame from wide to long format
I have a wide dataset with a unique ID variable and all other variables with a 4 digit year suffix:
ID MI1995 FRAC1995 MI1996 FRAC1996
1 2 3 2 4
7 3 10 12 1
10 1 2 1 1
I would like a long dataset grouped by the 4 digit variable suffix.
So each ID should have 1 row per year of 4 digit suffix:
ID YEAR MI FRAC
1 1995 2 3
1 1996 2 4
7 1995 3 10
7 1996 12 1
10 1995 1 2
10 1996 1 1
Base/generic solutions are preferred.
The main questions here are, how do I establish automatic cutpoints for the "varying" parameter in reshape, and how do I supply the "timevar" parameter from the variable suffix?
Using reshape we can set the cutpoints with sep="".
reshape(d, idvar="ID", varying=2:5, timevar="YEAR", sep="", direction="long")
# ID YEAR MI FRAC
# 1.1995 1 1995 2 3
# 7.1995 7 1995 3 10
# 10.1995 10 1995 1 2
# 1.1996 1 1996 2 4
# 7.1996 7 1996 12 1
# 10.1996 10 1996 1 1
Data
d <- structure(list(ID = c(1L, 7L, 10L), MI_1995 = c(2L, 3L, 1L),
FRAC_1995 = c(3L, 10L, 2L), MI_1996 = c(2L, 12L, 1L),
FRAC_1996 = c(4L, 1L, 1L)), row.names = c(NA, -3L),
class = "data.frame")

Apply function only to certain level of factor?

I have a data frame like so:
df <- structure(list(year = c(1990, 1990, 1990, 1990, 1990, 1990, 1990,
1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1991, 1991, 1991,
1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991,
1991), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
value = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L)), .Names = c("year", "group", "value"
), row.names = c(NA, -30L), class = "data.frame")
> df
year group value
1 1990 A 1
2 1990 A 2
3 1990 A 3
4 1990 A 4
5 1990 A 5
6 1990 A 6
7 1990 B 7
8 1990 B 8
9 1990 B 9
10 1990 B 10
11 1990 B 11
12 1990 B 12
13 1990 B 13
14 1990 B 14
15 1990 B 15
16 1991 A 5
17 1991 A 6
18 1991 A 7
19 1991 A 8
20 1991 A 9
21 1991 A 10
22 1991 A 11
23 1991 A 12
24 1991 A 13
25 1991 A 14
26 1991 B 15
27 1991 B 16
28 1991 B 17
29 1991 B 18
30 1991 B 19
I need to apply a function for each year (I intend to do that with plyr and summarise) but only on the factor level with the most rows (A or B). Is there a way to automatically select this level (A or B) for each year?
df2 <- ddply(df, .(year), summarise, result="some operation on longest level"))
desired output:
> df2
year group value result
1 1990 B 7 5
2 1990 B 8 4
3 1990 B 9 5
4 1990 B 10 3
5 1990 B 11 3
6 1990 B 12 8
7 1990 B 13 11
8 1990 B 14 7
9 1990 B 15 2
10 1991 A 5 10
11 1991 A 6 13
12 1991 A 7 9
13 1991 A 8 7
14 1991 A 9 6
15 1991 A 10 1
16 1991 A 11 15
17 1991 A 12 5
18 1991 A 13 5
19 1991 A 14 2
this might be another approach with dplyr
library(dplyr)
df <- df %.% group_by(year,group) %.% mutate(count = n()) %.% ungroup()
df <- df %.% group_by(year) %.% filter(count %in% max(count)) %.% mutate(result = sqrt(value))
df$count <- NULL
since i am not sure what function you want to apply to result I used sqrt(value) as in #rbatt's answer
Sorry, I don't use plyr myself, but here's how i might do it with base functions. Perhaps that will inspire a plyr solution for you.
#find largest groups for each year
maxgroups <- tapply(df$group, df$year, function(x) which.max(table(x)))
#create group names
maxpairs <- paste(names(maxgroups),levels(df$group)[maxgroups], sep=".")
#helper function
ifnotin<-function(val,set,ifnotin) {out<-val; out[!val%in%set]<-ifnotin; droplevels(out)}
#new factor indicating best group
tgroups <- ifnotin(interaction(df$year, df$group), maxpairs, NA)
#now transform the best groups by adding year to result (or whatever transformation you need to do)
transform(df, value=ifelse(!is.na(tgroups), value+year, value))
I wasn't sure if your transformation need to know what group/year it was for or not. If you just needed to know if it was in a group that needed transformation you could skip the tgroups and just use
needstransform <- interaction(df$year, df$group) %in% maxpairs
but tgroups has NA values that would be good for summaries tapply(df$value, droplevels(tgroups), mean) and such
I don't think this is a very good answer because it's super obfuscated (and it doesn't use your desired plyr approach), but maybe it will stimulate someone else's thinking:
Basically, you just need to know which values of group you want to look at for each year. Let's say you figure that out and store those values (in the same order as splits of the original data by year) in a variable called m, then you can mapply some function that subsets each split (of the data by year) by group and then does whatever other calculations you want.
do.call(rbind, mapply(function(x,y) {
tmp <- x[x$group==y,]
#fun(tmp) # apply your function to the relevant subset
}, split(df,df$year), m, SIMPLIFY=FALSE))
I thought of three different ways you could generate m. Here they are:
m <- with(df, levels(group)[apply(table(group, year), 2, which.max)])
m <- levels(df$group)[sapply(split(df, df$year), function(x) which.max(sapply(split(x, x$group), nrow)))]
m <- with(df, levels(group)[apply(tapply(year, list(group, year), length),2,which.max)])
This is what I came up with:
df2 <- ddply(
df,
.(year),
summarise,
result=sqrt(
value[group==names(which.max(table(df$group)))]
)
)

Resources