How do I combine dates, regardless of a third variable in R? - r

The following is a data example,
Month Year Tornado Location
January 1998 3 Illinois
February 1998 2 Illinois
March 1998 5 Illinois
January 1998 1 Florida
January 2010 3 Illinois
Here is what I want it to look like essentially,
Date Tornado
1998-01 4
1998-02 2
1998-03 5
2010-01 3
So, I want to combine the Year and Month into one, new column. The locations do not matter, I want to know the total number of tornadoes for January, 1998, and etc.
I have the following code, but do not know how to change it to incorporate both the variables I want, or if this is even the correct code for what I am attempting to do.
mydata$Date <- format(as.Date(mydata$month), "%m-%Y")
The real dataset is far too large to fix manually. I am basically attempting to make this data into time series data.

You need to apply some data transformation before applying How to sum a variable by group
aggregate(Tornado~Date, transform(df, Date = format(as.Date(paste(Month,Year,"01"),
"%B %Y %d"), "%Y-%m")), sum)
# Date Tornado
#1 1998-01 4
#2 1998-02 2
#3 1998-03 5
#4 2010-01 3
data
df <- structure(list(Month = structure(c(2L, 1L, 3L, 2L, 2L),
.Label = c("February", "January", "March"), class = "factor"),
Year = c(1998L, 1998L,1998L, 1998L, 2010L),
Tornado = c(3L, 2L, 5L, 1L, 3L), Location = structure(c(2L,
2L, 2L, 1L, 2L), .Label = c("Florida", "Illinois"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))

In the first place, I combined Month and Year into a single variable called Date, applied the appropriate format with zoo package, and grouped the results by Date.
library(tidyverse)
library(zoo)
df %>%
unite(Date, Month, Year) %>%
mutate(Date = as.yearmon(Date, format = '%B_%Y')) %>%
group_by(Date) %>%
summarise(Tornado = sum(Tornado))
# A tibble: 4 x 2
Date Tornado
<yearmon> <int>
1 Jan 1998 4
2 Feb 1998 2
3 Mar 1998 5
4 Jan 2010 3

if the day doesn't matter you can do:
#library (tidyverse)
library(lubridate)
x$Date<-as_date(paste0(x$Year,x$Month,"-01"))
# A tibble: 5 x 4
Month Year Tornados Date
<chr> <dbl> <dbl> <date>
1 January 1998 3 1998-01-01
2 February 1998 2 1998-02-01
3 March 1998 5 1998-03-01
4 January 1998 1 1998-01-01
5 January 2010 3 2010-01-01

Related

How to calculate three year rolling return using R

I need to get a 3-year rolling return working (3-year return for each id, for each year).
I have tried to use the PerformanceAnalytics package but I keep getting an error that my data is not a time series.
When I use the function it says TRUE so I am completely stuck as to how to get the 3-year rolling return to work. So I just need someone to provide me with the R code that will produce the 3-year returns.
Here's a sample dataset
ppd_id FY TF_1YR
1 2001 -0.0636
1 2002 -0.0929
1 2003 0.1648
1 2004 0.1006
1 2005 0.1098
1 2006 0.0837
1 2007 0.1792
1 2008 -0.1521
1 2009 -0.1003
1 2010 0.0847
1 2011 0.0221
1 2012 0.1801
1 2013 0.146
1 2014 0.1202
1 2015 0.0105
1 2016 0.1022
1 2017 0.1286
1 2018 0.0929
Here's link to dataset
Here's my code
library(smooth)
library(readr)
pensionreturns <- read_csv("pensionreturns.csv")
sma(pensionreturns, h=
Assuming that:
we are starting out with the data frame DF2 in the Note at the end which is the data in question duplicated so that there are 2 id's
the third column represents returns so the 3 year returns are the product of one plus each of the last 3 values (current value and prior 2) all minus 1, i.e. (1 + r0) * (1 + r1) * (1 + r2) - 1 where r0, is the current year's return, r1 is the prior year's return and r2 is the return in the year prior to that.
convert the data to the wide form zoo series z and then use rollapplyr. Omit the fill= argument if the NA's at the beginning are not needed. The result will be a zoo series of returns. (We could use fortify.zoo, see ?fortify.zoo, to convert it to a data frame although it will be easier to perform further time series manipulations if you leave it as a time series.)
library(zoo)
z <- read.zoo(DF2, index = 2, split = 1, FUN = c)
rollapplyr(z + 1, 3, prod, fill = NA) - 1
giving this zoo series:
1 2
2001 NA NA
2002 NA NA
2003 -0.010609049 -0.010609049
2004 0.162883042 0.162883042
2005 0.422740161 0.422740161
2006 0.323680900 0.323680900
2007 0.418212355 0.418212355
2008 0.083530596 0.083530596
2009 -0.100440641 -0.100440641
2010 -0.172530498 -0.172530498
2011 -0.002527919 -0.002527919
2012 0.308343674 0.308343674
2013 0.382282521 0.382282521
2014 0.514952431 0.514952431
2015 0.297228567 0.297228567
2016 0.247648627 0.247648627
2017 0.257004321 0.257004321
2018 0.359505217 0.359505217
Note
DF <- structure(list(ppd_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), FY = 2001:2018, TF_1YR = c(-0.0636,
-0.0929, 0.1648, 0.1006, 0.1098, 0.0837, 0.1792, -0.1521, -0.1003,
0.0847, 0.0221, 0.1801, 0.146, 0.1202, 0.0105, 0.1022, 0.1286,
0.0929)), class = "data.frame", row.names = c(NA, -18L))
DF2 <- rbind(DF, transform(DF, ppd_id = 2))

Repeatedly compare same portion of dataset to other portions of dataset based on index value in R

I have a dataframe that looks like the following:
state year value
1 1980 4
1 1981 5
1 1982 4
2 1980 2
2 1981 3
2 1982 4
100 1980 3
100 1981 2
100 1982 5
In the actual dataset, there are more states than are shown here. I would like to make a comparison between state 100 and all other states.
Specifically, for each state, I would like to find the difference between the value given by that state for a particular year and the value given for state 100 for that same year. Below, I have shown how I could compare the value for year 1980 between state 1 and state 100.
df_1 <- df %>% filter(state == 1)
df_100 <- df %>% filter(state == 100)
df_1_1980 <- df_1 %>% filter(year == 1980)
df_100_1980 <- df_100 %>% filter(year == 1980)
difference <- df_1_1980$value - df_100_1980$value
How could I do this for all the other states and years in the dataframe?
One possibility I have considered is making a dataframe composed only of the data from state 100 and then connecting it to the original dataframe, like this:
state year value state100 year100 value100
1 1980 4 100 1980 3
1 1981 5 100 1981 2
1 1982 4 100 1982 5
2 1980 2 100 1980 3
2 1981 3 100 1981 2
2 1982 4 100 1982 5
I could then subtract df$value from df$value100 for each row. I assume there is a better way of doing this.
We can filter the 'state' that is not equal to 100, left_join with the dataset with 'state' 100, by 'year' and get the difference between the 'value' columns
library(dplyr)
df %>%
filter(state != 100) %>%
left_join(df %>%
filter(state == 100) %>%
select(-state), by = c('year')) %>%
transmute(state, year, value = value.x, difference = value.x - value.y)
# state year value difference
#1 1 1980 4 1
#2 1 1981 5 3
#3 1 1982 4 -1
#4 2 1980 2 -1
#5 2 1981 3 1
#6 2 1982 4 -1
data
df <- structure(list(state = c(1L, 1L, 1L, 2L, 2L, 2L, 100L, 100L,
100L), year = c(1980L, 1981L, 1982L, 1980L, 1981L, 1982L, 1980L,
1981L, 1982L), value = c(4L, 5L, 4L, 2L, 3L, 4L, 3L, 2L, 5L)),
class = "data.frame", row.names = c(NA,
-9L))

Sum variables using R by categories condition

I have a data frame that shows the number of publications by year. But I am interested just in Conference and Journals Publications. I would like to sum all other categories in Others type.
Examples of data frame:
year type n
1994 Conference 2
1994 Journal 3
1995 Conference 10
1995 Editorship 3
1996 Conference 20
1996 Editorship 2
1996 Books and Thesis 3
And the result would be:
year type n
1994 Conference 2
1994 Journal 3
1995 Conference 10
1995 Other 3
1996 Conference 20
1996 Other 5
With dplyr we can replace anything other than "Journal" or "Conference" to "Other" and then sum them by year and type.
library(dplyr)
df %>%
mutate(type = sub("^((Journal|Conference))", "Other", type)) %>%
group_by(year, type) %>%
summarise(n = sum(n))
# year type n
# <int> <chr> <int>
#1 1994 Conference 2
#2 1994 Journal 3
#3 1995 Conference 10
#4 1995 Other 3
#5 1996 Conference 20
#6 1996 Other 5
We can use data.table
library(data.table)
library(stringr)
setDT(df1)[, .(n = sum(n)), .(year, type = str_replace(type,
'(Journal|Conference)', 'Other'))]
# year type n
#1: 1994 Other 5
#2: 1995 Other 10
#3: 1995 Editorship 3
#4: 1996 Other 20
#5: 1996 Editorship 2
#6: 1996 Books and Thesis 3
levels(df$type)[levels(df$type) %in% c("Editorship", "Books_and_Thesis")] <- "Other"
aggregate(n ~ type + year, data=df, sum)
# type year n
# 1 Conference 1994 2
# 2 Journal 1994 3
# 3 Other 1995 3
# 4 Conference 1995 10
# 5 Other 1996 5
# 6 Conference 1996 20
Input data:
df <- structure(list(year = c(1994L, 1994L, 1995L, 1995L, 1996L, 1996L,
1996L), type = structure(c(2L, 3L, 2L, 1L, 2L, 1L, 1L), .Label = c("Other",
"Conference", "Journal"), class = "factor"), n = c(2L, 3L, 10L,
3L, 20L, 2L, 3L)), .Names = c("year", "type", "n"), row.names = c(NA, -7L), class = "data.frame")

generate seq of quarter date in R

I am new to R and am I have a data frame that looks something like this.
Date A B
1990 Q1 2 3
Q2 4 2
Q3 7 6
Q4 5 3
1991 Q1 7 6
Q2 1 8
Q3 7 6
Q4 9 2
1992 Q1 1 7
Q2 4 6
Q3 1 3
Q4 5 8
...
The column stretches all the way to the end of the row and both the start date and the end date is not fixed as the data is constantly updated. I would like to format the date column into a date class and achieve something like this:
Date A B
1990 Q1 2 3
1990 Q2 4 2
1990 Q3 7 6
1990 Q4 5 3
1991 Q1 7 6
1991 Q2 1 8
1991 Q3 7 6
1991 Q4 9 2
1992 Q1 1 7
1992 Q2 4 6
1992 Q3 1 3
1992 Q4 5 8
...
I thought of recreating a new column of dates on the left and use the first date provided by the data (i.e. '1990 Q1') as the starting date and the length based on the number of rows. Was looking at using seq. and as.yearqtr commands but can't seem to work out a proper code for it. Anyone knows of a better way to do this?
To use the yearqtr function from the zoo package to create a year-quarter time series, you can first split the df$Date values into year and quarter strings, use na.locf, also from the zoo package, to fill in missing values of year with the value from the previous row, and then transform to a zoo time series with year quarter dates. Code would look like
library(zoo)
# split Date into year and quarter strings
tmp <- t(sapply(strsplit((df$Date), " "), function(x) if(length(x)==1) c(NA, x) else x))
# use na.locf to replace NA with previous year
tmp <- paste(na.locf(tmp[,1]), tmp[,2])
# transform df into a zoo time series object with yearqtr dates
df_zoo <- zoo(df[,-1], order.by = as.yearqtr(tmp))
We could do this in base R. Create a grouping variable using grep and cumsum, extract the numeric substring from 'Date', replace the '' values with the year values using ave, and then paste it with the quarter substring extracted using sub.
df$Date <- paste(ave(sub("\\s*Q.", "", df$Date),
cumsum(grepl("^\\d+", df$Date)), FUN = function(x) x[nzchar(x)]),
sub("^\\d+\\s+", "", df$Date))
df$Date
#[1] "1990 Q1" "1990 Q2" "1990 Q3" "1990 Q4" "1991 Q1" "1991 Q2"
#[7] "1991 Q3" "1991 Q4" "1992 Q1" "1992 Q2" "1992 Q3" "1992 Q4"
NO Addtional packages needed.
If we need a package solution, data.table can be used
library(data.table)
library(stringr)
setDT(df)[, Date:=sub("^(Q.*)", paste0(word(Date[1],1), " \\1") , Date),
cumsum(grepl("^\\d+" , Date))]
df
# Date A B
# 1: 1990 Q1 2 3
# 2: 1990 Q2 4 2
# 3: 1990 Q3 7 6
# 4: 1990 Q4 5 3
# 5: 1991 Q1 7 6
# 6: 1991 Q2 1 8
# 7: 1991 Q3 7 6
# 8: 1991 Q4 9 2
# 9: 1992 Q1 1 7
#10: 1992 Q2 4 6
#11: 1992 Q3 1 3
#12: 1992 Q4 5 8
data
df <- structure(list(Date = c("1990 Q1", "Q2", "Q3", "Q4", "1991 Q1",
"Q2", "Q3", "Q4", "1992 Q1", "Q2", "Q3", "Q4"), A = c(2L, 4L,
7L, 5L, 7L, 1L, 7L, 9L, 1L, 4L, 1L, 5L), B = c(3L, 2L, 6L, 3L,
6L, 8L, 6L, 2L, 7L, 6L, 3L, 8L)), .Names = c("Date", "A", "B"
), row.names = c(NA, -12L), class = "data.frame")
Here is a straight forward way to create the sequence which you are looking for:
numrows<-10 #number of elements desired
#create the sequence of Date objects
qtrseq<-seq(as.Date("1990-01-01"), by="quarter", length.out = numrows)
#created vector for the formatted display
qtrformatted<-paste(as.POSIXlt(qtrseq)$year+1900, quarters(qtrseq))
The downside of this method and the other listed solutions is the lost of the Date object. There is no good way in base R to format the Q1, Q2... and have the object remain a Date object. Depending on your application it might be best to store the date sequence in the data frame and use the statement for qtr formatted only output purposes.
Best of luck.
Assuming Date is a single character column, here's an option using tidyr:
library(tidyr)
# separate date into year and quarter, inserting NAs in year as necessary
df %>% separate(Date, into = c('year', 'quarter'), fill = 'left') %>%
# fill NAs with previous value
fill(year) %>%
# join year and quarter back into a single column
unite(Date, year, quarter, sep = ' ')
# Date A B
# 1 1990 Q1 2 3
# 2 1990 Q2 4 2
# 3 1990 Q3 7 6
# 4 1990 Q4 5 3
# 5 1991 Q1 7 6
# 6 1991 Q2 1 8
# 7 1991 Q3 7 6
# 8 1991 Q4 9 2
# 9 1992 Q1 1 7
# 10 1992 Q2 4 6
# 11 1992 Q3 1 3
# 12 1992 Q4 5 8
Data
df <- structure(list(Date = structure(c(1L, 4L, 5L, 6L, 2L, 4L, 5L,
6L, 3L, 4L, 5L, 6L), .Label = c("1990 Q1", "1991 Q1", "1992 Q1",
"Q2", "Q3", "Q4"), class = "factor"), A = c(2L, 4L, 7L, 5L, 7L,
1L, 7L, 9L, 1L, 4L, 1L, 5L), B = c(3L, 2L, 6L, 3L, 6L, 8L, 6L,
2L, 7L, 6L, 3L, 8L)), .Names = c("Date", "A", "B"), class = "data.frame", row.names = c(NA,
-12L))
Here is something you can try
library(dplyr); library(stringr); library(zoo)
df %>% mutate(Date = paste(na.locf(str_extract(Date, "^[0-9]{4}")),
str_extract(Date, "Q[1-4]$"), sep = " "))
Date A B
1 1990 Q1 2 3
2 1990 Q2 4 2
3 1990 Q3 7 6
4 1990 Q4 5 3
5 1991 Q1 7 6
6 1991 Q2 1 8
7 1991 Q3 7 6
8 1991 Q4 9 2
9 1992 Q1 1 7
10 1992 Q2 4 6
11 1992 Q3 1 3
12 1992 Q4 5 8

R: Get the last entry from previous group

I have data like this:
Group Year Month Mean_Price
A 2013 6 200
A 2013 6 200
A 2014 2 100
A 2014 2 100
B 2014 1 130
I want to add another column which gets the last entry from the group above, like this:
Group Year Month Mean_Price Last_Mean_price
A 2013 6 200 x
A 2013 6 200 x
A 2014 2 100 200
A 2014 2 100 200 ---This is where I am facing problem as doing dplyr + lag is just getting the last row entry and not the entry of th *last group's* last row.
B 2014 1 130 x
B 2014 4 140 130
All help will be appreciated. Thanks!
I had asked a related question here: Get the (t-1) data within groups
But then I wasn't grouping by years and months
This may be one way to go. I am not sure how you want to group your data. Here, I chose to group your data with GROUP, Year, and Month. First, I want to create a vector with all last elements from each group, which is foo.
group_by(mydf, Group, Year, Month) %>%
summarize(whatever = last(Mean_Price)) %>%
ungroup %>%
select(whatever) %>%
unlist -> foo
# whatever1 whatever2 whatever3 whatever4
# 200 100 130 140
Second, I arranged foo for our later process. Basically, I added x in the first position and deleted the last element in foo.
### Arrange a vector
foo <- c("x", foo[-length(foo)])
Third, I added row numbers for each group in mydf using mutate(). Then, I relaxed all numbers but 1 with x.
group_by(mydf, Group, Year, Month) %>%
mutate(ind = row_number(),
ind = replace(ind, which(row_number(ind) != 1), "x")) -> temp
Finally, I identified rows which have 1 in ind and assigned the vector, foo to the rows.
temp$ind[temp$ind == 1] <- foo
temp
# Group Year Month Mean_Price ind
# (fctr) (int) (int) (int) (chr)
#1 A 2013 6 200 x
#2 A 2013 6 200 x
#3 A 2014 2 100 200
#4 A 2014 2 100 x
#5 B 2014 1 130 100
#6 B 2014 4 140 130
DATA
mydf <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Year = c(2013L, 2013L, 2014L, 2014L,
2014L, 2014L), Month = c(6L, 6L, 2L, 2L, 1L, 4L), Mean_Price = c(200L,
200L, 100L, 100L, 130L, 140L)), .Names = c("Group", "Year", "Month",
"Mean_Price"), class = "data.frame", row.names = c(NA, -6L))

Resources