Create columns from aggregated row data in R - r

I have a data frame that contains historical price returns. The data is organized with date columns and many Asset columns (denoted as A1,A2...). Each asset column contains price return data for each unique historical date. I would like to process this data to create a data frame with many asset columns and only one row of data - with the data row containing the aggregated/average of the rows for the new columns. The new columns needs headers that are the original asset name, concatenated with date information. A simplified example of the original date follows:
> df <- read.csv("data.csv", header=T)
> df
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3
I used simple repeating numbers for the returns here. I am using a function that requires the data to be organized as follows:
> df2 <- read.csv("data2.csv", header=T)
> df2
Returns A1.Jan A1.Feb A1.Mar A2.Jan A2.Feb A2.Mar A3.Jan A3.Feb A3.Mar
1 Average 1 2 3 1 2 3 1 2 3
For clarity, A1.Jan contains the average of all Year's Jan returns. Thanks in advance for the insight and/or solution.

Take a look at the base function reshape. This is basically the same task as is solved by the last example on its help page:
reshape(df, idvar="Year", direction="wide", timevar="Month")
Year A1.Jan A2.Jan A3.Jan A1.Feb A2.Feb A3.Feb A1.Mar A2.Mar A3.Mar
1 2015 1 1 1 2 2 2 3 3 3
4 2016 1 1 1 2 2 2 3 3 3
You wanted the Year variable to remain as a column identifier but wanted the Month variable to act as a sequence that gets spread "wide".

With data.table you can do
library(data.table)
setDT(df)
df[, lapply(.SD, mean), .SDcols = names(df)[grep("^A", names(df))], by = Month
][, Returns := "Average"
][, melt(.SD, id = c("Month", "Returns"))
][, dcast(.SD, Returns ~ variable + Month, value.var = 'value', sep = ".")]
# Returns A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
#1: Average 2 1 3 2 1 3 2 1 3
In the first line we aggregate the data by Month. The part names(df)[grep("^A", names(df)) ensures that we only aggregate variables that start with the letter "A".
The second line creates variable Returns that contains the value "Average".
melt gathers you data into long format and dcast finally spreads into desired output.
data
df <- structure(list(Year = c(2015L, 2015L, 2015L, 2016L, 2016L, 2016L
), Month = c("Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), A1 = c(1L,
2L, 3L, 1L, 2L, 3L), A2 = c(1L, 2L, 3L, 1L, 2L, 3L), A3 = c(1L,
2L, 3L, 1L, 2L, 3L)), .Names = c("Year", "Month", "A1", "A2",
"A3"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))

Here's a tidyverse solution. I factored the months so they can be ordered, then used tidyr::gather() to convert into long format so I could dplyr::group_by() by month to dplyr::summarise() to find the average:
library(dplyr)
library(tidyr)
df <- read.table(text = "
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3", header = T) %>%
tbl_df()
df$Month <- df$Month %>%
factor(levels = format(ISOdate(2000, 1:12, 1), "%b"))
df_tidy <- df %>%
gather(asset, value, -Year, -Month) %>%
group_by(Month, asset) %>%
summarise(Average = mean(value)) %>%
arrange(asset, Month)
df_tidy
# # A tibble: 9 x 3
# # Groups: Month [3]
# Month asset Average
# <fct> <chr> <dbl>
# 1 Jan A1 1
# 2 Feb A1 2
# 3 Mar A1 3
# 4 Jan A2 1
# 5 Feb A2 2
# 6 Mar A2 3
# 7 Jan A3 1
# 8 Feb A3 2
# 9 Mar A3 3
# convert to wide format, as in OP - not sure of 'easy' way
# to order columns by asset.month other than using 'select()'
# (it currently sorts alphabetically).
df_tidy %>%
unite(Returns, c(asset, Month), sep = ".") %>%
spread(Returns, Average)
# # A tibble: 1 x 9
# A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2 1 3 2 1 3 2 1 3

Related

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

Rearrangement columns of a table in R

I have the following table that I want to modify
Debt2017 Debt2018 Debt2019 Cash2017 Cash2018 Cash2019 Year Other
2 4 3 5 6 7 2018 x
3 8 9 7 9 9 2017 y
So that the result is the following
Debt Cash FLAG After Other
2 5 0 x
3 7 1 x
8 9 1 y
9 9 1 y|
Basically, I want to change the data so that I have the different years in different rows, eliminating the values for the year indicated in the column "Year" and adding a FLAG that tells me whether the data indicated in the row is from a previous (0) or following (1) year (with respect to the year indicated in the column "Year").
Furthermore, I also want to keep the column "Other".
Does anybody know how to do it in R?
library(dplyr)
library(tidyr)
df %>%
pivot_longer(Debt2017:Cash2019,
names_to = c(".value", "Year2"),
names_pattern = "(\\D+)(\\d+)") %>%
filter(Year != Year2) %>%
mutate(flag = +(Year2 > Year))
# # A tibble: 4 × 6
# Year Other Year2 Debt Cash flag
# <int> <chr> <chr> <int> <int> <int>
# 1 2018 x 2017 2 5 0
# 2 2018 x 2019 3 7 1
# 3 2017 y 2018 8 9 1
# 4 2017 y 2019 9 9 1
Data
df <- structure(list(Debt2017 = 2:3, Debt2018 = c(4L, 8L), Debt2019 = c(3L, 9L),
Cash2017 = c(5L, 7L), Cash2018 = c(6L, 9L), Cash2019 = c(7L, 9L),
Year = 2018:2017, Other = c("x", "y")), class = "data.frame", row.names = c(NA, -2L))

Merge two data frames by row and column names and by group

I have two data frames, df1 and df2, that look as follows:
df1<- data.frame(year, week, X1, X2)
df1
year week X1 X2
1 2010 1 2 3
2 2010 2 8 6
3 2011 1 7 5
firm<-c("X1", "X1", "X2")
year <- c(2010,2010,2011)
week<- c(1, 2, 1)
cost<-c(10,30,20)
df2<- data.frame(firm,year, week, cost)
df2
firm year week cost
1 X1 2010 1 10
2 X1 2010 2 30
3 X2 2011 1 20
I'd like to merge these so the final result (i.e. df3) looks as follows:
df3
firm year week cost Y
1 X1 2010 1 10 2
2 X1 2010 2 30 8
3 X2 2011 1 20 5
Where "Y" is a new variable that reflects the values of X1 and X2 for a particular year and week found in df1.
Is there a way to do this in R? Thank you in advance for your reply.
We can reshape the first dataset to 'long' format and then do a join with the second data
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = X1:X2, values_to = 'Y', names_to = 'firm') %>%
right_join(df2)
-output
# A tibble: 3 x 5
# year week firm Y cost
# <dbl> <dbl> <chr> <int> <dbl>
#1 2010 1 X1 2 10
#2 2010 2 X1 8 30
#3 2011 1 X2 5 20
data
df1 <- structure(list(year = c(2010L, 2010L, 2011L), week = c(1L, 2L,
1L), X1 = c(2L, 8L, 7L), X2 = c(3L, 6L, 5L)), class = "data.frame",
row.names = c("1",
"2", "3"))
df2 <- structure(list(firm = c("X1", "X1", "X2"), year = c(2010, 2010,
2011), week = c(1, 2, 1), cost = c(10, 30, 20)), class = "data.frame",
row.names = c(NA,
-3L))
Here is a base R option (borrow data from #akrun, thanks!)
q <- startsWith(names(df1),"X")
v <- cbind(df1[!q],stack(df1[q]),row.names = NULL)
df3 <- merge(setNames(v,c(names(df1)[!q],"Y","firm")),df2)
which gives
> df3
year week firm Y cost
1 2010 1 X1 2 10
2 2010 2 X1 8 30
3 2011 1 X2 5 20

How to deduplicate based upon an interval between dates in same column

I have a table that looks something like this:
ID Date Type
1 2019/03/12 A
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
4 2019/01/23 A
I want to deduplicate this table by ID, but only if the span between the dates listed is greater than 7 days. If it is less than 7 days, then I want to keep the earliest date.
Want:
ID Date Type
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
I'm just struggling with where to start conceptually.
An option would be to convert the 'Date' to Date class (ymd from lubridate is used here), then grouped by 'ID', filter the difference of 'Date' that is greater than or equal to 7
library(dplyr)
library(lubridate)
df1 %>%
mutate(Date = ymd(Date)) %>%
group_by(ID) %>%
filter(c(TRUE, diff(Date) >= 7))
# A tibble: 5 x 3
# Groups: ID [4]
# ID Date Type
# <int> <date> <chr>
#1 1 2019-03-12 A
#2 2 2019-01-07 A
#3 2 2019-04-20 B
#4 3 2019-02-09 C
#5 4 2019-01-19 A
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 4L, 4L), Date = c("2019/03/12",
"2019/03/12", "2019/01/07", "2019/04/20", "2019/02/09", "2019/01/19",
"2019/01/23"), Type = c("A", "A", "A", "B", "C", "A", "A")),
class = "data.frame", row.names = c(NA,
-7L))

Create Binary Variable based on lag in R

I want to create a binary/indicator variable based on lagged observation. I have a variable X1. The raw data looks like below. It's a sample data. Original data has close to 10K records.
X1
Diagnosis
1
2
3
4
Treatment
1
2
3
I want the output to look like this :
X1 NewVar
Diagnosis Diagnosis
1 Diagnosis
2 Diagnosis
3 Diagnosis
4 Diagnosis
Treatment Treatment
1 Treatment
2 Treatment
3 Treatment
Any help would be highly appreciated!
You can achieve this with cumsum. The cumsum can create a new group each time a Diagnosis or Treatment appears. Then the NewVar in each group will take the value of first X1 in this group:
library(dplyr)
dtf %>%
mutate(g = cumsum(X1 == 'Diagnosis' | X1 == 'Treatment')) %>%
group_by(g) %>%
mutate(NewVar = X1[1]) %>%
ungroup() %>% select(-g)
# # A tibble: 9 x 2
# X1 NewVar
# <fctr> <fctr>
# 1 Diagnosis Diagnosis
# 2 1 Diagnosis
# 3 2 Diagnosis
# 4 3 Diagnosis
# 5 4 Diagnosis
# 6 Treatment Treatment
# 7 1 Treatment
# 8 2 Treatment
# 9 3 Treatment
the dtf in above code:
> dput(dtf)
structure(list(X1 = structure(c(5L, 1L, 2L, 3L, 4L, 6L, 1L, 2L,
3L), .Label = c("1", "2", "3", "4", "Diagnosis", "Treatment"), class = "factor")), .Names = "X1", class = "data.frame", row.names = c(NA,
-9L))
Here is an option with data.table. After converting to 'data.table' (setDT(dtf), get the cumulative sum of logical vector based on 'X1' values as characters and assign 'NewVar' as the first element of 'X1' (X1[1])
library(data.table)
setDT(dtf)[, NewVar := X1[1], cumsum(grepl('^[A-Za-z]+$', X1))]
dtf
# X1 NewVar
#1: Diagnosis Diagnosis
#2: 1 Diagnosis
#3: 2 Diagnosis
#4: 3 Diagnosis
#5: 4 Diagnosis
#6: Treatment Treatment
#7: 1 Treatment
#8: 2 Treatment
#9: 3 Treatment

Resources