I have a list of data frames int1 and int2. The end goal of this code is to assign the names to the elements in int1 and int2. The rest of the workflow for my work requires me to name the elements of the list multiple times, and I was wondering how I could create a function to reduce the number of keystrokes down the line using base r functions. Any ideas?
library(lubridate)
library(tidyverse)
library(purrr)
date <- rep_len(seq(dmy("01-01-2011"), dmy("31-07-2011"), by = "days"), 200)
ID <- rep(c("A","B", "C"), 200)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$Month <- month(df$date)
# Create first list
int1 <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(Month == "1") %>%
group_split()
# Create second list
int2 <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(Month == "2") %>%
group_split()
# Expected Output
# Assign names to int1
names(int1) <- sapply(int1, function(x) paste(x$ID[1],
x$new[1], sep = "_"))
# Assign names to int2
names(int2) <- sapply(int2, function(x) paste(x$ID[1],
x$new[1], sep = "_"))
Using group_split will not name the list elements. It is specified in the ?group_split
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
Instead use split from base R, which will return with the names pasteed using . from the 'ID', 'new' columns
int1 <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(Month == "1") %>% ungroup %>%
{split(., .[c('ID', 'new')])}
Similarly for int2
Related
I have a data set that i split into two list int1 and int2.
library(lubridate)
library(tidyverse)
library(purrr)
date <- rep_len(seq(dmy("01-01-2011"), dmy("01-01-2013"), by = "days"), 300)
ID <- rep(c("A","B", "C"), 300)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$month <- month(df$date)
df$year <- year(df$date)
# Create first list
int1 <- df %>%
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(month == "1") %>%
group_split()
# Create second list
int2 <- df %>%
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(month == "2") %>%
group_split()
names(int1) <- sapply(int1, function(x) paste(x$ID[1],
x$year[1], sep = '_'))
names(int2) <- sapply(int2, function(x) paste(x$ID[1],
x$year[1], sep = '_'))
I then assign a attribute to each list (match). I created a function check to grab this attribute more easily. I removed some elements from one list for this exmaple.
int1 <- int1[-c(3,6)]
# Convenience function to grab the attributes for you
check <- function(x) {
return(attr(x, "match"))
}
# Add an attribute to hold the attributes of each list element
attr(int1, "match") <- data.frame(id = sapply(int1, function(x) paste(x$ID[1])),
interval_start_date = sapply(int1, function(x) paste(x$new[1]))
)
# Check the attributes
check(int1)
# Add an attribute "tab" to hold the attributes of each list element
attr(int2, "match") <- data.frame(id = sapply(int2, function(x) paste(x$ID[1])),
interval_start_date = sapply(int2, function(x) paste(x$new[1]))
)
# Check the attributes
check(int2)
I would like to remove elements that are not in another based on the attribute that I had added. Specifically I would like to remove any that don't have the same interval_start_date and ID associated with it. For the interval_start_date, only the year and the day have to match, as the month will most likely differ between the two list. In this case, I would like int2 to match int1. Any thoughts on how I could do this? A base r method is preferred, if possible.
# Expected results
expected_int2 <- list(int2[[1]], int2[[2]], int2[[3]], int2[[4]], int2[[5]],
int2[[6]], int2[[7]])
names(expected_int2) <- sapply(int1, function(x) paste(x$ID[1],
x$year[1], sep = "_"))
We may create an index with %in% after pasteing the 'id' and the formatted 'interval_start_date' i.e. after removing the 'month' part
i1 <- with(check(int2), paste(id, format(as.Date(interval_start_date),
"%Y-%d"))) %in% with(check(int1), paste(id,
format(as.Date(interval_start_date), "%Y-%d")))
> which(i1)
[1] 1 2 4 5 7 8 9
out <- int2[i1]
I have two list that I am working with int1 and int2. Both list have similar names for the list elements. I would like to remove specific components in one list, in this case int2 that are not present in another list int1. Is there a good way to do this in base R? I would like my results to look like the expected_int2.
library(lubridate)
library(tidyverse)
library(purrr)
date <- rep_len(seq(dmy("01-01-2011"), dmy("31-07-2011"), by = "days"), 200)
ID <- rep(c("A","B", "C"), 200)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$Month <- month(df$date)
# Create first list
int1 <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(Month == "1") %>%
group_split()
# Assign names to int1
names(int1) <- sapply(int1, function(x) paste(x$ID[1],
x$new[1], sep = "_"))
#Remove list elements for the example
int1 <- int1[-c(6, 8, 9)]
# Create second list
int2 <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(Month == "2") %>%
group_split()
# Assign names to int2
names(int2) <- sapply(int2, function(x) paste(x$ID[1],
x$new[1], sep = "_"))
# Expected results
expected_int2 <- list(int2[[1]], int2[[2]], int2[[3]], int2[[4]], int2[[5]], int2[[6]])
names(expected_int2) <- sapply(int1, function(x) paste(x$ID[1],
x$new[1], sep = "_"))
We can remove the month part from the names, to check if they are similar to subset
i1 <- sub("(.*)-\\d+-(.*)", "\\1-\\2", names(int2)) %in%
sub("(.*)-\\d+-(.*)", "\\1-\\2", names(int1))
out <- int2[i1]
names(out) <- names(int1)
I am trying to filter out a data set into two months. I would like to filter out the ID and year that have data, and to remove the ID and year that do not have an associated pair.
For example if an ID and year has both the January and July month in the data set, I would like to include this ID and the year in my filtered data. If an ID has only the month of January and not July, I would like to remove this data and not include it in the filtered data set. Is there a good way to do this? Just a note that I wasn't sure how to simulate the uneven data set in the example.
After filtering for my desired output, I test by creating a list for each seasonal month where each ID and year has at least 15 rows associated with it.
library(lubridate)
library(dplyr)
set.seed(12345)
df <- tibble(
date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),
1000, replace = TRUE),
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID = rep(1:5, 200),
month = month(date),
year =year(date)) %>%
arrange(ID, date)
df %>%
filter(month %in% c(1,7)) %>%
group_by(ID, year) %>%
mutate(complete = length(unique(month)) == 2) %>%
group_by(ID) %>%
filter(all(complete)) %>%
group_by(ID, year)
# Creates a list for each year and by ID
summer_list <- df %>%
filter(month %in% 7) %>%
filter(n() >= 15) %>%
group_split(year, ID)
# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
# Creates a list for each year and by ID
winter_list <- df1 %>%
filter(month %in% 1) %>%
filter(n() >= 15) %>%
group_split(year, ID)
# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
You were really close. I think your filter can be simplified to the following. Just be sure to save it to df.
df <- df %>%
filter(month %in% c(1,7)) %>%
group_by(ID, year) %>%
mutate(complete = length(unique(month)) == 2) %>%
filter(complete)
# could add "%>% select(-c(complete))" to get rid of complete
On summer_list and winter_list, add a group_by between the filters. With the dataset you provided, there were no groups with 15 records, but I tested that this works by bumping up the size of df until I got some.
summer_list <- df %>%
filter(month == 7) %>% # used == since there's only one test value
group_by(ID, year) %>% # added this
filter(n() >= 15) %>%
group_split()
There's also a typo in your first use of winter_list -- the input data is df1, but I think you want df. Hope this works!
Here's the complete code including the larger df:
library(lubridate)
library(dplyr)
set.seed(12345)
df <- tibble(
date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),
4000, replace = TRUE),
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID = rep(1:5, 800),
month = month(date),
year =year(date)) %>%
arrange(ID, date)
df <- df %>%
filter(month %in% c(1,7)) %>%
group_by(ID, year) %>%
mutate(complete = length(unique(month)) == 2) %>%
filter(complete)
# could add "%>% select(-c(complete))" to get rid of complete
# Creates a list for each year and by ID
summer_list <- df %>%
filter(month == 7) %>%
group_by(ID, year) %>%
filter(n() >= 15) %>%
group_split()
# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
# Creates a list for each year and by ID
winter_list <- df %>%
filter(month == 1) %>%
group_by(ID, year) %>%
filter(n() >= 15) %>%
group_split()
# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
I want to calculate total number of negative values for each observation, using previous 10 observations. I used the following code, but it does not work -
funda_addit <- funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(NEG_EARN = rollapply(ni, 10, sum (ni<0), partial=TRUE)) %>%
ungroup()
Actually I want to create the new variable "NEG_EARN", which is the number of negative values of previous 10 observations(10 years in my data) for the variable "ni". I also use the following code, but it does not work -
funda_addit <- funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(NEG_EARN = rollapply(ni, 10, length(which(ni<0)), partial=TRUE)) %>%
ungroup()
You could create a vector cumsum(ni < 0) and then subtract a lagged version of that vector from it
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(neg_earn = {cs <- cumsum(ni < 0)
cs - lag(cs, 10, default = 0)})
This is equivalent to akrun's answer if you change rollapply to rollapplyr (tested using akrun's example data)
use_cumsum <-
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(neg_earn = {cs <- cumsum(ni < 0)
cs - lag(cs, 10, default = 0)})
use_rollapply <-
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(neg_earn = rollapplyr(ni, 10, FUN = f1, partial=TRUE))
all(use_cumsum == use_rollapply)
# [1] TRUE
We can use anonymous function call (or create a new function) instead of the whole column 'ni'
library(dplyr)
library(zoo)
f1 <- function(x) sum(x < 0)
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(NEG_EARN = rollapplyr(ni, 10, FUN = f1, partial=TRUE)) %>%
ungroup()
EDIT: changed rollapply to rollapplyr (based on comments from #IceCreamToucan)
data
set.seed(24)
funda_addit <- data.frame(TICKER = rep(LETTERS[1:3], each = 20),
year = 1921:1940, ni = rnorm(60))
this is my code and I have a problem with groupby :
library(dplyr)
library(lubridate)
df <- read.xlsx("Data.xlsx", sheet = "Sector-STOXX600", startRow = 2,colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE)
df[2:19] <- data.matrix(df[2:19])
percent_change2 <- function(x)last(x)/first(x) - 1
monthly_return <- df %>%
group_by(gr = floor_date(Date, unit = "month")) %>%
summarize_at(vars(-Date, -gr), percent_change2) %>%
ungroup() %>%
select(-gr) %>%
as.matrix()
Indeed I have this error :
"Error in is_character(x) : object 'gr' not found"
Here is a sample of the dataset :
Date .SXQR .SXTR .SXNR .SXMR .SXAR .SX3R .SX6R .SXFR .SXOR .SXDR .SX4R .SXRR .SXER
1 2000-01-03 364.94 223.93 489.04 586.38 306.56 246.81 385.36 403.82 283.78 455.39 427.43 498.08 457.57
2 2000-01-04 345.04 218.90 474.05 566.15 301.13 239.24 374.64 390.41 275.93 434.92 414.10 476.17 435.72
UPDATE
volatility_function<- function(x)sqrt(252) * sd(diff(log(x))) * 100
annualized_volatility <- df %>%
mutate(Date=ymd(Date)) %>%
group_by(gr = floor_date(Date, unit = "year")) %>%
select(gr,everything()) %>%
summarize_at(vars(-Date, -gr), volatility_function) %>%
ungroup() %>% select(-gr) %>%
as.matrix()
head(annualized_volatility,5)
I tried what #NeslonGon told me to do, however I know get the same error on an another function, what should I do ?
The idea is that we don't need to summarise_at a grouped variable but use the Date to account for this. The select and mutate calls can be skipped. They're for convenience.
df %>%
mutate(Date=ymd(Date)) %>%
group_by(gr = floor_date(Date, unit = "month")) %>%
select(gr,everything()) %>%
summarize_at(vars(-Date), percent_change2) %>%
ungroup() %>%
select(-gr) %>%
as.matrix()