I am new to data tables in R and have managed to get 80% of the way through my analysis. The background is that I want to get the returns of a stock 5 days (before and after), and then 25 and 45 days after they report. I have successfully managed to do it for one set of dates (effectively hardcoding) but when I try and automate the process it falls apart.
I will start with my current formulas and then explain the data.
This formula successfully looks at the data tables and returns the sum that I need. The issue is that datem5 and V1 need to go through a loop (or mapply) to automate the process.
CQR_Date[CQR_DF[CQR_Date, sum(CQR), on = .(unit, date >= date1, date <= datem5),
by = .EACHI], newvar := V1, on = .(unit, date1=date)]
I tried this (along with many other variants). Please note the newvar needs to be addressed as well.
for (i in 1:4) {
CQR_Date[CQR_DF[CQR_Date, sum(CQ), on = .(unit, date >= date1, date <= cols[,..i]),
by = .EACHI], newvar := v, on = .(unit, date1=date)]
but get this error
Error: argument specifying columns specify non existing column(s): cols[3]='cols[, ..i]'
Interestingly, when I try
for (i in 1:2) {
y <- cols[,..i]}
There is no issue.
Now in terms of data;
col just contains the column headings that I need from CQR_Data
cols <- data.table("datem5", "datep5", "datep20" , "datep45")
CQ_Data has the reporting dates for the stock CQ such as the following
CQ_Date <- data.frame("date1" = anydate(c("2016-02-17", "2016-06-12", "2016-08-17")))
CQ_Date$datem5 <- CQ_Date$date1 - 5 # minus five days
CQ_Date$datep5 <- CQ_Date$date1 + 5 # plus five days
CQ_Date$datep20 <- CQ_Date$date1 + 20
CQ_Date$datep45 <- CQ_Date$date1 + 45
CQ_Date$unit <- 1 # I guess I need this for some sort of indexing
Then CQ_DF (it is the log returns for the stock) is formed by:
CQ_DF <- data.frame("unit" = rep(1,300))
CQ_DF$CQ <- rnorm(10)
CQ_DF$date <- seq(as.Date("2015-12-25"), by = "day", length.out = 300)
CQ_DF$unit <- 1
Before setting them as DT
setDT(CQ_DF)
setDT(CQ_Date)
Any help would be greatly appreciated. Note this uses
library(data.table)
library(anytime)
A simplified version is:
CQ_Date <- data.frame("date1" = c(10, 20))
CQ_Date$datep5 <- CQ_Date$date1 + 5 # plus five days
CQ_Date$datep20 <- CQ_Date$date1 + 10
CQ_Date$unit <- 1
CQ_DF <- data.frame("unit" = rep(1,100))
CQ_DF$CQ <- seq(1, by = 1, length.out = 100)
CQ_DF$date <- seq(1, by = 1, length.out = 100)
CQ_DF$unit <- 1
setDT(CQ_DF)
setDT(CQ_Date)
cols <- c("datep5", "datep20" )
tmp <- melt(CQ_Date, measure.vars = cols)
setDT(tmp)
tmp[CQ_DF[tmp, sum(CQ), on = .( unit, date >= date1, date <= value), by =
.EACHI],newvar := V1, on = .(unit, date1=date )]
The issue is now that the sum does not appear to work correctly. It may have something to do with "variable" variable.
Instead of using mapply or for loop, try reshaping the dataset in long format using melt, create sequence between the numbers, perform the join and calculate the sum.
library(data.table)
cols <- c("datep5", "datep20" )
tmp <- melt(CQ_Date, measure.vars = cols)
tmp <- melt(CQ_Date, measure.vars = cols)
tmp <- tmp[, list(date = seq(date1, value)), .(unit, variable, date1, value)]
tmp <- merge(tmp, CQ_DF, by = c('unit', 'date'))
tmp[, .(newvar = sum(CQ)), .(unit, variable, date1)]
# unit variable date1 newvar
#1: 1 datep5 10 75
#2: 1 datep20 10 165
#3: 1 datep5 20 135
#4: 1 datep20 20 275
If you need the data back in wide format you can use dcast.
Equivalent tidyverse option is :
library(tidyverse)
CQ_Date %>%
pivot_longer(cols = cols) %>%
mutate(date = map2(date1, value, seq)) %>%
unnest(date) %>%
left_join(CQ_DF, by = c('unit', 'date')) %>%
group_by(unit, name, date1) %>%
summarise(newvar = sum(CQ))
Related
I have a pretty large data set with users and their membership start and end dates. For each membership period there is one entry.
I have another dataset, which is coming from the support system, and it has records of user id's along with the dates of each system usage. This dataset is even larger, as there is one record for each usage.
I need to aggregate the second and combine with the first one, based on each user and membership period.
I tried a function for a for loop but for an extremeley large dataset (her we are talking about some few millions of rows) this will take ages.
Edit: The join or merge will not work, because here there are several ranges (between start and end dates) for each ID in the first frame. Each range has been assigned a number. (Period of membership) The second data frame has dates and IDs and the problem is finding the membership period for each ID & date by comparing it to the date ranges in the first frame.
Here is the code, along with mock datasets and what I want to achieve at the end:
ids <- c(rep("id1", 5), rep("id2", 5), rep("id3", 5))
#
stdates <- c("2015-08-01", "2016-08-01", "2017-08-01", "2018-08-01", "2019-08-01",
"2013-05-07", "2014-05-07", "2015-05-07", "2016-05-07", "2017-05-07",
"2011-02-13", "2013-02-13", "2015-02-13", "2016-02-13", "2017-02-13")
#
endates <- c("2016-07-31", "2017-07-31", "2018-07-31", "2019-07-31", "2020-07-31",
"2014-05-06", "2015-05-06", "2016-05-06", "2017-05-06", "2018-05-06",
"2013-02-12", "2015-02-12", "2016-02-12", "2017-02-12", "2018-02-12")
#
# First dataset:
df <- data.table(id = ids,
stdate = stdates,
endate = endates)
#
df <- df %>%
arrange(id, desc(endate))
#
# Add the membership period number for each user:
setDT(df)
df[, counter := rowid(id)]
#
# Second dataset:
ids2 <- sample(df$id, 1000, replace = TRUE)
dates2 <- sample(seq(Sys.Date() - 7*365, Sys.Date() - 365, 1), 1000)
#
df2 <- data.table(id = ids2,
dateticket = dates2)
#
# Function
counterFunc <- function(d2, d1) {
d2$groupCounter <- NA
for (i in 1:nrow(d2)) {
crdate <- d2$dateticket[i]
idtemp <- d2$id[i]
dtemp <- d1 %>%
filter(id == idtemp) %>%
data.table()
dtemp[, drcode := ifelse(crdate >= stdate & crdate <= endate, 1, 0)]
if (length(unique(dtemp$drcode)) == 2) {
dtempgc <- dtemp[drcode == 1]$counter
d2$groupCounter[i] <- dtempgc
}
if (length(unique(dtemp$drcode)) != 2) {
d2$groupCounter[i] <- 0
}
print(i)
}
return(d2)
}
#
# The result I want to get without a for loop:
df2gc <- counterFunc(df2, df)
#
The operation you want to do is called "joining", so depending on the direction and completion of that "joining" there are some options.
Here is a simple example:
df1<-data.frame("ID"=c("1","2","3","1","2"),"First_Name"=c("A","B","C","D","E"))
df2<-data.frame("ID"=c("1","2","3"),"Last_Name"=c("Ko","Lo","To"))
left_join(df1,df2,by = "ID")
The result looks like this:
ID First_Name Last_Name
1 A Ko
2 B Lo
3 C To
1 A Ko
2 B Lo
left_joinfrom the dplyrpackage simply looked up the relevant values in the look-up table (df2) and added them to the original table (df1, the left table) based on a "key" (by = "ID" in this case).
There are other operations that specify the terms of the joining more but left_joinshould be helpful in your case.
EDIT:
I have better understood your problem now. Please check if this solves it:
library(tidyverse)
df %>%
mutate(stdate = as.Date(stdate), endate = as.Date(endate)) %>%
left_join(df2, by = "id") %>%
mutate(check = case_when(dateticket >= stdate & dateticket <= endate ~ "TRUE", TRUE ~ "FALSE")) %>%
filter(check == "TRUE")
Edit:
For the problem the error "Cannot allocate vector of size" with join please refer to this:
Left_join error cannot allocate vector of size
Actual OutputI am trying to sum an amount if a column of dates (z) fall between a specified range. Unfortunately my loop doesn't seem to wok and i have a null output.
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-20',tz = "GMT")
Ed <- EndDate + 30
LTV1 <- while (Sd < Ed) {
Sd <- Sd + 1
LTV1 <- LTV %>% group_by(InstallationDate)%>% filter(z < Sd) %>%
summarize(Amount = sum(USAmount))
}
as.data.frame(LTV1)
Apologies everyone. I am quite new to this. Here is a reproducible example:
Sample <- as.data.frame(seq(as.Date("2017/01/01"), by = "day", length.out
= 15))
Sample$Amount <- c(10,5,3,4,8,65,89,47,74,95,85,63,32,45,32)
colnames(Sample)[1] <- "date"
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-5',tz = "GMT")
Ed <- EndDate + 3
Sample1 <- while (Sd < Ed) {
Sd <- Sd + 1
Sample1 <- Sample %>% group_by(date)%>% filter(date < Sd) %>%
summarize(Amount = sum(Amount))
}
as.data.frame(Sample1)
Desired Output will be:
Dates: Day 1 Day 2 Day 3 .......................
Amount: 25 54 89 .......................
I think this is what you want:
LTV1 = LTV %>%
arrange(InstallationDate) %>%
group_by(InstallationDate) %>%
summarize(daily_amount = sum(USAmount)) %>%
ungroup() %>%
mutate(cumulative_amount = cumsum(daily_amount))
I believe the cumulative_amount column is what you are trying to create---though it's hard to tell and impossible to test since you haven't reproducibly shared your input data. I also can't tell if the arrange, group_by, and summarize are needed---if your data is already in order by date, arrange isn't needed. If your data only has one row per day, the grouping and summarizing aren't needed.
You assign (why?) to LTV1 the result of while, which is always NULL.
Actual OutputI am trying to sum an amount if a column of dates (z) fall between a specified range. Unfortunately my loop doesn't seem to wok and i have a null output.
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-20',tz = "GMT")
Ed <- EndDate + 30
LTV1 <- while (Sd < Ed) {
Sd <- Sd + 1
LTV1 <- LTV %>% group_by(InstallationDate)%>% filter(z < Sd) %>%
summarize(Amount = sum(USAmount))
}
as.data.frame(LTV1)
Apologies everyone. I am quite new to this. Here is a reproducible example:
Sample <- as.data.frame(seq(as.Date("2017/01/01"), by = "day", length.out
= 15))
Sample$Amount <- c(10,5,3,4,8,65,89,47,74,95,85,63,32,45,32)
colnames(Sample)[1] <- "date"
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-5',tz = "GMT")
Ed <- EndDate + 3
Sample1 <- while (Sd < Ed) {
Sd <- Sd + 1
Sample1 <- Sample %>% group_by(date)%>% filter(date < Sd) %>%
summarize(Amount = sum(Amount))
}
as.data.frame(Sample1)
Desired Output will be:
Dates: Day 1 Day 2 Day 3 .......................
Amount: 25 54 89 .......................
I think this is what you want:
LTV1 = LTV %>%
arrange(InstallationDate) %>%
group_by(InstallationDate) %>%
summarize(daily_amount = sum(USAmount)) %>%
ungroup() %>%
mutate(cumulative_amount = cumsum(daily_amount))
I believe the cumulative_amount column is what you are trying to create---though it's hard to tell and impossible to test since you haven't reproducibly shared your input data. I also can't tell if the arrange, group_by, and summarize are needed---if your data is already in order by date, arrange isn't needed. If your data only has one row per day, the grouping and summarizing aren't needed.
You assign (why?) to LTV1 the result of while, which is always NULL.
For the following dataset:
d = data.frame(date = as.Date(as.Date('2015-01-01'):as.Date('2015-04-10'), origin = "1970-01-01"),
group = rep(c('A','B','C','D'), 25), value = sample(1:100))
head(d)
date group value
1: 2015-01-01 A 4
2: 2015-01-02 B 32
3: 2015-01-03 C 46
4: 2015-01-04 D 40
5: 2015-01-05 A 93
6: 2015-01-06 B 10
.. can anyone advise a more elegant way to calculate a cumulative total of values by group than this data.table) method?
library(data.table)
setDT(d)
d.cast = dcast.data.table(d, group ~ date, value.var = 'value', fun.aggregate = sum)
c.sum = d.cast[, as.list(cumsum(unlist(.SD))), by = group]
.. which is pretty clunky and yields a flat matrix that needs dplyr::gather or reshape2::melt to reformat.
Surely R can do better than this??
If you just want cumulative sums per group, then you can do
transform(d, new=ave(value,group,FUN=cumsum))
with base R.
This should work
library(dplyr)
d %>%
group_by(group) %>%
arrange(date) %>%
mutate(Total = cumsum(value))
As this question was tagged with data.table, you are probably looking for (a modification of #Franks comment).
setDT(d)[order(date), new := cumsum(value), by = group]
This will simultaneously rearrange the data by date (not sure if needed, if not, you can get rid of order(date)) and update your data set in place utilizing the := operator
Is this it?
sp <- split(d, d$group)
res <- lapply(seq_along(sp), function(i) cumsum(sp[[i]]$value))
res <- lapply(seq_along(res), function(i){
sp[[i]]$c.sum <- res[[i]]
sp[[i]]
})
res <- do.call(rbind, res)
res
Related to this question here, but I decided to ask another question for the sake of clarity as the 'new' question is not directly related to the original. Briefly, I am using ddply to cumulatively sum a value for each of three years. My code takes data from the first year and repeats in in the second and third-year rows of the column. My guess is that each 1-year chunk is being copied to the whole of the column, but I don't understand why.
Q. How can I get a cumulatively summed value for each year, in the right rows of the designated column?
[Edit: the for loop - or something similar - is important, as ultimately I want to automagically calculate new columns based on a list of column names, rather than calculating each new column by hand. The loop iterates over the list of column names.]
I use the ddply and cumsum combination frequently so it is rather vexing to suddenly be having problems with it.
[Edit: this code has been updated to the solution I settled on, which is based on #Chase's answer below]
require(lubridate)
require(plyr)
require(xts)
require(reshape)
require(reshape2)
set.seed(12345)
# create dummy time series data
monthsback <- 24
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback),
myvalue1 = runif(monthsback, min = 600, max = 800),
myvalue2 = runif(monthsback, min = 1900, max = 2400),
myvalue3 = runif(monthsback, min = 50, max = 80),
myvalue4 = runif(monthsback, min = 200, max = 300))
mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))
# Select columns to process
newcolnames <- c('myvalue1','myvalue4','myvalue2')
# melt n' cast
mydf.m <- mydf[,c('mydate','year',newcolnames)]
mydf.m <- melt(mydf.m, measure.vars = newcolnames)
mydf.m <- ddply(mydf.m, c("year", "variable"), transform, newcol = cumsum(value))
mydf.m <- dcast(mydate ~ variable, data = mydf.m, value.var = "newcol")
colnames(mydf.m) <- c('mydate',paste(newcolnames, "_cum", sep = ""))
mydf <- merge(mydf, mydf.m, by = 'mydate', all = FALSE)
mydf
I don't really follow your for loop there, but are you overcomplicating things? Can't you just directly use transform and ddply?
#Make sure it's ordered properly
mydf <- mydf[order(mydf$year, mydf$month),]
#Use ddply to calculate the cumsum by year:
ddply(mydf, "year", transform,
cumsum1 = cumsum(myvalue1),
cumsum2 = cumsum(myvalue2))
#----------
mydate myvalue1 myvalue2 year month cumsum1 cumsum2
1 2010-05-01 744.1808 264.4543 2010 5 744.1808 264.4543
2 2010-06-01 775.1546 238.9828 2010 6 1519.3354 503.4371
3 2010-07-01 752.1965 269.8544 2010 7 2271.5319 773.2915
....
9 2011-01-01 745.5411 218.7712 2011 1 745.5411 218.7712
10 2011-02-01 797.9474 268.1834 2011 2 1543.4884 486.9546
11 2011-03-01 606.9071 237.0104 2011 3 2150.3955 723.9650
...
21 2012-01-01 690.7456 225.9681 2012 1 690.7456 225.9681
22 2012-02-01 665.3505 232.1225 2012 2 1356.0961 458.0906
23 2012-03-01 793.0831 206.0195 2012 3 2149.1792 664.1101
EDIT - this is untested as I don't have R on this machine, but this is what I had in mind:
require(reshape2)
mydf.m <- melt(mydf, measure.vars = newcolnames)
mydf.m <- ddply(mydf.m, c("year", "variable"), transform, newcol = cumsum(value))
dcast(mydate + year + month ~ variable, data = mydf.m, value.var = "newcol")