summarizing data.table - creating multiple columns subset by date in R - r

I have data about ID and the corresponding amount over multiple years. Something like this:
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:10000, 15))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02",
"2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22",
"2016-10-22","2017-10-25", "2018-10-22")
Now, I want to analyse every year of every ID. Specifically, I am interested in the amount. For one, I want to know the overall amount for every year. Then, i also want to know the overall amount for first 11 months of every year, first 10 months of every year, first 9 months of every year and first 8 months of every year. For this purpose I have calculated the cumSum for every ID per year as follows:
myData <- cbind(ID, amount, Date)
myData <- as.data.table(myData)
# createe cumsum per ID per Year
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData[order(clientID, clDate)]
myData[, CumSum := cumsum(amount), by =.(ID, year(Date))]
How can summarise the data.table such that i get columns amount9month, amount10month, amount11month for every ID in every year?

Between cumsum, by and dcast this is almost quite straightforward. The most difficult bit is dealing with those months without any data in. Hence this solution isn't as brief as it almost was, but it does do things the "data.table way" and avoids slow operations like looping through rows.
# Just sort the formatting out first
myData[, Date:=as.Date(Date)]
myData[, `:=`(amount = as.numeric(amount),
year = year(Date),
month = month(Date))]
bycols <- c('ID', 'year', 'month')
# Summarise all transactions for the same ID in the same month
summary <- myData[, .(amt = sum(amount)), by=bycols]
# Create a skeleton table with all possible combinations of ID, year and month, to fill in any gaps.
skeleton <- myData[, CJ(ID, year, month = 1:12, unique = TRUE)]
# Join the skeleton to the actual data, to recreate the data but with no gaps in
result.long <- summary[skeleton, on=bycols, allow.cartesian=TRUE]
result.long[, amt.cum:=cumsum(fcoalesce(amt, 0)), by=c('ID', 'year')]
# Cast the data into wide format to have one column per month
result.wide <- dcast(result.long, ID + year ~ paste0('amount',month,'month'), value.var='amt.cum')
NB. If you don't have fcoalesce, update your data.table package.

In which format do you want it? There are two simple options. You can get the requested result easily in two different formats:
# Prepare the data
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:1, 15, replace = TRUE))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02", "2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22", "2016-10-22","2017-10-25", "2018-10-22")
myData <- data.frame(ID, amount, Date)
# Add year column
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData$year <- format(myData$Date,"%Y")
Please note that I changed the amounts for testing purposes. Now two solutions.
# Format 1
by(myData$amount, list(myData$ID, myData$year), cumsum, simplify = TRUE)
# Format 2
aggregate(myData$amount, list(ID = myData$ID, Date = myData$year), cumsum)
However, you might want to have the result to be a new column in the data frame? You can solve it:
# Format: New column
myData <- myData[order(myData$year, myData$ID),] # sort by year and ID
myData$cumsum <- rep(0, nrow(myData))
for (r in 1:nrow(myData)) {
if (r > 1 && myData$year[r-1] == myData$year[r] && myData$ID[r-1] == myData$ID[r])
myData$cumsum[r] <- myData$cumsum[r-1] + myData$amount[r]
else
myData$cumsum[r] <- myData$amount[r]
}
I do not know a smooth solution with basic R. Maybe someone from the "dplr faction" has a neat trick up their sleeve?

Related

How to randomize a date in R

I'm trying to back into a fake birthdate based on the age of a consumer. I'm using lubridate package. Here is my code:
ymd(today) - years(df$age) - months(sample(1:12, 1)) - days(sample(1:31, 1)).
I want to use this to generate a different dob that equals the age. When I run this inline it gives every row the same month and day and different year. I want the month and day to vary as well.
You can make a date with the year of birth at 1st of January and then add random duration of days to it.
library(lubridate)
library(dplyr)
set.seed(5)
df <- data.frame(age = c(18, 33, 58, 63))
df %>%
mutate(dob = make_date(year(Sys.Date()) - age, 1, 1) +
duration(sample(0:364, n()), unit = "days"))
In base R, we can extract the year from the age column subtract it from current year, select a random month and date, paste the values together and create a Date object.
set.seed(123)
df <- data.frame(age = sample(100, 5))
as.Date(paste(as.integer(format(Sys.Date(), "%Y")) - df$age,
sprintf("%02d", sample(12, nrow(df))),
sprintf("%02d", sample(30, nrow(df))), sep = "-"))
#[1] "1990-01-29" "1940-06-14" "1978-09-19" "1933-05-16" "1928-04-03"
However, in this case you might need to make an extra check for month of February, or to be safe you might want to sample dates only from 28 instead of 30 here.

fast way to create a date based lookup R

I have a pretty large data set with users and their membership start and end dates. For each membership period there is one entry.
I have another dataset, which is coming from the support system, and it has records of user id's along with the dates of each system usage. This dataset is even larger, as there is one record for each usage.
I need to aggregate the second and combine with the first one, based on each user and membership period.
I tried a function for a for loop but for an extremeley large dataset (her we are talking about some few millions of rows) this will take ages.
Edit: The join or merge will not work, because here there are several ranges (between start and end dates) for each ID in the first frame. Each range has been assigned a number. (Period of membership) The second data frame has dates and IDs and the problem is finding the membership period for each ID & date by comparing it to the date ranges in the first frame.
Here is the code, along with mock datasets and what I want to achieve at the end:
ids <- c(rep("id1", 5), rep("id2", 5), rep("id3", 5))
#
stdates <- c("2015-08-01", "2016-08-01", "2017-08-01", "2018-08-01", "2019-08-01",
"2013-05-07", "2014-05-07", "2015-05-07", "2016-05-07", "2017-05-07",
"2011-02-13", "2013-02-13", "2015-02-13", "2016-02-13", "2017-02-13")
#
endates <- c("2016-07-31", "2017-07-31", "2018-07-31", "2019-07-31", "2020-07-31",
"2014-05-06", "2015-05-06", "2016-05-06", "2017-05-06", "2018-05-06",
"2013-02-12", "2015-02-12", "2016-02-12", "2017-02-12", "2018-02-12")
#
# First dataset:
df <- data.table(id = ids,
stdate = stdates,
endate = endates)
#
df <- df %>%
arrange(id, desc(endate))
#
# Add the membership period number for each user:
setDT(df)
df[, counter := rowid(id)]
#
# Second dataset:
ids2 <- sample(df$id, 1000, replace = TRUE)
dates2 <- sample(seq(Sys.Date() - 7*365, Sys.Date() - 365, 1), 1000)
#
df2 <- data.table(id = ids2,
dateticket = dates2)
#
# Function
counterFunc <- function(d2, d1) {
d2$groupCounter <- NA
for (i in 1:nrow(d2)) {
crdate <- d2$dateticket[i]
idtemp <- d2$id[i]
dtemp <- d1 %>%
filter(id == idtemp) %>%
data.table()
dtemp[, drcode := ifelse(crdate >= stdate & crdate <= endate, 1, 0)]
if (length(unique(dtemp$drcode)) == 2) {
dtempgc <- dtemp[drcode == 1]$counter
d2$groupCounter[i] <- dtempgc
}
if (length(unique(dtemp$drcode)) != 2) {
d2$groupCounter[i] <- 0
}
print(i)
}
return(d2)
}
#
# The result I want to get without a for loop:
df2gc <- counterFunc(df2, df)
#
The operation you want to do is called "joining", so depending on the direction and completion of that "joining" there are some options.
Here is a simple example:
df1<-data.frame("ID"=c("1","2","3","1","2"),"First_Name"=c("A","B","C","D","E"))
df2<-data.frame("ID"=c("1","2","3"),"Last_Name"=c("Ko","Lo","To"))
left_join(df1,df2,by = "ID")
The result looks like this:
ID First_Name Last_Name
1 A Ko
2 B Lo
3 C To
1 A Ko
2 B Lo
left_joinfrom the dplyrpackage simply looked up the relevant values in the look-up table (df2) and added them to the original table (df1, the left table) based on a "key" (by = "ID" in this case).
There are other operations that specify the terms of the joining more but left_joinshould be helpful in your case.
EDIT:
I have better understood your problem now. Please check if this solves it:
library(tidyverse)
df %>%
mutate(stdate = as.Date(stdate), endate = as.Date(endate)) %>%
left_join(df2, by = "id") %>%
mutate(check = case_when(dateticket >= stdate & dateticket <= endate ~ "TRUE", TRUE ~ "FALSE")) %>%
filter(check == "TRUE")
Edit:
For the problem the error "Cannot allocate vector of size" with join please refer to this:
Left_join error cannot allocate vector of size

sum up values of compressed time series over time

I try to describe my problem via the code below. I have a data frame of a 'compressed' time series in the form of data frame: have. It contains the start and end date of a period with a value over time. I want to repeat the data as in data frame: want to ultimately get to the data frame: ultimately_want which sums up the value over time. Maybe I do not need want and get straight to ultimately_want somehow? Thanks.
library(dplyr)
start_date <- as.Date(c("2004-08-02", "2004-08-03"))
end_date <- as.Date(c("2004-08-04", "2004-08-05"))
value <- c(5, 6)
have <- data.frame(start_date, end_date, value)
have
date <- as.Date(c("2004-08-02", "2004-08-03", "2004-08-04", "2004-08-03", "2004-08-04", "2004-08-05"))
value <- c(5, 5, 5, 6, 6, 6)
want <- data.frame(date, value)
want
ultimately_want <- want %>%
group_by(date) %>%
summarise(total = sum(value))
ultimately_want
Here is a data.table approach,
library(data.table)
setDT(have)[, .(value = value, date = seq(start_date, end_date, by = "day")),
by = 1:nrow(have)][,.(total = sum(value)), date][]
# date total
#1: 2004-08-02 5
#2: 2004-08-03 11
#3: 2004-08-04 11
#4: 2004-08-05 6

How to check if an id comes into data on a particular date that it stays until an exit date

I have a data set that looks something like below. Basically, I am interested in checking if a particular id is present at the beginning of the year(in this case jan,1,2003) that it is present everyday until the end of the year( dec 31 2003) then starting the checking process over again with the start of next year as people might change from year to year but should not change within a year. If on certain day, an id is not present I would like to know which day and which id.
I first started with a for loop and checked every two days but this is super inefficient since my data set spans roughly 50 years and will grow later on with new data.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
Edit:The above chunk has all the dates in it but if I delete for example id = 1 on the second day, the code should tell me it is missing so the count shouldn't be the same. I added the piece to delete the id = 1 on the second day below.
df <- df[-4,]
The code below will make the same data set but delete id = 1 for jan 2, 2003 and jan 3, 2003. I am trying to get something that returns the id that is missing and the date.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
df <- df[-4,]
df <- df[-6,]
This code chunk will count number of times a person appears in each year. if the answer is 365 or 366 in leap years a person was there everyday of the year.
library(dplyr)
library(tidyr)
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
dfx <- df %>%
mutate(yrs = lubridate::year(dates)) %>%
group_by(id, dates) %>%
filter(row_number()==1) %>%
group_by(id, yrs) %>%
tally
#remove values
dfa <- df[c(-4,-6),]
The in oder to find the date of missing value add an indicator column to the data set. then fill in the missing dates by id. After this the val column will have missing values. Filter the data to get the dates where it went missing.
dfx <- dfa %>%
mutate(val = 1) %>%
complete(nesting(id),
dates = seq(min(dates),max(dates),by = "day")) %>%
filter(is.na(val))

Replace NA with Value with Previous Value

I have a column in data frame that I created in R. After a certain month, the values become NA. I would like to replace the NAs with the record 12 months back. Is there a function in R for me to do this? Or do I have to do a loop?
So Jan-11 would then become 10, Feb-11 would become 11 and so forth.
EDIT:
I also tried:
for (i in 1:length(df$var)) {
df$var[i] <- ifelse(is.na(df$var[i]), df$var[i - 12],
df$var[i]) }
but the whole column ends up being NA.
Aha, from the last comment it sounds like you'd like a "chained" lag, where it uses the last value of that month that is available, however many years back you need to go.
Jan-11 will show the value 10, but when it comes to Jan-12, it shows
NA (when it should be 10).
Here's an approach that relies on first grouping by month, and then using tidyr::fill() to fill in from the last valid value for that month.
First, some fake data. (BTW it would be useful to include something like this in your question so that answerers don't have to retype your numbers or generate new ones.)
# Make fake data with 1 year values, 2 yrs NAs
library(lubridate)
set.seed(42);
data <- data.frame(
dates = seq.Date(from = ymd(20100101), to = ymd(20121201), by = "month"),
values = c(as.integer(rnorm(12, 10, 3)), rep(NA_integer_, 24))
)
# Group by months, fill within groups, ungroup.
library(tidyverse)
data_filled <- data %>%
group_by(month = month(dates)) %>%
fill(values) %>%
ungroup() %>%
arrange(dates)
I can't think of a way to do this without a loop, but this should give you what you need:
df <- data.frame(col1 = LETTERS[1:24],
col2 = c(rnorm(12), rep(NA, 12)))
for(i in 1:nrow(df)) {
if(is.na(df[i, 2])) {
df[i, 2] <- df[i - 12, 2]
}
}

Resources