I have a pretty large data set with users and their membership start and end dates. For each membership period there is one entry.
I have another dataset, which is coming from the support system, and it has records of user id's along with the dates of each system usage. This dataset is even larger, as there is one record for each usage.
I need to aggregate the second and combine with the first one, based on each user and membership period.
I tried a function for a for loop but for an extremeley large dataset (her we are talking about some few millions of rows) this will take ages.
Edit: The join or merge will not work, because here there are several ranges (between start and end dates) for each ID in the first frame. Each range has been assigned a number. (Period of membership) The second data frame has dates and IDs and the problem is finding the membership period for each ID & date by comparing it to the date ranges in the first frame.
Here is the code, along with mock datasets and what I want to achieve at the end:
ids <- c(rep("id1", 5), rep("id2", 5), rep("id3", 5))
#
stdates <- c("2015-08-01", "2016-08-01", "2017-08-01", "2018-08-01", "2019-08-01",
"2013-05-07", "2014-05-07", "2015-05-07", "2016-05-07", "2017-05-07",
"2011-02-13", "2013-02-13", "2015-02-13", "2016-02-13", "2017-02-13")
#
endates <- c("2016-07-31", "2017-07-31", "2018-07-31", "2019-07-31", "2020-07-31",
"2014-05-06", "2015-05-06", "2016-05-06", "2017-05-06", "2018-05-06",
"2013-02-12", "2015-02-12", "2016-02-12", "2017-02-12", "2018-02-12")
#
# First dataset:
df <- data.table(id = ids,
stdate = stdates,
endate = endates)
#
df <- df %>%
arrange(id, desc(endate))
#
# Add the membership period number for each user:
setDT(df)
df[, counter := rowid(id)]
#
# Second dataset:
ids2 <- sample(df$id, 1000, replace = TRUE)
dates2 <- sample(seq(Sys.Date() - 7*365, Sys.Date() - 365, 1), 1000)
#
df2 <- data.table(id = ids2,
dateticket = dates2)
#
# Function
counterFunc <- function(d2, d1) {
d2$groupCounter <- NA
for (i in 1:nrow(d2)) {
crdate <- d2$dateticket[i]
idtemp <- d2$id[i]
dtemp <- d1 %>%
filter(id == idtemp) %>%
data.table()
dtemp[, drcode := ifelse(crdate >= stdate & crdate <= endate, 1, 0)]
if (length(unique(dtemp$drcode)) == 2) {
dtempgc <- dtemp[drcode == 1]$counter
d2$groupCounter[i] <- dtempgc
}
if (length(unique(dtemp$drcode)) != 2) {
d2$groupCounter[i] <- 0
}
print(i)
}
return(d2)
}
#
# The result I want to get without a for loop:
df2gc <- counterFunc(df2, df)
#
The operation you want to do is called "joining", so depending on the direction and completion of that "joining" there are some options.
Here is a simple example:
df1<-data.frame("ID"=c("1","2","3","1","2"),"First_Name"=c("A","B","C","D","E"))
df2<-data.frame("ID"=c("1","2","3"),"Last_Name"=c("Ko","Lo","To"))
left_join(df1,df2,by = "ID")
The result looks like this:
ID First_Name Last_Name
1 A Ko
2 B Lo
3 C To
1 A Ko
2 B Lo
left_joinfrom the dplyrpackage simply looked up the relevant values in the look-up table (df2) and added them to the original table (df1, the left table) based on a "key" (by = "ID" in this case).
There are other operations that specify the terms of the joining more but left_joinshould be helpful in your case.
EDIT:
I have better understood your problem now. Please check if this solves it:
library(tidyverse)
df %>%
mutate(stdate = as.Date(stdate), endate = as.Date(endate)) %>%
left_join(df2, by = "id") %>%
mutate(check = case_when(dateticket >= stdate & dateticket <= endate ~ "TRUE", TRUE ~ "FALSE")) %>%
filter(check == "TRUE")
Edit:
For the problem the error "Cannot allocate vector of size" with join please refer to this:
Left_join error cannot allocate vector of size
Related
I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.
I am new to data tables in R and have managed to get 80% of the way through my analysis. The background is that I want to get the returns of a stock 5 days (before and after), and then 25 and 45 days after they report. I have successfully managed to do it for one set of dates (effectively hardcoding) but when I try and automate the process it falls apart.
I will start with my current formulas and then explain the data.
This formula successfully looks at the data tables and returns the sum that I need. The issue is that datem5 and V1 need to go through a loop (or mapply) to automate the process.
CQR_Date[CQR_DF[CQR_Date, sum(CQR), on = .(unit, date >= date1, date <= datem5),
by = .EACHI], newvar := V1, on = .(unit, date1=date)]
I tried this (along with many other variants). Please note the newvar needs to be addressed as well.
for (i in 1:4) {
CQR_Date[CQR_DF[CQR_Date, sum(CQ), on = .(unit, date >= date1, date <= cols[,..i]),
by = .EACHI], newvar := v, on = .(unit, date1=date)]
but get this error
Error: argument specifying columns specify non existing column(s): cols[3]='cols[, ..i]'
Interestingly, when I try
for (i in 1:2) {
y <- cols[,..i]}
There is no issue.
Now in terms of data;
col just contains the column headings that I need from CQR_Data
cols <- data.table("datem5", "datep5", "datep20" , "datep45")
CQ_Data has the reporting dates for the stock CQ such as the following
CQ_Date <- data.frame("date1" = anydate(c("2016-02-17", "2016-06-12", "2016-08-17")))
CQ_Date$datem5 <- CQ_Date$date1 - 5 # minus five days
CQ_Date$datep5 <- CQ_Date$date1 + 5 # plus five days
CQ_Date$datep20 <- CQ_Date$date1 + 20
CQ_Date$datep45 <- CQ_Date$date1 + 45
CQ_Date$unit <- 1 # I guess I need this for some sort of indexing
Then CQ_DF (it is the log returns for the stock) is formed by:
CQ_DF <- data.frame("unit" = rep(1,300))
CQ_DF$CQ <- rnorm(10)
CQ_DF$date <- seq(as.Date("2015-12-25"), by = "day", length.out = 300)
CQ_DF$unit <- 1
Before setting them as DT
setDT(CQ_DF)
setDT(CQ_Date)
Any help would be greatly appreciated. Note this uses
library(data.table)
library(anytime)
A simplified version is:
CQ_Date <- data.frame("date1" = c(10, 20))
CQ_Date$datep5 <- CQ_Date$date1 + 5 # plus five days
CQ_Date$datep20 <- CQ_Date$date1 + 10
CQ_Date$unit <- 1
CQ_DF <- data.frame("unit" = rep(1,100))
CQ_DF$CQ <- seq(1, by = 1, length.out = 100)
CQ_DF$date <- seq(1, by = 1, length.out = 100)
CQ_DF$unit <- 1
setDT(CQ_DF)
setDT(CQ_Date)
cols <- c("datep5", "datep20" )
tmp <- melt(CQ_Date, measure.vars = cols)
setDT(tmp)
tmp[CQ_DF[tmp, sum(CQ), on = .( unit, date >= date1, date <= value), by =
.EACHI],newvar := V1, on = .(unit, date1=date )]
The issue is now that the sum does not appear to work correctly. It may have something to do with "variable" variable.
Instead of using mapply or for loop, try reshaping the dataset in long format using melt, create sequence between the numbers, perform the join and calculate the sum.
library(data.table)
cols <- c("datep5", "datep20" )
tmp <- melt(CQ_Date, measure.vars = cols)
tmp <- melt(CQ_Date, measure.vars = cols)
tmp <- tmp[, list(date = seq(date1, value)), .(unit, variable, date1, value)]
tmp <- merge(tmp, CQ_DF, by = c('unit', 'date'))
tmp[, .(newvar = sum(CQ)), .(unit, variable, date1)]
# unit variable date1 newvar
#1: 1 datep5 10 75
#2: 1 datep20 10 165
#3: 1 datep5 20 135
#4: 1 datep20 20 275
If you need the data back in wide format you can use dcast.
Equivalent tidyverse option is :
library(tidyverse)
CQ_Date %>%
pivot_longer(cols = cols) %>%
mutate(date = map2(date1, value, seq)) %>%
unnest(date) %>%
left_join(CQ_DF, by = c('unit', 'date')) %>%
group_by(unit, name, date1) %>%
summarise(newvar = sum(CQ))
I have data about ID and the corresponding amount over multiple years. Something like this:
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:10000, 15))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02",
"2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22",
"2016-10-22","2017-10-25", "2018-10-22")
Now, I want to analyse every year of every ID. Specifically, I am interested in the amount. For one, I want to know the overall amount for every year. Then, i also want to know the overall amount for first 11 months of every year, first 10 months of every year, first 9 months of every year and first 8 months of every year. For this purpose I have calculated the cumSum for every ID per year as follows:
myData <- cbind(ID, amount, Date)
myData <- as.data.table(myData)
# createe cumsum per ID per Year
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData[order(clientID, clDate)]
myData[, CumSum := cumsum(amount), by =.(ID, year(Date))]
How can summarise the data.table such that i get columns amount9month, amount10month, amount11month for every ID in every year?
Between cumsum, by and dcast this is almost quite straightforward. The most difficult bit is dealing with those months without any data in. Hence this solution isn't as brief as it almost was, but it does do things the "data.table way" and avoids slow operations like looping through rows.
# Just sort the formatting out first
myData[, Date:=as.Date(Date)]
myData[, `:=`(amount = as.numeric(amount),
year = year(Date),
month = month(Date))]
bycols <- c('ID', 'year', 'month')
# Summarise all transactions for the same ID in the same month
summary <- myData[, .(amt = sum(amount)), by=bycols]
# Create a skeleton table with all possible combinations of ID, year and month, to fill in any gaps.
skeleton <- myData[, CJ(ID, year, month = 1:12, unique = TRUE)]
# Join the skeleton to the actual data, to recreate the data but with no gaps in
result.long <- summary[skeleton, on=bycols, allow.cartesian=TRUE]
result.long[, amt.cum:=cumsum(fcoalesce(amt, 0)), by=c('ID', 'year')]
# Cast the data into wide format to have one column per month
result.wide <- dcast(result.long, ID + year ~ paste0('amount',month,'month'), value.var='amt.cum')
NB. If you don't have fcoalesce, update your data.table package.
In which format do you want it? There are two simple options. You can get the requested result easily in two different formats:
# Prepare the data
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:1, 15, replace = TRUE))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02", "2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22", "2016-10-22","2017-10-25", "2018-10-22")
myData <- data.frame(ID, amount, Date)
# Add year column
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData$year <- format(myData$Date,"%Y")
Please note that I changed the amounts for testing purposes. Now two solutions.
# Format 1
by(myData$amount, list(myData$ID, myData$year), cumsum, simplify = TRUE)
# Format 2
aggregate(myData$amount, list(ID = myData$ID, Date = myData$year), cumsum)
However, you might want to have the result to be a new column in the data frame? You can solve it:
# Format: New column
myData <- myData[order(myData$year, myData$ID),] # sort by year and ID
myData$cumsum <- rep(0, nrow(myData))
for (r in 1:nrow(myData)) {
if (r > 1 && myData$year[r-1] == myData$year[r] && myData$ID[r-1] == myData$ID[r])
myData$cumsum[r] <- myData$cumsum[r-1] + myData$amount[r]
else
myData$cumsum[r] <- myData$amount[r]
}
I do not know a smooth solution with basic R. Maybe someone from the "dplr faction" has a neat trick up their sleeve?
I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.
My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)