choose semi-last observations based on date in data.table in R - r

I have a data.table with dates in it (as factor variables). I am getting the lag values from these. How can I tell R to run the get the lag values only for the observations dated semi-last? In this case this would be start == "01.01.2015"?
example data:
ID <- rep("A5", 15)
product <- rep(c("prod1","prod2","prod3", "prod55", "prod4", "prod9", "prod83"),3)
start <- c(rep("01.01.2016", 3), rep("01.01.2015", 3), rep("01.01.2014",3),
rep("01.01.2013",3), rep("01.01.2012",3))
prodID <- c(3,1,2,3,1,2,3,1,2,3,2,1,3,1,2)
mydata <- cbind(ID, product[1:15], start, prodID)
mydata <- as.data.table(mydata)
mydata[, (nameCols) := shift(.SD, 3, fill = "NA", "lead"), .SDcols= c("start", "V2"), by = "prodID"]
For now I have used this to get to my results:
mydata[start == "01.01.2015"]
The problem is that the semi-last date is not always the same date. I will be repeating this procedure many times and i want to avoid having to specify this by hand. Any ideas?

Convert the data to date object and sort to select semi-last date.
library(data.table)
mydata[, start := as.IDate(start, '%d.%m.%Y')]
mydata[start == sort(unique(start), decreasing = TRUE)[2]]
# ID V2 start prodID
#1: A5 prod55 2015-01-01 3
#2: A5 prod4 2015-01-01 1
#3: A5 prod9 2015-01-01 2

Related

Counting data.table entries between dates

I have a data table with a bunch of entries that have start and end dates, like so:
id
start
end
1
1958-01-03
1962-10-11
2
1961-02-23
2012-04-28
etc.
I want to count how many of these items were in operation, by month. So I tried to do this:
data.table(
month = seq(as.Date('1950-01-01','%Y-%m-%d'), as.Date('2021-09-01','%Y-%m-%d'), 'months'),
month_end = seq(as.Date('1950-02-01','%Y-%m-%d'), as.Date('2021-10-01', '%Y-%m-%d'), 'months') -1
) %>%
.[,count := satcat[start >= month & month_end <= end,.N]] %>%
.[]
What I get, however, is an error:
Warning message in `>.default`(start, month):
“longer object length is not a multiple of shorter object length”
Warning message in `<=.default`(month_end, end):
“longer object length is not a multiple of shorter object length”
and count is the same number for all rows. Why is this happening, and what is the correct way of doing this? I feel like there should be some apply solution, but I can't work it out.
This is a job for foverlaps:
library(data.table)
DT <- data.table(id = 1:2,
start = as.Date(c("1958-01-03", "1961-02-23")),
end = as.Date(c("1961-10-11", "2012-04-28")))
periods <- data.table(start = seq(as.Date('1950-01-01','%Y-%m-%d'), as.Date('2021-09-01','%Y-%m-%d'), 'months'),
end = seq(as.Date('1950-02-01','%Y-%m-%d'), as.Date('2021-10-01', '%Y-%m-%d'), 'months') -1)
setkey(DT, start, end)
setkey(periods, start, end)
res <- foverlaps(periods, DT, nomatch = NA)[, .(N = sum(!is.na(id))), by = .(i.start, i.end)]
plot(N ~ i.start, data = res, type = "s")
res[N == 2]
# i.start i.end N
#1: 1961-02-01 1961-02-28 2
#2: 1961-03-01 1961-03-31 2
#3: 1961-04-01 1961-04-30 2
#4: 1961-05-01 1961-05-31 2
#5: 1961-06-01 1961-06-30 2
#6: 1961-07-01 1961-07-31 2
#7: 1961-08-01 1961-08-31 2
#8: 1961-09-01 1961-09-30 2
#9: 1961-10-01 1961-10-31 2

Calculate type counts and add types separated by commas to a data.table

I have a dataframe like this,
ID <- c("ID001","ID001","ID001","ID002","ID002","ID002")
ToolID <- c("SWP","SWP","SWP","ISP","ISP","ISP")
Type <- c("A","B","C","D","E","A")
WHEN <- c("2017-08-15 12:44:11","2017-08-15 12:44:11","2017-08-14 19:07:11",
"2017-08-17 11:24:15","2017-08-17 11:24:15","2017-08-17 11:24:15")
df <- data.frame(ID,ToolID,Type,WHEN)
df$WHEN <- as.POSIXct(df$WHEN,format="%Y-%m-%d %H:%M:%S")
I am trying to put all the types in one column separated by commas and also calculate the count for the ID's grouped by (Tool_ID & ID) while only taking the MAX(WHEN), the recent time stamp for the respective ID.
My desired output is
ID ToolID Type Type_count WHEN
ID001 SWP A,B 2 2017-08-15 12:44:11
ID002 ISP D,E,A 3 2017-08-17 11:24:15
I tried using data.table and did it this way
library(data.table)
setDT(df)[, WHEN := as.POSIXct(WHEN)]
df1 <- df[, max(WHEN), by = list(ID,ToolID)]
colnames(df1 )[which(names(df1 ) == "V1")] <- "WHEN"
How do I get the types and types count added to df1 to get my desired output?
Could someone point me in the right direction?
We can create a rowindex based on the logical condition, then use the group by, specify the index in i and get the summary
i1 <- setDT(df)[, .I[WHEN == max(WHEN)], .(ID, ToolID)]$V1
df[i1, .(Type = toString(unique(Type)), Type_count = uniqueN(Type),
WHEN = WHEN[1]), .(ID, ToolID)]
# ID ToolID Type Type_count WHEN
#1: ID001 SWP A, B 2 2017-08-15 12:44:11
#2: ID002 ISP D, E, A 3 2017-08-17 11:24:15

Merge 2 dataframes with conditions on datetimes and get the counts for passfails

I have 2 dataframes like this
df1
ID <- c("ID001","ID001","ID002","ID003")
Type <- c("A","A","B","A")
Measurement <- c("Length","Breadth","Length","Length")
When <- c("2016-09-09 06:00:13", "2016-09-19 09:13:10", "2016-10-13 11:45:14", "2016-10-29 11:56:00")
df1 <- data.frame(ID,Type,Measurement,When)
df2
ID <- c("ID001","ID001","ID001","ID001","ID001",
"ID002","ID002","ID002","ID002","ID002")
Type <- c("A","A","A","A","A",
"B","B","B","B","B")
Measurement <- c("Length","Length","Length","Length","Length",
"Length","Length","Length","Length","Length")
Datetime <- c("2016-09-09 01:00:13", "2016-09-09 04:00:13", "2016-09-09 09:00:13", "2016-09-09 21:00:13","2016-09-09 23:00:13",
"2016-10-13 10:45:14", "2016-10-13 11:15:14", "2016-10-13 11:48:14", "2016-10-13 11:55:14","2016-10-13 21:45:14")
PassFail <- c("Pass","Fail","Pass","Fail","Pass",
"Fail","Fail","Pass","Pass","Pass")
df2 <- data.frame(ID,Type,Measurement,Datetime,PassFail)
I am trying to merge these 2 dataframes to get counts of passes and fails for measurements only for "Datetime" in df2 greater than "WHEN" in df1.
My desired output is
ID Type Measurement When PassCount FailCount
ID001 A Length 2016-09-09 06:00:13 2 1
ID002 B Length 2016-10-13 11:45:14 3 0
I tried using sqldf to get this
library(sqldf)
df3<-sqldf("SELECT L.*, r.Datetime, r.PASSFAIL
FROM df1 as L
LEFT JOIN df2 as r
ON L.ID=r.ID
AND L.Type=r.Type
AND L.Measurement=r.Measurement
WHERE r.Datetime > L.When
ORDER BY L.When")
I am being unsuccessful at obtaining the output. Could someone point me in the right direction? I also would like a fast merge solution since I want to apply it to a larger dataset.
With data.table, a non-equi join seems to work:
library(data.table)
setDT(df1)[, When := as.POSIXct(When)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[df1, on=.(ID, Datetime > When), if (.N > 0L) as.list(table(PassFail)), by=.EACHI]
# ID Datetime Fail Pass
# 1: ID001 2016-09-09 06:00:13 1 2
# 2: ID002 2016-10-13 11:45:14 0 3
If you want a row for each row of df1, remove the if clause.
To add the counts as columns to df1:
df1[, levels(df2$PassFail) :=
df2[df1, on=.(ID, Datetime > When), as.list(table(PassFail)), by=.EACHI][, !c("ID","Datetime")]
]

Setting a value in one dataframe by looking its value in another dataframe based on dates

I have 2 data frames in R.Data1 has 2 columns id, date and Data2 has 3 columns id, date, level.I want to set level column in Data1 based on level and date columns in Data2.
Data1 = data.frame(id = c(1,1,1), dates = c("2014-06","2016-02","2016-05"))
id date
1 2014-06
1 2016-02
1 2016-05
Data2 = data.frame(id = c(1,1,1), dates = c("2015-07","2016-04","2016-07"), level=c(3,4,5))
id date level
1 2015-07 3
1 2016-04 4
1 2016-07 5
So resulting data frame should be:
id date level
1 2014-06 NULL
1 2016-02 3
1 2016-05 4
You can accomplish this with the rolling joins from the data.table-package and converting the dates-columns to a date-class (see the note at the end of this post):
library(data.table)
setDT(Data1, key = c('id','dates'))
setDT(Data2, key = c('id','dates'))
Data1[Data2, lev := level, roll = -Inf, rollends = c(TRUE,FALSE)][]
which gives:
> Data1
id dates lev
1: 1 2014-06-01 NA
2: 1 2016-02-01 3
3: 1 2016-05-01 4
Explanation:
Convert the dataframes to datatables with setDT and set the key to the columns which are needed for the join
Join and create a new variable in Data1 with lev := level. With roll = -Inf you roll backwards and with rollends = c(TRUE,FALSE) you only roll the first value backwards.
Setting the keys beforehand isn't necessary. You could also do:
setDT(Data1)
setDT(Data2)
Data1[Data2, on = c('id','dates'), lev := level, roll = -Inf, rollends = c(TRUE,FALSE)][]
Used data:
Data1 = data.frame(id = c(1,1,1), dates = c("2014-06","2016-02","2016-05"))
Data2 = data.frame(id = c(1,1,1), dates = c("2015-07","2016-04","2016-07"), level=c(3,4,5))
Data1$dates <- as.Date(paste0(Data1$dates,'-01'))
Data2$dates <- as.Date(paste0(Data2$dates,'-01'))
NOTE: I converted the dates-columns to a date-format by adding the first day to each month. This necessary in order to properly do a rolling join as specified.

Combine datasets by date range and categorical variable

Suppose I have two datasets. One contains a list of promotions with start/end dates, and the other contains monthly sales data for each program.
promotions = data.frame(
start.date = as.Date(c("2012-01-01", "2012-06-14", "2012-02-01", "2012-03-31", "2012-07-13")),
end.date = as.Date(c("2014-04-05", "2014-11-13", "2014-02-25", "2014-08-02", "2014-09-30")),
program = c("a", "a", "a", "b", "b"))
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200))
Note that sales$year.month.day is used to indicate year/month. Day is included so R can more simply treat the column as a vector of date objects, but it isn't relevant to the actual sales.
I need to determine the number of promotions that occurred per month for each program. Here's an example of a loop that produces the output I want:
sales$count = rep(0, nrow(sales))
sub = list()
for (i in 1:nrow(sales)) {
sub[[i]] = promotions[which(promotions$program == sales$program[i]),]
if (nrow(sub[[i]]) > 1) {
for (j in 1:nrow(sub[[i]])) {
if (sales$year.month.day[i] %in% seq(from = as.Date(sub[[i]]$start.date[j]), to = as.Date(sub[[i]]$end.date[j]), by = "day")) {
sales$count[i] = sales$count[i] + 1
}
}
}
}
Example output:
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200),
count = c(3, 1, 3, 3, 2)
)
However since my actual datasets are very large, this loop crashes when I run it in R.
Is there a more efficient way to achieve the same result? Perhaps something with dplyr?
Using the newly implemented non-equi joins from the current development version of data.table:
require(data.table) # v1.9.7+
setDT(promotions) # convert to data.table by reference
setDT(sales)
ans = promotions[sales, .(monthly.sales, .N), by=.EACHI, allow.cartesian=TRUE,
on=.(program, start.date<=year.month.day, end.date>=year.month.day), nomatch=0L]
ans[, end.date := NULL]
setnames(ans, "start.date", "year.month.date")
# program year.month.date monthly.sales N
# 1: a 2013-02-01 200 3
# 2: b 2014-09-01 200 1
# 3: a 2013-08-01 200 3
# 4: a 2013-04-01 400 3
# 5: b 2012-11-01 200 2
See installation instructions for development version here.
You can do this with sql.
library(sqldf)
sqldf("select s.ymd,p.program,s.monthlysales, count(*) from promotions p outer left join sales s on p.program=s.program
where s.ymd between p.startdate and p.enddate and p.program=s.program group by s.ymd, s.program" )
This would first join the 2 data set where ymd in sales is between the start and end date of promotion and program in both the data are same. then it would group by ymd and count the instance. I have removed the periods from the names of variable.
I am a fan of Hadley's packages:
library(dplyr)
library(lubridate)
Floor dates so they are in the same format as the sales dataframe:
df <- promotions %>%
mutate(start.date = floor_date(start.date, unit = "month"),
end.date = floor_date(end.date, unit = "month"))
Expand the date intervals:
df$output <- mapply(function(x,y) seq(x, y, by = "month"),
df$start.date,
df$end.date)
Expand the data frame based on the date ranges, group and count, and merge to sales on date and program:
df %>% tidyr::unnest(output) %>%
group_by(output, program) %>%
summarise(prom_num = n()) %>%
merge(sales, .,
by.x = c("year.month.day", "program"),
by.y = c("output", "program"))
Output:
year.month.day program monthly.sales prom_num
1 2012-11-01 b 200 2
2 2013-02-01 a 200 3
3 2013-04-01 a 400 3
4 2013-08-01 a 200 3
5 2014-09-01 b 200 1
Could try ?data.table::foverlaps for that
library(data.table)
setDT(sales)[, c("start.date", "end.date") := year.month.day] # Add overlap cols
setkey(sales, program, start.date, end.date) # Key for join
res <- foverlaps(setDT(promotions), sales)[, .N, by = year.month.day] # Count joins
sales[res, count := i.N, on = "year.month.day"] # Update `sales` with results
sales
# year.month.day program monthly.sales start.date end.date count
# 1: 2013-02-01 a 200 2013-02-01 2013-02-01 3
# 2: 2013-04-01 a 400 2013-04-01 2013-04-01 3
# 3: 2013-08-01 a 200 2013-08-01 2013-08-01 3
# 4: 2012-11-01 b 200 2012-11-01 2012-11-01 2
# 5: 2014-09-01 b 200 2014-09-01 2014-09-01 1
This is basically creates interval column in sales, joins by them + by program, counts overlaps, and joins back to sales. You could remove the additional columns by doing sales[, c("start.date", "end.date") := NULL] if it really bothers you. Google foverlaps and data.table for more examples

Resources