Analyze the evolution of a population over time

Analyze the evolution of a population over time - r

Each day I have a new csv file with ids and some variables. The ids can be differents over the days. I would like to take the IDs of one day and follow how a variable evolves over the time.
My goal is to create area plot like this :
For example I take all the ids the 31 march, each day I make a join with thoses ids, and I make a count group by the var "Code". If there is missing ids (Ids here the 31 march but not day D) their code become "NA" to show how many IDs I "lose" over time. I hope i'm clear enough.
Here is how I calculate this king of plot : (my real datas are like li and not datas)
library(plyr)
library(dplyr)
datas <- data.frame(id1 = c("x", "y", "x", "y", "z", "x", "z"),
id2 = c("x2", "y2", "x2", "y2", "z2", "x2", "z2"),
code = c("code1", "code2", "code1", "code2", "code2", "code1", "code2"),
var = runif(7),
date = do.call(c, mapply(rep, seq(Sys.Date() - 2, Sys.Date(), by = 1), c(2, 3, 2))))
li <- split(datas, datas$date)
dateStart <- Sys.Date() - 2
dateEnd <- Sys.Date()
# A "filter" if I want to start with another date than the date min or end with another date than the max date
li <- li[as.Date(names(li)) >= dateStart & as.Date(names(li)) <= dateEnd]
dfCounts <- ldply(li, function(x)
left_join(li[[1]], x, by = c("id1", "id2")) %>%
group_by(code.y) %>%
count(code = code.y) %>%
mutate(freq = n / sum(n),
code = ifelse(is.na(code), "NA", code))),
.id = "date")
> dfCounts
date code n freq
1 2015-07-04 1 1 0.5
2 2015-07-04 2 1 0.5
3 2015-07-05 1 1 0.5
4 2015-07-05 2 1 0.5
5 2015-07-06 1 1 0.5
6 2015-07-06 NA 1 0.5
dfCounts %>%
ggplot(aes(date, freq)) +
geom_area(aes(fill = code), position = "stack")
# I have no idea why in this example, nothing is shown in the plot, but it works on my real datas
So it works, but if I want to observe a longer period, I have to join over many days (files) and it can be slow. Do you have any ideas to do the same things without joins, using the binded datas (the object datas and not li) with dplyr or data.table ?
In your opinion, which approach is better ?
Thanks !
(Sorry for the title I couldn't find better...)

Related

Filter on date in r dataframe

I am attempting to limit my dataframe to the days of each month between the 20th and the 25th . I got a big dataset with many dates ranging over many years. It looks something like this:
Event Date
Football 20.12.2016
Work 15.10.2019
Holiday 30.11.2018
Running 24.01.2020
I would then like to restrict my results to:
Event Date
Football 20.12.2016
Running 24.01.2020
Any tips on how to do this?

This is a solution using dplyr/lubridate although I have converted your Date column using as.Date
df <-
data.frame(
Event = c("Football", "Work", "Holiday", "Running"),
Date = c("20.12.2016", "15.10.2019", "30.11.2018", "24.01.2020")
)
df$Date <- as.Date(df$Date, format = "%d.%m.%Y")
df %>% filter(day(Date) >= 20 & day(Date) <= 25)
Output
1 Football 2016-12-20
2 Running 2020-01-24

Doing a literal string-match, keeping your Date column as strings (not real dates):
# base R
subset(quux, between(as.integer(sub("\\..*", "", Date)), 20, 25))
# Event Date
# 1 Football 20.12.2016
# 4 Running 24.01.2020
# dplyr
quux %>%
filter(between(as.integer(sub("\\..*", "", Date)), 20, 25))
between can be from dplyr::, data.table::, or we can easily craft our own with:
between <- function(x, y, z) y <= x & x <= z
... though the two package versions are more robust to NAs and other issues.
Data
quux <- structure(list(Event = c("Football", "Work", "Holiday", "Running"), Date = c("20.12.2016", "15.10.2019", "30.11.2018", "24.01.2020")), class = "data.frame", row.names = c(NA, -4L))

Calculating Highest In, First Out on trades

I am trying to use the Highest In, First Out accounting method on trades.
Highest In, First Out means that when you sell, you sell your most expensive shares first.
Here are my buys and sells (example borrowed from R calculate aggregate gains or loss using FIFO method - this is a similar, but different problem):
buy = data.frame(BuyTransactionID = c(1:10),
Ticker=c(rep('MSFT',4),rep('AMZN',3),rep('DOCU',3)),
Date=c(rep('01-01-2018',2),rep('01-14-2020',2),rep('01-01-2018',2),rep('01-14-2020',1),'01-01-2018','03-15-2020','04-06-2020'),
Price=c(100,102,102,107,2000,2010,2011,197,182,167),
Quantity=c(10,10,5,5,1,1,2,12,15,15))
sell = data.frame(SellTransactionID=c(1:7),
Ticker=c('MSFT','MSFT','AMZN','AMZN','DOCU','DOCU','DOCU'),
Date=c('01-07-2020','01-20-2020','01-01-2020','01-30-2020','01-15-2020','04-10-2020','04-20-2020'),
Price=c(97,110,2100,2050,210,205,225),
Quantity=c(15,12,1,3,10,5,3))
Here are the rules:
You sell the most expensive (highest price) shares first.
You cannot sell shares before you purchased them
You cannot sell the same shares multiple times
Example problem:
The first sale (SellTransactionID = 1) is 15 shares of MSFT on 01-07-2020. So, any purchase made before that date can be sold. Based on date, the eligible shares to be sold are those from BuyTransactionID 1 and 2. BuyTransactionID 2 is the highest price. Therefore, all 10 shares of BuyTransactionID 2 are sold and the remaining 5 shares come from BuyTransactionID 1.
Desired output:
'Date Sold' = the date sold (self-explanatory);
'Ticker' = the ticker sold;
'Proceeds' = the total dollar amount sold;
'Cost basis' = a weighted average of the shares sold.
Example solution:
This is the solution for SellTransactionID 1. A properly solution automates this and calculates for all SellTransactionIDs.
result <- data.frame(SellDate = '01-07-2020', Ticker = "MSFT", Proceeds = 1455, CostBasis = 101.33)
Cost Basis Example:
Cost basis is calculated as a weighted average. For the preceding example, Cost Basis is calculated as such: (Quantity1 * Price1 + Quanity2 * Price2 + .....)/sum of all Quantity(s)
So for example above: (10 * 102 + 5 * 100)/15

The answer by #DPH is excellent, but unfortunately not quite accurate enough. I will explain why.
Here is a new dataset where all the purchases precede the sales:
buy = data.frame(BuyTransactionID = c(1:10),
Ticker=c(rep('MSFT',10)),
Date=c(rep('01-01-2020',10)),
Price=c(100,102,102,107,105,111,109,112,115,106),
Quantity=c(10,10,5,5,1,1,2,12,15,15))
sell = data.frame(SellTransactionID=c(1:4),
Ticker=c('MSFT','MSFT','MSFT', 'MSFT'),
Date=c('01-07-2020','01-20-2020','01-21-2020',
'01-22-2020'),
Price=c(120,119,117, 121),
Quantity=c(7,12,1, 5))
If you apply the solution from #DPH, you will get this result:
Notice that the 'Remain_Price' does not change, nor does the 'Sales_Cost' for the last three transactions. This happens because the function determines how many shares remain after the first sale and what the average price of the remaining shares is. The shares purchased preceding the first sale can no longer be sold individually. They are now treated as a single entity with an average price and the remaining number of shares.
For example, a total of 76 shares were bought in this example. The first sale sells 7 shares. Now, 69 shares remain as seen in 'Remain_Qtd'. An average price is calculated for those remaining shares - that price is $106.5652. Now, the process considers all 69 shares to be priced at $106.5652 and the remaining sales reduce the quantity of 'Remain_Qtd', but does not change the 'Remain_Price'. The remaining shares can no longer be considered at the price that they were bought at, they are collectively part of the remaining shares and the average remaining price.
This occurs because of the object dfo and the recycling of dfo in the object sdf. In particular, this line calculates an average remaining price that is then recycled through dfo and sdf.
Price = (sum(ip * iq) - v) / sum(sdf$Quantity)
and Quantity = sum(sdf$Quantity) adds together all the remaining shares.
I think the answer by #DPH is brilliant, but hope that it can be modified to treat each purchase individually rather than aggregating past purchases.

If I understood your problem correctly this is one possible solution. In resume I am combinig the sales and buys data and group it in sales blocks (given by the sales ID). This assumes that the order of sales IDs is according to the date column. I then loop over these sales blocks sequentially and write the intermediate result to a individual dataframe. For each sales block processing this result dataframe is filtered for the last sales block result of the same ticker. This means sales quantity must not be larger than available quantity according to the timeline (since you can not sell what you not have this should not be of concern anyhow I have to point it out as a possible limitation)
The proposed loop solution 1 is not the best way to work data in R since it is a loop, which grows a data.frame. Since you listed the purrr tag I adapted the code for the second part of the answer to work with the map() function.
Before we get to the actual coding lets prepare the data first (need for both parts of the answer the same way):
library(tidyverse) # need dplyr, using lubridate for date parsing and tidyr to fill data.frame
# include identifier variable, manipulate column name and union sells and buy
df <- dplyr::mutate(buy, io = "i") %>%
dplyr::rename(TID = BuyTransactionID) %>%
dplyr::union(dplyr::mutate(sell, io = "o") %>%
dplyr::rename(TID = SellTransactionID)) %>%
# sort the data
dplyr::arrange(Ticker, Date) %>%
# make Qtd negative for sales, remove buy IDs, parce Date and set up helper column
dplyr::mutate(Quantity = ifelse(io == "o", -1 * Quantity, Quantity),
TID = ifelse(io == "i", NA, TID),
Date = lubridate::mdy(Date),
hprice = NA_real_) %>%
# group data to fill backwards per group
dplyr::group_by(Ticker) %>%
tidyr::fill(TID, .direction = "up") %>%
# ungroup to prevent unwanted behaviour downstream
dplyr::ungroup()
1 Standard loop
dfo <- df[0, ] # empty copy of df
for (i in sort(unique(df$TID))) {
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df[df$TID == i, ])
# current sales quantiy as positive value
o1 <- abs(sdf[sdf$io == "o", ]$Quantity)
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
}
}
# fill sales block frame and bind to output df
dfo <- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Price = (sum(ip * iq) - v) / sum(sdf$Quantity),
Quantity = sum(sdf$Quantity),
io = "i",
hprice = v / o1))
}
# format, join original data and calculate result per Sales block
dplyr::select(dfo, Ticker, Date, Remain_Price = Price, Remain_Qtd = Quantity, Sales_Cost = hprice) %>%
dplyr::left_join(dplyr::select(df, Ticker, Date, Sales_Price = Price, Sales_Qtd = Quantity), by = c("Ticker", "Date")) %>%
dplyr::mutate(Result = abs(Sales_Qtd) * (Sales_Price - Sales_Cost))
Ticker Date Remain_Price Remain_Qtd Sales_Cost Sales_Price Sales_Qtd Result
1 MSFT 2020-01-07 100.0000 5 101.3333 97 -15 -65
2 MSFT 2020-01-20 100.0000 3 103.7500 110 -12 75
3 AMZN 2020-01-01 2000.0000 1 2010.0000 2100 -1 90
4 AMZN 2020-01-30 NaN 0 2007.3333 2050 -3 128
5 DOCU 2020-01-15 197.0000 2 197.0000 210 -10 130
6 DOCU 2020-04-10 173.6667 27 188.0000 205 -5 85
7 DOCU 2020-04-20 0.0000 -1 131.3333 225 -3 281
2 loop rephrase as purrr solution (be aware of the global assignment operartor (<<- instead of <-) for assignment of dfo at end of function)
# rephrase loop as function
myfun <- function(i){
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df[df$TID == i, ])
# current sales quantiy as positive value
o1 <- abs(sdf[sdf$io == "o", ]$Quantity)
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
}
}
# fill sales block frame and bind to output df
dfo <<- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Price = (sum(ip * iq) - v) / sum(sdf$Quantity),
Quantity = sum(sdf$Quantity),
io = "i",
hprice = v / o1))
}
# empty copy of df
dfo <- df[0, ]
purrr::map(sort(unique(df$TID)), ~ myfun(.x))[[max(df$TID)]] %>%
dplyr::select(Ticker, Date, Remain_Price = Price, Remain_Qtd = Quantity, Sales_Cost = hprice) %>%
dplyr::left_join(dplyr::select(df, Ticker, Date, Sales_Price = Price, Sales_Qtd = Quantity), by = c("Ticker", "Date")) %>%
dplyr::mutate(Result = abs(Sales_Qtd) * (Sales_Price - Sales_Cost))
Ticker Date Remain_Price Remain_Qtd Sales_Cost Sales_Price Sales_Qtd Result
1 MSFT 2020-01-07 100.0000 5 101.3333 97 -15 -65
2 MSFT 2020-01-20 100.0000 3 103.7500 110 -12 75
3 AMZN 2020-01-01 2000.0000 1 2010.0000 2100 -1 90
4 AMZN 2020-01-30 NaN 0 2007.3333 2050 -3 128
5 DOCU 2020-01-15 197.0000 2 197.0000 210 -10 130
6 DOCU 2020-04-10 173.6667 27 188.0000 205 -5 85
7 DOCU 2020-04-20 0.0000 -1 131.3333 225 -3 281
EDIT
To keep track of the remaining stocks we need a second df to hold the current portfolio data. I did not optimize the code and editted only the loop, the purrr adaption should be pretty straight foreward though.
library(tidyverse) # need dplyr, using lubridate for date parsing and tidyr to fill data.frame
# include identifier variable, manipulate column name and union sells and buy
df <- dplyr::mutate(buy, io = "i") %>%
dplyr::rename(TID = BuyTransactionID) %>%
dplyr::union(dplyr::mutate(sell, io = "o") %>%
dplyr::rename(TID = SellTransactionID)) %>%
# sort the data
dplyr::arrange(Ticker, Date) %>%
# make Qtd negative for sales, remove buy IDs, parce Date and set up helper column
dplyr::mutate(Quantity = ifelse(io == "o", -1 * Quantity, Quantity),
TID = ifelse(io == "i", NA, TID),
Date = lubridate::mdy(Date),
hprice = NA_real_) %>%
# group data to fill backwards per group
dplyr::group_by(Ticker) %>%
tidyr::fill(TID, .direction = "up") %>%
# ungroup to prevent unwanted behaviour downstream
dplyr::ungroup()
dfo <- df[0, ] # empty copy of df
dfh <- df[0, c("Ticker", "Price", "Quantity", "io")] # to hold current stock aka portfolio
for (i in sort(unique(df$TID))) {
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df[df$TID == i, ])
# bind data from current portfolio to buys between last and current sale (new port folio before sale)
sdfh <- rbind(dfh[dfh$Ticker == t, ],
df[df$TID == i & df$io == "i", c("Ticker", "Price", "Quantity", "io")])
# current sales quantiy as positive value
o1 <- abs(sdf[sdf$io == "o", ]$Quantity)
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
ips <- ip
iqs <- iq
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
iqs[1] <- iqs[1] - o2
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
ips <- ips[-1]
iqs <- iqs[-1]
}
}
dfh <- rbind(dfh[dfh$Ticker != t, ],
data.frame(Ticker = t,
Price = ips,
Quantity = iqs,
io = "i"))
# fill sales block frame and bind to output df
dfo <- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Price = sum(ips * iqs) / sum(iqs),
Quantity = sum(iqs),
io = "i",
hprice = v/o1))
}
dfo
TID Ticker Date Price Quantity io hprice
1 1 MSFT 2020-01-07 106.5652 69 i 115
2 2 MSFT 2020-01-20 105.0000 57 i 114
3 3 MSFT 2020-01-21 104.8750 56 i 112
4 4 MSFT 2020-01-22 104.1765 51 i 112

Here is the final working solution that I have come to with the help of #DPH. I have made a couple of changes to #DPH's edited solution.
The edited solution does not work when all the shares of a stock are sold for multiple reasons including the dfh object. The updates solution does work with the modified dateset that I provided but not the original dateset. I have modified the answer so that it works when all shares are sold.
I have modified the result to include the dates of the purchases. This will be important for determining whether the sale is long term or short term capital gains.
I have removed tickers from stocks that have been purchased but not sold, as those will break the script
I have applied purrr to the updated solution to avoid looping.
I have changed the base subsetting (i.e., df[]) to dplyr subsetting (i.e., df %>% filter(). For some reason the base subsetting was resulting in rows with NA values in my actual dataset even though it did not cause that in the sample dataset. The NA rows caused the solution not to work.
data frame prep:
df <- buy %>% filter(Ticker %in% unique(sell$Ticker)) %>% dplyr::mutate(io = "i") %>%
dplyr::rename(TID = BuyTransactionID) %>%
dplyr::union(dplyr::mutate(sell, io = "o") %>%
dplyr::rename(TID = SellTransactionID)) %>%
# sort the data
dplyr::arrange(Ticker, Date) %>%
# make Qtd negative for sales, remove buy IDs, parce Date and set up helper column
dplyr::mutate(Quantity = ifelse(io == "o", -1 * Quantity, Quantity),
TID = ifelse(io == "i", NA, TID),
Date = lubridate::mdy(Date),
hprice = NA_real_) %>%
dplyr::arrange(Ticker, Date) %>%
# group data to fill backwards per group
dplyr::group_by(Ticker) %>%
tidyr::fill(TID, .direction = "up") %>%
# ungroup to prevent unwanted behaviour downstream
dplyr::ungroup()
df$Dates_bought <- NA
function and purrr:
# rephrase loop as function
myfun <- function(i){
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df %>% filter(TID == i))
sdfh <- rbind(dfh %>% filter(Ticker == t),
df %>% filter(TID == i & io == "i") %>% select(c("Ticker", "Date", "Price", "Quantity", "io")))
# current sales quantiy as positive value
o1 <- abs(sdf %>% filter(io == "o") %>% pull(Quantity))
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
date <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Date)
ips <- ip
iqs <- iq
dates <- date
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop. Modifications
# to make dates work properly.
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
iqs[1] <- iqs[1] - o2
dates2 <- if(o2 == 0) dates else dates[-1]
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
ips <- ips[-1]
iqs <- iqs[-1]
dates <- dates[-1]
dates2 <- dates
}
}
# Needs to have the if else statements because when all shares are sold, the length
# of dates, ips, and iqs is 0, whereas Ticker and io are length 1.
dfh <<- rbind(dfh[dfh$Ticker != t, ],
data.frame(Ticker = if(length(ips) == 0) numeric(length = 0L) else t,
Date = dates,
Price = ips,
Quantity = iqs,
io = if(length(ips) == 0) numeric(length = 0L) else "i"))
# fill sales block frame and bind to output df
dfo <<- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Dates_bought = paste(date[seq(length(date)-length(dates2))], collapse = ","),
Price = sum(ips * iqs) / sum(iqs),
Quantity = sum(iqs),
io = "i",
hprice = v / o1))
}
# empty copy of df
dfo <- df[0, ] # empty copy of df
dfh <- df[0, c("Ticker", "Date", "Price", "Quantity", "io")] # to hold current stock aka portfolio
hifo <- purrr::map(sort(unique(df$TID)), ~ myfun(.x))[[length(unique(df$TID))]] %>%
dplyr::select(Ticker, Date, Dates_bought, Remain_Price = Price, Remain_Qtd = Quantity, Sales_Cost = hprice) %>%
dplyr::left_join(dplyr::select(df, Ticker, Date, Sales_Price = Price, Sales_Qtd = Quantity), by = c("Ticker", "Date")) %>%
dplyr::mutate(Result = abs(Sales_Qtd) * (Sales_Price - Sales_Cost))
Let me know if anyone has any issues. I'd like to get this into a shinyapp and maybe develop it more. Let me know if you're interested in collaborating.

R Loop between two datatables

Since it is hard for me to solve another puzzle posted on this site due to computation limits, I am trying something new as a substitute, which should work. I have two large datasets, one with monthly firm data, one with monthly bank data :
Data_firm <- data.frame(
Firm = c("A", "A", "B"),
time = c("1", "3", "2"),
postal= c("20", "20", "67")
)
Data_bank <- data.frame(
Bank = c("AB", "AC","BD"),
Postal_bank = c("20", "67","20"),
entry = c("1","1","2"),
exit = c("6","5","7"))
I would need to add a column with the number of banks present in the same department as the firm at each time, accounting for the fact that some banks enter and some do exit (entry, exit variables in "Databank"). In theory, I should have the fllowing column in the example above :
nbbankindepartment = c(1,2,1)
I have tried the following
for (row in 1:nrow(Data_firm){
dep <- Data_firm[row, "postal"]
time <- Data_firm[row, "time"]
count <- sum(Data_bank$Postal_bank == dep & Data_bank$entry <= time & Data_bank$exit > time, na.rm = TRUE)
Data_firm[row,"nbbankindepartment"]<-count
}
But I only get 0s as a result. Does anybody have a solution ? What I am not doing right ?
Thanks in advance,

Here is my best guess for your problem:
library(dplyr)
Data_firm %>%
left_join(Data_bank, by = c("postal" = "Postal_bank")) %>%
filter(entry <= time, exit > time) %>%
group_by(Firm, time, postal) %>%
summarise(nb_bank_in_department = n(), .groups = "drop")
This returns
# A tibble: 3 x 4
Firm time postal nb_bank_in_department
<chr> <chr> <chr> <int>
1 A 1 20 1
2 A 3 20 2
3 B 2 67 1

How can I filter on a past date given another date in a specific row of data.frame?

I have a question related to filtering on dates in R. I found e.g. this link
dplyr filter on Date, which answers the question how to filter with help of dplyr in a specific date range. I would like to select a dynamic range, e.g. calculate the number of critical Jobs in a specific window e.g. the last seven days starting from the current date in the dataset. The code I have in mind would look something like this:
my.data %>%
group_by(category) %>%
filter(date > date - days(7) & date <= date) %>%
mutate(ncrit = sum(critical == 'yes'))
This is not working properly. Is there a way to get this running with dplyr?
Edit:
Apologies for the unclear post. To complete the post first the idea: imagine computers running jobs. If a computer fails to compute jobs the past x days it is more likely that it also fails in calculating the current job. A dummy dataset includes the computer categories (e.g. A/B), the date, and failure (yes/no)
Using the dataset from Rui Barradas, I would like to add with dplyr the following column 'number of critical Jobs in past 3 days" (in this case x = 3):
head(my.data, 7)
category date critical number of critical jobs in past 3 days
1 A 2018-08-14 yes NA
2 A 2018-08-15 no NA
3 A 2018-08-16 yes NA
4 A 2018-08-17 no 2
5 A 2018-08-18 yes 1
6 A 2018-08-19 no 2
7 A 2018-08-20 yes 1
Data (Rui Barradas):
set.seed(3635)
my.data <- data.frame(category = rep(c('A', 'B'), each = 10), #
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = 'days')),
critical = sample(c('no', 'yes'), 20, TRUE))

Without an example dataset it's not very easy to say, but given your description of the problem I believe the following is on the right track.
The code uses function rollapplyr from package zoo, inspired not by the accepted but by the second answer to this question.
library(zoo)
library(dplyr)
sumCrit <- function(DF, crit = "yes", window = 3){
DF %>%
group_by(category) %>%
mutate(ncrit = rollapplyr(critical == crit, list(-seq(3)), sum, fill = NA))
}
result <- sumCrit(my.data)
head(result, 7)
## A tibble: 7 x 4
## Groups: category [1]
# category date critical ncrit
# <fct> <date> <fct> <int>
#1 A 2018-08-14 yes NA
#2 A 2018-08-15 no NA
#3 A 2018-08-16 yes NA
#4 A 2018-08-17 no 2
#5 A 2018-08-18 yes 1
#6 A 2018-08-19 no 2
#7 A 2018-08-20 yes 1
Data.
This is a made up dataset meant to test the code above.
set.seed(3635) # Make the results reproducible
my.data <- data.frame(category = rep(c("A", "B"), each = 10),
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = "days"), 2),
critical = sample(c("no", "yes"), 20, TRUE))

Data generation
DATE1 <- as.Date("2018-08-23")
DATE2 <- as.Date("2018-07-23")
# creating a data range with the start and end date:
dates <- seq(DATE2, DATE1, by="days")
dt<-data.frame(category=sample(1:6,32,replace = T),deadline=dates)
Filter the dates
library("tidyverse")
dt %>%
group_by(category) %>%
filter(deadline %in% seq(Sys.Date()-7,Sys.Date() , by="days") )

Using the dataset that Rui Barradas created, providing a lubridate formulation, using intervals
set.seed(3635) # Make the results reproducible
my.data <- data.frame(category = rep(c("A", "B"), each = 10),
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = "days"), 2),
critical = sample(c("no", "yes"), 20, TRUE))
library(lubridate) #use lubridate to create intervals
INT_check<-interval(Sys.Date()-7,Sys.Date()) # 7 days from today
my.data %>%
filter(date %within% INT_check ) %>%
group_by(category)%>%
summarise(ncrit = sum(critical == 'yes'))
you can also specify INT_Check as
INT_check<-interval("2018-08-16","2018-08-18") # if you want to use absolute dates
INT_check<-interval("2018-08-16",Sys.Date()) # if you want to specify just absolute start date

Add time strata variable and change format in r

I have tried to get an answer to this with no luck. Hopefully someone out there can assist me. I have a data set of patients.
PatientID <- c('1', "1", "1","1", "2","2","2","2","3","3","3","3")
admission.duration.minutes <- c(0,0.5,1.2,2,0,2.5,3.6,8,0,4,22,24)
has.fever <- c(1,1,NA,0,1,NA,1,1,NA,0,1,NA)
on.ventilator<-c(1,0,1,1,0,1,0,1,NA,1,0,NA)
high.bloodpressure<-c(1,0,1,0,1,0,1,1,1,1,NA,1)
df <- data.frame(PatientID, admission.duration.minutes, has.fever,on.ventilator,high.bloodpressure)
I want to change the dataset so I have one line per patient and I want to calculate how many patients had fever in hour 1, on ventilator in hour 1, high blood pressure in hour 1, combinations of fever and ventilator and blood pressure in hour 1. The same for hour 2, 3, etc.
So I believe I first need to add a time strata variable that defines hour 1, 2, 3 etc. So Hour 1 = 0.0 - 1.0 and Hour 2 is >1.0 to 2.0. And then do a conditional count or something like that.
I have tried with the publish package, but cannot get the output right.
The output from the new data frame should look something like this:
PatientID hour1.fev hour1.vent hour1.BP hour1.fev&vent hour1.fev&BP
1 1 1 1 1 1
hour1.vent&BP hour2.fev hour2.vent hour2.BP hour2.fev&vent hour2.fev&BP
1 0 1 0 1 1
hour2.vent&BP
1
Can you help me?
Current data frame
How the new dataframe could look like

As an initial approach I would propose the following way. First of all, group the data by the patients and the time spans
library("dplyr")
# definition of time spans
df$strata <- if_else(df$admission.duration.minutes == 0, 1, ceiling(df$admission.duration.minutes))
# note that NA measurments are silently transformed here to zeros
df_groupped <- df %>% group_by(PatientID, strata) %>% summarise_at(vars(has.fever:high.bloodpressure),
sum, na.rm = TRUE)
If we want to process NA in another way, the solution may be
# the result is NA only if all parameters in the strata are NA
df_groupped <- df %>% group_by(PatientID, strata) %>%
summarise_at(.vars = vars(has.fever:high.bloodpressure),
.funs = funs(if (all(is.na(.))) NA else sum(., na.rm = TRUE)),
na.rm = FALSE)
So, we obtain the grouped data frame in a long format
# transform numbers of measurments to booleans
df_groupped <- df_groupped %>% mutate(
has.fever = as.integer(as.logical(has.fever)),
on.ventilator = as.integer(as.logical(on.ventilator)),
high.bloodpressure = as.integer(as.logical(high.bloodpressure)),
# ".and."" means `*` instead of `+`
fev.and.BP = as.integer(as.logical(has.fever * high.bloodpressure)),
fev.and.vent = as.integer(as.logical(has.fever * high.bloodpressure))
)
Then create a function to generate a data frame of a desired structure:
fill_form <- function(periods, df_Patient, n_param){
# obtain names of the measured parameters & the first column
long_col_names <- names(df_Patient)[-(1:2)]
long_df_names <- sapply(function(i) paste("hour", periods[i], ".", long_col_names, sep =""), X = periods)
# add the names of the first column with the Patient's ID
long_df_names <- c(names(df_Patient)[1], long_df_names)
long_df <- as.data.frame(matrix(NA, nrow = 1, ncol = 1 + length(periods) * n_param))
names(long_df) <- long_df_names
long_df[, 1] <- as.character(df_Patient[1, 1])
for (i in seq(along.with = periods)) {
if (nrow(filter(df_Patient, strata == periods[i])) > 0) {
long_df[ ,(2 + n_param * (i - 1)):(2 + n_param * i)] <- filter(df_Patient, strata == periods[i])[-(1:2)]
}
}
return(long_df)
}
And then finely apply this function to the data of each individual patient
# the ID's of the patients extracted from the initial df
PatientIDs_names <- unique(unlist(lapply(df["PatientID"], as.character)))
n_of_patients <- length(PatientIDs_names)
n_monit_param <- (ncol(df_groupped) - 2)
# outputted periods are restricted for demonstration purposes
hours_to_monitor <- c(1:5)
records <- lapply(function(i) fill_form(periods = hours_to_monitor,
df_Patient = filter(df_groupped, PatientID == PatientIDs_names[i]), n_param = n_monit_param),
X = seq(along.with = PatientIDs_names))
Hope, it'll be helpful. However, I'm not sure about two things:
1) Both hour2.fev and hour2.BP are 0 in your output example, so why hour2.fev&vent is 1?
2) Why high.bloodpressure is 0 for the PatientID == 1 on the second time span? There is a high.bloodpressure == 1 at time 1.2 hours. This time should be included into the second time span (Hour2 between 1 and 2), shouldn't it?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Analyze the evolution of a population over time - r

Related

Filter on date in r dataframe

Calculating Highest In, First Out on trades

R Loop between two datatables

How can I filter on a past date given another date in a specific row of data.frame?

Add time strata variable and change format in r

Categories

Resources