I am trying to use the Highest In, First Out accounting method on trades.
Highest In, First Out means that when you sell, you sell your most expensive shares first.
Here are my buys and sells (example borrowed from R calculate aggregate gains or loss using FIFO method - this is a similar, but different problem):
buy = data.frame(BuyTransactionID = c(1:10),
Ticker=c(rep('MSFT',4),rep('AMZN',3),rep('DOCU',3)),
Date=c(rep('01-01-2018',2),rep('01-14-2020',2),rep('01-01-2018',2),rep('01-14-2020',1),'01-01-2018','03-15-2020','04-06-2020'),
Price=c(100,102,102,107,2000,2010,2011,197,182,167),
Quantity=c(10,10,5,5,1,1,2,12,15,15))
sell = data.frame(SellTransactionID=c(1:7),
Ticker=c('MSFT','MSFT','AMZN','AMZN','DOCU','DOCU','DOCU'),
Date=c('01-07-2020','01-20-2020','01-01-2020','01-30-2020','01-15-2020','04-10-2020','04-20-2020'),
Price=c(97,110,2100,2050,210,205,225),
Quantity=c(15,12,1,3,10,5,3))
Here are the rules:
You sell the most expensive (highest price) shares first.
You cannot sell shares before you purchased them
You cannot sell the same shares multiple times
Example problem:
The first sale (SellTransactionID = 1) is 15 shares of MSFT on 01-07-2020. So, any purchase made before that date can be sold. Based on date, the eligible shares to be sold are those from BuyTransactionID 1 and 2. BuyTransactionID 2 is the highest price. Therefore, all 10 shares of BuyTransactionID 2 are sold and the remaining 5 shares come from BuyTransactionID 1.
Desired output:
'Date Sold' = the date sold (self-explanatory);
'Ticker' = the ticker sold;
'Proceeds' = the total dollar amount sold;
'Cost basis' = a weighted average of the shares sold.
Example solution:
This is the solution for SellTransactionID 1. A properly solution automates this and calculates for all SellTransactionIDs.
result <- data.frame(SellDate = '01-07-2020', Ticker = "MSFT", Proceeds = 1455, CostBasis = 101.33)
Cost Basis Example:
Cost basis is calculated as a weighted average. For the preceding example, Cost Basis is calculated as such: (Quantity1 * Price1 + Quanity2 * Price2 + .....)/sum of all Quantity(s)
So for example above: (10 * 102 + 5 * 100)/15
The answer by #DPH is excellent, but unfortunately not quite accurate enough. I will explain why.
Here is a new dataset where all the purchases precede the sales:
buy = data.frame(BuyTransactionID = c(1:10),
Ticker=c(rep('MSFT',10)),
Date=c(rep('01-01-2020',10)),
Price=c(100,102,102,107,105,111,109,112,115,106),
Quantity=c(10,10,5,5,1,1,2,12,15,15))
sell = data.frame(SellTransactionID=c(1:4),
Ticker=c('MSFT','MSFT','MSFT', 'MSFT'),
Date=c('01-07-2020','01-20-2020','01-21-2020',
'01-22-2020'),
Price=c(120,119,117, 121),
Quantity=c(7,12,1, 5))
If you apply the solution from #DPH, you will get this result:
Notice that the 'Remain_Price' does not change, nor does the 'Sales_Cost' for the last three transactions. This happens because the function determines how many shares remain after the first sale and what the average price of the remaining shares is. The shares purchased preceding the first sale can no longer be sold individually. They are now treated as a single entity with an average price and the remaining number of shares.
For example, a total of 76 shares were bought in this example. The first sale sells 7 shares. Now, 69 shares remain as seen in 'Remain_Qtd'. An average price is calculated for those remaining shares - that price is $106.5652. Now, the process considers all 69 shares to be priced at $106.5652 and the remaining sales reduce the quantity of 'Remain_Qtd', but does not change the 'Remain_Price'. The remaining shares can no longer be considered at the price that they were bought at, they are collectively part of the remaining shares and the average remaining price.
This occurs because of the object dfo and the recycling of dfo in the object sdf. In particular, this line calculates an average remaining price that is then recycled through dfo and sdf.
Price = (sum(ip * iq) - v) / sum(sdf$Quantity)
and Quantity = sum(sdf$Quantity) adds together all the remaining shares.
I think the answer by #DPH is brilliant, but hope that it can be modified to treat each purchase individually rather than aggregating past purchases.
If I understood your problem correctly this is one possible solution. In resume I am combinig the sales and buys data and group it in sales blocks (given by the sales ID). This assumes that the order of sales IDs is according to the date column. I then loop over these sales blocks sequentially and write the intermediate result to a individual dataframe. For each sales block processing this result dataframe is filtered for the last sales block result of the same ticker. This means sales quantity must not be larger than available quantity according to the timeline (since you can not sell what you not have this should not be of concern anyhow I have to point it out as a possible limitation)
The proposed loop solution 1 is not the best way to work data in R since it is a loop, which grows a data.frame. Since you listed the purrr tag I adapted the code for the second part of the answer to work with the map() function.
Before we get to the actual coding lets prepare the data first (need for both parts of the answer the same way):
library(tidyverse) # need dplyr, using lubridate for date parsing and tidyr to fill data.frame
# include identifier variable, manipulate column name and union sells and buy
df <- dplyr::mutate(buy, io = "i") %>%
dplyr::rename(TID = BuyTransactionID) %>%
dplyr::union(dplyr::mutate(sell, io = "o") %>%
dplyr::rename(TID = SellTransactionID)) %>%
# sort the data
dplyr::arrange(Ticker, Date) %>%
# make Qtd negative for sales, remove buy IDs, parce Date and set up helper column
dplyr::mutate(Quantity = ifelse(io == "o", -1 * Quantity, Quantity),
TID = ifelse(io == "i", NA, TID),
Date = lubridate::mdy(Date),
hprice = NA_real_) %>%
# group data to fill backwards per group
dplyr::group_by(Ticker) %>%
tidyr::fill(TID, .direction = "up") %>%
# ungroup to prevent unwanted behaviour downstream
dplyr::ungroup()
1 Standard loop
dfo <- df[0, ] # empty copy of df
for (i in sort(unique(df$TID))) {
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df[df$TID == i, ])
# current sales quantiy as positive value
o1 <- abs(sdf[sdf$io == "o", ]$Quantity)
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
}
}
# fill sales block frame and bind to output df
dfo <- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Price = (sum(ip * iq) - v) / sum(sdf$Quantity),
Quantity = sum(sdf$Quantity),
io = "i",
hprice = v / o1))
}
# format, join original data and calculate result per Sales block
dplyr::select(dfo, Ticker, Date, Remain_Price = Price, Remain_Qtd = Quantity, Sales_Cost = hprice) %>%
dplyr::left_join(dplyr::select(df, Ticker, Date, Sales_Price = Price, Sales_Qtd = Quantity), by = c("Ticker", "Date")) %>%
dplyr::mutate(Result = abs(Sales_Qtd) * (Sales_Price - Sales_Cost))
Ticker Date Remain_Price Remain_Qtd Sales_Cost Sales_Price Sales_Qtd Result
1 MSFT 2020-01-07 100.0000 5 101.3333 97 -15 -65
2 MSFT 2020-01-20 100.0000 3 103.7500 110 -12 75
3 AMZN 2020-01-01 2000.0000 1 2010.0000 2100 -1 90
4 AMZN 2020-01-30 NaN 0 2007.3333 2050 -3 128
5 DOCU 2020-01-15 197.0000 2 197.0000 210 -10 130
6 DOCU 2020-04-10 173.6667 27 188.0000 205 -5 85
7 DOCU 2020-04-20 0.0000 -1 131.3333 225 -3 281
2 loop rephrase as purrr solution (be aware of the global assignment operartor (<<- instead of <-) for assignment of dfo at end of function)
# rephrase loop as function
myfun <- function(i){
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df[df$TID == i, ])
# current sales quantiy as positive value
o1 <- abs(sdf[sdf$io == "o", ]$Quantity)
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdf[sdf$io == "i", ] %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
}
}
# fill sales block frame and bind to output df
dfo <<- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Price = (sum(ip * iq) - v) / sum(sdf$Quantity),
Quantity = sum(sdf$Quantity),
io = "i",
hprice = v / o1))
}
# empty copy of df
dfo <- df[0, ]
purrr::map(sort(unique(df$TID)), ~ myfun(.x))[[max(df$TID)]] %>%
dplyr::select(Ticker, Date, Remain_Price = Price, Remain_Qtd = Quantity, Sales_Cost = hprice) %>%
dplyr::left_join(dplyr::select(df, Ticker, Date, Sales_Price = Price, Sales_Qtd = Quantity), by = c("Ticker", "Date")) %>%
dplyr::mutate(Result = abs(Sales_Qtd) * (Sales_Price - Sales_Cost))
Ticker Date Remain_Price Remain_Qtd Sales_Cost Sales_Price Sales_Qtd Result
1 MSFT 2020-01-07 100.0000 5 101.3333 97 -15 -65
2 MSFT 2020-01-20 100.0000 3 103.7500 110 -12 75
3 AMZN 2020-01-01 2000.0000 1 2010.0000 2100 -1 90
4 AMZN 2020-01-30 NaN 0 2007.3333 2050 -3 128
5 DOCU 2020-01-15 197.0000 2 197.0000 210 -10 130
6 DOCU 2020-04-10 173.6667 27 188.0000 205 -5 85
7 DOCU 2020-04-20 0.0000 -1 131.3333 225 -3 281
EDIT
To keep track of the remaining stocks we need a second df to hold the current portfolio data. I did not optimize the code and editted only the loop, the purrr adaption should be pretty straight foreward though.
library(tidyverse) # need dplyr, using lubridate for date parsing and tidyr to fill data.frame
# include identifier variable, manipulate column name and union sells and buy
df <- dplyr::mutate(buy, io = "i") %>%
dplyr::rename(TID = BuyTransactionID) %>%
dplyr::union(dplyr::mutate(sell, io = "o") %>%
dplyr::rename(TID = SellTransactionID)) %>%
# sort the data
dplyr::arrange(Ticker, Date) %>%
# make Qtd negative for sales, remove buy IDs, parce Date and set up helper column
dplyr::mutate(Quantity = ifelse(io == "o", -1 * Quantity, Quantity),
TID = ifelse(io == "i", NA, TID),
Date = lubridate::mdy(Date),
hprice = NA_real_) %>%
# group data to fill backwards per group
dplyr::group_by(Ticker) %>%
tidyr::fill(TID, .direction = "up") %>%
# ungroup to prevent unwanted behaviour downstream
dplyr::ungroup()
dfo <- df[0, ] # empty copy of df
dfh <- df[0, c("Ticker", "Price", "Quantity", "io")] # to hold current stock aka portfolio
for (i in sort(unique(df$TID))) {
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df[df$TID == i, ])
# bind data from current portfolio to buys between last and current sale (new port folio before sale)
sdfh <- rbind(dfh[dfh$Ticker == t, ],
df[df$TID == i & df$io == "i", c("Ticker", "Price", "Quantity", "io")])
# current sales quantiy as positive value
o1 <- abs(sdf[sdf$io == "o", ]$Quantity)
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
ips <- ip
iqs <- iq
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
iqs[1] <- iqs[1] - o2
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
ips <- ips[-1]
iqs <- iqs[-1]
}
}
dfh <- rbind(dfh[dfh$Ticker != t, ],
data.frame(Ticker = t,
Price = ips,
Quantity = iqs,
io = "i"))
# fill sales block frame and bind to output df
dfo <- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Price = sum(ips * iqs) / sum(iqs),
Quantity = sum(iqs),
io = "i",
hprice = v/o1))
}
dfo
TID Ticker Date Price Quantity io hprice
1 1 MSFT 2020-01-07 106.5652 69 i 115
2 2 MSFT 2020-01-20 105.0000 57 i 114
3 3 MSFT 2020-01-21 104.8750 56 i 112
4 4 MSFT 2020-01-22 104.1765 51 i 112
Here is the final working solution that I have come to with the help of #DPH. I have made a couple of changes to #DPH's edited solution.
The edited solution does not work when all the shares of a stock are sold for multiple reasons including the dfh object. The updates solution does work with the modified dateset that I provided but not the original dateset. I have modified the answer so that it works when all shares are sold.
I have modified the result to include the dates of the purchases. This will be important for determining whether the sale is long term or short term capital gains.
I have removed tickers from stocks that have been purchased but not sold, as those will break the script
I have applied purrr to the updated solution to avoid looping.
I have changed the base subsetting (i.e., df[]) to dplyr subsetting (i.e., df %>% filter(). For some reason the base subsetting was resulting in rows with NA values in my actual dataset even though it did not cause that in the sample dataset. The NA rows caused the solution not to work.
data frame prep:
df <- buy %>% filter(Ticker %in% unique(sell$Ticker)) %>% dplyr::mutate(io = "i") %>%
dplyr::rename(TID = BuyTransactionID) %>%
dplyr::union(dplyr::mutate(sell, io = "o") %>%
dplyr::rename(TID = SellTransactionID)) %>%
# sort the data
dplyr::arrange(Ticker, Date) %>%
# make Qtd negative for sales, remove buy IDs, parce Date and set up helper column
dplyr::mutate(Quantity = ifelse(io == "o", -1 * Quantity, Quantity),
TID = ifelse(io == "i", NA, TID),
Date = lubridate::mdy(Date),
hprice = NA_real_) %>%
dplyr::arrange(Ticker, Date) %>%
# group data to fill backwards per group
dplyr::group_by(Ticker) %>%
tidyr::fill(TID, .direction = "up") %>%
# ungroup to prevent unwanted behaviour downstream
dplyr::ungroup()
df$Dates_bought <- NA
function and purrr:
# rephrase loop as function
myfun <- function(i){
# which ticker are we working with at this sale
t <- unique(df[df$TID == i, ]$Ticker)
# bind data from last sale of this ticker to current sale
sdf <- rbind(dplyr::slice_max(dfo[dfo$Ticker == t, ], "Date"),
df %>% filter(TID == i))
sdfh <- rbind(dfh %>% filter(Ticker == t),
df %>% filter(TID == i & io == "i") %>% select(c("Ticker", "Date", "Price", "Quantity", "io")))
# current sales quantiy as positive value
o1 <- abs(sdf %>% filter(io == "o") %>% pull(Quantity))
# copy to use for greedy algo
o2 <- o1
# vectors of price and qtd of bought shares at this sale, having price in decreasing order
ip <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Price)
iq <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Quantity)
date <- sdfh %>% dplyr::arrange(desc(Price)) %>% dplyr::pull(Date)
ips <- ip
iqs <- iq
dates <- date
# total value of shares in greedy
v <- 0
# loop to run over bought prices and quantities do reduce from sold qtd per sales block
# check if shares block is larger then remaining sales qtd to break loop. Modifications
# to make dates work properly.
for(l in 1:length(ip)){
if(o2 < iq[l]){
v <- v + ip[l] * o2
iqs[1] <- iqs[1] - o2
dates2 <- if(o2 == 0) dates else dates[-1]
break
}else{
o2 <- o2 - iq[l]
v <- v + ip[l] * iq[l]
ips <- ips[-1]
iqs <- iqs[-1]
dates <- dates[-1]
dates2 <- dates
}
}
# Needs to have the if else statements because when all shares are sold, the length
# of dates, ips, and iqs is 0, whereas Ticker and io are length 1.
dfh <<- rbind(dfh[dfh$Ticker != t, ],
data.frame(Ticker = if(length(ips) == 0) numeric(length = 0L) else t,
Date = dates,
Price = ips,
Quantity = iqs,
io = if(length(ips) == 0) numeric(length = 0L) else "i"))
# fill sales block frame and bind to output df
dfo <<- rbind(dfo,
data.frame(TID = i,
Ticker = t,
Date = max(sdf$Date),
Dates_bought = paste(date[seq(length(date)-length(dates2))], collapse = ","),
Price = sum(ips * iqs) / sum(iqs),
Quantity = sum(iqs),
io = "i",
hprice = v / o1))
}
# empty copy of df
dfo <- df[0, ] # empty copy of df
dfh <- df[0, c("Ticker", "Date", "Price", "Quantity", "io")] # to hold current stock aka portfolio
hifo <- purrr::map(sort(unique(df$TID)), ~ myfun(.x))[[length(unique(df$TID))]] %>%
dplyr::select(Ticker, Date, Dates_bought, Remain_Price = Price, Remain_Qtd = Quantity, Sales_Cost = hprice) %>%
dplyr::left_join(dplyr::select(df, Ticker, Date, Sales_Price = Price, Sales_Qtd = Quantity), by = c("Ticker", "Date")) %>%
dplyr::mutate(Result = abs(Sales_Qtd) * (Sales_Price - Sales_Cost))
Let me know if anyone has any issues. I'd like to get this into a shinyapp and maybe develop it more. Let me know if you're interested in collaborating.
Related
I have a data frame that has 160M rows and 2 columns(material name and price). I want to determine how many the frequency at which prices occur.
For example,
the price $10 was given 100 different times. I'd like to sort the values by largest occurrence to smallest occurs (example, $100 was given 1000 times)
There are 2,484,557 unique prices, so a "table" is not the most useful solution.
my issue is I'm dealing with memory issues.
Any suggestions how I can accomplish this?
Here's a 2 GB data frame with 160M rows and about 3M unique prices:
set.seed(42)
n = 160E6
fake_data <- data.frame(material = sample(LETTERS, n, replace = TRUE),
price = sample(1:3E6, n, replace = TRUE))
I like dplyr syntax, but for large data with many groups, data.table and collapse offer much better performance.
We could use dtplyr to translate dplyr code to data.table. This takes 22 seconds on my machine, with the result showing how many times each price appears in the data.
library(dplyr)
library(dtplyr)
fake_data %>%
lazy_dt() %>%
count(price, sort = TRUE)
Result
Source: local data table [3,000,000 x 2]
Call: `_DT2`[, .(n = .N), keyby = .(price)][order(desc(n))]
price n
<int> <int>
1 2586972 97
2 2843789 95
3 753207 92
4 809482 92
5 1735845 92
6 809659 90
# … with 2,999,994 more rows
If you need higher performance and don't mind a heuristic, you could also sample your data to make it 10% or 1% as big; if any placeholder values occur frequently in the whole data, they are also likely to be frequent in a random sample.
I'd probably create price intervals, e.g. $0-50, $51-100, $101-150 etc.
EDIT: more comprehensive solutution
library(tidyverse)
df <- letters %>%
expand_grid(., .) %>%
rename(v1 = `....1`,
v2 = `....2`) %>%
mutate(name = paste0(v1, v2)) %>%
select(name) %>%
bind_rows(., ., ., .)
df
n <- nrow(df)
df <- df %>%
mutate(price = rnorm(n = n, mean = 1000, sd = 200))
df %>%
ggplot(aes(x = price)) +
geom_histogram()
df <- df %>%
mutate(price_grp = case_when(price < 500 ~ "$0-500",
price > 500 & price <= 1000 ~ "$501-1000",
price > 1000 & price <= 1500 ~ "$1001-1500",
price > 1500 ~ "+ $1500"))
df %>%
group_by(price_grp) %>%
summarize(occurences = n()) %>%
arrange(desc(occurences))
I am dealing with a data frame containing the transaction level data. It contains two fields, bill_id and product.
The data represents products purchased at a bill level, and a particular bill_id gets repeated as many times as the number of products purchased in that bill. For example, if 5 items have been purchased in bill_id 12345, the data for this bill will be like this:
bill_id product
12345 A
12345 B
12345 C
12345 D
12345 E
My objective is to filter out data of all bills containing a certain product.
Following is an example of how I am performing this task currently:
library(dplyr)
set.seed(1)
# Sample data
dat <- data.frame(bill_id = sample(1:500, size = 1000, replace = TRUE),
product = sample(LETTERS, size = 1000, replace =
TRUE),
stringsAsFactors = FALSE) %>%
arrange(bill_id, product)
# vector of bill_ids of product A
bills_productA <- dat %>%
filter(product == "A") %>%
pull(bill_id) %>%
unique()
# data for bill_ids in vector bills_productA
dat_subset <- dat %>%
filter(bill_id %in% bills_productA)
This leads to the creation of an intermediary vector of bill_ids (bills_productA) and a two-step filtering process (first find ids of bills containing the product, and then find all transactions of these bills).
Is there a more efficient way of performing this task?
a data.table approach:
preparation
library(data.table)
setDT(dat)
actual code
dat[ bill_id %in% dat[ product == "A",][[1]], ]
output
# bill_id product
# 1: 14 A
# 2: 14 I
# 3: 19 A
# 4: 19 W
# 5: 22 A
# ---
# 130: 478 A
# 131: 478 V
# 132: 478 Z
# 133: 494 A
# 134: 494 J
You can filter the bill_id by directly subsetting it
library(dplyr)
dat_subset1 <- dat %>% filter(bill_id %in% unique(bill_id[product == "A"]))
identical(dat_subset, dat_subset1)
#[1] TRUE
This would also work without unique in it but better to keep the list short.
Another variation:
library(dplyr)
dat_subset2 <- semi_join(dat, filter(dat, product == "A") %>% select(bill_id))
> identical(dat_subset, dat_subset2)
[1] TRUE
I have the code below that gives me the time series results of a stock and groups everything into 'buys' and 'sells' buckets (based on closing prices higher or lower than opening prices).
library(dplyr)
library(data.table)
library(quantmod)
library(zoo)
# enter tickers to download time-series data
e <- new.env()
getSymbols("SBUX", env = e)
pframe <- do.call(merge, as.list(e))
#head(pframe)
# get a subset of data
df = pframe$SBUX.Close
colnames(df)[1] <- "Close"
head(df)
# Assign groupings
addGrps <- transform(df,Group = ifelse(Close < lead(Close), "S", "B"))
# create subsets
buys <- addGrps[addGrps$Group == 'B',]
sells <- addGrps[addGrps$Group == 'S',]
Now, I am trying to group the results by daily profits (Diff) and losses and find the cumulative sum of each (profits and losses).
I think it should be something like this, but something is off, and I'm not sure what it is.
# find daily differences
df <- df %>%
mutate(Diff = addGrps$Close - lead(addGrps$Close))
# get up and down price movements
ups <- filter(df, Diff > 0 )
downs <- filter(df, Diff <= 0 )
# cumulative sums of longs and shorts
longs<-cumsum(ups$Diff)
shorts<-cumsum(downs$Diff)
I'm not sure if I'm totally following your question/problem, and it seems like there is some unnecessary code. For example, all those packages aren't needed (at least, not yet),
and it's not clear why the two subset data frames for the buys and sells are needed. At the very least, the following cleans up some of what you've done so far, and gets the data in an easy to work with data frame. With some clarification, maybe this is a start.
library(quantmod)
library(tidyverse) # rather than just dplyr
# pull the SBUX data as a data frame and create the necessary new columns:
df <- data.frame(getSymbols(Symbols = 'SBUX', env = NULL)) %>% # pull the raw data
rownames_to_column('date') %>% # convert the row index to a column
select(date, close = SBUX.Close) %>% # select only the SBUX.Close column and rename it
mutate(group = ifelse(close < lead(close), 's', 'b')) %>% # assign the sell or buy group
mutate(diff = close - lead(close)) %>% # create the diff calculation
mutate(movement = ifelse(diff > 0, 'up', 'down')) %>% # create the movement classification
tbl_df()
# just to view the new data frame:
df %>% head(5)
# A tibble: 5 x 5
date close group diff movement
<chr> <dbl> <chr> <dbl> <chr>
1 2007-01-03 17.6 s -0.0200 down
2 2007-01-04 17.6 b 0.0750 up
3 2007-01-05 17.6 b 0.0650 up
4 2007-01-08 17.5 b 0.0750 up
5 2007-01-09 17.4 b 0.0550 up
# calculate the sums of the diff by the movement up or down:
df %>%
filter(!is.na(movement)) %>% # this removes the last date from the data - it cannot have a lead closing price
group_by(movement) %>%
summarize(cum_sum = sum(diff))
# A tibble: 2 x 2
movement cum_sum
<chr> <dbl>
1 down -489.
2 up 455.
Each day I have a new csv file with ids and some variables. The ids can be differents over the days. I would like to take the IDs of one day and follow how a variable evolves over the time.
My goal is to create area plot like this :
For example I take all the ids the 31 march, each day I make a join with thoses ids, and I make a count group by the var "Code". If there is missing ids (Ids here the 31 march but not day D) their code become "NA" to show how many IDs I "lose" over time. I hope i'm clear enough.
Here is how I calculate this king of plot : (my real datas are like li and not datas)
library(plyr)
library(dplyr)
datas <- data.frame(id1 = c("x", "y", "x", "y", "z", "x", "z"),
id2 = c("x2", "y2", "x2", "y2", "z2", "x2", "z2"),
code = c("code1", "code2", "code1", "code2", "code2", "code1", "code2"),
var = runif(7),
date = do.call(c, mapply(rep, seq(Sys.Date() - 2, Sys.Date(), by = 1), c(2, 3, 2))))
li <- split(datas, datas$date)
dateStart <- Sys.Date() - 2
dateEnd <- Sys.Date()
# A "filter" if I want to start with another date than the date min or end with another date than the max date
li <- li[as.Date(names(li)) >= dateStart & as.Date(names(li)) <= dateEnd]
dfCounts <- ldply(li, function(x)
left_join(li[[1]], x, by = c("id1", "id2")) %>%
group_by(code.y) %>%
count(code = code.y) %>%
mutate(freq = n / sum(n),
code = ifelse(is.na(code), "NA", code))),
.id = "date")
> dfCounts
date code n freq
1 2015-07-04 1 1 0.5
2 2015-07-04 2 1 0.5
3 2015-07-05 1 1 0.5
4 2015-07-05 2 1 0.5
5 2015-07-06 1 1 0.5
6 2015-07-06 NA 1 0.5
dfCounts %>%
ggplot(aes(date, freq)) +
geom_area(aes(fill = code), position = "stack")
# I have no idea why in this example, nothing is shown in the plot, but it works on my real datas
So it works, but if I want to observe a longer period, I have to join over many days (files) and it can be slow. Do you have any ideas to do the same things without joins, using the binded datas (the object datas and not li) with dplyr or data.table ?
In your opinion, which approach is better ?
Thanks !
(Sorry for the title I couldn't find better...)
Right now, I have the following data.frame which was created by original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n)).
DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A",
"S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701,
47387, 47096, 45601, 40056, 36882)), .Names = c("Category",
"n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"
))
Category n
1 E 163051
2 K 127133
3 M 106680
4 L 64868
5 I 49701
6 A 47387
7 S 47096
8 G 45601
9 N 40056
10 Q 36882
I want to create an "Other" field from the bottom ranked Categories by n. i.e.
Category n
1 E 163051
2 K 127133
3 M 106680
4 L 64868
5 I 49701
6 Other 217022
Right now, I am doing
rbind(filter(DF, rank(rev(n)) <= 5),
summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n)))
which collapses all categories not in the top 5 into the Other category.
But I'm curious whether there's a better way in dplyr or some other existing package. By "better" I mean more succinct/readable. I'm also interested in methods with cleverer or more flexible ways to choose Other.
This is another approach, assuming that each category (of the top 5 at least) only occurs once:
df %.%
arrange(desc(n)) %.% #you could skip this step since you arranged the input df already according to your question
mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.%
group_by(Category) %.%
summarize(n = sum(n))
# Category n
#1 E 163051
#2 I 49701
#3 K 127133
#4 L 64868
#5 M 106680
#6 Other 217022
Edit:
I just noticed that my output is not order by decreasing n any more. After running the code again, I found out that the order is kept until after the group_by(Category) but when I run the summarize afterwards, the order is gone (or rather, it seems to be ordered by Category). Is that supposed to be like that?
Here are three more ways:
m <- 5 #number of top results to show in final table (excl. "Other")
n <- m+1
#preserves the order (or better: reesatblishes it by index)
df <- arrange(df, desc(n)) %.% #this could be skipped if data already ordered
mutate(idx = 1:n(), Category = ifelse(idx > m, "Other", Category)) %.%
group_by(Category) %.%
summarize(n = sum(n), idx = first(idx)) %.%
arrange(idx) %.%
select(-idx)
#doesnt preserve the order (same result as in first dplyr solution, ordered by Category)
df[order(df$n, decreasing=T),] #this could be skipped if data already ordered
df[n:nrow(df),1] <- "Other"
df <- aggregate(n ~ Category, data = df, FUN = "sum")
#preserves the order (without extra index)
df[order(df$n, decreasing=T),] #this could be skipped if data already ordered
df[n:nrow(df),1] <- "Other"
df[n,2] <- sum(df$n[df$Category == "Other"])
df <- df[1:n,]
Different package/different syntax version:
library(data.table)
dt = as.data.table(DF)
dt[order(-n), # your data is already sorted, so this does nothing for it
if (.BY[[1]]) .SD else list("Other", sum(n)),
by = 1:nrow(dt) <= 5][, !"nrow", with = F]
# Category n
#1: E 163051
#2: K 127133
#3: M 106680
#4: L 64868
#5: I 49701
#6: Other 217022
This function modifies a column, replacing the infrequent entries with Other, either by specifying a minimum frequency, or by specifying the resultant number of categories intended.
#' #title Group infrequent entries into 'Other category'
#' #description Useful when you want to constrain the number of unique values in a column.
#' #param .data Data containing variable.
#' #param var Variable containing infrequent entries, to be collapsed into "Other".
#' #param n Threshold for total number of categories above "Other".
#' #param count Threshold for total count of observations before "Other".
#' #param by Extra variables to group by when calculating \code{n} or \code{count}.
#' #param copy Should \code{.data} be copied? Currently only \code{TRUE} is supported.
#' #param other.category Value that infrequent entries are to be collapsed into. Defaults to \code{"Other"}.
#' #return \code{.data} but with \code{var} changed to be grouped into smaller categories.
#' #export
mutate_other <- function(.data, var, n = 5, count, by = NULL, copy = TRUE, other.category = "Other"){
stopifnot(is.data.table(.data),
is.character(other.category),
identical(length(other.category), 1L))
had.key <- haskey(.data)
if (!isTRUE(copy)){
stop("copy must be TRUE")
}
out <- copy(.data)
if (had.key){
orig_key <- key(out)
} else {
orig_key <- "_order"
out[, "_order" := 1:.N]
setkeyv(out, "_order")
}
if (is.character(.data[[var]])){
stopifnot(!("nvar" %in% names(.data)),
var %in% names(.data))
N <- .rank <- NULL
n_by_var <-
out %>%
.[, .N, keyby = c(var, by)] %>%
.[, .rank := rank(-N)]
out <- merge(out, n_by_var, by = c(var, by))
if (missing(count)){
out[, (var) := dplyr::if_else(.rank <= n, out[[var]], other.category)]
} else {
out[, (var) := dplyr::if_else(N >= count, out[[var]], other.category)]
}
out <-
out %>%
.[, N := NULL] %>%
.[, .rank := NULL]
setkeyv(out, orig_key)
if (!had.key){
out[, (orig_key) := NULL]
setkey(out, NULL)
}
out
} else {
warning("Attempted to use by = on a non-character vector. Aborting.")
return(.data)
}
}
https://github.com/HughParsonage/hutils/blob/master/R/mutate_other.R