I am processing a data frame with two columns:
portfolio date stock Value
1 200006 Apple 10
1 200006 Google 20
1 200006 IBM 30
1 200007 Apple 10
Because the amount of data is large, I want to find a simple way to check from date June 2000 to July 2000, within portfolio 1, both stock Google and IBM are missing. The return would be a c("IBM","GOOGLE"). I will use the information what stocks are not listed in July 2000 and get these stocks' value in June 2000 to balance the portfolio in July 2000. So in this case, I hope to get c("IBM","GOOGLE") and then get their values (20,30) to do further adjustment for Apple's value 10.
The data type for four columns are: factor, Integer, factor and Integer for portfolio, date, stock and Value.
Is there any function or package that can deal with this problem?
You can try this:
library(data.table)
setDT(df)
# Get all possible stocks
stocks <- unique(df$stock)
# Get missing stocks
df[, stocks[!stocks %in% stock], .(portfolio, date)]
# portfolio date V1
# 1: 1 200007 Google
# 2: 1 200007 IBM
# Or vector output (no date or portfolio info)
df[, stocks[!stocks %in% stock], .(portfolio, date)]$V1
# [1] "Google" "IBM"
Related
Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel
The BTYD package in R looks very useful for predicting future customer behavior based on past transactions.
However, the walk-through only illustrates predicting how many transactions a customer will make in an upcoming period, for example in the next year or month.
Is there a way to use this package to create a prediction for the date on which a customer will purchase, and the expected amount of the purchase?
For example, using the sample data set available in the BTYD package:
cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD")
elog <- dc.ReadLines(cdnowElog, cust.idx = 2,
date.idx = 3, sales.idx = 5)
# Change to date format
elog$date <- as.Date(elog$date, "%Y%m%d");
elog[1:3,]
# cust date sales
# 1 1 1997-01-01 29.33
# 2 1 1997-01-18 29.73
# 3 1 1997-08-02 14.96
I would want an output that has the customer number, expected next date of purchase, and expected purchase amount.
# cust exp_date exp_sales
# 1 1998-02-23 19.35
# 2 1997-09-12 39.83
# 3 1998-01-05 24.56
Or this package can only predict the expected number of transactions in a time period, not the date itself or the spend amount? Is there a better approach for what I want to achieve?
I apologize if this question seems very basic, but I couldn't find the answer to this conceptual question in the documentation.
I have a data.table that looks like this:
dt
id month balance
1: 1 4 100
2: 1 5 50
3: 2 4 200
4: 2 5 135
5: 3 4 100
6: 3 5 100
7: 4 5 300
"id" is the client's ID, "month" indicates what month it is, and "balance" indicates the account balance of a client. In a sense, this is longitudinal data where, say, element (2,3) indicates that Client #1 has an account balance of 50 at the end of month 5.
I want to generate a column that will give me the difference between a client's balance between month 5 and 4 to know the transactions carried out from one month to another.
This new variable should let me know that Client 1 drew 50, Client 2 drew 65 and Client 3 didn't do anything in aggregate terms between april and may. Client 4 is a new client that joined in may.
I thought of the following code:
dt$transactions <- dt$balance - shift(dt$balance, 1, "up")
However, it does not work properly because it's telling me that Client 4 made a 200 dollar deposit (but Client 4 is new!). Therefore, I want to be able to introduce the argument "by=id" to this somehow.
I know the solution lies in using the following notation:
dt[, transactions := balance - shift(balance, ??? ), by=id]
I just need to figure out how to make the aforementioned code work properly.
Thanks in advance.
Given that I only have two observations (at most), the following code gives me an elegant solution:
dt[, transaction := balance - first(balance), by = id]
This prevents any NAs from entering the variable transaction.
However, if I had more observations per id, I would do the following:
dt[,transaction := balance - shift(balance,1), by = id]
Big thanks to #Ryan and #Onyambu for helping.
I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034
I am trying to convert data that show sales as cumulative total sales for the year to date. I want to show sales as they occur by day, not the cumulative figure.
Here is an example of the data:
Product, Geography, Date, SalesThisYear
Prod_1, Area_A, 20130501, 10
Prod_2, Area_B, 20130501, 5
Prod_1, Area_B, 20130501, 3
Prod_1, Area_a, 20130502, 12
Prod_2, Area_B, 20120502, 5
Prod_1, Area_B, 20130502, 4
...
So the transformed data would look like:
Product, Geography, Date, SalesThisYear*, DailySales
Prod_1, Area_A, 20130501, 10, 10
Prod_2, Area_B, 20130501, 5, 5
Prod_1, Area_B, 20130501, 3, 3
Prod_1, Area_a, 20130502, 12, 2
Prod_2, Area_B, 20120502, 3, 0
Prod_1, Area_B, 20130502, 4, 1
This can then be used in later analysis.
In case this makes any difference to the approach, I receive a new data file each day with the latest sales information. Therefore I need to append the new data to the existing data, and work out the daily sales figure. This is why I have kept the SalesThisYear field in the transformed data, so this field can be used to calculate the new DailySales figures when the next data file arrives.
I'm new to R so working out what is the best way to solve this problem. I recognize I have two categorical fields, so was anticipating one approach could be used to factor on these fields. My overall thinking was to use a function and then an apply command to run the function against the entire data set. As an overview, my thinking is:
(First load data file into R. Append second data file into R using rbind.)
Create a function that does the following:
Identify products and geographies using factor/similar
Identify largest date and second largest date
For each product and geography combination, find the SalesThisYear value for the appended data and the original data,using the date values obtained in step 2/ -- I'm thinking of using the subset function here. Subtract the two values: this becomes the
DailySales value. (There would need to be error checking logic in case a new geography or product was introduced)
Append this new DailySales value to the results.
Data volume is about 120k rows per day, so the standard route of using a for loop in step 3. may not be advisable.
Is the above approach appropriate? Or is there an unknown unknown I need to learn? :)
transform(d,
SalesThisDay = ave(SalesThisYear, Product, Geography,
FUN=function(x) x - c(0, head(x, -1))))
# Product Geography Date SalesThisYear SalesThisDay
# 1 prod_1 area_a 20130501 10 10
# 2 prod_2 area_b 20130501 5 5
# 3 prod_1 area_b 20130501 3 3
# 4 prod_1 area_a 20130502 12 2
# 5 prod_2 area_b 20120502 5 0
# 6 prod_1 area_b 20130502 4 1