This follows a question I asked lately on the forum (the first part was brilliantly solved by a forum member who advised me to post a new question for this one) but now I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:
5 million observations
4 variables of of interest:
Code ID Buy
Code ID Sell
Date
New : if there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1
Distance : the distance in months between the two last occurences of new = 1 provided Code_ID_Buy is the same for the two rows
I would like another variable called distance with Distancelastr1, which does the same thing as "distance" but also for the lines where new = 0:
A reduced sample would look like this:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
Amount = rnorm(21,mean=100,sd=20),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A")
)
Data$new<-0
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
r <- rowid(Code_ID_Buy, Code_ID_Sell)
+(r==1L)
}]
Data[Month==1L, new:=0L]
Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
by=.EACHI, i.Month - x.Month]$V1]
Data$Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,NA,1,2)
Data$LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
View(Data)
Here Distancelastr1 should take value "NA" if month = 1 and no "new = 1" has been recorded since month 1. If there is a new = 1 for a given buyer let us say for month x, then Distancewithlastr1 in month i calculates month i - month x.
Thank you in advance,
Here is an option:
Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))][,
dlr := fifelse(new==1L, shift(dlr), dlr), Code_ID_Buy]
Related
I'm trying to understand when was the last date an employee moved from one level to another (from level B to level C for instance) while ignoring the changes due to company reorg.
Meaning if someone moved from band 1 to band 3 then to level a , I'd like to record the first move only (band 1 to 3). But if someone moved from level A to B then to C , I'd like to record the move from B to C. Below is the code I've tried.
df1 <- data.table(snap)
setorderv(df1, c("workday_id", "record_date"))
df1[, career_level_lagged := shift(career_level, n=1, type="lag"), by= "workday_id"]
df1 <- df1[career_level_lagged != career_level]
# identify latest row for each emp
df1[, c("row", "number_rows") := list(frank(record_date), .N), by="workday_id"]
df1 <- df1[row==number_rows]
The code picks employees with and without career changes too. Which is not what I want.
I am trying to calculate how many pid within a set fid's have a yob smaller than person's yob. The second question is about unique pid. Updating the question based on efforts #langtang and my own reflections:
#Libraries:
library(data.table)
library(tictoc)
#Make it replicable:
set.seed(1)
#Define parameters of the simulation:
pid<-1:1000
fid<-1:5
time_periods<-1:12
yob<-sample(seq(1900,2010),length(pid),replace = TRUE)
#Obtain in how many firms a given pid works in a givem month:
nr_firms_pid_time<-sample(1:length(fid),length(pid),replace = TRUE)
#This means:
#First pid: works in first firm;
#Second pid: works in first four firms;
#Third pid: works in first firm;
#Fourth pid: works in two firms.
#Aux functions:
function_rep<-function(x){
rep(1:12,x)
}
function_seq<-function(x){
1:x
}
#Create panel
data_panel<-data.table(pid = rep(pid,nr_firms_pid_time*length(time_periods)))
data_panel[,yearmonth:=do.call(c,sapply(nr_firms_pid_time,function_rep))]
data_panel[,fid:=rep(do.call(c,sapply(nr_firms_pid_time,function_seq)),each = 12)]
#Merge in yob:
data_yob<-data.table(pid = pid,yob = yob)
data_panel<-merge(data_panel,data_yob,by = c("pid"),all.x = TRUE)
#Remove not needed stuff:
rm(pid)
rm(fid)
rm(time_periods)
rm(yob)
rm(data_yob)
#Solution 1 (terribly slow):
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
sum(data_func[pid!=id]$yob<dob_to_use)
}
older_coworkers_unique = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
#Get UNIQUE number of coworkers:
sum(unique(data_func[pid!=id],by = c("pid"))$yob<dob_to_use)
}
#Works but is terrible slow:
tic()
sol_1<-data_panel[, .(older_coworkers(.BY$pid,.BY$yearmonth)),by = c("pid","yearmonth")]
toc()
#Solution 2 (better but do not like it, what if I want unique older coworkers)
function_older<-function(x){
noc<-lapply(
1:length(x),
function(i){
sum(x[-i]<x[i])
}
)
unlist(noc)
}
#This is fast but I cannot get unique number:
tic()
sol_2<-data_panel[,.(pid,function_older(yob)),by = c("fid","yearmonth")][,sum(V2),by = c("pid","yearmonth")][order(pid,yearmonth)]
toc()
#Everything works:
identical(sol_1,sol_2)
The question is how to implement older_coworkers_unique in a very fast manner. Any suggestions would be greatly appreciated.
Update, based on OP's new reproducible dataset
If you want a one-liner to reproduce sol_2 above, you can do this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )][, .N, by=.(i.pid, yearmonth)]
Explanation:
The above is using a non-equi join, which can be a helpful approach when using data.table. I am joining data_panel on itself, requiring that yearmonth and fid be equal, but that year of birth (left side of join) is less than year of birth (right side of join). This will return a data.table where firms and yearmonth matches, but where every older coworker (pid) is matched to their younger coworkers (i.pid). We can thus count the rows (.N) by each younger coworker (i.pid) and yearmonth. This produces the same as sol_1 and sol_2 above. You commented that you would like to find the unique coworkers, and so the second approach below does that, by using len(unique(pid)) as below, in Option 2.
The same non-equi join approach can be used to get unique older coworkers, like this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )] %>%
.[, .(older_coworkers = length(unique(pid))), by=.(i.pid, yearmonth)]
Previous Response, based on OP's original very small example dataset
I'm not sure exactly what you want the output to look like. However in your example data, I first drop the duplicate row (because I couldn't understand why it was there (see my comment above)), and then I apply a function that counts that number of older coworkers for each pid/fid/ym.
# make your example data unique
data=unique(data)
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(birth,firm,yrmonth,id) {
data[dob<birth & fid==firm & ym==yrmonth & pid!=id,.N]
}
# apply the function to the data
data[, .(num_older_coworkers = older_coworkers(dob,.BY$fid, .BY$ym, .BY$pid)), by=.(pid,fid,ym)]
Output:
pid fid ym num_older_coworkers
1: 1 1 200801 1
2: 1 2 200802 0
3: 2 1 200801 0
4: 3 2 200801 0
Person 1 at Firm 1 has one older coworker in the month of 2008-01 -- that is, Person 2 at Firm 1 in 2008-01.
Person 1 at Firm 2 (born in 1950) would also have an older coworker, namely, Person 3 at Firm 2 (born in 1930), but the result shows 0, because Person 1 at Firm 2 ym (i.e. 2008-01) does not match with that potential older coworker's ym (i.e. 2008-02).
I am learning data.table and got confused at one place. Need help to understand how the below can be achieved. The data I am having, I need to filter out those brands which have sales of 0 in the 1st period OR do not have sales > 0 in atleast 14 periods. I have tried and I think I have achieved the 1st part....however not able to get how I can get the second part of filtering those brands which do not have sales > 0 in atleast 14 periods.
Below is my sample data and code that I have written. Please suggest how I can I achieve the second part?
library(data.table)
#### set the seed value
set.seed(9901)
#### create the sample variables for creating the data
group <- sample(1:7,1200,replace = T)
brn <- sample(1:10,1200,replace = T)
period <- rep(101:116,75)
sales <- sample(0:50,1200,replace = T)
#### create the data.table
df1 <- data.table(cbind(group,brn,period,sales))
#### taking the minimum value by group x brand x period
df1_min <- df1[,.(min1 = min(sales,na.rm = T)),by = c('group','brn','period')][order(group,brn,period)]
#### creating the filter
df1_min$fil1 <- ifelse(df1_min$period == 101 & df1_min$min1 == 0,1,0)
Thank you !!
Assuming that the first restriction applies on the dataset wide minimum period (101), implying that brn/group pairs starting with a 0-sales period greater than 101 are still included.
# 1. brn/group pairs with sales of 0 in the 1st period.
brngroup_zerosales101 = df1[sales == 0 & period == min(period), .(brn, group)]
# 2a. Identify brn/group pairs with <14 positive sale periods
df1[, posSale := ifelse(sales > 0, 1, 0)] # Was the period sale positive?
# 2b. For each brn/group pair, sum posSale and filter posSale < 14
brngroup_sub14 = df1[, .(GroupBrnPosSales = sum(posSale)), by = .(brn, group)][GroupBrnPosSales < 14, .(brn, group)]
# 3. Join the two restrictions
restr = rbindlist(list(brngroup_zerosales101, brngroup_sub14))
df1[, ID := paste(brn, group)] # Create a brn-group ID
restr[, ID := paste(brn, group)] # See above
filtered = df1[!(ID %in% restr[,ID]),]
I am trying to add a new row to this dataframe below name total. Now for the columns counts,cost,views are colsums or totals but for average I want to do average and for average I want to do a custom formula .So how can I do that. I did use the janitor library (adorn_totals("row"))but it just does the sum . Below is the sample dataframe:
data.frame(stringsAsFactors=FALSE,
Site = c("Channel1", "Channel2", "Channel3", "Channel4"),
Counts = c(7637587, 19042385, 72019057, 45742745),
Cost = c(199999.993061, 102196.9726, 102574.79, 196174.712132),
Views = c(3007915, 5897235, 14245859, 24727451),
Average = c(2.54, 3.23, 5.05543800482653, 2.21111111111111),
avg_views = c(7.5197875, 14.7430875, 35.6146475, 48.24)
)
I'm not sure if this can help you but that's what I'm using when I want to add a Total row at the end. I also use this with the data.table package.
Code example:
dt <- rbind(dt, data.table(Site = "Total",
Counts = sum(dt[, Counts]),
Cost = sum(dt[, Cost]),
Views = mean(dt[, Views]),
Average = sum(dt[, Average]),
avg_views = paste("Hi OP")))
Output :
Site Counts Cost Views Average avg_views
1: Channel1 7637587 200000.0 3007915 2.540000 7.519787
2: Channel2 19042385 102197.0 5897235 3.230000 14.743087
3: Channel3 72019057 102574.8 14245859 5.055438 35.614647
4: Channel4 45742745 196174.7 24727451 2.211111 48.240000
5: Total 144441774 600946.5 11969615 13.036549 Hi OP
You can apply whatever functions you want. In my code example you have sum() and also mean() but you could use anything.
Here's a base way. It's ok.
DF_summary <- colSums(DF[, -1])
DF_summary[4] <- mean(DF[, 5])
rbind(DF, c('Total',DF_summary))
Site Counts Cost Views Average avg_views
1 Channel1 7637587 199999.993061 3007915 2.54 7.5197875
2 Channel2 19042385 102196.9726 5897235 3.23 14.7430875
3 Channel3 72019057 102574.79 14245859 5.05543800482653 35.6146475
4 Channel4 45742745 196174.712132 24727451 2.21111111111111 48.24
5 Total 144441774 600946.467793 47878460 3.25913727898441 106.1175225
I have two data sets, Transaction_long, and Transaction_short. Transaction_long has many quotes of policy and price with a purchase point (denoted by true) in the dataset. Transaction_short has only entries of the purchase points.
My objective is to add a column in the Transaction_short dataset called Policy_Change_Frequency. For every customer in the short dataset, iterate over the rows for that customer in the long dataset and calculate how many time the policy changed.
To find the policy change I can use sum(diff(Transaction_Long$policy)!=0) but not sure how to iterate over these two data sets and get results
Details:
Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long
Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627
Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627
Need to add a Policy Change Frequency column in the Transaction Short Dataset, so my final Transcation short Dataset will look like
Final Dataset should look like this
Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq
Joe,101,1,3,501,3
Mary,103,1,2,627,2
Consider a calculated column for policy change which tags changes from previous row within each customer with one. Then, aggregates the ones for a count. Merge is used due to two aggregations needed (final row for each customer and PolicyChanged count):
Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
function(i)
if (i > 1) {
ifelse(Transaction_Long$Policy[i-1]==
Transaction_Long$Policy[i], 0,
ifelse(Transaction_Long$Customer_ID[i-1] !=
Transaction_Long$Customer_ID[i], 0, 1))
} else { 0 }
)
Transaction_Final <- merge(aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:5)], FUN = tail, n = 1),
aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:2,6)], FUN = sum),
by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
# Customer_ID Customer_Name Purchased Policy Price PolicyChangedFreq
#1 101 Joe 1 3 501 3
#2 103 Mary 1 2 627 2
#Parfait. Thank you for the solution. i solved this using the sqldf package in R
for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
}
Since i have about 500K rows in the long dataset and about 100K in the short dataset..this is taking a while..is there any other solution that does not require loops? Thank you