In R, iterating over two datasets and getting back results without looping - r

I have two data sets, Transaction_long, and Transaction_short. Transaction_long has many quotes of policy and price with a purchase point (denoted by true) in the dataset. Transaction_short has only entries of the purchase points.
My objective is to add a column in the Transaction_short dataset called Policy_Change_Frequency. For every customer in the short dataset, iterate over the rows for that customer in the long dataset and calculate how many time the policy changed.
To find the policy change I can use sum(diff(Transaction_Long$policy)!=0) but not sure how to iterate over these two data sets and get results
Details:
Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long
Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627
Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627
Need to add a Policy Change Frequency column in the Transaction Short Dataset, so my final Transcation short Dataset will look like
Final Dataset should look like this
Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq
Joe,101,1,3,501,3
Mary,103,1,2,627,2

Consider a calculated column for policy change which tags changes from previous row within each customer with one. Then, aggregates the ones for a count. Merge is used due to two aggregations needed (final row for each customer and PolicyChanged count):
Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
function(i)
if (i > 1) {
ifelse(Transaction_Long$Policy[i-1]==
Transaction_Long$Policy[i], 0,
ifelse(Transaction_Long$Customer_ID[i-1] !=
Transaction_Long$Customer_ID[i], 0, 1))
} else { 0 }
)
Transaction_Final <- merge(aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:5)], FUN = tail, n = 1),
aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:2,6)], FUN = sum),
by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
# Customer_ID Customer_Name Purchased Policy Price PolicyChangedFreq
#1 101 Joe 1 3 501 3
#2 103 Mary 1 2 627 2

#Parfait. Thank you for the solution. i solved this using the sqldf package in R
for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
}
Since i have about 500K rows in the long dataset and about 100K in the short dataset..this is taking a while..is there any other solution that does not require loops? Thank you

Related

data.table aggregation based on multiple criteria

I am trying to calculate how many pid within a set fid's have a yob smaller than person's yob. The second question is about unique pid. Updating the question based on efforts #langtang and my own reflections:
#Libraries:
library(data.table)
library(tictoc)
#Make it replicable:
set.seed(1)
#Define parameters of the simulation:
pid<-1:1000
fid<-1:5
time_periods<-1:12
yob<-sample(seq(1900,2010),length(pid),replace = TRUE)
#Obtain in how many firms a given pid works in a givem month:
nr_firms_pid_time<-sample(1:length(fid),length(pid),replace = TRUE)
#This means:
#First pid: works in first firm;
#Second pid: works in first four firms;
#Third pid: works in first firm;
#Fourth pid: works in two firms.
#Aux functions:
function_rep<-function(x){
rep(1:12,x)
}
function_seq<-function(x){
1:x
}
#Create panel
data_panel<-data.table(pid = rep(pid,nr_firms_pid_time*length(time_periods)))
data_panel[,yearmonth:=do.call(c,sapply(nr_firms_pid_time,function_rep))]
data_panel[,fid:=rep(do.call(c,sapply(nr_firms_pid_time,function_seq)),each = 12)]
#Merge in yob:
data_yob<-data.table(pid = pid,yob = yob)
data_panel<-merge(data_panel,data_yob,by = c("pid"),all.x = TRUE)
#Remove not needed stuff:
rm(pid)
rm(fid)
rm(time_periods)
rm(yob)
rm(data_yob)
#Solution 1 (terribly slow):
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
sum(data_func[pid!=id]$yob<dob_to_use)
}
older_coworkers_unique = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
#Get UNIQUE number of coworkers:
sum(unique(data_func[pid!=id],by = c("pid"))$yob<dob_to_use)
}
#Works but is terrible slow:
tic()
sol_1<-data_panel[, .(older_coworkers(.BY$pid,.BY$yearmonth)),by = c("pid","yearmonth")]
toc()
#Solution 2 (better but do not like it, what if I want unique older coworkers)
function_older<-function(x){
noc<-lapply(
1:length(x),
function(i){
sum(x[-i]<x[i])
}
)
unlist(noc)
}
#This is fast but I cannot get unique number:
tic()
sol_2<-data_panel[,.(pid,function_older(yob)),by = c("fid","yearmonth")][,sum(V2),by = c("pid","yearmonth")][order(pid,yearmonth)]
toc()
#Everything works:
identical(sol_1,sol_2)
The question is how to implement older_coworkers_unique in a very fast manner. Any suggestions would be greatly appreciated.
Update, based on OP's new reproducible dataset
If you want a one-liner to reproduce sol_2 above, you can do this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )][, .N, by=.(i.pid, yearmonth)]
Explanation:
The above is using a non-equi join, which can be a helpful approach when using data.table. I am joining data_panel on itself, requiring that yearmonth and fid be equal, but that year of birth (left side of join) is less than year of birth (right side of join). This will return a data.table where firms and yearmonth matches, but where every older coworker (pid) is matched to their younger coworkers (i.pid). We can thus count the rows (.N) by each younger coworker (i.pid) and yearmonth. This produces the same as sol_1 and sol_2 above. You commented that you would like to find the unique coworkers, and so the second approach below does that, by using len(unique(pid)) as below, in Option 2.
The same non-equi join approach can be used to get unique older coworkers, like this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )] %>%
.[, .(older_coworkers = length(unique(pid))), by=.(i.pid, yearmonth)]
Previous Response, based on OP's original very small example dataset
I'm not sure exactly what you want the output to look like. However in your example data, I first drop the duplicate row (because I couldn't understand why it was there (see my comment above)), and then I apply a function that counts that number of older coworkers for each pid/fid/ym.
# make your example data unique
data=unique(data)
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(birth,firm,yrmonth,id) {
data[dob<birth & fid==firm & ym==yrmonth & pid!=id,.N]
}
# apply the function to the data
data[, .(num_older_coworkers = older_coworkers(dob,.BY$fid, .BY$ym, .BY$pid)), by=.(pid,fid,ym)]
Output:
pid fid ym num_older_coworkers
1: 1 1 200801 1
2: 1 2 200802 0
3: 2 1 200801 0
4: 3 2 200801 0
Person 1 at Firm 1 has one older coworker in the month of 2008-01 -- that is, Person 2 at Firm 1 in 2008-01.
Person 1 at Firm 2 (born in 1950) would also have an older coworker, namely, Person 3 at Firm 2 (born in 1930), but the result shows 0, because Person 1 at Firm 2 ym (i.e. 2008-01) does not match with that potential older coworker's ym (i.e. 2008-02).

Create a list from a complex comparison of two lists

I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.
Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0
Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data

Calculating distance in months between two occurrences in a data set

This follows a question I asked lately on the forum (the first part was brilliantly solved by a forum member who advised me to post a new question for this one) but now I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:
5 million observations
4 variables of of interest:
Code ID Buy
Code ID Sell
Date
New : if there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1
Distance : the distance in months between the two last occurences of new = 1 provided Code_ID_Buy is the same for the two rows
I would like another variable called distance with Distancelastr1, which does the same thing as "distance" but also for the lines where new = 0:
A reduced sample would look like this:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
Amount = rnorm(21,mean=100,sd=20),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A")
)
Data$new<-0
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
r <- rowid(Code_ID_Buy, Code_ID_Sell)
+(r==1L)
}]
Data[Month==1L, new:=0L]
Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
by=.EACHI, i.Month - x.Month]$V1]
Data$Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,NA,1,2)
Data$LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
View(Data)
Here Distancelastr1 should take value "NA" if month = 1 and no "new = 1" has been recorded since month 1. If there is a new = 1 for a given buyer let us say for month x, then Distancewithlastr1 in month i calculates month i - month x.
Thank you in advance,
Here is an option:
Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))][,
dlr := fifelse(new==1L, shift(dlr), dlr), Code_ID_Buy]

Logic parsing in R

Building off of this question Pass a data.frame with column names and fields as filter
Let's say we have the following data set:
filt = data.table(X1 = c("Gender","Male"),
X2 = c('jobFamilyGroup','Finance'),
X3 = c('jobFamilyGroup','Software Dev')
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'))
and I want to use filt as a filter for df. filt is done by grabbing an input from Shiny and transforming it a bit to get me that data.table above. My goal is to filter df so we have: All rows that are MALE AND (Software Dev OR Finance).
Currently, I'm hardcoding it to always be an AND but that isn't ideal for situations like this. My thought would be to have multiple if conditions to catch things like this, but I feel like there could be an easier approach for building this logic in.
___________UPDATE______________
Once I have a table like filt I can pass code like:
if(!is.null(primary))
{
if(ncol(primary)==1){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1])]
}
else if(length(primary)==2){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1]) &
get(as.character(primary[1,2]))==as.character(primary[2,2])]
}
else{
d2 = df[get(as.character(primary[1,1]))==as.character(primary[1,2]) &
get(as.character(primary[1,2]))==as.character(primary[2,2]) &
get(as.character(primary[1,3]))==as.character(primary[2,3])]
}
}
But this code doesn't account for the OR Logical needed if there are multiple inputs for one type of grouping. Meaning the current code says give me all rows where: Gender == Male & Job Family Group == 'Finance'& Job Family Group == 'Software Dev' When really it should be Gender == Male & (Job Family Group == 'Finance'| Job Family Group == 'Software Dev')
this is a minimal example meaning there are many other columns so ideally the solution has the ability to determine when a multiple input for a grouping is present.
Given your problem, what if you parsed it so your logic looked like:
Gender %in% c("Male") & jobFamilyGroup %in% c('Finance','Software Dev')
By lumping all filter values with the same column name together in an %in% you get your OR and you keep your AND between column names.
UPDATE
Consider the case discussed in comments below.
Your reactive inputs a data.table specifying
Gender IS Male
Country IS China OR US
EmployeeStatus IS Active
In the sample data you provided there is no country column, so I added one. I extract the columns to be filtered and the values to be filtered and split the values to be filtered by the columns. I pass this into an lapply which does the logical check for each column using an %in% rather than a == so that options within the same column are treated as an | instead of a &. Then I rbind the logical results together and apply an all to the columns and then filter df by the results.
This approach handles the & between columns and the | within columns. It supports any number of columns to be searched removing the need for your if/else logic.
library(data.table)
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'),
Country = c('China','China','US','US','China'))
filt = data.table(x1 = c('Gender' , 'Male'),x2 = c('Country' , 'China'),x3 = c('Country','US'), x4 = c('EmployeeStatus','Active'))
column = unlist(filt[1,])
value = unlist(filt[2,])
tofilter = split(value,column)
tokeep = apply(do.call(rbind,lapply(names(tofilter),function(x){
`[[`(df,x) %in% tofilter[[x]]
})),2,all)
df[tokeep==TRUE]
#> Gender EmployeeStatus jobFamilyGroup Country
#> 1: Male Active Finance China
#> 2: Male Active HR US
#> 3: Male Active Finance US

R counts events between time range if meet specific word

I have a difficult question I was hoping you friendly folks can help me with.
I have a dataset (df1) that includes variable company id, day, day-3,day+3 ,
Each id represents a company ticker symbol.
dataset (news) includes variable company id, day,newstitle,weight
I want to create two variables event1 and event2 in dataset(df1)
event1 =it counts events of newstile from dataset(news) between the time range[df1$range-3,df2$range+3] with news$newstitle contains "order" for each df$id
event2 = it counts events of newstile between the time range[df1$range-3,df2$range+3] with news$newstitle contains "dividend" for each df#id
I have rewrite my question for better understanding. I really appreciate your help.
Best.
#this would be desired result with new vars event1,event2
out <- read.table(text="
id,date,date_bef3,date_aft3,event1,event2
1605,1992/12/15,1992/12/12,1992/12/18,0,0
1705,1992/12/30,1992/12/27,1993/1/2,1,1
3412,1992/12/31,1992/12/28,1993/1/3,0,0
9921,1993/1/7,1993/1/4,1993/1/10,0,0
2314,1993/1/18,1993/1/15,1993/1/21,1,0",
header=T,sep=",")
#this is index data
df1<-out[,-c(5,6)]
#this is simulated news source data
news<- read.table(text="
id,date,newstitle,weight
2543,1992/12/30,new order,1
1705,1992/12/29,dividend payment,1
1705,1993/1/1,new order,1
9921,1993/1/1,new product,1
2314,1993/1/16,new order,1",
header=T,sep=",")
it is like the sql code like that:
proc sql;select *
sum (case when news$newstitle="order" ) as event1,
sum (case when news$newstitle="dividend" ) as event2,
from df1,new
where news$date between df$date_bf3 and df$date_af3 ,
news$id=df$id
my initial procs:
1: inner merge df1 and news by "id"
2: set dummy event1=1 if
newdate between(date_bef3,date_aft3) and
newstitle contain "order"
set dummy event2=1 if
newdate between(date_bef3,date_aft3) and
newstitle contain "dividend"
3: collapse (sum)event1 event2 by(id,date)
setDT(df1,key="id")
setDT(news,key="id")
%inner merge
df<-df1[news,]
%set dummy event1, something wrong here
df[newstitle=="order",event1:=as.numeric(between(newsdate,date_bef3,date_aft3))]
Based on your comment below, I believe that what you want is:
# Make dates actual dates instead of factors
library(lubridate)
df1$date_bef3 <- ymd(df1$date_bef3)
df1$date_aft3 <- ymd(df1$date_aft3)
news$date <- ymd(news$date)
event1 = sum(news$newstitle[news$id == df1$id & df1$date_bef3 <= news$date & df1$date_aft3 >= news$date]=="new order")
event2 = sum(news$newstitle[news$id == df1$id & df1$date_bef3 <= news$date & df1$date_aft3 >= news$date]=="dividend payment")

Resources