I have a data frame that contains one column which indicates an event ID. There is another column that indicates the products used in that event. Each product would only be used one time for an event and each event contains at least one product. I would like to know how many times each product is used with every other product. Some sample data is below:
set.seed(1)
events <- paste('Event ', sample(1:4, size = 15, replace = TRUE), sep = '')
events <- events[order(events)]
prods <- paste('Product ', c(1, 2, 3, 4, 1, 5, 6, 2, 4, 6, 7, 1, 2, 3, 5))
test_data <- data.frame(events, prods)
test_data
events prods
1 Event 1 Product 1
2 Event 1 Product 2
3 Event 1 Product 3
4 Event 1 Product 4
5 Event 2 Product 1
6 Event 2 Product 5
7 Event 2 Product 6
8 Event 3 Product 2
9 Event 3 Product 4
10 Event 3 Product 6
11 Event 3 Product 7
12 Event 4 Product 1
13 Event 4 Product 2
14 Event 4 Product 3
15 Event 4 Product 5
Product 1 and Product 2 occur in the same event twice (Event 1 and Event 4). So I would want to return a '2' for that match. Product 1 and Product 7 never occur in the same event, so I'd want to return a 0 for that pair. For 'matches' between the same item, I am comfortable returning the total number of times that product is used.
There are two formats that are possible and I don't have a preference for which I'd like to see returned.
A short and fat data frame that has the products running across the tops as column headers and the side as row headers. The body of this data frame would be populated by the number of matches.
A long, narrow data frame where there are two columns that will serve to represent all possible combinations of Product pairings and then a third column representing the number of times they match.
I have been experimenting with expand.grid with nothing to show for it.
Thank you!
Split prods by events and then calculate all the combn-inations, then aggregate to get the count of each combination.
out <- t(do.call(cbind,
lapply(split(as.character(test_data$prods), test_data$events), combn, 2))
)
aggregate(count ~ . , data=transform(out,count=1), FUN=sum)
# X1 X2 count
#1 Product 1 Product 2 2
#2 Product 1 Product 3 2
#3 Product 2 Product 3 2
#4 Product 1 Product 4 1
#5 Product 2 Product 4 2
#6 Product 3 Product 4 1
#7 Product 1 Product 5 2
#8 Product 2 Product 5 1
#9 Product 3 Product 5 1
#10 Product 1 Product 6 1
#11 Product 2 Product 6 1
#12 Product 4 Product 6 1
#13 Product 5 Product 6 1
#14 Product 2 Product 7 1
#15 Product 4 Product 7 1
#16 Product 6 Product 7 1
Maybe this is using a sledgehammer to crack a nut, but you could mine (frequent) item sets, which comes with other fancy stuff. It could work like this:
library(arules)
library(reshape2)
mat <- as(sapply(dcast(test_data, events~prods, fun.aggregate = length, value.var="prods")[, -1], as.logical), "transactions")
sets <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen = 2, maxlen = 2, target = "frequent itemsets"))
df <- as(sets, "data.frame")
subset(transform(df, n=support*nrow(trans)), n>0, -support)
# items n
# 2 {Product 6,Product 7} 1
# 4 {Product 4,Product 7} 1
# 6 {Product 2,Product 7} 1
# 7 {Product 5,Product 6} 1
# 8 {Product 3,Product 5} 1
# 10 {Product 1,Product 5} 2
# 11 {Product 2,Product 5} 1
# 13 {Product 4,Product 6} 1
# 14 {Product 1,Product 6} 1
# 15 {Product 2,Product 6} 1
# 16 {Product 3,Product 4} 1
# 17 {Product 1,Product 3} 2
# 18 {Product 2,Product 3} 2
# 19 {Product 1,Product 4} 1
# 20 {Product 2,Product 4} 2
# 21 {Product 1,Product 2} 2
The support value shows you the percentage of events in which both products were included. I multiplied it with the number of transactions to get your frequency count.
Related
I'm working on a project that analyzes the ROI of people participating in a bunch of contests. For each contest I have a table that has everyone's ranks, and another table that has the payout for a given rank-range. I want to join these two tables to assign everyone a payout based on their ranking, but I'm having issues thinking about how to handle ties. If two people are tied then the payouts are averaged. My tables are in the hundreds of thousands so I want to get the process right for this smaller example.
Rank table example:
id rank
1 A 1
2 B 1
3 C 3
4 D 4
5 E 4
6 F 4
7 G 7
8 H 8
9 I 9
10 J 10
Payout table example:
rankMin rankMax payout
1 1 1 100
2 2 3 70
3 4 5 50
4 6 8 20
5 9 10 0
End goal:
id rank payout
1 A 1 85 # Two people tied for first, so take average of 1st and 2nd payouts
2 B 1 85
3 C 3 70
4 D 4 40 # Three people tied for 4th, so take average of 4th/5th/6th payouts.
5 E 4 40
6 F 4 40
7 G 7 20
8 H 8 20
9 I 9 0
10 J 10 0
My code so far:
# Load libraries
library(dplyr)
# Setup the rank table
id <- LETTERS[1:10]
rank <- c(1, 1, 3, 4, 4, 4, 7, 8, 9, 10)
finalStandingsDf <- data.frame(id, rank, stringsAsFactors = FALSE)
# Setup the payout table
rankMin <- c(1, 2, 4, 6, 9)
rankMax <- c(1, 3, 5, 8, 10)
payoutAmt <- c(100, 70, 50, 20, 0)
payoutDf <- data.frame(rankMin, rankMax, payoutAmt)
# "Unzip" the payout table to make it easier to join onto rank table
payoutsFixedAll <- data.frame()
for(i in 1:length(id)){
rank <- i
payoutIndex <- min(which(rank <= rankMax))
payout <- payoutDf[payoutIndex, 3]
payoutsFixed <- data.frame(rank, payout)
payoutsFixedAll <- rbind(payoutsFixedAll, payoutsFixed)
}
### Intermittent step to adjust the payoutsFixedAll table to account for ties ###
# Join onto rank table
rankPayoutDf <- finalStandingsDf %>%
left_join(payoutsFixedAll, by = c('rank'))
Obviously I need to make some sort of adjustment to the payout table so that it gets adjusted properly, but I'm struggling to think of the best way to do so. I think it will involve counting the number of each rank (1: 2, 2: 0, 3: 1, 4: 3, etc) and somehow making the adjustment from there? I get what needs to be done I'm just struggling to see the path to get there. Any thoughts?
For this particular dataset, we can 1) make the payout for each ranking in payoutDf and then 2) average the payouts based on the ranking of finalStandingsDf.
payouts = with(payoutDf, rep(payoutAmt, rankMax - rankMin + 1))
finalStandingsDf$payout <- ave(payouts, finalStandingsDf[["rank"]])
finalStandingsDf
id rank payout
1 A 1 85
2 B 1 85
3 C 3 70
4 D 4 40
5 E 4 40
6 F 4 40
7 G 7 20
8 H 8 20
9 I 9 0
10 J 10 0
I am stuck with a problem. I am new to R and coding in generel, so maybe I don't see some obvious mistakes. I hope you can help me.
My goal is it to calculate stock price returns between two different dates. I have this (simplified) dataframe:
ID <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
day <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
price <- c(12, 11, 10, 9, 15, 10, 15, 11, 8, 5, 10, 7)
stock.prices <- data.frame(ID, day, price)
ID day price
1 1 1 12
2 1 2 11
3 1 3 10
4 1 4 9
5 2 1 15
6 2 2 10
7 2 3 15
8 2 4 11
9 3 1 8
10 3 2 5
11 3 3 10
12 3 4 7
I want to calculate the buy-and-hold return between the lowest stock price per company (ID) and the first stock price per company (day =1).
The formula would be:
(Lowest stock price - first day stock price) / first day stock price.
In my first step I tried to find the lowest stock price and the matching day for each ID and create a new dataframe:
library(dplyr)
Lowest.Price <- stock.prices %>% group_by(ID) %>% slice(which.min(price))
ID day price
<dbl> <dbl> <dbl>
1 1 4 9
2 2 2 10
3 3 2 5
This step might not be necessary, but it was the only thing I could think of and it also gave me a good overview. But I wouldn't mind a solution that excludes that step (i.e. skip this datafrme and jump to the final result right away).
What I tried next is to calculate the return between the lowest price and the price on day 1 for each ID, using an If-Statement, and save it as a new Value:
Return <- if(Lowest.Price$ID == stock.prices$ID & stock.prices$day == "1"){(Lowest.Price$price - stock.prices$price)/stock.prices$price}
So it only should calculate the return if the IDs in both dataframes are the same and if the day in the stock.prices.df equals 1.
Unfortunatly that doesn't work. The desired result should be:
ID day price return
1 1 4 9 -0.25
2 2 2 10 -0.33
3 3 2 5 -0.375
Instead I get:
> show(Return)
[1] -0.25000000 -0.09090909 -0.50000000 0.00000000 -0.33333333 -0.50000000 -0.40000000 -0.09090909 -0.37500000
[10] 0.80000000 0.00000000 -0.28571429
First problem: I get 12 results (=number of total observations) instead of 3 (=number of different IDs). To be honest, I don't know why this happens or where these calculated returns come from.
Second problem: the final result should create a new dataframe that also shows the ID and maybe the day along with the returns. I don't know how to do that; I tried to use the mutate() function and add it to an existing dataframe (here: Lowest.Price), but that didn't seem to work at all.
I hope you can help me. If you need mor einformation, let me know! Thank you!
Getting the percentage change between Day 1 and the minimum stock price for each ID:
stock.return <- stock.prices %>%
group_by(ID) %>%
summarize(stock.return = min(price) / price[day == 1] - 1)
ID stock.return
<dbl> <dbl>
1 1 -0.25
2 2 -0.333
3 3 -0.375
The return for each day:
return.by.day <- stock.prices %>%
group_by(ID) %>%
mutate(daily.return = price / price[day == 1] - 1)
ID day price daily.return
<dbl> <dbl> <dbl> <dbl>
1 1 1 12 0
2 1 2 11 -0.0833
3 1 3 10 -0.167
4 1 4 9 -0.25
5 2 1 15 0
6 2 2 10 -0.333
7 2 3 15 0
8 2 4 11 -0.267
9 3 1 8 0
10 3 2 5 -0.375
11 3 3 10 0.25
12 3 4 7 -0.125
I have data on several thousand US basketball players over multiple years.
Each basketball player has a unique ID. It is known for what team and on which position they play in a given year, much like the mock data df below:
df <- data.frame(id = c(rep(1:4, times=2), 1),
year = c(1, 1, 2, 2, 3, 4, 4, 4,5),
team = c(1,2,3,4, 2,2,4,4,2),
position = c(1,2,3,4,1,1,4,4,4))
> df
id year team position
1 1 1 1 1
2 2 1 2 2
3 3 2 3 3
4 4 2 4 4
5 1 3 2 1
6 2 4 2 1
7 3 4 4 4
8 4 4 4 4
9 1 5 2 4
What is an efficient way to manipulate df into new_df below?
> new_df
id move time position.1 position.2 year.1 year.2
1 1 0 2 1 1 1 3
2 2 1 3 2 1 1 4
3 3 0 2 3 4 2 4
4 4 1 2 4 4 2 4
5 1 0 2 1 4 3 5
In new_df the first occurrence of the basketball player is compared to the second occurrence, recorded whether the player switched teams and how long it took the player to make the switch.
Note:
In the real data some basketball players occur more than twice and can play for multiple teams and on multiple positions.
In such a case a new row in new_df is added that compares each additional occurrence of a player with only the previous occurrence.
Edit: I think this is not a rather simple reshape exercise, because of the reasons mentioned in the previous two sentences. To clarify this, I've added an additional occurrence of player ID 1 to the mock data.
Any help is most welcome and appreciated!
s=table(df$id)
df$time=rep(1:max(s),each=length(s))
df1 = reshape(df,idvar = "id",dir="wide")
transform(df1, move=+(team.1==team.2),time=year.2-year.1)
id year.1 team.1 position.1 year.2 team.2 position.2 move time
1 1 1 1 1 3 2 1 0 2
2 2 1 2 2 4 2 1 1 3
3 3 2 3 3 4 4 4 0 2
4 4 2 4 4 4 4 4 1 2
The below code should help you get till the point where the data is transposed
You'll have to create the move and time variables
df <- data.frame(id = rep(1:4, times=2),
year = c(1, 1, 2, 2, 3, 4, 4, 4),
team = c(1, 2, 3, 4, 2, 2, 4, 4),
position = c(1, 2, 3, 4, 1, 1, 4, 4))
library(reshape2)
library(data.table)
setDT(df) #convert to data.table
df[,rno:=rank(year,ties="min"),by=.(id)] #gives the occurance
#creating the transposed dataset
Dcast_DT<-dcast(df,id~rno,value.var = c("year","team","position"))
This piece of code did the trick, using data.table
#transform to data.table
dt <- as.data.table(df)
#sort on year
setorder(dt, year, na.last=TRUE)
#indicate the names of the new columns
new_cols= c("time", "move", "prev_team", "prev_year", "prev_position")
#set up the new variables
dtt[ , (new_cols) := list(year - shift(year),team!= shift(team), shift(team), shift(year), shift(position)), by = id]
# select only repeating occurrences
dtt <- dtt[!is.na(dtt$time),]
#outcome
dtt
id year team position time move prev_team prev_year prev_position
1: 1 3 2 1 2 TRUE 1 1 1
2: 2 4 2 1 3 FALSE 2 1 2
3: 3 4 4 4 2 TRUE 3 2 3
4: 4 4 4 4 2 FALSE 4 2 4
5: 1 5 2 4 2 FALSE 2 3 1
I have a data frame that contains one column which indicates an event ID. There is another column that indicates the products used in that event. Each product would only be used one time for an event and each event contains at least one product. I would like to know how many times each product is used with every other product. Some sample data is below:
set.seed(1)
events <- paste('Event ', sample(1:4, size = 15, replace = TRUE), sep = '')
events <- events[order(events)]
prods <- paste('Product ', c(1, 2, 3, 4, 1, 5, 6, 2, 4, 6, 7, 1, 2, 3, 5))
test_data <- data.frame(events, prods)
test_data
events prods
1 Event 1 Product 1
2 Event 1 Product 2
3 Event 1 Product 3
4 Event 1 Product 4
5 Event 2 Product 1
6 Event 2 Product 5
7 Event 2 Product 6
8 Event 3 Product 2
9 Event 3 Product 4
10 Event 3 Product 6
11 Event 3 Product 7
12 Event 4 Product 1
13 Event 4 Product 2
14 Event 4 Product 3
15 Event 4 Product 5
Product 1 and Product 2 occur in the same event twice (Event 1 and Event 4). So I would want to return a '2' for that match. Product 1 and Product 7 never occur in the same event, so I'd want to return a 0 for that pair. For 'matches' between the same item, I am comfortable returning the total number of times that product is used.
There are two formats that are possible and I don't have a preference for which I'd like to see returned.
A short and fat data frame that has the products running across the tops as column headers and the side as row headers. The body of this data frame would be populated by the number of matches.
A long, narrow data frame where there are two columns that will serve to represent all possible combinations of Product pairings and then a third column representing the number of times they match.
I have been experimenting with expand.grid with nothing to show for it.
Thank you!
Split prods by events and then calculate all the combn-inations, then aggregate to get the count of each combination.
out <- t(do.call(cbind,
lapply(split(as.character(test_data$prods), test_data$events), combn, 2))
)
aggregate(count ~ . , data=transform(out,count=1), FUN=sum)
# X1 X2 count
#1 Product 1 Product 2 2
#2 Product 1 Product 3 2
#3 Product 2 Product 3 2
#4 Product 1 Product 4 1
#5 Product 2 Product 4 2
#6 Product 3 Product 4 1
#7 Product 1 Product 5 2
#8 Product 2 Product 5 1
#9 Product 3 Product 5 1
#10 Product 1 Product 6 1
#11 Product 2 Product 6 1
#12 Product 4 Product 6 1
#13 Product 5 Product 6 1
#14 Product 2 Product 7 1
#15 Product 4 Product 7 1
#16 Product 6 Product 7 1
Maybe this is using a sledgehammer to crack a nut, but you could mine (frequent) item sets, which comes with other fancy stuff. It could work like this:
library(arules)
library(reshape2)
mat <- as(sapply(dcast(test_data, events~prods, fun.aggregate = length, value.var="prods")[, -1], as.logical), "transactions")
sets <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen = 2, maxlen = 2, target = "frequent itemsets"))
df <- as(sets, "data.frame")
subset(transform(df, n=support*nrow(trans)), n>0, -support)
# items n
# 2 {Product 6,Product 7} 1
# 4 {Product 4,Product 7} 1
# 6 {Product 2,Product 7} 1
# 7 {Product 5,Product 6} 1
# 8 {Product 3,Product 5} 1
# 10 {Product 1,Product 5} 2
# 11 {Product 2,Product 5} 1
# 13 {Product 4,Product 6} 1
# 14 {Product 1,Product 6} 1
# 15 {Product 2,Product 6} 1
# 16 {Product 3,Product 4} 1
# 17 {Product 1,Product 3} 2
# 18 {Product 2,Product 3} 2
# 19 {Product 1,Product 4} 1
# 20 {Product 2,Product 4} 2
# 21 {Product 1,Product 2} 2
The support value shows you the percentage of events in which both products were included. I multiplied it with the number of transactions to get your frequency count.
I have a dataframe df:
id1 id2 action
1 2 10
1 3 11
1 4 21
2 1 6
...
It means, the user id1 do something (10) to user id2, and id2 do something (6) to id1.
Now I want to create a new column, called partner_action, which basically records what did the partner do. So it will look like:
id1 id2 action partner_action
1 2 10 6
2 1 6 10
1 3 11 9
3 1 9 11
I tried:
df$partner_action = df[df$id2 == df$id1,]$action
But of course, it does not work.
I thought about make a copy of df, called df_copy then:
df$partner_action = df_copy[df_copy$id1 == df$id2,]$action
But is there a better way to do it?
Basically you want to merge df with itself, matching pairs of (id2, id1) with pairs of (id1, id2). You can do this in R either with merge or match:
df$partner_action <- with(df, action[match(paste(id2, id1), paste(id1, id2))])
df
# id1 id2 action partner_action
# 1 1 2 10 6
# 2 2 1 6 10
# 3 1 3 11 9
# 4 3 1 9 11
Data:
(df <- data.frame(id1=c(1, 2, 1, 3), id2=c(2, 1, 3, 1), action=c(10, 6, 11, 9)))
# id1 id2 action
# 1 1 2 10
# 2 2 1 6
# 3 1 3 11
# 4 3 1 9
for(i in 1:nrow(df))
{df[i,4]<-df[which(df$id1==df[i,2]&df$id2==df[i,1]),3]}