Joining tables in R while adjusting for "ties" - r

I'm working on a project that analyzes the ROI of people participating in a bunch of contests. For each contest I have a table that has everyone's ranks, and another table that has the payout for a given rank-range. I want to join these two tables to assign everyone a payout based on their ranking, but I'm having issues thinking about how to handle ties. If two people are tied then the payouts are averaged. My tables are in the hundreds of thousands so I want to get the process right for this smaller example.
Rank table example:
id rank
1 A 1
2 B 1
3 C 3
4 D 4
5 E 4
6 F 4
7 G 7
8 H 8
9 I 9
10 J 10
Payout table example:
rankMin rankMax payout
1 1 1 100
2 2 3 70
3 4 5 50
4 6 8 20
5 9 10 0
End goal:
id rank payout
1 A 1 85 # Two people tied for first, so take average of 1st and 2nd payouts
2 B 1 85
3 C 3 70
4 D 4 40 # Three people tied for 4th, so take average of 4th/5th/6th payouts.
5 E 4 40
6 F 4 40
7 G 7 20
8 H 8 20
9 I 9 0
10 J 10 0
My code so far:
# Load libraries
library(dplyr)
# Setup the rank table
id <- LETTERS[1:10]
rank <- c(1, 1, 3, 4, 4, 4, 7, 8, 9, 10)
finalStandingsDf <- data.frame(id, rank, stringsAsFactors = FALSE)
# Setup the payout table
rankMin <- c(1, 2, 4, 6, 9)
rankMax <- c(1, 3, 5, 8, 10)
payoutAmt <- c(100, 70, 50, 20, 0)
payoutDf <- data.frame(rankMin, rankMax, payoutAmt)
# "Unzip" the payout table to make it easier to join onto rank table
payoutsFixedAll <- data.frame()
for(i in 1:length(id)){
rank <- i
payoutIndex <- min(which(rank <= rankMax))
payout <- payoutDf[payoutIndex, 3]
payoutsFixed <- data.frame(rank, payout)
payoutsFixedAll <- rbind(payoutsFixedAll, payoutsFixed)
}
### Intermittent step to adjust the payoutsFixedAll table to account for ties ###
# Join onto rank table
rankPayoutDf <- finalStandingsDf %>%
left_join(payoutsFixedAll, by = c('rank'))
Obviously I need to make some sort of adjustment to the payout table so that it gets adjusted properly, but I'm struggling to think of the best way to do so. I think it will involve counting the number of each rank (1: 2, 2: 0, 3: 1, 4: 3, etc) and somehow making the adjustment from there? I get what needs to be done I'm just struggling to see the path to get there. Any thoughts?

For this particular dataset, we can 1) make the payout for each ranking in payoutDf and then 2) average the payouts based on the ranking of finalStandingsDf.
payouts = with(payoutDf, rep(payoutAmt, rankMax - rankMin + 1))
finalStandingsDf$payout <- ave(payouts, finalStandingsDf[["rank"]])
finalStandingsDf
id rank payout
1 A 1 85
2 B 1 85
3 C 3 70
4 D 4 40
5 E 4 40
6 F 4 40
7 G 7 20
8 H 8 20
9 I 9 0
10 J 10 0

Related

Expand data frame and and add rowsums from another dataframe

I am trying to find a faster way of accomplishing the following code since my actual dataset is very large. I would like to get rid of the for loop altogether. I am trying to duplicate each row in xdf into a new data frame based on the number of columns in values. Then, next to each entry in the new dataset, show the row sums from column 1 in values up to the column j.
xdf <- data_frame(
x = c('a', 'b', 'c'),
y = c(4, 5, 6),
)
values <- data_frame(
col_1 = c(5, 9, 1),
col_2 = c(4, 7, 6),
col_3 = c(1, 5, 2),
col_4 = c(7, 8, 5)
)
for (j in seq(ncol(values))){
if (j==1){
Temp <- cbind(xdf, z= rowSums(values[1:j]))
}
else{
Temp <- rbind(Temp, cbind(xdf, z= rowSums(values[1:j])))
}
}
print(Temp)
The output should be:
x y z
1 a 4 5
2 b 5 9
3 c 6 1
4 a 4 9
5 b 5 16
6 c 6 7
7 a 4 10
8 b 5 21
9 c 6 9
10 a 4 17
11 b 5 29
12 c 6 14
Is there a shorter way to accomplish this?
This is the closest answer that I could get on SO.
How to expand data frame based on values?
I am new to R, so sorry for the longwinded code.
Here's one base R option :
Repeat the rows in xdf as there are number of columns in values, iteratively increment one column at a time to find rowSums and add it as a new column in the final dataframe.
newdf <- xdf[rep(seq(nrow(xdf)), ncol(values)), ]
newdf$z <- c(sapply(seq(ncol(values)), function(x) rowSums(values[1:x])))
newdf
# A tibble: 12 x 3
# x y z
# <chr> <dbl> <dbl>
# 1 a 4 5
# 2 b 5 9
# 3 c 6 1
# 4 a 4 9
# 5 b 5 16
# 6 c 6 7
# 7 a 4 10
# 8 b 5 21
# 9 c 6 9
#10 a 4 17
#11 b 5 29
#12 c 6 14
A concise one-liner as suggested by #sindri_baldur doesn't require repeating the rows explicitly.
cbind(xdf, z = c(sapply(seq(ncol(values)), function(x) rowSums(values[1:x]))))

Deleting incomplete cases across multiple rows in R studio

Say I have a longitudinal data set as below
ID <- c(1, 1, 2, 2, 3, 3, 4, 4)
time <- c(1, 2, 1, 2, 1, 2, 1, 2)
value <- c(7, 5, 9, 2, NA, 3, 7, NA)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
In this data-set, we have 4 cases with data at two time-points (let's say pre and post treatment)
Something I want to do is set criteria to delete any case that are not complete for both time-points. In this example, I would want to delete ID3 (who is missing timepoint 1), and ID4 (who is missing timepoint 2). Like below:
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
I am not having much luck. I've tried variants of complete.cases() or which() to no avail
I'm still new to R, and would be hugely appreciative if anyone could help me out
Edit: Thank you Ronak for answering my question. Upon reflection of my real data, I have encountered a second problem. My actual data is more reflected by the below:
ID <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8)
time <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1)
value <- c(7, 5, 9, 2, NA, 3, 7, NA, 8, 9, 7, 6)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
Where I would also want to remove cases 5, 6, 7 and 8. These IDs have an entry for Time 1, but not Time 2. Hopefully this makes sense
Thanks a heap
If you switch your data to wide format (where each time point is represented as its own column), then you can use na.omit. Using dplyr and tidyr functions:
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
> mydata
ID time value
1 1 1 7
2 2 1 9
3 1 2 5
4 2 2 2
Note that this will work (it will keep only cases with complete data for both time 1 and time 2) even when you have a time point missing without an explicit NA present in the data, like this:
> mydata
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
You can do this easily with sqldf.
library(sqldf)
sqldf(' select * from (select ID, count(*) as cnt from mydata where value is not null group by id having cnt >1 ) t1 inner join mydata t2 on t1.ID=t2.ID')
You would select those id having a count greater than 1 and who doesn't have NA in their values and then join back with the original data.
#Ronak already provided
mydata[!mydata$ID %in% mydata$ID[is.na(mydata$value)], ]
For the second part, you can just group over each id and filter on their frequency
k2 <- data.frame(table(mydata$ID))
k2$Var1[k2$Freq > 1]
and then do something like
mydata[mydata$ID %in% k2$Var1[k2$Freq > 1],]
See the updated answer
# Eliminates ID cases with NA
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2

Gather ragged data frame into key-value columns

I recently discovered how to create ragged data frames using the I function, but are having a hard time integrating them with tidyr, ggplot2 and the rest of the Hadleyverse. More specifically, how do you gather a column containing named vectors into key-value-columns?
Suppose I create a data frame like this
make.vector <- function(length.out){
x <- sample(9, length.out)
names(x) <- switch(length.out,
"Alice",
c("Bob", "Charlie"),
c("Dave", "Erin", "Frank"),
c("Gwen", "Harold", "Inez", "James"))
x
}
mydf <- data.frame(Game = gl(3, 3, labels=LETTERS[1:3]),
Set = rep(1:3, 3),
Score = I(lapply(rep(2:4, each=3), make.vector)))
producing
> print(mydf)
Game Set Score
1 A 1 8, 3
2 A 2 2, 8
3 A 3 3, 8
4 B 1 1, 5, 4
5 B 2 2, 3, 5
6 B 3 2, 8, 5
7 C 1 7, 2, 3, 4
8 C 2 1, 6, 3, 7
9 C 3 6, 9, 3, 7
The data frame can be manipulated with dplyr and tidyr in a straight forward manner as long as the results are of the expected length.
mydf %>%
mutate(nPlayers = sapply(Score, length))
mydf %>%
group_by(Game) %>%
summarize(TotalScore = list(Reduce("+", Score)))
However, I cannot figure out how to create multiple rows of result for each original row. Suppose I want to create the following data frame by manipulating mydf:
Game Set Player Score
1 A 1 Bob 8
2 A 1 Charlie 3
3 A 2 Bob 2
4 A 2 Charlie 8
5 A 3 Bob 3
6 A 3 Charlie 8
7 B 1 Dave 1
8 B 1 Erin 5
9 B 1 Frank 4
10 B 2 Dave 2
...
The only tool I know for doing so would be the gather function of the tidyr package, but it doesn't seem to play very well with non-atomic data.
mydf %>%
mutate(Player = lapply(Score, names)) %>%
gather(P = Player, S = Score)
I guess I could hack together a solution (as done in similar previous questions [1][2]),
cbind(
mydf[rep(1:nrow(mydf), sapply(mydf$Score, length)),
c("Game", "Set")],
data.frame(
Player = unlist(lapply(mydf$Score, names)),
Score = unlist(mydf$Score)
)
)
but I have a feeling I will have a hard time digesting it if look back at the code next week. Is there a "official" or at least smarter way to do this? Otherwise I'll make a general function for it and add to my personal library.
Update
In the light of David's answer below I figured out that the same result can be achieved with dplyr too.
mydf %>%
group_by(Game, Set) %>%
do(with(., data.frame(Player = names(unlist(Score)),
Score = unlist(Score))))
# Game Set Player Score
# 1 A 1 Bob 8
# 2 A 1 Charlie 6
# 3 A 2 Bob 7
# 4 A 2 Charlie 6
# 5 A 3 Bob 5
# 6 A 3 Charlie 8
# 7 B 1 Dave 1
# 8 B 1 Erin 9
# 9 B 1 Frank 3
# 10 B 2 Dave 8
# .. ... ... ... ...
# Warning message:
# In rbind_all(out[[1]]) : Unequal factor levels: coercing to character
I would try unlisting by group using data.table. You can run this only once per each group while storing it in a temporary variable using curly brackets (as you would do within a function) within the jth expression
library(data.table)
setDT(mydf)[, {
temp <- unlist(Score)
.(Player = names(temp), Score = temp)
}, by = .(Game, Set)]
# Game Set Player Score
# 1: A 1 Bob 2
# 2: A 1 Charlie 9
# 3: A 2 Bob 6
# 4: A 2 Charlie 3
# 5: A 3 Bob 2
# 6: A 3 Charlie 8
# 7: B 1 Dave 1
# 8: B 1 Erin 6
# 9: B 1 Frank 5
# 10: B 2 Dave 3
#...

Remove two outliers in multiple regression

we've got a problem with removing two outliers from our dataset. The data is about an experiment with two independent and one dependent variable. We've exercised the multiple regression and analyzed the "Normal Q-Q" plot. It showed us two outliers (10,46). Now we would like to remove those two cases, before rerunning the multiple regression without the outliers.
We've already tried out various commands recommended in several R platforms but unfortunately nothing worked out.
We would be glad, if anyone of you had an idea that would help us solving our problem.
Thank You very much for helping.
Since no data was provided, I fabricated some:
> x <- data.frame(a = c(10, 12, 14, 6, 10, 8, 11, 9), b = c(1, 2, 3, 24, 4, 1, 2, 4),
c = c(2, 1, 3, 6, 3, 4, 2, 48))
> x
a b c
1 10 1 2
2 12 2 1
3 14 3 3
4 6 24 6
5 10 4 3
6 8 1 4
7 11 2 2
8 9 4 48
If the 4th case in column x$b and the 8th case in column x$c are outliers:
> x1 <- x[-c(4, 8), ]
> x1
a b c
1 10 1 2
2 12 2 1
3 14 3 3
5 10 4 3
6 8 1 4
7 11 2 2
Is this what you need?

Create column in dataframe that samples from another column by factor levels

I would like column x3 of my dataframe dat to contain a random sample of column x2 but the random sample should only come from the same factor level given in column x1. I have researched the functions by(), ddply(), and sample(), but can't seem to make it work. I also checked a similar question but it didn't help me. You can see what I tried in the context of (what I hope is) a reproducible example below.
Here is the example dataframe:
dat <- data.frame(x1=c("a","a","a","b","b","b","c","c","c"),x2=1:9);
dat$x1 <- as.factor(dat$x1);
dat;
x1 x2
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 c 8
9 c 9
Then some of my non-working attempts to generate x3 were the following:
set.seed(99);
by(dat,FUN=dat$x1,dat$x3<-sample(dat$x1,1,replace=FALSE)); #this did not work at all
I also tried this
set.seed(99);
a <- by(dat,dat[,"x1"],function(d){sample(d$x2,3,replace=FALSE)},simplify=TRUE);
dat$x3<-a;
a;
dat[, "x1"]: a
[1] 2 1 3
---------------------------------------------------------------------------------------------------
dat[, "x1"]: b
[1] 6 5 4
---------------------------------------------------------------------------------------------------
dat[, "x1"]: c
[1] 9 7 8
dat;
> dat
x1 x2 x3
1 a 1 2, 1, 3
2 a 2 6, 5, 4
3 a 3 9, 7, 8
4 b 4 2, 1, 3
5 b 5 6, 5, 4
6 b 6 9, 7, 8
7 c 7 2, 1, 3
8 c 8 6, 5, 4
9 c 9 9, 7, 8
I kind of got what I needed into a in that the random resampling by factor level is there but a is not a simple vector. I feel that if a was a vector I would just about have what I need as I could assign it to dat$x3. To sum up, I would want dat to turn out something like this:
dat
x1 x2 x3
1 a 1 2
2 a 2 1
3 a 3 3
4 b 4 6
5 b 5 5
6 b 6 4
7 c 7 9
8 c 8 7
9 c 9 8
The solution should be efficient for a dataframe with >2 million rows. Thanks anyone for your help. I hope to return the help to others as I get better with r.
dat$x3 <- ave( dat$x2, dat$x1, FUN=sample)
The way you have constructed the output (to have the same number of entries as there were rows of the dataframe) you will get permutations of x2 values within distinct values of x1. (Edited your code to make it run.)

Resources