I'm dealing with tournament results in R where ties can happen. Say two players tie for 3rd place. They would share (3rd_prize + 4th_prize), and each earn (3rd_prize + 4th_prize)/2. If 10 players tie for third place, they would split the sum of 3rd through 13th place, and each get that sum over 10.
Given this structure, and given a data.table listing all players, their absolute results, and how many people they drew with, how could we generate a column with everyone's winnings? I don't know how to format sample data in the post, so I'm attaching a link to a google sheet with sample data and a desired result if that's okay!
https://docs.google.com/spreadsheets/d/1fLUZ172Sl_yXVQE3VI0Xo4wSr_SRvaL43MCZIMYen2w/edit?usp=sharing
Here are 2 options:
(1)
prizes[results[, rn:=.I], on=.(Position=rn)][,
.(Person, Winnings=sum(Prize) / .N), .(Position=i.Position)]
Explanation:
Create a sequence of row index for results using results[, rn:=.I]
Then left join this results and prize table using row index prizes[results[, rn:=.I], on=.(Position=rn)]
Then using the result from step 2, group by Position in results and calculate average prize for each Person (i.e. [,.(Person, Winnings=sum(Prize) / .N), .(Position=i.Position)].
Assumption is that results is already sorted by Position.
(2)
Assuming that each row in results receives a prize in the same row in prizes, you can calculated average prizes after extracting using indexing:
results[, Winnings := sum(prizes$Prize[.I], na.rm=TRUE) / .N, Position]
output:
Position Person Winnings
1: 1 A 100.0
2: 2 B 50.0
3: 3 C 17.5
4: 3 D 17.5
5: 4 E 5.0
6: 5 F 4.0
7: 6 G 3.0
8: 7 H 1.0
9: 7 I 1.0
10: 7 J 1.0
data:
library(data.table)
results <- data.table(Person=LETTERS[1:10],
Position=c(1,2,3,3,4,5,6,7,7,7),
tied=c(1,1,2,2,1,1,1,3,3,3))
prizes <- data.table(Position=1:10,
Prize=c(100,50,25,10,5,4,3,2,1,0))
Related
I infrequently use Access to update one table with another table using an inner join and some selection conditions and am trying to find a method to do this sort of operation in R.
# Example data to be updated
ID <- c('A','A','A','B','B','B','C','C','C')
Fr <- c(0,1.5,3,0,1.5,4.5,0,3,6)
To <- c(1.5,3,6,1.5,4.5,9,3,6,9)
dfA <- data.frame(ID,Fr,To)
dfA$Vl <- NA
I wish to update dfA$Vl using the Vl field ina second data frame as below
# Example data to do the updating
ID <- c('A','A','B','B','B','C','C','C')
Fr <- c(0,3,0,1,3,0,4,7)
To <- c(3,6,1,3,9,4,7,9)
Vl <- c(1,2,3,4,5,6,7,8)
dfB <- data.frame(ID,Fr,To,Vl)
The following is the Access SQL syntax I would use for this type of update
UPDATE DfA INNER JOIN DfB ON DfA.ID = DfB.ID SET DfA.Vl = [DfB].[Vl]
WHERE (((DfA.Fr)<=[DfB].[To]) AND ((DfA.To)>[DfB].[Fr]));
This reports that 14 rows are being updated (even through there are only 9 in dfA) as some of the rows will meet the selection conditions more than once and are applied sequentially. I'm not concerned about this inconsistency as the result is sufficient for the intended purpose -- however, it would be best to match the longest overlapping(To-Fr) from DfB to the (To-Fr) of DfA to be more precise - bonus points for that solution)
The result I end up with from Access is as follows
# Result
ID <- c('A','A','A','B','B','B','C','C','C')
Fr <- c(0,1.5,3,0,1.5,4.5,0,3,6)
To <- c(1.5,3,6,1.5,4.5,9,3,6,9)
Vl <- c(1,1,2,4,5,5,6,7,8)
dfC <- data.frame(ID,Fr,To,Vl)
So the question is the best R way to addressing this operation or alternatively (or additionally) how to reproduce the Access SQL in the R sql packages? Also (for extra credit) how to make sure the majority To-Fr overlap is the number updated not necessary the last update operation?
A possible approach using data.table:
library(data.table)
setDT(dfA); setDT(dfB); setDT(dfC);
dfA[, rn:=.I]
#non equi join like your ACCESS sql
dfB[dfA, on=.(ID, To>=Fr, Fr<To), .(rn, i.ID, i.Fr, i.To, x.Vl, x.Fr, x.To)][,
#calculate overlapping range
rng := pmin(x.To, i.To) - pmax(x.Fr, i.Fr)][,
#find the rows with max overlapping range and in case of dupes, choose the first row
first(.SD[rng==max(rng), .(ID=i.ID, Fr=i.Fr, To=i.To, Vl=x.Vl)]), by=.(rn)]
output:
rn ID Fr To Vl
1: 1 A 0.0 1.5 1
2: 2 A 1.5 3.0 1
3: 3 A 3.0 6.0 2
4: 4 B 0.0 1.5 3 #diff from dfC as Vl=3 has a bigger overlap
5: 5 B 1.5 4.5 4 #diff from dfC. both overlaps by 1.5 so either 4/5 works
6: 6 B 4.5 9.0 5
7: 7 C 0.0 3.0 6
8: 8 C 3.0 6.0 7
9: 9 C 6.0 9.0 8
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I am looking to add a column to my data that will list the individual count of the observation in the dataset. I have data on NBA teams and each of their games. They are listed by date, and I want to create a column that lists what # in each season each game is for each team.
My data looks like this:
# gmDate teamAbbr opptAbbr id
# 2012-10-30 WAS CLE 2012-10-30WAS
# 2012-10-30 CLE WAS 2012-10-30CLE
# 2012-10-30 BOS MIA 2012-10-30BOS
Commas separate each column
I've tried to use "add_count" but this has provided me with the total # of games each team has played in the dataset.
Prior attempts:
nba_box %>% add_count()
I expect the added column to display the # game for each team (1-82), but instead it now shows the total number of games in the dataset (82).
Here is a base R example that approaches the problem from a for loop standpoint. Given that a team can be either column, we keep track of the teams position by unlisting the data and using the table function to sum the previous rows.
# intialize some fake data
test <- as.data.frame(t(replicate(6, sample( LETTERS[1:3],2))),
stringsAsFactors = F)
colnames(test) <- c("team1","team2")
# initialize two new columns
test$team2_gamenum <- test$team1_gamenum <- NA
count <- NULL
for(i in 1:nrow(test)){
out <- c(count, table(unlist(test[i,c("team1","team2")])))
count <- table(rep(names(out), out)) # prob not optimum way of combining two table results
test$team1_gamenum[i] <- count[which(names(count) == test[i,1])]
test$team2_gamenum[i] <- count[which(names(count) == test[i,2])]
}
test
# team1 team2 team1_gamenum team2_gamenum
#1 B A 1 1
#2 A C 2 1
#3 C B 2 2
#4 C B 3 3
#5 A C 3 4
#6 A C 4 5
I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.
The dataset looks like this:
library(data.table)
# Generate random data:
loc <- 1:10
dates <- seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="month")
mods <- c("A","B", "C", "D", "E")
temp <- runif(length(loc)*length(dates)*length(mods), min=0, max=30)
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp)
So basically, for location 1, I have measurements from january 2000 to december 2004 taken by model A. Then, I have measurements made by model B. And so on for models C, D and E. And then, so on for location 2 to location 10.
What I need to do is, instead of having five different temperature measurements (from the models), to take the mean temperature for all the models.
As a result, I would have, for each location and each date, not five but ONLY ONE temperature measurement (that would be a multi-model mean).
I tried this:
df2 <- df[, Mean:=mean(Temperature), by=list(Model, Location, Date)]
which didn't work as I expected. I would at least expect the resulting data table to be 1/5th the number of rows of the original table, since I am summarizing five measurements into a single one.
What am I doing wrong?
I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp argument in the expand.grid() call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2). I think you intended one temperature value per key, which should result in 10*60*5 rows:
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
## Location Date Model Temperature
## 1: 1 2000-01-01 A 2.469751
## 2: 2 2000-01-01 A 16.103135
## 3: 3 2000-01-01 A 7.147051
## 4: 4 2000-01-01 A 10.301937
## 5: 5 2000-01-01 A 16.760238
## ---
## 2996: 6 2004-12-01 E 26.293968
## 2997: 7 2004-12-01 E 8.446528
## 2998: 8 2004-12-01 E 29.003001
## 2999: 9 2004-12-01 E 12.076765
## 3000: 10 2004-12-01 E 28.410980
If this is correct, you can generate the means across models with this:
df[,.(Mean=mean(Temperature)),.(Location,Date)];
## Location Date Mean
## 1: 1 2000-01-01 9.498497
## 2: 2 2000-01-01 11.744622
## 3: 3 2000-01-01 15.691228
## 4: 4 2000-01-01 11.457154
## 5: 5 2000-01-01 8.897931
## ---
## 596: 6 2004-12-01 17.587000
## 597: 7 2004-12-01 19.555963
## 598: 8 2004-12-01 15.710465
## 599: 9 2004-12-01 15.322790
## 600: 10 2004-12-01 20.240392
Note that the := operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.
In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The := operator does not do this.
Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j argument evaluations merged with their respective aggregation keys. Since our j argument results in a scalar value for each group, our result will be one row per Location/Date aggregation key.
If we are using data.table, the CJ can be used
CJ(Location=loc, date= dates,Model= mods)[,
Temperature:= temp][, .(Mean = mean(Temperature)), by = .(Location, date)]
I have a dataframe. For simplicity, I am leaving out many columns and rows:
Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
I need to find 6 consecutive rows in the dataframe, such that the average distance is 1000, and such that the only types considered are A or B. Just for clarification, one may think to filter out all Type C rows, and then proceed, but then the rows that were not originally consecutive will become consecutive upon filtering, and that's no good.
For example, if I filtered out rows 3 and 5 above, I would be left with 3 rows. And if I had provided more rows, that might produce a faulty result.
Maybe a solution with data.table library ?
For reproducibility, here is a data sample, based on what you wrote.
library(data.table)
# data orig (with row numbers...)
DO<-"Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
6 1234 A"
# data : sep by comma
DS<-gsub('[[:blank:]]+',';',DO)
# data.frame
DF<-read.table(textConnection(DS),header=T,sep=';',stringsAsFactors = F)
#data.table
DT<-as.data.table(DF)
Then, make a function to increment a counter each time a sequence of identical value is found :
# function to set sequencial group number
mkGroupRep<-function(x){
cnt=1L
grp=1L
lx<-length(x)
ne<- x[-lx] != x[-1L] #next not equal
for(i in seq_along(ne)){if(ne[i])cnt=cnt+1;grp[i+1]=cnt}
grp
}
And use it with data.table 'multiple assignment by reference' :
# update dat : set group number based on sequential type
DT[,grp:=mkGroupRep(Type)]
# calc sum of distance and number of item in group, by group
DT[,`:=`(
distMean=mean(Distance),
grpLength=.N
),by=grp]
# filter what you want :
DT[Type != 'C' & distMean >100 & grpLength==2 | grpLength==3]
Output :
Distance Type grp distMean grpLength
1: 162 A 1 13672 2
2: 27182 A 1 13672 2
UPDATED AND SIMPLIFIED
I am having a really large table (~ 7 million records) which has the following structure.
temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "Website Datetime Rating
A 2007-12-06T14:53:07Z 1
A 2006-07-28T03:52:26Z 4
B 2006-11-02T11:06:25Z 2
C 2007-06-19T06:56:08Z 5
C 2009-11-28T22:27:58Z 2
C 2009-11-28T22:28:13Z 2")
What I want to retrieve is the unique websites with a max rating per website:
Website Rating
A 4
B 2
C 5
I tried using a for loop but it was too slow. Is there any other way I can achieve this.
do.call( rbind, lapply( split(temp, temp$Website) ,
function(d) d[ which.max(d$Rating), ] ) )
Website Datetime Rating
A A 2006-07-28T03:52:26Z 4
B B 2006-11-02T11:06:25Z 2
C C 2007-06-19T06:56:08Z 5
Since your 'Datetime' variable does not appear to yet actually be either a Date or a datetime object, you should probably convert to a Date-object first.
which.max will pick the first item that is a maximum.
> which.max(c(1,1,2,2))
[1] 3
So Ananda may not be correct in his warning in that regard. Datatable methods will certainly be more rapid and may also succeed if the machine memory is modest. The method above may make several copies along the way and data.table functions do not need to to as much copying.
I would probably explore the data.table package, though without more details, the following example solution is most likely not going to be what you need. I mention this because, in particular, there might be more than one "Rating" record per group which matches max; how would you like to deal with those cases?
library(data.table)
temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "Website Datetime Rating
A 2012-10-9 10
A 2012-11-10 12
B 2011-10-9 5")
DT <- data.table(temp, key="Website")
DT
# Website Datetime Rating
# 1: A 2012-10-9 10
# 2: A 2012-11-10 12
# 3: B 2011-10-9 5
DT[, list(Datetime = Datetime[which.max(Rating)],
Rating = max(Rating)), by = key(DT)]
# Website Datetime Rating
# 1: A 2012-11-10 12
# 2: B 2011-10-9 5
I would recommend that to get better answers, you might want to include information like how your datetime variable might factor into your aggregation, or whether it is possible for there to be more than one "max" value per group.
If you want all the rows that match the max, the fix is easy:
DT[, list(Time = Times[Rating == max(Rating)],
Rating = max(Rating)), by = key(DT)]
If you do just want the Rating column, there are many ways to go about this. Following the same steps as above to convert to a data.table, try:
DT[, list(Datetime = max(Rating)), by = key(DT)]
Website Datetime
# 1: A 4
# 2: B 2
# 3: C 5
Or, keeping the original "temp" data.frame, try aggregate():
aggregate(Rating ~ Website, temp, max)
Website Rating
# 1 A 4
# 2 B 2
# 3 C 5
Yet another approach, using ave:
temp[with(temp, Rating == ave(Rating, Website, FUN=max)), ]