I have a data set in chronological order which I have imported to R using:
mydata <- read.csv(file="test.csv",stringsAsFactors=FALSE)
Two of the columns in the data set are 'winner' and loser'. Each row in the data is a tennis match.
What I am looking to do is to add two columns which give me a cumulative count of the total matches the player in the 'winner' column has played up to and including the match on that row. And the same count for the 'loser' in that row.
So for example it would look like this:
winner loser winner_matches loser_matches
tom andy 1 1
andy greg 2 1
greg tom 2 2
I hope that makes sense.
I have tried using the following code but can't get it to work across both columns:
ave(mydata$winner_name==mydata$winner_name, mydata$winner_name, FUN=cumsum)
So the data below is the first 10 rows of around 20,000.
1) base Define a function which counts matches up to the ith row for the indicated player and then apply it for the winner and loser matches separately. No packages are used:
count_matches <- function(i, player) {
with(DF[1:i, ], sum(winner == player | loser == player))
}
n <- nrow(DF)
transform(DF, winner_matches = mapply(count_matches, 1:n, winner),
loser_matches = mapply(count_matches, 1:n, loser))
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
2) sqldf A different solution can be obtained using sqldf upon realizing that this problem can be solved with a self-join on a complex condition like this:
library(sqldf)
sqldf("select a.winner,
a.loser,
sum(a.winner = b.winner or a.winner = b.loser) winner_matches,
sum(a.loser = b.winner or a.loser = b.loser) loser_matches
from DF a join DF b on a.rowid >= b.rowid
group by a.rowid")
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
Note: The input used, in reproducible form, is:
Lines <- "winner loser
tom andy
andy greg
greg tom"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
We can get number of times that each player won or lost by data.table package:
library(data.table)
setDT(dat)[, winner_matches_won := seq_len(.N), by=(winner)]
setDT(dat)[, loser_matches_lost := seq_len(.N), by=(loser)]
dat
# winner loser winner_matches_won loser_matches_lost
# 1: tom andy 1 1
# 2: andy greg 1 1
# 3: greg tom 1 1
# 4: greg tom 2 2
# 5: tom greg 2 2
Data:
dat <- structure(list(winner = structure(c(3L, 1L, 2L, 2L, 3L), .Label = c("andy",
"greg", "tom"), class = "factor"), loser = structure(c(1L, 2L,
3L, 3L, 2L), .Label = c("andy", "greg", "tom"), class = "factor")), .Names = c("winner",
"loser"), class = "data.frame", row.names = c(NA, -5L))
You're really close to getting ave to work. The cumsum function doesn't know how to handle text so I created a dummy column that's equal to 1 for each row. That gives cumsum something to count.
Here's a sample dataframe.
mydata <-
data.frame(
winner = c("tom", "andy", "greg", "tom", "gary"),
loser = c("andy", "greg", "tom", "gary", "tom"),
stringsAsFactors = FALSE
)
And here's the code to add the two new columns.
library(tidyverse)
mydata <- mutate(mydata, one = 1) # Add dummy column
# Use ave() to calculate both the wins and losses
mydata$winner_matches <- ave(x = mydata$one, mydata$winner, FUN = cumsum)
mydata$loser_matches <- ave(x = mydata$one, mydata$loser, FUN = cumsum)
mydata <- select(mydata, -one) # Remove dummy column
Related
I am doing some operations on a data.table and getting a result. So far so good. Next, I want the result to also show the sums across some columns, but I can't get that to work.
I filter my table by rows where x1=1, and compute a metric by Group1:
dt[x1 == 1, .N, by = c("Group1")][,
"%" := round(N /sum(N) * 100, 0)] [
]
giving
Group1 N %
1: 2 6 40
2: 1 6 40
3: 3 2 13
4: 5 1 7
I would just like to add a row to the above table that gives the sum across all columns.
I can just do
colSums(.Last.value)
and get the answer in a in a separate console, but what if I wanted to just append a new row to the above table itself, something like:
Group1 N %
1: 2 6 40
2: 1 6 40
3: 3 2 13
4: 5 1 7
ColSum: -- 15 100
Since I don't understand your sample dataset, I guess this can help solve your issue.
I would suggest that you use the janitor package to wrap up your column total or row total
See sample below
library(janitor)
set.seed(10)
df_sample<- sample(1:nrow(iris), 10)
df<-iris[df_sample, ]
#This would sum all the rows together and return total
df%>%
select(Species,Sepal.Width, Petal.Length, Petal.Width)%>%
adorn_totals(where = "row")
#This would sum all columns and return total
df%>%
select(Species,Sepal.Width, Petal.Length, Petal.Width)%>%
adorn_totals(where = "col")
I hope that this answered your question.
As a hacked mod to akrun's (since deleted) answer, here's a custom printing function that works around data.table's omission of row names.
prettyDT <- function(x, ...) {
out <- capture.output(data.table:::print.data.table(x, ...))
nms <- rownames(x)
gre <- gregexpr("^([0-9]+)(?=:)", out, perl = TRUE)
newnms <- nms[as.integer(regmatches(out, gre), nms)]
wids <- nchar(newnms)
newnms[!is.na(wids)] <- sprintf(paste0("%", max(wids, na.rm = TRUE), "s"), newnms[!is.na(wids)])
regmatches(out, gre)[!is.na(wids)] <- newnms[!is.na(wids)]
pre <- strrep(" ", diff(range(wids, na.rm = TRUE)))
out[is.na(wids)] <- paste0(pre, out[is.na(wids)])
cat(out, sep = "\n")
}
With this, we can do:
out <- rbindlist(list(
DT,
DT[, c(.(Group1 = "--"), lapply(.SD, sum)), .SDcols = c("N", "%")]
))
rownames(out)[nrow(out)] <- "Colsum"
prettyDT(out)
# Group1 N %
# <char> <int> <int>
# 1: 2 6 40
# 2: 1 6 40
# 3: 3 2 13
# 4: 5 1 7
# Colsum: -- 15 100
Admittedly, this is a bit of a hack, and requires explicit calling of a udf to get the desired output.
Data
DT <- setDT(structure(list(Group1 = c("2", "1", "3", "5"), N = c(6L, 6L, 2L, 1L), "%" = c(40L, 40L, 13L, 7L)), class = c("data.table", "data.frame"), row.names = c(NA, -4L)))
I have a dataset with names, dates, and several categorical columns. Let's say
data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
period = c(1,1,1,1,1,1,2,2,2,3,3),
category = c("A","A","A","B","B","B","A","B","A","B","A"))
Which looks like this:
name period category
Anne 1 A
Ben 1 A
Cal 1 A
Anne 1 B
Ben 1 B
Cal 1 B
Anne 2 A
Ben 2 B
Ben 2 A
Ben 3 A
Cal 3 B
I want to compute, for each period, how many names were present in the past period, for every group of my categorical variables. The output should be as follows:
period category recurrence_count
2 A 2 # due to Anne and Ben being on A, period 1
2 B 1 # due to Ben being on B, period 1
3 A 1 # due to Ben being on A, period 2
3 B 0 # no match from B, period 2
I am aware of the .I and .GRP operators in data.table, but I have no idea how to write the notion of 'next group' in the j entry of my statement. I imagine something like this might be a reasonable path, but I can't figure out the correct syntax:
data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
You can first summarize your data by category and period.
previous_period_names <- data[, .(names = list(name)), .(category, period)]
previous_period_names[, next_period := period + 1]
Join your summary with your original data.
data[previous_period_names, names := i.names, on = c('period==next_period')]
Now count how many names you see the name in the summarized names
data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]
Another data.table alternative. For rows that can have a previous period (period != 1), create such a variable (prev_period := period - 1).
Join original data with a subset that has values for 'prev_period' (data[data[!is.na(prev_period)]). Join on 'category', 'period = prev_period' and 'name'.
In the resulting data set, for each 'period' and 'category' (by = .(period = i.period, category)), count the number of names from original data (x.name) that had a match with previous period (length(na.omit(x.name))).
data[period != 1, prev_period := period - 1]
data[data[!is.na(prev_period)], on = c("category", period = "prev_period", "name"),
.(category, i.period, x.name)][
, .(n = length(na.omit(x.name))), by = .(period = i.period, category)]
# period category n
# 1: 2 A 2
# 2: 2 B 1
# 3: 3 B 1
# 4: 3 A 0
One option in base R is to split the 'data' by 'category', then loop over the list (lapply), use Reduce with intersect on the splitted 'name' by 'period' with accumulate as TRUE, get the lengths of the list, create a data.frame with the unique elements of 'period' and use Map to create the 'category' from the names of the list output, rbind the list of data.frame into a single dataset
library(data.table)
lst1 <- lapply(split(data, data$category), function(x)
data.frame(period = unique(x$period)[-1],
recurrence_count = lengths(Reduce(intersect,
split(x$name, x$period), accumulate = TRUE)[-1])))
rbindlist(Map(cbind, category = names(lst1), lst1))[
order(period), .(period, category, recurrence_count)]
# period category recurrence_count
#1: 2 A 2
#2: 2 B 1
#3: 3 A 1
#4: 3 B 0
Or using the same logic within data.table, grouped by 'category, do the split of 'name' by 'period' and apply the Reduce with intersect
setDT(data)[, .(period = unique(period),
recurrence_count = lengths(Reduce(intersect,
split(name, period), accumulate = TRUE))), .(category)][duplicated(category)]
# category period recurrence_count
#1: A 2 2
#2: A 3 1
#3: B 2 1
#4: B 3 0
Or similar option in tidyverse
library(dplyr)
library(purrr)
data %>%
group_by(category) %>%
summarise(reccurence_count = lengths(accumulate(split(name, period),
intersect)), period = unique(period), .groups = 'drop' ) %>%
filter(duplicated(category))
# A tibble: 4 x 3
# category reccurence_count period
# <chr> <int> <int>
#1 A 2 2
#2 A 1 3
#3 B 1 2
#4 B 0 3
data
data <- structure(list(name = c("Anne", "Ben", "Cal", "Anne", "Ben",
"Cal", "Anne", "Ben", "Ben", "Ben", "Cal"), period = c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), category = c("A", "A", "A",
"B", "B", "B", "A", "B", "A", "A", "B")), class = "data.frame",
row.names = c(NA,
-11L))
A data.table option
setDT(df)[
,
{
u <- split(name, period)
data.table(
period = unique(period)[-1],
recurrence_count = lengths(
Map(
intersect,
head(u, -1),
tail(u, -1)
)
)
)
},
category
]
gives
category period recurrence_count
1: A 2 2
2: A 3 1
3: B 2 1
4: B 3 0
I have a couple of data frames that I am attempting to use the values from one data frame to populate the cells of a column in a separate data frame.
They are as follows:
df1 <- data.frame(A = c("Doug", "Michele", "Steve", "John", "Pete", "David"))
df1$B <- 0
df2 <- data.frame(A = c("Doug", "Steve", "John"), B = c(1,1,0))
And the result that I am looking for is:
df1 <- data.frame(A = c("Doug", "Michele", "Steve", "John", "Pete", "David"), B = c(1,0,1,0,0,0))
I tried the following approach, but only Doug has a 1 value while the others are 0.
df1$B[(df1$A == df2$A & df2$B == 1)] <- 1
When attempting an approach with %in%, Doug has a 1 value but John does as well when Steve should be the one to receive the 1.
df1$B[(df1$A %in% df2$A & df2$B == 1)] <- 1
Am I missing something here that would resolve this issue?
Thanks in advance
An option with data.table would be to join on the 'A' column and assign the 'B' from the second dataset (i.B) to 'B' in first data
library(data.table)
setDT(df1)[df2, B := i.B, on = .(A)]
-output
df1
# A B
#1: Doug 1
#2: Michele 0
#3: Steve 1
#4: John 0
#5: Pete 0
#6: David 0
I want to Transform R Dataframe factor into Indicator Variable using some index in R.
Given following representation
StudentID Subject
1 A
1 B
2 A
2 C
3 A
3 B
I need following representation using StudentID as index
StudentID SubjectA SubjectB SubjectC
1 1 1 0
2 1 0 1
3 1 1 0
We can use table
table(df1)
# Subject
#StudentID A B C
# 1 1 1 0
# 2 1 0 1
# 3 1 1 0
If we need a data.frame
as.data.frame.matrix(table(df1))
Here's how I got it, using dcast from reshape2 as suggested in the comment above
library(reshape2)
ID <- c(1, 1, 2, 2, 3, 3)
Subject <- c('A', 'B', 'A', 'C', 'A', 'B')
data <- data.frame(ID, Subject)
data <- dcast(data, ID ~ Subject)
data[is.na(data)] <- 0
f <- function(x) {
x <- gsub('[A-Z]', 1, x)
}
as.data.frame(apply(data, 2, f))
# ID A B C
#1 1 1 1 0
#2 2 1 0 1
#3 3 1 1 0
Now that I look at this solution it may not be very efficient. But it is much more dynamic than some other solutions. There might also be a way to use data.table directly but I cannot figure it out. This might help though:
library(data.table)
df <- structure(list(StudentID = c(1, 1, 2, 2, 3, 3),
Subject = structure(c(1L,
2L, 1L, 3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("StudentID",
"Subject"), row.names = c(NA, -6L), class = "data.frame")
df <- data.table(df)
### here we pull the unique student id's to use in group by
studentid <- as.character(unique(df$Subject))
### here we group by student ID's and paste which Subjects exist
x <- df[,list("Values"=paste(Subject,collapse="_")),by=StudentID]
### then we go through each one and try to match it to the unique vector
tmp <- strsplit(x$Values,"_")
res <- do.call(rbind,lapply(tmp,function(i) match(studentid,i)))
### change the results to the indicator variable desired
res[!is.na(res)] <- 1
res[is.na(res)] <- 0
res <- data.frame("StudentID"=x$StudentID,res)
colnames(res) <- c("StudentID",studentid)
I have a list of dataframes in R, each loaded from a different file containing ranks. For example the files could contain finishing positions for different athletes in different races.
The same element (athlete) can appear in more than one dataframe (race) but no dataframe will necessarily contain all elements.
I would like to populate a matrix of rankings with athletes as rows and races as columns. Where there is no ranking for an athlete in a particular race it should read 0.
For example, if I have:
[[1]]
name rank
1 Alice 1
2 Bob 2
3 Carla 3
4 Diego 4
[[2]]
name rank
1 Alice 2
2 Carla 1
3 Eric 3
4 Frank 4
5 Gary 5
[[3]]
name rank
1 Bob 5
2 Carla 4
3 Diego 3
4 Eric 1
5 Gary 2
I would like to generate a matrix:
1 2 3
Alice 1 2 0
Bob 2 0 5
Carla 3 1 4
Diego 4 0 3
Eric 0 3 1
Frank 0 4 0
Gary 0 5 2
I am looking for an efficient way to do this: my data is more like 200 dataframes and 10000 ranked elements per dataframe (15000 unique elements in total) so the final matrix will be approx 15000x200
Here's a solution using reshape2 package:
require(reshape2)
dcast(do.call(rbind, lapply(seq_along(ll), function(ix)
transform(ll[[ix]], id = ix))), name ~ id, value.var="rank", fill=0)
name 1 2 3
1 Alice 1 2 0
2 Bob 2 0 5
3 Carla 3 1 4
4 Diego 4 0 3
5 Eric 0 3 1
6 Frank 0 4 0
7 Gary 0 5 2
where ll is your list of data.frames.
or equivalently:
dcast(transform(do.call(rbind, ll), id = rep(seq_along(ll), sapply(ll, nrow))),
name ~ id, value.var = "rank", fill = 0)
A data.table solution:
require(data.table)
pp <- rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
setkey(pp, "name", "id")
pp[CJ(unique(name), 1:3)][is.na(rank), rank := 0L][, as.list(rank), by = name]
name V1 V2 V3
1: Alice 1 2 0
2: Bob 2 0 5
3: Carla 3 1 4
4: Diego 4 0 3
5: Eric 0 3 1
6: Frank 0 4 0
7: Gary 0 5 2
Some benchmarking (now that we've quite some answers):
names <- tapply(sample(letters, 1e4, replace=TRUE), rep(1:(1e4/5), each=5), paste, collapse="")
names <- unique(names)
dd_create <- function() {
nrow <- sample(c(100:500), 1)
ncol <- 3
data.frame(name = sample(names, nrow, replace=FALSE), rank = sample(nrow))
}
ll <- replicate(1e3, dd_create(), simplify = FALSE)
require(reshape2)
require(data.table)
Arun1_reshape2 <- function(ll) {
# same as #agstudy's
dcast(do.call(rbind, lapply(seq_along(ll), function(ix)
transform(ll[[ix]], id = ix))), name ~ id, value.var="rank", fill=0)
}
Arun2_reshape2 <- function(ll) {
dcast(transform(do.call(rbind, ll), id = rep(seq_along(ll), sapply(ll, nrow))),
name ~ id, value.var = "rank", fill = 0)
}
eddi_reshape2 <- function(ll) {
dcast(melt(ll, id.vars = 'name'), name ~ L1, fill = 0)
}
Arun_data.table <- function(ll) {
pp <- rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
setkey(pp, "name", "id")
pp[CJ(unique(name), 1:1000)][is.na(rank), rank := 0L][, as.list(rank), by = name]
}
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="name")
}
Hong_Ooi <- function(ll) {
for(i in seq_along(ll))
names(ll[[i]])[2] <- paste0("rank", i)
out <- Reduce(merge.all, ll)
}
require(microbenchmark)
microbenchmark( arun1 <- Arun1_reshape2(ll),
arun2 <- Arun2_reshape2(ll),
eddi <- eddi_reshape2(ll),
hong <- Hong_Ooi(ll),
arun.dt <- Arun_data.table(ll), times=10)
Unit: seconds
expr min lq median uq max neval
arun1 <- Arun1_reshape2(ll) 9.157160 9.177143 9.366775 9.715767 28.043125 10
arun2 <- Arun2_reshape2(ll) 8.408356 8.437066 8.494233 9.018796 10.075029 10
eddi <- eddi_reshape2(ll) 8.056605 8.314110 8.402396 8.474129 9.124581 10
hong <- Hong_Ooi(ll) 82.457432 82.716930 82.908646 108.413217 321.164598 10
arun.dt <- Arun_data.table(ll) 2.006474 2.123331 2.212783 2.311619 2.738914 10
Here's a simpler reshape2 solution:
library(reshape2)
dcast(melt(ll, id.vars = 'name'), name ~ L1, fill = 0)
# name 1 2 3
#1 Alice 1 2 0
#2 Bob 2 0 5
#3 Carla 3 1 4
#4 Diego 4 0 3
#5 Eric 0 3 1
#6 Frank 0 4 0
#7 Gary 0 5 2
Arun's benchmarks were pretty interesting, and it seems like what data.table does really well is the melting part, and what reshape2 does really well is the dcast, so here's the best of both worlds:
library(reshape2)
library(data.table)
pp = rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
dcast(pp, name ~ id, fill = 0, value.var = 'rank')
Using Arun's benchmark data:
names <- tapply(sample(letters, 1e4, replace=TRUE), rep(1:(1e4/5), each=5), paste, collapse="")
names <- unique(names)
dd_create <- function() {
nrow <- sample(c(100:500), 1)
ncol <- 3
data.frame(name = sample(names, nrow, replace=FALSE), rank = sample(nrow))
}
ll <- replicate(1e3, dd_create(), simplify = FALSE)
Arun_data.table <- function(ll) {
pp <- rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
setkey(pp, "name", "id")
pp[CJ(unique(name), 1:1000)][is.na(rank), rank := 0L][, as.list(rank), by = name]
}
mix_of_both = function(ll) {
pp = rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
dcast(pp, name ~ id, fill = 0, value.var = 'rank')
}
require(microbenchmark)
microbenchmark(Arun_data.table(ll), mix_of_both(ll), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# Arun_data.table(ll) 2568.333 2586.0079 2626.7704 2832.8076 2911.1314 10
# mix_of_both(ll) 615.166 739.9383 766.8994 788.5822 821.0478 10
here the data since the OP don't give a reproducible example :
dput(ll)
list(structure(list(name = structure(1:4, .Label = c("Alice",
"Bob", "Carla", "Diego"), class = "factor"), rank = 1:4), .Names = c("name",
"rank"), class = "data.frame", row.names = c("1", "2", "3", "4"
)), structure(list(name = structure(1:5, .Label = c("Alice",
"Carla", "Eric", "Frank", "Gary"), class = "factor"), rank = c(2L,
1L, 3L, 4L, 5L)), .Names = c("name", "rank"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")), structure(list(name = structure(1:5, .Label = c("Bob",
"Carla", "Diego", "Eric", "Gary"), class = "factor"), rank = c(5L,
4L, 3L, 1L, 2L)), .Names = c("name", "rank"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))
Nearly the same solution as #Arun one, but in 2 separtes steps:
## add race column
ll <- lapply(seq_along(ll),function(x){
ll[[x]]$race <- x
ll[[x]]
})
## create a long data.frame
dd <- do.call(rbind,ll)
## transform to the wide format
library(reshape2)
dcast(name~race,data=dd,fill=0,value.var='rank')
name 1 2 3
1 Alice 1 2 0
2 Bob 2 0 5
3 Carla 3 1 4
4 Diego 4 0 3
5 Eric 0 3 1
6 Frank 0 4 0
7 Gary 0 5 2
Another Reduce use case, it seems.
merge.all <- function(x, y)
merge(x, y, all=TRUE, by="name")
# to avoid problems with merged name clashes
for(i in seq_along(ll))
names(ll[[i]])[2] <- paste0("rank", i)
out <- Reduce(merge.all, ll)
You'll have to modify your data frames slightly to avoid merge complaining about name collisions; a for loop works as well as anything for this purpose.
Any missing races will have NA. You can replace them with 0 by out[is.na(out)] <- 0; you should ask yourself whether this is sensible though. For example, if you do this, then simple summary statistics like means, variances etc will give misleading results. The same holds if you want to do any more complicated modelling. By contrast, most R modelling functions will be smart enough to exclude NAs.