I have a data frame that follows the following format.
match team1 team2 winningTeam
1 A D A
2 B E E
3 C F C
4 D C C
5 E B B
6 F A A
7 A D D
8 D A A
What I want to do is to crate variables that calculates the form of both team 1 and 2 over the last x matches. For example, I would want to create a variable called team1_form_last3_matches which for match 8 would be 0.33 (as they won 1 of their last 3 matches) and there would also be a variable called team2_form_last3_matches which would be 0.66 in match 8 (as they won 2 of their last 3 matches). Ideally I would like to be able to specify the number of previous matches to be considered when calculating the teamx_form_lasty variable and those variables to be automatically created. I have tried a bunch of approaches using dplyr, zoo rolling mean functions and a load of nested for / if statements. However, I have not quite cracked it and certainly not in an elegant way. I feel like I am missing a simple solution to this generic problem. Any help would be much appreciated!
Cheers,
Jack
This works for t1l3, you will need to replicate it for t2.
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"),stringsAsFactors = FALSE)
dat$t1l3 <- c(NA,sapply(2:nrow(dat),function(i) {
df <- dat[1:(i-1),] #just previous games, i.e. excludes current game
df <- df[df$team1==dat$team1[i] | df$team2==dat$team1[i],] #just those containing T1
df <- tail(df,3) #just the last three (or fewer if there aren't three previous games)
return(sum(df$winningTeam==dat$team1[i])/nrow(df)) #total wins/total games (up to three)
}))
How about something like:
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"))
match team1 team2 winningTeam
1 1 A D A
2 2 B E E
3 3 C F C
4 4 D C C
5 5 E B B
6 6 F A A
7 7 A D D
8 8 D A A
Allteams <- c("A","B","C","D","E","F")
# A vectorized function for you to use to do as you ask:
teamX_form_lastY <- function(teams, games, dat){
sapply(teams, function(x) {
games_info <- rowSums(dat[,c("team1","team2")] == x) + (dat[,"winningTeam"] == x)
lookup <- ifelse(rev(games_info[games_info != 0])==2,1,0)
games_won <- sum(lookup[1:games])
if(length(lookup) < games) warning(paste("maximum games for team",x,"should be",length(lookup)))
games_won/games
})
}
teamX_form_lastY("A", 4, dat)
A
0.75
# Has a warning for the number of games you should be using
teamX_form_lastY("A", 5, dat)
A
NA
Warning message:
In FUN(X[[i]], ...) : maximum games for team A should be 4
# vectorized input
teamX_form_lastY(teams = c("A","B"), games = 2, dat = dat)
A B
0.5 0.5
# so you ca do all teams
teamX_form_lastY(teams = Allteams, 2, dat)
A B C D E F
0.5 0.5 1.0 0.5 0.5 0.0
Related
library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
This is the full data frame, cor=NA represents a correlation smaller than 0.8
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
In the above df, F is not in V1, meaning that F is not of interest
so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
So now, df.gp is the dataset I need to work on
I drop the unused level in V2 (which is F in the example)
df.gp$V2 <- droplevels(df.gp$V2)
I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group
By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself.
Therefore, the example here should have three groups.
The way I see this, you may have complicated things by working your data straight into a data.frame. I took the liberty of transforming it back to a matrix.
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables.
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
Now either X1 or X2 determines your unique groupings.
Edited by cyrusjan:
The above script is a possible solution when assuming we already select the rows in with cor >= a, where a is a threshold taken as 0.8 in the above question.
Contributed by alexis_laz:
By using cutree and hclust, we can set the threshold in the script (i.e. h=0.8) as blow.
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))
If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F
I have an issue with a function I have that calculates the cumulative mean with a lag of one over groups on a field:
cumroll <- function(x) { x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))}
Everything works fine as long as I am performing this function over groups that are larger than one:
Player <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
However when my dataset has a grouping of size one:
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
I get the error:
Error in `split<-.default`(`*tmp*`, g, value = lapply(split(x, g), FUN)) :
replacement has length zero
I know there is a way to modify the function to account for this. I want to give the observed value when group size is 1 in these cases. Any help is appreciated!!
The simplest way to change the function's behavior conditional on the length of the input is, happily, to condition on the length of the input. E.g., you can use
cumroll <- function(x) {
if(length(x)<=1) {
x
} else {
x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))
}
}
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
test <- ave(Score, Player, Team, FUN = cumroll)
> data.frame(Player, Team, Score, test)
Player Team Score test
1 A A 5 5.000000
2 B B 2 2.000000
3 B B 7 2.000000
4 C C 3 3.000000
5 C C 9 3.000000
6 C C 6 6.000000
7 D D 3 3.000000
8 D D 7 3.000000
9 D D 1 5.000000
10 D D 7 3.666667
11 E E 3 3.000000
12 E E 8 3.000000
13 E E 3 5.500000
14 E E 4 4.666667
15 E E 1 4.500000
But I'm a little wary about your approach...how is cumulative mean with a lag of one defined precisely? You might look at shift in data.table and rollapply in zoo to get better performance and robustness.
I have a data frame like this:
n = c(2, 2, 3, 3, 4, 4)
n <- as.factor(n)
s = c("a", "b", "c", "d", "e", "f")
df = data.frame(n, s)
df
n s
1 2 a
2 2 b
3 3 c
4 3 d
5 4 e
6 4 f
and I want to access the first element of each level of my factor (and have in this example a vector containing a, c, e).
It is possible to reach the first element of one level, with
df$s[df$n == 2][1]
but it does not work for all levels:
df$s[df$n == levels(n)]
[1] a f
How would you do that?
And to go further, I’d like to modify my data frame to see which is the first element for each level at every occurrence. In my example, a new column should be:
n s rep firstelement
1 2 a a a
2 2 b c a
3 3 c e c
4 3 d a c
5 4 e c e
6 4 f e e
Edit. The first part of my answer addresses the original question, i.e. before "And to go further" (which was added by OP in an edit).
Another possibility, using duplicated. From ?duplicated: "duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts."
Here we use !, the logical negation (NOT), to select not duplicated elements of 'n', i.e. first elements of each level of 'n'.
df[!duplicated(df$n), ]
# n s
# 1 2 a
# 3 3 c
# 5 4 e
Update Didn't see your "And to go further" edit until now. My first suggestion would definitely be to use ave, as already proposed by #thelatemail and #sparrow. But just to dig around in the R toolbox and show you an alternative, here's a dplyr way:
Group the data by n, use the mutate function to create a new variable 'first', with the value 'first element of s' (s[1]),
library(dplyr)
df %.%
group_by(n) %.%
mutate(
first = s[1])
# n s first
# 1 2 a a
# 2 2 b a
# 3 3 c c
# 4 3 d c
# 5 4 e e
# 6 4 f e
Or go all in with dplyr convenience functions and use first instead of [1]:
df %.%
group_by(n) %.%
mutate(
first = first(s))
A dplyr solution for your original question would be to use summarise:
df %.%
group_by(n) %.%
summarise(
first = first(s))
# n first
# 1 2 a
# 2 3 c
# 3 4 e
Here is an approach using match:
df$s[match(levels(n), df$n)]
EDIT: Maybe this looks a bit confusing ...
To get a column which lists the first elements you could use match twice (but with x and table arguments swapped):
df$firstelement <- df$s[match(levels(n), df$n)[match(df$n, levels(n))]]
df$firstelement
# [1] a a c c e e
# Levels: a b c d e f
Lets look at this in detail:
## this returns the first matching elements
match(levels(n), df$n)
# [1] 1 3 5
## when we swap the x and table argument in match we get the level index
## for each df$n (the duplicated indices are important)
match(df$n, levels(n))
# [1] 1 1 2 2 3 3
## results in
c(1, 3, 5)[c(1, 1, 2, 2, 3, 3)]
# [1] 1 1 3 3 5 5
df$s[c(1, 1, 3, 3, 5, 5)]
# [1] a a c c e e
# Levels: a b c d e f
the function ave is useful in these cases:
df$firstelement = ave(df$s, df$n, FUN = function(x) x[1])
df
n s firstelement
1 2 a a
2 2 b a
3 3 c c
4 3 d c
5 4 e e
6 4 f e
In this case I prefer plyr package, it gives further freedom to manipulate the data.
library(plyr)
ddply(df,.(n),function(subdf){return(subdf[1,])})
n s
1 2 a
2 3 c
3 4 e
You could also use data.table
library(data.table)
dt = as.data.table(df)
dt[, list(firstelement = s[1]), by=n]
which would get you:
n firstelement
1: 2 a
2: 3 c
3: 4 e
The by=n bit groups everything by each value of n so s[1] is getting the first element of each of those groups.
To get this as an extra column you could do:
dt[, newcol := s[1], by=n]
dt
# n s newcol
#1: 2 a a
#2: 2 b a
#3: 3 c c
#4: 3 d c
#5: 4 e e
#6: 4 f e
So this just takes the value of s from the first row of each group and assigns it to a new column.
df$s[sapply(levels(n), function(particular.level) { which(df$n == particular.level)[1]})]
I believe your problem is that you are comparing two vectors df$n is a vector and levels(n) is a vector. vector == vector only happens to work for you since df$n is a multiple length of levels(n)
Surprised not to see this classic in the answer stream yet.
> do.call(rbind, lapply(split(df, df$n), function(x) x[1,]))
## n s
## 2 2 a
## 3 3 c
## 4 4 e
I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
replace_index <- function(dat){
x <- as.data.frame(t(sapply(seq_len(nrow(dat)),
function(i)sort(unlist(dat[i, 1:2])))))
names(x) <- paste("V", seq_len(ncol(x)), sep="")
data.frame(x, dat[, -(1:2), drop=FALSE])
}
replace_index(df1)
V1 V2 C D E F G
1 1 2 a a a a a
2 2 3 b b b c c
3 1 4 e e f f e
This means you can use a straight-forward merge to combine the data.
merge(replace_index(df1), replace_index(df2), all.y=TRUE)
V1 V2 C D E F G Z
1 1 2 a a a a a g
2 1 2 a a a a a h
3 1 4 e e f f e j
4 3 4 <NA> <NA> <NA> <NA> <NA> i
This is slightly clunky, and has some potential collision and order issues but works with your example
df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)
to produce
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i <NA> <NA> <NA> <NA> <NA>
One approach would be to create an id key for matching that is order invariant.
# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B), "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y), "-", max(X, Y)))
# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])
This produces the output
X Y Z id C D E F G
1 1 2 g 1 - 2 a a a a a
1.1 2 1 h 1 - 2 a a a a a
NA 3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3 1 4 j 1 - 4 e e f f e
You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.
library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))
rbind(m,n) %>%
group_by(X,Y) %>%
arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
slice(1) # take top row from combination
Produces:
Source: local data frame [4 x 8]
Groups: X, Y
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i NA NA NA NA NA
Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:
merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
## X Y Z C D E F G
## 1 1 2 g a a a a a
## 2 2 1 h a a a a a
## 3 1 4 j e e f f e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>
Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what #Andrie does in his helper function, which would work for any number of key columns, but would be less performant.