Cumulative Mean Function Issue with Grouping Sizes - r

I have an issue with a function I have that calculates the cumulative mean with a lag of one over groups on a field:
cumroll <- function(x) { x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))}
Everything works fine as long as I am performing this function over groups that are larger than one:
Player <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
However when my dataset has a grouping of size one:
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
I get the error:
Error in `split<-.default`(`*tmp*`, g, value = lapply(split(x, g), FUN)) :
replacement has length zero
I know there is a way to modify the function to account for this. I want to give the observed value when group size is 1 in these cases. Any help is appreciated!!

The simplest way to change the function's behavior conditional on the length of the input is, happily, to condition on the length of the input. E.g., you can use
cumroll <- function(x) {
if(length(x)<=1) {
x
} else {
x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))
}
}
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
test <- ave(Score, Player, Team, FUN = cumroll)
> data.frame(Player, Team, Score, test)
Player Team Score test
1 A A 5 5.000000
2 B B 2 2.000000
3 B B 7 2.000000
4 C C 3 3.000000
5 C C 9 3.000000
6 C C 6 6.000000
7 D D 3 3.000000
8 D D 7 3.000000
9 D D 1 5.000000
10 D D 7 3.666667
11 E E 3 3.000000
12 E E 8 3.000000
13 E E 3 5.500000
14 E E 4 4.666667
15 E E 1 4.500000
But I'm a little wary about your approach...how is cumulative mean with a lag of one defined precisely? You might look at shift in data.table and rollapply in zoo to get better performance and robustness.

Related

How to produce a compact letter display for overlapping confidence intervals?

I'm trying to produce a compact letter display for displaying all pairwise comparisons for overlapping intervals. I've tried searching for a package that would already do this, but I can't seem to find one not rooted in doing statistical pairwise comparisons.
trt <- paste("", letters[1:10], sep = "")
upper <- seq(from = 2, to = 10, length.out = 10)
lower <- seq(from = 0, to = 8, length.out = 10)
new.df <- data.frame(trt, upper, lower)
Thus, I end up with this data frame:
> new.df
trt upper lower
1 a 2.000000 0.0000000
2 b 2.888889 0.8888889
3 c 3.777778 1.7777778
4 d 4.666667 2.6666667
5 e 5.555556 3.5555556
6 f 6.444444 4.4444444
7 g 7.333333 5.3333333
8 h 8.222222 6.2222222
9 i 9.111111 7.1111111
10 j 10.000000 8.0000000
Then to make all the unique combinations this is what I used:
comparisons <- combn(unique(new.df$trt), 2)
Which I then convert into a data frame.
comparisons <- as.data.frame(cbind(comparisons[1,], comparisons[2,]))
> head(comparisons)
V1 V2
1 a b
2 a c
3 a d
4 a e
5 a f
6 a g
Once I get to my list of comparisons I'm not entirely sure where to proceed.

R: Grouping data within cetrain range

I have a data frame with two columns, let's call them X and Y. Here's an example of it:
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
which produces this:
X Y
A 14
B 12
C 12
D 11
E 9
F 6
G 4
H 1
Note that the data frame will always be ordered in a descending order based on Y. I want to group together cases where the Y values lie within a certain range, while updating the X column to reflect the grouping too. For example, if the value is 2, I would like the final output to be:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F G 5.00000
H 1.00000
Let me explain how I got that. From the starting df data frame, the closest values were B and C. Joining them would result in:
X new_Y
A 14
B C 12
D 11
E 9
F 6
G 4
H 1
The new_Y value for cases B and C is the average of the original values for B and C i.e. 12. From this second data frame, B C are within 2 from D so they are the next to be grouped together:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F 6.00000
G 4.00000
H 1.00000
Note that the Y value for B C D is 11.67 because the original values of B, C and D were 12, 12 and 11 respectively and their average is 11.667. I wouldn't want the code to return the average Y from the previous iteration (which in this case would be 11.5).
Finally, F and G can also be grouped together, producing the final output stated above.
I'm not sure of the code needed to achieve this. My only thoughts were to calculate the distance from the previous and following element, look for the minimum and check whether it exceeds the threshold value (of 2 in the example above). Based on where that minimum appears, join the X column while averaging the Y values from the original table. Repeat this until the minimum becomes larger than the threshold.
But I'm not sure how to write the necessary code to achieve this or whether there's a more efficient solution to the algorithm I'm suggesting above. Any help will be much appreciated.
P.S I forgot to mention that if the distance between the previous and the following Y value is the same, then the grouping should be done towards the larger Y value. So
X Y
A 10
B 8
C 6
would be returned as
X new_Y
A B 9
C 6
Thanks in advance for your patience. My apologies if I didn't explain this very well.
This sounds like hierarchical agglomerative clustering.
To get the groups, use dist, hclust and cutree.
Note that centroid clustering with hclust expects the distances as the square of the Euclidean distance.
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5
To munge the original table, we can use dplyr.
library('dplyr')
df %>%
group_by(group_id = group_id) %>%
summarise(
X = paste(X, collapse = ' '),
Y = mean(Y))
#> # A tibble: 5 x 3
#> group_id X Y
#> <int> <chr> <dbl>
#> 1 1 A 14.00000
#> 2 2 B C D 11.66667
#> 3 3 E 9.00000
#> 4 4 F G 5.00000
#> 5 5 H 1.00000
This gives the average of the previous iteration though. In any case I hope it helps
library(data.table)
df <- data.table(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
while (minimum < 2){
index <- which(df$difference==minimum) # see where the minimum occurs
check = FALSE
# because the last row cannot have a number since there is not an element after that
# we need to see if this element has the minimum difference with its previous
# if it does not have the minimum difference then we exclude it and paste it later
if(df[nrow(df)-1,difference]!=minimum){
last_row <- df[nrow(df)]
df <- df[-nrow(df)]
check = TRUE
}
tmp <- df[(index:(index+1))]
df <- df[-(index:(index+1))]
to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
to_bind$Y <- mean(tmp$Y)
df <- rbind(df[,.(X,Y)],to_bind)
if(check){
df <- rbind(df,last_row[,.(X,Y)])
}
setorder(df,-Y)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}

Aggregated rolling average with a conditional statement in R

I have a data frame that follows the following format.
match team1 team2 winningTeam
1 A D A
2 B E E
3 C F C
4 D C C
5 E B B
6 F A A
7 A D D
8 D A A
What I want to do is to crate variables that calculates the form of both team 1 and 2 over the last x matches. For example, I would want to create a variable called team1_form_last3_matches which for match 8 would be 0.33 (as they won 1 of their last 3 matches) and there would also be a variable called team2_form_last3_matches which would be 0.66 in match 8 (as they won 2 of their last 3 matches). Ideally I would like to be able to specify the number of previous matches to be considered when calculating the teamx_form_lasty variable and those variables to be automatically created. I have tried a bunch of approaches using dplyr, zoo rolling mean functions and a load of nested for / if statements. However, I have not quite cracked it and certainly not in an elegant way. I feel like I am missing a simple solution to this generic problem. Any help would be much appreciated!
Cheers,
Jack
This works for t1l3, you will need to replicate it for t2.
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"),stringsAsFactors = FALSE)
dat$t1l3 <- c(NA,sapply(2:nrow(dat),function(i) {
df <- dat[1:(i-1),] #just previous games, i.e. excludes current game
df <- df[df$team1==dat$team1[i] | df$team2==dat$team1[i],] #just those containing T1
df <- tail(df,3) #just the last three (or fewer if there aren't three previous games)
return(sum(df$winningTeam==dat$team1[i])/nrow(df)) #total wins/total games (up to three)
}))
How about something like:
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"))
match team1 team2 winningTeam
1 1 A D A
2 2 B E E
3 3 C F C
4 4 D C C
5 5 E B B
6 6 F A A
7 7 A D D
8 8 D A A
Allteams <- c("A","B","C","D","E","F")
# A vectorized function for you to use to do as you ask:
teamX_form_lastY <- function(teams, games, dat){
sapply(teams, function(x) {
games_info <- rowSums(dat[,c("team1","team2")] == x) + (dat[,"winningTeam"] == x)
lookup <- ifelse(rev(games_info[games_info != 0])==2,1,0)
games_won <- sum(lookup[1:games])
if(length(lookup) < games) warning(paste("maximum games for team",x,"should be",length(lookup)))
games_won/games
})
}
teamX_form_lastY("A", 4, dat)
A
0.75
# Has a warning for the number of games you should be using
teamX_form_lastY("A", 5, dat)
A
NA
Warning message:
In FUN(X[[i]], ...) : maximum games for team A should be 4
# vectorized input
teamX_form_lastY(teams = c("A","B"), games = 2, dat = dat)
A B
0.5 0.5
# so you ca do all teams
teamX_form_lastY(teams = Allteams, 2, dat)
A B C D E F
0.5 0.5 1.0 0.5 0.5 0.0

In R, how can I access the first element of each level of a factor?

I have a data frame like this:
n = c(2, 2, 3, 3, 4, 4)
n <- as.factor(n)
s = c("a", "b", "c", "d", "e", "f")
df = data.frame(n, s)
df
n s
1 2 a
2 2 b
3 3 c
4 3 d
5 4 e
6 4 f
and I want to access the first element of each level of my factor (and have in this example a vector containing a, c, e).
It is possible to reach the first element of one level, with
df$s[df$n == 2][1]
but it does not work for all levels:
df$s[df$n == levels(n)]
[1] a f
How would you do that?
And to go further, I’d like to modify my data frame to see which is the first element for each level at every occurrence. In my example, a new column should be:
n s rep firstelement
1 2 a a a
2 2 b c a
3 3 c e c
4 3 d a c
5 4 e c e
6 4 f e e
Edit. The first part of my answer addresses the original question, i.e. before "And to go further" (which was added by OP in an edit).
Another possibility, using duplicated. From ?duplicated: "duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts."
Here we use !, the logical negation (NOT), to select not duplicated elements of 'n', i.e. first elements of each level of 'n'.
df[!duplicated(df$n), ]
# n s
# 1 2 a
# 3 3 c
# 5 4 e
Update Didn't see your "And to go further" edit until now. My first suggestion would definitely be to use ave, as already proposed by #thelatemail and #sparrow. But just to dig around in the R toolbox and show you an alternative, here's a dplyr way:
Group the data by n, use the mutate function to create a new variable 'first', with the value 'first element of s' (s[1]),
library(dplyr)
df %.%
group_by(n) %.%
mutate(
first = s[1])
# n s first
# 1 2 a a
# 2 2 b a
# 3 3 c c
# 4 3 d c
# 5 4 e e
# 6 4 f e
Or go all in with dplyr convenience functions and use first instead of [1]:
df %.%
group_by(n) %.%
mutate(
first = first(s))
A dplyr solution for your original question would be to use summarise:
df %.%
group_by(n) %.%
summarise(
first = first(s))
# n first
# 1 2 a
# 2 3 c
# 3 4 e
Here is an approach using match:
df$s[match(levels(n), df$n)]
EDIT: Maybe this looks a bit confusing ...
To get a column which lists the first elements you could use match twice (but with x and table arguments swapped):
df$firstelement <- df$s[match(levels(n), df$n)[match(df$n, levels(n))]]
df$firstelement
# [1] a a c c e e
# Levels: a b c d e f
Lets look at this in detail:
## this returns the first matching elements
match(levels(n), df$n)
# [1] 1 3 5
## when we swap the x and table argument in match we get the level index
## for each df$n (the duplicated indices are important)
match(df$n, levels(n))
# [1] 1 1 2 2 3 3
## results in
c(1, 3, 5)[c(1, 1, 2, 2, 3, 3)]
# [1] 1 1 3 3 5 5
df$s[c(1, 1, 3, 3, 5, 5)]
# [1] a a c c e e
# Levels: a b c d e f
the function ave is useful in these cases:
df$firstelement = ave(df$s, df$n, FUN = function(x) x[1])
df
n s firstelement
1 2 a a
2 2 b a
3 3 c c
4 3 d c
5 4 e e
6 4 f e
In this case I prefer plyr package, it gives further freedom to manipulate the data.
library(plyr)
ddply(df,.(n),function(subdf){return(subdf[1,])})
n s
1 2 a
2 3 c
3 4 e
You could also use data.table
library(data.table)
dt = as.data.table(df)
dt[, list(firstelement = s[1]), by=n]
which would get you:
n firstelement
1: 2 a
2: 3 c
3: 4 e
The by=n bit groups everything by each value of n so s[1] is getting the first element of each of those groups.
To get this as an extra column you could do:
dt[, newcol := s[1], by=n]
dt
# n s newcol
#1: 2 a a
#2: 2 b a
#3: 3 c c
#4: 3 d c
#5: 4 e e
#6: 4 f e
So this just takes the value of s from the first row of each group and assigns it to a new column.
df$s[sapply(levels(n), function(particular.level) { which(df$n == particular.level)[1]})]
I believe your problem is that you are comparing two vectors df$n is a vector and levels(n) is a vector. vector == vector only happens to work for you since df$n is a multiple length of levels(n)
Surprised not to see this classic in the answer stream yet.
> do.call(rbind, lapply(split(df, df$n), function(x) x[1,]))
## n s
## 2 2 a
## 3 3 c
## 4 4 e

Removal of constant columns in R

I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
Thanks,
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat
This is the test_dat
> test_dat
A B C
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA
the test_dat_with_NA we get:
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
remove_constant(test_dat_with_NA)
could not delete the column A
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)

Resources