Calculating cumulative mean using floating conditions - r

My dataset has as features: players IDs, team, weeks and points.
I want to calculate the mean of TEAM points for previous weeks, but not all past weeks, just to the last 5 or less (if the current week is smaller than 5).
Example: For team = A, week = 7, the result will be the average of POINTS for team = A and weeks 2, 3, 4, 5 and 6.
The dataset can be created using the following code:
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
I would like to have a solution without a heavy looping, because the dataset is huge.
I have done related questions here that maybe will help, but I could not adapt to this case.
Question 1
Question 2
Thank you!

We adapt the approach from my answer to one of your other questions if you want a dplyr solution:
library(dplyr)
library(zoo)
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
roll_mean <- function(x, k) {
result <- rollapplyr(x, k, mean, partial=TRUE, na.rm=TRUE)
result[is.nan(result)] <- NA
return( result )
}
It might first be easier to aggregate by team:
team_data <- mydata %>%
select(-player_id) %>%
group_by(team, week) %>%
arrange(week) %>%
summarise(team_points = sum(points)) %>%
mutate(rolling_team_mean = roll_mean(lag(team_points), k=5)) %>%
arrange(team)
team_data
# A tibble: 30 x 4
# Groups: team [2]
team week team_points rolling_team_mean
<fctr> <int> <dbl> <dbl>
1 A 1 13 NA
2 A 2 11 13.00
3 A 3 6 12.00
4 A 4 13 10.00
5 A 5 19 10.75
6 A 6 10 12.40
7 A 7 13 11.80
8 A 8 16 12.20
9 A 9 16 14.20
10 A 10 12 14.80
# ... with 20 more rows
Then, if you like we can put everything back together:
mydata <- inner_join(mydata, team_data) %>%
arrange(week, team, player_id)
mydata[1:12, ]
player_id team week points team_points rolling_team_mean
1 1 A 1 4 13 NA
2 2 A 1 9 13 NA
3 3 B 1 10 12 NA
4 4 B 1 2 12 NA
5 1 A 2 8 11 13
6 2 A 2 3 11 13
7 3 B 2 9 12 12
8 4 B 2 3 12 12
9 1 A 3 5 6 12
10 2 A 3 1 6 12
11 3 B 3 7 12 12
12 4 B 3 5 12 12

Here's one way:
# compute points per team per week
pts <- with(mydata, tapply(points, list(team, week), sum, default = 0))
pts
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 11 6 13 19 10 13 16 16 12 17 11 13 10 4
#B 12 12 12 11 10 6 13 11 6 9 5 7 13 13 6
# compute the 5-week averages
sapply(setNames(seq(2, ncol(pts)), seq(2, ncol(pts))),
function(i) {
apply(pts[, seq(max(1, i - 5), i - 1), drop = FALSE], 1, mean)
})
# 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 12 10 10.75 12.4 11.8 12.2 14.2 14.8 13.4 14.8 14.4 13.8 12.6
#B 12 12 12 11.75 11.4 10.2 10.4 10.2 9.2 9.0 8.8 7.6 8.0 9.4
This will give the wrong result if the week variable has gaps.

Related

How to insert a row which calculates the average of the rows above it?

I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)

Run function in R for mean

Based on my values i need a function to get following results.
enter image description here
The functions has to calculate the mean of current value and the 3 previous values.
The function should be flexible in that way, that the same calculation can be applied for 2, 4, 5 or x previous values, for example: mean of current value and the 2 previous values.
please consider, that my daten has random numbers, and not like in above example ascending numbers
What you need is a rolling mean, in the argument k (4 in my example) you provide an integer width of the rolling window. Check the documentation page for the rollmean function of the zoo package, ?rollmean.
zoo
library(zoo)
library(dplyr)
df <- data.frame(number = 1:20)
df %>% mutate(rolling_avg = rollmean(number, k = 4 , fill = NA, align = "right"))
RcppRoll
library(RcppRoll)
df %>% mutate(rolling_avg = roll_mean(number, n = 4, fill = NA, align = "right"))
Output
number rolling_avg
1 1 NA
2 2 NA
3 3 NA
4 4 2.5
5 5 3.5
6 6 4.5
7 7 5.5
8 8 6.5
9 9 7.5
10 10 8.5
11 11 9.5
12 12 10.5
13 13 11.5
14 14 12.5
15 15 13.5
16 16 14.5
17 17 15.5
18 18 16.5
19 19 17.5
20 20 18.5
Using the other vector you provided in the comments:
df <- data.frame(number = c(1,-3,5,4,3,2,-4,5,6,-4,3,2,3,-4,5,6,6,3,2))
df %>% mutate(rolling_avg = rollmean(number, 4, fill = NA, align = "right"))
Output
number rolling_avg
1 1 NA
2 -3 NA
3 5 NA
4 4 1.75
5 3 2.25
6 2 3.50
7 -4 1.25
8 5 1.50
9 6 2.25
10 -4 0.75
11 3 2.50
12 2 1.75
13 3 1.00
14 -4 1.00
15 5 1.50
16 6 2.50
17 6 3.25
18 3 5.00
19 2 4.25
You can also use the rollify function in the tibbletime package to create a custom rolling function for any function. For mean it would look like this (using data from #mpalanco's answer):
library(dplyr)
library(tibbletime)
rolling_mean <- rollify(mean, window = 4)
df %>% mutate(moving_average = rolling_mean(number))
which gives you:
number moving_average
1 1 NA
2 2 NA
3 3 NA
4 4 2.5
5 5 3.5
6 6 4.5
7 7 5.5
8 8 6.5
9 9 7.5
10 10 8.5
11 11 9.5
12 12 10.5
13 13 11.5
14 14 12.5
15 15 13.5
16 16 14.5
17 17 15.5
18 18 16.5
19 19 17.5
20 20 18.5
The benefit of this approach is that it is easy to extend to things other than rolling average.

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

5 nearest neighbors based on given distance in r

I have the following dataset:
id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22
I want to find 5 closest neighbors for each of the id based on the distance calculated from the given (x,y). The final output should look like the following:
id n_id dist age age_n_id
1 2 2 30 23
1 5 1.5 30 2
1 3 5 30 10
1 7 3 30 7
1 8 3 30 55
2 1 6 23 30
2 10 1 23 22
2 6 2 23 66
2 7 6 23 7
2 8 9 23 55
3 2 1 10 23
3 1 2 10 30
3 4 1.2 10 5
3 6 1.6 10 66
3 9 2.3 10 26
................................
................................
10 2 1.9 22 23
10 6 2.3 22 66
10 9 2.1 22 26
10 1 2.5 22 30
10 5 1.6 22 2
where n_id is the id if the neighbors, dist is the straight line distance between id and n_id, age is the age of the id, and age_n_id is the age of the n_id. Also, the maximum distance would be 10km. If there are fewer than 5 neighbors within 10km, say 3 neighbors, the corresponding id would be repeated only three times.
I am relatively newer in r programming and any help would be much appreciated.
data.table solution:
library(data.table)
data<-fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
data[,all_x:=list(list(x))]
data[,all_y:=list(list(y))]
data[,all_age:=list(list(age))]
data[,seq_nr:=seq_len(.N)]
#Distance formula:
formula_distance<-function(x_1,x_2,y_1,y_2,z){
x_2<-x_2[[1]][-z]
y_2<-y_2[[1]][-z]
sqrt((x_1-x_2)^2+(y_1-y_2)^2)
}
data<-data[,{list(dist = formula_distance(x,all_x,y,all_y,seq_nr),
id =seq(1:nrow(data))[-id],
age_id=all_age[[1]][-id],
age=rep(age,nrow(data)-1))},by=1:nrow(data)]
data<-data[order(nrow,dist)]
#Filter data within threshold:
threshold<-1000
#How many nearest neighbors to take:
k<-5
filtered<-data[dist<=threshold]
filtered<-filtered[,{list(dist=dist[1:k],n_id=id[1:k],n_age=age_id[1:k])},by=c("nrow","age")]
filtered<-filtered[!is.na(dist)]
setnames(filtered,"nrow","id")
filtered
id age dist n_id n_age
1: 1 30 25.37893 4 5
2: 1 30 52.27055 5 2
3: 1 30 69.37211 2 23
4: 1 30 213.41050 3 10
5: 2 23 26.31045 5 2
6: 2 23 48.08326 4 5
7: 2 23 69.37211 1 30
8: 2 23 147.12665 3 10
9: 3 10 147.12665 2 23
10: 3 10 172.11243 5 2
11: 3 10 194.93653 4 5
12: 3 10 213.41050 1 30
13: 4 5 25.37893 1 30
14: 4 5 27.32471 5 2
15: 4 5 48.08326 2 23
16: 4 5 194.93653 3 10
17: 5 2 26.31045 2 23
18: 5 2 27.32471 4 5
19: 5 2 52.27055 1 30
20: 5 2 172.11243 3 10
21: 6 66 67.84106 9 26
22: 6 66 209.88273 7 7
23: 7 7 180.54432 9 26
24: 7 7 209.88273 6 66
25: 8 55 805.91482 10 22
26: 9 26 67.84106 6 66
27: 9 26 180.54432 7 7
28: 10 22 805.91482 8 55
Assuming that the unit of coordinates is in meter.
# Load packages
library(FNN)
library(tidyverse)
library(data.table)
# Create example data frame
dataset <- fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
# Calculate the nearest ID and distance
near_data <- get.knn(dataset[, 2:3], k = 5)
# Extract the nearest ID
nn_index <- as.data.frame(near_data$nn.index)
# Extract the nearest Distance
nn_dist <- as.data.frame(near_data$nn.dist)
# Re organize the data
nn_index2 <- nn_index %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, n_id, -ID)
nn_dist2 <- nn_dist %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, dist, -ID)
# Remove coordinates in dataset
dataset2 <- dataset %>% select(-x, -y)
# Create the final output
nn_final <- nn_index2 %>%
# Merge nn_index2 and nn_dist2
left_join(nn_dist2, by = c("ID", "Rank")) %>%
# Merge with dataset2 by ID and id
left_join(dataset2, by = c("ID" = "id")) %>%
# Merge with dataset2 by n_id and id
left_join(dataset2, by = c("n_id" = "id")) %>%
# Remove Rank
select(-Rank) %>%
# Rename column names
rename(id = ID, age = age.x, age_n_id = age.y) %>%
# Sort the data frame
arrange(id, dist) %>%
# Filter the dist < 10000 meters
filter(dist < 10000)

Merging data sets with unequal observations

I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")

Resources