Based on my values i need a function to get following results.
enter image description here
The functions has to calculate the mean of current value and the 3 previous values.
The function should be flexible in that way, that the same calculation can be applied for 2, 4, 5 or x previous values, for example: mean of current value and the 2 previous values.
please consider, that my daten has random numbers, and not like in above example ascending numbers
What you need is a rolling mean, in the argument k (4 in my example) you provide an integer width of the rolling window. Check the documentation page for the rollmean function of the zoo package, ?rollmean.
zoo
library(zoo)
library(dplyr)
df <- data.frame(number = 1:20)
df %>% mutate(rolling_avg = rollmean(number, k = 4 , fill = NA, align = "right"))
RcppRoll
library(RcppRoll)
df %>% mutate(rolling_avg = roll_mean(number, n = 4, fill = NA, align = "right"))
Output
number rolling_avg
1 1 NA
2 2 NA
3 3 NA
4 4 2.5
5 5 3.5
6 6 4.5
7 7 5.5
8 8 6.5
9 9 7.5
10 10 8.5
11 11 9.5
12 12 10.5
13 13 11.5
14 14 12.5
15 15 13.5
16 16 14.5
17 17 15.5
18 18 16.5
19 19 17.5
20 20 18.5
Using the other vector you provided in the comments:
df <- data.frame(number = c(1,-3,5,4,3,2,-4,5,6,-4,3,2,3,-4,5,6,6,3,2))
df %>% mutate(rolling_avg = rollmean(number, 4, fill = NA, align = "right"))
Output
number rolling_avg
1 1 NA
2 -3 NA
3 5 NA
4 4 1.75
5 3 2.25
6 2 3.50
7 -4 1.25
8 5 1.50
9 6 2.25
10 -4 0.75
11 3 2.50
12 2 1.75
13 3 1.00
14 -4 1.00
15 5 1.50
16 6 2.50
17 6 3.25
18 3 5.00
19 2 4.25
You can also use the rollify function in the tibbletime package to create a custom rolling function for any function. For mean it would look like this (using data from #mpalanco's answer):
library(dplyr)
library(tibbletime)
rolling_mean <- rollify(mean, window = 4)
df %>% mutate(moving_average = rolling_mean(number))
which gives you:
number moving_average
1 1 NA
2 2 NA
3 3 NA
4 4 2.5
5 5 3.5
6 6 4.5
7 7 5.5
8 8 6.5
9 9 7.5
10 10 8.5
11 11 9.5
12 12 10.5
13 13 11.5
14 14 12.5
15 15 13.5
16 16 14.5
17 17 15.5
18 18 16.5
19 19 17.5
20 20 18.5
The benefit of this approach is that it is easy to extend to things other than rolling average.
Related
I have a vector of sample IDs that are required to be in my dataframe (otherwise the function I am applying to them doesn't work) but are missing (called missing).
For each of the elements in missing, I want to add a row to the end of my dataframe where I include the ID but the rest of the data (for all the other columns) in the row is all NAs.
What I am currently trying, based on some other Stack Overflow posts I saw that talk only about adding empty rows, is as follows:
for (element in missing) {
df[nrow(df)+1,] <- NA
df[nrow(df),1] <- element
}
Is there a simpler and faster way to do this, since it takes some time for even 1000 missing elements, whereas I might later have to deal with a lot more.
1) Using the built-in anscombe data frame, this inserts two rows putting -1 and -3 in the x1 column.
library(tibble)
new <- c(-1, -3)
add_row(anscombe, x1 = new)
giving:
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
12 -1 NA NA NA NA NA NA NA
13 -3 NA NA NA NA NA NA NA
2) Here is a base solution. new is from (1)
(If overwriting anscombe is ok, but typically this would make it harder to debug, then omit the first line and replace anscombe2 with anscombe.)
anscombe2 <- anscombe
anscombe2[nrow(anscombe2) + seq_along(new), "x1"] <- new
3) Using the tibble package (or dplyr which imports this) we can use rows_insert. new is from (1).
library(dplyr)
rows_insert(anscombe, tibble(x1 = new))
Sample data:
samp <- data.frame(id = 1:10, val1 = 11:20, val2 = 21:30)
missing <- c(11, 13, 15)
Merge:
merge(samp, data.frame(id = missing), by = "id", all = TRUE)
# id val1 val2
# 1 1 11 21
# 2 2 12 22
# 3 3 13 23
# 4 4 14 24
# 5 5 15 25
# 6 6 16 26
# 7 7 17 27
# 8 8 18 28
# 9 9 19 29
# 10 10 20 30
# 11 11 NA NA
# 12 13 NA NA
# 13 15 NA NA
Row-bind with an external package:
data.table::rbindlist(list(samp, data.frame(id = missing)), use.names = TRUE, fill = TRUE)
dplyr::bind_rows(samp, data.frame(id = missing))
Row-bind with base R, a little more work:
samp0 <- samp[rep(1, length(missing)),,drop = FALSE][NA,]
samp0$id <- missing
rownames(samp0) <- NULL
rbind(samp, samp0)
Attempting to plot aggregate data from the following data.
Person Time Period Value SMA2 SMA3 SMA4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 1 14 NA NA NA
2 A 2 1 8 11 NA NA
3 A 3 1 13 10.5 11.7 NA
4 A 4 1 12 12.5 11 11.8
5 A 5 1 19 15.5 14.7 13
6 A 6 1 9 14 13.3 13.2
7 A 7 2 14 NA NA NA
8 A 8 2 7 10.5 NA NA
9 A 9 2 11 9 10.7 NA
10 A 10 2 14 12.5 10.7 11.5
# ... with 26 more rows
I have used aggregate(DataSet[,c(4,5,6,7)], by=list(DataSet$Person), na.rm = TRUE, max) to get the following:
Group.1 Value SMA2 SMA3 SMA4
1 A 20 18.0 16.66667 15.25
2 B 20 17.0 16.66667 15.00
3 C 19 18.5 14.33333 14.50
I'd like to plot the maxes for each SMA for Person A, B, and C on the same plot.
I would also like to be able to plot the mean of these maxes for each SMA column.
Any help is appreciated.
Like so? Or are you looking for something different?
df <- data.frame("Group.1"=c("A","B","C"), "Value"=c(20,20,20),
"SMA2"=c(18.0, 17.0, 18.5), "SMA3" =c(16.667, 16.667, 14.333),
"SMA4"=c(15.25, 15.00, 14.50))
library(ggplot2)
library(tidyr)
df.g <- df %>%
gather(SMA, Value, -Group.1)
df.g$SMA <- factor(df.g$SMA, levels=c("Value", "SMA2", "SMA3", "SMA4"))
means <- df.g %>%
group_by(SMA) %>%
summarise(m=mean(Value))
ggplot(df.g, aes(x=SMA, y=Value, group=Group.1, colour=Group.1)) +
geom_line() +
geom_point(data=means, aes(x=SMA, y=m), inherit.aes = F)
My dataset has as features: players IDs, team, weeks and points.
I want to calculate the mean of TEAM points for previous weeks, but not all past weeks, just to the last 5 or less (if the current week is smaller than 5).
Example: For team = A, week = 7, the result will be the average of POINTS for team = A and weeks 2, 3, 4, 5 and 6.
The dataset can be created using the following code:
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
I would like to have a solution without a heavy looping, because the dataset is huge.
I have done related questions here that maybe will help, but I could not adapt to this case.
Question 1
Question 2
Thank you!
We adapt the approach from my answer to one of your other questions if you want a dplyr solution:
library(dplyr)
library(zoo)
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
roll_mean <- function(x, k) {
result <- rollapplyr(x, k, mean, partial=TRUE, na.rm=TRUE)
result[is.nan(result)] <- NA
return( result )
}
It might first be easier to aggregate by team:
team_data <- mydata %>%
select(-player_id) %>%
group_by(team, week) %>%
arrange(week) %>%
summarise(team_points = sum(points)) %>%
mutate(rolling_team_mean = roll_mean(lag(team_points), k=5)) %>%
arrange(team)
team_data
# A tibble: 30 x 4
# Groups: team [2]
team week team_points rolling_team_mean
<fctr> <int> <dbl> <dbl>
1 A 1 13 NA
2 A 2 11 13.00
3 A 3 6 12.00
4 A 4 13 10.00
5 A 5 19 10.75
6 A 6 10 12.40
7 A 7 13 11.80
8 A 8 16 12.20
9 A 9 16 14.20
10 A 10 12 14.80
# ... with 20 more rows
Then, if you like we can put everything back together:
mydata <- inner_join(mydata, team_data) %>%
arrange(week, team, player_id)
mydata[1:12, ]
player_id team week points team_points rolling_team_mean
1 1 A 1 4 13 NA
2 2 A 1 9 13 NA
3 3 B 1 10 12 NA
4 4 B 1 2 12 NA
5 1 A 2 8 11 13
6 2 A 2 3 11 13
7 3 B 2 9 12 12
8 4 B 2 3 12 12
9 1 A 3 5 6 12
10 2 A 3 1 6 12
11 3 B 3 7 12 12
12 4 B 3 5 12 12
Here's one way:
# compute points per team per week
pts <- with(mydata, tapply(points, list(team, week), sum, default = 0))
pts
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 11 6 13 19 10 13 16 16 12 17 11 13 10 4
#B 12 12 12 11 10 6 13 11 6 9 5 7 13 13 6
# compute the 5-week averages
sapply(setNames(seq(2, ncol(pts)), seq(2, ncol(pts))),
function(i) {
apply(pts[, seq(max(1, i - 5), i - 1), drop = FALSE], 1, mean)
})
# 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 12 10 10.75 12.4 11.8 12.2 14.2 14.8 13.4 14.8 14.4 13.8 12.6
#B 12 12 12 11.75 11.4 10.2 10.4 10.2 9.2 9.0 8.8 7.6 8.0 9.4
This will give the wrong result if the week variable has gaps.
I have a data with primary key and ratio values like the following
2.243164164
1.429242413
2.119270714
3.013427143
1.208634972
1.208634972
1.23657632
2.212136028
2.168583297
2.151961216
1.159886063
1.234106444
1.694206176
1.401425329
5.210125578
1.215267806
1.089189869
I want to add a rank column which groups these ratios in say 3 bins. Functionality similar to the sas code:
PROC RANK DATA = TAB1 GROUPS = &NUM_BINS
I did the following:
Convert your vector to data frame.
Create variable Rank:
test2$rank<-rank(test2$test)
> test2
test rank
1 2.243164 15.0
2 1.429242 9.0
3 2.119271 11.0
4 3.013427 16.0
5 1.208635 3.5
6 1.208635 3.5
7 1.236576 7.0
8 2.212136 14.0
9 2.168583 13.0
10 2.151961 12.0
11 1.159886 2.0
12 1.234106 6.0
13 1.694206 10.0
14 1.401425 8.0
15 5.210126 17.0
16 1.215268 5.0
17 1.089190 1.0
Define function to convert to percentile ranks and then define pr as that percentile.
percent.rank<-function(x) trunc(rank(x)/length(x)*100)
test3<-within(test2,pr<-percent.rank(rank))
Then I created bins on the fact you wanted 3 of them.
test3$bins <- cut(test3$pr, breaks=c(0,33,66,100), labels=c("0-33","34-66","66-100"))
test x rank pr bins
1 2.243164 15.0 15.0 88 66-100
2 1.429242 9.0 9.0 52 34-66
3 2.119271 11.0 11.0 64 34-66
4 3.013427 16.0 16.0 94 66-100
5 1.208635 3.5 3.5 20 0-33
6 1.208635 3.5 3.5 20 0-33
7 1.236576 7.0 7.0 41 34-66
8 2.212136 14.0 14.0 82 66-100
9 2.168583 13.0 13.0 76 66-100
10 2.151961 12.0 12.0 70 66-100
11 1.159886 2.0 2.0 11 0-33
12 1.234106 6.0 6.0 35 34-66
13 1.694206 10.0 10.0 58 34-66
14 1.401425 8.0 8.0 47 34-66
15 5.210126 17.0 17.0 100 66-100
16 1.215268 5.0 5.0 29 0-33
17 1.089190 1.0 1.0 5 0-33
That work for you?
Almost late but given your data, we can use ntile from dplyr package to get equal sized groups:
df <- data.frame(values = c(2.243164164,
1.429242413,
2.119270714,
3.013427143,
1.208634972,
1.208634972,
1.23657632,
2.212136028,
2.168583297,
2.151961216,
1.159886063,
1.234106444,
1.694206176,
1.401425329,
5.210125578,
1.215267806,
1.089189869))
library(dplyr)
df <- df %>%
arrange(values) %>%
mutate(rank = ntile(values, 3))
values rank
1 1.089190 1
2 1.159886 1
3 1.208635 1
4 1.208635 1
5 1.215268 1
6 1.234106 1
7 1.236576 2
8 1.401425 2
9 1.429242 2
10 1.694206 2
11 2.119271 2
12 2.151961 2
13 2.168583 3
14 2.212136 3
15 2.243164 3
16 3.013427 3
17 5.210126 3
Or see cut_number from ggplot2 package:
library(ggplot2)
df$rank2 <- cut_number(df$values, 3, labels = c(1:3))
values rank rank2
1 1.089190 1 1
2 1.159886 1 1
3 1.208635 1 1
4 1.208635 1 1
5 1.215268 1 1
6 1.234106 1 1
7 1.236576 2 2
8 1.401425 2 2
9 1.429242 2 2
10 1.694206 2 2
11 2.119271 2 2
12 2.151961 2 3
13 2.168583 3 3
14 2.212136 3 3
15 2.243164 3 3
16 3.013427 3 3
17 5.210126 3 3
Because your sample consists of 17 numbers, one bin consists of 5 numbers while the others consist of 6 numbers. There are differences for row 12: ntile assigns 6 numbers to the first and second group, whereas cut_number assigns them to the first and third group.
> table(df$rank)
1 2 3
6 6 5
> table(df$rank2)
1 2 3
6 5 6
See also here: Splitting a continuous variable into equal sized groups
I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")