Roll max in R. From first row to current row - r

I would like to calculate max value from first row to current row
df <- data.frame(id = c(1,1,1,1,2,2,2), value = c(2,5,3,2,4,5,4), result = c(NA,2,5,5,NA,4,5))
I have tried grouping by id with dplyr and using rollmax function from zoo but did not success

1) rollmax is used with a fixed width but here we have a variable width so using rollapplyr, which seems close to the approach of the question, we have:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(out = lag(rollapplyr(value, 1:n(), max))) %>%
ungroup
giving:
# A tibble: 7 x 4
# Groups: id [2]
id value result out
<dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
2) It is also possible to perform the grouping via the width (second) argument of rollapplyr like this eliminating dplyr. In this case the widths are 1, 2, 3, 4, 1, 2, 3 and Max is like max except it does not use the last element of its argument x. (An alternate expression for the width would be seq_along(id) - match(id, id) + 1).
library(zoo)
Max <- function(x) if (length(x) == 1) NA else max(head(x, -1))
transform(df, out = rollapplyr(value, sequence(rle(id)$lengths), Max))
giving:
id value result out
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5

A data.table option using shift + cummax
> setDT(df)[, result2 := shift(cummax(value)), id][]
id value result result2
1: 1 2 NA NA
2: 1 5 2 2
3: 1 3 5 5
4: 1 2 5 5
5: 2 4 NA NA
6: 2 5 4 4
7: 2 4 5 5

library(dplyr)
df |>
group_by(id) |>
mutate(result = lag(cummax(value)))
# # A tibble: 7 x 3
# # Groups: id [2]
# id value result
# <dbl> <dbl> <dbl>
# 1 1 2 NA
# 2 1 5 2
# 3 1 3 5
# 4 1 2 5
# 5 2 4 NA
# 6 2 5 4
# 7 2 4 5

Here is a base R solution. This would just get you the cumulative maximum:
df$result = ave(df$value, df$i, FUN=cummax)
To get the cumulative maximum with the lag you wanted:
df$result = ave(df$value, df$i, FUN=function(x) c(NA,cummax(x[-(length(x))])))

Related

using R rowmeans to get mean regardless of any missing values [duplicate]

I would like to get the average for certain columns for each row.
I have this data:
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(1,2,3)
length(y)=4
z=data.frame(w,x,y)
Which returns:
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 NA
I would like to get the mean for certain columns, not all of them. My problem is that there are a lot of NAs in my data. So if I wanted the mean of x and y, this is what I would like to get back:
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
I guess I could do something like z$mean=(z$x+z$y)/2 but the last row for y is NA so obviously I do not want the NA to be calculated and I should not be dividing by two. I tried cumsum but that returns NAs when there is a single NA in that row. I guess I am looking for something that will add the selected columns, ignore the NAs, get the number of selected columns that do not have NAs and divide by that number. I tried ??mean and ??average and am completely stumped.
ETA: Is there also a way I can add a weight to a specific column?
Here are some examples:
> z$mean <- rowMeans(subset(z, select = c(x, y)), na.rm = TRUE)
> z
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
weighted mean
> z$y <- rev(z$y)
> z
w x y mean
1 5 1 NA 1
2 6 2 3 2
3 7 3 2 3
4 8 4 1 4
>
> weight <- c(1, 2) # x * 1/3 + y * 2/3
> z$wmean <- apply(subset(z, select = c(x, y)), 1, function(d) weighted.mean(d, weight, na.rm = TRUE))
> z
w x y mean wmean
1 5 1 NA 1 1.000000
2 6 2 3 2 2.666667
3 7 3 2 3 2.333333
4 8 4 1 4 2.000000
Try using rowMeans:
z$mean=rowMeans(z[,c("x", "y")], na.rm=TRUE)
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
Here is a tidyverse solution using c_across which is designed for row-wise aggregations. This makes it easy to refer to columns by name, type or position and to apply any function to the selected columns.
library("tidyverse")
w <- c(5, 6, 7, 8)
x <- c(1, 2, 3, 4)
y <- c(1, 2, 3, NA)
z <- data.frame(w, x, y)
z %>%
rowwise() %>%
mutate(
mean = mean(c_across(c(x, y)), na.rm = TRUE),
max = max(c_across(x:y), na.rm = TRUE)
)
#> # A tibble: 4 × 5
#> # Rowwise:
#> w x y mean max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 1
#> 2 6 2 2 2 2
#> 3 7 3 3 3 3
#> 4 8 4 NA 4 4
Created on 2022-06-25 by the reprex package (v2.0.1)

is there a way in R to subtract two rows within a group by specifying another grouping var?

Say I have something like this:
ID = c("a","a","a","a","a", "b","b","b","b","b")
Group = c("1","2","3","4","5", "1","2","3","4","5")
Value = c(3, 4,2,4,3, 6, 1, 8, 9, 10)
df<-data.frame(ID,Group,Value)
I want to subtract group=5 from group=3 within the ID, with an output column which has this difference for each ID like so:
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
Also, if that calculation cannot be done (i.e. group 5 is missing), NA values for the 'want' column would be ideal.
As there is only one unique 'Group' per 'ID', we can do subsetting
library(dplyr)
df %>%
group_by(ID) %>%
mutate(want = Value[Group == 5] - Value[Group == 3])
# A tibble: 10 x 4
# Groups: ID [2]
# ID Group Value want
# <fct> <fct> <dbl> <dbl>
# 1 a 1 3 1
# 2 a 2 4 1
# 3 a 3 2 1
# 4 a 4 4 1
# 5 a 5 3 1
# 6 b 1 6 2
# 7 b 2 1 2
# 8 b 3 8 2
# 9 b 4 9 2
#10 b 5 10 2
The above can be made more error-proof if we convert to numeric index and get the first element. When there are no TRUE, by using [1], it returns NA
df %>%
slice(-10) %>%
group_by(ID) %>%
mutate(want = Value[which(Group == 5)[1]] - Value[which(Group == 3)[1]])
Or use match which returns an index of NA if there are no matches, and anything with NA index returns NA which will subsequently return NA in subtraction (NA -3)
df %>%
slice(-10) %>% # removing the last row where Group is 10
group_by(ID) %>%
mutate(want = Value[match(5, Group)] - Value[match(3, Group)])
Here is a base R solution
dfout <- Reduce(rbind,
lapply(split(df,df$ID),
function(x) within(x, Want <-diff(subset(Value, Group %in% c("3","5"))))))
such that
> dfout
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
A data.table method:
library(data.table)
setDT(df)[, want := (Value[Group == 5] - Value[Group == 3]), by = .(ID)]
df
# ID Group Value want
# 1: a 1 3 1
# 2: a 2 4 1
# 3: a 3 2 1
# 4: a 4 4 1
# 5: a 5 3 1
# 6: b 1 6 2
# 7: b 2 1 2
# 8: b 3 8 2
# 9: b 4 9 2
# 10: b 5 10 2
Here is a solution using base R.
unsplit(
lapply(
split(df, df$ID),
function(d) {
x5 = d$Value[d$Group == "5"]
x5 = ifelse(length(x5) == 1, x5, NA)
x3 = d$Value[d$Group == "3"]
x3 = ifelse(length(x3) == 1, x3, NA)
d$Want = x5 - x3
d
}),
df$ID)

dplyr 0.8.0 mutate_at: use of custom function without overwriting original columns

Using the new grammar in dplyr 0.8.0 using list() instead of funs(), I want to be able to create new variables from mutate_at() without overwriting the old. Basically, I need to replace any integers over a value with NA in several columns, without overwriting the columns.
I had this working already using a previous version of dplyr, but I want to accommodate the changes in dplyr so my code doesn't break later.
Say I have a tibble:
x <- tibble(id = 1:10, x = sample(1:10, 10, replace = TRUE),
y = sample(1:10, 10, replace = TRUE))
I want to be able to replace any values above 5 with NA. I used to do it this way, and this result is exactly what I want:
x %>% mutate_at(vars(x, y), funs(RC = replace(., which(. > 5), NA)))
# A tibble: 10 x 5
id x y x_RC y_RC
<int> <int> <int> <int> <int>
1 1 2 3 2 3
2 2 2 1 2 1
3 3 3 4 3 4
4 4 4 4 4 4
5 5 2 9 2 NA
6 6 6 8 NA NA
7 7 10 2 NA 2
8 8 1 3 1 3
9 9 10 1 NA 1
10 10 1 8 1 NA
This what I've tried, but it doesn't work:
x %>% mutate_at(vars(x, y), list(RC = replace(., which(. > 5), NA)))
Error in [<-.data.frame(*tmp*, list, value = NA) :
new columns would leave holes after existing columns
This works, but replaces the original variables:
x %>% mutate_at(vars(x, y), list(~replace(., which(. > 5), NA)))
# A tibble: 10 x 3
id x y
<int> <int> <int>
1 1 2 3
2 2 2 1
3 3 3 4
4 4 4 4
5 5 2 NA
6 6 NA NA
7 7 NA 2
8 8 1 3
9 9 NA 1
10 10 1 NA
Any help is appreciated!
Almost there, just create a named list.
x %>% mutate_at(vars(x, y), list(RC = ~replace(., which(. > 5), NA)))

Remove trailing NA by group in a data.frame

I have a data.frame with a grouping variable, and some NAs in the value column.
df = data.frame(group=c(1,1,2,2,2,2,2,3,3), value1=1:9, value2=c(NA,4,9,6,2,NA,NA,1,NA))
I can use zoo::na.trim to remove NA at the end of a column: this will remove the last line of the data.frame:
library(zoo)
library(dplyr)
df %>% na.trim(sides="right")
Now I want to remove the trailing NAs by group; how can I achieve this using dplyr?
Expected output for value2 column: c(NA, 4,9,6,2,1)
You could write a little helper function that checks for trailing NAs of a vector and then use group_by and filter.
f <- function(x) { rev(cumsum(!is.na(rev(x)))) != 0 }
library(dplyr)
df %>%
group_by(group) %>%
filter(f(value2))
# A tibble: 6 x 3
# Groups: group [3]
group value1 value2
<dbl> <int> <dbl>
1 1 1 NA
2 1 2 4
3 2 3 9
4 2 4 6
5 2 5 2
6 3 8 1
edit
If we need to remove both leading and trailing zero we need to extend that function a bit.
f1 <- function(x) { cumsum(!is.na(x)) != 0 & rev(cumsum(!is.na(rev(x)))) != 0 }
Given df1
df1 = data.frame(group=c(1,1,2,2,2,2,2,3,3), value1=1:9, value2=c(NA,4,9,NA,2,NA,NA,1,NA))
df1
# group value1 value2
#1 1 1 NA
#2 1 2 4
#3 2 3 9
#4 2 4 NA
#5 2 5 2
#6 2 6 NA
#7 2 7 NA
#8 3 8 1
#9 3 9 NA
We get this result
df1 %>%
group_by(group) %>%
filter(f1(value2))
# A tibble: 5 x 3
# Groups: group [3]
group value1 value2
<dbl> <int> <dbl>
1 1 2 4
2 2 3 9
3 2 4 NA
4 2 5 2
5 3 8 1
Using lapply, loop through group:
do.call("rbind", lapply(split(df, df$group), na.trim, sides = "right"))
# group value1 value2
# 1.1 1 1 NA
# 1.2 1 2 4
# 2.3 2 3 9
# 2.4 2 4 6
# 2.5 2 5 2
# 3 3 8 1
Or using by, as mentioned by #Henrik:
do.call("rbind", by(df, df$group, na.trim, sides = "right"))

Merging dataframe every x row

I am trying to merge values in a dataframe by every nth row.
The data structure looks as follows:
id value
1 1
2 2
3 1
4 2
5 3
6 4
7 1
8 2
9 4
10 4
11 2
12 1
I like to aggregate the values for every 4 rows each. Actually, the dataset describes a measurement for each a 4-day period.
id"1" = day1,
id"2" = day2,
id"3" = day3,
id"4" = day4,
id"5" = day1,
...
As such, a column counting in a loop from 1 to 4 might be used?
The result should look like (sums):
day sum
1 8
2 10
3 4
4 5
This can be achieved with %% for creating a grouping variable and then do the sum with aggregate
n <- 4
aggregate(value ~cbind(day = (seq_along(df1$id)-1) %% n + 1), df1, FUN = sum)
# day value
#1 1 8
#2 2 10
#3 3 4
#4 4 5
This approach can also be used with dplyr/data.table
library(dplyr)
df1 %>%
group_by(day = (seq_along(id)-1) %% 4 +1) %>%
summarise(value = sum(value))
# day value
# <dbl> <int>
#1 1 8
#2 2 10
#3 3 4
#4 4 5
or
setDT(df1)[, .(value = sum(value)), .(day = (seq_along(id) - 1) %% 4 + 1)]
# day value
#1: 1 8
#2: 2 10
#3: 3 4
#4: 4 5
You need to make a sequence to group by, e.g.
rep(1:4, length = nrow(df))
## [1] 1 2 3 4 1 2 3 4 1 2 3 4
In aggregate:
aggregate(value ~ cbind(day = rep(1:4, length = nrow(df))), df, FUN = sum)
## day value
## 1 1 8
## 2 2 10
## 3 3 4
## 4 4 5
or dplyr:
library(dplyr)
df %>% group_by(day = rep(1:4, length = n())) %>% summarise(sum = sum(value))
## # A tibble: 4 x 2
## day sum
## <int> <int>
## 1 1 8
## 2 2 10
## 3 3 4
## 4 4 5
or data.table:
library(data.table)
setDT(df)[, .(sum = sum(value)), by = .(day = rep(1:4, length = nrow(df)))]
## day sum
## 1: 1 8
## 2: 2 10
## 3: 3 4
## 4: 4 5

Resources