using R rowmeans to get mean regardless of any missing values [duplicate] - r

I would like to get the average for certain columns for each row.
I have this data:
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(1,2,3)
length(y)=4
z=data.frame(w,x,y)
Which returns:
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 NA
I would like to get the mean for certain columns, not all of them. My problem is that there are a lot of NAs in my data. So if I wanted the mean of x and y, this is what I would like to get back:
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
I guess I could do something like z$mean=(z$x+z$y)/2 but the last row for y is NA so obviously I do not want the NA to be calculated and I should not be dividing by two. I tried cumsum but that returns NAs when there is a single NA in that row. I guess I am looking for something that will add the selected columns, ignore the NAs, get the number of selected columns that do not have NAs and divide by that number. I tried ??mean and ??average and am completely stumped.
ETA: Is there also a way I can add a weight to a specific column?

Here are some examples:
> z$mean <- rowMeans(subset(z, select = c(x, y)), na.rm = TRUE)
> z
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
weighted mean
> z$y <- rev(z$y)
> z
w x y mean
1 5 1 NA 1
2 6 2 3 2
3 7 3 2 3
4 8 4 1 4
>
> weight <- c(1, 2) # x * 1/3 + y * 2/3
> z$wmean <- apply(subset(z, select = c(x, y)), 1, function(d) weighted.mean(d, weight, na.rm = TRUE))
> z
w x y mean wmean
1 5 1 NA 1 1.000000
2 6 2 3 2 2.666667
3 7 3 2 3 2.333333
4 8 4 1 4 2.000000

Try using rowMeans:
z$mean=rowMeans(z[,c("x", "y")], na.rm=TRUE)
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4

Here is a tidyverse solution using c_across which is designed for row-wise aggregations. This makes it easy to refer to columns by name, type or position and to apply any function to the selected columns.
library("tidyverse")
w <- c(5, 6, 7, 8)
x <- c(1, 2, 3, 4)
y <- c(1, 2, 3, NA)
z <- data.frame(w, x, y)
z %>%
rowwise() %>%
mutate(
mean = mean(c_across(c(x, y)), na.rm = TRUE),
max = max(c_across(x:y), na.rm = TRUE)
)
#> # A tibble: 4 × 5
#> # Rowwise:
#> w x y mean max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 1
#> 2 6 2 2 2 2
#> 3 7 3 3 3 3
#> 4 8 4 NA 4 4
Created on 2022-06-25 by the reprex package (v2.0.1)

Related

Roll max in R. From first row to current row

I would like to calculate max value from first row to current row
df <- data.frame(id = c(1,1,1,1,2,2,2), value = c(2,5,3,2,4,5,4), result = c(NA,2,5,5,NA,4,5))
I have tried grouping by id with dplyr and using rollmax function from zoo but did not success
1) rollmax is used with a fixed width but here we have a variable width so using rollapplyr, which seems close to the approach of the question, we have:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(out = lag(rollapplyr(value, 1:n(), max))) %>%
ungroup
giving:
# A tibble: 7 x 4
# Groups: id [2]
id value result out
<dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
2) It is also possible to perform the grouping via the width (second) argument of rollapplyr like this eliminating dplyr. In this case the widths are 1, 2, 3, 4, 1, 2, 3 and Max is like max except it does not use the last element of its argument x. (An alternate expression for the width would be seq_along(id) - match(id, id) + 1).
library(zoo)
Max <- function(x) if (length(x) == 1) NA else max(head(x, -1))
transform(df, out = rollapplyr(value, sequence(rle(id)$lengths), Max))
giving:
id value result out
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
A data.table option using shift + cummax
> setDT(df)[, result2 := shift(cummax(value)), id][]
id value result result2
1: 1 2 NA NA
2: 1 5 2 2
3: 1 3 5 5
4: 1 2 5 5
5: 2 4 NA NA
6: 2 5 4 4
7: 2 4 5 5
library(dplyr)
df |>
group_by(id) |>
mutate(result = lag(cummax(value)))
# # A tibble: 7 x 3
# # Groups: id [2]
# id value result
# <dbl> <dbl> <dbl>
# 1 1 2 NA
# 2 1 5 2
# 3 1 3 5
# 4 1 2 5
# 5 2 4 NA
# 6 2 5 4
# 7 2 4 5
Here is a base R solution. This would just get you the cumulative maximum:
df$result = ave(df$value, df$i, FUN=cummax)
To get the cumulative maximum with the lag you wanted:
df$result = ave(df$value, df$i, FUN=function(x) c(NA,cummax(x[-(length(x))])))

R: how to obtain unique pairwise combinations of 2 vectors [duplicate]

This question already has answers here:
How to generate permutations or combinations of object in R?
(3 answers)
Closed 2 years ago.
x = 1:3
y = 1:3
> expand.grid(x = 1:3, y = 1:3)
x y
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
Using expand.grid gives me all of the combinations. However, I want only pairwise comparisons, that is, I don't want a comparison of 1 vs 1, 2 vs, 2, or 3 vs 3. Moreover, I want to keep only the unique pairs, i.e., I want to keep 1 vs 2 (and not 2 vs 1).
In summary, for the above x and y, I want the following 3 pairwise combinations:
x y
1 1 2
2 1 3
3 2 3
Similarly, for x = y = 1:4, I want the following pairwise combinations:
x y
1 1 2
2 1 3
3 1 4
4 2 3
5 2 4
6 3 4
We can use combn
f1 <- function(x) setNames(as.data.frame(t(combn(x, 2))), c("x", "y"))
f1(1:3)
# x y
#1 1 2
#2 1 3
#3 2 3
f1(1:4)
# x y
#1 1 2
#2 1 3
#3 1 4
#4 2 3
#5 2 4
#6 3 4
Using data.table,
library(data.table)
x <- 1:4
y <- 1:4
CJ(x, y)[x < y]
x y
1: 1 2
2: 1 3
3: 1 4
4: 2 3
5: 2 4
6: 3 4
Actually you are already very close to the desired output. You may need subset as well
> subset(expand.grid(x = x, y = y), x < y)
x y
4 1 2
7 1 3
8 2 3
Here is another option but with longer code
v <- letters[1:5] # dummy data vector
mat <- diag(length(v))
inds <- upper.tri(mat)
data.frame(
x = v[row(mat)[inds]],
y = v[col(mat)[inds]]
)
which gives
x y
1 a b
2 a c
3 b c
4 a d
5 b d
6 c d
7 a e
8 b e
9 c e
10 d e

Add column to each data frame within list with function rowSums and range of columns

SO. The following might serve as a small example of the real list.
a <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
b <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
c <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
l <- list(a,b,c)
From the second column to last column - on every data frame - i want to add the sums as a new column to each data frame.
I tried:
lapply(l, function(x) rowSums(x[2:ncol(x)]))
which returns the correct sums, but doesn't add them to the data frames.
I also tried:
lapply(l, transform, sum = y + z)
which gives me the correct results but is not flexible enough, because i don't always know how many columns there are for each data frame and what names they have. The only thing i know, is, that i have to start from second column to end. I tried to combine these two approaches but i can't figure out, how to do it exactly.
Thanks
Try this. You can play around index in columns and exclude the first variable so that there is not issues about how many additional variables you have in order to obtain the rowsums. Here the code:
#Compute rowsums
l1 <- lapply(l,function(x) {x$RowSum<-rowSums(x[,-1],na.rm=T);return(x)})
Output:
l1
[[1]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
Here's how to combine your attempts. I used data[-1] instead of data[2:ncol(data)] because it seems simpler, but either should work.
lapply(l, function(data) transform(data, sum = rowSums(data[-1])))
Unfortunately, transform will be confused if the name of the argument to your anonymous function is the same as a column name - data[-1] needs to look at the data frame, not a particular column. (I originally use function(x) instead of function(data), and this caused an error because there is a column named x. From this perspective, Duck's answer is a little safer.)
Does this work:
> add_col <- function(df){
+ df[(ncol(df)+1)] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
>
With sum as column name:
> add_col <- function(df){
+ df['sum'] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
use tidyverse
library(tidyverse)
map(l, ~.x %>% mutate(Sum := apply(.x[-1], 1, sum)))
#> [[1]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[2]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[3]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
Created on 2020-09-30 by the reprex package (v0.3.0)
We can use map with mutate
library(purrr)
library(dplyr)
map(l, ~ .x %>%
mutate(sum = rowSums(select(., -1))))
Or with c_across
map(l, ~ .x %>%
rowwise() %>%
mutate(sum = sum(c_across(-1), na.rm = TRUE)) %>%
ungroup)

dplyr 0.8.0 mutate_at: use of custom function without overwriting original columns

Using the new grammar in dplyr 0.8.0 using list() instead of funs(), I want to be able to create new variables from mutate_at() without overwriting the old. Basically, I need to replace any integers over a value with NA in several columns, without overwriting the columns.
I had this working already using a previous version of dplyr, but I want to accommodate the changes in dplyr so my code doesn't break later.
Say I have a tibble:
x <- tibble(id = 1:10, x = sample(1:10, 10, replace = TRUE),
y = sample(1:10, 10, replace = TRUE))
I want to be able to replace any values above 5 with NA. I used to do it this way, and this result is exactly what I want:
x %>% mutate_at(vars(x, y), funs(RC = replace(., which(. > 5), NA)))
# A tibble: 10 x 5
id x y x_RC y_RC
<int> <int> <int> <int> <int>
1 1 2 3 2 3
2 2 2 1 2 1
3 3 3 4 3 4
4 4 4 4 4 4
5 5 2 9 2 NA
6 6 6 8 NA NA
7 7 10 2 NA 2
8 8 1 3 1 3
9 9 10 1 NA 1
10 10 1 8 1 NA
This what I've tried, but it doesn't work:
x %>% mutate_at(vars(x, y), list(RC = replace(., which(. > 5), NA)))
Error in [<-.data.frame(*tmp*, list, value = NA) :
new columns would leave holes after existing columns
This works, but replaces the original variables:
x %>% mutate_at(vars(x, y), list(~replace(., which(. > 5), NA)))
# A tibble: 10 x 3
id x y
<int> <int> <int>
1 1 2 3
2 2 2 1
3 3 3 4
4 4 4 4
5 5 2 NA
6 6 NA NA
7 7 NA 2
8 8 1 3
9 9 NA 1
10 10 1 NA
Any help is appreciated!
Almost there, just create a named list.
x %>% mutate_at(vars(x, y), list(RC = ~replace(., which(. > 5), NA)))

Binding matrices in R with unequal rows

I have two matrices with potentially both equal columns but unequal rows (but hopefully a solution will generalize to unequal numbers of both).
I would like the following behavior (demonstrated using data.frames):
x = data.frame(z = c(8, 9), w = c(10, 11))
y = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
> x
z w
1 8 10
2 9 11
> y
x y
1 1 4
2 2 5
3 3 6
And I would like to do something like
magic_cbind(x, y)
z w x y
1 8 10 1 4
2 9 11 2 5
3 NA NA 3 6
I found a perverse solution using rbind.fill from the plyr package:
> x = data.frame(t(x))
> y = data.frame(t(y))
> x
X1 X2
z 8 9
w 10 11
> y
X1 X2 X3
x 1 2 3
y 4 5 6
> rbind.fill(x, y)
X1 X2 X3
1 8 9 NA
2 10 11 NA
3 1 2 3
4 4 5 6
> rbind.fill(x, y) %>% t %>% as.matrix %>% unname
[,1] [,2] [,3] [,4]
[1,] 8 10 1 4
[2,] 9 11 2 5
[3,] NA NA 3 6
But I was wondering if there were a more elegant solution? I don't know the final size of the matrix in advance, which is a problem, and it grows inside a loop (which is terrible practice but it but won't grow large enough to actually be a concern). That is, given a matrix, I'm trying to bind additional columns obtained through a loop to it in the way described above.
I cobbled my solution up using the following questions:
Bind list with unequal columns
Combining (cbind) vectors of different length
R: column binding with unequal number of rows
We can use cbind.fill from rowr
rowr::cbind.fill(x, y, fill = NA)
# z w x y
#1 8 10 1 4
#2 9 11 2 5
#3 NA NA 3 6
Here's a way in base R:
as.data.frame(lapply(c(x,y),`length<-`,max(nrow(x),nrow(y))))
z w x y
1 8 10 1 4
2 9 11 2 5
3 NA NA 3 6
data.frame( sapply(c(x,y), '[', seq(max(lengths(c(x, y))))))
z w x y
1 8 10 1 4
2 9 11 2 5
3 NA NA 3 6
Or Use
library(magrittr)
library(purrr)
map_df(c(x, y), extract, seq(max(lengths(c(x, y)))))
or
map_df(c(x,y), `[`, seq(max(lengths(c(x, y)))))
# A tibble: 3 x 4
z w x y
<dbl> <dbl> <dbl> <dbl>
1 8.00 10.0 1.00 4.00
2 9.00 11.0 2.00 5.00
3 NA NA 3.00 6.00

Resources