Using Mutate and conditional statement to calculate column means [duplicate] - r

This question already has answers here:
Average across Columns in R, excluding NAs
(2 answers)
Closed 11 months ago.
I have a dataset with an identifier variable and some numeric variables. I want to calculate the mean of the columns according to the identifier variable. Here is a simple example:
From this
id v1 v2 v3 v4
d 1 2 NA NA
e NA NA 3 3
e NA NA 2 4
d 3 5 NA NA
I want to get to this:
id v1 v2 v3 v4 mean
d 1 2 NA NA 1.5
e NA NA 3 3 3
e NA NA 2 4 3
d 3 5 NA NA 4
I would like to use an if else statement like:
ifelse(id=d, colMeans(v1:v2), colMeans(v3:v4)
Thank you in advance!

df %>%
rowwise %>%
mutate(mean = mean(c_across(v1:v4), na.rm = T))
# A tibble: 4 x 6
# Rowwise:
id v1 v2 v3 v4 mean
<chr> <int> <int> <int> <int> <dbl>
1 d 1 2 NA NA 1.5
2 e NA NA 3 3 3
3 e NA NA 2 4 3
4 d 3 5 NA NA 4
Or:
df %>%
rowwise %>%
mutate(mean = mean(c_across(where(is.numeric)), na.rm = T))

Related

Subset dataframe in R using function inside select_if to make it conditional on a grouping variable?

I would like to conditionally subset a dataframe in R, using dplyr::select_if(). More specifically, I have a dataframe that is made up of a grouping variable and numerous other variables that contain a bunch of NAs:
data <- tibble(group = sort(rep(letters[1:5],3)),
var_1 = c(1,1,1,1,rep(NA,11)),
var_2 = c(1,1,1,1,1,1,rep(NA,9)),
var_3 = 1,
var_4 = c(1,1,rep(NA,10),1,1,1),
var_5 = c(1,1,1,1,1,1,NA,NA,NA,NA,NA,NA,1,1,1))
# A tibble: 15 x 6
group var_1 var_2 var_3 var_4 var_5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 1 1 1 1
2 a 1 1 1 1 1
3 a 1 1 1 NA 1
4 b 1 1 1 NA 1
5 b NA 1 1 NA 1
6 b NA 1 1 NA 1
7 c NA NA 1 NA NA
8 c NA NA 1 NA NA
9 c NA NA 1 NA NA
10 d NA NA 1 NA NA
11 d NA NA 1 NA NA
12 d NA NA 1 NA NA
13 e NA NA 1 1 1
14 e NA NA 1 1 1
15 e NA NA 1 1 1
In this dataframe, I need to identify and remove columns like var_4 in this case that only occur in one group (but irrespective of whether or not they show up in the last group: "e"). Importantly, everything else has to remain untouched (i.e. I want to keep variables that look like var_1,var_2,var_3, and var_5). This is what I tried:
library(dplyr)
data %>%
filter(group!="e") %>% # Ignore last group.
select_if(~ function(col)) %>% # Write function to look for cols that only have values for one group of the total four groups remaining (a-d).
names() -> cols_to_drop # Save col names.
data %>% select(-cols_to_drop) -> new_data # Subset by saved col names.
Unfortunately, I can't figure out how to write that function inside select_if() to specify that grouping variable condition.
A second thing that I have been wondering about is whether I can use select_if() to remove cols based on the percentage of NAs it contains. Is there a way?
I am not sure if select_if would be able to do such grouped selection of columns.
Here is one way to do this getting data in long format :
library(dplyr)
cols <- data %>%
filter(group != "e") %>%
tidyr::pivot_longer(cols = starts_with('var')) %>%
group_by(name, group) %>%
summarise(value = any(!is.na(value))) %>%
summarise(value = sum(value)) %>%
filter(value > 1) %>%
pull(name)
#Select the columns
data %>% select(group, cols)
# group var_1 var_2 var_3 var_5
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1
# 2 a 1 1 1 1
# 3 a 1 1 1 1
# 4 b 1 1 1 1
# 5 b NA 1 1 1
# 6 b NA 1 1 1
# 7 c NA NA 1 NA
# 8 c NA NA 1 NA
# 9 c NA NA 1 NA
#10 d NA NA 1 NA
#11 d NA NA 1 NA
#12 d NA NA 1 NA
#13 e NA NA 1 1
#14 e NA NA 1 1
#15 e NA NA 1 1

r data formatting with NAs

If I have a dataset with three columns like this below
Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F
How could I clean this data such that I see a dataset like this below ?
Id Date Gender
1 03-11-1977 F
2 04-17-2005 M
3 06-04-1999 F
Thanks.
fill the values by Id and filter NA values.
library(dplyr)
df %>%
group_by(Id) %>%
tidyr::fill(Gender, .direction = "updown") %>%
filter(!is.na(Date))
# Id Date Gender
# <int> <chr> <chr>
#1 1 03-11-1977 F
#2 2 04-17-2005 M
#3 3 06-04-1999 F
You may use na.omit in a by approach.
dat <- do.call(rbind, by(dat, dat$Id, function(x) cbind(x[1,1,drop=F], lapply(x[-1], na.omit))))
dat
# Id Date Gender
# 1 1 03-11-1977 F
# 2 2 04-17-2005 M
# 3 3 06-04-1999 F
Data:
dat <- read.table(header=T,text=' Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F')

Get summaries of repeated consecutive values by row in R

I´m trying to get some statistics (min, max, mean) of repeated values by row in R.
My dataframe looks similar to this:
b <- as.data.frame(matrix(ncol=7, nrow=3,
c(3,NA,NA,4,5,NA,7,6,NA,7,NA,8,9,NA,NA,4,6,NA,NA,7,NA), byrow = TRUE))
For each row, I want to add a column with the min, max and mean of the no. of columns containing consecutive NAs and it should something like this
V1 V2 V3 V4 V5 V6 V7 max min mean
1 3 NA NA 4 5 NA 7 2 1 1.5
2 6 NA 7 NA 8 9 NA 1 1 1.0
3 NA 4 6 NA NA 7 NA 2 1 1.33
This is just a small example of my dataset with 2000 rows and 48 columns.
Does anyone have some code for this?
You can apply over the rows and get the "runs" of non-NA columns. Once you have that, you can simply take the summary stats of those:
b[,c("mean", "max", "min")] <- do.call(rbind, apply(b, 1, function(x){
res <- rle(!is.na(x))
res2 <- res[["lengths"]][res[["values"]]]
data.frame(mean = mean(res2), max = max(res2), min = min(res2))
}
))
b
# V1 V2 V3 V4 V5 V6 V7 mean max min
#1 3 NA NA 4 5 NA 7 1.333333 2 1
#2 6 NA 7 NA 8 9 NA 1.333333 2 1
#3 NA 4 6 NA NA 7 NA 1.500000 2 1
A dplyr solution with rlewhich computes the lengths of runs of equal values in a vector.
library(dplyr)
b %>% cbind( b %>% rowwise() %>% do(rl = rle(is.na(.))$lengths[rle(is.na(.))$values == T]))
%>% rowwise()
%>% mutate(mean = mean(rl),
max = max(rl),
min = min(rl))
%>% select(-rl)
# V1 V2 V3 V4 V5 V6 V7 max min mean
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
# 1 3 NA NA 4 5 NA 7 2 1 1.50
# 2 6 NA 7 NA 8 9 NA 1 1 1.00
# 3 NA 4 6 NA NA 7 NA 2 1 1.33

Transpose multiple columns as column names and fill with values in R

The sample data as following:
x <- read.table(header=T, text="
ID CostType1 Cost1 CostType2 Cost2
1 a 10 c 1
2 b 2 c 20
3 a 1 b 50
4 a 40 c 1
5 c 2 b 30
6 a 60 c 3
7 c 10 d 1
8 a 20 d 2")
I want the second and third columns (CostType1 and CostType 2) to be the the names of new columns and fill the corresponding cost to certain cost type. If there's no match, filled with NA. The ideal format will be following:
a b c d
1 10 NA 1 NA
2 NA 2 20 NA
3 1 50 NA NA
4 40 1 NA NA
5 NA 30 2 NA
6 60 NA 3 NA
7 NA NA 10 1
8 20 NA NA 2
A solution using tidyverse. We can first get how many groups are there. In this example, there are two groups. We can convert each group, combine them, and then summarize the data frame with the first non-NA value in the column.
library(tidyverse)
# Get the group numbers
g <- (ncol(x) - 1)/2
x2 <- map_dfr(1:g, function(i){
# Transform the data frame one group at a time
x <- x %>%
select(ID, ends_with(as.character(i))) %>%
spread(paste0("CostType", i), paste0("Cost", i))
return(x)
}) %>%
group_by(ID) %>%
# Select the first non-NA value if there are multiple values
summarise_all(funs(first(.[!is.na(.)])))
x2
# # A tibble: 8 x 5
# ID a b c d
# <int> <int> <int> <int> <int>
# 1 1 10 NA 1 NA
# 2 2 NA 2 20 NA
# 3 3 1 50 NA NA
# 4 4 40 NA 1 NA
# 5 5 NA 30 2 NA
# 6 6 60 NA 3 NA
# 7 7 NA NA 10 1
# 8 8 20 NA NA 2
A base solution using reshape
x1 <- setNames(x[,c("ID", "CostType1", "Cost1")], c("ID", "CostType", "Cost"))
x2 <- setNames(x[,c("ID", "CostType2", "Cost2")], c("ID", "CostType", "Cost"))
reshape(data=rbind(x1, x2), idvar="ID", timevar="CostType", v.names="Cost", direction="wide")

r element frequency and index, ranking

I was looking at this example code below,
r element frequency and column name
and was wondering if there is any way to show the index of each element in each column, in addition to the rank and frequency in r. so for example, the desired input and output would be
df <- read.table(header=T, text='A B C D
a a b c
b c x e
c d y a
d NA NA z
e NA NA NA
f NA NA NA',stringsAsFactors=F)
and output
element frequency columns ranking A B C D
1 a 3 A,B,D 1 1 1 na 2
3 c 3 A,B,D 1 3 2 na 1
2 b 2 A,C 2 2 na 1 na
4 d 2 A,B 2 4 3 na na
5 e 2 A,D 2 5 na na 2
6 f 1 A 3 6 na na na
8 x 1 C 3 na na 2 na
9 y 1 C 3 na na 3 na
10 z 1 D 3 na na na 3
Thank you.
Perhaps there is a way to do this in one step, but it's not coming to mind at the moment. So, continuing with my previous answer:
library(dplyr)
library(tidyr)
step1 <- df %>%
gather(var, val, everything()) %>% ## Make a long dataset
na.omit %>% ## We don't need the NA values
group_by(val) %>% ## All calculations grouped by val
summarise(column = toString(var), ## This collapses
freq = n()) %>% ## This counts
mutate(ranking = dense_rank(desc(freq))) ## This ranks
step2 <- df %>%
mutate(ind = 1:nrow(df)) %>% ## Add an indicator column
gather(var, val, -ind) %>% ## Go long
na.omit %>% ## Remove NA
spread(var, ind) ## Go wide
inner_join(step1, step2)
# Joining by: "val"
# Source: local data frame [9 x 8]
#
# val column freq ranking A B C D
# 1 a A, B, D 3 1 1 1 NA 3
# 2 b A, C 2 2 2 NA 1 NA
# 3 c A, B, D 3 1 3 2 NA 1
# 4 d A, B 2 2 4 3 NA NA
# 5 e A, D 2 2 5 NA NA 2
# 6 f A 1 3 6 NA NA NA
# 7 x C 1 3 NA NA 2 NA
# 8 y C 1 3 NA NA 3 NA
# 9 z D 1 3 NA NA NA 4

Resources