Conditionally mutate columns based on column class - r

My question is based on a previous topic posted here: Mutating multiple columns in a data frame
Suppose I have a tibble as follows:
id char_var_1 char_var_2 num_var_1 num_var_2 ... x_var_n
1 ... ... ... ... ...
2 ... ... ... ... ...
3 ... ... ... ... ...
where id is the key and char_var_x is a character variable and num_var_x is a numerical variable. I have 346 columns in total and I want to write a function that scales all the numerical variables except the id column. I'm looking for an elegant way to mutate these columns using pipes and dplyr functions.
Obviously the following works for all numeric variables:
pre_process_data <- function(dt)
{
# scale numeric variables
dt %>% mutate_if(is.numeric, scale)
}
But I'm looking for a way to exclude id column from scaling and retain the original values and at the same time scale all other numerical variables. Is there an elegant way to do this?

Try below, answer is similar to select_if post:
library(dplyr)
# Using #Psidom's example data: https://stackoverflow.com/a/48408027
df %>%
mutate_if(function(col) is.numeric(col) &
!all(col == .$id), scale)
# id a b
# 1 1 a -1
# 2 2 b 0
# 3 3 c 1

Not a canonical way to do this, but with a little bit hack, you can do this with mutate_at by making the integer indices of columns to mutate using which with manually constructed column selecting conditions:
df = data.frame(id = 1:3, a = letters[1:3], b = 2:4)
df %>%
mutate_at(vars(which(sapply(., is.numeric) & names(.) != 'id')), scale)
# id a b
#1 1 a -1
#2 2 b 0
#3 3 c 1

How about the "make the column your interested a character, then change it back approach?"
dt %>%
mutate(id = as.character(id)) %>%
mutate_if(is.numeric, scale) %>%
mutate(id = as.numeric(id))

you can use dplyr's across
df %>% mutate(across(c(where(is.numeric),-id),scale))

Related

How to conditionally mutate a variable in R based on the values in multiple columns?

There are no recent answers to this question using the current tidyverse verbs (R 4.1 & tidyverse 1.3.1 in my case). I've tried using mutate with both case_when() and ifelse() with select_if() to conditionally fill a new variable with a value calculated from the number of TRUE values in specific other columns by row but neither seem to be filtering the correct columns to calculate from, as intended. I could probably pivot longer to replace my column groupings and avoid the need to filter which columns are used in the mutate calculation but I want to keep one response per row for merging later. Here's a reproducible example.
library(tidyverse)
set.seed(195)
# create dataframe
response_id <- rep(1:461)
questions <- c("overall","drought","domestic","livestock","distance")
answers <- c("a","b","c","d","e")
colnames <- apply(expand.grid(questions, answers), 1, paste, collapse="_")
df <- tibble(response_id)
# data is actually an unknown mix of TRUE and FALSE values in all columns but just doing that for one column for now for simplicity
df[,colnames] = FALSE
df$overall_a[sample(nrow(df),100)] <- TRUE
# using ifelse and select if to filter which columns to sum
df %>%
mutate(positive = ifelse(select_if(isTRUE), sum(str_detect(colnames(df), "a|b")), NA)) %>%
mutate(negative = ifelse(select_if(isTRUE), sum(str_detect(colnames(df), "c|d|e")), NA)) %>%
select(response_id, positive, negative)
# using case_when
df %>%
mutate(positive = case_when(TRUE ~ sum(str_detect(colnames(df), "a|b"))), NA) %>%
mutate(negative = case_when(TRUE ~ sum(str_detect(colnames(df), "c|d|e"))), NA) %>%
select(response_id, positive, negative)
The desired output should be as follows. Thanks for any help on this!
# A tibble: 461 × 3
response_id positive negative
<int> <int> <int>
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 1 0
6 6 1 0
7 7 0 0
8 8 1 0
9 9 0 0
10 10 1 0
# … with 451 more rows
Having data in column names is not considered "tidy" and the "tidyverse" works best with tidy data. Rather than hacking against the column names, the pivoting approach would be the most consistent with the tidy philosophy. Plus it will scale better for more categories. For example
df %>%
pivot_longer(-response_id) %>%
separate(name, into=c("category", "code")) %>%
mutate(sentiment=case_when(
code %in% c("a", "b") ~ "positive",
code %in% c("c", "d", "e") ~ "negative")) %>%
group_by(response_id, sentiment) %>%
summarize(count=sum(value)) %>%
pivot_wider(response_id, names_from=sentiment, values_from=count)
It's not as concise but it more directly says what it's doing.
But if you really want to keep data in the row names, you can perform rowwise summaries use c_across() with the latest dplyr
df %>%
rowwise() %>%
mutate(
positive=sum(c_across(ends_with(c("_a", "_b")))),
negative=sum(c_across(ends_with(c("_c", "_d", "_e"))))) %>%
select(response_id, positive, negative)

Drop all rows from data frame that follow a filter threshold using dplyr

This feels like a common enough task that I assume there's an established function/method for accomplishing it. I'm imagining a function like dplyr::filter_after() but there doesn't seem to be one.
Here's the method I'm using as a starting point:
#Setup:
library(dplyr)
threshold <- 3
test.df <- data.frame("num"=c(1:5,1:5),"let"=letters[1:10])
#Drop every row that follows the first 3, including that row:
out.df <- test.df %>%
mutate(pastThreshold = cumsum(num>=threshold)) %>%
filter(pastThreshold==0) %>%
dplyr::select(-pastThreshold)
This produces the desired output:
> out.df
num let
1 1 a
2 2 b
Is there another solution that's less verbose?
dplyr provides the window functions cumany and cumall, that filter all rows after/before a condition becomes false for the first time. Documentation.
test.df %>%
filter(cumall(num<threshold)) #all rows until condition violated for first time
# num let
# 1 1 a
# 2 2 b
You can do:
test.df %>%
slice(1:which.max(num == threshold)-1)
num let
1 1 a
2 2 b
We can use the same in filter without the need for creating extra column and later removing it
library(dplyr)
test.df %>%
filter(cumsum(num>=threshold) == 0)
# num let
#1 1 a
#2 2 b
Or another option is match with slice
test.df %>%
slice(seq_len(match(threshold-1, num)))
Or another option is rleid
library(data.table)
test.df %>%
filter(rleid(num >= threshold) == 1)

r - arrange values in column based on unique values in another column within a group

I'm trying to reorder column in a dataframe, in a descending or ascending order, based on unique values of another column in the same dataframe within groups.
To demonstrate this below is given an example in which a dataframe has three columns. The goal is to group by the gr column, and to order the a column based on the unique value of the b column. So for example if within the gr=1 the unique value of the column b is T then I would like the column a in ascending order, and if not in descending order. The example is below
# sample dataset
df <- data.frame(
a = c(1,3,2,4),
b = c(T,T,F,F),
gr = c(1,1,2,2)
)
# split dataset according to a grouping column
df <- df %>% split(df$gr)
# ordering function
f1 <- function(dt) {
if (unique(dt$b) == T) {
arrange(dt, a)
} else {
arrange(dt, -a)
}
}
The desired dataset should look like this:
# order within groups based on variable b
df %>% purrr::map_df(f1)
Can this be done without using lists or tidyr::nest ? Using a simple dplyr::group_by and dplyr::arrange it should be possible and is the best desired answer.
Here is one option with arrange alone without doing any split
library(dplyr)
df %>%
arrange(gr, c(1, -1)[gr] * a)
# a b gr
#1 1 TRUE 1
#2 3 TRUE 1
#3 4 FALSE 2
#4 2 FALSE 2
or if it needs to be with 'b'
df %>%
arrange(gr, c(-1, 1)[(b + 1)] * a)
# a b gr
#1 1 TRUE 1
#2 3 TRUE 1
#3 4 FALSE 2
#4 2 FALSE 2
Here, we make use of the numeric 'gr'. If it is not numeric, create the grouping index with match and use that to change values of 'a'
df %>%
arrange(gr, c(1, -1)[match(gr, unique(gr))] * a)
Here is a way.
library(dplyr)
f2 <- function(dt) {
2*as.integer(df$b) - 1
}
df %>% arrange(gr, a*f2())
If you accept the rearrangement of the column gr, remove it from arrange.
df %>% arrange(a*f2())
Edit.
Simpler?
f2 <- function(x) 2*x - 1
df %>% arrange(gr, a*f2(b))

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

Unique on a dataframe with only selected columns

I have a dataframe with >100 columns, and I would to find the unique rows by comparing only two of the columns. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.
In the below, I would like to unique only using id and id2:
data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
id id2 somevalue
1 1 x
1 1 y
3 4 z
I would like to obtain either:
id id2 somevalue
1 1 x
3 4 z
or:
id id2 somevalue
1 1 y
3 4 z
(I have no preference which of the unique rows is kept)
Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
id id2 somevalue
1 1 1 x
3 3 4 z
Inside the duplicated call, I'm simply passing only those columns from dat that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)
Here are a couple dplyr options that keep non-duplicate rows based on columns id and id2:
library(dplyr)
df %>% distinct(id, id2, .keep_all = TRUE)
df %>% group_by(id, id2) %>% filter(row_number() == 1)
df %>% group_by(id, id2) %>% slice(1)
Using unique():
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])),]
Minor update in #Joran's code.
Using the code below, you can avoid the ambiguity and only get the unique of two columns:
dat <- data.frame(id=c(1,1,3), id2=c(1,1,4) ,somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])), c("id", "id2")]

Resources