Transform multiple columns the tidy way - r

I wonder whether there is a solution for transforming multiple columns within a pipe.
Lets say we have a tibble with three columns. iq_pre and iq_post have to be transormend on log scale and be saved into new columns.
library(tidyverse)
library(magrittr)
df <- tibble(
iq_pre = rnorm(10, 100, 15),
iq_post = rnorm(10, 100, 18),
gender = rep(c("m", "f"), each = 5)
)
I know I could get the result with base R by doing
df[c("iq_pre_lg", "iq_post_lg")] <- log(df[c("iq_pre", "iq_post")])
or looping over the columns with lapply.
The only tidy solution I came up with is to use mutate manually for each column like this
df %<>%
mutate(iq_pre_lg = log(iq_pre),
iq_post_lg = log(iq_post))
Since the names of the columns which should be transformed start with the same letters, I could also use
df %<>%
mutate_at(vars(starts_with("iq")), funs(lg = log(.)))
But what if I want to convert like 20 columns with different names? Is there a way to use purrr::map or maybe even tidyr::nest to solve this in a more elegant way?

We can use
df %>%
mutate_at(vars(matches("iq")), log)
One advantage with matches is that it can take multiple patterns to be matched in a single call. For e.g., if we need to apply the function on columns that start (^) with 'iq' or (|) those end ($) with 'oq', this can be passed into the single matches
df %>%
mutate_at(vars(matches('^iq|oq$'), log)
If the column names are completely different and there are n patterns for the n column, but if there is still some order in the position of columns, then the column position numbers can be passed into the vars. In the current example, the 'iq' columns are the 1st and 2nd columns
df %>%
mutate_at(1:2, log)
Similarly, if the 20 columns occupy the 1st 20 positions
df %>%
mutate_at(1:20, log)
Or if the positions are 1 to 6, 8 to 12, 41:50
df %>%
mutate_at(vars(1:6, 8:12, 41:50), log)

Related

How to ensure the setNames() attributes the correct names to my group_split() datasets when splitting by multiple groups?

As a follow up to this question, I'm using dplyr's group_split() to make dataframes / tibbles based on a levels of a column. Continuing off of this question, I want to split off of two columns instead of 1. When I try to split and name the columns, it attributes the wrong names to some of the datasets.
Here's a simple example:
library(dplyr)
#Sample dataset to intuitively illustrate issue
example <- tibble(number = c(1:6),
even_or_odd = c("odd", "even", "odd", "even", "odd", "even"),
prime_or_not = c("prime", "prime", "prime", "not", "prime", "not")) %>%
mutate(type = paste0(even_or_odd, "_", prime_or_not)) %>%
mutate(type_factor = factor(type, levels = unique(type)))
#Does group split to make 3 datasets
the_test <- example %>%
group_split(even_or_odd, prime_or_not) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #wrong label :`-(
odd_prime <- the_test["odd_prime"]$odd_prime #wrong label :`-(
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
My question: how do I ensure that my group names will be attributed to the right dataset and avoid the issues here with even_not and odd_prime being mixed up?
In my actual dataset, I have 50+ combinations, so typing them all out manually is not an option. In addition, my actual dataset will have some combinations that don't consistently exist (like the (like the odd not prime combination here), so relying on index isn't an option.
Instead of splitting by the two columns, use the factor column that was created, which ensures that it splits by the order of the levels created in the type_factor. In addition, using the unique on type_factor can have some issues if the order of the values in 'type_factor' is different i.e. unique gets the first non-duplicated value based on its occurrence. Instead, levels is better. In fact, it may be more appropriate to droplevels as well in case of unused levels.
the_test <- example %>%
group_split(type_factor) %>%
setNames(levels(example$type_factor))
group_split returns unnamed list. If we want to avoid the pain of renaming incorrectly, use split from base R which does return a named list. Thus, it can return in any order as long as the key/value pairs are correct
# 1 - return in a different order based on alphabetic order
split(example, example[c("even_or_odd", "prime_or_not")], drop = TRUE)
# 2 - return order based on the levels of the factor column
split(example, example$type_factor)
# 3 - With dplyr pipe
example %>%
split(.$type_factor)
# 4 - or using magrittr exposition operator
library(magrittr)
example %$%
split(x = ., f = type_factor)
Oh, of course the moment I post it, I realize that an easy solution existed:
Just change the group split to the new variable and it works!
library(dplyr)
#Does group split to make 3 datasets
the_test <- example %>%
group_split(type_factor) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #works now!
odd_prime <- the_test["odd_prime"]$odd_prime #works now!
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!

How to replace all values in selected columns with dplyr

I was trying to figure out a way to transform all values in selected columns of my dataset using an equation $$x_i = x_{max} - x_i$$ using dplyr. I'm not sure how to correctly do this for one column, let alone multiple columns. My attempt at mutating 1 column:
df1 <- df %>% mutate(column1 = replace(column1, ., x = max(column1) - x)
My x = max(column1) - x part is not literal, I just want to know how I can implement that equation into all row entries in the column. Furthermore, how can I do this for multiple columns in the same line? Any help is appreciated. Thanks!
If it is to replace all values across multiple columns, loop across the numeric columns and subtract the values from its max value for that column
library(dplyr)
df <- df %>%
mutate(across(where(is.numeric), ~ max(., na.rm = TRUE) - .))

Applying pivot_wider to dataframes within a list

I have created a list of dataframes with split like so:
dataframes_list <- split(df, f = df$variable3)
Each dataframe (131 in total) there is in long format and have the same variables and structure. I want to perform the function pivot_wider in all of them simultaneously.
I have been struggling with some functions of the apply family, but could not get it done:
First I reduced the number of variables within each dataframe selecting only those that should be used for pivoting
dataframes_list_2 <- lapply(dataframes_list, function (x) select(x, variable1, variable2))
Then I tried pivot_wider
dataframes_list_3 <- lapply(dataframes_list_2, function(x) pivot_wider(x, names_from = variable1, values_from = variable 2)
What I obtain in this way is the list with dataframes that contain 1 observation per variable, each of them being a vector of (in this case) 12 values. What I want instead is this:
Because there was a warning telling me that my observations were not uniquely identified, I varied the code above including such variable. But what I got was this:
Can someone give me some answer to this issue?
Thank you
Each dataframe in the list has this aspect:
I had the same problem and I solved it this way:
df_list <- lapply(1:length(my_list),
function(x) (pivot_wider(my_list[[x]], names_from = names, values_from = values)))
bind_rows(df_list)
You will get what you needed! Hope it helps!
You could try:
map(my_list, ~ (pivot_wider(.x, names_from=1,values_from= 2)))
number 1 and 2 are the columns in my tibbles. You can use map_dfr. To combine the data sets you can use unnest of bind_rows.

How do I convert a bunch of factors to ordinals?

I have a bunch of factors that are really ordinals but they're coded as numerics.
This is my code
student_performance <-
read_csv("https://raw.githubusercontent.com/UBC-MDS/ellognea-smwatts-student-performance/master/data/student-math-perf.csv") %>%
as_tibble()
convert.to.ordinals <-
c("Medu",
"Fedu",
"traveltime",
"studytime")
student_perf %>%
mutate_at(vars(convert.to.ordinals), as.factor(ordered = T))
I'm trying to organize them as ordinals and get them to be in ascending order, so it would be the same as doing factor(student_performance$Medu, levels = c(1, 2, 3, 4)) except for all of the ones in the list of variable names
In the newer version of dplyr, we can use across to loop over the column names specified in the vector convert.to.ordinals, apply the function to transform i.e. factor to modify those columns and assign the output back to the original object to change that object
library(dplyr)
student_performance <- student_performance %>%
mutate(across(all_of(convert.to.ordinals), ~
factor(., ordered = TRUE)))
NOTE: The across is a generic way to loop over groups of columns and it replaces the mutate_at or mutate_all, mutate_if with certain changes in the .cols to specify whether a subset of column names with all_of wrapper or select_helpers such as matches, starts_with, ends_with or everything() (mutate_all) or the where (mutate_if)
Or with mutate_at, the key is the lambda function (~ => function(x))
student_performance %>%
mutate_at(vars(convert.to.ordinals), ~ factor(., ordered = TRUE))

How to make a weighted mean inside a summarise_if

I have a dataframe containing a line per company, with different variables (some numeric, others not):
data <- data.frame(id=1:5,
CA = c(1200,1500,1550,200,0),
EBE = c(800,50,654,8555,0),
VA = c(6984,6588,633,355,84),
FBCF = c(35,358,358,1331,86),
name=c("qsdf","xdwfq","qsdf","sqdf","qsdfaz"),
weight = c(1, 5, 10,1 ,1))
I would like to summarise all numeric variables by a weighted sum. If I wanted a simple sum I would do:
data %>% summarise_if(is.numeric,sum)
but I don't see how to define a weighted sum.
I tried:
w.sum <- function(x) {sum(x*weight) %>% return()}
but without any success.
We can use it inside the funs
data %>%
summarise_if(is.numeric, funs(sum(.*weight)))
Note that the above is based on the condition that if the columns are numeric class. Based on the example the 'id' column is numeric, which may not need the summariseation. A better option would be summarise_at to specify the columns of interest
data %>%
summarise_at(names(.)[2:5], funs(sum(.*weight)))

Resources