Transmute over multiple columns in dplyr - r

I have a dplyr question: How do I use transmute over each column without writing each column out by hand? I.e. is there something like transmute_each()?
I want to do the following: Using dplyr I want to get the z-score of each column for a MWE below:
tickers <- c(rep(1,10),rep(2,10))
df <- data.frame(cbind(tickers,rep(1:20),rep(2:21),rep(2:21),rep(4:23),rep(3:22)))
colnames(df) <- c("tickers","col1","col2","col3","col4","col5")
df %>% group_by(tickers)
Is there a simple way to then use transmute to achieve the following:
for(i in 2:ncol(df)){
df[,i] <- df[,i] - mean(df[,i])/sd(df[,i])
}
Many thanks

Now that there is a transmute_at() function (as of dplyr 0.7), you can do the following:
df %>%
group_by(tickers) %>%
transmute_at(.vars = vars(starts_with("col")),
.funs = funs(scale(.))) %>%
ungroup
Note that this uses the scale() function from base R, which by default converts a numeric vector into a z-score.
Also, the use of vars() in the .vars argument allows you to use all the helper functions that are available for dplyr's select(), such as one_of(), ends_with(), etc.
Finally, instead of writing funs(scale(.)) here, since you're using a simple function in the .funs argument, you can just write .funs = scale.

I solved this using the following:
df %>%
group_by(tickers) %>%
mutate_at(.funs = funs((. - mean(.))/sd(.)),
.cols = vars(matches("col")))

Related

Is it possible to use group_by in a function for more than one variable?

I created a function that aggregates the numeric values in a dataset, and I use a group_by() function to group the data first. Below is an example of what the code I wrote looks like. Is there a way I can group_by() more than one variable without having to create another input for the function?
agg <- function(data, group){ aggdata <- data %>% group_by({{group}}) %>% select_if(function(col) !is.numeric(col) & !is.integer(col)) %>% summarise_if(is.numeric, sum, na.rm = TRUE) return(aggdata)
Your code has (at least) a misplaced curly brace, and it's a bit difficult to see what you're trying to accomplish without a reproducible example and desired result.
It is possible to pass a vector of variable names to group_by(). For example, the following produces the same result as mtcars %>% group_by(cyl, gear):
my_groups <- c("cyl", "gear")
mtcars %>% group_by(!!!syms(my_groups))
Maybe you could use this syntax within your function definition.

Product of columns selected by starts_with()

I am wondering if there is an efficient way or alternative way to compute the row wise product of a selection of columns in dplyr format.
I know one way to do it (see below), but it seems using rowwise() take a long time to run on my large data set, hence looking for any alternative way to do this.
df = df %>%
rowwise %>%
mutate(myprod = prod(c_across(starts_with('var_xyz'))))
Here are some alternative options.
If you want to stay in tidyverse you can try pmap_dbl :
library(dplyr)
library(purrr)
df %>% mutate(myprod = pmap_dbl(select(., starts_with('var_xyz')), prod))
A base R option with Reduce or using rowProds from matrixStats.
cols <- grep('^var_xyz', names(df))
#2.
df$myprod <- Reduce(`*`, df[cols])
#3.
df$myprod <- matrixStats::rowProds(as.matrix(df[cols]))

dplyr mutate using dynamic variable name while respecting group_by

I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?
I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.
The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.

Convert all columns to characters in a data.frame

Consider a data.frame with a mix of data types.
For a weird purpose, a user needs to convert all columns to characters.
How is it best done? A tidyverse attempt at solution is this:
map(mtcars,as.character) %>% map_df(as.list) %>% View()
c2<-map(mtcars,as.character) %>% map_df(as.list)
when I call str(c2) it should say a tibble or data.frame with all characters.
The other option would be some parameter settings for write.csv() or in write_csv() to achieve the same thing in the resulting file output.
EDIT: 2021-03-01
Beginning with dplyr 1.0.0, the _all() function variants are superceded. The new way to accomplish this is using the new across() function.
library(dplyr)
mtcars %>%
mutate(across(everything(), as.character))
With across(), we choose the set of columns we want to modify using tidyselect helpers (here we use everything() to choose all columns), and then specify the function we want to apply to each of the selected columns. In this case, that is as.character().
Original answer:
You can also use dplyr::mutate_all.
library(dplyr)
mtcars %>%
mutate_all(as.character)
In base R:
x[] <- lapply(x, as.character)
This converts the columns to character class in place, retaining the data.frame's attributes. A call to data.frame() would cause them to be lost.
Attribute preservation using dplyr: Attributes seem to be preserved during dplyr::mutate(across(everything(), as.character)). Previously they were destroyed by dplyr::mutate_all.
Example
x <- mtcars
attr(x, "example") <- "1"
In the second case below, the example attribute is retained:
# Destroys attributes
data.frame(lapply(x, as.character)) %>%
attributes()
# Preserves attributes
x[] <- lapply(x, as.character)
attributes(x)
This might work, but not sure if it's the best.
df = data.frame(lapply(mtcars, as.character))
str(df)
Most efficient way using data.table-
data.table::setDT(mtcars)
mtcars[, (colnames(mtcars)) := lapply(.SD, as.character), .SDcols = colnames(mtcars)]
Note: You can use this to convert few columns of a data table to your desired column type.
If we want to convert all columns to character then we can also do something like this-
to_col_type <- function(col_names,type){
get(paste0("as.", type))(dt[[col_names]])
}
mtcars<- rbindlist(list(Map(to_col_type ,colnames(mtcars),"character")))
mutate_all in the accepted answer is superseded.
You can use mutate() function with across():
library(dplyr)
mtcars %>%
mutate(across(everything(), as.character))

Dplyr or Magrittr - tolower?

Is it possible to set all column names to upper or lower within a dplyr or magrittr chain?
In the example below I load the data and then, using a magrittr pipe, chain it through to my dplyr mutations. In the 4th line I use the tolower function , but this is for a different purpose: to create a new variable with lowercase observations.
mydata <- read.csv('myfile.csv') %>%
mutate(Year = mdy_hms(DATE),
Reference = (REFNUM),
Event = tolower(EVENT)
I'm obviously looking for something like colnames = tolower but know this doesn't work/exist.
I note the dplyr rename function but this isn't really helpful.
In magrittr the colname options are:
set_colnames instead of base R's colnames<-
set_names instead of base R's names<-
I've tried numerous permutations with these but no dice.
Obviously this is very simple in base r.
names(mydata) <- tolower(names(mydata))
However it seems incongruous with the dplyr/magrittr philosophies that you'd have to do that as a clunky one liner, before moving on to an elegant chain of dplyr/magrittr code.
with {dplyr} we can do :
mydata %>% rename_all(tolower)
or
mydata %>% rename(across(everything(), tolower))
iris %>% setNames(tolower(names(.))) %>% head
Or equivalently use replacement function in non-replacement form:
iris %>% `names<-`(tolower(names(.))) %>% head
iris %>% `colnames<-`(tolower(names(.))) %>% head # if you really want to use `colnames<-`
Using magrittr's "compound assignment pipe-operator" %<>% might be, if I understand your question correctly, an even more succinct option.
library("magrittr")
names(iris) %<>% tolower
?`%<>%` # for more
mtcars %>%
set_colnames(value = casefold(colnames(.), upper = FALSE)) %>%
head
casefold is available in base R and can convert in both direction, i.e. can convert to either all upper case or all lower case by using the flag upper, as need might be.
Also colnames() will use only column headers for case conversion.
You could also define a function:
upcase <- function(df) {
names(df) <- toupper(names(df))
df
}
library(dplyr)
mtcars %>% upcase %>% select(MPG)

Resources