Modify multiple variable names [duplicate] - r

I want to add a suffix or prefix to most variable names in a data.frame, typically after they've all been transformed in some way and before performing a join. I don't have a way to do this without breaking up my piping.
For example, with this data:
library(dplyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
I want to get to this result (note variable names):
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
My current approach is:
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.)))
names(means14)[2:length(names(means14))] <- paste0(names(means14)[2:length(names(means14))], "_mean_2014")
Is there an alternative to that clunky last line that breaks up my pipes? I've looked at select() and rename() but don't want to explicitly specify each variable name, as I usually want to rename all except a single variable and might have a much wider data.frame than in this example.
I'm imagining a final piped command that approximates this made-up function:
appendname(cols = 2:n, str = "_mean_2014", placement = "suffix")
Which doesn't exist as far as I know.

You can pass functions to rename_at, so do
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_all(funs(mean(.))) %>%
rename_at(vars(-class),function(x) paste0(x,"_2014"))

After additional experimenting since posting this question, I've found that the setNames function will work with the piping as it returns a data.frame:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
setNames(c(names(.)[1], paste0(names(.)[-1],"_mean_2014")))
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116

This is a bit quicker, but not totally what you want:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) -> means14
names(means14)[-1] %<>% paste0("_mean_2014")
if you haven't used the %<>%-operator before definitely check this link out, its a super-useful tool.
you can also use it for recomputing or rounding some columns, like this df$meancolumn %<>% round() , and so on, it just comes up very often and just saves you a lot of writing

As of February 2017 you can do this with the dplyr command rename_(...).
In the case of this example you could do.
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
rename_(names(.)[-1], paste0(names(.)[-1],"_mean_2014")))
This is rather similar to the answer with set_names but works with tibbles too!

This is more of a step back, but you might think of reshaping your data in order to apply the function to multiple years at the same time. This will preserve tidyness. If you're going to want to end up comparing different years, it might make sense to have the year be a separate variable in a dataframe, rather than storing the year in the names. You should be able to use summarise_ to get the mean_year behavior. See http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
library(dplyr)
library(tidyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
dat14 %>%
gather(variable, value, -ID, -class) %>%
mutate(year = 2014) %>%
group_by(class, year, variable)%>%
summarise(mean = mean(value))`

While Sam Firkes solution using setNames() ist certainly the only solution keeping an unbroken pipe, it will not work with the tbl objects from dplyr, since the column names are not accessible by methods from the usual base R naming functions. Here is a function that you can use within a pipe with tbl objects as well, thanks to this solution by hrbrmstr. It adds predefined prefixes and suffixes at the specified column indices. Default is all columns.
tbl.renamer <- function(tbl,prefix="x",suffix=NULL,index=seq_along(tbl_vars(tbl))){
newnames <- tbl_vars(tbl) # Get old variable names
names(newnames) <- newnames
names(newnames)[index] <- paste0(prefix,".",newnames,suffix)[index] # create a named vector for .dots
rename_(tbl,.dots=newnames) # rename the variables
}
Example usage (Assume auth_users beeing an tbl_sql object):
auth_user %>% tbl_vars
tbl.renamer(auth_user) %>% tbl_vars
auth_user %>% tbl.renamer %>% tbl_vars
auth_user %>% tbl.renamer(index = c(1,5)) %>% tbl_vars

Related

Is there a way to access the entire tibble/grouped tibble from the `mutate()` function acting on it?

For example,
df1 = expand.grid(x1=1:2,x2=1:2,x3=1:2,x4=1:2,x5=1:2,x6=1:2) %>%
mutate(
x7 = sample(1:2,64,T),
y1 = rnorm(64)
)
df2 = expand.grid(x1=1:2,x2=1:2,x3=1:2,x4=1:2,x5=1:2,x6=1:2) %>%
mutate(
x7 = sample(1:2,64,T),
y2 = rnorm(64)
)
myfunc <- function(data){
data %>%
mutate(key = paste(x1,x2,x3,x4,x5,x6)) %>%
pull(key)
}
joined_df = df1 %>%
mutate(y3 = runif(64)) %>%
mutate(key=myfunc([some sort of expression referencing df1])) %>%
inner_join(
df2 %>%
mutate(y4 = runif(64)) %>%
mutate(key=myfunc([some sort of expression referencing df2]),
by='key'
)
Essentially, I would like to avoid having to recreate the data frame from a function that looks like
myfunc_v2 <- function(data){
data %>%
mutate(key = paste(x1,x2,x3,x4,x5,x6))
}
Even though myfunc_v2() is arguably cleaner, the main reason for this is that I usually change the names of variables using transformation functions like rename_all() across sources that are formatted differently, but do not want to actually modify them, in the main copy, as I am keeping the column name formatting from one of the tibbles and later discarding the others.
The solution is straightforward.
When using the pipe operator %>%, which is how dplyr is usually used, you can specify where in the function it's acting on the argument should be used.
For a copy of the argument, all you need to do is to put (.) where you want the object, provided it's not inside of some anonymous function (e.g., with mutate_all(data, list(scaled=~scale(.), signed=sign(.)).
The solution would simply look like
joined_df = df1 %>%
mutate(y3 = runif(64)) %>%
mutate(key=myfunc((.)) %>%
inner_join(
df2 %>%
mutate(y4 = runif(64)) %>%
mutate(key=myfunc((.)),
by='key'
)

most elegant way to calculate rowSums of colums that start AND end with certain strings, using dplyr

I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))

How to write a function that includes pipes using functions from the srvyr package?

I have a survey that I am trying to get to be grouped by years and the calculate totals for certain variables. I need to do this about 20 times with different variables so I am writing a function but I can't seem to get to work properly even though it works fine outside the function.
this works fine:
mepsdsgn %>% group_by(YEAR) %>% summarise(tot_pri = survey_total(TOTPRV)) %>% select(YEAR, tot_pri)
when I try a function:
total_calc <- function(x) {mepsdsgn %>% group_by(YEAR) %>% summarise(total = survey_total(x)) %>% select(YEAR, total)}
total_calc(TOTPRV)
I get this error: Error in stop_for_factor(x) : object 'TOTPRV' not found
Figured it out:
total_fun <- function(x) {
col = x
mepsdsgn %>% group_by(YEAR) %>% summarise(total = survey_total(!!sym(col), na.rm = TRUE)) %>% select(YEAR, total)
}
there are a couple of things I'd suggest doing, see below
# first try to make a working minimal example people can run in a new R session
library(magrittr)
library(dplyr)
dt <- data.frame(y=1:10, x=rep(letters[1:2], each=5))
# simple group and mean using the column names explicitly
dt %>% group_by(x) %>% summarise(mean(y))
# a bit of googling showed me you need to use group_by_at(vars("x")) to replicate
# using a string input
# in this function, add all arguments, so the data you use - dt & the column name - column.x
foo <- function(dt, column.x){
dt %>% group_by_at(vars(column.x)) %>% summarise(mean(y))
}
# when running a function, you need to supply the name of the column as a string, e.g. "x" not x
foo(dt, column.x="x")
I don't use dplyr, so there may be a better way

Can I create a data.frame in R from an existing data.frame by assigning a list of col.names?

I have a data.frame where I assign each column.name a vector of variables:
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
I want to create a new data.frame but instead of assigning each column individually, I want to assign them all at once. For example, if I wanted to rename them all:
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1)
This obviously doens't work. Is there a way to make it work?
I understand I can just rename using names(), but the scenario where this actually seems useful is if combining multiple data sets that share the same col.names (and in which I don't want to simply rbind):
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
dat2 <- data.frame(a=6:10,b=6:10,c=6:10)
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1, paste(names(dat1),'2',sep='') = dat2)
library(dplyr)
library(tidyr)
library(magrittr)
Ok, here's the first part:
dat2 =
dat1 %>%
setNames(names(.) %>%
paste0("1") )
Here's the second part. The reshaping is a bit complex but more flexible, especially if you have row id's already with different amounts of rows:
list(dat1, dat2) %>%
bind_rows(.id = "number") %>%
group_by(number) %>%
mutate(id = 1:n()) %>%
gather(variable, value, -number, -id) %>%
unite(new_variable, variable, number) %>%
spread(new_variable, value)

dplyr change many data types

I have a data.frame:
dat <- data.frame(fac1 = c(1, 2),
fac2 = c(4, 5),
fac3 = c(7, 8),
dbl1 = c('1', '2'),
dbl2 = c('4', '5'),
dbl3 = c('6', '7')
)
To change data types I can use something like
l1 <- c("fac1", "fac2", "fac3")
l2 <- c("dbl1", "dbl2", "dbl3")
dat[, l1] <- lapply(dat[, l1], factor)
dat[, l2] <- lapply(dat[, l2], as.numeric)
with dplyr
dat <- dat %>% mutate(
fac1 = factor(fac1), fac2 = factor(fac2), fac3 = factor(fac3),
dbl1 = as.numeric(dbl1), dbl2 = as.numeric(dbl2), dbl3 = as.numeric(dbl3)
)
is there a more elegant (shorter) way in dplyr?
thx
Christof
Edit (as of 2021-03)
As also pointed out in Eric's answer, mutate_[at|if|all] has been superseded by a combination of mutate() and across(). For reference, I will add the respective pendants to the examples in the original answer (see below):
# convert all factor to character
dat %>% mutate(across(where(is.factor), as.character))
# apply function (change encoding) to all character columns
dat %>% mutate(across(where(is.character),
function(x){iconv(x, to = "ASCII//TRANSLIT")}))
# subsitute all NA in numeric columns
dat %>% mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0)))
Original answer
Since Nick's answer is deprecated by now and Rafael's comment is really useful, I want to add this as an Answer. If you want to change all factor columns to character use mutate_if:
dat %>% mutate_if(is.factor, as.character)
Also other functions are allowed. I for instance used iconv to change the encoding of all character columns:
dat %>% mutate_if(is.character, function(x){iconv(x, to = "ASCII//TRANSLIT")})
or to substitute all NA by 0 in numeric columns:
dat %>% mutate_if(is.numeric, function(x){ifelse(is.na(x), 0, x)})
You can use the standard evaluation version of mutate_each (which is mutate_each_) to change the column classes:
dat %>% mutate_each_(funs(factor), l1) %>% mutate_each_(funs(as.numeric), l2)
EDIT - The syntax of this answer has been deprecated, loki's updated answer is more appropriate.
ORIGINAL-
From the bottom of the ?mutate_each (at least in dplyr 0.5) it looks like that function, as in #docendo discimus's answer, will be deprecated and replaced with more flexible alternatives mutate_if, mutate_all, and mutate_at. The one most similar to what #hadley mentions in his comment is probably using mutate_at. Note the order of the arguments is reversed, compared to mutate_each, and vars() uses select() like semantics, which I interpret to mean the ?select_helpers functions.
dat %>% mutate_at(vars(starts_with("fac")),funs(factor)) %>%
mutate_at(vars(starts_with("dbl")),funs(as.numeric))
But mutate_at can take column numbers instead of a vars() argument, and after reading through this page, and looking at the alternatives, I ended up using mutate_at but with grep to capture many different kinds of column names at once (unless you always have such obvious column names!)
dat %>% mutate_at(grep("^(fac|fctr|fckr)",colnames(.)),funs(factor)) %>%
mutate_at(grep("^(dbl|num|qty)",colnames(.)),funs(as.numeric))
I was pretty excited about figuring out mutate_at + grep, because now one line can work on lots of columns.
EDIT - now I see matches() in among the select_helpers, which handles regex, so now I like this.
dat %>% mutate_at(vars(matches("fac|fctr|fckr")),funs(factor)) %>%
mutate_at(vars(matches("dbl|num|qty")),funs(as.numeric))
Another generally-related comment - if you have all your date columns with matchable names, and consistent formats, this is powerful. In my case, this turns all my YYYYMMDD columns, which were read as numbers, into dates.
mutate_at(vars(matches("_DT$")),funs(as.Date(as.character(.),format="%Y%m%d")))
Dplyr across function has superseded _if, _at, and _all. See vignette("colwise").
dat %>%
mutate(across(all_of(l1), as.factor),
across(all_of(l2), as.numeric))
It's a one-liner with mutate_at:
dat %>% mutate_at("l1", factor) %>% mutate_at("l2", as.numeric)
A more general way of achieving column type transformation is as follows:
If you want to transform all your factor columns to character columns, e.g., this can be done using one pipe:
df %>% mutate_each_( funs(as.character(.)), names( .[,sapply(., is.factor)] ))
Or mayby even more simple with convert from hablar:
library(hablar)
dat %>%
convert(fct(fac1, fac2, fac3),
num(dbl1, dbl2, dbl3))
or combines with tidyselect:
dat %>%
convert(fct(contains("fac")),
num(contains("dbl")))
For future readers, if you are ok with dplyr guessing the column types, you can convert the col types of an entire df as if you were originally reading it in with readr and col_guess() with
library(tidyverse)
df %>% type_convert()
Try this
df[,1:11] <- sapply(df[,1:11], as.character)

Resources