For example,
df1 = expand.grid(x1=1:2,x2=1:2,x3=1:2,x4=1:2,x5=1:2,x6=1:2) %>%
mutate(
x7 = sample(1:2,64,T),
y1 = rnorm(64)
)
df2 = expand.grid(x1=1:2,x2=1:2,x3=1:2,x4=1:2,x5=1:2,x6=1:2) %>%
mutate(
x7 = sample(1:2,64,T),
y2 = rnorm(64)
)
myfunc <- function(data){
data %>%
mutate(key = paste(x1,x2,x3,x4,x5,x6)) %>%
pull(key)
}
joined_df = df1 %>%
mutate(y3 = runif(64)) %>%
mutate(key=myfunc([some sort of expression referencing df1])) %>%
inner_join(
df2 %>%
mutate(y4 = runif(64)) %>%
mutate(key=myfunc([some sort of expression referencing df2]),
by='key'
)
Essentially, I would like to avoid having to recreate the data frame from a function that looks like
myfunc_v2 <- function(data){
data %>%
mutate(key = paste(x1,x2,x3,x4,x5,x6))
}
Even though myfunc_v2() is arguably cleaner, the main reason for this is that I usually change the names of variables using transformation functions like rename_all() across sources that are formatted differently, but do not want to actually modify them, in the main copy, as I am keeping the column name formatting from one of the tibbles and later discarding the others.
The solution is straightforward.
When using the pipe operator %>%, which is how dplyr is usually used, you can specify where in the function it's acting on the argument should be used.
For a copy of the argument, all you need to do is to put (.) where you want the object, provided it's not inside of some anonymous function (e.g., with mutate_all(data, list(scaled=~scale(.), signed=sign(.)).
The solution would simply look like
joined_df = df1 %>%
mutate(y3 = runif(64)) %>%
mutate(key=myfunc((.)) %>%
inner_join(
df2 %>%
mutate(y4 = runif(64)) %>%
mutate(key=myfunc((.)),
by='key'
)
I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))
I have a survey that I am trying to get to be grouped by years and the calculate totals for certain variables. I need to do this about 20 times with different variables so I am writing a function but I can't seem to get to work properly even though it works fine outside the function.
this works fine:
mepsdsgn %>% group_by(YEAR) %>% summarise(tot_pri = survey_total(TOTPRV)) %>% select(YEAR, tot_pri)
when I try a function:
total_calc <- function(x) {mepsdsgn %>% group_by(YEAR) %>% summarise(total = survey_total(x)) %>% select(YEAR, total)}
total_calc(TOTPRV)
I get this error: Error in stop_for_factor(x) : object 'TOTPRV' not found
Figured it out:
total_fun <- function(x) {
col = x
mepsdsgn %>% group_by(YEAR) %>% summarise(total = survey_total(!!sym(col), na.rm = TRUE)) %>% select(YEAR, total)
}
there are a couple of things I'd suggest doing, see below
# first try to make a working minimal example people can run in a new R session
library(magrittr)
library(dplyr)
dt <- data.frame(y=1:10, x=rep(letters[1:2], each=5))
# simple group and mean using the column names explicitly
dt %>% group_by(x) %>% summarise(mean(y))
# a bit of googling showed me you need to use group_by_at(vars("x")) to replicate
# using a string input
# in this function, add all arguments, so the data you use - dt & the column name - column.x
foo <- function(dt, column.x){
dt %>% group_by_at(vars(column.x)) %>% summarise(mean(y))
}
# when running a function, you need to supply the name of the column as a string, e.g. "x" not x
foo(dt, column.x="x")
I don't use dplyr, so there may be a better way
I have a data.frame where I assign each column.name a vector of variables:
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
I want to create a new data.frame but instead of assigning each column individually, I want to assign them all at once. For example, if I wanted to rename them all:
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1)
This obviously doens't work. Is there a way to make it work?
I understand I can just rename using names(), but the scenario where this actually seems useful is if combining multiple data sets that share the same col.names (and in which I don't want to simply rbind):
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
dat2 <- data.frame(a=6:10,b=6:10,c=6:10)
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1, paste(names(dat1),'2',sep='') = dat2)
library(dplyr)
library(tidyr)
library(magrittr)
Ok, here's the first part:
dat2 =
dat1 %>%
setNames(names(.) %>%
paste0("1") )
Here's the second part. The reshaping is a bit complex but more flexible, especially if you have row id's already with different amounts of rows:
list(dat1, dat2) %>%
bind_rows(.id = "number") %>%
group_by(number) %>%
mutate(id = 1:n()) %>%
gather(variable, value, -number, -id) %>%
unite(new_variable, variable, number) %>%
spread(new_variable, value)
I have a data.frame:
dat <- data.frame(fac1 = c(1, 2),
fac2 = c(4, 5),
fac3 = c(7, 8),
dbl1 = c('1', '2'),
dbl2 = c('4', '5'),
dbl3 = c('6', '7')
)
To change data types I can use something like
l1 <- c("fac1", "fac2", "fac3")
l2 <- c("dbl1", "dbl2", "dbl3")
dat[, l1] <- lapply(dat[, l1], factor)
dat[, l2] <- lapply(dat[, l2], as.numeric)
with dplyr
dat <- dat %>% mutate(
fac1 = factor(fac1), fac2 = factor(fac2), fac3 = factor(fac3),
dbl1 = as.numeric(dbl1), dbl2 = as.numeric(dbl2), dbl3 = as.numeric(dbl3)
)
is there a more elegant (shorter) way in dplyr?
thx
Christof
Edit (as of 2021-03)
As also pointed out in Eric's answer, mutate_[at|if|all] has been superseded by a combination of mutate() and across(). For reference, I will add the respective pendants to the examples in the original answer (see below):
# convert all factor to character
dat %>% mutate(across(where(is.factor), as.character))
# apply function (change encoding) to all character columns
dat %>% mutate(across(where(is.character),
function(x){iconv(x, to = "ASCII//TRANSLIT")}))
# subsitute all NA in numeric columns
dat %>% mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0)))
Original answer
Since Nick's answer is deprecated by now and Rafael's comment is really useful, I want to add this as an Answer. If you want to change all factor columns to character use mutate_if:
dat %>% mutate_if(is.factor, as.character)
Also other functions are allowed. I for instance used iconv to change the encoding of all character columns:
dat %>% mutate_if(is.character, function(x){iconv(x, to = "ASCII//TRANSLIT")})
or to substitute all NA by 0 in numeric columns:
dat %>% mutate_if(is.numeric, function(x){ifelse(is.na(x), 0, x)})
You can use the standard evaluation version of mutate_each (which is mutate_each_) to change the column classes:
dat %>% mutate_each_(funs(factor), l1) %>% mutate_each_(funs(as.numeric), l2)
EDIT - The syntax of this answer has been deprecated, loki's updated answer is more appropriate.
ORIGINAL-
From the bottom of the ?mutate_each (at least in dplyr 0.5) it looks like that function, as in #docendo discimus's answer, will be deprecated and replaced with more flexible alternatives mutate_if, mutate_all, and mutate_at. The one most similar to what #hadley mentions in his comment is probably using mutate_at. Note the order of the arguments is reversed, compared to mutate_each, and vars() uses select() like semantics, which I interpret to mean the ?select_helpers functions.
dat %>% mutate_at(vars(starts_with("fac")),funs(factor)) %>%
mutate_at(vars(starts_with("dbl")),funs(as.numeric))
But mutate_at can take column numbers instead of a vars() argument, and after reading through this page, and looking at the alternatives, I ended up using mutate_at but with grep to capture many different kinds of column names at once (unless you always have such obvious column names!)
dat %>% mutate_at(grep("^(fac|fctr|fckr)",colnames(.)),funs(factor)) %>%
mutate_at(grep("^(dbl|num|qty)",colnames(.)),funs(as.numeric))
I was pretty excited about figuring out mutate_at + grep, because now one line can work on lots of columns.
EDIT - now I see matches() in among the select_helpers, which handles regex, so now I like this.
dat %>% mutate_at(vars(matches("fac|fctr|fckr")),funs(factor)) %>%
mutate_at(vars(matches("dbl|num|qty")),funs(as.numeric))
Another generally-related comment - if you have all your date columns with matchable names, and consistent formats, this is powerful. In my case, this turns all my YYYYMMDD columns, which were read as numbers, into dates.
mutate_at(vars(matches("_DT$")),funs(as.Date(as.character(.),format="%Y%m%d")))
Dplyr across function has superseded _if, _at, and _all. See vignette("colwise").
dat %>%
mutate(across(all_of(l1), as.factor),
across(all_of(l2), as.numeric))
It's a one-liner with mutate_at:
dat %>% mutate_at("l1", factor) %>% mutate_at("l2", as.numeric)
A more general way of achieving column type transformation is as follows:
If you want to transform all your factor columns to character columns, e.g., this can be done using one pipe:
df %>% mutate_each_( funs(as.character(.)), names( .[,sapply(., is.factor)] ))
Or mayby even more simple with convert from hablar:
library(hablar)
dat %>%
convert(fct(fac1, fac2, fac3),
num(dbl1, dbl2, dbl3))
or combines with tidyselect:
dat %>%
convert(fct(contains("fac")),
num(contains("dbl")))
For future readers, if you are ok with dplyr guessing the column types, you can convert the col types of an entire df as if you were originally reading it in with readr and col_guess() with
library(tidyverse)
df %>% type_convert()
Try this
df[,1:11] <- sapply(df[,1:11], as.character)