dplyr change many data types - r

I have a data.frame:
dat <- data.frame(fac1 = c(1, 2),
fac2 = c(4, 5),
fac3 = c(7, 8),
dbl1 = c('1', '2'),
dbl2 = c('4', '5'),
dbl3 = c('6', '7')
)
To change data types I can use something like
l1 <- c("fac1", "fac2", "fac3")
l2 <- c("dbl1", "dbl2", "dbl3")
dat[, l1] <- lapply(dat[, l1], factor)
dat[, l2] <- lapply(dat[, l2], as.numeric)
with dplyr
dat <- dat %>% mutate(
fac1 = factor(fac1), fac2 = factor(fac2), fac3 = factor(fac3),
dbl1 = as.numeric(dbl1), dbl2 = as.numeric(dbl2), dbl3 = as.numeric(dbl3)
)
is there a more elegant (shorter) way in dplyr?
thx
Christof

Edit (as of 2021-03)
As also pointed out in Eric's answer, mutate_[at|if|all] has been superseded by a combination of mutate() and across(). For reference, I will add the respective pendants to the examples in the original answer (see below):
# convert all factor to character
dat %>% mutate(across(where(is.factor), as.character))
# apply function (change encoding) to all character columns
dat %>% mutate(across(where(is.character),
function(x){iconv(x, to = "ASCII//TRANSLIT")}))
# subsitute all NA in numeric columns
dat %>% mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0)))
Original answer
Since Nick's answer is deprecated by now and Rafael's comment is really useful, I want to add this as an Answer. If you want to change all factor columns to character use mutate_if:
dat %>% mutate_if(is.factor, as.character)
Also other functions are allowed. I for instance used iconv to change the encoding of all character columns:
dat %>% mutate_if(is.character, function(x){iconv(x, to = "ASCII//TRANSLIT")})
or to substitute all NA by 0 in numeric columns:
dat %>% mutate_if(is.numeric, function(x){ifelse(is.na(x), 0, x)})

You can use the standard evaluation version of mutate_each (which is mutate_each_) to change the column classes:
dat %>% mutate_each_(funs(factor), l1) %>% mutate_each_(funs(as.numeric), l2)

EDIT - The syntax of this answer has been deprecated, loki's updated answer is more appropriate.
ORIGINAL-
From the bottom of the ?mutate_each (at least in dplyr 0.5) it looks like that function, as in #docendo discimus's answer, will be deprecated and replaced with more flexible alternatives mutate_if, mutate_all, and mutate_at. The one most similar to what #hadley mentions in his comment is probably using mutate_at. Note the order of the arguments is reversed, compared to mutate_each, and vars() uses select() like semantics, which I interpret to mean the ?select_helpers functions.
dat %>% mutate_at(vars(starts_with("fac")),funs(factor)) %>%
mutate_at(vars(starts_with("dbl")),funs(as.numeric))
But mutate_at can take column numbers instead of a vars() argument, and after reading through this page, and looking at the alternatives, I ended up using mutate_at but with grep to capture many different kinds of column names at once (unless you always have such obvious column names!)
dat %>% mutate_at(grep("^(fac|fctr|fckr)",colnames(.)),funs(factor)) %>%
mutate_at(grep("^(dbl|num|qty)",colnames(.)),funs(as.numeric))
I was pretty excited about figuring out mutate_at + grep, because now one line can work on lots of columns.
EDIT - now I see matches() in among the select_helpers, which handles regex, so now I like this.
dat %>% mutate_at(vars(matches("fac|fctr|fckr")),funs(factor)) %>%
mutate_at(vars(matches("dbl|num|qty")),funs(as.numeric))
Another generally-related comment - if you have all your date columns with matchable names, and consistent formats, this is powerful. In my case, this turns all my YYYYMMDD columns, which were read as numbers, into dates.
mutate_at(vars(matches("_DT$")),funs(as.Date(as.character(.),format="%Y%m%d")))

Dplyr across function has superseded _if, _at, and _all. See vignette("colwise").
dat %>%
mutate(across(all_of(l1), as.factor),
across(all_of(l2), as.numeric))

It's a one-liner with mutate_at:
dat %>% mutate_at("l1", factor) %>% mutate_at("l2", as.numeric)

A more general way of achieving column type transformation is as follows:
If you want to transform all your factor columns to character columns, e.g., this can be done using one pipe:
df %>% mutate_each_( funs(as.character(.)), names( .[,sapply(., is.factor)] ))

Or mayby even more simple with convert from hablar:
library(hablar)
dat %>%
convert(fct(fac1, fac2, fac3),
num(dbl1, dbl2, dbl3))
or combines with tidyselect:
dat %>%
convert(fct(contains("fac")),
num(contains("dbl")))

For future readers, if you are ok with dplyr guessing the column types, you can convert the col types of an entire df as if you were originally reading it in with readr and col_guess() with
library(tidyverse)
df %>% type_convert()

Try this
df[,1:11] <- sapply(df[,1:11], as.character)

Related

most elegant way to calculate rowSums of colums that start AND end with certain strings, using dplyr

I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))

Add multiple columns with mutate using column-based conditions, without using explicit column name + POSIX

I have a dataframe of data: 1 column is POSIX, the rest is data.
I need to remove selectively some data from a group of columns and add these "new" columns to the original dataframe.
I can "easily" do it in base R (I am an old-style user). I'd like to do it more compactly with mutate_at or with other function... although I am having several issues.
A solution homemade with base R could be
df <- data.frame("date" = seq.POSIXt(as.POSIXct(format(Sys.time(),"%F %T"),tz="UTC"),length.out=20,by="min"), "a.1" = rnorm(20,0,3), "a.2" = rnorm(20,1,2), "b.1"= rnorm(20,1,4), "b.2"= rnorm(20,3,4))
df1 <- lapply(df[,grep("^a",names(df))], function(x) replace(x, which(x > 0 & x < 0.2), NA))
df1 <- data.frame(matrix(unlist(df1), nrow = nrow(df), byrow = F)) ## convert to data.frame
names(df1) <- grep("^a",names(df),value=T) ## rename columns
df1 <- cbind.data.frame("date"=df$date, df1) ## add date
Can anyone help me in setting up something working with dplyr + transmute?
So far I come up with something like:
df %>%
select(starts_with("a.")) %>%
transmute(
case_when(
.>0.2 ~ NA,
)
) %>%
cbind.data.frame(df)
But I am quite stuck, since I can't combine transmute with case_when: all examples that I found use explicitly the column names in case_when, but I can't, since I won't know the names of the column in advance. I will only know the initial of the columns that I need to transmute.
Thanks,
Alex
We can use transmute_at if the intention is to return only those columns specified in the vars
library(dplyr)
df %>%
transmute_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
bind_cols(df %>% select(date), .)
If we need all the columns to return, but only change the columns of interest in vars, then we need mutate_at instead of transmute_at
df %>%
mutate_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
select(date, starts_with('a')) # only need if we are selecting a subset of columns

Modify multiple variable names [duplicate]

I want to add a suffix or prefix to most variable names in a data.frame, typically after they've all been transformed in some way and before performing a join. I don't have a way to do this without breaking up my piping.
For example, with this data:
library(dplyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
I want to get to this result (note variable names):
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
My current approach is:
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.)))
names(means14)[2:length(names(means14))] <- paste0(names(means14)[2:length(names(means14))], "_mean_2014")
Is there an alternative to that clunky last line that breaks up my pipes? I've looked at select() and rename() but don't want to explicitly specify each variable name, as I usually want to rename all except a single variable and might have a much wider data.frame than in this example.
I'm imagining a final piped command that approximates this made-up function:
appendname(cols = 2:n, str = "_mean_2014", placement = "suffix")
Which doesn't exist as far as I know.
You can pass functions to rename_at, so do
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_all(funs(mean(.))) %>%
rename_at(vars(-class),function(x) paste0(x,"_2014"))
After additional experimenting since posting this question, I've found that the setNames function will work with the piping as it returns a data.frame:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
setNames(c(names(.)[1], paste0(names(.)[-1],"_mean_2014")))
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
This is a bit quicker, but not totally what you want:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) -> means14
names(means14)[-1] %<>% paste0("_mean_2014")
if you haven't used the %<>%-operator before definitely check this link out, its a super-useful tool.
you can also use it for recomputing or rounding some columns, like this df$meancolumn %<>% round() , and so on, it just comes up very often and just saves you a lot of writing
As of February 2017 you can do this with the dplyr command rename_(...).
In the case of this example you could do.
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
rename_(names(.)[-1], paste0(names(.)[-1],"_mean_2014")))
This is rather similar to the answer with set_names but works with tibbles too!
This is more of a step back, but you might think of reshaping your data in order to apply the function to multiple years at the same time. This will preserve tidyness. If you're going to want to end up comparing different years, it might make sense to have the year be a separate variable in a dataframe, rather than storing the year in the names. You should be able to use summarise_ to get the mean_year behavior. See http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
library(dplyr)
library(tidyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
dat14 %>%
gather(variable, value, -ID, -class) %>%
mutate(year = 2014) %>%
group_by(class, year, variable)%>%
summarise(mean = mean(value))`
While Sam Firkes solution using setNames() ist certainly the only solution keeping an unbroken pipe, it will not work with the tbl objects from dplyr, since the column names are not accessible by methods from the usual base R naming functions. Here is a function that you can use within a pipe with tbl objects as well, thanks to this solution by hrbrmstr. It adds predefined prefixes and suffixes at the specified column indices. Default is all columns.
tbl.renamer <- function(tbl,prefix="x",suffix=NULL,index=seq_along(tbl_vars(tbl))){
newnames <- tbl_vars(tbl) # Get old variable names
names(newnames) <- newnames
names(newnames)[index] <- paste0(prefix,".",newnames,suffix)[index] # create a named vector for .dots
rename_(tbl,.dots=newnames) # rename the variables
}
Example usage (Assume auth_users beeing an tbl_sql object):
auth_user %>% tbl_vars
tbl.renamer(auth_user) %>% tbl_vars
auth_user %>% tbl.renamer %>% tbl_vars
auth_user %>% tbl.renamer(index = c(1,5)) %>% tbl_vars

Correct syntax for mutate_if

I would like to replace NA values with zeros via mutate_if in dplyr. The syntax below:
set.seed(1)
mtcars[sample(1:dim(mtcars)[1], 5),
sample(1:dim(mtcars)[2], 5)] <- NA
require(dplyr)
mtcars %>%
mutate_if(is.na,0)
mtcars %>%
mutate_if(is.na, funs(. = 0))
Returns error:
Error in vapply(tbl, p, logical(1), ...) : values must be length 1,
but FUN(X[[1]]) result is length 32
What's the correct syntax for this operation?
I learned this trick from the purrr tutorial, and it also works in dplyr.
There are two ways to solve this problem:
First, define custom functions outside the pipe, and use it in mutate_if():
any_column_NA <- function(x){
any(is.na(x))
}
replace_NA_0 <- function(x){
if_else(is.na(x),0,x)
}
mtcars %>% mutate_if(any_column_NA,replace_NA_0)
Second, use the combination of ~,. or .x.( .x can be replaced with ., but not any other character or symbol):
mtcars %>% mutate_if(~ any(is.na(.x)),~ if_else(is.na(.x),0,.x))
#This also works
mtcars %>% mutate_if(~ any(is.na(.)),~ if_else(is.na(.),0,.))
In your case, you can also use mutate_all():
mtcars %>% mutate_all(~ if_else(is.na(.x),0,.x))
Using ~, we can define an anonymous function, while .x or . stands for the variable. In mutate_if() case, . or .x is each column.
The "if" in mutate_if refers to choosing columns, not rows. Eg mutate_if(data, is.numeric, ...) means to carry out a transformation on all numeric columns in your dataset.
If you want to replace all NAs with zeros in numeric columns:
data %>% mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
mtcars %>% mutate_if(is.numeric, replace_na, 0)
or more recent syntax
mtcars %>% mutate(across(where(is.numeric),
replace_na, 0))
We can use set from data.table
library(data.table)
setDT(mtcars)
for(j in seq_along(mtcars)){
set(mtcars, i= which(is.na(mtcars[[j]])), j = j, value = 0)
}
I always struggle with replace_na function of dplyr
replace(is.na(.),0)
this works for me for what you are trying to do.

Dplyr write a function with column names as inputs

I'm writing a function that I'm going to use on multiple columns in dplyr, but I'm having trouble passing column names as inputs to functions for dplyr.
Here's an example of what I want to do:
df<-tbl_df(data.frame(group=rep(c("A", "B"), each=3), var1=sample(1:100, 6), var2=sample(1:100, 6)))
example<-function(colname){
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(colname)))%>%
select(output)
}
example("var1")
Output should look like
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(var1)))%>%
select(output)
I've found a few similar questions, but nothing that I could directly apply to my problem, so any help is appreciated. I've tried some solutions involving eval, but I honestly don't know what exactly I'm supposed to be passing to eval.
Is this what you expected?
df<-tbl_df(data.frame(group=rep(c("A", "B"), each=3), var1=sample(1:100, 6), var2=sample(1:100, 6)))
example<-function(colname){
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(colname)))%>%
select(output)
}
example( quote(var1) )
#-----
Source: local data frame [2 x 1]
output
1 7.185935
2 8.090866
The accepted answer does not work anymore in R 3.6 / dplyr 0.8.
As suggested in another answer, one can use !!as.name()
This works for me:
df<-tbl_df(data.frame(group=rep(c("A", "B"), each=3), var1=sample(1:100, 6), var2=sample(1:100, 6)))
example<-function(colname){
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(!!as.name(colname)))%>%
select(output)
}
example( quote(var1) )
If one additionally wants to have the column names to assign to in a mutate, then the easiest is to use the assignment :=. For example to replace colname with its square root.
example_mutate<-function(colname){
df %>%
mutate(!!colname := sqrt(!!as.name(colname)))
}
example_mutate( quote(var1) )
quote() can of course be replaced with quotation marks "".

Resources