I'd like to apply a transformation to all columns via dplyr::mutate_each, e.g.
library(dplyr)
mult <- function(x,m) return(x*m)
mtcars %>% mutate_each(funs(mult(.,2))) # Multiply all columns by a factor of two
However, the transformation should have parameters depending on the column name. Therefore, the column name should be passed to the function as an additional argument
named.mult <- function(x,colname) return(x*param.A[[colname]])
Example: multiply every column by a different factor:
param.A <- c()
param.A[names(mtcars)] <- seq(length(names(mtcars)))
param.A
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 2 3 4 5 6 7 8 9 10 11
Since the column name gets lost during mutate_each, I currently work around this by passing a list with lazy evalution to mutate_ (the SE version):
library(lazyeval)
named.mutate <- function(fun, cols) sapply(cols, function(n) interp(~fun(col, n), fun=fun, col=as.name(n)))
mtcars %>% mutate_(.dots=named.mutate(named.mult, names(.)))
Works, but is there some special variable like .name which contains the column name of . for each colwise execution? So I could do something like
mtcars %>% mutate_each(funs(named.mult(.,.name)))
I'd suggest taking a different approach. Instead of using mutate_each a combination of dplyr::mutate with tidyr::gather and tidyr::spread can achieve the same result.
For example:
library(dplyr)
library(tidyr)
data(mtcars)
# Multiple each column by a different interger
mtcars %>%
dplyr::tbl_df() %>%
dplyr::mutate(make_and_model = rownames(mtcars)) %>%
tidyr::gather(key, value, -make_and_model) %>%
dplyr::mutate(m = as.integer(factor(key)), # a multiplication factor dependent on column name
value = value * m) %>%
dplyr::select(-m) %>%
tidyr::spread(key, value)
# compare to the original data
mtcars[order(rownames(mtcars)), order(names(mtcars))]
# the muliplicative values used.
mtcars %>%
tidyr::gather() %>%
dplyr::mutate(m = as.integer(factor(key))) %>%
dplyr::select(-value) %>%
dplyr::distinct()
Related
I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)
I want to create a new column (T/F) based on any value from a list being present in multiple columns. For this example, I'm using mtcars for my example, searching for two values in two columns, but my actual challenge is many values in many columns.
I have a successful filter using filter_at() included below, but I've been unable to apply that logic to a mutate:
# there are 7 cars with 6 cyl
mtcars %>%
filter(cyl == 6)
# there are 2 cars with 19.2 mpg, one with 6 cyl, one with 8
mtcars %>%
filter(mpg == 19.2)
# there are 8 rows with either.
# these are the rows I want as TRUE
mtcars %>%
filter(mpg == 19.2 | cyl == 6)
# set the cols to look at
mtcars_cols <- mtcars %>%
select(matches('^(mp|cy)')) %>% names()
# set the values to look at
mtcars_numbs <- c(19.2, 6)
# result is 8 vars with either value in either col.
# this is a successful filter of the data
out1 <- mtcars %>%
filter_at(vars(mtcars_cols), any_vars(
. %in% mtcars_numbs
)
)
# shows set with all 6 cyl, plus one 8cyl 21.9 mpg
out1 %>%
select(mpg, cyl)
# This attempts to apply the filter list to the cols,
# but I only get 6 rows as True
# I tried to change == to %in& but that results in an error
out2 <- mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs) > 0
)
# only 6 rows returned
out2 %>%
filter(myset == T)
I'm not sure why the two rows are skipped. I think it might be the use of rowSums that is aggregating those two rows in some way.
If we want to do the corresponding checks, it may be better to use map2
library(dplyr)
library(purrr)
map2_df(mtcars_cols, mtcars_numbs, ~
mtcars %>%
filter(!! rlang::sym(.x) == .y)) %>%
distinct
NOTE: Doing the comparison (==) with floating point numbers can get into trouble as the precision can vary and result in FALSE
Also, note that == works only when when either the lhs and rhs elements have the same length or the rhs vector is of length 1 (here the recycling happens). If the length is greater than 1 and not equal to length of lhs vector, then the recycling would be comparing in the column order.
We can replicate to make the lengths equal and now it should work
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs[col(select(., mtcars_cols))]) > 0
) %>% pull(myset) %>% sum
#[1] 8
In the above code select is used twice for better understanding. Otherwise, we can also use rep
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == rep(mtcars_numbs, each = n())) > 0
) %>%
pull(myset) %>%
sum
#[1] 8
I am currently try to compare the column classes and names of various data frames in R prior to undertaking any transformations and calculations.
The code I have is noted below::
library(dplyr)
m1 <- mtcars
m2 <- mtcars %>% mutate(cyl = factor(cyl), xxxx1 = factor(cyl))
m3 <- mtcars %>% mutate(cyl = factor(cyl), xxxx2 = factor(cyl))
out <- cbind(sapply(m1, class), sapply(m2, class), sapply(m3, class))
If someone can solve this for dataframes stored in a list, that would be great. All my dataframes are currently stored in a list, for easier processing.
All.list <- list(m1,m2,m3)
I am expecting that the output is displayed in a matrix form as shown in the dataframe "out". The output in "out" is not desireable as it is incorrect. I am expecting the output to be more along the following::
Try compare_df_cols() from the janitor package:
library(janitor)
compare_df_cols(All.list)
#> column_name All.list_1 All.list_2 All.list_3
#> 1 am numeric numeric numeric
#> 2 carb numeric numeric numeric
#> 3 cyl numeric factor factor
#> 4 disp numeric numeric numeric
#> 5 drat numeric numeric numeric
#> 6 gear numeric numeric numeric
#> 7 hp numeric numeric numeric
#> 8 mpg numeric numeric numeric
#> 9 qsec numeric numeric numeric
#> 10 vs numeric numeric numeric
#> 11 wt numeric numeric numeric
#> 12 xxxx1 <NA> factor <NA>
#> 13 xxxx2 <NA> <NA> factor
It accepts both a list and/or the individual named data.frames, i.e., compare_df_cols(m1, m2, m3).
Disclaimer: I maintain the janitor package to which this function was recently added - posting it here as it addresses exactly this use case.
I think the easiest way would be to define a function, and then use a combination of lapply and dplyr to obtain the result you want. Here is how I did it.
library(dplyr)
m1 <- mtcars
m2 <- mtcars %>% mutate(cyl = factor(cyl), xxxx1 = factor(cyl))
m3 <- mtcars %>% mutate(cyl = factor(cyl), xxxx2 = factor(cyl))
All.list <- list(m1,m2,m3)
##Define a function to get variable names and types
my_function <- function(data_frame){
require(dplyr)
x <- tibble(`var_name` = colnames(data_frame),
`var_type` = sapply(data_frame, class))
return(x)
}
target <- lapply(1:length(All.list),function(i)my_function(All.list[[i]]) %>%
mutate(element =i)) %>%
bind_rows() %>%
spread(element, var_type)
target
I'm writing a function where the user is asked to define one or more grouping variables in the function call. The data is then grouped using dplyr and it works as expected if there is only one grouping variable, but I haven't figured out how to do it with multiple grouping variables.
Example:
x <- c("cyl")
y <- c("cyl", "gear")
dots <- list(~cyl, ~gear)
library(dplyr)
library(lazyeval)
mtcars %>% group_by_(x) # groups by cyl
mtcars %>% group_by_(y) # groups only by cyl (not gear)
mtcars %>% group_by_(.dots = dots) # groups by cyl and gear, this is what I want.
I tried to turn y into the same as dots using:
mtcars %>% group_by_(.dots = interp(~var, var = list(y)))
#Error: is.call(expr) || is.name(expr) || is.atomic(expr) is not TRUE
How to use a user-defined input string of > 1 variable names (like y in the example) to group the data using dplyr?
(This question is somehow related to this one but not answered there.)
No need for interp here, just use as.formula to convert the strings to formulas:
dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)
The reason why your interp approach doesn’t work is that the expression gives you back the following:
~list(c("cyl", "gear"))
– not what you want. You could, of course, sapply interp over y, which would be similar to using as.formula above:
dots1 = sapply(y, . %>% {interp(~var, var = .)})
But, in fact, you can also directly pass y:
mtcars %>% group_by_(.dots = y)
The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.
slice_rows() from the purrrlyr package (https://github.com/hadley/purrrlyr) groups a data.frame by taking a vector of column names (strings) or positions (integers):
y <- c("cyl", "gear")
mtcars_grp <- mtcars %>% purrrlyr::slice_rows(y)
class(mtcars_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
group_vars(mtcars_grp)
#> [1] "cyl" "gear"
Particularly useful now that group_by_() has been depreciated.
I'm writing a function where the user is asked to define one or more grouping variables in the function call. The data is then grouped using dplyr and it works as expected if there is only one grouping variable, but I haven't figured out how to do it with multiple grouping variables.
Example:
x <- c("cyl")
y <- c("cyl", "gear")
dots <- list(~cyl, ~gear)
library(dplyr)
library(lazyeval)
mtcars %>% group_by_(x) # groups by cyl
mtcars %>% group_by_(y) # groups only by cyl (not gear)
mtcars %>% group_by_(.dots = dots) # groups by cyl and gear, this is what I want.
I tried to turn y into the same as dots using:
mtcars %>% group_by_(.dots = interp(~var, var = list(y)))
#Error: is.call(expr) || is.name(expr) || is.atomic(expr) is not TRUE
How to use a user-defined input string of > 1 variable names (like y in the example) to group the data using dplyr?
(This question is somehow related to this one but not answered there.)
No need for interp here, just use as.formula to convert the strings to formulas:
dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)
The reason why your interp approach doesn’t work is that the expression gives you back the following:
~list(c("cyl", "gear"))
– not what you want. You could, of course, sapply interp over y, which would be similar to using as.formula above:
dots1 = sapply(y, . %>% {interp(~var, var = .)})
But, in fact, you can also directly pass y:
mtcars %>% group_by_(.dots = y)
The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.
slice_rows() from the purrrlyr package (https://github.com/hadley/purrrlyr) groups a data.frame by taking a vector of column names (strings) or positions (integers):
y <- c("cyl", "gear")
mtcars_grp <- mtcars %>% purrrlyr::slice_rows(y)
class(mtcars_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
group_vars(mtcars_grp)
#> [1] "cyl" "gear"
Particularly useful now that group_by_() has been depreciated.