With the new release of dplyr I am refactoring quite a lot of code and removing functions that are now retired or deprecated. I had a function that is as follows:
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:", paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>% select(matches("snsr_val")) %>% mutate(global_demand = rowSums(.)) # we get isolated load
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind")) # we get isolated quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate_all(~ factor(.), colnames(df_isolated_load_qlty)) %>%
mutate_each(funs(as.numeric(.)), colnames(df_isolated_load_qlty)) # we convert the qlty to factors and then to numeric
df_isolated_load_qlty[df_isolated_load_qlty[]==1] <- 1 # 1 is bad
df_isolated_load_qlty[df_isolated_load_qlty[]==2] <- 0 # 0 is good we mask to calculate the global index quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate(global_quality = rowSums(.)) %>% select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}
Basically the function does as follows:
1.The function selects all of the values of a pivoted dataframe and aggregated them.
2.The function selects the quality indicator (character) of a pivoted dataframe.
3.I convert the characters of the quality to factors and then to numeric to get the 2 levels (1 or 2).
4.I replace the numeric values of each of the individual columns by 0 or 1 depending on the level.
5.I rowsum the individual quality as I will get 0 if all of the values are good, otherwise the global quality is bad.
The problem is that I am getting the following messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: `mutate_each_()` is deprecated as of dplyr 0.7.0.
Please use `across()` instead.
I did multiple trials as for instance:
df_isolated_load_qlty %>% mutate(across(.fns = ~ as.factor(), .names = colnames(df_isolated_load_qlty)))
Error: Problem with `mutate()` input `..1`.
x All unnamed arguments must be length 1
ℹ Input `..1` is `across(.fns = ~as.factor(), .names = colnames(df_isolated_load_qlty))`.
But I am still a bit confused about the new dplyr syntax. Would someone be able to guide me a little bit around the right way of doing this?
mutate_each has been long deprecated and was replaced with mutate_all.
mutate_all is now replaced with across
across has default .cols as everything() which means it behaves as mutate_all by default (like here) if not mentioned explicitly.
You can apply the mulitple function in the same mutate call, so here factor and as.numeric can be applied together.
Considering all this you can change your existing function to :
library(dplyr)
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:",
paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>%
select(matches("snsr_val")) %>%
mutate(global_demand = rowSums(.))
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind"))
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(across(.fns = ~as.numeric(factor(.))))
df_isolated_load_qlty[df_isolated_load_qlty ==1] <- 1
df_isolated_load_qlty[df_isolated_load_qlty==2] <- 0
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(global_quality = rowSums(.)) %>%
select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}
Related
I have sam_data looking like this
I want to use the merge_samples() command on the "genus".
I have tried with code looking like this:
merge_physeq <- merge_samples(physeq, "genus")
What happens now is I get several of the warnings:
In asMethod(object) : NAs introduced by coercion
And my sam_data ends up looking like this:
I think this is due to nature of the variable. the different genuses are factor variables. But all my other data is also changed. The lower taxonomic levels are not important after the merge so they can be removed.
This is because you have not specified an appropriate fun argument to the merge_samples function. By default it calculates the mean, which is possible for numerical, boolean and factor types, but not for strings.
Unfortunately, it seems that phyloseq has hardcoded coercion to numeric for all columns in the sample data within the merge_samples function before applying the specified function, so I'm not sure if it is possible at all to deal with character type columns through merge_samples.
As a solution, try the merge_ps_samples function below, which should merge samples and then summarize any column according to its class. For numeric columns it calculates the mean within each group, for character types it pastes together the unique values within each group. You can of course choose any function you like, just alter the code to your desire.
Hope this helps! Good luck.
EDIT: note that this function calculates mean abundances, rather than summed abundances as implemented in the phyloseq::merge_samples function.
require(phyloseq)
require(tidyverse)
# Concatenate unique values in a vector
concat_unique <- function(vec){
uniq <- unique(as.character(vec))
return(paste(uniq, collapse = "/"))
}
# Like psmelt, but only uses the otu_table and sample_data
ps_semi_melt <- function(ps){
otu_table(ps) %>%
data.frame(taxid = row.names(.)) %>%
rename_with(function(x){gsub("X", "", x)}) %>%
pivot_longer(!taxid, names_to = "sample_id", values_to = "abundance") %>%
left_join(sample_data(ps) %>%
data.frame(sample_id = row.names(.)),
by = "sample_id")
}
# Function that summarizes a vector based on its class
summarise_vec <- function(vec){
if(class(vec) %in% c("numeric", "integer", "logical")){
return(mean(vec, na.rm = T))
} else if (class(vec) %in% c("factor", "character")){
return(concat_unique(vec))
} else {
stop("Error: unknown column type")
}
}
# Converts a summary df to an otu_table
summ_to_otu_tbl <- function(summ){
summ %>%
select(taxid, sample_id, abundance) %>%
pivot_wider(names_from = "sample_id", values_from = "abundance") %>%
column_to_rownames('taxid') %>%
as.matrix() %>%
otu_table(, taxa_are_rows = TRUE)
}
# Converts a summary df to sample_data
summ_to_sample_dat <- function(summ){
summ %>%
select(!c(taxid, abundance)) %>%
unique() %>%
column_to_rownames('sample_id') %>%
sample_data()
}
# Function that merges phyloseq samples based on the names of one or more grouping factors
# present in sample_data(ps)
merge_ps_samples <- function(ps, grouping){
# Make sure taxa are rows
if (!phyloseq::taxa_are_rows(ps)) {
otu_table(ps) <- phyloseq::otu_table(t(otu_table(ps)), taxa_are_rows = T)
}
# Convert to long format
ps_long <- ps_semi_melt(ps)
# Summarise all columns
summ <- ps_long %>%
group_by(across(all_of(!!grouping))) %>%
group_by(taxid, .add = T) %>%
summarise(across(everything(), summarise_vec)) %>%
ungroup()
# Convert to otu_table and sample_data
otu_tbl <- summ_to_otu_tbl(summ)
sample_dat <- summ_to_sample_dat(summ)
# Create new physeq object
new_ps <- phyloseq(otu_tbl, sample_dat, tax_table(ps))
return(new_ps)
}
data("GlobalPatterns")
ps <- GlobalPatterns
merged_ps <- merge_ps_samples(ps, grouping = "SampleType")
I am trying to pass a variable Phyla (which is also the name of a df column of interest) into other functions. However I get the error: Error: Columntax_levelis unknown. Which I understand. It would just be more convenient to state the column you want to use once in the function since this will also be repeated numerous times in the script. I Have tried using OTU_melt_grouped[,1] since this will always be the first column to use in the dcast function, but get the error: Error: Must use a vector in[, not an object of class matrix. Moreover, it does not solve my solution in the group_by function since I want to be able to specify Phyla, Class, Order etc...
I am sure there must be a simple solution, but I don't know where to start!
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
OTU_melt_grouped <- data %>%
group_by(tax_level, variable) %>%
summarise(value = sum(value))
taxa_cols <- dcast(OTU_melt_grouped, variable ~ tax_level)
rownames(taxa_cols) <- meta_data$site
taxa_cols <- taxa_cols[-1]
return(taxa_cols)
}
tax_test <- taxa_specific_columns_func(OTU_melt)
As we are passing an unquoted variable, we could make use of curly-curly ({{..}}) operator in group_by
library(dplyr)
library(tidyr)
library(tibble)
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
data %>%
group_by({{tax_level}}, variable) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = {{tax_level}}, values_from = value) %>%
column_to_rownames("variable")
}
taxa_specific_columns_func(OTU_melt)
# A B C D E
#a 0.01859254 0.42141238 -0.196961 -0.1859115 -0.2901680
#b -0.64700080 NA -0.161108 NA NA
#c -0.03297331 0.05871052 -1.963341 NA 0.7608218
data
set.seed(48)
OTU_melt <- data.frame(Phyla = rep(LETTERS[1:5], each = 3),
variable = sample(letters[1:3], 15, replace = TRUE), value = rnorm(15))
I would like to apply 3 functions using one code on the same variables in my data.
I have a data set and there are certain columns in my data and i want to apply these function to all of them.
1- make them all factor data
2- replace spaces in the columns with missing(convert space values to missing)
3- give missing value an explicit factor level using fct_explicit_na
i have done this in separate code lines but i want to merge all of them using dplyr mutate function. I tried the following but didnt work
cols <- c("id12", "id13", "id14", "id15")
data_new <- data_old %>%
mutate_if(cols=="", NA) %>% # replace space with NA for cols
mutate_at(cols, factor) %>% # then turn them into factors
mutate_at(cols, fct_explicit_na) # give NAs explicit factor level
)
I get the error:
Error in tbl_if_vars(.tbl, .p, .env, ..., .include_group_vars = .include_group_vars) :
length(.p) == length(tibble_vars) is not TRUE
The mutate_if step is not doing what the OP intend to do. Instead, we can do this in a single step with
library(dplyr)
data_old %>%
mutate_at(vars(cols), ~ na_if(., "") %>%
factor %>%
fct_explicit_na)
Why the OP's code didn't work?
Using a reproducible example, below code converts columns that are factor to character class
iris1 <- iris %>%
mutate_if(is.factor, as.character) %>%
mutate(Species = replace(Species, c(1, 3, 5), ""))
Now, if we do
iris1 %>%
mutate_if("Species" == "", NA)
it is comparing two strings instead of checking the column values. Also, mutate_if should return a logical vector of length 1 for selecting that column.
Instead, if we use
iris1 %>%
mutate_if(~ any(. == ""), ~ na_if(., "")) %>%
head
I have some troubles using the pipe operator (%>%) with the unique function.
df = data.frame(
a = c(1,2,3,1),
b = 'a')
unique(df$a) # no problem here
df %>% unique(.$a) # not working here
# I got "Error: argument 'incomparables != FALSE' is not used (yet)"
Any idea?
As other answers mention : df %>% unique(.$a) is equivalent to df %>% unique(.,.$a).
To force the dots to be explicit you can do:
df %>% {unique(.$a)}
# [1] 1 2 3
An alternative option from magrittr
df %$% unique(a)
# [1] 1 2 3
Or possibly stating the obvious:
df$a %>% unique()
# [1] 1 2 3
What is happening is that %>% takes the object on the left hand side and feeds it into the first argument of the function by default, and then will feed in other arguments as provided. Here is an example:
df = data.frame(
a = c(1,2,3,1),
b = 'a')
MyFun<-function(x,y=FALSE){
return(match.call())
}
> df %>% MyFun(.$a)
MyFun(x = ., y = .$a)
What is happening is that %>% is matching df to x and .$a to y.
So for unique your code is being interpreted as:
unique(x=df, incomparables=.$a)
which explains the error. For your case you need to pull out a before you run unique. If you want to keep with %>% you can use df %>% .$a %>% unique() but obviously there are lots of other ways to do that.
I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.