I have a tibble in which I want to lag/lead various columns and check their correlations.
Currently, for every column name, I have to have a separate function to do the lead/lag and correlation function.
Is there a way in which I could pass the column name as a variabe and then use that variable with lag/lead
#This is what I have tried unsuccessfully so far
library(janitor)
library(tidyverse)
(x <- mtcars %>%
as_tibble())
var_to_lag <- "carb"
# Tried without success
x %>% mutate(lag_var = lag(!!var_to_lag, 1))
x %>% mutate(lag_var = lag(contains(var_to_lag), 1))
x %>% mutate(lag_var = lag(vars(contains(var_to_lag)), 1))
x %>% mutate(lag_var = lag(vars(!!var_to_lag), 1))
Any ideas?
We could use mutate_at which accepts string input
library(dplyr)
x %>% mutate_at(vars(var_to_lag), list(lag_var = ~lag(.)))
We can also use get
x %>% mutate(lag_var = lag(get(var_to_lag)))
Or first convert var_to_lag to symbol (sym) and then evaluate (!!)
x %>% mutate(lag_var = lag(!!sym(var_to_lag)))
Related
I have sam_data looking like this
I want to use the merge_samples() command on the "genus".
I have tried with code looking like this:
merge_physeq <- merge_samples(physeq, "genus")
What happens now is I get several of the warnings:
In asMethod(object) : NAs introduced by coercion
And my sam_data ends up looking like this:
I think this is due to nature of the variable. the different genuses are factor variables. But all my other data is also changed. The lower taxonomic levels are not important after the merge so they can be removed.
This is because you have not specified an appropriate fun argument to the merge_samples function. By default it calculates the mean, which is possible for numerical, boolean and factor types, but not for strings.
Unfortunately, it seems that phyloseq has hardcoded coercion to numeric for all columns in the sample data within the merge_samples function before applying the specified function, so I'm not sure if it is possible at all to deal with character type columns through merge_samples.
As a solution, try the merge_ps_samples function below, which should merge samples and then summarize any column according to its class. For numeric columns it calculates the mean within each group, for character types it pastes together the unique values within each group. You can of course choose any function you like, just alter the code to your desire.
Hope this helps! Good luck.
EDIT: note that this function calculates mean abundances, rather than summed abundances as implemented in the phyloseq::merge_samples function.
require(phyloseq)
require(tidyverse)
# Concatenate unique values in a vector
concat_unique <- function(vec){
uniq <- unique(as.character(vec))
return(paste(uniq, collapse = "/"))
}
# Like psmelt, but only uses the otu_table and sample_data
ps_semi_melt <- function(ps){
otu_table(ps) %>%
data.frame(taxid = row.names(.)) %>%
rename_with(function(x){gsub("X", "", x)}) %>%
pivot_longer(!taxid, names_to = "sample_id", values_to = "abundance") %>%
left_join(sample_data(ps) %>%
data.frame(sample_id = row.names(.)),
by = "sample_id")
}
# Function that summarizes a vector based on its class
summarise_vec <- function(vec){
if(class(vec) %in% c("numeric", "integer", "logical")){
return(mean(vec, na.rm = T))
} else if (class(vec) %in% c("factor", "character")){
return(concat_unique(vec))
} else {
stop("Error: unknown column type")
}
}
# Converts a summary df to an otu_table
summ_to_otu_tbl <- function(summ){
summ %>%
select(taxid, sample_id, abundance) %>%
pivot_wider(names_from = "sample_id", values_from = "abundance") %>%
column_to_rownames('taxid') %>%
as.matrix() %>%
otu_table(, taxa_are_rows = TRUE)
}
# Converts a summary df to sample_data
summ_to_sample_dat <- function(summ){
summ %>%
select(!c(taxid, abundance)) %>%
unique() %>%
column_to_rownames('sample_id') %>%
sample_data()
}
# Function that merges phyloseq samples based on the names of one or more grouping factors
# present in sample_data(ps)
merge_ps_samples <- function(ps, grouping){
# Make sure taxa are rows
if (!phyloseq::taxa_are_rows(ps)) {
otu_table(ps) <- phyloseq::otu_table(t(otu_table(ps)), taxa_are_rows = T)
}
# Convert to long format
ps_long <- ps_semi_melt(ps)
# Summarise all columns
summ <- ps_long %>%
group_by(across(all_of(!!grouping))) %>%
group_by(taxid, .add = T) %>%
summarise(across(everything(), summarise_vec)) %>%
ungroup()
# Convert to otu_table and sample_data
otu_tbl <- summ_to_otu_tbl(summ)
sample_dat <- summ_to_sample_dat(summ)
# Create new physeq object
new_ps <- phyloseq(otu_tbl, sample_dat, tax_table(ps))
return(new_ps)
}
data("GlobalPatterns")
ps <- GlobalPatterns
merged_ps <- merge_ps_samples(ps, grouping = "SampleType")
I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)
I have several variables (id.type and id.subtype in this example) I would like to check for distinct values in a tibble all.snags using the dplyr package. I would like them sorted and all values printed out in the console (a tibble typically prints only the first 10). The output would be equivalent to the following code:
distinct(all.snags,id.type) %>% arrange(id.type) %>% print(n = Inf)
distinct(all.snags,id.subtype) %>% arrange(id.subtype) %>% print(n = Inf)
I think this is better done by looping over the values in a vector, but I can't get it to work.
distinct.vars <- c("id.type","id.subtype")
for (i in distinct.vars) {
distinct(all.snags,distinct.vars[i]) %>%
arrange(distinct.vars[i]) %>%
print(n = Inf)
}
I think this function is what you want:
library(dplyr)
df = iris
print_distinct = function(df, columns) {
for (c in columns) {
print(df %>% distinct_(c) %>% arrange_(c))
}
}
print_distinct(df, c("Sepal.Length", "Sepal.Width"))
I have the following (heavily simplified) dplyr example for mutate:
xx <- data.frame(x = 1:10, y = c(rep(1,4),rep(2,6)))
bla_fun <- function(x,y){cat(x," ",y,"\n"); min(x,y)}
xx %>% rowwise() %>%
mutate( z = bla_fun(x,y))
I would like to get it working with mutate_at which enables to pass me the column names as strings.
xx %>% rowwise() %>%
mutate_at( c("x","y"), funs("bla_fun") )
But this does not work. How to get it working?
mutate_at mutates every single column separately.
Your particular example can be solved (assuming the min is a placeholder that cannot be replaced by pmin) like this:
xx %>%
mutate(z = map2(!!sym("x"), !!sym("y"), !!sym("bla_fun")))
syms <- rlang::syms(c("x", "y"))
xx %>%
rowwise() %>%
mutate( z = bla_fun(!!! syms))
Side note 1: mutate_at is typically for applying n unary functions to n variables, not 1 n-ary function to n variables. mutate does the job.
Side note 2: there is no need to group rowwise. You could more simply mutate(xx, z = purrr::map2_dbl(x, y, bla_fun)) or rewrite/vectorize bla_fun with pmin() to mutate directly.
Combine this with the use of syms for strings: mutate(xx, z = mapply(bla_fun, !!! syms)) for instance, or mutate(xx, z = purrr::pmap_dbl(list(!!! syms), bla_fun)).
I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt
The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile