dplyr: Renaming variables with rename_ - r

I am trying to rename several variables in a chain:
df_foo = data_frame(
a.a = 1:10,
"b...b" = 1:10,
"cc..d" = 1:10
)
df_foo %>%
rename_(
.dots = setNames(
names(.),
gsub("[[:punct:]]", "", names(.)))
)
This works fine, but when there is a space in the name of one of the variables:
df_foo = data_frame(
a.a = 1:10,
"b...b" = 1:10,
"c c..d" = 1:10
)
df_foo %>%
rename_(
.dots = setNames(
names(.),
gsub("[[:punct:]]", "", names(.)))
)
I get this error:
Error in parse(text = x) : <text>:1:3: unexpected symbol
1: c c..d
^
I am not sure where this stems from since when I run gsub directly:
setNames(
names(df_foo),
gsub("[[:punct:]]", "", names(df_foo)))
I do not get an error. Not sure what is going on here.
This is now raised as issue #2391 on the dplyr GH issues page.

In general: I strongly suggest you never use variable names with spaces. They are a pain and will often cause more trouble than they are worth.
Here is the cause of this error.
rename_ dispatches to dplyr:::rename_.data.frame. First line of that function is:
dots <- lazyeval::all_dots(.dots, ...)
That lazyeval function will then call lazyeval::as.lazy_dots, which uses lazyeval::as.lazy, which itself uses lazyeval:::as.lazy.character which calls lazy_(parse(text = x)[[1]], env). Now, parse() expects valid R expression as its text argument:
text: character vector. The text to parse. Elements are treated as if they were lines of a file. (from help("parse"))
This is why rename_ doesn't seem to like character vectors with spaces and we get "Error in parse(text = x)":
lazyeval:::as.lazy(names(df_foo)[2])
<lazy>
expr: b...b
env: <environment: base>
lazyeval:::as.lazy(names(df_foo)[3])
Error in parse(text = x) : <text>:1:3: unexpected symbol
1: c c..d
^
I'm not aware of a solution, other then just using base for this simple renaming.

Related

R: How to make parse() accept regular expressions with escaped characters?

I am trying to use the validator package to check if certain rows in my data table contain a regular expression or not.
I make a vector (fields) with the columns I want to test and then paste together the commands for the validator rules as a string.
To be able to use the rules in the confront() function I use parse() and eval() to turn the character string into an expression
The following example is working as expected:
library(validate)
data <- data.frame("Protocol_Number" = c("123", "A122"), "Numeric_Result" = c("-0.5", "1.44"))
fields <- c("Protocol_Number", "Numeric_Result")
# build validator commands for each field
cmds <- paste("validator(",
paste(
map_chr(
fields, function(x) paste0("grepl('^-?[0-9]', as.character(`", x, "`))")
), collapse = ","),
")")
# convert to rule and do the tests
rule <- eval(parse(text = cmds))
out <- confront(data, rule)
summary(out)
However, I want to use a regex that recognizes any sort of number as opposed to text, like in this working example
grepl('^-?[0-9]\\d*(\\.\\d+)?$', c(1, -1, 0.5, "Not Done"))
When I try to use this regex in the above example, the parse() function will throw an error:
Error: '\d' is an unrecognized escape in character string starting "'^-?[0-9]\d"
This is not working:
# build validator commands for each field
cmds <- paste("validator(",
paste(
map_chr(
fields, function(x) paste0("grepl('^-?[0-9]\\d*(\\.\\d+)?$', as.character(`", x, "`))")
), collapse = ","),
")")
# convert to rule and do the tests
rule <- eval(parse(text = cmds))
out <- confront(data, rule)
summary(out)
How do I make parse() accept the escaped characters? Or is there a better way to do this?
We may escape it with \\
cmds <- paste("validator(",
paste(
map_chr(
fields, function(x)
paste0("grepl('^-?[0-9]\\\\d*(\\\\.\\\\d+)?$', as.character(`", x, "`))")
), collapse = ","),
")")
-testing
> rule <- eval(parse(text = cmds))
>
> out <- confront(data, rule)
> out
Object of class 'validation'
Call:
confront(dat = data, x = rule)
Rules confronted: 2
With fails : 1
With missings: 0
Threw warning: 0
Threw error : 0

Using an anonymous function in mutate

I want to use the character strings from one column of a dataframe as the search string in a sub search of the character strings in another column of the dataframe on a row-by-row basis. I would like to do this using dplyr::mutate. I have figured out a way to do this using an anonymous function and apply, but I feel like apply shouldn't be necessary and I must be doing something wrong with how I'm implementing mutate. (And yes, I know that tools::file_path_sans_ext can give me the final result without needing to use mutate; I'm just want to understand how to use mutate.)
Here is the code that I think should work but doesn't:
files.vec <- dir(
dir.target,
full.names = T,
recursive = T,
include.dirs = F,
no.. = T
)
library(tools)
files.paths.df <- as.data.frame(
cbind(
path = files.vec,
directory = dirname(files.vec),
file = basename(files.vec),
extension = file_ext(files.vec)
)
)
library(tidyr)
library(dplyr)
files.split.df <- files.paths.df %>%
mutate(
no.ext = function(x) {
sub(paste0(".", x["extension"], "$"), "", x["file"])
}
)
| Error in mutate_impl(.data, dots) :
| Column `no.ext` is of unsupported type function
Here is the code that works, using apply:
files.split.df <- files.paths.df %>%
mutate(no.ext = apply(., 1, function(x) {
sub(paste0(".", x["extension"], "$"), "", x["file"])
}))
Can this be done without apply?
Apparently what you need is a whole bunch of parentheses. See https://stackoverflow.com/a/36906989/3277050
In your situation it looks like:
files.split.df <- files.paths.df %>%
mutate(
no.ext = (function(x) {sub(paste0(".", x["extension"], "$"), "", x["file"])})(.)
)
So it seems like if you wrap the whole function definition in brackets you can then treat it like a regular function and supply arguments to it.
New Answer
Really this is not the right way to use mutate at all though. I got focused in on the anonymous function part first without looking at what you are actually doing. What you need is a vectorized version of sub. So I used str_replace from the stringr package. Then you can just refer to columns by name because that is the beauty of dplyr:
library(tidyr)
library(dplyr)
library(stringr)
files.split.df <- files.paths.df %>%
mutate(
no.ext = str_replace(file, paste0(".", extension, "$"), ""))
Edit to Answer Comment
To use a user defined function where there isn't an existing vectorized function you could use Vectorize like this:
string_fun <- Vectorize(function(x, y) {sub(paste0(".", x, "$"), "", y)})
files.split.df <- files.paths.df %>%
mutate(
no.ext = string_fun(extension, file))
Or if you really don't want to name the function, which I do not recommend as it is much harder to read:
files.split.df <- files.paths.df %>%
mutate(
no.ext = (Vectorize(function(x, y) {sub(paste0(".", x, "$"), "", y)}))(extension, file))

Extract and paste together multiple columns of a data frame like object using a vector of column names

I have an object (variable rld) which looks a bit like a "data.frame" (see further down the post for details) in that it has columns that can be accessed using $ or [[]].
I have a vector groups containing names of some of its columns (3 in example below).
I generate strings based on combinations of elements in the columns as follows:
paste(rld[[groups[1]]], rld[[groups[2]]], rld[[groups[3]]], sep="-")
I would like to generalize this so that I don't need to know how many elements are in groups.
The following attempt fails:
> paste(rld[[groups]], collapse="-")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) :
attempt to extract more than one element
Here is how I would do in functional-style with a python dictionary:
map("-".join, zip(*map(rld.get, groups)))
Is there a similar column-getter operator in R ?
As suggested in the comments, here is the output of dput(rld): http://paste.ubuntu.com/23528168/ (I could not paste it directly, since it is huge.)
This was generated using the DESeq2 bioinformatics package, and more precisely, doing something similar to what is described page 28 of this document: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf.
DESeq2 can be installed from bioconductor as follows:
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
Reproducible example
One of the solutions worked when running in interactive mode, but failed when the code was put in a library function, with the following error:
Error in do.call(function(...) paste(..., sep = "-"), colData(rld)[groups]) :
second argument must be a list
After some tests, it appears that the problem doesn't occur if the function is in the main calling script, as follows:
library(DESeq2)
library(test.package)
lib_names <- c(
"WT_1",
"mut_1",
"WT_2",
"mut_2",
"WT_3",
"mut_3"
)
file_names <- paste(
lib_names,
"txt",
sep="."
)
wt <- "WT"
mut <- "mut"
genotypes <- rep(c(wt, mut), times=3)
replicates <- c(rep("1", times=2), rep("2", times=2), rep("3", times=2))
sample_table = data.frame(
lib = lib_names,
file_name = file_names,
genotype = genotypes,
replicate = replicates
)
dds_raw <- DESeqDataSetFromHTSeqCount(
sampleTable = sample_table,
directory = ".",
design = ~ genotype
)
# Remove genes with too few read counts
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ]
dds$group <- factor(dds$genotype)
design(dds) <- ~ replicate + group
dds <- DESeq(dds)
test_do_paste <- function(dds) {
require(DESeq2)
groups <- head(colnames(colData(dds)), -2)
rld <- rlog(dds, blind=F)
stopifnot(all(groups %in% names(colData(rld))))
combined_names <- do.call(
function (...) paste(..., sep = "-"),
colData(rld)[groups]
)
print(combined_names)
}
test_do_paste(dds)
# This fails (with the same function put in a package)
#test.package::test_do_paste(dds)
The error occurs when the function is packaged as in https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
Data used in the example:
WT_1.txt
WT_2.txt
WT_3.txt
mut_1.txt
mut_2.txt
mut_3.txt
I posted this issue as a separate question: do.call error "second argument must be a list" with S4Vectors when the code is in a library
Although I have an answer to my initial question, I'm still interested in alternative solutions for the "column extraction using a vector of column names" issue.
We may use either of the following:
do.call(function (...) paste(..., sep = "-"), rld[groups])
do.call(paste, c(rld[groups], sep = "-"))
We can consider a small, reproducible example:
rld <- mtcars[1:5, ]
groups <- names(mtcars)[c(1,3,5,6,8)]
do.call(paste, c(rld[groups], sep = "-"))
#[1] "21-160-3.9-2.62-0" "21-160-3.9-2.875-0" "22.8-108-3.85-2.32-1"
#[4] "21.4-258-3.08-3.215-1" "18.7-360-3.15-3.44-0"
Note, it is your responsibility to ensure all(groups %in% names(rld)) is TRUE, otherwise you get "subscript out of bound" or "undefined column selected" error.
(I am copying your comment as a follow-up)
It seems the methods you propose don't work directly on my object. However, the package I'm using provides a colData function that makes something more similar to a data.frame:
> class(colData(rld))
[1] "DataFrame"
attr(,"package")
[1] "S4Vectors"
do.call(function (...) paste(..., sep = "-"), colData(rld)[groups]) works, but do.call(paste, c(colData(rld)[groups], sep = "-")) fails with an error message I fail to understand (as too often with R...):
> do.call(paste, c(colData(rld)[groups], sep = "-"))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mcols’ for signature ‘"character"’

Avoiding backtick characters with dplyr

How can I write the argument of select without backtick characters? I would like to do this so that I can pass in this argument from a variable as a character string.
df <- dat[["__Table"]] %>% select(`__ID` ) %>% mutate(fk_table = "__Table", val = 1)
Changing the argument of select to "__ID" gives this error:
Error: All select() inputs must resolve to integer column positions.
The following do not:
* "__ID"
Unfortunately, the _ characters in column names cannot be avoided since the data is downloaded from a relational database (FileMaker) via ODBC and needs to be written back to the database while preserving the column names.
Ideally, I would like to be able to do the following:
colName <- "__ID"
df <- dat[["__Table"]] %>% select(colName) %>% mutate(fk_table = "__Table", val = 1)
I've also tried eval(parse()):
df <- dat[["__Table"]] %>% select( eval(parse(text="__ID")) ) %>% mutate(fk_table = "__Table", val = 1)
It throws this error:
Error in parse(text = "__ID") : <text>:1:1: unexpected input
1: _
^
By the way, the following does work, but then I'm back to square one (still with backtick symbol).
eval(parse(text="`__ID`")
References about backtick characters in R:
Removing backticks in R output
What do backticks do in R?
R encoding ASCII backtick
You can use as.name() with select_():
colName <- "__ID"
df <- data.frame(`__ID` = c(1,2,3), `123` = c(4,5,6), check.names = FALSE)
select_(df, as.name(colName))

Creating formula using very long strings in R

I'm in a situation where I have a vector full of column names for a really large data frame.
Let's assume: x = c("Name", "address", "Gender", ......, "class" ) [approximatively 100 variables]
Now, I would like to create a formula which I'll eventually use to create a HoeffdingTree.
I'm creating formula using:
myformula <- as.formula(paste("class ~ ", paste(x, collapse= "+")))
This throws up the following error:
Error in parse(text = x) : :1:360: unexpected 'else'
1:e+spread+prayforsonni+just+want+amp+argue+blxcknicotine+mood+now+right+actually+herapatra+must+simply+suck+there+always+cookies+ever+everything+getting+nice+nigga+they+times+abu+all+alliepickl
The paste part in the above statement works fine but passing it as an argument to as.formula is throwing all kinds of weird problems.
The problem is that you have R keywords as column names. else is a keyword so you can't use it as a regular name.
A simplified example:
s <- c("x", "else", "z")
f <- paste("y~", paste(s, collapse="+"))
formula(f)
# Error in parse(text = x) : <text>:1:10: unexpected '+'
# 1: y~ x+else+
# ^
The solution is to wrap your words in backticks "`" so that R will treat them as non-syntactic variable names.
f <- paste("y~", paste(sprintf("`%s`", s), collapse="+"))
formula(f)
# y ~ x + `else` + z
You can reduce your data-set first
dat_small <- dat[,c("class",x)]
and then use
myformula <- as.formula("class ~ .")
The . means using all other (all but class) column.
You may try reformulate
reformulate(setdiff(x, 'class'), response='class')
#class ~ Name + address + Gender
where 'x' is
x <- c("Name", "address", "Gender", 'class')
If R keywords are in the 'x', you can do
reformulate('.', response='class')
#class ~ .

Resources