Creating formula using very long strings in R

Creating formula using very long strings in R - r

I'm in a situation where I have a vector full of column names for a really large data frame.
Let's assume: x = c("Name", "address", "Gender", ......, "class" ) [approximatively 100 variables]
Now, I would like to create a formula which I'll eventually use to create a HoeffdingTree.
I'm creating formula using:
myformula <- as.formula(paste("class ~ ", paste(x, collapse= "+")))
This throws up the following error:
Error in parse(text = x) : :1:360: unexpected 'else'
1:e+spread+prayforsonni+just+want+amp+argue+blxcknicotine+mood+now+right+actually+herapatra+must+simply+suck+there+always+cookies+ever+everything+getting+nice+nigga+they+times+abu+all+alliepickl
The paste part in the above statement works fine but passing it as an argument to as.formula is throwing all kinds of weird problems.

The problem is that you have R keywords as column names. else is a keyword so you can't use it as a regular name.
A simplified example:
s <- c("x", "else", "z")
f <- paste("y~", paste(s, collapse="+"))
formula(f)
# Error in parse(text = x) : <text>:1:10: unexpected '+'
# 1: y~ x+else+
# ^
The solution is to wrap your words in backticks "`" so that R will treat them as non-syntactic variable names.
f <- paste("y~", paste(sprintf("`%s`", s), collapse="+"))
formula(f)
# y ~ x + `else` + z

You can reduce your data-set first
dat_small <- dat[,c("class",x)]
and then use
myformula <- as.formula("class ~ .")
The . means using all other (all but class) column.

You may try reformulate
reformulate(setdiff(x, 'class'), response='class')
#class ~ Name + address + Gender
where 'x' is
x <- c("Name", "address", "Gender", 'class')
If R keywords are in the 'x', you can do
reformulate('.', response='class')
#class ~ .

Related

Combine names of variables (in a formula) retrieved by grep in R

I'd like to use the output of grep directly in a formula.
In other words, I use grep to retrieve the variables I want to select and store them in a vector.
The cool thing would be to be able to use this vector in a formula.
As to say
var.to.retrieve <- grep(pattern="V", x=data)
lm(var.dep~var.to.retrieve)
but this doesn't work...
I've tried the solution paste(var.to.retrieve, collapse="+") but this doesn't work either.
EDIT
The solution could be
formula <- as.formula(paste(var.dep, paste(var.to.retrieve, collapse="+"), sep="~"))
but I cannot imagine there is no more elegant way to do it

reformulate(var.to.retrieve, response = var.dep) is basically this.
var.dep <- "y"
var.to.retrieve <- LETTERS[1:10]
r1 <- reformulate(var.to.retrieve, response = var.dep)
r2 <- as.formula(
paste(var.dep,
paste(var.to.retrieve, collapse = "+"),
sep = "~")
)
identical(r1,r2) ## TRUE

var_to_retrieve <- colnames(data)[grep(pattern = "V", x = colnames(data))]
lm(formula(paste(var.dep, paste(var_to_retrieve, collapse = "+"), sep = "~")),
data = data)

Function to create new binary variables within existing dataframe?

This question is related to a previous topic:
How to use custom function to create new binary variables within existing dataframe?
I would like to use a similar function but be able to use a vector to specify ICD9 diagnosis variables within the dataframe to search for (e.g., "diag_1", "diag_2","diag_1", etc )
I tried
y<-c("diag_1","diag_2","diag_1")
diagnosis_func(patient_db, y, "2851", "Anemia")
but I get the following error:
Error in `[[<-`(`*tmp*`, i, value = value) :
recursive indexing failed at level 2
Below is the working function by Benjamin from the referenced post. However, it works only from 1 diagnosis variable at a time. Ultimately I need to create a new binary variable that indicates if a patient has a specific diagnosis by querying the 25 diagnosis variables of the dataframe.
*targetcolumn is the icd9 diagnosis variables "diag_1"..."diag_20" is the one I would like to input as vector
diagnosis_func <- function(data, target_col, icd, new_col){
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
data[[new_col]] <- grepl(pattern = pattern,
x = data[[target_col]]) + 0L
data
}
diagnosis_func(patient_db, "diag_1", "2851", "Anemia")
This non-function version works for multiple diagnosis. However I have not figured out how to use it in a function version as above.
pattern = paste("^(", paste0("2851", collapse = "|"), ")", sep = "")
df$anemia<-ifelse(rowSums(sapply(df[c("diag_1","diag_2","diag_3")], grepl, pattern = pattern)) != 0,"1","0")
Any help or guidance on how to get this function to work would be greatly appreciated.
Best,
Albit

Try this modified version of Benjamin's function:
diagnosis_func <- function(data, target_col, icd, new_col){
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
new <- apply(data[target_col], 2, function(x) grepl(pattern=pattern, x)) + 0L
data[[new_col]] <- ifelse(rowSums(new)>0, 1,0)
data
}

dplyr: Renaming variables with rename_

I am trying to rename several variables in a chain:
df_foo = data_frame(
a.a = 1:10,
"b...b" = 1:10,
"cc..d" = 1:10
)
df_foo %>%
rename_(
.dots = setNames(
names(.),
gsub("[[:punct:]]", "", names(.)))
)
This works fine, but when there is a space in the name of one of the variables:
df_foo = data_frame(
a.a = 1:10,
"b...b" = 1:10,
"c c..d" = 1:10
)
df_foo %>%
rename_(
.dots = setNames(
names(.),
gsub("[[:punct:]]", "", names(.)))
)
I get this error:
Error in parse(text = x) : <text>:1:3: unexpected symbol
1: c c..d
^
I am not sure where this stems from since when I run gsub directly:
setNames(
names(df_foo),
gsub("[[:punct:]]", "", names(df_foo)))
I do not get an error. Not sure what is going on here.
This is now raised as issue #2391 on the dplyr GH issues page.

In general: I strongly suggest you never use variable names with spaces. They are a pain and will often cause more trouble than they are worth.
Here is the cause of this error.
rename_ dispatches to dplyr:::rename_.data.frame. First line of that function is:
dots <- lazyeval::all_dots(.dots, ...)
That lazyeval function will then call lazyeval::as.lazy_dots, which uses lazyeval::as.lazy, which itself uses lazyeval:::as.lazy.character which calls lazy_(parse(text = x)[[1]], env). Now, parse() expects valid R expression as its text argument:
text: character vector. The text to parse. Elements are treated as if they were lines of a file. (from help("parse"))
This is why rename_ doesn't seem to like character vectors with spaces and we get "Error in parse(text = x)":
lazyeval:::as.lazy(names(df_foo)[2])
<lazy>
expr: b...b
env: <environment: base>
lazyeval:::as.lazy(names(df_foo)[3])
Error in parse(text = x) : <text>:1:3: unexpected symbol
1: c c..d
^
I'm not aware of a solution, other then just using base for this simple renaming.

Extract and paste together multiple columns of a data frame like object using a vector of column names

I have an object (variable rld) which looks a bit like a "data.frame" (see further down the post for details) in that it has columns that can be accessed using $ or [[]].
I have a vector groups containing names of some of its columns (3 in example below).
I generate strings based on combinations of elements in the columns as follows:
paste(rld[[groups[1]]], rld[[groups[2]]], rld[[groups[3]]], sep="-")
I would like to generalize this so that I don't need to know how many elements are in groups.
The following attempt fails:
> paste(rld[[groups]], collapse="-")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) :
attempt to extract more than one element
Here is how I would do in functional-style with a python dictionary:
map("-".join, zip(*map(rld.get, groups)))
Is there a similar column-getter operator in R ?
As suggested in the comments, here is the output of dput(rld): http://paste.ubuntu.com/23528168/ (I could not paste it directly, since it is huge.)
This was generated using the DESeq2 bioinformatics package, and more precisely, doing something similar to what is described page 28 of this document: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf.
DESeq2 can be installed from bioconductor as follows:
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
Reproducible example
One of the solutions worked when running in interactive mode, but failed when the code was put in a library function, with the following error:
Error in do.call(function(...) paste(..., sep = "-"), colData(rld)[groups]) :
second argument must be a list
After some tests, it appears that the problem doesn't occur if the function is in the main calling script, as follows:
library(DESeq2)
library(test.package)
lib_names <- c(
"WT_1",
"mut_1",
"WT_2",
"mut_2",
"WT_3",
"mut_3"
)
file_names <- paste(
lib_names,
"txt",
sep="."
)
wt <- "WT"
mut <- "mut"
genotypes <- rep(c(wt, mut), times=3)
replicates <- c(rep("1", times=2), rep("2", times=2), rep("3", times=2))
sample_table = data.frame(
lib = lib_names,
file_name = file_names,
genotype = genotypes,
replicate = replicates
)
dds_raw <- DESeqDataSetFromHTSeqCount(
sampleTable = sample_table,
directory = ".",
design = ~ genotype
)
# Remove genes with too few read counts
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ]
dds$group <- factor(dds$genotype)
design(dds) <- ~ replicate + group
dds <- DESeq(dds)
test_do_paste <- function(dds) {
require(DESeq2)
groups <- head(colnames(colData(dds)), -2)
rld <- rlog(dds, blind=F)
stopifnot(all(groups %in% names(colData(rld))))
combined_names <- do.call(
function (...) paste(..., sep = "-"),
colData(rld)[groups]
)
print(combined_names)
}
test_do_paste(dds)
# This fails (with the same function put in a package)
#test.package::test_do_paste(dds)
The error occurs when the function is packaged as in https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
Data used in the example:
WT_1.txt
WT_2.txt
WT_3.txt
mut_1.txt
mut_2.txt
mut_3.txt
I posted this issue as a separate question: do.call error "second argument must be a list" with S4Vectors when the code is in a library
Although I have an answer to my initial question, I'm still interested in alternative solutions for the "column extraction using a vector of column names" issue.

We may use either of the following:
do.call(function (...) paste(..., sep = "-"), rld[groups])
do.call(paste, c(rld[groups], sep = "-"))
We can consider a small, reproducible example:
rld <- mtcars[1:5, ]
groups <- names(mtcars)[c(1,3,5,6,8)]
do.call(paste, c(rld[groups], sep = "-"))
#[1] "21-160-3.9-2.62-0" "21-160-3.9-2.875-0" "22.8-108-3.85-2.32-1"
#[4] "21.4-258-3.08-3.215-1" "18.7-360-3.15-3.44-0"
Note, it is your responsibility to ensure all(groups %in% names(rld)) is TRUE, otherwise you get "subscript out of bound" or "undefined column selected" error.
(I am copying your comment as a follow-up)
It seems the methods you propose don't work directly on my object. However, the package I'm using provides a colData function that makes something more similar to a data.frame:
> class(colData(rld))
[1] "DataFrame"
attr(,"package")
[1] "S4Vectors"
do.call(function (...) paste(..., sep = "-"), colData(rld)[groups]) works, but do.call(paste, c(colData(rld)[groups], sep = "-")) fails with an error message I fail to understand (as too often with R...):
> do.call(paste, c(colData(rld)[groups], sep = "-"))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mcols’ for signature ‘"character"’

Pass a vector of variables into lm() formula

I was trying to automate a piece of my code so that programming become less tedious.
Basically I was trying to do a stepwise selection of variables using fastbw() in the rms package. I would like to pass the list of variables selected by fastbw() into a formula as y ~ x1+x2+x3, "x1" "x2" "x3" being the list of variables selected by fastbw()
Here is the code I tried and did not work
olsOAW0.r060 <- ols(roll_pct~byoy+trans_YoY+change18m,
subset= helper=="POPNOAW0_r060",
na.action = na.exclude,
data = modelready)
OAW0 <- fastbw(olsOAW0.r060, rule="p", type="residual", sls= 0.05)
vec <- as.vector(OAW0$names.kept, mode="any")
b <- paste(vec, sep ="+") ##I even tried b <- paste(OAW0$names.kept, sep="+")
bestp.OAW0.r060 <- lm(roll_pct ~ b ,
data = modelready,
subset = helper =="POPNOAW0_r060",
na.action = na.exclude)
I am new to R and still haven't trailed the steep learning curve, so apologize for obvious programming blunders.

You're almost there. You just have to paste the entire formula together, something like this:
paste("roll_pct ~ ",b,sep = "")
coerce it to an actual formula using as.formula and then pass that to lm. Technically, I think lm may coerce a character string itself, but coercing it yourself is generally safer. (Some functions that expect formulas won't do the coercion for you, others will.)

You would actually need to use collapse instead of seb when defining b.
b <- paste(OAW0$names.kept, collapse="+")
Then you can put it in joran answer
paste("roll_pct ~ ",b,sep = "")
or just use:
paste("roll_pct ~ ",paste(OAW0$names.kept, collapse="+"),sep = "")

I ran into similar issue today, if you want to make it even more generic where you don't even have to have fixed class name, you can use
frmla <- as.formula(paste(colnames(modelready)[1], paste(colnames(modelready)[2:ncol(modelready)], sep = "",
collapse = " + "), sep = " ~ "))
This assumes that you have class variable or the dependent variable in the first column but indexing can be easily switched to last column as:
frmla <- as.formula(paste(colnames(modelready)[ncol(modelready)], paste(colnames(modelready)[1:(ncol(modelready)-1)], sep = "",
collapse = " + "), sep = " ~ "))
Then continue with lm using:
bestp.OAW0.r060 <- lm(frmla , data = modelready, ... )

If you're looking for something less verbose:
fm <- as.formula( paste( colnames(df)[i], ".", sep=" ~ "))
# i is the index of the outcome column
Here it is in a function:
getFormula<-function(target, df) {
i <- grep(target,colnames(df))
as.formula(paste(colnames(df)[i],
".",
sep = " ~ "))
}
fm <- getFormula("myOutcomeColumnName", myDataFrame)
rp <- rpart(fm, data = myDataFrame) # Use the formula to build a model

One trick that I use in similar situations is to subset your data and simply use e.g. lm(dep_var ~ ., data = your_data).
Example
data(mtcars)
ind_vars <- c("mpg", "cyl")
dep_var <- "hp"
temp_subset <- dplyr::select(mtcars, dep_var, ind_vars)
lm(hp ~., data = temp_subset)

just to simplify and collect above answers, based on a function
my_formula<- function(colPosition, trainSet){
dep_part<- paste(colnames(trainSet)[colPosition],"~",sep=" ")
ind_part<- paste(colnames(trainSet)[-colPosition],collapse=" + ")
dt_formula<- as.formula(paste(dep_part,ind_part,sep=" "))
return(dt_formula)
}
To use it:
my_formula( dependent_var_position, myTrainSet)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating formula using very long strings in R - r

You can reduce your data-set first dat_small <- dat[,c("class",x)] and then use myformula <- as.formula("class ~ .") The . means using all other (all but class) column.

You may try reformulate reformulate(setdiff(x, 'class'), response='class') #class ~ Name + address + Gender where 'x' is x <- c("Name", "address", "Gender", 'class') If R keywords are in the 'x', you can do reformulate('.', response='class') #class ~ .

Related

Combine names of variables (in a formula) retrieved by grep in R

Function to create new binary variables within existing dataframe?

dplyr: Renaming variables with rename_

Extract and paste together multiple columns of a data frame like object using a vector of column names

Pass a vector of variables into lm() formula

Categories

Resources