How to convert R formula to text? - r
I have trouble working with formula as with text. What I'm trying to do is to concatenate the formula to the title of the graph. However, when I try to work with the formula as with text, I fail:
model <- lm(celkem ~ rok + mesic)
formula(model)
# celkem ~ rok + mesic
This is fine. Now I want to build string like "my text celkem ~ rok + mesic" - this is where the problem comes:
paste("my text", formula(model))
# [1] "my text ~" "my text celkem" "my text rok + mesic"
paste("my text", as.character(formula(model)))
# [1] "my text ~" "my text celkem" "my text rok + mesic"
paste("my text", toString(formula(model)))
# [1] "my text ~, celkem, rok + mesic"
Now I see there is a sprint function in package gtools, but I think this is such a basic thing that it deserves a solution within the default environment!!
A short solution from the package formula.tools, as a function as.character.formula:
frm <- celkem ~ rok + mesic
Reduce(paste, deparse(frm))
# [1] "celkem ~ rok + mesic"
library(formula.tools)
as.character(frm)
# [1] "celkem ~ rok + mesic"
Reduce might be useful in case of long formulas:
frm <- formula(paste("y ~ ", paste0("x", 1:12, collapse = " + ")))
deparse(frm)
# [1] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + "
# [2] " x12"
Reduce(paste, deparse(frm))
# [1] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12"
Which is because of width.cutoff = 60L in ?deparse.
Try format :
paste("my text", format(frm))
## [1] "my text celkem ~ rok + mesic"
Simplest solution covering everything:
f <- formula(model)
paste(deparse(f, width.cutoff = 500), collapse="")
R 4.0.0 (released 2020-04-24) introduced deparse1 which never splits the result into multiple strings:
f <- y ~ a + b + c + d + e + f + g + h + i + j + k + l + m + n + o +
p + q + r + s + t + u + v + w + x + y + z
deparse(f)
# [1] "y ~ a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + " " p + q + r + s + t + u + v + w + x + y + z"
deparse1(f)
# [1] "y ~ a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z"
However, it still has a width.cutoff argument (default (an maximum): 500) after which linebreaks are introduced but with lines separated by collapse (default: " ") not \n, leaving extra white whitespace (even with collapse = "") (use gsub to remove them if needed, see Ross D's answer):
> f <- rlang::parse_expr( paste0("y~", paste0(rep(letters, 20), collapse="+")))
> deparse1(f, collapse = "")
[1] "y ~ a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z"
To use it in R < 4.0.0 use backports (recommended)
or copy it's implementation:
# Part of the R package, https://www.R-project.org
#
# Copyright (C) 1995-2019 The R Core Team
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# A copy of the GNU General Public License is available at
# https://www.R-project.org/Licenses/
deparse1 <- function (expr, collapse = " ", width.cutoff = 500L, ...)
paste(deparse(expr, width.cutoff, ...), collapse = collapse)
or as an alternative to Julius's version (note: your code was not self-contained)
celkem = 1
rok = 1
mesic = 1
model <- lm(celkem ~ rok + mesic)
paste("my model ", deparse(formula(model)))
The easiest way is this:
f = formula(model)
paste(f[2],f[3],sep='~')
done!
Here a solution which use print.formula, it seems trick but it do the job in oneline and avoid the use of deparse and no need to use extra package. I just capture the output of the print formula, using capture.output
paste("my text",capture.output(print(formula(celkem ~ rok + mesic))))
[1] "my text celkem ~ rok + mesic"
In case of long formula:
ff <- formula(paste("y ~ ", paste0("x", 1:12, collapse = " + ")))
paste("my text",paste(capture.output(print(ff)), collapse= ' '))
"my text y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12"
Another deparse-based solution is rlang::expr_text() (and rlang::quo_text()):
f <- Y ~ 1 + a + b + c + d + e + f + g + h + i +j + k + l + m + n + o + p + q + r + s + t + u
rlang::quo_text(f)
#> [1] "Y ~ 1 + a + b + c + d + e + f + g + h + i + j + k + l + m + n + \n o + p + q + r + s + t + u"
They do have a width argument to avoid line breaks, but that is limited to 500 characters too. At least it's a single function that is most likely loaded already...
Then add gsub to remove white spaces
gsub(" ", "", paste(format(frm), collapse = ""))
Was optimizing some functions today. A few approaches that have not been mentioned so far.
f <- Y ~ 1 + a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u
bench::mark(
expression = as.character(as.expression(f)),
deparse = deparse(f, width.cutoff = 500L),
deparse1 = deparse1(f),
tools = formula.tools:::as.character.formula(f),
stringi = stringi::stri_c(f),
I = as.character(I(f)),
as = as(f, "character"),
txt = gettext(f),
txtf = gettextf(f),
sub = sub("", "", f),
chr = as.character(f),
str = substring(f, 1L),
paste = paste0(f),
)[c(1, 3, 5, 7)]
#> # A tibble: 13 x 3
#> expression median mem_alloc
#> <bch:expr> <bch:tm> <bch:byt>
#> 1 expression 15.4us 0B
#> 2 deparse 31us 0B
#> 3 deparse1 34us 0B
#> 4 tools 58.7us 1.74MB
#> 5 stringi 67us 3.09KB
#> 6 I 64.1us 0B
#> 7 as 100.5us 521.61KB
#> 8 txt 83.4us 0B
#> 9 txtf 85.8us 3.12KB
#> 10 sub 64.6us 0B
#> 11 chr 60us 0B
#> 12 str 62.8us 0B
#> 13 paste 63.5us 0B
Related
How to remove the quotation mark in formula object?
Here is an example: formula <- Y ~ A + B + C + D + E + F + G pryr::substitute_q(formula, list(Y = as.name('Ya + Yb'))) # `Ya + Yb` ~ A + B + C + D + E + F + G what I am hoping for is: Ya + Yb ~ A + B + C + D + E + F + G I have tried noquote(), as.symbol(), as.name() so on so on, but none of them work.
Why not using update from base? update(formula, Ya + Yb ~ .) # Ya + Yb ~ A + B + C + D + E + F + G or x <- "Ya + Yb" update(formula, paste(x, "~ .")) # Ya + Yb ~ A + B + C + D + E + F + G
pryr::substitute_q(formula, list(Y = quote(Ya + Yb))) # Ya + Yb ~ A + B + C + D + E + F + G
Solve the recursion
I am trying to solve the recursion T (n) = T (n/5) + n^2 and I cant figure out after following step. T (k) = T (k/5) + k^2 = (T(k/25) + k^2/25) + k^2 = (T(k/625) + k^2/625) + k^2/25 + k^2 = T(1) + … + k^2/625 + k^2/25 + k^2 = k^2+ k^2/25+ k^2/625 +…+ T(1) = k^2(1 + 1/25 + 1/625 + …)
Big O of shrinking list?
Want to make sure I have this right. int n = 20; while (n > 0) int index = 0 while (index < n) index++ n-- The Big O of this is: n + (n-1) + (n-2) + (n-3) + … ++ (n-n) Is that still technically O(N)?
Prove by induction: 1 + 2 + 3 + ... + n = n(n + 1) / 2 1 + 2 + 3 + ... + n = O(n^2) Base case: n = 1 1 = (1 + 1) / 2 1 = 2 / 2 1 = 1 Assume true up to k for k < n: 1 + 2 + 3 + ... + k = k(k + 1) / 2 Prove true for n = k + 1 1 + 2 + 3 + ... + k + (k + 1) = (k + 1)(k + 1 + 1) / 2 k(k + 1)/2 + (k + 1) = (k + 1)(k + 1 + 1) / 2 k(k + 1)/2 + 2(k + 1) / 2 = (k + 1)(k + 1 + 1) / 2 (k^2 + k)/2 + (2k + 2) / 2 = (k + 1)(k + 1 + 1) / 2 (k^2 + k + 2k + 2) / 2 = (k + 1)(k + 1 + 1) / 2 (k^2 + 3k + 2) / 2 = (k + 1)(k + 2) / 2 (k^2 + 3k + 2) / 2 = (k^2 + 2k + k + 2) / 2 (k^2 + 3k + 2) / 2 = (k^2 + 3k + 2) / 2 Therefore: 1 + 2 + 3 + ... + n = n(n + 1) / 2 1 + 2 + 3 + ... + n = (n^2 + n) / 2 1 + 2 + 3 + ... + n = O(n^2)
If you work it out, it's the Nth triangular number - and therefore: O(N(N + 1) / 2)
Why glm make an input error on this function
I'm trying to run a glm in R but it results me with an error I can't figure it out how to solve: > GLM.3 <- glm(log(Total_Pass + 1) ~ Total_Pass + Total_Buzz + dm_plant + dm_cdeagua + dm_cultivo + dm_humed + dm_bnativ + dm_snaspe + Cultivos + BosqNat + Plantac + Pastizal + Matorral + Humedal + C_agua + Sup_imper + Tie_desnud + hielo + alt_media + pend_media + Temp_media + PP_media + CA _100 + PLAND _100 + PD _100 + ED _100 + AREA_MN _100 + ENN_MN_100 + CA _210 + PLAND _210 + PD _210 + ED _210 + AREA_MN _210 + ENN_MN_210 + CA _600 + PLAND _600 + PD _600 + ED _600 + AREA_MN _600 + ENN_MN_600 + SHDI + SIDI + MSIDI + SHEI + SIEI + MSIEI, family=gaussian(identity), data=bats_araucania_500) Error: unexpected input in "Total_Pass + Total_Buzz + dm_plant + dm_cdeagua + dm_cultivo + dm_humed + dm_bnativ + dm_snaspe + Cultivos + BosqNat + Plantac + Pastizal + Matorral + Humedal + C_agua + Sup_imper + Tie_desnud" Any help is useful
R can not handle column names with space: CA _210. Try to wrap these columns between two ` (backticks) or rename your columns without spaces. FYI : If you are using all columns as predictors, you can write your code this way: glm(log(y+1) ~ . , nextargs...)
Random Forest in R (multi-label-classification)
I'm fairly new to R, trying to implement Random Forest algorithm. My training and test set have 60 features in the format: Train: feature1,feature2 .. feature60,Label Test: FileName,feature1,feature2 ... feature60 Train-sample mov-mov,or-or,push-push,or-mov,sub-sub,mov-or,sub-mov,xor-or,call-sub,mul-imul,mov-push,push-mov,push-call,or-jz,mov-mul,cmp-or,mov-sub,sub-or,or-sub,or-push,jnz-or,jmp-sub,or-in,mov-call,retn-sub,mul-mul,or-jmp,imul-mul,pop-pop,nop-nop,nop-mul,sub-push,imul-mov,test-or,mul-mov,lea-push,std-mov,in-call,or-call,mov-std,mov-cmp,std-mul,call-or,jz-mov,push-or,pop-retn,add-mov,mov-add,mov-xor,in-inc,mov-pop,in-or,in-push,push-lea,lea-mov,mov-lea,sub-add,std-std,sub-cmp,or-cmp,Label 687,1346,1390,1337,750,2770,1518,418,1523,0,441,532,612,512,0,411,354,310,412,495,134,236,318,237,226,0,0,0,200,0,0,386,39,365,0,0,0,125,528,0,125,0,41,260,169,143,149,61,89,0,127,126,107,44,45,40,79,0,273,157,9 812,873,83,533,88,484,264,106,199,0,188,137,128,51,38,92,131,102,52,58,37,26,428,95,107,0,34,0,58,0,0,39,0,26,0,27,0,152,152,0,45,0,124,0,0,73,84,88,22,23,59,319,105,56,86,47,0,0,43,41,2 Test-sample FileName,mov-mov,or-or,push-push,or-mov,sub-sub,mov-or,xor-or,sub-mov,call-sub,mul-imul,push-mov,mov-push,push-call,mov-mul,or-jz,cmp-or,mov-sub,sub-or,or-sub,or-push,jmp-sub,jnz-or,or-in,mul-mul,or-jmp,mov-call,retn-sub,imul-mul,nop-mul,pop-pop,nop-nop,imul-mov,sub-push,mul-mov,test-or,lea-push,std-mov,or-call,mov-std,in-call,std-mul,mov-cmp,call-or,push-or,jz-mov,pop-retn,in-or,add-mov,mov-add,in-inc,mov-xor,in-push,push-lea,mov-pop,lea-mov,mov-lea,mov-nop,or-cmp,sub-add,sub-cmp Ig2DB5tSiEy1cJvV0zdw,166,360,291,194,41,201,62,61,41,18,85,56,121,18,15,0,57,131,113,123,0,9,54,0,0,18,15,0,0,15,0,8,25,0,0,11,0,70,0,43,0,0,63,37,0,14,51,43,56,36,26,0,20,14,17,14,0,9,18,0 k4HCwy5WRFXczJU6eQdT,3,88,106,23,104,0,12,43,59,0,65,87,99,0,2,2,47,22,4,53,1,5,0,0,0,0,46,0,0,0,0,0,4,0,0,6,0,44,0,21,0,0,0,0,0,0,0,2,1,1,3,0,1,2,9,2,0,0,44,2 So what I have so far in R is this, library(randomForest); dat <- read.csv("train-sample.csv", sep=",", h=T); test <- read.csv("test-sample.csv", sep=",", h=T); attach(dat); #If I do this, I get Error: unexpected 'in' ... rfmodel = randomForest (Label ~ mov-mov + or-or + push-push + or-mov + sub-sub + mov-or + sub-mov + xor-or + call-sub + mul-imul + mov-push + push-mov + push-call + or-jz + mov-mul + cmp-or + mov-sub + sub-or + or-sub + or-push + jnz-or + jmp-sub + or-in + mov-call + retn-sub + mul-mul + or-jmp + imul-mul + pop-pop + nop-nop + nop-mul + sub-push + imul-mov + test-or + mul-mov + lea-push + std-mov + in-call + or-call + mov-std + mov-cmp + std-mul + call-or + jz-mov + push-or + pop-retn + add-mov + mov-add + mov-xor + in-inc + mov-pop + in-or + in-push + push-lea + lea-mov + mov-lea + sub-add + std-std + sub-cmp + or-cmp, data=dat); #If I do this, I get Error in terms.formula(formula, data = data) : invalid model formula in ExtractVars rfmodel = randomForest (Label ~ 'mov-mov' + 'or-or' + 'push-push' + or-mov + sub-sub + mov-or + sub-mov + xor-or + call-sub + mul-imul + mov-push + push-mov + push-call + or-jz + mov-mul + cmp-or + mov-sub + sub-or + or-sub + or-push + jnz-or + jmp-sub + 'or-in' + mov-call + retn-sub + mul-mul + or-jmp + imul-mul + pop-pop + nop-nop + nop-mul + sub-push + imul-mov + test-or + mul-mov + lea-push + 'std-mov' + 'in-call' + 'or-call' + 'mov-std' + 'mov-cmp' + 'std-mul' + 'call-or' + 'jz-mov' + 'push-or' + 'pop-retn' + 'add-mov' + 'mov-add' + 'mov-xor' + 'in-inc' + 'mov-pop' + 'in-or' + 'in-push' + 'push-lea' + 'lea-mov' + 'mov-lea' + 'sub-add' + 'std-std' + 'sub-cmp' + 'or-cmp', data=dat); #I even tried this and got Error in na.fail.default(list(Label = c(9L, 2L, 9L, 1L, 8L, 6L, 2L, 2L, : missing values in object rfmodel <- randomForest(Label~., dat); So I'm kinda stuck. I want to end up using something like, predicted <- predict(rfmodel, test, type="response"); prop.table(table(test$FileName, predicted),1); To get an output in form of: FileName, Label1, Label2, Label3 .. Label9 name1, 0.98, 0, 0.02, 0, 0 .. 0 (basically the fileName with probabilities of each label) Any help is appreciated. Thank you.