R: Define ranges from text using regex - r

I need a way to call defined variables dependant from a string within text.
Let's say I have five variables (r010, r020, r030, r040, r050).
If there is a given text in that form "r010-050" I want to have the sum of values from all five variables.
The whole text would look like "{r010-050} == {r060}"
The first part of that equation needs to be replaced by the sum of the five variables and since r060 is also a variable the result (via parsing the text) should be a logical value.
I think regex will help here again.
Can anyone help?
Thanks.

Define the inputs: the variables r010 etc. which we assume are scalars and the string s.
Then define a pattern pat which matches the {...} part and a function Sum which accepts the 3 capture groups in pat (i.e. the strings matched to the parts of pat within parentheses) and performs the desired sum.
Use gsubfn to match the pattern, passing the capture groups to Sum and replacing the match with the output of Sum. Then evaluate it.
In the example the only variables in the global environment whose names are between r010 and r050 inclusive are r010 and r020 (it would have used more had they existed) and since they sum to r060 it returned TRUE.
library(gsubfn)
# inputs
r010 <- 1; r020 <- 2; r060 <- 3
s <- "{r010-050} == {r060}"
pat <- "[{](\\w+)(-(\\w+))?[}]"
Sum <- function(x1, x2, x3, env = .GlobalEnv) {
x3 <- if(x3 == "") x1 else paste0(gsub("\\d", "", x1), x3)
lst <- ls(env)
sum(unlist(mget(lst[lst >= x1 & lst <= x3], envir = env)))
}
eval(parse(text = gsubfn(pat, Sum, s)))
## [1] TRUE

Related

Compare two strings in R and see additions, deletions

I want to compare two character values in R and see which characters where added and deleted to display it later similar to git diff --color-words=. (see screenshot below)
For example:
a <- "hello world"
b <- "helo world!"
diff <- FUN(a, b)
where diff would somehow show that an l was dropped and a ! was added.
The ultimate goal is to construct an html string like this hel<span class="deleted">l</span>o world<span class="added">!</span>.
I am aware of diffobj but so far I cannot get it to return the character differences, only the differences between elements.
Output of git diff --color-words=.
the output looks like this:
Base R has a function adist that computes the generalized Levenshtein distance. With arguments count and partial attribute "trafos" is set to the sequence of matches, insertions and deletions needed to go from one string to the other. From the documentation, section Value, my emphasis:
If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively. Additionally, if partial = FALSE, the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements ‘⁠M⁠’, ‘⁠I⁠’, ‘⁠D⁠’ and ‘⁠S⁠’ indicating a match, insertion, deletion and substitution, respectively. If partial = TRUE, the offsets (positions of the first and last element) of the matched substrings are returned as the "offsets" attribute of the return value (with both offsets -1−1 in case of no match).
a <- "hello world"
b <- "helo world!"
attr(adist(a, b, counts = TRUE), "trafos")
#> [,1]
#> [1,] "MMDMMMMMMMMI"
Created on 2022-05-31 by the reprex package (v2.0.1)
There is a deletion in the 3rd character and an insertion at the end of string a.
Found a solution using diffobj::ses_dat() and splitting the data into its characters before.
get_html_diff <- function(a, b) {
aa <- strsplit(a, "")[[1]]
bb <- strsplit(b, "")[[1]]
s <- diffobj::ses_dat(aa, bb)
m <- cumsum(as.integer(s$op) != c(Inf, s$op[1:(length(s$op) - 1)]))
res <- paste(
sapply(split(seq_along(s$op), m), function(i) {
val <- paste(s$val[i], collapse = "")
if (s$op[i[[1]]] == "Insert")
val <- paste0("<span class=\"add\">", val, "</span>")
if (s$op[i[[1]]] == "Delete")
val <- paste0("<span class=\"del\">", val, "</span>")
val
}),
collapse = "")
res
}
get_html_diff("hello world", "helo World!")
#> [1] "hel<span class=\"del\">l</span>o <span class=\"del\">w</span><span class=\"add\">W</span>orld<span class=\"add\">!</span>"
Created on 2022-05-31 by the reprex package (v2.0.1)
We use diffobj to compare configuration files (in more or less production environment), and it works just right. In your case, wouldn't diffobj::diffChr be what you want?
diffobj::diffChr("hello world", "helo world!", color.mode = 'rgb')

R: created a names vector containing the means of multiple numeric vectors

I have over 20 numeric vectors which consist of a series of values. each vector is distinguished by a letter, e.g. val_a, val_b, val_c etc...
I would like to put the means from each of these vectors into a single named vector. I could of course do this in a laborious manner like so:
obs <- c("val_a" = round(mean(val_a),3),
"val_b" = round(mean(val_b),3),
"val_c" = round(mean(val_c),3))
But with 20 vectors this then becomes tedious to write out, and not to mention an inelegant solution. How can I create the named vector in a more succinct way? I have made an attempt using a for loop, as so:
obs <- c(for (j in 1:20) {
assign(paste("val",letters[j], sep = "_"),
mean(as.name(paste('val',letters[j], sep = '_'))),)
})
In the right hand argument passed to assign, "as.name" is used in order to remove the quotation marks from output of "paste". So the second argument passed to assign returns a character which has the exact same name as the numeric vector that I want get the mean of, e.g. val_a. But I get the error messsage:
Warning messages:
1: In mean.default(as.name(paste("val", letters[j], sep = "_"))) :
argument is not numeric or logical: returning NA
Does anyone know how to accomplish this?
Solution
To build on bouncyball's comment so you have a full answer, you can do this:
sapply(paste('val', letters[1:20], sep='_'), function(x) round(mean(get(x)), 3))
Explanation
For an object in your environment called x, get("x") will return x. See help("get"). Then we can do this for every element of paste('val', letters[1:20], sep='_') using sapply(), or if you like, a loop.
Example
val_a <- rnorm(100)
val_b <- rnorm(100)
val_c <- rnorm(100)
sapply(paste('val', letters[1:3], sep='_'), function(x) round(mean(get(x)), 3))
val_a val_b val_c
-0.09328504 -0.15632654 -0.09759111

Replace variable name in string with variable value [R]

I have a character in R, say "\\frac{A}{B}". And I have values for A and B, say 5 and 10. Is there a way that I can replace the A and B with the 5 and 10?
I tried the following.
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
output <- do.call("substitute", list(parse(text=words)[[1]], numbers))
But I get an error on the \. Is there a way that I can do this? I an trying to create equations with the actual variable values.
You could use the stringi function stri_replace_all_fixed()
stringi::stri_replace_all_fixed(
words, names(numbers), numbers, vectorize_all = FALSE
)
# [1] "\\frac{5}{10}"
Try this:
sprintf(gsub('\\{\\w\\}','\\{%d}',words),5,10)
I'm more familiar with gsub than substitute. The following works:
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
arglist = mapply(list, as.list(names(numbers)), numbers, SIMPLIFY=F)
for (i in 1:length(arglist)){
arglist[[i]]["x"] <- words
words <- do.call("gsub", arglist[[i]])
}
But of course this is unsafe because you're iterating over the substitutions. If, say, the first variable has value "B" and the second variable has name "B", you'll have problems. There's probably a cleaner way.

String split and expand the (vector) at the delimiter: R

I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H
You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"

paste not returning values concatenated

I am trying to get the column names of a dataframe to use them in another call, but this apply call returns the values separated, instead of concatenated correctly. What did I do wrong here?
df<-data.frame(c(1,2,3),c(4,5,6))
colnames(df)<-c("hi","bye")
apply(df,2,function(x){
paste("subscale_scores$",colnames(x),sep="")
#this is the command I am eventually trying to run
#lm(paste("subscale_scores",colnames(x))~surveys$npitotal+ipip$extraversion+ipip$agreeableness+ipip$conscientiousness+ipip$emotionalStability+ipip$intelImagination)
})
Goal output:
subscale_scores$hi
subscale_scores$bye
Is there any need for the apply?
Is this what you mean?
paste0('subscale_scores$', names(df))
# [1] "subscale_scores$hi" "subscale_scores$bye"
if you need them concatenated by newline say, add , sep='\n'.
The paste0 is shorthand for paste(..., sep="").
A note on your lm call later - if you want to do lm(Y ~ ...) where Y is each of your columns separately, try:
lms <- lapply(colnames(df),
function (y) {
# construct your formula
frm <- paste0('subscale_scores$', y, ' ~ surveys$npitotal+ipip$extraversion+ipip$agreeableness+ipip$conscientiousness+ipip$emotionalStability+ipip$intelImagination')
lm(frm)
})
names(lms) <- colnames(df)
Then lms$hi will contain the output of lm(subscale_scores$hi ~ ...) and so on.
Or if the aim was to combine all the columns together (Y1 + Y2 ~ ...)
Then paste0('subscale_scores$', names(df), collapse='+') will give you subscale_scores$hi+subscale_scores$bye
How about this?
unlist(lapply(colnames(df),function(x){
paste("subscale_scores$",x,sep="")
}))

Resources