compare string in R - r

I'm currently working on a programming project in R (for school) and I'm using a data set made of a large quantity of LastFm users (an application that collects data when you're using a media player).
I want to work on an eventual link between 2 variables present in the dataset which are the "nickname" and the "real name". To do so, I would like to compute a variable that represents the rate of similarity between the characters.
As an example take one individual (regardless of the other variables):
name = 'chris meller'
nickname = 'mellertime'
So far, tried to sort the strings in order to to check for identical characters one by one but I'm stuck here. What i found is just a way to to check if "name" is present inside "nickname" with different kind of functions.
>paste(sort(unlist(strsplit(name, ""))), collapse = "")
[1] "eeeillmmrt"
>paste(sort(unlist(strsplit(nickname, ""))), collapse = "")
[1] " ceehillmrrs"
What I would like to know is if there is a way to count the number of identical letters between 2 character strings, regardless of the order?
I would like to end with something like this:
function(a,b)
[1] 0.63
# a,b are 2 character strings
where the result is the ratio of the number of identical character between the two strings divided by the number of characters in the real name.

Try this:
SimilarityRatio <- function(wholeName, nickname, matchCase) {
n1 <- sort(strsplit(paste(strsplit(wholeName, " ")[[1]], collapse = ""), "")[[1]])
n2 <- sort(strsplit(paste(strsplit(nickname, " ")[[1]], collapse = ""), "")[[1]])
if (!matchCase) {
n1 <- tolower(n1)
n2 <- tolower(n2)
}
MyLen <- tempLen <- length(n1)
j <- 1L
numMatch <- 0L
while (j <= tempLen) {
test1 <- n1[j] %in% n2
if (test1) {
myRemove <- min(which(n2 %in% n1[j]))
n1 <- n1[-j]
n2 <- n2[-myRemove]
numMatch <- numMatch + 1L
tempLen <- tempLen - 1L
} else {
j <- j+1L
}
}
numMatch/MyLen
}
Below are some test cases:
> SimilarityRatio("chris meller", "mellertime", FALSE)
[1] 0.6363636
> SimilarityRatio("SuperMan3000", "The3Musketeers", FALSE)
[1] 0.5
> SimilarityRatio("SuperMan3000", "The3Musketeers", TRUE)
[1] 0.4166667
> SimilarityRatio("should a garbage collection be performed immediately", "same expression can vary considerably depending on whether", FALSE)
[1] 0.7608696

Related

Regex to convert time equations to R date-time (POSIXct)

I'm reading in data from another platform where a combination of the strings listed below is used for expressing timestamps:
\* = current time
t = current day (00:00)
mo = month
d = days
h = hours
m = minutes
For example, *-3d is current time minus 3 days, t-3h is three hours before today morning (midnight yesterday).
I'd like to be able to ingest these equations into R and get the corresponding POSIXct value. I'm trying using regex in the below function but lose the numeric multiplier for each string:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('^\\*',
'^t',
'([[:digit:]]{1,})mo',
'([[:digit:]]{1,})d',
'([[:digit:]]{1,})h',
'([[:digit:]]{1,})m'),
replacement = c('Sys.time()',
'Sys.Date()',
'*lubridate::months(1)',
'*lubridate::days(1)',
'*lubridate::hours(1)',
'*lubridate::minutes(1)'),
vectorize_all = F
)
return(ret)
# return(eval(expr = parse(text = ret)))
}
> strTimeConverter('*-5mo+3d+4h+2m')
[1] "Sys.time()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
> strTimeConverter('t-5mo+3d+4h+2m')
[1] "Sys.Date()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
Expected output:
# *-5mo+3d+4h+2m
"Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
# t-5mo+3d+4h+2m
"Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
I assumed that wrapping the [[:digit]]{1,} in parentheses () would preserve them but clearly that's not working. I defined the pattern like this else the code replaces repeat occurrences e.g. * gets converted to Sys.time() but then the m in Sys.time() gets replaced with *lubridate::minutes(1).
I plan on converting the (expected) output to R date-time using eval(parse(text = ...)) - currently commented out in the function.
I'm open to using other packages or approach.
Update
After tinkering around for a bit, I found the below version works - I'm replacing strings in the order such that newly replaced characters are not replaced again:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('y', 'd', 'h', 'mo', 'm', '^t', '^\\*'),
replacement = c('*years(1)',
'*days(1)',
'*hours(1)',
'*days(30)',
'*minutes(1)',
'Sys.Date()',
'Sys.time()'),
vectorize_all = F
)
ret <- gsub(pattern = '\\*', replacement = '*lubridate::', x = ret)
rdate <- (eval(expr = parse(text = ret)))
attr(rdate, 'tzone') <- 'UTC'
return(rdate)
}
sample_string <- '*-5mo+3d+4h+2m'
strTimeConverter(sample_string)
This works but is not very elegant and will likely fail as I'm forced to incorporate other expressions (e.g. yd for day of the year e.g. 124).
You can use backreferences in the replacements like this:
library(stringr)
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo' = '\\1*lubridate::months(1)', '(\\d+)d' = '\\1*lubridate::days(1)', '(\\d+)h' = '\\1*lubridate::hours(1)', '(\\d+)m' = '\\1*lubridate::minutes(1)')
stringr::str_replace_all(x, repl)
## => [1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
## [2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
See the R demo online.
See, for example, '(\\d+)mo' = '\\1*lubridate::months(1)'. Here, (\d+)mo matches and captures into Group 1 one or more digits, and mo is just matched. Then, when the match is found, \1 in \1*lubridate::months(1) inserts the contents of Group 1 into the resulting string.
Note that it might make the replacements safer if you cap the time period match with a word boundary (\b) on the right:
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo\\b' = '\\1*lubridate::months(1)', '(\\d+)d\\b' = '\\1*lubridate::days(1)', '(\\d+)h\\b' = '\\1*lubridate::hours(1)', '(\\d+)m\\b' = '\\1*lubridate::minutes(1)')
It won't work if the time spans are glued one to another without any non-word delimiters, but you have + in your example strings, so it is safe here.
Actually, you can make it work with the function you used, too. Just make sure the backreferences have the $n syntax:
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
pattern = c('^\\*', '^t', '(\\d+)mo', '(\\d+)d', '(\\d+)h', '(\\d+)m')
replacement = c('Sys.time()', 'Sys.Date()', '$1*lubridate::months(1)', '$1*lubridate::days(1)', '$1*lubridate::hours(1)', '$1*lubridate::minutes(1)')
stringi::stri_replace_all_regex(x, pattern, replacement, vectorize_all=FALSE)
Output:
[1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
[2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
Another option to produce the time directly, would be the following:
strTimeConvert <- function(base=Sys.time(), delta="-5mo+3d+4h+2m"){
mo <- gsub(".*([+-]\\d+)mo.*", "\\1", x)
ds <- gsub(".*([+-]\\d+)d.*", "\\1", x)
hs <- gsub(".*([+-]\\d+)h.*", "\\1", x)
ms <- gsub(".*([+-]\\d+)m.*", "\\1", x)
out <- base + months(as.numeric(mo)) + days(as.numeric(ds)) +
hours(as.numeric(hs)) + minutes(as.numeric(ms))
out
}
strTimeConvert()
# [1] "2020-07-21 20:32:19 EDT"

Find sequences of elements in vectors

I need some pointers on this. Actually, I don't necessarily need a fully-fledged solution here - some pointers to functions and/or packages would be great.
The problem: I want to find specific sequences in a character vector. The sequences can be somewhat "underspecified". That means that some of the elements should be fixed, but for some elements it does not matter how long they are or what they are exactly.
An example: Suppose I want to find the following pattern in a character vector:
The sequence should begin with "Out of" or "out of"
The sequence should end with "reasons"
In between, there should be other elements. But it does not matter how much elements (also zero would be OK) and what the elements exactly are.
In between 1. and 2., there shouldn't be a ".", "!" or "?".
There should be a parameter that controls how long the sequence in 3. can maximally be to still produce a result.
Return value of the function should be the intervening elements and/or their indices in the vector.
So, the function should "behave" like this:
c("Out", "of", "specific", "reasons", ".") Return "specific"
c("Out", "of", "very", "specific", "reasons", ".") Return c("very", "specific")
c("out", "of", "curiosity", ".", "He", "had", "his", "reasons") Return "" or NA or NULL, which one doesn't matter - just a signal that there is no result.
As I said: I don't need a full solution. Any pointers to packages that already implement such functionality are appreciated!
Optimally, I don't want to rely on a solution that first pastes the text and then uses regex for matching.
Thanks a lot!
I would be really curious to learn of a package that serves your needs. My inclination would be to collapse the strings and use regular expressions or find a programmer or use perl. But here's one extensible solution in R with a few more cases to experiment on. Not very elegant, but see if this has some utility.
# Recreate data as a list with a few more edge cases
txt1 <- c(
"Out of specific reasons.",
"Out of very specific reasons.",
"Out of curiosity. He had his reasons.",
"Out of reasons.",
"Out of one's mind.",
"For no particular reason.",
"Reasons are out of the ordinary.",
"Out of time and money and for many good reasons, it seems.",
"Out of a box, a car, and for random reasons.",
"Floop foo bar.")
txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
txt <- lapply(txt3, unlist) # create list of tokens from each line
# Define characters to exclude: [. ! and ?] but not [,]
exclude <- "[.!?]"
# Assign acceptable limit to separation
lim <- 5 # try 7 and 12 to experiment
# Create indices identifying each of the enumerated conditions
fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
index1 <- lapply(txt, fun1, "out")
index2 <- lapply(txt, fun1, "of")
index3 <- lapply(txt, fun1, "reasons")
index4 <- lapply(txt, fun1, exclude)
# Create logical vectors from indices satisfying the conditions
fun2 <- function(set, val) val[1] %in% set
cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)
# Combine logical vectors to a single logical vector
valid <- cond1 & cond2 & cond3
valid <- ifelse(is.na(valid), FALSE, valid)
# Examine selected original lines
print(txt1[valid])
# Helper function to extract the starting and the ending element
fun3 <- function(index2, index3, valid) {
found <- rep(list(NULL), length(index2))
found[valid] <- Map(seq, index2[valid], index3[valid])
found <- lapply(found, tail, -1)
found <- lapply(found, head, -1)
}
# Extract starting and ending element from valid list members
idx <- fun3(index2, index3, valid)
# Return the results or "" for no intervening text or NULL for no match
ans <- Map(function(x, i) {
if (is.null(i)) NULL # no match found
else if (length(i) == 0) "" # no intervening elements
else x[i]}, # all intervening elements <= lim
txt, idx)
# Show found (non-NULL) values
ans[!sapply(ans, is.null)]
So let's assume your example
x <- c("Out", "of", "very", "specific", "reasons", ".")
We first need to get the beginning of the indicator
i_Beginning <- as.numeric(grep("Out|out", x))
and the ending
i_end <- as.numeric(grep("reasons", x))
Need to also check that Out is followed by of
Is_Of <- grepl("Of|of", x[i_Beginning +1])
And if this is true we extract the other elements
if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)

Replacing nth instance of a character string using sub/gsub in R

I am attempting to re-name some character strings given to me in a large list. The issue is that I only need to replace some of the characters not all of them.
exdata <- c("i_am_having_trouble_with_this_string",
"i_am_wishing_files_were_cleaner_for_me",
"any_help_would_be_greatly_appreciated")
From this list, for example, I would like to replace the third through the fifth instance of "_" with "-". I am having trouble understanding the regex coding for this, as most examples split strings up instead of keeping them intact.
Here are some alternative approaches. All of them can be generalized to arbitrary bounds by replacing 3 and 5 with other numbers.
1) strsplit Split the strings at underscore and use paste to collapse it back using the appropriate separators. No packages are used.
i <- 3
j <- 5
sapply(strsplit(exdata, "_"), function(x) {
g <- seq_along(x)
g[g < i] <- i
g[g > j + 1] <- j+1
paste(tapply(x, g, paste, collapse = "_"), collapse = "-")
})
giving:
[1] "i_am_having-trouble-with-this_string"
[2] "i_am_wishing-files-were-cleaner_for_me"
[3] "any_help_would-be-greatly-appreciated"
2) for loop This translates the first j occurrences of old to new in x and then translates the first i-1 occurrences of new back to old. No packages are used.
translate <- function(old, new, x, i = 1, j) {
if (i <= 1) {
if (j > 0) for(k in seq_len(j)) x <- sub(old, new, x, fixed = TRUE)
x
} else Recall(new, old, Recall(old, new, x, 1, j), 1, i-1)
}
translate("_", "-", exdata, 3, 5)
giving:
[1] "i_am_having-trouble-with-this_string"
[2] "i_am_wishing-files-were-cleaner_for_me"
[3] "any_help_would-be-greatly-appreciated"
3) gsubfn This uses a package but in return is substantially shorter than the others. gsubfn is like gsub except that the replacement string in gsub can be a string, list, function or proto object. In the case of a proto object the fun method of the proto object is invoked each time there is a match to the regular expression. Below the matching string is passed to fun as x while the output of fun replaces the match in the data. The proto object is automatically populated with a number of variables set by gsubfn and accessible by fun including count which is 1 for the first match, 2 for the second and so on. For more information see the gsubfn vignette -- section 4 discusses the use of proto objects.
library(gsubfn)
p <- proto(i = 3, j = 5,
fun = function(this, x) if (count >= i && count <= j) "-" else x)
gsubfn("_", p, exdata)
giving:
[1] "i_am_having-trouble-with-this_string"
[2] "i_am_wishing-files-were-cleaner_for_me"
[3] "any_help_would-be-greatly-appreciated"
> gsub('(.*_.*_.*?)_(.*?)_(.*?)_(.*)','\\1-\\2-\\3-\\4', exdata)
[1] "i_am_having-trouble-with-this_string" "i_am_wishing-files-were-cleaner_for_me" "any_help_would-be-greatly-appreciated"

String splitting in R Programming

Currently the script below is splitting a combined item code into a specific item codes.
rule2 <- c("MR")
df_1 <- test[grep(paste("^",rule2,sep="",collapse = "|"),test$Name.y),]
SpaceName_1 <- function(s){
num <- str_extract(s,"[0-9]+")
if(nchar(num) >3){
former <- substring(s, 1, 4)
latter <- strsplit(substring(s,5,nchar(s)),"")
latter <- unlist(latter)
return(paste(former,latter,sep = "",collapse = ","))
}
else{
return (s)
}
}
df_1$Name.y <- sapply(df_1$Name.y, SpaceName_1)
Example,
Combined item code: Room 324-326 is splitting into MR324 MR325 MR326.
However for this particular Combined item code: Room 309-311 is splitting into MR309 MR300 MR301.
How should I amend the script to give me MR309 MR310 MR311?
You can try something along these lines:
range <- "324-326"
x <- as.numeric(unlist(strsplit(range, split="-")))
paste0("MR", seq(x[1], x[2]))
[1] "MR324" "MR325" "MR326"
I assume that you can obtain the numerical room sequence by some means, and then use the snippet I gave you above.
If your combined item codes always have the form Room xxx-yyy, then you can extract the range using gsub:
range <- gsub("Room ", "", "Room 324-326")
If your item codes were in a vector called codes, then you could obtain a vector of ranges using:
ranges <- sapply(codes, function(x) gsub("Room ", "", x))
We can also evaluate the string after replacing the - with : and then paste the prefix "MR".
paste0("MR", eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)", "\\1:\\2", range))))
#[1] "MR324" "MR325" "MR326"
Wrap it as a function for convenience
fChange <- function(prefixStr, RangeStr){
paste0(prefixStr, eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)",
"\\1:\\2", RangeStr))))
}
fChange("MR", range)
fChange("MR", range1)
#[1] "MR309" "MR310" "MR311"
For multiple elements, just loop over and apply the function
sapply(c(range, range1), fChange, prefixStr = "MR")
data
range <- "Room 324-326"
range1 <- "Room 309-311"

Am I using sapply incorrectly?

This code is suppose to take in a word, and compute values for letters of the word, based on the position of the letter in the word. So for a word like "broke" it's suppose to compute the values for the letter "r" and "k"
strg <- 'broke'
#this part stores everything except the first,
#last, and middle position of the word
strg.leng <- nchar(strg)
other.letts <- sequence(strg.leng)
if (length(other.letts) %% 2 != 0) {
oth_let1 <- other.letts[-c(1, ceiling(length(other.letts)/2), length(other.letts))]
} else {
oth_let0 <- other.letts[-c(1, c(1,0) + floor(length(other.letts)/2), length(other.letts))]
}
print(paste("Values of the other letters of: ", strg))
#here is where the computation starts, taking in the objects created above
if ((nchar(strg) %% 2) != 0) {
sapply(oth_let1, function(i) print(paste(oth_let1[i], "L", (.66666*1.00001) - (oth_let1[i] - 1) *.05 )))
} else {
sapply(oth_let0, function(i) print(paste(oth_let0[i], "L", (.66666*1.00001) - (oth_let0[i] - 1) *.05 )))
}
However for "broke" I get this which is only computing the value of "k" and some other stuff:
[1] "4 L 0.5166666666"
[1] "NA L NA"
[1] "4 L 0.5166666666" "NA L NA"
While the desired output should be a value for both "r" and "k", so something like:
[1] "2 L 0.61666666"
[1] "4 L 0.51666666"
What am I doing wrong? Am I using sapply incorrectly?
sapply iterates through the supplied vector or list and supplies each member in turn to the function. In your case, you're getting the values 2 and 4 and then trying to index your vector again using its own values. Since the oth_let1 vector has only two members, you get NA. You could fix your current code by replacing the oth_let1[i] with just i. However, your code could be greatly simplified to:
strg <- 'broke'
lets <- 2:(nchar(strg) - 1)
lets <- lets[-(1:2 + length(lets)) / 2] # removes middle item for odd and middle two for even
cat("Values of the other letters of:", strg, "\n")
#here is where the computation starts, taking in the objects created above
writeLines(paste(lets, "L", 0.66666*1.00001 - (lets - 1) * 0.05, sep = " "))
I'm assuming you want to output the results to the console.
You're using sapply correct, what you're getting wrong is the function inside it. What you want is the i element of the other.letts variable, not from the oth_let1. oth_let1 have the indexes from the other.letts.
The code bellow should work, I also change the name of the variable to oth_let, so you don't have to use other if. For the output be exact what you ask for I used the invisible function.
strg <- 'broke'
strg.leng <- nchar(strg)
other.letts <- sequence(strg.leng)
if(length(other.letts) %% 2 != 0) {
oth_let <- other.letts[-c(1, ceiling(length(other.letts)/2),
length(other.letts))]
}else{
oth_let <- other.letts[-c(1, c(1,0) + floor(length(other.letts)/2),
length(other.letts))]
}
print(paste("Values of the other letters of: ", strg))
invisible(sapply(oth_let,
function(i)
print(paste(other.letts[i], "L", (.66666*1.00001) - (other.letts[i] - 1) *.05 ))))

Resources