R change vector to pythonic tuple - r

Hi I want to change a vector of strings into one string, which is in the Python Tuple format.
Input:
a <- c('stack', 'overflow', 'kicks', 'ass')
Expected Output:
"('stack', 'overflow', 'kicks', 'ass')"
What would be an easy solution to implement?
This is what I have done and I expect there should be an easier solution:
> b <- a[1]
> for(word in a[-1]){ b <- paste(b, word, sep="','") }
> b
[1] "stack','overflow','kick','ass"
> b <- paste("('", b, "')",sep="")
> b
[1] "('stack','overflow','kick','ass')"

> paste0("(", paste(sQuote(a), collapse = ","), ")")
[1] "(‘stack’,‘overflow’,‘kicks’,‘ass’)"
> options(useFancyQuotes = FALSE)
> paste0("(", paste(sQuote(a), collapse = ","), ")")
[1] "('stack','overflow','kicks','ass')"
> substring(capture.output(dput(a)), 2)
[1] "(\"stack\", \"overflow\", \"kicks\", \"ass\")"

Related

Extract matching words from strings in order

If I have two strings that look like this:
x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
Is there an easy way to check the words from left to right and create a new string of matching words and then stop when the words no longer match so the output would look like:
> "Here is a"
I don't want to find all matching words between the two strings but rather just the words that match in order. So "words and stuff." is in both string but I don't want that to be selected.
Split the strings, compute the minimum of the length of the two splits, take that number of words from the head of each and append a FALSE to ensure a non-match can occur when matching the corresponding words. Then use which.min to find the first non-match and take that number minus 1 of the words and paste back together.
L <- strsplit(c(x, y), " +")
wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
paste(head(L[[1]], wx - 1), collapse = " ")
## [1] "Here is a"
This shows you the first n words that match:
xvec <- strsplit(x, " +")[[1]]
yvec <- strsplit(y, " +")[[1]]
(len <- min(c(length(xvec), length(yvec))))
# [1] 8
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] "Here" "is" "a" "test" "of" "words" "and" "stuff."
# [[2]]
# [1] "Here" "is" "a" "better" "test" "of" "words" "and"
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] "Here" "is" "a"
# [[2]]
# [1] "Here" "is" "a"
From here, we can easily derive the leading string:
paste(xvec[1:(i-1)], collapse = " ")
# [1] "Here is a"
and the remaining strings with
paste(xvec[-(1:(i-1))], collapse = " ")
# [1] "test of words and stuff."
I wrote a function which will check the string and return the desired output:
x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
z <- "This string doesn't match"
library(purrr)
check_str <- function(inp, pat, delimiter = "\\s") {
inp <- unlist(strsplit(inp, delimiter))
pat <- unlist(strsplit(pat, delimiter))
ln_diff <- length(inp) - length(pat)
if (ln_diff < 0) {
inp <- append(inp, rep("", abs(ln_diff)))
}
if (ln_diff > 0) {
pat <- append(pat, rep("", abs(ln_diff)))
}
idx <- map2_lgl(inp, pat, ~ identical(.x, .y))
rle_idx <- rle(idx)
if (rle_idx$values[1]) {
idx2 <- seq_len(rle_idx$length[1])
} else {
idx2 <- 0
}
paste0(inp[idx2], collapse = delimiter)
}
check_str(x, y, " ")
#> [1] "Here is a"
check_str(x, z, " ")
#> [1] ""
Created on 2023-02-13 with reprex v2.0.2
You could write a helper function to do the check for you
common_start<-function(x, y) {
i <- 1
last <- NA
while (i <= nchar(x) & i <= nchar(x)) {
if (substr(x,i,i) == substr(y,i,i)) {
if (grepl("[[:space:][:punct:]]", substr(x,i,i), perl=T)) {
last <- i
}
} else {
break;
}
i <- i + 1
}
if (!is.na(last)) {
substr(x, 1, last-1)
} else {
NA
}
}
and use that with your sample stirngs
common_start(x,y)
# [1] "Here is a"
The idea is to check every character, keeping track of the last non-word character that still matches. Using a while loop may not be fancy but it does mean you get to break early without processing the whole string as soon as a mismatch is found.

Substitute captured non-ascii letter with upper case

Is it possible to replicate, using only regex and only base R (only using the g*sub() functions), the following...
sub("(i)", "\\U\\1", "string", perl = TRUE)
# [1] "strIng"
For non-ascii letters?
# Hoped for output
sub("(í)", "?", "stríng", perl = TRUE)
# [1] "strÍng"
PS. R regex flavours are TRE and PCRE.
PS2. I'm using R 4.2.1 with Sys.getlocale() giving:
[1] "LC_COLLATE=Icelandic_Iceland.utf8;LC_CTYPE=Icelandic_Iceland.utf8;LC_MONETARY=Icelandic_Iceland.utf8;LC_NUMERIC=C;LC_TIME=Icelandic_Iceland.utf8"
You can use
x="stríng"
gr <- gregexpr("í", x)
mat <- regmatches(x, gr)
regmatches(x, gr) <- lapply(mat, toupper)
# > x
# > [1] "strÍng"
See the R demo online.
For a slightly more involved/explicit solution that only uses base R:
sub_nascii <- function(pattern, string) {
matches <- gregexpr(pattern, string)[[1]]
for (i in matches) {
substr(string, i, i) <- toupper(substr(string, i, i))
}
string
}
sub_nascii(pattern = "í", "stríng")
This works in my locale where sub on it's own doesn't.

Deleting nth delimiter in R

I am trying to delete the 5th delimiter in this string:
"Bacteria_Firmicutes_Clostridia_Clostridiales_Rumino_coccaceae_Ruminococcus_Ruminococcus_albus"
so it becomes:
"Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
This seems to work, but I feel like there should be a more elegant solution possibly with regex and str_replace
library(stringr)
name <- "Bacteria_Firmicutes_Clostridia_Clostridiales_Rumino_coccaceae_Ruminococcus_Ruminococcus_albus"
index <- str_locate_all(name, "_")[[1]]
str_sub(name, index[5, "start"], index[5, "end"]) <- ""
name
Try gsub:
> gsub("((?:[^_]+_){4}[^_]+)_", "\\1", name)
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>
Or a less "pretty" way:
> gsub("([^_]*_[^_]*_[^_]*_[^_]*_[^_]*)_", "\\1", name)
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>
Or with the strex library:
> library(strex)
> paste(str_before_nth(name, "_", 5), str_after_nth(name, "_", 5), sep="")
[1] "Bacteria_Firmicutes_Clostridia_Clostridiales_Ruminococcaceae_Ruminococcus_Ruminococcus_albus"
>

Collapsing mixed types into a neat comma separated string

I have a list of mixed types which I would like to collapse into a neat comma separated string to be read somewhere else. The following is a MWE:
a <- "name"
b <- as.vector(c(10))
names(b) <- c('s')
c <- as.vector(c(1, 2))
names(c) <- c('p1', 'p2')
d <- 20
r <- list(a, b, c, d)
r
# [[1]]
# [1] "name"
#
# [[2]]
# s
# 10
#
# [[3]]
# p1 p2
# 1 2
#
# [[4]]
# [1] 20
I want this:
# [1] '"name","10","1,2","20"'
But this is as far as I got:
# Collapse individual elements into individual strings.
# `sapply` with `paste` works perfectly:
> sapply(r, paste, collapse = ",")
# [1] "name" "10" "1,2" "20"
# Try paste again (doesn't work):
> paste(sapply(r, paste, collapse = ","), collapse = ',')
# [1] "name,10,1,2,20"
I tried paste0, cat to no avail. The only way I could do it is using write.table and passing it a buffer memory. That way is too complicated, and quite error prone. I need to have my code working on a cluster with MPI.
You need to add in the quotes - the ones printed after your sapply are just markers to show they are strings. This seems to work...
cat(paste0('"',sapply(r, paste, collapse = ','),'"',collapse=','))
"name","10","1,2","20"
You might need to try with and without the cat if you are writing to a file. Without it, at the terminal, you get backslashes before the 'real' quotes.

R: f(x) != sapply(x,f) -- bug or feature?

> f = function(x) as.Date(as.character(x), format='%Y%m%d')
> f(20110606)
[1] "2011-06-06"
> sapply(20110606, f)
[1] 15131
Why 2 returned values are not the same. I need to apply this function to a long vector of dates, but I'm not getting dates with sapply()!
The functions you use to create f are already vectorized. There's no need to use sapply, unless you work for the Department of Redundancy Department.
> f <- function(x) as.Date(as.character(x), format='%Y%m%d')
> d <- 20110606 + 0:10
> f(d)
[1] "2011-06-06" "2011-06-07" "2011-06-08" "2011-06-09"
[5] "2011-06-10" "2011-06-11" "2011-06-12" "2011-06-13"
[9] "2011-06-14" "2011-06-15" "2011-06-16"
> lapply(20110606, f)
[[1]]
[1] "2011-06-06"
> unlist(lapply(20110606, f))
[1] 15131
sapply unlists lapply and in doing so unclasses the date
> unclass(lapply(20110606, f)[[1]])
[1] 15131
> class(lapply(20110606, f)[[1]])
[1] "Date"
as #Joshua Ulrich noted there is no need to use apply type functions however for interest
d <- 20110606 + 0:10
do.call("c",lapply(d, f))
would be one possible way to "unlist" the dates

Resources