Parse string by | - r

I have a list of strings which look like that:
categories <- "|Music|Consumer Electronics|Mac|Software|"
However, I only want get the first string. In this case Music(without |). I tried:
sub(categories, pattern = " |", replacement = "")
However, that does not give me the desired result. Any recommendation how to correctly parse my string?
I appreciate your answer!
UPDATE
> dput(head(df))
structure(list(data.founded_at = c("01.06.2012", "26.10.2012",
"01.04.2011", "01.01.2012", "10.10.2011", "01.01.2007"), data.category_list = c("|Entertainment|Politics|Social Media|News|",
"|Publishing|Education|", "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|",
"|Software|", "|Software|", "|Curated Web|")), .Names = c("data.founded_at",
"data.category_list"), row.names = c(NA, 6L), class = "data.frame")

An alternative for this could be scan:
na.omit(scan(text = categories, sep = "|", what = "", na.strings = ""))[1]
# Read 6 items
# [1] "Music"

Find a function that will tokenize a string at a particular character: strsplit would be my guess.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html

Note that the parameter in split is a regexp, so using split="|" will not work (unless you specify fixed=TRUE, as suggested from joran -thanks- in the comments)
strsplit(categories,split="[|]")[[1]][2]
To apply this to the data frame you could do this:
sapply(df$data.category_list, function(x) strsplit(x,split="[|]")[[1]][2])
But this is faster (see the comments):
vapply(strsplit(df$data.category_list, "|", fixed = TRUE), `[`, character(1L), 2)
(thanks to Ananda Mahto)

Related

R Using a list of text value output in binary when character appears more than once in a string

Using R in Databricks.
I have the following sample list of possible text entries.
extract <- c("codeine", "tramadol", "fentanyl", "morphine")
I want check if any of these appear more than once in a string (example below) and return a binary output in a new column.
Example = ("codeine with fentanyl oral")
The output for this example would be 1.
I have tried the following with only partial success:
df$testvar1 <- +(str_count(df$medname, fixed(extract))> 1)
also tried
df$testvar2 <- cSplit_e(df$medname, split.col = "String", sep = " ", type = "factor", mode = "binary", fixed = TRUE, fill = 0)
and also
df$testvar3 <- str_extract_all(df$medname, paste(extract, collapse = " "))
Combine your extract with |.
+(stringr::str_count(Example, paste(extract, collapse = "|"))> 1)
# [1] 1
I tried the following and it worked for my code
df$testvar <- sapply(df$medname, function(x) str_extract(x, paste(extract, collapse="|")))

R - Construct a string with double quotations

I basically need the outcome (string) to have double quotations, thus need of escape character. Preferabily solving with R base, without extra R packages.
I have tried with squote, shQuote and noquote. They just manipulate the quotations, not the escape character.
My list:
power <- "test"
myList <- list (
"power" = power)
I subset the content using:
myList
myList$power
Expected outcome (a string with following content):
" \"power\": \"test\" "
Using package glue:
library(glue)
glue(' "{names(myList)}": "{myList}" ')
"power": "test"
Another option using shQuote
paste(shQuote(names(myList), type = "cmd"),
shQuote(unlist(myList), type = "cmd"),
sep = ": ")
# [1] "\"power\": \"test\""
Not sure to get your expectation. Is it what you want?
myList <- list (
"power" = "test"
)
stringr::str_remove_all(
as.character(jsonlite::toJSON(myList, auto_unbox = TRUE)),
"[\\{|\\}]")
# [1] "\"power\":\"test\""
If you want some spaces:
x <- stringr::str_remove_all(
as.character(jsonlite::toJSON(myList, auto_unbox = TRUE)),
"[\\{|\\}]")
paste0(" ", x, " ")

R: functional approach to multiple consecutive gsub

I've been wrapping my head around this for a while, trying plenty of varieties of map, Reduce and such but without success so far.
I am looking for a functional, elegant approach to substitute a sequence of gsub as in
text_example <- c(
"I'm sure dogs are the best",
"I won't, I can't think otherwise",
"We'll be happy to discuss about dogs",
"cant do it today tho"
)
text_example %>%
gsub(pattern = "'ll", replacement = " will") %>%
gsub(pattern = "can'?t", replacement = "can not") %>%
gsub(pattern = "won'?t", replacement = "will not") %>%
gsub(pattern = "n't", replacement = " not") %>%
gsub(pattern = "'m", replacement = " am") %>%
gsub(pattern = "'s", replacement = " is") %>%
gsub(pattern = "dog", replacement = "cat") %>%
Into something of the form
text_example %>%
???(dict$pattern, dict$replacement, gsub())
Where, for sake of a reproducible example, dict can be a data.frame such as
dict <- structure(
list(
pattern = c("'ll", "can'?t", "won'?t", "n't", "'m", "'s", "dog"),
replacement = c(" will", "can not", "will not", " not", " am", " is", "cat")
),
row.names = c(NA, -7L),
class = "data.frame"
)
(and I am aware that the substitutions performed might not be correct linguistically, but that's not the problem now)
Of course, a brutal
for(i in seq(nrow(dict))) {
text_example <- gsub(dict$pattern[i], dict$replacement[i], text_example)
}
would work, and I know that there are dozens of libraries that solve this issue with some specific function. But I want to understand how to deal with recursions and problems like this in a simple, functional way, keeping as close as possible to base R. I love my lambdas!
Thank you in advance for the help.
You can use mapply for a parallel apply-effect:
mapply(dict$pattern, dict$replacement, function(pttrn, rep) gsub(pttrn, rep, text_example))
(You might want to use SIMPLIFY=FALSE)
Maybe the following does what you want.
It is inspired in Functional Programming, the link in your comment.
I don't like the output though, it is a list with as many elements as rows of dataframe dict and only the last one is the one of interess.
new_text <- function(pattern, replacement, text) {
txt <- text
function(pattern, replacement) {
txt <<- gsub(pattern, replacement, txt)
txt
}
}
Replace <- new_text(p, r, text = text_example)
Map(Replace, as.list(dict[[1]]), as.list(dict[[2]]))

Move "*" to new column in R

Hello I have a column in a data.frame, it has many rows, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"))
I want to make a new column "Species_new" where the "*" is moved to the end of the character string, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"),
"Species_new" = c("Briza minor*", "Briza minor*", "Wattle"))
Is there a way to do this using gsub? The manual example would take far too long as I have approximately 50,000 rows.
Thanks in advance
One option is to capture the * as a group and in the replacement reverse the backreferences
df$Species_new <- sub("^([*])(.*)$", "\\2\\1", df$Species)
df$Species_new
#[1] "Briza minor*" "Briza minor*" "Wattle"
NOTE: * is a metacharacter meaning 0 or more, so we can either escape (\\*) or place it in brackets ([]) to evaluate the raw character i.e. literal evaluation
Thanks so much for the quick response, I also found a workaround;
df$Species_new = sub("[*]","",df$Species, perl=TRUE)
differences = setdiff(df$Species,df$Species_new)
tochange = subset(df,df$Species == differences)
toleave = subset(df,!df$Species == differences)
tochange$Species_new = paste(tochange$Species_new, "*", sep = "")
df = rbind(tochange,toleave)

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

Resources