I have a list of strings which look like that:
categories <- "|Music|Consumer Electronics|Mac|Software|"
However, I only want get the first string. In this case Music(without |). I tried:
sub(categories, pattern = " |", replacement = "")
However, that does not give me the desired result. Any recommendation how to correctly parse my string?
I appreciate your answer!
UPDATE
> dput(head(df))
structure(list(data.founded_at = c("01.06.2012", "26.10.2012",
"01.04.2011", "01.01.2012", "10.10.2011", "01.01.2007"), data.category_list = c("|Entertainment|Politics|Social Media|News|",
"|Publishing|Education|", "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|",
"|Software|", "|Software|", "|Curated Web|")), .Names = c("data.founded_at",
"data.category_list"), row.names = c(NA, 6L), class = "data.frame")
An alternative for this could be scan:
na.omit(scan(text = categories, sep = "|", what = "", na.strings = ""))[1]
# Read 6 items
# [1] "Music"
Find a function that will tokenize a string at a particular character: strsplit would be my guess.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html
Note that the parameter in split is a regexp, so using split="|" will not work (unless you specify fixed=TRUE, as suggested from joran -thanks- in the comments)
strsplit(categories,split="[|]")[[1]][2]
To apply this to the data frame you could do this:
sapply(df$data.category_list, function(x) strsplit(x,split="[|]")[[1]][2])
But this is faster (see the comments):
vapply(strsplit(df$data.category_list, "|", fixed = TRUE), `[`, character(1L), 2)
(thanks to Ananda Mahto)
Related
Using R in Databricks.
I have the following sample list of possible text entries.
extract <- c("codeine", "tramadol", "fentanyl", "morphine")
I want check if any of these appear more than once in a string (example below) and return a binary output in a new column.
Example = ("codeine with fentanyl oral")
The output for this example would be 1.
I have tried the following with only partial success:
df$testvar1 <- +(str_count(df$medname, fixed(extract))> 1)
also tried
df$testvar2 <- cSplit_e(df$medname, split.col = "String", sep = " ", type = "factor", mode = "binary", fixed = TRUE, fill = 0)
and also
df$testvar3 <- str_extract_all(df$medname, paste(extract, collapse = " "))
Combine your extract with |.
+(stringr::str_count(Example, paste(extract, collapse = "|"))> 1)
# [1] 1
I tried the following and it worked for my code
df$testvar <- sapply(df$medname, function(x) str_extract(x, paste(extract, collapse="|")))
I basically need the outcome (string) to have double quotations, thus need of escape character. Preferabily solving with R base, without extra R packages.
I have tried with squote, shQuote and noquote. They just manipulate the quotations, not the escape character.
My list:
power <- "test"
myList <- list (
"power" = power)
I subset the content using:
myList
myList$power
Expected outcome (a string with following content):
" \"power\": \"test\" "
Using package glue:
library(glue)
glue(' "{names(myList)}": "{myList}" ')
"power": "test"
Another option using shQuote
paste(shQuote(names(myList), type = "cmd"),
shQuote(unlist(myList), type = "cmd"),
sep = ": ")
# [1] "\"power\": \"test\""
Not sure to get your expectation. Is it what you want?
myList <- list (
"power" = "test"
)
stringr::str_remove_all(
as.character(jsonlite::toJSON(myList, auto_unbox = TRUE)),
"[\\{|\\}]")
# [1] "\"power\":\"test\""
If you want some spaces:
x <- stringr::str_remove_all(
as.character(jsonlite::toJSON(myList, auto_unbox = TRUE)),
"[\\{|\\}]")
paste0(" ", x, " ")
I've been wrapping my head around this for a while, trying plenty of varieties of map, Reduce and such but without success so far.
I am looking for a functional, elegant approach to substitute a sequence of gsub as in
text_example <- c(
"I'm sure dogs are the best",
"I won't, I can't think otherwise",
"We'll be happy to discuss about dogs",
"cant do it today tho"
)
text_example %>%
gsub(pattern = "'ll", replacement = " will") %>%
gsub(pattern = "can'?t", replacement = "can not") %>%
gsub(pattern = "won'?t", replacement = "will not") %>%
gsub(pattern = "n't", replacement = " not") %>%
gsub(pattern = "'m", replacement = " am") %>%
gsub(pattern = "'s", replacement = " is") %>%
gsub(pattern = "dog", replacement = "cat") %>%
Into something of the form
text_example %>%
???(dict$pattern, dict$replacement, gsub())
Where, for sake of a reproducible example, dict can be a data.frame such as
dict <- structure(
list(
pattern = c("'ll", "can'?t", "won'?t", "n't", "'m", "'s", "dog"),
replacement = c(" will", "can not", "will not", " not", " am", " is", "cat")
),
row.names = c(NA, -7L),
class = "data.frame"
)
(and I am aware that the substitutions performed might not be correct linguistically, but that's not the problem now)
Of course, a brutal
for(i in seq(nrow(dict))) {
text_example <- gsub(dict$pattern[i], dict$replacement[i], text_example)
}
would work, and I know that there are dozens of libraries that solve this issue with some specific function. But I want to understand how to deal with recursions and problems like this in a simple, functional way, keeping as close as possible to base R. I love my lambdas!
Thank you in advance for the help.
You can use mapply for a parallel apply-effect:
mapply(dict$pattern, dict$replacement, function(pttrn, rep) gsub(pttrn, rep, text_example))
(You might want to use SIMPLIFY=FALSE)
Maybe the following does what you want.
It is inspired in Functional Programming, the link in your comment.
I don't like the output though, it is a list with as many elements as rows of dataframe dict and only the last one is the one of interess.
new_text <- function(pattern, replacement, text) {
txt <- text
function(pattern, replacement) {
txt <<- gsub(pattern, replacement, txt)
txt
}
}
Replace <- new_text(p, r, text = text_example)
Map(Replace, as.list(dict[[1]]), as.list(dict[[2]]))
Hello I have a column in a data.frame, it has many rows, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"))
I want to make a new column "Species_new" where the "*" is moved to the end of the character string, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"),
"Species_new" = c("Briza minor*", "Briza minor*", "Wattle"))
Is there a way to do this using gsub? The manual example would take far too long as I have approximately 50,000 rows.
Thanks in advance
One option is to capture the * as a group and in the replacement reverse the backreferences
df$Species_new <- sub("^([*])(.*)$", "\\2\\1", df$Species)
df$Species_new
#[1] "Briza minor*" "Briza minor*" "Wattle"
NOTE: * is a metacharacter meaning 0 or more, so we can either escape (\\*) or place it in brackets ([]) to evaluate the raw character i.e. literal evaluation
Thanks so much for the quick response, I also found a workaround;
df$Species_new = sub("[*]","",df$Species, perl=TRUE)
differences = setdiff(df$Species,df$Species_new)
tochange = subset(df,df$Species == differences)
toleave = subset(df,!df$Species == differences)
tochange$Species_new = paste(tochange$Species_new, "*", sep = "")
df = rbind(tochange,toleave)
Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"