I'm setting up a new project in R ,and want to extract specific symbol from text
X <- c("amazing tiny phone ^_^","so cute!!! <3")
I would like to extract ^_^ and <3 from X in R
Thank you!
More straightforward
X = c("amazing tiny phone ^_^","so cute!!! <3","^_^ and :) are my fav symbols")
patt=c("=d" ,"<3" , ":o" , ":(" ,
":)" , "(y)" , ":*" , "^_^", ":d" ,";)" , ":'(")
variable = sapply(X,function(x){
i = which(patt%in%strsplit(x," ")[[1]])
if (length(i)>0){
paste(patt[i],collapse=" ")
} else{NA}
})
names(variable)=NULL
> variable
[1] "^_^" "<3" ":) ^_^" NA
#GraemeForst A generalization could be achieved using groupings and lookaheads:
group <- "[\\^\\_\\<\\>3\\:\\(\\)\\;]"
pat <- sprintf(".*[\\s\\b](%s+)(?!\\1)", group)
group defines the character grouping. Basically all symbols we want to extract.
pat defines our matching pattern. The [\\s\\b] says prior to a possible match there must be either a blank or the boundary. And (?!\\1) say after a match there cannot be an element of group.
Here is a demo:
X <- c("amazing tiny phone ^_^","so cute!!! <3", "I like pizza :)", "hello beautiful ;)")
gsub(pat, "\\1", grep(pat, X, value = TRUE, perl = TRUE), perl = TRUE)
# [1] "^_^" "<3" ":)" ";)"
This can be further refined and generalized. An very simple step one can add is to extend the grouping.
Old Answer
You can use regex for this:
# create the pattern to be extracted
pat = ".*(\\^\\_\\^).*|.*(\\<3).*" # escape special characters with "\\" and ".*" to specify there may be text before/after
# extract
gsub(pat, "\\1\\2", grep(pat, X, value = TRUE, perl = TRUE), perl = TRUE)
# [1] "^_^" "<3"
Related
I want the function to return the string that follows the below condition.
after "def"
in the parentheses right before the first %ile after "def"
So the desirable output is "4", not "5". So far, I was able to extract "2)(3)(4". If I change the function to str_extract_all, the output became "2)(3)(4" and "5" . I couldn't figure out how to fix this problem. Thanks!
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile"
string.after.match <- str_match(string = x,
pattern = "(?<=def)(.*)")[1, 1]
parentheses.value <- str_extract(string.after.match, # get value in ()
"(?<=\\()(.*?)(?=\\)\\%ile)")
parentheses.value
Take the
Here is a one liner that will do the trick using gsub()
gsub(".*def.*(\\d+)\\)%ile.*%ile", "\\1", x, perl = TRUE)
Here's an approach that will work with any number of "%ile"s. Based on str_split()
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile(9)%ile"
x %>%
str_split("def", simplify = TRUE) %>%
subset(TRUE, 2) %>%
str_split("%ile", simplify = TRUE) %>%
subset(TRUE, 1) %>%
str_replace(".*(\\d+)\\)$", "\\1")
sub(".*?def.*?(\\d)\\)%ile.*", "\\1", x)
[1] "4"
You can use
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile"
library(stringr)
result <- str_match(x, "\\bdef(?:\\((\\d+)\\))+%ile")
result[,2]
See the R demo online and the regex demo.
Details:
\b - word boundary
def - a def string
(?:\((\d+)\))+ - zero or more occurrences of ( + one or more digits (captured into Group 1) + ) and the last one captured is saved in Group 1
%ile - an %ile string.
Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"
I need to replace subset of a string with some matches that are stored within a dataframe.
For example -
input_string = "Whats your name and Where're you from"
I need to replace part of this string from a data frame. Say the data frame is
matching <- data.frame(from_word=c("Whats your name", "name", "fro"),
to_word=c("what is your name","names","froth"))
Output expected is what is your name and Where're you from
Note -
It is to match the maximum string. In this example, name is not matched to names, because name was a part of a bigger match
It has to match whole string and not partial strings. fro of "from" should not match as "froth"
I referred to the below link but somehow could not get this work as intended/described above
Match and replace multiple strings in a vector of text without looping in R
This is my first post here. If I haven't given enough details, kindly let me know
Edit
Based on the input from Sri's comment I would suggest using:
library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)
Original
I would not say this is easier to read than your simple loop, but it might take better care of the overlapping replacements:
# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
to_word=c("what is your name","names","froth", "where are", "Whatsup"))
# load used library
library(gsubfn)
# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)
# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE
# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])
# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
replacements,input_string)
toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")
function have two arguments xi and yi.
xi is pattern (find what), yi is replacement (replace with).
input_string = "Whats your name and Where're you from"
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth")
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)
Was trying out different things and the below code seems to work.
a <-c("Whats your name", "name", "fro")
b <- c("what is your name","names","froth")
c <- c("Whats your name and Where're you from")
for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c)
c <- gsub("_"," ",c)
c
Took help from the below link Making gsub only replace entire words?
However, I would like to avoid the loop if possible. Can someone please improve this answer, without the loop
Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.
In R, I'd like to convert
c("ThisText", "NextText")
to
c("this.text", "next.text")
This is the reverse of this SO question, and the same as this one but with dots in R rather than underscores in PHP.
Not clear what the entire set of rules is here but we have assumed that
we should lower case any upper case character after a lower case one and insert a dot between them and also
lower case the first character of the string if succeeded by a lower case character.
To do this we can use perl regular expressions with sub and gsub:
# test data
camelCase <- c("ThisText", "NextText", "DON'T_CHANGE")
s <- gsub("([a-z])([A-Z])", "\\1.\\L\\2", camelCase, perl = TRUE)
sub("^(.[a-z])", "\\L\\1", s, perl = TRUE) # make 1st char lower case
giving:
[1] "this.text" "next.text" "DON'T_CHANGE"
You could do this also via the snakecase package:
install.packages("snakecase")
library(snakecase)
to_snake_case(c("ThisText", "NextText"), sep_out = ".")
# [1] "this.text" "next.text"
Github link to package: https://github.com/Tazinho/snakecase
You can replace all capitals with themselves and a preceeding dot with gsub, change everything tolower, and the substr out the initial dot:
x <- c("ThisText", "NextText", "LongerCamelCaseText")
substr(tolower(gsub("([A-Z])","\\.\\1",x)),2,.Machine$integer.max)
[1] "this.text" "next.text" "longer.camel.case.text"
Using stringr
x <- c("ThisText", "NextText")
str_replace_all(string = x,
pattern = "(?<=[a-z0-9])(?=[A-Z])",
replacement = ".") %>%
str_to_lower()
OR
x <- c("ThisText", "NextText")
str_to_lower(
str_replace_all(string = x,
pattern = "(?<=[a-z0-9])(?=[A-Z])",
replacement = ".")
)