I have sentences in R (they represent column names of an SQL query), like the following:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
I would need to add a character(s) like "k." in front of every word of the sentence. Notice how sometimes words within the sentence may be separated by a comma and a space, but sometimes just by a comma.
The desired output would be:
new_sentence <- "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
I would prefer to achieve this without using a loop for. I saw this post Add a character to the start of every word but there they work with a vector and I can't figure out how to apply that code to my example.
Thanks
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
gsub(pattern = "(\\w+)", replacement = "k.\\1", sample_sentence)
# [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
Explanation: in regex \\w+ matches one or more "word" characters, and putting it in () makes each match a "capturing group". We replace each capture group with k.\\1 where \\1 refers to every group captured by the first set of ().
A possible solution:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
paste0("k.", gsub("(,\\s*)", "\\1k\\.", sample_sentence))
#> [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
Related
How to select the number in a text?
I want to convert the Latin number and English numbers in the text. For example, in the text "……one telephone……". I want to change the English number "one" into "1", but I do not want to change "telephone" into "teleph1".
Is it right to select only the number with a space ahead of the word and a space after it? How to do it?
To avoid replacing parts of other words into numbers you can include word boundaries in the search pattern. Some functions have a dedicated option for this but generally you can just use the special character \\b to indicate a word boundary as long as the function supports regular expressions.
For example, \\bone\\b will only match "one" if it is not part of another word. That way you can apply it to your character string "……one telephone……" without having to rely on spaces as delimiter between words.
With the stringr package (part of the Tidyverse), the replacement might look like this:
# define test string
x <- "……one telephone……"
# define dictionary for replacements with \\b indicating word boundaries
dict <- c(
"\\bone\\b" = "1",
"\\btwo\\b" = "2",
"\\bthree\\b" = "3"
)
# replace matches in x
stringr::str_replace_all(x, dict)
#> [1] "……1 telephone……"
Created on 2022-11-11 with reprex v2.0.2
Here is one way:
gsub(" one ", " 1 ", ".. one telephone ..")
You may need more rules than "space before and space after" (e.g. punctuation). Here is an example to handle blank space or punctuation before "one"
gsub("\\([[:punct:]]|[[:blank:]]\\)one ", "\11 ", "..one telephone ..")
You can do something similar after "one".
The \1 in the second argument refers to whatever is matched inside \\( ... \\) in the first argument.
Check the documentation of gsub to learn more about regular expressions.
I have a field in a data frame that formatted as last name, coma, space, first name, space, middle name, and sometimes without middle name. I need to remove middle names from the full names when they have it, and all spaces. Couldn't figure out how. My guess is that it will involve regular expression and stuff. It would be nice if you can provide explanations for the answer. Below is an example,
names <- c("Casillas, Kyron Jamar", "Knoll, Keyana","McDonnell, Messiah Abdul")
names
Expected output will be,
names_n <- c("Casillas,Kyron", "Knoll,Keyana","McDonnell,Messiah")
names_n
Thanks!
You can use this:
gsub("([^,]+,).*?(\\w+)$","\\1\\2",names)
[1] "Casillas,Jamar" "Knoll,Keyana" "McDonnell,Abdul"
Here we divide the string into two capturing groups and use backreference to recollect their content:
([^,]+,): the 1st capture group, which captures any sequence of characters that is not a ,followed by a comma
.*?: this lazily matches what follows until ...
(\\w+)$: ... the 2nd capture group, which captures the alphanumeric string at the end
\\1\\2 in the replacment argument recollects the contents of the two capture groups only, thereby removing whatever is not captured. If you wish to separate the surname from the first name not only by a comma but also a whitespace just squeeze one whitespace between the two backreferences, thus: \\1 \\2
We may capture the second word (\\w+) and replace with the backreference (\\1) of the captured word
sub("\\s+", "", sub("\\s+(\\w+)\\s+\\w+$", "\\1", names))
-output
[1] "Casillas,Kyron" "Knoll,Keyana" "McDonnell,Messiah"
I have the following dataframe:
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
I want to remove the word preceding a comma and parentheses so my output would yield:
[1] "golden,red delicious),cavendish,lady finger),golden pears" "yellow pineapples,roma,vine),orange carrots"
Ideally, the right parenthesis would be removed as well. But I can manage that delete with gsub.
I feel like a lookbehind might work but can't seem to code it correctly.
Thanks!
edit: I amended the dataframe so that the word I want deleted is a string of two words.
We can use base R with gsub to remove the characters. We match a word (\\w+) followed by space (\\s+) followed by word (\\w+) comma (,) and (, replace with blank ("")
gsub("\\w+\\s+\\w+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"
Or if the , is one of the patterns to check for the words, we can create the pattern with characters that are not a ,
gsub("[^,]+,\\(", "", df)
#[1] "golden,red delicious),cavendish,lady finger),golden pears"
#[2] "yellow pineapples,roma,vine),orange carrots"
Using the tidyverse package stringr, I was able to make your data appear the way you'd want it with two function calls separated by a pipe. The pipe comes from the package magrittr which loads with dplyr and/or tidyverse.
I used stringr::str_replace_all to perform two substitutions which remove the words you wanted to take out. Note the syntax for multiple substitutions within this function.
str_replace_all( c( "first string to get rid of" = "string to replace it with", "second string to get rid of" = "second replacement string")
You might find it more intuitive to combine all the "get rid of strings" first followed by combining the replacement strings, but each element within the c() is the string to be replaced (in quotes) connected to its replacement (also in quotes) with "=". Each of those replaced=replacement pairs is separated by a comma.
Using str_replace, I first took out all text which starts with "," and ends with ",)" using this regular expression ",[a-z ]+,\\(" which refers to comma, followed by any number of lowercase letters and spaces (allowing for chunks with multiple words to be detected) followed by ",(". Note the escape for the "(". If you thought there might be capital letters you would use [a-zA-Z ] instead. In either case, note the space before the "]".
Because you wanted to take out the word, but not the comma preceding it, I replaced the removed text with ",".
This doesn't remove "red apples" in the first string because it doesn't follow a comma. The expression "^[a-z ]+,\\(" refers to any number of lowercase letters and spaces coming before ",(" at the beginning of the string (the ^ "anchors" your pattern to the beginning of the string). Therefore it removes "red apples" or any other example where the text you want to remove starts the string. For these cases, it makes sense to replace it with nothing ("") because you want the first character of the remaining string to appear at the beginning.
Together, the two substitutions remove the offending text whether it starts the string or is in the middle of it or ends it so in that sense it's more or less generalized.
str_remove_all("\\)") removes the right parentheses throughout
library(stringr)
library(magrittr)
df<-c("red apples,(golden,red delicious),bananas,(cavendish,lady finger),
golden pears","yellow pineapples,red tomatoes,(roma,vine),orange carrots")
str_replace_all(df, c(",[a-z ]+,\\(" = ",",
"^[a-z ]+,\\(" = "")) %>%
str_remove_all("\\)")
[1] "golden,red delicious,cavendish,lady finger,golden pears"
[2] "yellow pineapples,roma,vine,orange carrots"
I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:
sentences <- c("Direito à participação e ao controle social",
"Direito a ser ouvido pelo governo e representantes",
"Direito aos serviços públicos",
"Direito de acesso à informação")
For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:
[1] "DPCS", "DOGR", "DSP", "DAI
I tried to make a pattern subset using stringr with a regex pattern founded here:
library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)
But I got an error when creating the pattern object:
Error: '\w' is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"
What am I doing wrong?
Thanks in advance for any help.
You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DSOPGR" "DASP" "DAI"
But if we were to ignore the words you indicated then it would be:
gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP" "DAI"
#Onyambu's answer is great, though as a regular expression beginner, it does take me a long time to try to understand it so that I can make modifications to suit my own needs.
Here is my understanding to gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE).
Post in the hope of being helpful to others.
Background information:
\\b: boundary of word
\\pL matches any kind of letter from any language
{4,} is an occurrence indicator
{m}: The preceding item is matched exactly m times.
{m,}: The preceding item is matched m or more times, i.e., m+
{m,n}: The preceding item is matched at least m times, but not more than n times.
| is OR logic operator
. represents any one character except newline.
\\U\\1 in the replacement text is to reinsert text captured by the pattern as well as capitalize the texts. Note that parentheses () create a numbered capturing group in the pattern.
With all the background knowledge, the interpretation of the command is
replace words matching \\b(\\pL)\\pL{4,} with the first letter
replace any character not matching the above pattern with "" as nothing is captured for this group
Here are two great places I learned all these backgrounds.
https://www.regular-expressions.info/rlanguage.html
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
You can use this pattern: (?<=^| )\S(?=\pL{4,})
I used a positive lookbehind to make sure the matches are preceded by either a space or the beginning of the line. Then I match one character, only if it is followed by 4 or more letters, hence the positive lookahead.
I suggest you don't use \w for non-English languages, because it won't match any characters with accents. Instead, \pL matches any letter from any language.
Once you have your matches, you can just concatenate them to create your strings (dpcs, dogr, etc...)
Here's a demo
Using R script in PowerBI Query Editor to find six digit numeric string in a description column and add this as a new column to the table. It works EXCEPT where the number string is preceded by a "_" (underscore character)
# 'dataset' holds the input data for this script ##
library(stringr)
# assign regex to variable #
pattern <- "(?:^|\\D)(\\d{6})(?!\\d)"
# define function to use pattern ##
isNewSiteNum = function(x) substr(str_extract(x,pattern),1,6)
# output statement - within adds new column to dataset ##
output <- within(dataset,{NewSiteNum=isNewSiteNum(dataset$LineItemComment)})
number string can be at start, end or in the middle of the description text. When the number string is preceded by underscore (_123456 for example) the regex returns the _12345 instead of 123456. Not sure how to tell this to skip the underscore but still grab the six digits (and not break the cases where there is no leading underscore that currently work.)
regex101.com shows the full match as '_123456' and group.1 as '123456' but my result column has '_12345' For the case with a leading space the full match is ' 123456' yet my result column is correct. I seem to be missing something since the full match gets 7 char and the desirec group 1 has 6.
The problem was with the str_extract which I could not get to work. However, by using the str_match and selecting the group I get what I am looking for.
# 'dataset' holds input data
library(stringr)
pattern<-"(?:^|\\D)(\\d{6})(?!\\d)"
SiteNum = function(x) str_match(x, pattern)[,2]
output<-within(dataset,{R_SiteNum2=SiteNum(dataset$ReqComments)})
this does not pick up non-numeric initial characters.