I have a set of strings and I need to search by words that have a period in the middle. Some of the strings are concatenated so I need to break them apart in to words so that I can then filter for words with dots.
Below is a sample of what I have and what I get so far
punctToRemove <- c("[^[:alnum:][:space:]._]")
s <- c("get_degree('TITLE',PERS.ID)",
"CLIENT_NEED.TYPE_CODe=21",
"2.1.1Report Field Level Definition",
"The user defined field. The user will validate")
This is what I currently get
gsub(punctToRemove, " ", s)
[1] "get_degree TITLE PERS.ID "
[2] "CLIENT_NEED.TYPE_CODe 21"
[3] "2.1.1Report Field Level Definition"
[4] "The user defined field. The user will validate"
Sample of what I want is below
[1] "get_degree ( ' TITLE ' , PERS.ID ) " # spaces before and after the "(", "'", ",",and ")"
[2] "CLIENT_NEED.TYPE_CODe = 21" # spaces before and after the "=" sign. Dot and underscore remain untouched.
[3] "2.1.1Report Field Level Definition" # no changes
[4] "The user defined field. The user will validate" # no changes
We can use regex lookarounds
s1 <- gsub("(?<=['=(),])|(?=['(),=])", " ", s, perl = TRUE)
s1
#[1] "get_degree ( ' TITLE ' , PERS.ID ) "
#[2] "CLIENT_NEED.TYPE_CODe = 21"
#[3] "2.1.1Report Field Level Definition"
#[4] "The user defined field. The user will validate"
nchar(s1)
#[1] 35 26 34 46
which is equal to the number of characters showed in the OP's expected output.
For this example:
library(stringr)
s <- str_replace_all(s, "\\)", " \\) ")
s <- str_replace_all(s, "\\(", " \\( ")
s <- str_replace_all(s, "=", " = ")
s <- str_replace_all(s, "'", " ' ")
s <- str_replace_all(s, ",", " , ")
Related
Hello everyone I hope you guys are having a good one,
I have multiple and long strings of text in a dataset, I am trying to capture all text between , after and before a set of words, I will refere to this words as keywords
keywords= UHJ, uhj, AXY, axy, YUI, yui, OPL, opl, UJI, uji
if I have the following string:
UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI
The following regex will easily match my keywords:
UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji
but since I am interested in capturing eveyrthing in between after and before those words, I am in some way wanting to capture the invert of my regex so that I can have something like this:
I have tried the following:
[^UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji]
with no luck and in the future the keywords may change so please if you know a regex that would work in R that can achive my desired output
The simplest solution is probably just to split by your pattern. (Note this includes an empty string if the text starts with a keyword.)
x <- "UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI"
keywords <- "UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji"
strsplit(x, keywords)
# [[1]]
# [1] "" " This is only a test to "
# [3] " check regex in a " " educational context so "
# [5] " please be kind "
Other options would be to use regmatches() with invert = TRUE. (This includes empty strings if the text starts or ends with a keyword.)
regmatches(
x,
gregexpr(keywords, x, perl = TRUE),
invert = TRUE
)
# [[1]]
# [1] "" " This is only a test to "
# [3] " check regex in a " " educational context so "
# [5] " please be kind "
Or stringr::str_extract_all() with your pattern in both a lookbehind and a lookahead. (This doesn't include empty strings.)
library(stringr)
str_extract_all(
x,
str_glue("(?<={keywords}).+?(?={keywords})"
)
# [[1]]
# [1] " This is only a test to " " check regex in a "
# [3] " educational context so " " please be kind "
I would need to remove all words (or replace them with spaces) in strings that have non-alphabetic characters (except hyphens and apostrophes) in the middle in R. Could anyone kindly help? Thanks.
e.g.
str = "he#llo wor*ld i'm using state-of-the-art technologies it's i4u"
expected output
" i'm using state-of-the-art technologies it's "
I have tried the following regex.
lines <- c("i'm",
'gas-lighting',
"i'm gas-lighting",
"i-love-you",
"i#u",
"b2b",
"i'm gas-lighting u i#u b2b")
gsub("\\w+[^a-z'-]+\\w+", " ", lines)
[1] "i'm" "gas-lighting" "i' -lighting" "i-love-you" " "
" " "i' - "
The problem is the space between words? Tried to skip space.
gsub("\\w+[^a-z\\s'-]+\\w+", " ", lines)**
[1] "i'm" "gas-lighting" "i' -lighting" "i-love-you" " "
" " "i' - "
It wouldn't skip the spaces? Expected the following strings.
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you" " "
" " "i'm gas-lighting u "
Update 2: OK, this works fine so far.
> lines <- c("i'm",
+ 'gas-lighting',
+ "i'm gas-lighting",
+ "i-love-you",
+ "i#u",
+ "b2b",
+ "i'm gas-lighting u and you and you i#u b2b",
+ " he#llo wor$ld how*are&you ")
>
> # split a string at spaces then remove the words
> # that contain any non-alphabetic characters (excpet "-", "'")
> # then paste them together (separate them with spaces)
> unlist(lapply(lines, function(line){
+ words <- unlist(strsplit(line, "\\s+"))
+ words <- words[!grepl("[^a-z'-]", words, perl=TRUE)]
+ paste(words, collapse=" ")}))
[1] "i'm" "gas-lighting"
[3] "i'm gas-lighting" "i-love-you"
[5] "" ""
[7] "i'm gas-lighting u and you and you" ""
Update 1: So far I am using the following regex.
> # replace word at the beginning of a string
> lines <- gsub("^\\s*\\w*[^a-z'-]+\\w*", " ", lines); lines
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you"
[5] " " " " "i'm gas-lighting u i#u "
> # replace word at the end of a string
> lines <- gsub("\\s[a-z]+[^a-z'-]+\\w*$", " ", lines); lines
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you"
[5] " " " " "i'm gas-lighting u i#u "
> # replace words between spaces
> gsub("\\s\\w*[^a-z'-]+\\w*\\s", " ", lines)
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you" " "
[6] " " "i'm gas-lighting u "
I came up with an indirect way, but it worked.
library(tidyverse)
str = "he#llo wor*ld i'm using state-of-the-art technologies it's i4u"
##Break the string based on spaces
break_1 <- (str_split(str, pattern = "\\s"))
##Find the good words and put them in a vector
good_words <- unlist(break_1)[!sapply(break_1,
function(i)str_detect(i,pattern = "[^(Aa-zZ|\\-|')]"))]
##Merge the vector
merged_vector <- paste0(good_words, collapse = " ")
merged_vector
As a variation of Harro Cyranka with grepl
paste0(sapply(break_1, function(x) x[!grepl("[^Aa-zZ|'|-]", x)]), collapse = " ")
I have a vector like this:
> myarray
[1] "AA\tThis is ",
[2] "\tthe ",
[3] "\tbegining."
[4] "BB\tA string of "
[5] "\tcharacters."
[6] "CC\tA short line."
[7] "DD\tThe "
[8] "\tend."`
I am trying to write a function that processes the above to generate this:
> myoutput
[1] "AA\tThis is the begining."
[2] "BB\tA string of characters."
[3] "CC\tA short line"
[4] "DD\tThe end."`
This is doable by looping through the rows and using an if statement to concatenate the current row with the last one if it starts with a \t. I was wondering if there is a more efficient way of achieving the same result.
# Create your example data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ", "\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
# Find where each "sentence" starts based on detecting
# that the first character isn't \t
starts <- grepl("^[^\t]", myarray)
# Create a grouping variable
id <- cumsum(starts)
# Remove the leading \t as that seems like what your example output wants
tmp <- sub("^\t", "", myarray)
# split into groups and paste the groups together
sapply(split(tmp, id), paste, collapse = "")
And running it we get
> sapply(split(tmp, id), paste, collapse = "")
1 2
"AA\this is the beginning" "BB\tA string of characters."
3 4
"CC\tA short line." "DD\tThe end"
An option is to use paste than replace AA,BB etc. with additional character say ## and and strsplit as:
#Data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ",
"\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
strsplit(gsub("([A-Z]{2})","##\\1",
paste(sub("^\t","", myarray), collapse = "")),"##")[[1]][-1]
# [1] "AA\this is the beginning"
# [2] "BB\tA string of characters."
# [3] "CC\tA short line."
# [4] "DD\tTheend"
I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either.
df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three"))
> gsub(".", "\\1 \\2", df$n)
[1] " " " " " "
> gsub(".", " ", df$n)
[1] " " " " " "
You don't need to use regex for one-to-one character translation. You can use chartr().
df$n <- chartr(".", " ", df$n)
df
# m n
# 1 1 one one
# 2 2 one two
# 3 3 one three
You can try
gsub("[.]", " ", df$n)
#[1] "one one" "one two" "one three"
Set fixed = TRUE if you are looking for an exact match and don't need a regular expression.
gsub(".", " ", df$n, fixed = TRUE)
#[1] "one one" "one two" "one three"
That's also faster than using an appropriate regex for such a case.
I suggest you to do like this,
gsub("\\.", " ", df$n)
OR
gsub("\\W", " ", df$n)
\\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.
My problem is the following: I need to write my own print function and the output should be saved to a textfile and look very similiar to a table.
Basically my structure is this:
Description Symbol Rank
I've did this with:
paste("Description Symbol Rank", "\n",sep="")
Now you can guess my problem. Some Symbol descriptions are 10 letters long, some are 20 etc. That's why my paste function for these rows cannot be that simple. How do I need to program this to fill lets say for a 20 letter long string the remaining 10 with an empty space, whereas for a 10 letter string I fill the remaining 20 with an empty space?
paste0(yourstring,paste0(rep(" ",20-nchar(yourstring)),collapse = ""))
this should help... I think
You could play around with str_pad
> x <- c("Description", "Symbol", "Rank")
> library(stringr)
> str_pad(x, 20)
# [1] " Description" " Symbol" " Rank"
> str_pad(x, 20, side = "right")
# [1] "Description " "Symbol " "Rank "
> c(str_pad(x[1], 20, "right"), str_pad(x[2], 20), x[3])
# [1] "Description " " Symbol" "Rank"
Third solution is the classic sprintf:
> x <- c("Description", "Symbol", "Rank")
> sprintf("%20s",x)
[1] " Description" " Symbol" " Rank"
You may also use formatC
formatC(x, width=-20)
#[1] "Description " "Symbol " "Rank "