substitute word separators with space

substitute word separators with space - r

I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either.
df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three"))
> gsub(".", "\\1 \\2", df$n)
[1] " " " " " "
> gsub(".", " ", df$n)
[1] " " " " " "

You don't need to use regex for one-to-one character translation. You can use chartr().
df$n <- chartr(".", " ", df$n)
df
# m n
# 1 1 one one
# 2 2 one two
# 3 3 one three

You can try
gsub("[.]", " ", df$n)
#[1] "one one" "one two" "one three"

Set fixed = TRUE if you are looking for an exact match and don't need a regular expression.
gsub(".", " ", df$n, fixed = TRUE)
#[1] "one one" "one two" "one three"
That's also faster than using an appropriate regex for such a case.

I suggest you to do like this,
gsub("\\.", " ", df$n)
OR
gsub("\\W", " ", df$n)
\\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.

Related

Why does gsub/sub not work to replace ".."?

When I call rownames on my df I get something like this:
"Saint.Petersburg..Russia" "Istanbul..Turkey"
This what I coded
gsub("..", " ", rownames(df))
This is what was returned
[1] " " " " "
What I expected was
"Saint.Petersburg Russia" "Istanbul Turkey"
Does anyone know what is going wrong here?

We can use fixed = TRUE as . can match any character in the default regex mode if it is not escaped (\\.) or placed inside square brackets ([.]) or the faster option is fixed = TRUE
gsub("..", " ", rownames(df), fixed = TRUE)
#[1] "Saint.Petersburg Russia" "Istanbul Turkey"

How to remove words that contain any non-alphabetic characters (except hyphen and apostrophe) in R

I would need to remove all words (or replace them with spaces) in strings that have non-alphabetic characters (except hyphens and apostrophes) in the middle in R. Could anyone kindly help? Thanks.
e.g.
str = "he#llo wor*ld i'm using state-of-the-art technologies it's i4u"
expected output
" i'm using state-of-the-art technologies it's "
I have tried the following regex.
lines <- c("i'm",
'gas-lighting',
"i'm gas-lighting",
"i-love-you",
"i#u",
"b2b",
"i'm gas-lighting u i#u b2b")
gsub("\\w+[^a-z'-]+\\w+", " ", lines)
[1] "i'm" "gas-lighting" "i' -lighting" "i-love-you" " "
" " "i' - "
The problem is the space between words? Tried to skip space.
gsub("\\w+[^a-z\\s'-]+\\w+", " ", lines)**
[1] "i'm" "gas-lighting" "i' -lighting" "i-love-you" " "
" " "i' - "
It wouldn't skip the spaces? Expected the following strings.
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you" " "
" " "i'm gas-lighting u "
Update 2: OK, this works fine so far.
> lines <- c("i'm",
+ 'gas-lighting',
+ "i'm gas-lighting",
+ "i-love-you",
+ "i#u",
+ "b2b",
+ "i'm gas-lighting u and you and you i#u b2b",
+ " he#llo wor$ld how*are&you ")
>
> # split a string at spaces then remove the words
> # that contain any non-alphabetic characters (excpet "-", "'")
> # then paste them together (separate them with spaces)
> unlist(lapply(lines, function(line){
+ words <- unlist(strsplit(line, "\\s+"))
+ words <- words[!grepl("[^a-z'-]", words, perl=TRUE)]
+ paste(words, collapse=" ")}))
[1] "i'm" "gas-lighting"
[3] "i'm gas-lighting" "i-love-you"
[5] "" ""
[7] "i'm gas-lighting u and you and you" ""
Update 1: So far I am using the following regex.
> # replace word at the beginning of a string
> lines <- gsub("^\\s*\\w*[^a-z'-]+\\w*", " ", lines); lines
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you"
[5] " " " " "i'm gas-lighting u i#u "
> # replace word at the end of a string
> lines <- gsub("\\s[a-z]+[^a-z'-]+\\w*$", " ", lines); lines
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you"
[5] " " " " "i'm gas-lighting u i#u "
> # replace words between spaces
> gsub("\\s\\w*[^a-z'-]+\\w*\\s", " ", lines)
[1] "i'm" "gas-lighting" "i'm gas-lighting" "i-love-you" " "
[6] " " "i'm gas-lighting u "

I came up with an indirect way, but it worked.
library(tidyverse)
str = "he#llo wor*ld i'm using state-of-the-art technologies it's i4u"
##Break the string based on spaces
break_1 <- (str_split(str, pattern = "\\s"))
##Find the good words and put them in a vector
good_words <- unlist(break_1)[!sapply(break_1,
function(i)str_detect(i,pattern = "[^(Aa-zZ|\\-|')]"))]
##Merge the vector
merged_vector <- paste0(good_words, collapse = " ")
merged_vector

As a variation of Harro Cyranka with grepl
paste0(sapply(break_1, function(x) x[!grepl("[^Aa-zZ|'|-]", x)]), collapse = " ")

regex misunderstanding in r

I don't seem to understand gsub or stringr.
Example:
> a<- "a book"
> gsub(" ", ".", a)
[1] "a.book"
Okay. BUT:
> a<-"a.book"
> gsub(".", " ", a)
[1] " "
I would of expected
"a book"
I'm replacing the full stop with a space.
Also: srintr: str_replace(a, ".", " ") returns:
" .book"
and str_replace_all(a, ".", " ") returns
" "
I can use stringi: stri_replace(a, " ", fixed="."):
"a book"
I'm just wondering why gsub (and str_replace) don't act as I'd have expected. They work when replacing a space with another character, but not the other way around.

That's because the first argument to gsub, namely pattern is actually a regex. In regex the period . is a metacharacter and it matches any single character, see ?base::regex. In your case you need to escape the period in the following way:
gsub("\\.", " ", a)

R: Print a string of different length below each other and fill rest with spaces

My problem is the following: I need to write my own print function and the output should be saved to a textfile and look very similiar to a table.
Basically my structure is this:
Description Symbol Rank
I've did this with:
paste("Description Symbol Rank", "\n",sep="")
Now you can guess my problem. Some Symbol descriptions are 10 letters long, some are 20 etc. That's why my paste function for these rows cannot be that simple. How do I need to program this to fill lets say for a 20 letter long string the remaining 10 with an empty space, whereas for a 10 letter string I fill the remaining 20 with an empty space?

paste0(yourstring,paste0(rep(" ",20-nchar(yourstring)),collapse = ""))
this should help... I think

You could play around with str_pad
> x <- c("Description", "Symbol", "Rank")
> library(stringr)
> str_pad(x, 20)
# [1] " Description" " Symbol" " Rank"
> str_pad(x, 20, side = "right")
# [1] "Description " "Symbol " "Rank "
> c(str_pad(x[1], 20, "right"), str_pad(x[2], 20), x[3])
# [1] "Description " " Symbol" "Rank"

Third solution is the classic sprintf:
> x <- c("Description", "Symbol", "Rank")
> sprintf("%20s",x)
[1] " Description" " Symbol" " Rank"

You may also use formatC
formatC(x, width=-20)
#[1] "Description " "Symbol " "Rank "

Finding the best string match with R

Starting with this L Hernandez
From a vector containing the following:
[1] "HernandezOlaf " "HernandezLuciano " "HernandezAdrian "
I tried this:
'subset(ABC, str_detect(ABC, "L Hernandez") == TRUE)'
The name Hernandez which includes the capital L anyplace is the desired output.
The desired output is HernandezLuciano

May be this helps:
vec1 <- c("L Hernandez", "HernandezOlaf ","HernandezLuciano ", "HernandezAdrian ")
grep("L ?Hernandez|Hernandez ?L",vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "
Update
variable <- "L Hernandez"
v1 <- gsub(" ", " ?", variable) #replace space with a space and question mark
v2 <- gsub("([[:alpha:]]+) ([[:alpha:]]+)", "\\2 ?\\1", variable) #reverse the order of words in the string and add question mark
You can also use strsplit to split variable as #rawr commented
grep(paste(v1,v2, sep="|"), vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "

You could use agrep function for approximate string matching.
If you simply run this function it matches every string...
agrep("L Hernandez", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
but if you modify this a little "L Hernandez" -> "Hernandez L"
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
and change the max distance
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "),0.01)
[1] 2
you get the right answer. This is only an idea, it might work for you :)

You could modify the following if you only want full names after a capital L:
vec1[grepl("Hernandez", vec1) & grepl("L\\.*", vec1)]
[1] "L Hernandez" "HernandezLuciano
or
vec1[grepl("Hernandez", vec1) & grepl("L[[:alpha:]]", vec1)]
[1] "HernandezLuciano "
The expression looks for a match on "Hernandez" and then looks to see if there is a capital "L" followed by any character or space. The second version requires a letter after the capital "L".
BTW, it appears that you can't chain the grepls.
vec1[grepl("Hernandez", vec1) & grepl("L\\[[:alpha:]]", vec1)]
character(0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

substitute word separators with space - r

I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either. df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three")) > gsub(".", "\\1 \\2", df$n) [1] " " " " " " > gsub(".", " ", df$n) [1] " " " " " "

You don't need to use regex for one-to-one character translation. You can use chartr(). df$n <- chartr(".", " ", df$n) df # m n # 1 1 one one # 2 2 one two # 3 3 one three

You can try gsub("[.]", " ", df$n) #[1] "one one" "one two" "one three"

Set fixed = TRUE if you are looking for an exact match and don't need a regular expression. gsub(".", " ", df$n, fixed = TRUE) #[1] "one one" "one two" "one three" That's also faster than using an appropriate regex for such a case.

I suggest you to do like this, gsub("\\.", " ", df$n) OR gsub("\\W", " ", df$n) \\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.

Related

Why does gsub/sub not work to replace ".."?

How to remove words that contain any non-alphabetic characters (except hyphen and apostrophe) in R

regex misunderstanding in r

R: Print a string of different length below each other and fill rest with spaces

Finding the best string match with R

Categories

Resources