Remove all special characters from string except Turkish ones

Remove all special characters from string except Turkish ones - r

there are tons of similar questions but i couldn't find the exact answer.
I have a text like this;
str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم "
I want to remove all special and non-Turkish characters and keep the others. Desired output is;
"NGÜÇ12 NET GROUP"
I really appreciate your help.

Please try
library(stringr)
str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم "
str_replace_all(str, '[^[\\da-zA-Z ÜüİıÇçŞşĞğ]]', '')

Using base gsub:
gsub("[^0-9a-zA-Z ÜüİıÇçŞşĞğ]", "", str)
# [1] "NGÜÇ12 NET GROUP "

Related

gsub to remove escape string

> pname <- "Ratchanon \"TK\" Chantananuwat (Am)"
> gsub(\"TK\", "", pname)
Error: unexpected string constant in "gsub(\"TK\", ""
It is possible to remove the \"TK\" in this persons name?

I would like to suggest you do it in the following manner. First remove the special chars you have in your string. Then apply gsub() to get rid of the letter/word you may like.
pname <- "Ratchanon \"TK\" Chantananuwat (Am)"
library(stringr)
pname <- str_replace_all(pname, "[[:punct:]]", "") # removes all the special chars
gsub("TK", "", pname)
Hope this might help you!

In base R:
gsub('\\"TK\\"', "", pname)
#> [1] "Ratchanon Chantananuwat (Am)"
Another possible solution, based on stringr::str_replace:
library(stringr)
str_remove(pname, '\\"TK\\"')
#> [1] "Ratchanon Chantananuwat (Am)"

substitute string when there is a dot + number + ':'

I have strings that look like these:
> ABCD.1:f_HJK
> ABFD.1:f_HTK
> CJD:f_HRK
> QQYP.2:f_HDP
So basically, I have always a string in the first part, I could have a part with . and a number, and after this part I always have ':' and a string.
I would like to remove the '. + number' when it is included in the string, using R.
I know that maybe regular expressions could be useful but I have not idea about I can apply them in this context. I know that I can substitute the '.' with gsub, but not idea about how I can add the information about number and ':'.
Thank you for your help.

Does this work:
v <- c('ABCD.1:f_HJK','ABFD.1:f_HTK','CJD:f_HRK','QQYP.2:f_HDP')
v
[1] "ABCD.1:f_HJK" "ABFD.1:f_HTK" "CJD:f_HRK" "QQYP.2:f_HDP"
gsub('([A-Z]{,4})(\\.\\d)?(:.*)','\\1\\3',v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"

You could also use any of the following depending on the structure of your string
If no other period and numbers in the string
sub("\\.\\d+", "", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
If you are only interested in the first pattern matched.
sub("^([A-Z]+)\\.\\d+:", "\\1:", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
Same as above, invoking perl. ie no captured groups
sub("^[A-Z]+\\K\\.\\d+", "", v, perl = TRUE)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"

If I understood your explanation correctly, this should do the trick:
gsub("(\\.\\d+)", "", string)

Remove first 4 words after a certain string pattern in R?

I am working with really long strings. How can I remove the first 4 words after a certain string pattern occurs? For example:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "stackoverflow"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

Solution with base R
A one line solution:
pattern <- "stackoverflow"
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
How it works
Create the pattern you want with a regex:
"stackoverflow" followed by 4 words.
Definitely, check out ?regex for more info about it.
Words are identified by \\w+ and separators are identified by \\W+ (capital w, it includes spaces and special characters like the apostrophe that you have in the sentence)
(...){0,4} means that the combination of word and separator may repeat up to 4 times.
\\W* needs to identify a possible final separator, so that the remaining two pieces of the sentence won't have two separators dividing them. Try it without, you'll see what I mean.
gsub locates the pattern you want and replace it with "" (thus deliting it).
Handle Exceptions
Note that it works even for particular cases:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of stackoverflow and I am"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "stackoverflow and I am really"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] ""
A tidyverse solution
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\\W+\\w+){0,4}\\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

Search for your pattern with additional spaces and words after it. Find the positions of the first last match, split the string and paste it back together. At the end gsub any double (or more) spaces.
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="stackoverflow"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

Quick answer, I am sure you can have better code thant that:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="stackoverflow"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)

negation handling in R add the prefix "neg_" to the word before a "not"

I am doing sentiment analysis with german customer reviews and want to implement negation handling.
I decided to add the prefix "neg_" both to the word following "not" as well as the word before "not" (this may not make sense for the English language but for German it does).
I already found out how to add the prefix "_neg" to words following a "not" with this function:
addprefix <-function(text){
words<-unlist(strsplit(text, " "))
negative <- grepl("\\<not\\>",words,ignore.case=T)
negate <- append(FALSE,negative)[1:length(words)]
words[negate==T]<- paste0("neg_",words[negate==T])
words<-paste(words,collapse=" ")
}
Is there a possibility to also add the prefix "_neg" also to the word before "not"?
So that a review goes from originally this:
> str_negate("I did not like the product")
[1] "I did not like the product"
and currently this:
> str_negate("I did not like the product")
[1] "I did not neg_like the product"
to finally this:
> str_negate("I did not like the product")
[1] "I neg_did not neg_like the product"
Any help would be appreciated. Thank you!

A solution using the index of not with the wich function:
addprefix <-function(text){
words<-unlist(strsplit(text, " "))
negative <- which(grepl("\\<not\\>",words,ignore.case=T))
to.change = c(negative-1, negative+1)
to.change = to.change[to.change>0]
words[to.change] = paste("neg_", words[to.change], sep = '')
words<-paste(words,collapse=" ")
}

R how to remove VERY special characters in strings?

I'm trying to remove some VERY special characters in my strings.
i've read other post like:
Remove all special characters from a string in R?
How to remove special characters from a string?
but these are not what im looking for.
lets say my string is as following:
s = "who are í ½í¸€ bringing?"
i've tried following:
test = tm_map(s, function(x) iconv(enc2utf8(x), sub = "byte"))
test = iconv(s, 'UTF-8', 'ASCII')
none of above worked.
edit:
I am looking for a GENERAL solution!
I cannot (and prefer not) manually identify all the special characters.
also these VERY special characters MAY (not 100% sure) be result from emoticons
please help or guide me to the right posts.
Thank you!

So, I'm going to go ahead and make an answer, because I believe this is what you're looking for:
> s = "who are í ½í¸€ bringing?"
> rmSpec <- "í|½|€" # The "|" designates a logical OR in regular expressions.
> s.rem <- gsub(rmSpec, "", s) # gsub replace any matches in remSpec and replace them with "".
> s.rem
[1] "who are ¸ bringing?"
Now, this does have the caveat that you have to manually define the special character in the rmSpec variable. Not sure if you know what special characters to remove or if you're looking for a more general solution.
EDIT:
So it appears you almost had it with iconv, you were just missing the sub argument. See below:
> s
[1] "who are í ½í¸€ bringing?"
> s2 <- iconv(s, "UTF-8", "ASCII", sub = "")
> s2
[1] "who are bringing?"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove all special characters from string except Turkish ones - r

there are tons of similar questions but i couldn't find the exact answer. I have a text like this; str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم " I want to remove all special and non-Turkish characters and keep the others. Desired output is; "NGÜÇ12 NET GROUP" I really appreciate your help.

Please try library(stringr) str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم " str_replace_all(str, '[^[\\da-zA-Z ÜüİıÇçŞşĞğ]]', '')

Using base gsub: gsub("[^0-9a-zA-Z ÜüİıÇçŞşĞğ]", "", str) # [1] "NGÜÇ12 NET GROUP "

Related

gsub to remove escape string

substitute string when there is a dot + number + ':'

Remove first 4 words after a certain string pattern in R?

negation handling in R add the prefix "neg_" to the word before a "not"

R how to remove VERY special characters in strings?

Categories

Resources