Removing punctuations from text using R - r

I need to remove punctuation from the text. I am using tm package but the catch is :
eg: the text is something like this:
data <- "I am a, new comer","to r,"please help","me:out","here"
now when I run
library(tm)
data<-removePunctuation(data)
in my code, the result is :
I am a new comerto rplease helpmeouthere
but what I expect is:
I am a new comer to r please help me out here

Here's how I take your question, and an answer that is very close to #David Arenburg's in the comment above.
data <- '"I am a, new comer","to r,"please help","me:out","here"'
gsub('[[:punct:] ]+',' ',data)
[1] " I am a new comer to r please help me out here "
The extra space after [:punct:] is to add spaces to the string and the + matches one or more sequential items in the regular expression. This has the side effect, desirable in some cases, of shortening any sequence of spaces to a single space.

If you had something like
string <- "hello,you"
> string
[1] "hello,you"
You could do this:
> gsub(",", "", string)
[1] "helloyou"
It replaces the "," with "" in the variable called string

Related

Concatenate string with escape characters to bookend string with "\"... \"" in R

I have a character vector of file paths that look like this:
xx <- c("data/lsa_two_isl_prosp_u.csv",
"data/lsa_two_isl_prosp_d.csv" ,
"data/lsa_two_isl_propsuit_u.csv")
However, I need these file paths to have "" concatenated on to the beginning of the string and "" concatenated onto the end, so that my string looks like this:
xx <- c("\"data/lsa_two_isl_prosp_u.csv\"",
"\"data/lsa_two_isl_prosp_d.csv\"" ,
"\"data/lsa_two_isl_propsuit_u.csv\"")
Normally I would use paste but the "\"... \"" are escape characters that need each other to 'bookend' a string.
In hindsight, an obviously doomed idea, but sharing to avoid anyone else who might try: If I try yo use paste('"\"', xx, '\""') , I get "\"\" data/lsa_two_isl_prosp_d.csv \"\"" , which is obviously wrong, and I cannot remove the excess portions of the string without throwing out all of it, incase you may have the same idea...
Any suggestions?
Found the answer after a lot of trial and error:
xx <- paste("\"", xx, "\"")

Removing specific tags from a string - R

I read similar questions and applied them to my case but they did not work. Anyway, How can I remove the <\/p>\r\n<p> from an string?
My failed attempts:
string <- "Hello!<\/p>\r\n<p>For this skill, I had no idea of what to do at first."
gsub("<\/p>\r\n<p>","",string)
gsub("<\\/p>\\r\\n<p>","",string)
Thanks
I solved it using the following codes
string <- gsub("<[^>]+>", "", string)
string <- gsub("\\r\\n", "", string,fixed=TRUE)

combining strings to one string in r

I'm trying to combine some stings to one. In the end this string should be generated:
//*[#id="coll276"]
So my inner part of the string is an vector: tag <- 'coll276'
I already used the paste() method like this:
paste('//*[#id="',tag,'"]', sep = "")
But my result looks like following: //*[#id=\"coll276\"]
I don't why R is putting some \ into my string, but how can I fix this problem?
Thanks a lot!
tldr: Don't worry about them, they're not really there. It's just something added by print
Those \ are escape characters that tell R to ignore the special properties of the characters that follow them. Look at the output of your paste function:
paste('//*[#id="',tag,'"]', sep = "")
[1] "//*[#id=\"coll276\"]"
You'll see that the output, since it is a string, is enclosed in double quotes "". Normally, the double quotes inside your string would break the string up into two strings with bare code in the middle:
"//*[#id\" coll276 "]"
To prevent this, R "escapes" the quotes in your string so they don't do this. This is just a visual effect. If you write your string to a file, you'll see that those escaping \ aren't actually there:
write(paste('//*[#id="',tag,'"]', sep = ""), 'out.txt')
This is what is in the file:
//*[#id="coll276"]
You can use cat to print the exact value of the string to the console (Thanks #LukeC):
cat(paste('//*[#id="',tag,'"]', sep = ""))
//*[#id="coll276"]
Or use single quotes (if possible):
paste('//*[#id=\'',tag,'\']', sep = "")
[1] "//*[#id='coll276']"

How can I delete every single letter of a row after a certain character in R?

I am having a problem doing a cleaning of transactions. I have an excel with every single transaction that clients do, with the number, the gloss and the code of the industry. I convert this excel in text separated by ";" then I only need to clean the gloss and convert it back again into an excel.
tolower(tabla1)
lapply(tabla1, tolower)
tabla1[] <- lapply(tabla1, tolower)
str(tabla1)
tabla1
tabla1_texto <- gsub("[.]", "", tabla1)
table1_texto <- gsub("[(]", " ", tabla1_texto)
I know that I need to use gsub() but I'm not sure how to use it, in other hand, someone know how to do a correct dictionary and only keep certain words and delete every other word?
If you have a string like this one:
string <- "Some text here; and some text here; and some more text here"
Then you can delete everything after the first ; with:
gsub(";.*$", "", string)
[1] "Some text here"
Explanation of ;,*$ which you will be substituting for "" (empty string):
starting with ;
any character . zero or more times *
up until the end of the line $
If you have a table - you will have to do this for every row separately.

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

Resources