remove special/non-English characters from string in R - r

I want to do some text mining analysis with my data collected from Facebook, but have some problems with the special/non-English characters in the text. The data looks like:
doc_id
text
001
'ð˜ð—¶ð˜€ ð˜ð—µð—² ð˜€ð—²ð—®ð˜€
002
I expect a return to normalcy...That is Biden’s great
003
'I’m facing a prison sentence
What I want is to remove the words containing these "strange" characters. I tried to do this by using
str_replace_all(text, "[^[:alnum:]]", " ")
But this doesn't work to my case. Any idea?

A general answer to this kind of tasks is to specify the characters you want to keep. It appears that :alnum: comprises the greek letters and letters with accents.
Maybe this regex is more appropriate :
str_remove_all(x, "[^[\\da-zA-Z ]]")
[1] ""
[1] "I expect a return to normalcyThat is Bidens great"
[1] "Im facing a prison sentence"
I just replaced the alpha shortcut by a-zA-Z. I added a whitespace and used the str_remove_all function instead. Add any character you want to keep.

Related

Remove Everything Except Specific Words From Text

I'm working with twitter data using R. I have a large data frame where I need to remove everything from the text except from specific information. Specifically, I want to remove everything except from statistical information. So basically, I want to keep numbers as well as words such as "half", "quarter", "third". Also is there a way to also keep symbols such as "£", "%", "$"?
I have been using "gsub" to try and do this:
df$text <- as.numeric(gsub(".*?([0-9]+).*", "\\1", df$text))
This code removes everything except from numbers, however information regarding any words was gone. I'm struggling to figure out how I would be able to keep specific words within the text as well as the numbers.
Here's a mock data frame:
text <- c("here is some text with stuff inside that i dont need but also some that i do, here is a word half and quarter also 99 is too old for lego", "heres another one with numbers 132 1244 5950 303 2022 and one and a half", "plz help me with code i am struggling")
df <- data.frame(text)
I would like to be be able to end up with data frame outputting:
Also, I've included a N/A table in the picture because some of my observations will have neither a number or the specific words. The goal of this code is really just to be able to say that these observations contain some form of statistical language and these other observations do not.
Any help would be massively appreciate and I'll do my best to answer any Q's!
I am sure there is a more elegant solution, but I believe this will accomplish what you want!
df$newstrings <- unlist(lapply(regmatches(df$text, gregexpr("half|quarter|third|[[:digit:]]+", df$text)), function(x) paste(x, collapse = "")))
df$newstrings[df$newstrings == ""] <- NA
> df$newstrings
# [1] "halfquarter99" "132124459503032022half" NA
You can capture what you need to keep and then match and consume any character to replace with a backreference to the group value:
text <- c("here is some text with stuff inside that i dont need but also some that i do, here is a word half and quarter also 99 is too old for lego", "heres another one with numbers 132 1244 5950 303 2022 and one and a half", "plz help me with code i am struggling")
gsub("(half|quarter|third|\\d+)|.", "\\1", text)
See the regex demo. Details:
(half|quarter|third|\d+) - a half, quarter or third word, or one or more digits
| - or
. - any single char.
The \1 in the replacement pattern puts the captured vaue back into the resulting string.
Output:
[1] "halfquarter99" "132124459503032022half" ""

How to remove dash (-) and \n at the same time or some solution to this text cleaning

I have been trying to remove dash(-) and \n for quite sometimes and it is not working.
I have tried using this code to remove -
gsub(" - ", " ", df1$text)
I have also tried using this code to remove \n
gsub("[\n]", " ", df1$text)
However, when I remove \n it becomes "abc-" when I remove dash(-), it becomes "abc\n". Is just a loop. All this result is in the console
When I try using \n to remove. In the console result
Df1
Id text
1 I have learnt abc-d in school.
2 I want app-le.
3 Going to sc-hool is fun.
When I try using dash(-) to remove. In console result
Df1
Id text
1 I have learnt abc\nd in school.
2 I want app\nle.
3 Going to sc\nhool is fun.
This is just loop and loop. I tried \n remove then dash(-) remove and all over again.
This is the data in dataframe. (It always stays the same)
Id text
1 I have learnt abc- d in school.
2 I want app- le.
3 Going to sc- hool is fun
For the dataframe right after the dash(-) there is a space.
The data I am using is news article, I have copyed and pasted it in a excel file. But I try using r to clean it.
Could someone help me out with this. Thanks!
I don't mind sharing the data with you privately, but just do not disclose it. Because it is my school project.
gsub(" - ", " ", df2$text) is looking for a space, then a dash, then a space. The examples you give like app- le don't have a space before the dash, so they won't match. If you want to match a space next to a dash only if it's there, us ? to quantify the space:
df2 = read.table(text = 'Id|text
1|I have learnt abc- d in school.
2|I want app- le.
3|Going to sc- hool is fun', sep = "|", header = TRUE)
gsub(" ?- ?", " ", df2$text)
# [1] "I have learnt abc d in school." "I want app le." "Going to sc hool is fun"
## Maybe you want to replace it with nothing, not with a space?
gsub(" ?- ?", "", df2$text)
# [1] "I have learnt abcd in school." "I want apple." "Going to school is fun"
Since your example doesn't include any line breaks, I can't really tell what the issue is.

R Regex for matching comma separated sections in a column/vector

The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'
This must be very simple, but I've hit a brick wall on it.
I have a dataframe column where each cell(row) is a comma separated list of items. I want to find the rows that have a specific item.
df<-data.frame( nms= c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ),
b = c('A', 'X', "T"))
nms b
1 XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3 YYY,XXX REP LIMITED,999,XXX T
I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.
However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]
grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3
The easiest way to do this of course is strsplit() but I'd like to avoid it.Any suggestions on performance are welcome.
When \b does not "work", the problem usually lies in the definition of the "whole word".
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
It seems you want to only match a word in between commas or start/end of the string).
You may use a PCRE regex (note the perl=TRUE argument) like
(?<![^,])XXX(?![^,])
See the regex demo (the expression is "converted" to use positive lookarounds due to the fact it is a demo with a single multiline string).
Details
(?<![^,]) (equal to (?<=^|,)) - either start of the string or a comma
XXX - an XXX word
(?![^,]) (equal to (?=$|,)) - either end of the string or a comma
R demo:
> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3
The equivalent TRE regex will look like
> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)
Note that here, non-capturing groups are used to match either start of string or , (see (?:^|,)) and either end of string or , (see ((?:$|,))).
This is perhaps a somewhat simplistic solution, but it works for the examples which you've provided:
library(stringr)
df$nms %>%
str_replace_all('\\s', '') %>% # Removes all spaces, tabs, newlines, etc
str_detect('(^|,)XXX(,|$)') # Detects string XXX surrounded by comma or beginning/end
[1] FALSE TRUE TRUE
Also, have a look at this cheatsheet made by RStudio on Regular Expressions - it is very nicely made and very useful (I keep going back to it when I'm in doubt).

Finding and extracting words that include a punctuation expressions in R

I'am stuck trying to extract, from a big text (around 17000 documents), words that contain punctuation expressions. For example
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The
aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A
cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This
prospective double blind,...[95] c(c(Introduction, Silicosis is a fibrotic"
I would like to extract words like the following:
[1] c(A<sc>IMS AND</sc> M<sc>ETHODS</sc>
[2] c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>
[3] c(PATIENTS & METHODS,
[4] c(c(Introduction
but not for example words like "cross-sectional", or "2013.", or "2)", or "(inability". This is the first step, my idea is to be able to get to this:
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\n AIMS AND OBJECTIVES, The aim of this
study is to ... MATERIALS AND METHODS, A cross-sectional study with a ...
surgeries.n), \n\n PATIENTS AND METHODS, This prospective double blind,...
[95] Introduction Silicosis is a fibrotic"
As a way to extract these words and not grabbing any words that include punctuation (like "surgeries.n)"), I have seen that they always start or include "c(" expression. But had some trouble with the regex:
grep("c(", test)
Error en grep("c(", test) :
invalid regular expression 'c(', reason 'Missing ')''
also tried with:
grep("c\\(", test, value = T)
But returns the whole text file. Have also use str_match from the dap package but I don't seem to get the correct pattern (regex) code right. Have any recommendation?
If I understood your problem (I'm unsure your second text is expected output or just a step) I would go with gsub like this:
gsub("(c\\(|<\\/?sc>)","",text)
The regex (first parameter) will match c( or <sc> or </sc> and replace them with nothing, thus cleaning the text as you expect (again, if I understood correctly your expectation).
more on the regex involved:
(|) is the structure to OR condition
c\\( will match literally c( anywhere in the text
<\\/?sc> will match <sc> or </sc> as the ? after the / mean it can be there 0 or 1 time, so it's optionnal.
The double \\ are there so after R interpreter has removed the first backslash there's still a backslash to tell the regex interpreter we want to match a litteral ( and a litteral /
Try this,
text <- "...urine bag tubing and the vent jutting above the summit also strapped with the white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This prospective double blind,...[95] c(c(Introduction, Silicosis is a fibroticf"
require(stringr)
words <- str_split(text, " ")
words[[1]][grepl("c\\(", words[[1]])]
## [1] "\n\nc(A<sc>IMS" "c(M<sc>ATERIALS" "\n\nc(PATIENTS" "c(c(Introduction,"

How to Count Text Lines in R?

I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:
MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that.
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.
MR. JOHN: Thank you
In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:
MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1
Thanks for pointers using R!
You can use the pattern : to split the string by and then use table:
table(sapply(strsplit(x, ":"), "[[", 1))
# MR. JOHN MR. LEHMAN MS. SMITH
# 2 1 1
strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency
Edit: Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.
tt <- readLines("./tmp.txt")
Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.
Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).
You can use strsplit followed by sapply (as shown below)
Using strsplit:
# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:
out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))
There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.
Of course, this assumes that there is no other line, for example, like this:
"Mr. Chariman, whatever (bla bla): It is not a problem"
Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.

Resources