Regular expressions error message - "Out of memory" - r

I've been playing around with R's sentiment analysis capabilities and keep running into an error that is raised when running a gsub function. The positive and negative word lists were taken from here.
After some Google searches, I found one mention of this error on the R help list but nothing else. Has anyone run into this problem? What is going on? Is there a workaround?
I've ran similar code (using gsub and stringer packages) when working with strings in the past and this is the first time I've ever had this type of error come up. Furthermore, I tried to reproduce this error by writing a similar script on a different set of strings and that worked fine.
Here is the error message:
> pos_match <- str_c(vpos, collapse = "|")
> neg_match <- str_c(vneg, collapse = "|")
> dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression, reason 'Out of memory'
Here's the whole 'process.'
require(tm); require(stringr); require(lubridate); library(RTextTools)
d1 <- read.csv("Video_Comments.csv", stringsAsFactors=FALSE, sep=",", fileEncoding="ISO_8859-2")
pos <- read.csv("positive-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
neg <- read.csv("negative-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
vpos = as.vector(pos[,1]); vneg = as.vector(neg[,1])
head(vpos); head(vneg)
colnames(d1); nrow(d1); ncol(d1)
str(d1); head(d1)
table(d1$Likes); table(d1$Replies)
nrow(vpos); nrow(vneg)
length(vpos); length(vneg)
is.atomic(vpos); is.atomic(vneg)
dat = data.frame(Comment=c(d1$Comment))
dat$Comment = gsub('[[:punct:]]', '', dat$Comment)
dat$Comment = gsub('[[:cntrl:]]', '', dat$Comment)
dat$Comment = gsub('\\d+', '', dat$Comment)
dat$Comment = tolower(dat$Comment)
vpos = gsub('[[:punct:]]', '', vpos); vneg = gsub('[[:punct:]]', '', vneg)
vpos = gsub('[[:cntrl:]]', '', vpos); vneg = gsub('[[:cntrl:]]', '', vneg)
vpos = gsub('\\d+', '', vpos); vneg = gsub('\\d+', '', vneg)
vpos = tolower(vpos); vneg = tolower(vneg)
head(vpos); head(vneg)
pos_match <- str_c(vpos, collapse = "|")
neg_match <- str_c(vneg, collapse = "|")
dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Another error message I've received is the following:
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression 'faced|faces|abnormal|abolish|abominable|abominably|abominate|abomination|abort|aborted|
Data for reproducing error:
dat = c("Hey guys I am Aliza Lomez...18 y.o. I need your likes please like my page and find love quotes, beauty tips and much more.Please like my page you will never regret thank u all\u0083 <3 <3 <3...",
"Alexandra Saturn", "And that's what makes a Subaru a Subaru", "Missouri in a battleground....; meanwhile in southern California....", "What the Frisbee", "very cool !!!!", "Get a life",
"Try that with my GT!!!", "Did he make any money?", "Wo! WO! BSMITH THROWING DISCS WITH SUBARUS?!?! THIS IS SO AWESOME! SHOULD OF USED AN STI THO")

I don't know the entire solution but I can get you started. I made this community wiki so, hopefully, someone can fill in the blanks...
For the invalid regex, to create an OR you need to enclose everything in parentheses. For example, if you wanted to match the words "a", "an", or "the", you would use the regex string (a|an|the). If I have a list of words I'd like to match with an OR in regex, here's what I usually use:
mywords <- c("a", "an", "the")
mystring <- paste0("(", paste(mywords, collapse="|"), ")")
> mystring
[1] "(a|an|the)"
That should rid you of the invalid regex error, as your string doesn't begin with an open parenthesis and ends with a pipe instead of a close parenthesis.


Within a column, I'd like to gsub each row of string values and remove any value that matches a list of values I created

I am working with a messy datafile right now. I have a list of comments that I'd like to sort out and grab the most common combination of phrases. An example phrase would be "Did not qualify because of X and Y" and "Did not qualify because of Y and X". I am trying to go through and remove Stop Words so I can match X and Y as a common phrase. I was able to easily do this for common single words, but phrases are a little difficult. Below is my code for context
Create Datafile
dat1 <- dat %>% filter(Action != Exclude)
Remove problem characters
dat1$Comments <- stri_trans_general(dat1$Comments, "latin-ascii")
dat1$Comments <- gsub(pattern='<[^<>]*>', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern='\n', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern="[[:punct:]]", replacement=" ", x=dat1$Comments)
Remove stop words (Where my problem is)
sw <- paste0("\\b(", paste0(stop_words$word, collapse="|"), ")\\b")
dat1$Comments <- lapply(dat1$Comments, function(x) (gsub(pattern=sw, replacement=" ", x)))
Remove extra spaces between words
dat1$Comments <- trimws(gsub("\\s+", " ", dat1$Comments))
dat1$Comments <- gsub("(^[[:space:]]*)|([[:space:]]*$)", "", dat1$Comments)
Sweet Data
top_phrases <- data.frame(text = dat1$Comments) %>%
unnest_tokens(bigram, text, 'ngrams', n = Length, to_lower = TRUE) %>%
count(bigram, sort = TRUE)
This is what pops up and is traced back to the gsub code
Error in gsub(pattern = sw, replacement = " ", x) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
If anyone is curious, here is what is stored in "sw"
Both TRE (the default regex engine used in base R regex functions) and PCRE (the regex engine used in base R regex functions with perl=TRUE) have quite hard limits for the pattern length.
In your case, stringr regex functions will work better as they are using ICU regex engine that supports much longer regex patterns.
So, you may replace
gsub(pattern=sw, replacement=" ", x)
stringr::str_replace_all(x, sw, " ")

Text argument from a dataframe

I try to test this nice solution using a dataframe as input in the your_sentence.
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text#TT.res$token == badword)
cond2 <- (substring(tagged.text#TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
lapply(your_sentences, remove_words)
The data frame has 1 column and 351 rows. In lapply in your_sentences I use the call for my dataframe and the column name and I receive this error (the same error is when I use the dataframe without call the column):
> dfnew <- lapply(df$text, remove_words)
Error in writeLines(text, con = conn.tempfile) : invalid 'text' argument
What can I do to solve the error?
Example data:
df = data.frame(text = c('I blame myself for what happened', 'For what happened the blame is yours', 'I will blame you if my friend removes'))
What a bummer, hoped that its only a typo :-). But I have a second guess. You probably stepped into the difficulties caused by StringsAsFactors = TRUE. This might have caused passing the type factor instead of character to your function. Try the following:
df = data.frame(text = c('I blame myself for what happened'
, 'For what happened the blame is yours'
, 'I will blame you if my friend removes')
, stringsAsFactors = FALSE)
Your strings seem to be saved as factors and therefore remove_words is supplied with factor values, instead of strings. Using the stringsAsFactors = FALSE as an argument will solve the issue:
df <- data.frame(text = c('I blame myself for what happened',
'For what happened the blame is yours',
'I will blame you if my friend removes'),
Or, if you have already defined your df with factors, you can change that using df <- lapply(df, as.character)
lapply(df$text, remove_words)
[1] "I myself for what happened"
[1] "For what happened the blame is yours"
[1] "I will you if my friend removes"

How do I remove line breaks from a string?

I've am extracting tweets using the Twitter API in R.
I have been saving my results to csv in r using a write.csv2 command which is fine but there is an issue where character returns in the tweet text are causing the multiple rows in the spreadsheet for the one tweet.
I've tried using a str_replace_all but it doesn't seem to work for me and i can't find anything as to why.
Here is my code
searchTags = c("Galwaybikeshare", "Corkbikeshare", "dublinbikes", "BelfastBikes", "SantanderCycles", "CitiBikeNYC", "obike", "Hubway", "bicing")
additionalParams = c("-rt -http")
searchString <- paste((paste(searchTags[1:9], collapse = " OR ")), additionalParams, collapse = "")
tweets_list <- searchTwitter(searchString, n=20, lang = "en", resultType = 'recent')
str_replace_all(tweets_list, "[\r\n]" , "")
tweets.df <- twListToDF(tweets_list)
todayDate <- Sys.Date()
tweetArchive <- paste("BikeShareTweets ", todayDate, ".csv", sep ="")
write.csv2(tweets.df, file = tweetArchive)
The text below is an example of a tweet which is causing the issue.
"TransitNinja205: 0.01% of the budget for 5-borough #CitiBikeNYC,\nand 0.2% for #FairFares. #NYCmayor #NYCmayorsOffice #progressive"
Why isn't my str_replace_all removing the \n from the text?
stringr::str_replace_all works, you’re just ignoring the result. To fix it:
tweets_list = str_replace_all(tweets_list, "[\r\n]" , "")
stringr::str_remove_all will also do this for you.
tweets_list = str_remove_all(tweets_list, "[\r\n]")

Efficient way to remove all proper names from corpus

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, = TRUE)
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

Generalizing a function to return a list of data.frame columns with invalid UTF-8 bytes/code points

I recently wrote a function that uses grep and regex to find invalid UTF-8 code point (Since I work on a mac, my locale is also UTF-8). The input doesn't have to be UTF-8, as it is looking for invalid UTF-8 bytes. I wrote the function for work, and was wondering if anyone could provide tips for generalizing/catch any red flags in the code that I didn't notice (e.g. using base code instead of dplyr). Feel free to use any of the code if it's useful to you.
enc_check <- function(data) {
# Create vector of all possible 2-digit hexadecimal numbers (2 digits is the lenth of a byte)
allBytes <- list(x_esc = '\\x',
hex1 = as.character(c(seq(0,9),
hex2 = as.character(c(seq(0,9),
) %$%
expand.grid(x_esc, hex1, hex2) %>%
apply(1, paste, collapse = '')
# Valid mixed alphanumeric bytes
validBytes1 <- list(x_esc = '\\x',
hexNum = as.character(c(seq(2,7))),
hexAlpha = c('a','b','c','d','e','f')
) %$%
expand.grid(x_esc, hexNum, hexAlpha) %>%
apply(1, paste, collapse = '') %>%
extract(. != '\\x7f')
# Valid purely numeric bytes
validBytes2 <- list(x_esc = '\\x',
hexNum2 = as.character(seq(20,79))
) %$%
expand.grid(x_esc, hexNum2) %>%
apply(1, paste, collapse = '')
# New-line byte
validBytes3 <- '\\x0a'
# charToRaw('\n')
# [1] 0a
# Filter all possible combinations down to only invalid bytes
validBytes <- c(validBytes1, validBytes2, validBytes3)
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
# Create list of data.frame columns with invalid bytes
a_vector <- vector()
matches <- list()
for (i in 1:dim(data)[2]) {
a_vector <- data[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
# Get rid of empty list elements
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
# matches <- matches[lapply(matches,length) > 0]
Edit: Here's the updated code with the suggestions implemented.
enc_check <- function(dataset) {
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
validBytes <- paste0("\\x",
sapply(rASCII, charToRaw))) %>%
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
a_vector <- vector()
matches <- list()
for (i in 1:dim(dataset)[2]) {
a_vector <- dataset[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
} # sapply() is preferable to lapply due to USE.NAMES = TRUE
names(matches) <- names(dataset)
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
2nd Edit: A better strategy was to use iconv. Let's say you have a file or object with some invalid bytes but that is generally UTF-8. This is often the case with Mac computers, whose default locale setting seems to be UTF-8. Moreover, Mac-based RStudio seems to use UTF-8 internally, and this can't be changed even if you set your computer's locale to a different encoding. Anyway, you can use iconv to sub all invalid bytes, normally displayed as hexadecimal bytes, (e.g. "\x8f") for the Unicode replacement symbol. Then you can search for that symbol and return a list of unique observations within a data.frame column with that symbol. Based on that, you can use "sub()" to replace those characters with the desired ones. One thing to note is that converting the file to another encoding, say latin-1, can have unexpected results if invalid bytes are present. When I did this, I noticed that some invalid bytes were converted to Unicode control characters, while other invalid bytes apparently matched valid latin-1 bytes and were displayed as nonsensical characters. In either case, I wrote a package to search data.frames for these characters and return a list, then do some replacement. The package isn't nearly as official as something off of CRAN, but if anyone's interested then here's a link to the repository: It's important to note that the "stable" version of the package isn't on the "master" branch; it's actually on branch "iconv". The documentation can be searched in R via "?FixEncoding" after installation of the correct branch, then finding the functions listed there and searching help for those.
This will construct all the alpha-versions of the hex numbers to "ff":
allBytes <- as.character( as.hexmode(0:255) )
Or as greppish patterns as you seem to prefer:
allBytes <- paste0("\\x", as.character( as.hexmode(0:255) ) )
The "special" characters that R recognizes does include the "\n" that you lissted but also a few more listed on the ?Quotes help page:
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
You could create a vector of valid grep patterns for "characters" " space up to tilde ("~") just with this:
validBytes1 <- c(rASCII, paste( "\\x", as.hexmode( c(20:126)) )
I have concerns about using this strategy since my R throws errors when it tried to do greppish matches with what it considers an invalid input string.
> txt <- "ttt\nuuu\tiii\xff"
> dfrm <- data.frame(a = txt)
> lapply(dfrm, grep, patt = "\\xff")
Warning message:
In FUN(X[[i]], ...) : input string 1 is invalid in this locale
> lapply(dfrm, grep, patt = "\\\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
> lapply(dfrm, grep, patt = "\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
You may want to switch over to grepRaw since it doesn't throw the same errors:
> grepRaw("\xff", txt)
[1] 12
Or may use ?tools::showNonASCII as suggested by Duncan Murdoch when this came up on Rhelp 4 years ago:
# and the help page has a reproducible example of its use:
out <- c(
"fa\xE7ile test of showNonASCII():",
" This is a good line",
" This has an \xfcmlaut in it.",
" OK again.",
f <- tempfile()
cat(out, file = f, sep = "\n")
#-------output appears in red----
1: fa<e7>ile test of showNonASCII():
4: This has an <fc>mlaut in it.
