String match error "invalid regular expression, reason 'Out of memory'" - r

I have a table that is shaped like this called df (the actual table is 16,263 rows):
title date brand
big farm house 2022-01-01 A
ranch modern 2022-01-01 A
town house 2022-01-01 C
Then I have a table like this called match_list (the actual list is 94,000 rows):
words_for_match
farm
town
clown
beach
city
pink
And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:
match_list <- match_list$words_for_match
match_list <- paste(match_list, collapse = "|")
match_list <- sprintf("\\b(%s)\\b", match_list)
df %>%
filter(grepl(match_list, title))
But then I get the following error:
Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'
If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!

You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words.
Here a simple implementation :
titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
w <- sprintf("\\b(%s)\\b", w)
is.match <- grepl(w, titles)
titles.to.keep <- c(titles.to.keep, titles[is.match])
titles <- titles[!is.match]
print(paste(length(titles), "remaining titles"))
}
titles.to.keep
If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones.
UPDATE
You can also make a mix with your previous strategy to make it faster :
gr.size <- 20
gr.words <- split(words_for_match, ceiling(seq_along(words_for_match) / gr.size))
gr.words <- sapply(gr.words, function(words)
{
words <- paste(words, collapse = "|")
sprintf("\\b(%s)\\b", words)
})
and then iterate on gr.words and not on words_for_match in the first code chunk.

Related

Finding and Replacing parts of person names

I have a dataframe with a column that consists of politician names which are extracted of thousands of news articles. Each row is a specific article. I want to count which politicians are mentioned the most, but count each name only one time per article (row).
The entities recognition algorithm returned these results. Now I have to convert the names in a standard form to be able to summarise and compare them.
Because there are maximal 20 people I am interested in, I knew the names and thought despite the effort, manually coding the patterns for each name might be the fastest way (I am happy for other ideas).
#example-data
persons <- c("Merkel,Angela Merkel,Trump,Ursula,Merkels", "Ursula von,Trumps,Donald Trump,Leyen")
df <- data.frame(persons)
#change pattern
df <- df %>%
mutate(
persons= paste(" ",str_replace_all(df$persons,",", " , "), sep = "")
)
#example of exctracting the names.. and so on, you get the idea
str_replace_all(df$persons, c(" Trump(s)?" = "Donald Trump", ", Trump(s)?" = ", Donald Trump", "Donald Trumps" = "Donald Trump",
" Merkel(s)?") = "Angela Merkel")
My desired output is to have for each row just the full names. In the end I would remove the duplicated names per row and then I could count the dataset like desired.
The data would should look like this in the end:
persons <- c("Angela Merkel,Angela Merkel,Donald Trump,Ursula von der Leyen,Angela Merkel", "Ursula von der Leyen,Donald Trump,Donald Trump,Ursula von der Leyen")
I have especially a hard time with patterns for names which consists of more than two parts like Ursula von der Leyen. What would the best way to do convert the names and how would the pattern for replacement look like?
Edit
I wrote now a function, witch takes care that there is only one instance of a name in my dataframe for each row. Not really elegant and nice code but its working.
clean_name <- function(x) {
b <- unlist(strsplit(x, '[,]')) %>%
str_squish(.)
c <- b[!duplicated(b)]
#Lists mit forename and surname
ganzer_name <- vector()
nachname<-vector()
for (person in c){
if(any(str_count(person," ") == 0)){
nachname <- append(nachname,person)
} else{
ganzer_name <- append(ganzer_name,person)
}
}
#chckes if therese constructions like s wie Angela Merkels
ganzer_name <- ganzer_name%>%
str_sort() %>% str_replace(.,"\\+","")
i <-0
aussortieren <- c("","")
while(i < (length(ganzer_name)-1) ){
i <- i+1
if(str_detect(ganzer_name[i+1],
paste0(ganzer_name[i],"*")
)){
aussortieren <- append(aussortieren, ganzer_name[i+1] )
}else{ }
}
ganzer_name <- ganzer_name[!ganzer_name %in% aussortieren]
#check if surname is already in a full name
for( person in nachname ){
#check construction with s like Merkels
if(any(str_detect(ganzer_name, paste0(
"\\Q",
str_sub(person,end = nchar(person)-1),
"\\E" )
)
)
) {
} else {
ganzer_name <- append(ganzer_name,person)
}
}
return(paste(ganzer_name, collapse=","))
}
Instead of turning various combination of names into standard format, removing duplicates and then counting here is a different approach.
We can use grepl for pattern matching and count how many times a politician occurs in different news articles.
name <- c('Trump', 'Merkel')
sapply(name, function(x) sum(grepl(x, df$persons)))
# Trump Merkel
# 2 1
Use ignore.case = TRUE in grepl if you want to make the comparison case insensitive.

Remove words per year in a corpus

I am working with a corpus with speeches spanning several years (aggregated to person-year level). I want to remove words that occur less than 4 times in a year (not remove it for the whole corpus, but only for the year in which it does not meet the threshold).
I have tried the following:
DT$text <- ifelse(grepl("1998", DT$session), mgsub(DT$text, words_remove_1998, ""), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), str_remove_all(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), drop_element(DT$text, words_remove_1998), DT$text)
However, none seem to work. Mgsub just substitutes the whole speech with "" for 1998, whilst the other options give error messages. The reason that removeWords does not work is that my words_remove_1998 vector is too large. I have tried to split the word vector and loop over the words (see code below), but R does not appear to like this (running forever).
group <- 100
n <- length(words_remove_1998)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(words_remove_1998,r)
for (i in 1:length(d)) {
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, c(paste(d[[i]]))), DT$text)
}
Any suggestions for how to solve this?
Thank you for your help!
Reproducible example:
text <- rbind(c("i like ice cream"), c("banana ice cream is my favourite"), c("ice cream is not my thing"))
name <- rbind(c("Arnold Ford"), c("Arnold Ford"), c("Leslie King"))
session <- rbind("1998", "1999", "1998")
DT <- cbind(name, session, text)
words_remove_1998 <- c("like", "ice", "cream")
newtext <- rbind(c("i"), c("banana ice cream is my favourite"), c("is not my thing"))
DT <- cbind(DT, newtext)
My real word vector that I want removed contains 30k elements.
I ended up not using any wrappings, as none of them could handle the size of the data. Insted I did it the old-fashioned and simple way; separate the text into several rows, count the occurences of each word per session (year) and person, then remove the rows corresponding to less than a threshold (same limit as I used to identify the vector with words I wanted to remove). Lastly, I aggregate the data back to it's initial level (person-year).
This only words because I am removing words according to a threshold. If I had a list of words to remove that I could not remove in this way, I would have been in more trouble.
DT_separate <- separate_rows(DT, text)
df <- DT_separate %>%
dplyr::group_by(session, text) %>%
dplyr::mutate(count = dplyr::n())
df <- df[df$count >5, ]
df <- aggregate(
text ~ x, #where x is a person-year id
data=df,
FUN=paste, collapse=' '
)
names(df)[names(df) == 'text'] <- 'text2'
DT <- left_join(DT, df, by="x")
DT$text <- DT$text2
DT <- DT[, !(colnames(DT) %in% c("text2"))]

JSON applied over a dataframe in R

I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!
Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])

How to find the position of a word in one column with another column in same dataframe in R?

I have two columns in a dataframe train:
Subject |Keyword
the box is beautiful |box
delivery reached in time |delivery
they serve well serve |serve
How to find the position of keyword in the subject ?
Currently, I am using the for loop:
for(k in 1:nrow(train)){
l <- unlist(gregexpr(train$keyword[k],train$subject[k],ignore.case = T))
train$position[k] <- l}
Is there any other way ?
No need for a loop, just use the locate functions in the stringr or stringi package.
train <- data.frame(subject = c("the box is beauty", "delivery reached on time", "they serve well"),
keyword = c("box", "delivery", "serve"),
stringsAsFactors = FALSE)
library(stringr)
train$position_stringr <- str_locate(train$subject, train$keyword)[,1]
#locate returns a matrix and we are just interested in the start of keyword.
library(stringi)
train$position_stringi <- stri_locate_first(train$subject, regex = train$keyword)[,1]
#locate returns a matrix and we are just interested in the start of keyword.
train
subject keyword position_stringr position_stringi
1 the box is beauty box 5 5
2 delivery reached on time delivery 1 1
3 they serve well serve 6 6
You could use the below.
#data.frame created using the below statements
Subject <- c("the box is beauty","delivery reached on time","they serve well")
Keyword <- c("box","delivery","serve")
train <- data.frame(Subject,Keyword)
#Solution
library(stringr)
for(k in 1:nrow(train))
{
t1 <- as.character(train$Subject[k])
t2 <- as.character(train$Keyword[k])
locate_vector <- str_locate(t1,regex(t2,ignore.case=true))[[1]]
train$start_position[k] <- locate_vector
#If end position is also required, the second column from str_locate
#function could be used.
}

R: Read text files with blanks and unequal number of columns

I am trying to read many text files into R using read.table. Most of the time we have clean text files which have defined columns.
The data that I am trying to read comes from ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
You can see that the blanks and length of text files varies by report.
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt
My objective is to read many of these text files and combine them into a dataset.
If I can read one of the them then compiling should not be an issue. However, I am running into several issues because of the format of the text file:
1) the number of FIRMS vary from report to report. For example, sometimes there will be 3 rows (i.e. 3 firms that did business on that data) of data to import and sometimes there may be 10.
2) Blanks are being recognized. For example, under the FIRM section there should be a column for Deliveries (DEL) and Receipts (REC). The data when it is read in THIS section should look like:
df <- data.frame("FIRM_#" = c(407, 685, 800, 905),
"FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
"DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))
however when I read this in the fomatting is all messed up and does not put NA for the blank values
3) The above issues apply for "YARDS" and "FUTURE DELIVERIES SCHEDULED" section of the text file.
I have tried to read in sections of the text file and then format it accordingly but since the the number of firms change day to day the code does not generalize.
Any help would greatly be appreciated.
Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.
You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.
Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...
library(rvest)
library(dplyr)
page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")
table <- page %>%
html_text("pre") %>%
#reformat by splitting on line breakes
{ unlist(strsplit(., "\n")) } %>%
#select range based on strings in specific lines
"["(.,(grep("FIRM #", .):(grep(" DELIVERIES SCHEDULED", .)-1))) %>%
#exclude empty rows
"["(., !grepl("^\\s+$", .)) %>%
#fix width of table to the right
{ substring(., 1, nchar(gsub("\\s+$", "" , .[1]))) } %>%
#strip white space on the left
{ gsub("^\\s+", "", .) }
headline <- unlist(strsplit(table[1], "\\s{2,}"))
get_split_position <- function(substring, string) {
nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))
}
#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {
get_split_position(x, table[1])
})
#exclude headline from split
table <- lapply(table[-1], function(x) {
substring(x, c(1, split_positions + 1), c(split_positions, nchar(x)))
})
table <- do.call(rbind, table)
colnames(table) <- headline
#strip whitespace
table <- gsub("\\s+", "", table)
table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2, as.numeric)
table
# FIRM # FIRM NAME DEL REC
# 1 407 STRAITSFINLLC 1 NA
# 2 685 R.J.O'BRIENASSOC 1 18
# 3 800 ROSENTHALCOLLINSLL 15 NA
# 4 905 ADMINVESTORSERVICE 1 NA

Resources