Efficient way to remove all proper names from corpus

Efficient way to remove all proper names from corpus - r

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"

Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

Related

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput() (including the step force to lowercase:
Emoji_struct <- c(
list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
)
Current coding (data_orig is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Any ideas? Lower cases would work, too.
Best regards. Stay safe. Stay healthy.

Answer
Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
Example
Single character string:
replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
Character vector:
replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
List:
list("list_element_1: 🔥", "list_element_2: ❌") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
Rationale
To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.

How to access a single item in an R data frame?

So I'm diving into yet another language (R), and need to be able to look at individual items in a dataframe(?). I've tried a number of ways to access this, but so far am confused by what R wants me to do to get this out. Current code:
empStatistics <- read.csv("C:/temp/empstats.csv", header = TRUE, row.names = NULL, encoding = "UTF-8", sep = ",", dec = ".", quote = "\"", comment.char = "")
attach(empStatistics)
library(svDialogs)
Search_Item <- dlgInput("Enter a Category", "")$res
if (!length(Search_Item)) {
cat("You didn't pick anything!?")
} else {
Category <- empStatistics[Search_Item]
}
Employee_Name <- dlgInput("Enter a Player", "")$res
if (!length(Employee_Name)) {
cat("No Person Selected!\n")
} else {
cat(empStatistics[Employee_Name, Search_Item])
}
and the sample of my csv file:
Name,Age,Salary,Department
Frank,25,40000,IT
Joe,24,40000,Sales
Mary,34,56000,HR
June,39,70000,CEO
Charles,60,120000,Janitor
From the languages I'm used to, I would have expected the brackets to work, but that obviously isn't the case here, so I tried looking for other solutions including separating each variable into its own brackets, trying to figure out how to use subset() (failed there, not sure it is applicable), tried to find out how to get the column and row indexes, and a few other things I'm not sure I can describe.
How can I enter values into variables, and then use that to get the individual pieces of data (ex, enter "Frank" for the name and "Age" for the search item and get back 25 or "June" for the name and "Department" for the search item to get back "CEO")?

If you would like to access it like that, you can do:
Search_Item <- "Salary"
Employee_Name <- "Frank"
empStatistics <- read.csv("empstats.csv",header = TRUE, row.names = 1)
empStatistics[Employee_Name,Search_Item]
[1] 40000
R doesn't have an Index for its data.frame. The other thing you can try is:
empStatistics <- read.csv("empstats.csv",header = TRUE)
empStatistics[match(Employee_Name,empStatistics$Name),Search_Item]
[1] 40000

R - How to replace a string from multiple matches (in a data frame)

I need to replace subset of a string with some matches that are stored within a dataframe.
For example -
input_string = "Whats your name and Where're you from"
I need to replace part of this string from a data frame. Say the data frame is
matching <- data.frame(from_word=c("Whats your name", "name", "fro"),
to_word=c("what is your name","names","froth"))
Output expected is what is your name and Where're you from
Note -
It is to match the maximum string. In this example, name is not matched to names, because name was a part of a bigger match
It has to match whole string and not partial strings. fro of "from" should not match as "froth"
I referred to the below link but somehow could not get this work as intended/described above
Match and replace multiple strings in a vector of text without looping in R
This is my first post here. If I haven't given enough details, kindly let me know

Edit
Based on the input from Sri's comment I would suggest using:
library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)
Original
I would not say this is easier to read than your simple loop, but it might take better care of the overlapping replacements:
# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
to_word=c("what is your name","names","froth", "where are", "Whatsup"))
# load used library
library(gsubfn)
# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)
# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE
# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])
# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
replacements,input_string)

toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")
function have two arguments xi and yi.
xi is pattern (find what), yi is replacement (replace with).
input_string = "Whats your name and Where're you from"
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth")
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)

Was trying out different things and the below code seems to work.
a <-c("Whats your name", "name", "fro")
b <- c("what is your name","names","froth")
c <- c("Whats your name and Where're you from")
for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c)
c <- gsub("_"," ",c)
c
Took help from the below link Making gsub only replace entire words?
However, I would like to avoid the loop if possible. Can someone please improve this answer, without the loop

R Optimizing double loop that uses stri_extract

I have been working on some text scraping/analysis. One thing I did was pull out the top words from documents to compare and learn about different metrics. This was fast and easy. There became an issue with defining what separators to use though and pulling out individual words rather than phrases removed information from the analysis. For example .Net Developer becomes net and developer after the transformation. I already had a list of set phrases/words from an old project someone else gave up on. The next step was pulling out specific keywords from multiple rows for multiple documents.
I have been looking into several techniques including vectorization, parallel processing, using C++ code within R and others. Moving forward I will experiment with all of these techniques and try and speed up my process as well as give me these tools for future projects. In the mean time (without experimentation) I'm wondering what adjustments are obvious which will significantly decrease the time taken e.g. moving parts of the code outside the loop, using better packages etc
I also have a progress bar, but I can remove it if its slowing down my loop significantly.
Here is my code:
words <- read.csv("keyphrases.csv")
df <- data.frame(x=(list.files("sec/new/")))
total = length(df$x)
pb <- txtProgressBar(title = "Progress Bar", min = 0, max =total , width = 300, style=3)
for (i in df$x){
s <- read.csv(paste0("sec/new/",i))
u <- do.call(rbind, pblapply(words$words, function(x){
t <- data.frame(ref= s[,2], words = stri_extract(s[,3], coll=x))
t<-na.omit(t)
}))
write.csv(u,paste0("sec/new_results/new/",i), row.names = F)
setTxtProgressBar(pb, i, title=paste( round(which(df$x== i)/total*100, 2),"% done"))
}
So words has 60,000 rows of words/short phrases - no more than 30 characters each. Length i is around 4000 where each i has between 100 and 5000 rows with each row having between 1 and 5000 characters. Any random characters/strings can be used if my question needs to be reproducible.
I only used lapply because combining it with rbind and do.call worked really well, having a loop within a loop may be slowing down the process significantly too.
So off the bat there are somethings I can do right? Swapping data.frame to data.table or using vectors instead. Do the reading and writing outside the loop somehow? Perhaps write it such that one of the loops isnt nested?
Thanks in advance
EDIT
The key element that needs speeding up is the extract. Whether I use lapply above or cut it down to:
for(x in words$words){t<-data.table(words=stri_extract(s[,3], coll=x))}
This still takes the most time for a long way. skills and t are data tables in this case.
EDIT2
Attempting to create reproducible data:
set.seed(42)
words <- data.frame(words=rnorm(1:60000))
words$wwords <- as.String(words$words)
set.seed(42)
file1 <- data.frame(x=rnorm(1:5000))
file1$x<-as.String(file1$x)
pblapply(words$words, function(x){
t <- data.frame(words = stri_extract(file1$x, coll=x))
})

First things first. Yes, I would definitely switch from data.frame to data.table. Not only is it faster and easier to use, when you start merging data sets data.table will do reasonable things when data.frame will give you unexpected and unintended results.
Secondly, is there an advantage to using R to take care of your separators? You mentioned a number of different techniques you are considering using. If separators are just noise for the purposes of your analysis, why not split the work into two tools and use a tool that is much better at handling separators and continuation lines and so on? For me, Python is a natural choice to do things like parsing a bunch of text into keywords--including stripping off separators and other "noise" words you do not care about in your analysis. Feed the results of the Python parsing into R, and use R for its strengths.
There are a few different ways to get the output of Python into R. I would suggest starting off with something simple: CSV files. They are what you are starting with, they are easy to read and write in Python and easy to read in R. Later you can deal with a direct pipe between Python and R, but it does not give you much advantage until you have a working prototype and it is a lot more work at first. Make Python read in your raw data and turn out a CSV file that R can drop straight into a data.table without further processing.
As for stri_extract, it is really not the tool you need this time. You certainly can match on a bunch of different words, but it is not really what it is optimized for. I agree with #Chris that using merge() on data.tables is a much more efficient--and faster--way to search for a number of key words.

Single Word Version
When you have single words in each lookup, this is easily accomplished with merging:
library(data.table)
#Word List
set.seed(42)
WordList <- data.table(ID = 1:60000, words = sapply(1:60000, function(x) paste(sample(letters, 5), collapse = '')))
#A list of dictionaries
set.seed(42)
Dicts <- list(
Dict1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
#Create Dictionary Data.table and add Identifier
Dicts <- rbindlist(lapply(Dicts, function(x){data.table(ref = x)}), use.names = T, idcol = T)
# set key for joining
setkey(WordList, "words")
setkey(Dicts, "ref")
Now we have a data.table with all dictionary words, and a data.table with all words in our word list. Now we can just merge:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)
words ID .id
1: abcli 30174 Dict3
2: abcrg 26210 Dict2
3: abcsj 8487 Dict1
4: abczg 24311 Dict2
5: abdgl 1326 Dict1
---
60260: zyxeb 52194 NA
60261: zyxfg 57359 NA
60262: zyxjw 19337 Dict2
60263: zyxoq 5771 Dict1
60264: zyxqa 24544 Dict2
So we can see abcli appears in Dict3, while zyxeb does not appear in any of the dictionaries. There look to be 264 duplicates (words that appear in >1 dictionary), as the resultant data.table is larger than our word list (60264 > 60000). This is shown as follows:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)[words == "ahlpk"]
words ID .id
1: ahlpk 7344 Dict1
2: ahlpk 7344 Dict2
3: ahlpk 28487 Dict1
4: ahlpk 28487 Dict2
We also see here that duplicated words in our word list are going to create multiple resultant rows.
This is very very quick to run
Phrases + Sentences
In the case where you are searching for phrases within sentences, you will need to perform a string match instead. However, you will still need to make n(Phrases) * n(Sentences) comparisons, which will quick hit memory limits in most R data structures. Fortunately, this is an embarrassingly parallel operation:
Same setup:
library(data.table)
library(foreach)
library(doParallel)
# Sentence List
set.seed(42)
Sentences <- data.table(ID = 1:60000, Sentence = sapply(1:60000, function(x) paste(sample(letters, 10), collapse = '')))
# A list of phrases
set.seed(42)
Phrases <- list(
Phrases1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
# Create Dictionary Data.table and add Identifier
Phrases <- rbindlist(lapply(Phrases, function(x){data.table(Phrase = x)}), use.names = T, idcol = T)
# Full Outer Join
Sentences[, JA := 1]
Phrases[, JA := 1]
# set key for joining
setkey(Sentences, "JA")
setkey(Phrases, "JA")
We now want to break up our Phrases table into manageable batches
cl<-makeCluster(4)
registerDoParallel(cl)
nPhrases <- as.numeric(nrow(Phrases))
nSentences <- as.numeric(nrow(Sentences))
batch_size <- ceiling(nPhrases*nSentences / 2^30) #Max data.table allocation is 2^31. Lower this if you are hitting memory allocation limits
seq_s <- seq(1,nrow(Phrases), by = floor(nrow(Phrases)/batch_size))
ln_s <- length(seq_s)
if(ln_s > 1){
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(Phrases) + 1) - 1)
} else {
str_seq <- paste0(seq_s,":",nrow(Phrases))
}
We are now ready to send our job out. The grepl line below is doing the work - testing which phrases match each sentence. We then filter out any non-matches.
ls<-foreach(i = 1:ln_s) %dopar% {
library(data.table)
TEMP_DT <- merge(Sentences,Phrases[eval(parse(text = str_seq[1]))], by = "JA", allow.cartesian = T)
TEMP_DT <- TEMP_DT[, match_test := grepl(Phrase,Sentence), by = .(Phrase,Sentence)][match_test == 1]
return(TEMP_DT)
}
stopCluster(cl)
DT_OUT <- unique(do.call(rbind,ls))
DT_OUT now summarizes the sentences that match, along with the Phrase + Phrase list that it is found in.
This still will take some time (as there is a lot of processing that is necessary) , but nowhere near a year.

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.

If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.

To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).

Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)

Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex