NLP - identifying and replacing words (synonyms) in R

NLP - identifying and replacing words (synonyms) in R - r

I have problem with code in R.
I have a data-set(questions) with 4 columns and over 600k observation, of which one column is named 'V3'.
This column has questions like 'what is the day?'.
I have second data-set(voc) with 2 columns, of which one column name 'word' and other column name 'synonyms'. If In my first data-set (questions )exists word from second data-set(voc) from column 'synonyms' then I want to replace it word from 'word' column.
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
I wrote simple code but it doesn't work.
for (k in 1:nrow(question))
{
for (i in 1:nrow(voc))
{
question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
}
}
Maybe someone will try to help me? :)
I wrote second code, but it doesn't work too..
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}

First, it is important that you use the stringsAsFactors = FALSE option, either at the program level, or during your data import. This is because R defaults to making strings into factors unless you otherwise specify. Factors are useful in modeling, but you want to do analysis of the text itself, and so you should be sure that your text is not coerced to factors.
The way I approached this was to write a function that would "explode" each string into a vector, and then uses match to replace the words. The vector gets reassembled into a string again.
I'm not sure how performant this will be given your 600K records. You might look into some of the R packages that handle strings, like stringr or stringi, since they will probably have functions that do some of this. match tends to be okay on speed, but %in% can be a real beast depending on the length of the string and other factors.
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes

Related

How do I grab the word in a character column after two consecutive word matches in R?

I have a data frame 'key_words' with vectors of pairs of words
key_words <- data.frame( c1 = ('word1','word2'), c2 = ('word3, word4'), c3 = ('word5','word6'))
I would like to search for these pairs of key words in a character column 'text' in another data frame 'x' where each row can be a few sentences long. I want to grab the word following two consecutive matches of a column in the key_words data frame and insert that value into a table at the same index of where the match was found. For example, if 'word1' and 'word2' are found one after the other in text[1] then I want to grab the word that comes after in text[1] and insert it into table[1].
I have tried splitting each row in 'text' into a list, separating by a single space so that each word has its own index in each row. I have the following idea which seems very inefficient and I'm running into problems where the character value temp_list[k] is of length 0.
x <- x %>% mutate(text = strsplit(text, " "))
for (i in 1:ncol(key_words)) {
word1 <- key_words[i, 1]
word2 <- key_words[i, 2]
for (j in 1:length(x$text)) {
temp_list <- as.list(unlist(x$text[[j]]))
for (k in 1:length(temp_list))
if (word1 == temp_list[k]) {
if (word2 == temp_list[k + 1]) {
table$word_found[j] <- temp_list[k + 2]
}
}
}
Is there a better way to do this or can I search through the text column for 'word1 word2' and grab the next word which can be any length? I'm new to R and coding in general, but I know I should be avoiding nested loops like this. Any help would be appreciated, thanks!!

I would suggest that you create a small function like this one, that returns the word following the occurrence of the pair 'w1 w2'
get_word_after_pair <- function(text,w1,w2) {
stringr::str_extract(text, paste0("(?<=\\b", w1, "\\s", w2, "\\b\\s)\\w*(?=\\b)"))
}
and then you can do this
data.frame(
lapply(key_words, function(x) get_word_after_pair(texttable$text,x[1],x[2]))
)
Input:
(keywords is a list of word pairs, texttable is a frame with a column text)
key_words <- list( pair1 = c('has','important'), pair2 = c('sentence','has'), pair3 = c('third','sentence'))
texttable = data.frame(text=c("this sentence has important words that we must find",
"this second sentence has important words to find",
"this is the third sentence and it also has important words within")
)
Output:
pair1 pair2 pair3
1 words important <NA>
2 words important <NA>
3 words <NA> and

Finding Matches Across Char Vectors in R

Given the below two vectors is there a way to produce the desired data frame? This represents a real world situation which I have to data frames the first contains a col with database values (keys) and the second contains a col of 1000+ rows each a file name (potentials) which I need to match. The problem is there can be multiple files (potentials) matched to any given key. I have worked with grep, merge, inner join etc. but was unable to incorporate them into one solution. Any advise is appreciated!
potentials <- c("tigerINTHENIGHT",
"tigerWALKINGALONE",
"bearOHMY",
"bearWITHME",
"rat",
"imatchnothing")
keys <- c("tiger",
"bear",
"rat")
desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")
Psudo code for what I think of as the solution:
#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
str result = na
foreach(key in keys){
if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
result += potential that matched + ', '
}
}
return new column with result as the value on the current row
}

We can write a small functions to extract matches and then loop over the keys:
return_matches <- function(keys, potentials, fixed = TRUE) {
vapply(keys, function(k) {
paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
}, FUN.VALUE = character(1))
}
vapply is just a typesafe version of sapply meaning it will never return anything but a character vector. When you set fixed = TRUE the function will run a lot faster but does not recognise regular expressions anymore. Then we can easily make the desired data.frame:
df <- data.frame(
key = keys,
matches = return_matches(keys, potentials),
stringsAsFactors = FALSE
)
df
#> key matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear bear bearOHMY, bearWITHME
#> rat rat rat
The reason for putting the loop in a function instead of running it directly is just to make the code look cleaner.

You can interate using grep
> Match <- sapply(keys, function(item) {
paste0(grep(item, potentials, value = TRUE), collapse = ", ")
} )
> data.frame(keys, Match, row.names = NULL)
keys Match
1 tiger tigerINTHENIGHT, tigerWALKINGALONE
2 bear bearOHMY, bearWITHME
3 rat rat

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.

You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.

Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

How to read csv with values containing commas in R?

I have a tool (exe provided to me), which outputs poorly formatted csv's. They are bad in that the last value can have commas, with no quotes, e.g.:
184500,OBJECT_CALENDAR,,,UNITS_NO_UNITS,NULL,,,,Sched N&S B1,1st,3rd,4S,5th&6th
Where the last string actually begins at 'Sched', so I would expect to see something like this:
184500,OBJECT_CALENDAR,,,UNITS_NO_UNITS,NULL,,,,"Sched N&S B1,1st,3rd,4S,5th&6th"
This is screwing up everything I am trying to do, and I am curious how to address it. Is there a way to define the number of columns in read.csv?
I have tried to read it line by line, but it is slow, and less than elegant:
processFile = function(filepath) {
i = 1
vector = character(0)
theFile = file(filepath, "r")
while ( TRUE ) {
line = readLines(theFile, n = 1)
if ( length(line) == 0 ) {
break
} else {
vector[i] <- line
i = i+1
}
}
close(theFile)
formatted <- lapply(strsplit(vector[-1],','), function(x) {c(x[1:9], paste(x[10:length(x)], collapse = ','))})
finalFrame <- as.data.frame(matrix(unlist(formatted),ncol = 10, byrow = TRUE))
return(finalFrame)
}
Any better ways to do this? Any base functions that can do this, and if not, any libraries?

Specifying the classes for each column seems to work in my case. So if you have 4 columns and the 4th one might have varying number of commas, try this:
theData <- read.table(filepath, colClasses=rep("character" ,4))
Of course adjust the number of columns and their classes to your situation. Here is what I get on toy csv file:
> read.table("tmp.csv", colClasses=rep("character" ,4), header=FALSE)
V1 V2 V3 V4
1 A, B, C, 1&2
2 A, C, C, 1,2,3
3 A, V, X, 12
4 A, V, D, 1,0
Another option would be to use read.csv with fill=TRUE argument
theData <- read.csv(filepath, fill=TRUE)
This will produce a data.frame with number of columns equal to the line with the maximum number of commas. Then you would have to manually combine those split commas into one.
NOTE: this will work in the case when only the last column can have loose commas.

This isn't ideal since you still have to read the file in line by line, but
stringr::str_split has a parameter n that specifies the maximum number of splits. If you set pattern = "," and n=10, then it will split your string into only 10 pieces, leaving the last chunk as a single string.

Using ifelse on factor in R

I am restructuring a dataset of species names. It has a column with latin names and column with trivial names when those are available. I would like to make a 3rd column which gives the trivial name when available, otherwise the latin name. Both trivial names and latin names are in factor-class.
I have tried with an if-loop:
if(art2$trivname==""){
art2$artname=trivname
}else{
art2$artname=latname
}
It gives me the correct trivnames, but only gives NA when supplying latin names.
And when I use ifelse I only get numbers.
As always, all help appreciated :)

Example:
art <- data.frame(trivname = c("cat", "", "deer"), latname = c("cattus", "canis", "cervus"))
art$artname <- with(art, ifelse(trivname == "", as.character(latname), as.character(trivname)))
print(art)
# trivname latname artname
# 1 cat cattus cat
# 2 canis canis
# 3 deer cervus deer
(I think options(stringsAsFactors = FALSE) as default would be easier for most people, but there you go...)

Getting only numbers suggests that you just need to add as.character to your assignments, and the if-else would probably work you also seem to not be referring to the data frame in the assignment?
if(as.character(art2$trivname)==""){
art2$artname=as.character(art2$trivname)
}else{
art2$artname=as.character(art2$latname)
}
Option 2: Using ifelse:
art2$artname= ifelse(as.character(art2$trivname) == "", as.character(art2$latname),as.character(art2$trivname))
It is probably easier (and more "R-thonic" because it avoids the loop) just to assign artname to trivial across the board, then overwrite the blank ones with latname...
art2 = art
art2$artname = as.character(art$trivname)
changeme = which(art2$artname=="")
art2$artname[changeme] = as.character(art$latname[changeme])

If art2 is the dataframe, and artname the new column, another possible solution:
art2$artname <- as.character(art2$trivname)
art2[art$artname == "",'artname'] <- as.character(art2[art2$artname == "", 'latname'])
And if you want factors in the new column:
art2$artname <- as.factor(art2$artname)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

NLP - identifying and replacing words (synonyms) in R - r

Related

How do I grab the word in a character column after two consecutive word matches in R?

Finding Matches Across Char Vectors in R

How to transform long names into shorter (two-part) names

How to read csv with values containing commas in R?

Using ifelse on factor in R

Categories

Resources