Multiple Pattern matching in R over multiple files , multiple columns & rows - r

I've a list of CSV files i need to read from , in which multiple files with columns such as Title, description .... . From these columns over multiple files , a retrieval operation has to be written and matched against another CSV generated from popular keywords(~10k) generated from a tool similar to WordStream SEO.
What i was able to do
#Not sure if this is correct approach
Source1<- read.csv(path to csv file)
Keywords_tomatch<- read.csv(path to csv file)
#cant really take both the columns into single vector and iterate over them
subColdesc <- Source1[,c(3)]
subcolTitle <-Source1[,c(2)]
keywordget<- subset(Keywords_tomatch,grepl("*",Keywords_tomatch$col1))
#Two individual vectors since i'm not sure whether sapply() can be applied over multiple lists Definition: sapply(list,function)
descBoolean <- sapply(keywordget,
function(y)
sapply(subColdesc ,
function(x)
any(grepl(y,x)))
)
TitleBoolean = sapply(keywordget,
function(y)
sapply(subcolTitle ,
function(x)
any(grepl(y,x)))
)
#matches just the first element in the column of keywordget against (~4k) elements in description,title column. i.e returns a warning/error
In grepl(y, x) :
argument 'pattern' has length > 1 and only the first element will be used
I've tried at Akrun's version of grep and it hadn't worked for me
Question :
How to match all the elements in the keywordget vector and retrieve what columns matched on each row of Description,Title and what rows of Description and Title have matched.
In short how to retrieve all the game related products in the Source1 using Keywords_tomatch?
As a sample i'm posting the two files i've gathered. Source1 only contains few rows of 4k rows
Source1 =1.csv,
Keywords_tomatch = Gaming.csv

First, let me point out some possible ways why your code is not working (and correct me if I am wrong):
Your files are read with stringAsFactors = TRUE, and grepl does recognize factor variables for the pattern = argument. But since you did not get an error about grepl not recognizing factors, I assume you converted them to characters before matching.
You need the fixed = TRUE argument for grepl or else it will treat the elements of keys as regular expressions.
Your keywordget is a dataframe, and R treats dataframes as lists when being called as one. So since the first argument of sapply takes a list, it treats keywordget as a list with 1 element. So when this element (which is essentially the entire vector of keywordget) is supplied to the pattern argument of the grepl function, you get the error:
In grepl(y, x) : argument 'pattern' has length > 1 and only the first element will be used
For example, this should work:
sapply(keywordget$GAMING, function(y) {
sapply(source1$title, function(x) {
any(grepl(y,x, fixed = TRUE))
})
})
Below is my solution:
# Read files
source1 = read.csv("source1.csv", stringsAsFactors = FALSE)
keys = read.csv("gaming.csv", stringsAsFactors = FALSE)
# Finds the index of elements in source1 that matches
# with any of the keys
matchIndex = lapply(source1, function(x){
which(Reduce(`|`, lapply(keys$GAMING, grepl, x, fixed = TRUE)))
})
> matchIndex
$title
integer(0)
$description
[1] 189 293 382 402 456
title has zero matches and description has 5
# Returns the descriptions that match
source1$description[matchIndex$description]
# Returns the title corresponding to the descriptions that match
source1$title[matchIndex$description]
> source1$title[matchIndex$description]
[1] "tomb raider: legend"
[2] "namco museum 50th anniversary collection"
[3] "restricted area"
[4] "south park chef's luv shack"
[5] "brainfood games cranium collection 2006"

Related

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.
You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

Vectorized use of the substring function for a row selection of a dataframe with different length

My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.
sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200

NLP - identifying and replacing words (synonyms) in R

I have problem with code in R.
I have a data-set(questions) with 4 columns and over 600k observation, of which one column is named 'V3'.
This column has questions like 'what is the day?'.
I have second data-set(voc) with 2 columns, of which one column name 'word' and other column name 'synonyms'. If In my first data-set (questions )exists word from second data-set(voc) from column 'synonyms' then I want to replace it word from 'word' column.
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
I wrote simple code but it doesn't work.
for (k in 1:nrow(question))
{
for (i in 1:nrow(voc))
{
question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
}
}
Maybe someone will try to help me? :)
I wrote second code, but it doesn't work too..
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}
First, it is important that you use the stringsAsFactors = FALSE option, either at the program level, or during your data import. This is because R defaults to making strings into factors unless you otherwise specify. Factors are useful in modeling, but you want to do analysis of the text itself, and so you should be sure that your text is not coerced to factors.
The way I approached this was to write a function that would "explode" each string into a vector, and then uses match to replace the words. The vector gets reassembled into a string again.
I'm not sure how performant this will be given your 600K records. You might look into some of the R packages that handle strings, like stringr or stringi, since they will probably have functions that do some of this. match tends to be okay on speed, but %in% can be a real beast depending on the length of the string and other factors.
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes

Replace misspelled values with agrep

I have a dataset of restaurants and the variable "CONAME" contains the name of each establishment. Unfortunately, there are quite a few misspellings, and I'd like to correct them. I've tried agrep for fuzzy set matching using the following code (which I'll repeat for all major chains):
rest2012$CONAME <- agrep("MC DONALD'S", rest2012$CONAME, ignore.case = FALSE, value = FALSE, max.distance = 3)
I'm getting the following error message:
Error in $<-.data.frame(*tmp*, "CONAME", value = c(35L, 40L, 48L, :
replacement has 3074 rows, data has 67424
Is there another way I can replace the misspelled names or am I simply using the agrep function wrong?
When you use agrep with value = FALSE the result is "a vector giving the indices of the elements that yielded a match". That is, the position of matches in the vector of names that you fed agrep with. You then try to replace the entire name variable in your data frame (67424 rows) with a shorter vector of indices (3074 of them). Not what you want. Here is a small example which perhaps can guide you in the right direction. You may also read ?Extract and this. The details of agrep itself (e.g. max.distance), I leave to you.
# create a data frame with some MC DONALD's-ish names, and some other names.
rest2012 <- data.frame(CONAME = c("MC DONALD'S", "MCC DONALD'S", "SPSS Café", "GLM RONALDO'S", "MCMCglmm"))
rest2012
# do some fuzzy matching with 'agrep'
# store the indices in an object named 'idx'
idx <- agrep(pattern = "MC DONALD'S", x = rest2012$CONAME, ignore.case = FALSE, value = FALSE, max.distance = 3)
idx
# just look at the rows in the data frame that matched
# indexing with a numeric vector
rest2012[idx, ]
# replace the elements that matches
rest2012[idx, ] <- "MC DONALD'S"
rest2012

Resources