Replace misspelled values with agrep - r

I have a dataset of restaurants and the variable "CONAME" contains the name of each establishment. Unfortunately, there are quite a few misspellings, and I'd like to correct them. I've tried agrep for fuzzy set matching using the following code (which I'll repeat for all major chains):
rest2012$CONAME <- agrep("MC DONALD'S", rest2012$CONAME, ignore.case = FALSE, value = FALSE, max.distance = 3)
I'm getting the following error message:
Error in $<-.data.frame(*tmp*, "CONAME", value = c(35L, 40L, 48L, :
replacement has 3074 rows, data has 67424
Is there another way I can replace the misspelled names or am I simply using the agrep function wrong?

When you use agrep with value = FALSE the result is "a vector giving the indices of the elements that yielded a match". That is, the position of matches in the vector of names that you fed agrep with. You then try to replace the entire name variable in your data frame (67424 rows) with a shorter vector of indices (3074 of them). Not what you want. Here is a small example which perhaps can guide you in the right direction. You may also read ?Extract and this. The details of agrep itself (e.g. max.distance), I leave to you.
# create a data frame with some MC DONALD's-ish names, and some other names.
rest2012 <- data.frame(CONAME = c("MC DONALD'S", "MCC DONALD'S", "SPSS Café", "GLM RONALDO'S", "MCMCglmm"))
rest2012
# do some fuzzy matching with 'agrep'
# store the indices in an object named 'idx'
idx <- agrep(pattern = "MC DONALD'S", x = rest2012$CONAME, ignore.case = FALSE, value = FALSE, max.distance = 3)
idx
# just look at the rows in the data frame that matched
# indexing with a numeric vector
rest2012[idx, ]
# replace the elements that matches
rest2012[idx, ] <- "MC DONALD'S"
rest2012

Related

Finding Matches Across Char Vectors in R

Given the below two vectors is there a way to produce the desired data frame? This represents a real world situation which I have to data frames the first contains a col with database values (keys) and the second contains a col of 1000+ rows each a file name (potentials) which I need to match. The problem is there can be multiple files (potentials) matched to any given key. I have worked with grep, merge, inner join etc. but was unable to incorporate them into one solution. Any advise is appreciated!
potentials <- c("tigerINTHENIGHT",
"tigerWALKINGALONE",
"bearOHMY",
"bearWITHME",
"rat",
"imatchnothing")
keys <- c("tiger",
"bear",
"rat")
desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")
Psudo code for what I think of as the solution:
#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
str result = na
foreach(key in keys){
if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
result += potential that matched + ', '
}
}
return new column with result as the value on the current row
}
We can write a small functions to extract matches and then loop over the keys:
return_matches <- function(keys, potentials, fixed = TRUE) {
vapply(keys, function(k) {
paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
}, FUN.VALUE = character(1))
}
vapply is just a typesafe version of sapply meaning it will never return anything but a character vector. When you set fixed = TRUE the function will run a lot faster but does not recognise regular expressions anymore. Then we can easily make the desired data.frame:
df <- data.frame(
key = keys,
matches = return_matches(keys, potentials),
stringsAsFactors = FALSE
)
df
#> key matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear bear bearOHMY, bearWITHME
#> rat rat rat
The reason for putting the loop in a function instead of running it directly is just to make the code look cleaner.
You can interate using grep
> Match <- sapply(keys, function(item) {
paste0(grep(item, potentials, value = TRUE), collapse = ", ")
} )
> data.frame(keys, Match, row.names = NULL)
keys Match
1 tiger tigerINTHENIGHT, tigerWALKINGALONE
2 bear bearOHMY, bearWITHME
3 rat rat

Converting list of Characters to Named num in R

I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))

Returning the matched string from a grepl match of multiple strings, rather than the logical

Currently I'm using nested ifelse functions with grepl to check for matches to a vector of strings in a data frame, for example:
# vector of possible words to match
x <- c("Action", "Adventure", "Animation")
# data
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
my_text <- as.data.frame(my_text)
my_text$new_column <- ifelse (
grepl("Action", my_text$my_text) == TRUE,
"Action",
ifelse (
grepl("Adventure", my_text$my_text) == TRUE,
"Adventure",
ifelse (
grepl("Animation", my_text$my_text) == TRUE,
"Animation", NA)))
> my_text$new_column
[1] "Animation" NA "Adventure"
This is fine for just a few elements (e.g., the three here), but how do I return when the possible matches are much larger (e.g., 150)? Nested ifelse seems crazy. I know I can grepl multiple things at once as in the code below, but this return a logical telling me only if the string was matched, not which one was matched. I'd like to know what was matched (in the case of multiple, any of the matches is fine.
x <- c("Action", "Adventure", "Animation")
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
grepl(paste(x, collapse = "|"), my_text)
returns: [1] TRUE FALSE TRUE
what i'd like it to return: "Animation" ""(or FALSE) "Adventure"
Following the pattern here, a base solution.
x <- c("ActionABC", "AdventureDEF", "AnimationGHI")
regmatches(x, regexpr("(Action|Adventure|Animation)", x))
stringr has an easier way to do this
library(stringr)
str_extract(x, "(Action|Adventure|Animation)")
Building on Benjamin's base solution, use lapply so that you will have a character(0) value when there is no match.
Just using regmatches on your sample code directly, will you give the following error.
my_text$new_column <-regmatches(x = my_text$my_text, m = regexpr(pattern = paste(x, collapse = "|"), text = my_text$my_text))
Error in `$<-.data.frame`(`*tmp*`, new_column, value = c("Animation", :
replacement has 2 rows, data has 3
This is because there are only 2 matches and it will try to fit the matches values in the data frame column which has 3 rows.
To fill non-matches with a special value so that this operation can be done directly we can use lapply.
my_text$new_column <-
lapply(X = my_text$my_text, FUN = function(X){
regmatches(x = X, m = regexpr(pattern = paste(x, collapse = "|"), text = X))
})
This will put character(0) where there is no match.
Table screenshot
Hope this helps.
This will do it...
my_text$new_column <- unlist(
apply(
sapply(x, grepl, my_text$my_text),
1,
function(y) paste("",x[y])))
The sapply produces a logical matrix showing which of the x terms appears in each element of your column. The apply then runs through this row-by-row and pastes together all of the values of x corresponding to TRUE values. (It pastes a "" at the start to avoid NAs and keep the length of the output the same as the original data.) If there are two terms in x matched for a row, they will be pasted together in the output.

R - How to replace a string from multiple matches (in a data frame)

I need to replace subset of a string with some matches that are stored within a dataframe.
For example -
input_string = "Whats your name and Where're you from"
I need to replace part of this string from a data frame. Say the data frame is
matching <- data.frame(from_word=c("Whats your name", "name", "fro"),
to_word=c("what is your name","names","froth"))
Output expected is what is your name and Where're you from
Note -
It is to match the maximum string. In this example, name is not matched to names, because name was a part of a bigger match
It has to match whole string and not partial strings. fro of "from" should not match as "froth"
I referred to the below link but somehow could not get this work as intended/described above
Match and replace multiple strings in a vector of text without looping in R
This is my first post here. If I haven't given enough details, kindly let me know
Edit
Based on the input from Sri's comment I would suggest using:
library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)
Original
I would not say this is easier to read than your simple loop, but it might take better care of the overlapping replacements:
# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
to_word=c("what is your name","names","froth", "where are", "Whatsup"))
# load used library
library(gsubfn)
# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)
# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE
# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])
# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
replacements,input_string)
toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")
function have two arguments xi and yi.
xi is pattern (find what), yi is replacement (replace with).
input_string = "Whats your name and Where're you from"
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth")
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)
Was trying out different things and the below code seems to work.
a <-c("Whats your name", "name", "fro")
b <- c("what is your name","names","froth")
c <- c("Whats your name and Where're you from")
for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c)
c <- gsub("_"," ",c)
c
Took help from the below link Making gsub only replace entire words?
However, I would like to avoid the loop if possible. Can someone please improve this answer, without the loop

Multiple Pattern matching in R over multiple files , multiple columns & rows

I've a list of CSV files i need to read from , in which multiple files with columns such as Title, description .... . From these columns over multiple files , a retrieval operation has to be written and matched against another CSV generated from popular keywords(~10k) generated from a tool similar to WordStream SEO.
What i was able to do
#Not sure if this is correct approach
Source1<- read.csv(path to csv file)
Keywords_tomatch<- read.csv(path to csv file)
#cant really take both the columns into single vector and iterate over them
subColdesc <- Source1[,c(3)]
subcolTitle <-Source1[,c(2)]
keywordget<- subset(Keywords_tomatch,grepl("*",Keywords_tomatch$col1))
#Two individual vectors since i'm not sure whether sapply() can be applied over multiple lists Definition: sapply(list,function)
descBoolean <- sapply(keywordget,
function(y)
sapply(subColdesc ,
function(x)
any(grepl(y,x)))
)
TitleBoolean = sapply(keywordget,
function(y)
sapply(subcolTitle ,
function(x)
any(grepl(y,x)))
)
#matches just the first element in the column of keywordget against (~4k) elements in description,title column. i.e returns a warning/error
In grepl(y, x) :
argument 'pattern' has length > 1 and only the first element will be used
I've tried at Akrun's version of grep and it hadn't worked for me
Question :
How to match all the elements in the keywordget vector and retrieve what columns matched on each row of Description,Title and what rows of Description and Title have matched.
In short how to retrieve all the game related products in the Source1 using Keywords_tomatch?
As a sample i'm posting the two files i've gathered. Source1 only contains few rows of 4k rows
Source1 =1.csv,
Keywords_tomatch = Gaming.csv
First, let me point out some possible ways why your code is not working (and correct me if I am wrong):
Your files are read with stringAsFactors = TRUE, and grepl does recognize factor variables for the pattern = argument. But since you did not get an error about grepl not recognizing factors, I assume you converted them to characters before matching.
You need the fixed = TRUE argument for grepl or else it will treat the elements of keys as regular expressions.
Your keywordget is a dataframe, and R treats dataframes as lists when being called as one. So since the first argument of sapply takes a list, it treats keywordget as a list with 1 element. So when this element (which is essentially the entire vector of keywordget) is supplied to the pattern argument of the grepl function, you get the error:
In grepl(y, x) : argument 'pattern' has length > 1 and only the first element will be used
For example, this should work:
sapply(keywordget$GAMING, function(y) {
sapply(source1$title, function(x) {
any(grepl(y,x, fixed = TRUE))
})
})
Below is my solution:
# Read files
source1 = read.csv("source1.csv", stringsAsFactors = FALSE)
keys = read.csv("gaming.csv", stringsAsFactors = FALSE)
# Finds the index of elements in source1 that matches
# with any of the keys
matchIndex = lapply(source1, function(x){
which(Reduce(`|`, lapply(keys$GAMING, grepl, x, fixed = TRUE)))
})
> matchIndex
$title
integer(0)
$description
[1] 189 293 382 402 456
title has zero matches and description has 5
# Returns the descriptions that match
source1$description[matchIndex$description]
# Returns the title corresponding to the descriptions that match
source1$title[matchIndex$description]
> source1$title[matchIndex$description]
[1] "tomb raider: legend"
[2] "namco museum 50th anniversary collection"
[3] "restricted area"
[4] "south park chef's luv shack"
[5] "brainfood games cranium collection 2006"

Resources