I know similar question might have asked in this forum but I feel my requirement is peculiar.
I have a data frame with a column with the following values.
Below is the just sample and it contains more than 1000 observations
Reported Terms
"2 Left Axillary Lymph Nodes Resection"
"cardyoohyper"
"Ablation Breast"
"Hypercarido"
"chordiohyper"
"Adenocarcinoma Of Colon (Radical Resection And Cr)"
"myocasta"
"hypermyopa"
I have another data frame with the below rules:
Data frame
I am expecting the below output:
"2 Left Axillary Lymph Nodes Resection"
"carddiohiper"
"Ablation Breast"
"hipercardio"
"cardiohyper"
"Adenocarcinoma Of Colon (Radical Resection And Cr)"
"miocasta"
"hipermiopa"
I am trying with hot encoding with gsub function but I understand that it will take a lot time.
pattern <- c("kardio, "carido", "cardyo", "cordio", "chordio")
replacement <- "cardio"
gusub(pattern,replacement,df$reportedterms)
with the above approach I need to encode every time for every rule and I need to create different variables each time for pattern and replacement in gsub function.
Is there a simple approach to solve this problem?
First let's set this up as described by you:
library(tibble)
df <- tibble(text = c("2 Left Axillary Lymph Nodes Resection",
"cardyoohyper",
"Ablation Breast",
"Hypercarido",
"chordiohyper",
"Adenocarcinoma Of Colon (Radical Resection And Cr)",
"myocasta",
"hypermyopa"))
replace_dict <- tibble(pattern = list(c("kardio", "carido", "cardyo", "cordio", "chordio"),
"myoca",
"myopa",
"hyper"),
replacement = c("cardio",
"mioca",
"miopa",
"hiper"))
I would simply use stringi for the task as it has an extremely efficient version of gsub which is stri_replace_all_fixed (note that you could also use the regex version, which is a bit slower but works the same). It can handle several patterns and replacements at the same time, so all we need to do is unnest the pattern column first and then run stringi:
batch_replace <- function(text, replace_dict) {
replace_dict <- tidyr::unnest(replace_dict, pattern)
stringi::stri_replace_all_fixed(str = text,
pattern = replace_dict$pattern,
replacement = replace_dict$replacement,
vectorize_all = FALSE)
}
Let's put this function to a test:
df$text_new <- batch_replace(df$text, replace_dict)
df
#> # A tibble: 8 x 2
#> text text_new
#> <chr> <chr>
#> 1 2 Left Axillary Lymph Nodes Resecti~ 2 Left Axillary Lymph Nodes Resecti~
#> 2 cardyoohyper cardioohiper
#> 3 Ablation Breast Ablation Breast
#> 4 Hypercarido Hypercardio
#> 5 chordiohyper cardiohiper
#> 6 Adenocarcinoma Of Colon (Radical Re~ Adenocarcinoma Of Colon (Radical Re~
#> 7 myocasta miocasta
#> 8 hypermyopa hipermiopa
I think that is what you wanted. Note that the function isn't very flexible as you have to provide stri_replace_all_fixed exactly in the way shown. Since you haven't shared the file, I can't help you with wrangling into that form, so you have to figure that out or ask a new question.
update
If you want replacement to be case insensitive and also want to lowercase the text, the function could look like this:
batch_replace <- function(text, replace_dict, to_lower = TRUE, case_insensitive = TRUE) {
replace_dict <- tidyr::unnest(replace_dict, pattern)
if (to_lower) {
text <- tolower(text)
}
stringi::stri_replace_all_fixed(str = text,
pattern = replace_dict$pattern,
replacement = replace_dict$replacement,
vectorize_all = FALSE,
opts_fixed = stringi::stri_opts_fixed(case_insensitive = case_insensitive))
}
You can turn on/off lower casing and case-insensitive replacement as you need it.
Related
I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).
Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.
Let me explain what I want to do. I have a corpus data (15 M words) about a political debate and I want to find the co-ocurrence of two terms within, say, 10k words.
I create two vectors of positions of two terms: "false" and "law".
false.v <- c(133844, 133880, 145106, 150995, 152516, 152557, 153697, 155507)
law.v <- c(48064, 155644, 251315, 297303, 323417, 349576, 368052, 543487)
Then I want to gather them on a matrix to see the co-ocurrence using the 'outer' function. The positions are taken from the same corpus, so I'm creating a matrix of differences:
distances <- outer(false.v, law.v, "-")
To make this easier to read lets name them:
rownames(distances) <- paste0("False", false.v)
colnames(distances) <- paste0("Law", law.v)
Okay, so we have the matrix ready. To find which pairs of positions were within 10000 words of each other I just run:
abs(distances) <= 10000
So I have to identify those moments in the political debate where there is a greater frequency of those co-occurences. Here comes the problem. I have to do it with more than a pair of words (In fact with 5 pair of words or so), so it would be great if I just could search multiple words instead of two pair of words at a time. So instead searching "false" and "law", search "false OR lie OR whatever" and "law OR money OR whatever". I guess I have to use RegEx for this task, isn't it? I just tried everything and nothing worked.
The example I just gave is a simplification. The command I use to search words is creating a vector out of the corpus:
positions.law.v <- which(C1.corpus.v == "law")
Soo it would be great if I can just use something like
which(C1.corpus.v == "law OR money OR prison OR ...")
which(C1.corpus.v == "false OR lie OR country OR ...")
It's like telling R "hey, give me the co-ocurrence positions of any possible combination between the first row of words (law or money or prison...) and the second one (false or lie or country...). I hope I'm explaining it in a clear way. I'm sorry for the language mistakes. Thank you!!
library(dplyr)
I have an extended answer here as well, but it could be as simple as:
mywords = c("law", "money", "prison", "false", "lie", "country")
which(C1.corpus.v %in% mywords)
Try:
library(quanteda)
I'll use the election manifestos of 9 UK political parties from 2010:
data_char_ukimmig2010
Create a tokens object (there are lots of settings - check out https://quanteda.io/)
mytoks <- data_char_ukimmig2010 %>%
char_tolower() %>%
tokens()
mywords = c("law", "money", "prison", "false", "lie", "country")
kwic "return[s] a list of a keyword supplied by the user in its immediate context, identifying the source text and the word index number within the source text" source
mykwic <- kwic(mytoks, mywords)
A kwic builds a data frame with various features, one of which is the integer value starting position of your keywords (because you can use it to look for phrases):
mykwic$from
Gives us:
> mykwic$from
[1] 130 438 943 1259 1281 1305 1339 1356 1743 1836 1859 2126 2187 2443 2546 2640 2763 2952 3186 3270 179 8 201
[24] 343 354 391 498 16 131 552 14 29 388 80 306 487 507
I think your problem is slightly more sophisticated than using regex. For instance, you may be willing to include law, legal and legislation in one group but do not include lawless. Regex like \blaw.*\b wouldn't help you much. In effect, you are interested in:
Creating feature co-occurrence matrix
Incorporating the semantic proximity of the words
Feature co-occurrence matrix
This is a well-established task and I would encourage you to use a tested solution like the fcm function. To introduce an example from the documentation:
txt <- "A D A C E A D F E B A C E D"
fcm(txt, context = "window", window = 2)
fcm(txt, context = "window", count = "weighted", window = 3)
fcm(txt, context = "window", count = "weighted", window = 3,
weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)
Your regex
To suggest a solution to your particular problem. This:
which(C1.corpus.v == "law OR money OR prison OR ...")
where
C1.corpus.v <- c("law", "word", "something","legal", "stuff")
you could do
grep(
pattern = paste("legal", "law", "som.*", sep = "|"),
x = C1.corpus.v,
perl = TRUE,
value = FALSE
)
where sep = "|" serves as your ...OR.... IMHO, this is not what you want as it does not address semantic similarity. I would suggest you have a look at some of the good tutorials that are available on the net 1,2.
1 Taylor Arnold and Lauren Tilton Basic Text Processing in R
2 Islam, Aminul & Inkpen, Diana. (2008). Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity. TKDD. 2. 10.1145/1376815.1376819.
I am new to R. In my dataset, I have a variable called Reason . I want to create a new column called Price. If any of the following conditions is met:
word "Price" and word "High" are both mentioned in Reason and the distance between them is less than 6 words
word "Price" and word "expensive" are both mentioned in Reason and the distance between them is less than 6 words
-word "Price" and word "increase" are both mentioned in Reason and the distance between them is less than 6 words
than Price=1. Otherwise, price=0.
I found the following user defined function to get the distance between 2 words
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
but I don't know how to apply it the whole column to get the expected results. I tried the following code but it only give me "logical(0)" as the result.
for (j in seq(Survey$Reason))
{
Survey$Price[[j]]<- distance(Survey$Reason[[j]], " price ", " high ") <=6
}
Any help is highly appreciated.
Thanks
Starting from your sample data:
survey <- structure(list(Reason = c("Their price are extremely high.", "Because my price was increased so much, I wouldn't want anyone else to have to deal with that.", "Just because the intial workings were fine, but after we realised it would affect our contract, it left a sour taste in our mouth.", "Problems with the repair", "They did not handle my complaint as well I would have liked.", "Bad service overall.")), .Names = "Reason", row.names = c(NA, 6L), class = "data.frame")
First, I updated your fonction to remove punctuation and directrly returns your position test
distanceOK <- function(string, term1, term2,n=6) {
words <- strsplit(gsub("[[:punct:]]", "", string), "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
dist <- abs(indices[term1] - indices[term2])
ifelse(is.na(dist)|dist>n,0,1)
}
Then we apply:
survey$Price <- sapply(survey$Reason, FUN=function(str) distanceOK(str, "price","high"))
I am trying to efficiently map exact peptides (short sequences of amino acids in the 26 character alphabet A-Z1) to proteins (longer sequences of the same alphabet). The most efficient way to do this I'm aware of is an Aho-Corasick trie (where peptides are the keywords). Unfortunately I can't find a version of AC in R that will work with a non-nucleotide alphabet (Biostrings' PDict and Starr's match_ac are both hard-coded for DNA).
As a crutch I've been trying to parallelize a basic grep approach. But I'm having trouble figuring out a way to do so without incurring significant IO overhead. Here is a brief example:
peptides = c("FSSSGGGGGGGR","GAHLQGGAK","GGSGGSYGGGGSGGGYGGGSGSR","IISNASCTTNCLAPLAK")
if (!exists("proteins"))
{
biocLite("biomaRt", ask=F, suppressUpdates=T, suppressAutoUpdate=T)
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
proteins = getBM(attributes=c('peptide', 'refseq_peptide'), filters='refseq_peptide', values=c("NP_000217", "NP_001276675"), mart=ensembl)
row.names(proteins) = proteins$refseq_peptide
}
library(snowfall)
library(Biostrings)
library(plyr)
sfInit(parallel=T, cpus=detectCores()-1)
allPeptideInstances = NULL
i=1
increment=100
count=nrow(proteins)
while(T)
{
print(paste(i, min(count, i+increment), sep=":"))
text_source = proteins[i:min(count, i+increment),]
text = text_source$peptide
#peptideInstances = sapply(peptides, regexpr, text, fixed=T, useBytes=T)
peptideInstances = sfSapply(peptides, regexpr, text, fixed=T, useBytes=T)
dimnames(peptideInstances) = list(text_source$refseq_peptide, colnames(peptideInstances))
sparsePeptideInstances = alply(peptideInstances, 2, .fun = function(x) {x[x > 0]}, .dims = T)
allPeptideInstances = c(allPeptideInstances, sparsePeptideInstances, recursive=T)
if (i==count | nrow(text_source) < increment)
break
i = i+increment
}
sfStop()
There are a few issues here:
peptideInstances here is a dense matrix, so
returning it from each worker is very verbose. I have broken it up
into blocks so that I'm not dealing with a 40,000 (proteins) x 60,000
(peptides) matrix.
Parallelizing over peptides, when it would make
more sense to parallelize over the proteins because they're bigger.
But I got frustrated with trying to do it by protein because:
This code breaks if there is only one protein in text_source.
Alternatively, if anyone is aware of a better solution in R, I'm happy to use that. I've spent enough time on this I probably would have been better served implementing Aho-Corasick.
1 Some of those are ambiguity codes, but for simplicity, ignore that.
I learned Rcpp and implemented an Aho-Corasick myself. Now CRAN has a good general purpose multiple-keyword search package.
Here are some usage examples:
listEquals = function(a, b) { is.null(unlist(a)) && is.null(unlist(b)) || !is.null(a) && !is.null(b) && all(unlist(a) == unlist(b)) }
# simple search of multiple keywords in a single text
keywords = c("Abra", "cadabra", "is", "the", "Magic", "Word")
oneSearch = AhoCorasickSearch(keywords, "Is Abracadabra the Magic Word?")
stopifnot(listEquals(oneSearch[[1]][[1]], list(keyword="Abra", offset=4)))
stopifnot(listEquals(oneSearch[[1]][[2]], list(keyword="cadabra", offset=8)))
stopifnot(listEquals(oneSearch[[1]][[3]], list(keyword="the", offset=16)))
stopifnot(listEquals(oneSearch[[1]][[4]], list(keyword="Magic", offset=20)))
stopifnot(listEquals(oneSearch[[1]][[5]], list(keyword="Word", offset=26)))
# search a list of lists
# * sublists are accessed by index
# * texts are accessed by index
# * non-matched texts are kept (to preserve index order)
listSearch = AhoCorasickSearchList(keywords, list(c("What in", "the world"), c("is"), "secret about", "the Magic Word?"))
stopifnot(listEquals(listSearch[[1]][[1]], list()))
stopifnot(listEquals(listSearch[[1]][[2]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[2]][[1]][[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(listSearch[[3]], list()))
stopifnot(listEquals(listSearch[[4]][[1]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[4]][[1]][[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(listSearch[[4]][[1]][[3]], list(keyword="Word", offset=11)))
# named search of a list of lists
# * sublists are accessed by name
# * matched texts are accessed by name
# * non-matched texts are dropped
namedSearch = AhoCorasickSearchList(keywords, list(subject=c(phrase1="What in", phrase2="the world"),
verb=c(phrase1="is"),
predicate1=c(phrase1="secret about"),
predicate2=c(phrase1="the Magic Word?")))
stopifnot(listEquals(namedSearch$subject$phrase2[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$verb$phrase1[[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(namedSearch$predicate1, list()))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[3]], list(keyword="Word", offset=11)))
# named search of multiple texts in a single list with keyword grouping and aminoacid alphabet
# * all matches to a keyword are accessed by name
# * non-matched keywords are dropped
proteins = c(protein1="PEPTIDEPEPTIDEDADADARARARARAKEKEKEKEPEPTIDE",
protein2="DERPADERPAPEWPEWPEEPEERAWRAWWARRAGTAGPEPTIDEKESEQUENCE")
peptides = c("PEPTIDE", "DERPA", "SEQUENCE", "KEKE", "PEPPIE")
peptideSearch = AhoCorasickSearch(peptides, proteins, alphabet="aminoacid", groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(list(keyword="protein1", offset=1),
list(keyword="protein1", offset=8),
list(keyword="protein1", offset=37),
list(keyword="protein2", offset=38))))
stopifnot(listEquals(peptideSearch$DERPA, list(list(keyword="protein2", offset=1),
list(keyword="protein2", offset=6))))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(list(keyword="protein2", offset=47))))
stopifnot(listEquals(peptideSearch$KEKE, list(list(keyword="protein1", offset=29),
list(keyword="protein1", offset=31),
list(keyword="protein1", offset=33))))
stopifnot(listEquals(peptideSearch$PEPPIE, NULL))
# grouping by keyword without text names: offsets are given without reference to the text
names(proteins) = NULL
peptideSearch = AhoCorasickSearch(peptides, proteins, groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(1, 8, 37, 38)))
stopifnot(listEquals(peptideSearch$DERPA, list(1, 6)))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(47)))
stopifnot(listEquals(peptideSearch$KEKE, list(29, 31, 33)))
I have a dataframe like this
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Invezzstment LLC
4 Investment_LLC
5 Haiperloop LLC
6 Inwestment LLC
I need to match all these fuzzy strings, so the end result should look like this:
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Investment LLC
4 Investment LLC
5 Hyperloop LLC
6 Investment LLC
So, actually, I must solve a partial match-and-replace task for categorical variable. There are a lot great functions in base R and packages to solve string matching, but I'm stuck to find a single solution for this kind of match-and-replace.
I don't care which occurrence will replace other, for example "Investment LLC" or "Invezzstment LLC" are both equally fine. Just need them to be consistent.
Is there any single all-in-one function or a loop for this?
If you have a vector of correct spellings, agrep makes this reasonably easy:
myDataFrame$company <- sapply(myDataFrame$company,
function(val){agrep(val,
c('Investment LLC', 'Hyperloop LLC'),
value = TRUE)})
myDataFrame
# company
# 1 Investment LLC
# 2 Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5 Hyperloop LLC
# 6 Investment LLC
If you don't have such a vector, you can likely make one with clever application of adist or even just table if the correct spelling is repeated more than the others, which it likely will be (though isn't here).
So, after some time I ended up with this dumb code. Attention: It is not fully automating the process of replacement, because every time the proper matches should be verified by human, and every time we need a fine tune of agrep max.distance argument. I am totally sure there are ways to make it better and quicker, but this can help to get the job done.
##########
# Manual renaming with partial matches
##########
# a) Take a look at the desired column of factor variables
sort(unique(MYDATA$names)) # take a look
# ****
Sensthreshold <- 0.2 # sensitivity of agrep, usually 0.1-0.2 get it right
Searchstring <- "Invesstment LLC" # what should I search?
# ****
# User-defined function: returns similar string on query in column
Searcher <- function(input, similarity = 0.1) {
unique(agrep(input,
MYDATA$names, # <-- define your column here
ignore.case = TRUE, value = TRUE,
max.distance = similarity))
}
# b) Make a search of desired string
Searcher(Searchstring, Sensthreshold) # using user-def function
### PLEASE INSPECT THE OUTPUT OF THE SEARCH
### Did it get it right?
=============================================================================#
## ACTION! This changes your dataframe!
## Please make backup before proceeding
## Please execute this code as a whole to avoid errors
# c) Make a vector of cells indexes after checking output
vector_of_cells <- agrep(Searchstring,
MYDATA$names, ignore.case = TRUE,
max.distance = Sensthreshold)
# d) Apply the changes
MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
# e) Check result
unique(agrep(Searchstring, MYDATA$names,
ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#