I have two dataframes with numerous variables. Of primary concern are the following variables, df1.organization_name and df2.legal.name. I'm just using fully qualified SQL-esque names here.
df1 has dimensions of 15 x 2700 whereas df2 has dimensions of 10x40,000. And essentially, the 'common' or 'matching' columns are name fields.
I reviewed this post Merging through fuzzy matching of variables in R and it was very helpful but I can't really figure out how to wrangle the script to get it to work with my dfs.
I keep getting an error - Error in which(organization_name[i] == LEGAL.NAME) :
object 'LEGAL.NAME' not found.
Desired Matching and Outcome
What I am trying to do is compare each and every one of my df1.organization_name to every one of the df2.legal_name and make a comparison if they are a very close match (like >=85%). And then like in the script above, take matched customer name and the matched comparison name and put them into a data.frame for later analysis.
So, if one of my customer names is 'Johns Hopkins Auto Repair' and one of my public list names is, 'John Hopkins Microphone Repair', I would call that a good match and I want some sort of indicator appended to my customer list (in another column) that says, 'Partial Match' and the name from the public list.
Example(s) of the dfs for text wrangling:
df1.organization_name (these are fake names b/c I can't post customer names)
- My Company LLC
- John Johns DBA John's Repair
- Some Company Inc
- Ninja Turtles LLP
- Shredder Partners
df2.LEGAL.NAME (these are real names from the open source file)
- $1 & UP STORE CORP.
- $1 store 0713
- LLC 0baid/munir/gazem
- 1 2 3 MONEY EXCHANGE LLC
- 1 BOY & 3 GIRLS, LLC
- 1 STAR BEVERAGE INC
- 1 STOP LLC
- 1 STOP LLC
- 1 STOP LLC DBA TIENDA MEXICANA LA SAN JOSE
- 1 Stop Money Centers, LLC/Richard
Related
Edit: Fixed data example issue
Background/Data: I'm working on a merge between two datasets: one is a list of the legal names of various publicly traded companies and the second is a fairly dirty field with company names, individual titles, and all sorts of other difficult to predict words. The company name list is about 14,000 rows and the dirty data is about 1.3M rows. Not every publicly traded company will appear in the dirty data and some may appear multiple times with different presentations (Exxon Mobil, Exxon, ExxonMobil, etc.).
Accordingly, my current approach is to dismantle the publicly traded company name list into the individual words used in each title (after cleaning out some common words like company, corporation, inc, etc.), resulting in the data shown below as Have1. An example of some of the dirty data is shown below as Have2. I have also cleaned these strings to eliminate words like Inc and Company in my ongoing work, but in case anyone has a better idea than my current approach, I'm leaving the data as-is. Additionally, we can assume there are very few, if any, exact matches in the data and that the Have2 data is too noisy to successfully use a fuzzy match without additional work.
Question: What is the best way to go about determining which of the items in Have2 contains the words from Have1? Specifically, I think I need the final data to look like Want, so that I can then link the public company name to the dirty data name. The plan is to hand-verify the matches given the difficult of the Have2 data, but if anyone has any suggestions on another way to go about this, I am definitely open to suggestions (please, someone, have a suggestion haha).
Tried so far: I have code that sort of works, but takes ages to run and seems inefficient. That is:
library(data.table)
library(stringr)
company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")
have1 <- data.table(table(str_split(company_name_data, "\\W+", simplify = TRUE)))[!V1 == "inc"]
have2 <- c("ceo and director, apple inc",
"current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
"xerox inc., president and ceo",
"president and ceo of the amazon apple assn., division 4")
#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\\W+", simplify = TRUE))
#Creates container
store <- data.table()
#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix
for (i in 1:nrow(have1)){
matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\\b",have1$V1[i],"\\b"), have3[x,])))])
if (nrow(matches) == 0){
next
}
#Create combo data
matches[, have1_word := have1$V1[i]]
#Storage
store <- rbind(store, matches)
}
Want
Name (from Have2)
Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
radiation
vp and general bird aficionado of the amazon apple assn. branch F
amazon
vp and general bird aficionado of the amazon apple assn. branch F
apple
ceo and director, apple inc
apple
xerox inc., president and ceo
xerox
Have1
Word
N
amazon
1
apple
3
xerox
1
notgoingtomatch
2
radiation
1
Have2
Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F
Using what you have documented, in terms of data from company_name_data and have2 only:
library(tidytext)
library(tidyverse)
#------------ remove stop words before tokenization ---------------
# now split each phrase, remove the stop words, rejoin the phrases
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data, # split the phrases into individual words,
# remove stop words then reassemble phrases
function(x) {
paste(unlist(strsplit(x,
" ")
)[!(unlist(strsplit(x,
" ")) %in% (stop_words$word %>%
unlist())
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
function(x){
paste(unlist(strsplit(x,
" ")
)[(unlist(strsplit(x,
" ")) %in% comp2
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
The results in the second column, based on the text analysis are "apple," "radiation," "xerox," and "amazon apple."
I'm certain this code isn't mine originally. I'm sure I got these ideas from somewhere on StackOverflow...
My Dataset looks something like this. Note below is hypothetical dataset.
Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information
Sr.No.
Store_Name
Phone-No.
Agent_id
Area
Lat-Long
1
ABC Stores
89099090
121
Bay Area
23.909090,89.878798
2
Wuhan Masks
45453434
122
Santa Fe
24.452134,78.123243
3
Twitter Cafe
67556090
123
Middle East
11.889766,23.334483
4
abc
33445569
121
Santa Cruz
23.345678,89.234213
5
Silver Gym
11004110
234
Worli Sea Link
56.564311, 78.909087
6
CK Clothings
00908876
223
90 th Street
34.445887, 12.887654
Facts:
#1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same
In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet
a) Since Name is entered manually for same house/store names can be changed and entered in the system -
multiple visits can happen
b) Mobile number can also be manipulated, different number can be associated with same outlet
c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building
Problem:
How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.
Deploying QR is not also very helpful as this can also be tweaked.
Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)
Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made
I'm trying to create a dplyr pipeline to filter on
Imagine a data frame jobs, where I want to filter out the most-senior positions from the titles column:
titles
Chief Executive Officer
Chief Financial Officer
Chief Technical Officer
Manager
Product Manager
Programmer
Scientist
Marketer
Lawyer
Secretary
R code for filtering them out (up to 'Manager') would be...
jobs %>%
filter(!str_detect(title, 'Chief')) %>%
filter(!str_detect(title, 'Manager')) ...
but I want to still keep "Program Manager" in the final filtering to produce a new data frame with all of the "lower level jobs" like
Product Manager
Programmer
Scientist
Marketer
Lawyer
Secretary
Is there a way to specify the str_detect() filter on a given value EXCEPT for one particular string?
Assume that the data frame's column has 1000s of roles, with various string combinations including "Manager," but there will always be a filter on a specific exception.
Or you could have a separate filter for "Product Manager"
library(tidyverse)
jobs %>%
filter((!str_detect(title, "Chief|Manager")) | str_detect(title, "Product Manager"))
# title
#1 Product Manager
#2 Programmer
#3 Scientist
#4 Marketer
#5 Lawyer
#6 Secretary
which can be also twisted in base R using grepl/grep
jobs[c(grep("Product Manager",jobs$title),
grep("Chief|Manager", jobs$title, invert = TRUE)),, drop = FALSE]
I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.
My data looks like this.
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
I am looking to filter the data in multiple different ways. For example, I filter by the type of equipment, such as column 4, with the code
rifle.off <- city.data[[i]][city.data[[i]][,4]=="RIFLE,5.56 MILLIMETER",]
Where city.data is a list of matrices with data from 31 cities (so I iterate through a for loop to isolate the rifle data for each city). I would like to also filter by the number in the third column. Specifically, I only need to filter by the first two digits, i.e. I would like to isolate all line items where the number in column 3 begins with '10'. How would I modify my above code to isolate only the first two digits but let all the other digits be anything?
Edit: Providing an example of the city.data matrix as requested. First off city.data is a list made with:
city.data <- list(albuq, austin, baltimore, charlotte, columbus, dallas, dc, denver, detroit)
where each city name is a matrix. Each individual matrix is isolated by police department using:
phoenix <- vector()
for (i in 1:nrow(gun.mat)){
if (gun.mat[i,2]=="PHOENIX DEPT OF PUBLIC SAFETY"){
phoenix <- rbind(gun.mat[i,],phoenix)
}
}
where gun.mat is just the original matrix containing all observations. phoenix looks like
state police.dept nsn type quantity price date.shipped name
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
Try this:
The original data that you have in the first block in the question. Subset it.
Rifle556<-subset(data, data$column4 == "RIFLE,5.56 MILLIMETER")
After that, subset the data again that don't start with "10" from column 3
s <- '10'
Rifle55610<-subset(Rifle556, grep(s, column3, invert=T)
This way you have the data subset according to your condition.