Fix spelling errors in unique identifers of data - bigdata

I have 6,000 items (just a sampling of some 200,000 entries). The unique identifier is a company name (not my choosing). There are spelling mistakes in the company name. I'm using Levenshtein's distance algorithm to decide if one company name is say 90% similar to the other company name. If this is true I would combine the entries. If I compare every company name entry against every other company name entry I have 6,000^2 iterations. This takes over ten minutes. The data entries are stored in a c++ std::map, where the company names are the key and the associated data is the value. Any ideas on how I can accurately decide whether two company names might be the same with small spelling errors or abbreviations, with out a nested for loop?

Related

How can I get correlation between features and target variables using user defind function in python?

I have two different dataset. Where one dataset has all the features column like consumer price, gdp.
Another dataset has the information of different customers orders. I want to find out the correlation for each customers order with the features. At the end I want to store the information in a dataframe like, one column contains the customer Name , 2nd column must be the feature name and the 3rd one should be the correlation value.
It would be greatful if anyone helps me in this.

Trying to create a list of locations without any duplicates

So I've got a table that is used largely for inventory purposes. There's a location, part number, length (a single part can have multiple lengths), user, etc..
How the system is supposed to work is one person scans the parts and lengths, once it's done a second and third person come and scan the parts in succession.
I'm trying to create a list of locations in which no one part/length combination in any location has got multiple scans. So if any part/length combination has been scanned more than once than that entire location is thrown out and not in the final list.
I've been racking my brain and this seems like a simple query but I can't seem to find something that works.

R RecordLinkage Identity

I am working with RecordLinkage Library in R.
I have a data frame with id, name, phone, mail
My code looks like this:
ids = data$id
pairs = compare.dedup(data, identity=ids, blockfld=as.list(2,3,4))
The problem is that my ids are not the same in my result output
so if I had this data:
id Name Phone Mail
233 Nathali 2222 nathali#dd.com
435 Nathali 2222
553 Jean 3444 jean#dd.com
In my result output I will have something like
id1 id2
1 2
Instead of
id1 id2
233 435
I want to know if there is a way to keep the ids instead of the index, or someone could explain me the identity parameter.
Thanks
The identity vector tells the getPairs method which of the input records belong to the same entity. It actually holds information that you usually want to gain from record linkage, i.e. you have a couple of records and do not know in advance which of them belong together. However, when you use a training set to calibrate a method or you want to evaluate the accurateness of record linkage methods (the package was mainly written for this purpose), you start with an already deduplicated or linked data set.
In your example, the first two rows (ids 233, 435) obviously mean the same person and the third row a different one. A meaningful identity vector would therefore be:
c(1,1,2)
But it could also be:
c(42,42,128)
Just make sure that the identity vector has identical values exactly at those positions where the corresponding table rows hold matching record (vector index = row index).
About your question on how to display the ids in the result: You can get the full record pairs, including all data fields, with (see the documentation for more details):
getPairs(pairs)
There might be better ways to get hold of the original ids, depending on how you further process the record pairs (e.g. running a classification algorithm). Extend your example if you need more advice on this.
p.s.: I am one of the package authors. I have only very recently become aware that people ask questions about the package on Stack Overflow, so please excuse that a couple of questions have been around unanswered for a long time. I will look for a way to get notified on new questions posted here, but I would also like to mention that people can contact us directly via one of the email addresses listed in the package information.
You have to replace the index column with your identify column.

Creating a Roster Database in SQLite

I am dealing with a roster with 15,000 unique employees. Depending on their 'Designation' they either impact performance or do not. The issue is, these employees could change their designation any day. The roster is as simple as this:
AgentID
AgentDesignation
Date
I feel like I would be violating some Normalization rules if I just have duplicate values (the agent has the same designation from the previous day, for example). Would I really want to create a new row for each date even if the Designation is the same? I want to always be able to get the agent's correct designation on a particular date.
All calculations are done with Excel, probably with Vlookup. Anyone have some tips?
The table structure you propose would not be a violation of normalization -- it contains a PRIMARY KEY (AgentID, Date) and a single attribute that is dependent on all elements of the key (AgentDesignation). Furthermore, it's easy (using the PRIMARY KEY constraint) to ensure that there is one-and-only-one designation per agent per day. The fact that many PRIMARY KEY values will yield the same dependent value does not mean the database is not correctly normalized.
An alternative approach using date ranges would likely result in fewer rows but guaranteeing integrity would be harder and searches for a particular value would be costlier.

Matching unique ids in two different databases

I have two different databases that are not connected in any way. In fact, one is a public school database and one is a hud (housing) database. By law they are not allowed to share names and other specific identifying addresses. Birthdates and addresses are okay - along with zip codes and other more general ids. The uses need to be able to query the other database to get non-specific information so it would appear that they need to share the same unique id. I was considering such things as using birthdates and perhaps initials of name or perhaps last 4 digits of ssn along with the birthdate. The client was thinking of global positioning data but I'm concerned about apartments next to one another or moving of families. Any ideas?
First you need to determine what will be your measure of uniqueness. If there are two people in either database with more than one entry for your measure of uniqueness, you need to change your strategy. After that, put a constraint on both databases constraining that these properties(Birthday, SSN) are what make a Person record unique.

Resources