SQL Query Help - Duplicate Removal - asp.net

wasn't sure whether to put this in Software or here, so I figured I'd start here I know this will be a straightforward answer from you SQL geniuses...
I have a table, it contains contacts that I import on a daily basis. I will have an ASP.NET front end for user interaction. From this table, my intention is to send them all mailers - but only one to each address. So my end result is a user enters a date (which corresponds to teh date imported) and they are given a resultant grid that has all the unique addresses associated to that date. I only want to send a mailer to that address once - many times my original imported list will contain multiple businesses at the same address.
Table: ContactTable
Fielsd:
ID, CompanyName, Address, City, State, Zip, Phone
I can use the SELECT DISTINCT clause, but I need all the data associated to it (company name, etc.)
I have over 262000 Records in this table.
If I select a sample date of 1/10/2011, I get 2401 records. SELECT DISTINCT Address from the same date gives me 2092 records. This is workable, I would send those 2092 people a mailer.
Secondly, I'd have to be able to historically check if a mailer was already sent to that address as well. I would not want to send another mailer to the same business tomorrow either.
What's my best way?

I would start with creating a table to lookup sent mailers.
ID | DateSent
-------------
Every time you send a mailer you are going to want to insert the ID, and the DateTime into it, this way when you go to pull the mailers you can look against this table to see if the mailer has been sent within whatever your specified time frame of mailing is. You can extend this if you have multiple types of mailers to include the mailer type.
Plain Old SQL
SELECT a.ID, a.CompanyName, b.Address, b.City, b.State, b.Zip, a.Phone
FROM a.ContactTable
RIGHT JOIN (SELECT DISTINCT Address, City, State, Zip
FROM ContactTable) b
ON a.ID = b.ID
This sub-query is like creating a temp table SELECTing only the DISTINCT addresses, then joining it to the rest of the info.
To add the lookup against your new table add the following
SELECT a.ID, a.CompanyName, b.Address, b.City, b.State, b.Zip, a.Phone
FROM a.ContactTable
RIGHT JOIN (SELECT DISTINCT Address, City, State, Zip
FROM ContactTable) b
ON a.ID = b.ID
RIGHT JOIN SentMailer c
ON a.ID = c.ID
WHERE DATEDIFF(mm, c.DateSent, GETDATE()) > 12 --gives you everything that hasn't been sent a mailer within the last year
Edit
Without the data being standardized it's hard to get quality results. I've found in the past the more creative I have to get with my queries is a flag to bad table structure or data collection. I think you should still create a lookup table for ID/DateSent to manage the time frames for sending.
Edit
Yes, I'm basically looking for the unique address, city, state, zip. I would only require one instance for each address so we would be able to send a mailer to that address. At this point, Company name would not be required.
If this is the case you can simply do the following:
SELECT DISTINCT Address, City, State, Zip, Phone
FROM ContactTable
Keep in mind this won't scrub entries like Main Street vs Main St.

RogueSpear, I work in the address verification (and thus de-duplication) field for SmartyStreets, where we deal with this scenario a lot and tackle the challenge.
If you're getting daily lists from a company and have hundreds of thousands of records, then removing duplicate addresses using stored procedures or mere queries won't be enough to match the varying possibilities of each address. There are services which do this, and I'd point you to CASS-Certified vendors which provide that.
You can flag duplicates in a table using something like CASS-Certified Scrubbing, or you can prevent duplicates at point-of-entry with an API like LiveAddress. Anyway, I'd be happy to personally help you with any other address questions.

I would select, then remove, the duplicates like this:
SELECT a.ID, a.PurgedID, a.CAMPAIGNTYPE, a.COMPANY, a.DBANAME, a.COADDRESS, a.COCITY, a.COSTATE, a.COZIP, a.FIRSTNAME1, a.DIALERPHONENUM, a.Purged FROM PurgeReportDetail a
WHERE EXISTS (
SELECT * FROM PurgeReportDetail b WHERE
b.COADDRESS = a.COADDRESS
AND b.COCITY = a.COCITY
AND b.COSTATE = a.COSTATE
AND b.COZIP = a.COZIP
AND b.id <> a.id
) -- This clause will only include rows with duplicate columns noted
AND a.ID IN (
SELECT TOP 1 c.ID from PurgeReportDetail c
WHERE c.COADDRESS = a.COADDRESS
AND c.COCITY = a.COCITY
AND c.COSTATE = a.COSTATE
AND c.COZIP = a.COZIP
ORDER BY c.ID -- If you want the *newest* entry to be saved, add "DESC" here
) -- This clause gets the top 1 ID value for each matching set
or something like this.
This will keep the first ID of the redundant address, just replace the SELECT with DELETE when ready.
EDIT: Of course this will only work on exact matches.
EDIT2: If you wanted to only check where you hadn't sent mailers, you should join both to a table of sent mailers from a specified date range

Related

SQLITE3 want to find LIKE data within table using long list

I have a SQLITE3 table with over 1mil rows and multiple columns, one of which is 'email_address'.
I also have a separate list of about 200k web domains.
I want to find all rows from my table that have email addresses with these domains.
I can figure out how to do it individually with "select * from table where email_address like '%domain';"
But does anyone know how I can do this on a bulk level please?
Just guessing the name of the tables and columns.
Use a join between the 2 tables:
SELECT d.domain, e.email_address
FROM domains as d
INNER JOIN emails as e
ON e.email_address LIKE '%' || d.domain
You can also do it without LIKE. Something like
select emailAddr,domainName
from email
JOIN domains on substr(emailAddr,instr(emailAddr,'#')+1) = domainName
And you can create an index on the domain part of the email address, something like
CREATE INDEX emailAddr_idx ON email ( substr(emailAddr,instr(emailAddr,'#')+1) )
I don't know have a big enough data set to test its efficacy/impact.
Was able to get this to work for me:
create table final as select * from alladdresses join alldomains on substr(email_address,instr(email_address,'#')+1) = domain;
Created a new table with all alladdresses info that matched to the domains I had, and added in an extra column of just domains.
Thanks to those who got me on the right path!

Last Function in Query

So I currently have a database that keeps tracks of projects, project updates, and the update dates. I have a form that with a subform that displays the project name and the most recent update made to said project. It was brought to my attention however, that the most recent update to a project does not display correctly. Ex: shows the update date of 4/6/2017 but the actual update text is from 3/16/2017.
Doing some spot research, I then learned that Access does not store records in any particular order, and that the Last function does not actually give you the last record.
I am currently scouring google to find a solution but to no avail as of yet and have turned here in hopes of a solution or idea. Thank you for any insight you can provide in advance!
Other details:
tblProjects has fields
ID
Owner
Category_ID
Project_Name
Description
Resolution_Date
Priority
Resolution_Category_ID
tblUpdates has these fields:
ID
Project_ID
Update_Date
Update
there is no built-in Last function that I am aware of in Access or VBA, where exactly are you seeing that used?
if your sub-form is bound directly to tblUpdates, then you ought to be able to just sort the sub-form in descending order based on either ID or Update_date.
if you have query joining the two tables, and are only trying to get a single row returned from tblUpdates, then this would do that, assuming the ID column in tblUpdates is an autonumber. if not, just replace ORDER BY ID with ORDER BY Update_Date Desc
SELECT a.*,
(SELECT TOP 1 Update FROM tblUpdates b WHERE a.ID = b.PROJECT_ID ORDER BY ID DESC ) AS last_update
FROM tblProjects AS a;

SQLite: SELECT from grouped and ordered result

I'm new to SQL(ite), so i'm sorry if there is a simple answer i just were to stupid to find the right search terms for.
I got 2 tables: 1 for user information and another holding points a user achieved. It's a simple one to many relation (a user can achieve points multiple times).
table1 contains "userID" and "Username" ...
table2 contains "userID" and "Amount" ...
Now i wanted to get a highscore rank for a given username.
To get the highscore i did:
SELECT Username, SUM(Amount) AS total FROM table2 JOIN table1 USING (userID) GROUP BY Username ORDER BY total DESC
How could i select a single Username and get its position from the grouped and ordered result? I have no idea how a subselect would've to look like for my goal. Is it even possible in a single query?
You cannot calculate the position of the user without referencing the other data. SQLite does not have a ranking function which would be ideal for your user case, nor does it have a row number feature that would serve as an acceptable substitute.
I suppose the closest you could get would be to drop this data into a temp table that has an incrementing ID, but I think you'd get very messy there.
It's best to handle this within the application. Get all the users and calculate rank. Cache individual user results as necessary.
Without knowing anything more about the operating context of the app/DB it's hard to provide a more specific recommendation.
For a specific user, this query gets the total amount:
SELECT SUM(Amount)
FROM Table2
WHERE userID = ?
You have to count how many other users have a higher amount than that single user:
SELECT COUNT(*)
FROM table1
WHERE (SELECT SUM(Amount)
FROM Table2
WHERE userID = table1.userID)
>=
(SELECT SUM(Amount)
FROM Table2
WHERE userID = ?);

Retrieving data from two tables

I have two tables vendors and customers. My problem is that I want to make the search of the name and in result I get the common data from both the tables(if any exist).
Example: If I search the name John and it exits in both the tables so in result I must get the gridviews of both the tables.
Your question is not quite clear for me. Are you looking for a SQL statement?
Try with:
SELECT * FROM Vendors LEFT JOIN Customers ON Vendors.Name = Customers.Name
or
SELECT * FROM Vendors WHERE Vendors.Name IN (SELECT Customers.Name FROM Customers)

Efficiently finding unique values in a database table

I've got a database table with a very large amount of rows. This table represents messages that are logged by a system. Each message has a message type and this is stored it it's own field in the table. I'm writing a website for querying this message log. If I want to search by message type then ideally I would want to have a drop down box listing the message types that have come up in the database. Message types may change over time so I can't hard code the types into the drop down. I'll have to do some sort of lookup. Iterating over the entire table contents to find unique message values is obviously very stupid however being stupid in the database field I'm here asking for a better way. Perhaps a separate lookup table which the database occasionally updates listing just the unique message types that I can populate my drop down from would be a better idea.
Any suggestions would be much appreciated.
The platform I'm using is ASP.NET MVC and SQL Server 2005
A separate lookup table with the id of the message type stored in your log. This will reduce the size and increase the efficiency of the log. Also it would Normalize your data.
Yep, I would definitely go with the separate lookup table. You can then populate it using something like:
INSERT TypeLookup (Type)
SELECT DISTINCT Type
FROM BigMassiveTable
You could then run a top-up job periodically to pull in new types from your main table that don't already exist in the lookup table.
SELECT DISTINCT message_type
FROM message_log
is the most straightforward but not very efficient way.
If you have a list of types that can possibly appear in the log, use this:
SELECT message_type
FROM message_types mt
WHERE message_type IN
(
SELECT message_type
FROM message_log
)
This will be more efficient if message_log.message_type is indexed.
If you don't have this table but want to create one, and message_log.message_type is indexed, use a recursive CTE to emulate loose index scan:
WITH rows (message_type) AS
(
SELECT MIN(message_type) AS mm
FROM message_log
UNION ALL
SELECT message_type
FROM (
SELECT mn.message_type, ROW_NUMBER() OVER (ORDER BY mn.message_type) AS rn
FROM rows r
JOIN message_type mn
ON mn.message_type > r.message_type
WHERE r.message_type IS NOT NULL
) q
WHERE rn = 1
)
SELECT message_type
FROM rows r
OPTION (MAXRECURSION 0)
I just wanted to state the obvious: normalize the data.
message_types
message_type | message_type_name
messages
message_id | message_type | message_type_name
Then you can just do without any cached DISTINCT:
For your dropdown
SELECT * FROM message_types
For your retrieval
SELECT * FROM messages WHERE message_type = ?
SELECT m.*, mt.message_type_name FROM messages AS m
JOIN message_types AS mt
ON ( m.message_type = mt.message_type)
I'm not sure why you would want a cached DISTINCT which you'll have to update, when you can slightly tweak the schema and have one with RI.
Create an index on the message type:
CREATE INDEX IX_Messages_MessageType ON Messages (MessageType)
Then to get a list of unique Message Types, you run:
SELECT DISTINCT MessageType
FROM Messages
ORDER BY MessageType
Because the index is physically sorted in order of MessageType SQL Server can very quickly, and efficiently, scan through the index, picking up a list of unique message types.
It is not bad performing - it's what SQL Server is good at.
Admittedly, you can save some space by having a "message types" table. And if you only display a few messages at a time: then the bookmark lookup, as it joins back to the MessageTypes table, won't be a problem. But if you start displaying hundreds or thousands of messages at a time, then the join back to MessageTypes can get pretty expensive, and needless, and it will be faster to have the MessageType stored with the message.
But i would have no problem with creating an index on the MessageType column, and selecting distinct. SQL Server loves that sort of thing. But if you're finding it to be a real load on your server, once you're getting dozens of hits a second, then follow the other suggestion and cache them in memory.
My personal solution would be:
create the index
select distinct
and if i still had problems
cache in memory that expires after 30 seconds
As for the normalized/denormalized issue. Normalizing saves space, at the cost of CPU when joins are constantly performed. But the logical point of denoralization is to avoid duplicate data, which can lead to inconsistent data.
Are you planning on changing the text of a message type, which if you stored with the messages you would have to update all rows?
Or is there something to be said for the fact that at the time of the message the message type was "Client response requested"?
Have you considered an indexed view? Its result set is materialized and persists in storage so that the overhead of the lookup is separated from the rest of whatever you're trying to do.
SQL Server takes care of automagically updating the view when there is a data change which in its opinion would change the contents of the view, so in this respect it's less flexible than Oracle materialized.
The MessageType should be a Foreign Key in the main table to a definition table containing the message type codes and descriptions. This will greatly increase your lookup performance.
Something like
DECLARE #MessageTypes TABLE(
MessageTypeCode VARCHAR(10),
MessageTypeDesciption VARCHAR(100)
)
DECLARE #Messages TABLE(
MessageTypeCode VARCHAR(10),
MessageValue VARCHAR(MAX),
MessageLogDate DATETIME,
AdditionalNotes VARCHAR(MAX)
)
From this design, your lookup should only query MessageTypes
As others have said, create a separate table of message types. When you add a record to the message table, check if the message type already exists in the table. If not, add it. In either case, then post the identifier from the message type table into the message table. This should give you normalized data. Yes, it's a little extra time when you add a record, but should be more efficient on retrieval.
If there are a lot more adds then reads and if the "message type" is short, an entirely different approach would be to still create the separate message type table, but don't reference it when doing adds, and only update it lazily, on demand.
Namely, (a) Include a time-stamp in each message record. (b) Keep a list of the message types found as of the last time you checked. (c) Each time you check, search for any new message types added since the last time, as in:
create table temp_new_types as
(select distinct message_type
from message
where timestamp>last_type_check
);
insert into message_type_list (message_type)
select message_type
from temp_new_types
where message_type not in (select message_type from message_type_list);
drop table temp_new_types;
Then store the timestamp of this check somewhere so you can use it the next time around.
The answer is to use 'DISTINCT' and each best solution is different for different sizes of table. Thousands of rows, millions, billions ? more ? This are very different best solutions.

Resources