I have an old application I am modernizing and bringing to AWS. I will be using DynamoDB for the database and am looking to go with a single table design. This is a multitenant application.
The applications will consist of Organisations, Outlets, Customers & Transactions.
Everything stems from an organization, an organization can have multiple outlets, outlets can have multiple customers and customers can have multiple transactions.
Access patterns are expected to be as follows:
Fetch a customer by its ID
Search for a customer by name or email
Get all customers for a given outlet
Get all transactions for a customer
Get all transactions for an outlet
Get all transactions for an outlet during a given time period (timestamps will be stored with each transaction)
Get all outlets for a given organisation
Get an outlet by its ID
I've been reading into single table designs and utilizing the primary key and sort keys to enable this sort of access but right now I can't quite figure out the table/schema design.
The customer will have the outletID and OrganiastionID attached so I should always know those ID's
Data Structure (can be modified)
Organisations:
id
Name
Owner
List of Outlets
createdAt (timestamp)
Outlets:
OrganisationId
Outlet Name
Number of customers
Number of transactions
createdAt (timestamp)
Customers:
id
OrganisationID
OutletID
firstName
lastName
email
total transactions
total spent
createdAt (timestamp)
Transactions:
id
customerID
OrganisationID
OutletID
createdAt (timestamp)
type
value
You're off to a great start by having a thorough understanding of your entities and access patterns! I've taken a stab at modeling for these access patterns, but keep in mind this is not the only way to model a solution. Data modeling in DynamoDB is iterative, so this is very likely that this specific design might not fit 100% of your use cases.
With that disclaimer out of the way, let's get into it!
I've modeled your access patterns using a single table named data with global secondary indexes (GSI) named GSI1 and GSI2. Each GSI has partition and sort keys named GSI#PK and GSI#SK respectively.
The base table models the following access patterns:
Fetch customer by ID: getItem where PK=CUST#<id> and SK = A
Fetch all transactions for a customer: query where PK=CUST#<id> and SK begins_with TX
Fetch an outlet by ID: getItem where PK=ORG#<id> and SK = A
Fetch all customers for an outlet: query where PK=OUT#<id>#CUST
That last access pattern may require a bit more explanation. I've chosen to model the relationship between outlets and customers using a unique PK/SK pattern where PK is OUT#<id>#CUST and SK isCUST#<id>. When your application records a transaction for a particular customer, it can insert two records in DDB using a batch write operation. The batch write operation would perform two operations:
Write a new Transaction into the Customer partition (e.g. PK = CUST#1 and SK = TX#<id>)
Write a new record to the CUSTOMERLIST partition (e.g. PK = OUT#<id>#CUST and SK = CUST#<id>). It this record already exists, DynamoDB will just overwrite the existing record, which is fine for your use case.
Moving onto GSI1:
GSI1 supports the following operations:
Fetch outlets by organization: query GSI1 where GSI1PK = ORG#<id>
Fetch transactions by outlet: query GSI1 where GSI1PK = OUT#<id>
Fetch transactions by outlet for a given time period: `query GSI1 where GSI1PK=OUT# and GSI1SK between and
And finally, there's GSI2
GSI2 supports the following transactions:
Fetch transactions by organization: query GSI2 where GSI2PK = ORG#<id>
Fetch transactions by organization for a given time period: query GSI2 where GSI2PK=OUT#<id> and GSI2SK between <period1> and <period2>
For your final access pattern, you've asked to support searching for customers by email or name. DynamoDB is really good at finding items by their primary key. DynamoDB is not good for search, where fuzzy or partial matches are expected. If you need an exact match on email or name, you could do that in DynamoDB by incorporating email//name in the primary key of the User item.
I hope this gives you some ideas on how to model your access patterns!
I'm a little lost with dynamodb table definition and Keyschema. Here's what i want to achieve :
I'm creating a table to store reporting information. This reporting will be in the folliwing format :
itemId, accountId, date, typeOfMetric, metric1, metric2, metric3
At the moment i expect typeOfMetric to be monthlyReport, or dailyData for example. accountId is for users who are grouped into accounts. So each account can access their own data.
Typically i'm thinking to query the table this way :
get all items with accountId=123 and typeOfMetrics=daily
get one item with accountId=123 and typeOfMetrics=daily and date=2021-11-15
And i'm a little lost with the keyschema and the indexes i should create, any help very welcome!
We can choose accountId as PrimaryKey and date as our sortKey. This will help us query over range.
I guess this will cater your requests for typeOfMetrics:Daily.
And, if we are looking for monthly, we can query over date over the month.
If this doesn't fits your use-case, let us know.
I ended up doing the following :
AccountId -> PrimaryKey
SK -> SortKey
This allows to have items like :
accountId REPORTDAY#2021-01-21 stats1, stats2
accountId REPORTMONTH#2021-01, stats1, stats2
And query that with begginWith
In BigQuery, I have created the following query from a BigQuery partitioned table, with as initial source Google Analytics-data. The goal is to get # sessions, product revenue and shipping costs. Note that in the current setup I can't use the 'aggregated' fields like totals.visits.
SELECT c_country AS country, date As Date, COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS STRING),CAST(Visitid AS STRING), CAST(visitStartTime AS STRING))) AS Sessions,
(SELECT SUM(product.productRevenue)/1000000 FROM t.hits as hits, hits.product AS product) AS Product_Revenue, (SELECT SUM(hits.transaction.transactionShipping)/1000000 FROM t.hits AS hits) AS Shipping_Costs
FROM `xx.yy.zz` as t
WHERE c_date BETWEEN "2019-11-06" AND "2019-11-06"
GROUP BY c_country, date
Now, the following error message appears:
"Correlated aliases referenced in the from clause must refer to arrays
that are valid to access from the outer query, but t refers to an
array that is not valid to access after GROUP BY or DISTINCT in the
outer query at [2:50]"
Does anyone know how to adjust the query so that the query executes without issues?
I'm new to SQL(ite), so i'm sorry if there is a simple answer i just were to stupid to find the right search terms for.
I got 2 tables: 1 for user information and another holding points a user achieved. It's a simple one to many relation (a user can achieve points multiple times).
table1 contains "userID" and "Username" ...
table2 contains "userID" and "Amount" ...
Now i wanted to get a highscore rank for a given username.
To get the highscore i did:
SELECT Username, SUM(Amount) AS total FROM table2 JOIN table1 USING (userID) GROUP BY Username ORDER BY total DESC
How could i select a single Username and get its position from the grouped and ordered result? I have no idea how a subselect would've to look like for my goal. Is it even possible in a single query?
You cannot calculate the position of the user without referencing the other data. SQLite does not have a ranking function which would be ideal for your user case, nor does it have a row number feature that would serve as an acceptable substitute.
I suppose the closest you could get would be to drop this data into a temp table that has an incrementing ID, but I think you'd get very messy there.
It's best to handle this within the application. Get all the users and calculate rank. Cache individual user results as necessary.
Without knowing anything more about the operating context of the app/DB it's hard to provide a more specific recommendation.
For a specific user, this query gets the total amount:
SELECT SUM(Amount)
FROM Table2
WHERE userID = ?
You have to count how many other users have a higher amount than that single user:
SELECT COUNT(*)
FROM table1
WHERE (SELECT SUM(Amount)
FROM Table2
WHERE userID = table1.userID)
>=
(SELECT SUM(Amount)
FROM Table2
WHERE userID = ?);
wasn't sure whether to put this in Software or here, so I figured I'd start here I know this will be a straightforward answer from you SQL geniuses...
I have a table, it contains contacts that I import on a daily basis. I will have an ASP.NET front end for user interaction. From this table, my intention is to send them all mailers - but only one to each address. So my end result is a user enters a date (which corresponds to teh date imported) and they are given a resultant grid that has all the unique addresses associated to that date. I only want to send a mailer to that address once - many times my original imported list will contain multiple businesses at the same address.
Table: ContactTable
Fielsd:
ID, CompanyName, Address, City, State, Zip, Phone
I can use the SELECT DISTINCT clause, but I need all the data associated to it (company name, etc.)
I have over 262000 Records in this table.
If I select a sample date of 1/10/2011, I get 2401 records. SELECT DISTINCT Address from the same date gives me 2092 records. This is workable, I would send those 2092 people a mailer.
Secondly, I'd have to be able to historically check if a mailer was already sent to that address as well. I would not want to send another mailer to the same business tomorrow either.
What's my best way?
I would start with creating a table to lookup sent mailers.
ID | DateSent
-------------
Every time you send a mailer you are going to want to insert the ID, and the DateTime into it, this way when you go to pull the mailers you can look against this table to see if the mailer has been sent within whatever your specified time frame of mailing is. You can extend this if you have multiple types of mailers to include the mailer type.
Plain Old SQL
SELECT a.ID, a.CompanyName, b.Address, b.City, b.State, b.Zip, a.Phone
FROM a.ContactTable
RIGHT JOIN (SELECT DISTINCT Address, City, State, Zip
FROM ContactTable) b
ON a.ID = b.ID
This sub-query is like creating a temp table SELECTing only the DISTINCT addresses, then joining it to the rest of the info.
To add the lookup against your new table add the following
SELECT a.ID, a.CompanyName, b.Address, b.City, b.State, b.Zip, a.Phone
FROM a.ContactTable
RIGHT JOIN (SELECT DISTINCT Address, City, State, Zip
FROM ContactTable) b
ON a.ID = b.ID
RIGHT JOIN SentMailer c
ON a.ID = c.ID
WHERE DATEDIFF(mm, c.DateSent, GETDATE()) > 12 --gives you everything that hasn't been sent a mailer within the last year
Edit
Without the data being standardized it's hard to get quality results. I've found in the past the more creative I have to get with my queries is a flag to bad table structure or data collection. I think you should still create a lookup table for ID/DateSent to manage the time frames for sending.
Edit
Yes, I'm basically looking for the unique address, city, state, zip. I would only require one instance for each address so we would be able to send a mailer to that address. At this point, Company name would not be required.
If this is the case you can simply do the following:
SELECT DISTINCT Address, City, State, Zip, Phone
FROM ContactTable
Keep in mind this won't scrub entries like Main Street vs Main St.
RogueSpear, I work in the address verification (and thus de-duplication) field for SmartyStreets, where we deal with this scenario a lot and tackle the challenge.
If you're getting daily lists from a company and have hundreds of thousands of records, then removing duplicate addresses using stored procedures or mere queries won't be enough to match the varying possibilities of each address. There are services which do this, and I'd point you to CASS-Certified vendors which provide that.
You can flag duplicates in a table using something like CASS-Certified Scrubbing, or you can prevent duplicates at point-of-entry with an API like LiveAddress. Anyway, I'd be happy to personally help you with any other address questions.
I would select, then remove, the duplicates like this:
SELECT a.ID, a.PurgedID, a.CAMPAIGNTYPE, a.COMPANY, a.DBANAME, a.COADDRESS, a.COCITY, a.COSTATE, a.COZIP, a.FIRSTNAME1, a.DIALERPHONENUM, a.Purged FROM PurgeReportDetail a
WHERE EXISTS (
SELECT * FROM PurgeReportDetail b WHERE
b.COADDRESS = a.COADDRESS
AND b.COCITY = a.COCITY
AND b.COSTATE = a.COSTATE
AND b.COZIP = a.COZIP
AND b.id <> a.id
) -- This clause will only include rows with duplicate columns noted
AND a.ID IN (
SELECT TOP 1 c.ID from PurgeReportDetail c
WHERE c.COADDRESS = a.COADDRESS
AND c.COCITY = a.COCITY
AND c.COSTATE = a.COSTATE
AND c.COZIP = a.COZIP
ORDER BY c.ID -- If you want the *newest* entry to be saved, add "DESC" here
) -- This clause gets the top 1 ID value for each matching set
or something like this.
This will keep the first ID of the redundant address, just replace the SELECT with DELETE when ready.
EDIT: Of course this will only work on exact matches.
EDIT2: If you wanted to only check where you hadn't sent mailers, you should join both to a table of sent mailers from a specified date range