I'm new to this data lake concept.
I want to move 4 different MySQL databases into an S3 data lake so I can use the Redshift spectrum to query it. A lot of these databases have tables that do update operations.
What are the best practices to handle that in S3? Or is S3 data lake not the right solution for this.
I've tried writing spark job to pull incremental data based on created_at and updated_at columns and put that in S3. The issue with this is I'll have duplicate rows if there is an update operation.
The other way I've done it is to copy the whole table over each time.
I've also tried partitioning the S3 buckets by hours, so if the update is within an hour, I'll just delete that bucket and reprocess that hour.
It just seems very hacky to me.
Is this not a common use case?
What are the best practices around this?
Related
We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).
I am planning to create a merchant table, which will have store locations of the merchant. Most merchants are small businesses and they only have a few stores. However, there is the odd multi-chain/franchise who may have hundreds of locations.
What would be my solution if I want to put include location attributes within the merchant table? If I have to split it into multiple tables, how do I achieve that?
Thank you!
EDIT: How about splitting the table. To cater for the majority, say up to 5 locations I can place them inside the same table. But beyond 5, it will spill over to a normalised table with an indicator on the main table to say there are more than 5 locations. Any thoughts on how to achieve that?
You have a couple of options depending on your access patterns:
Compress the data and store the binary object in DynamoDB.
Store basic details in DynamoDB along with a link to S3 for the larger things. There's no transactional support across DynamoDB and S3 so there's a chance your data could become inconsistent.
Rather than embed location attributes, you could normalise your tables and put that data in a separate table with the equivalent of a foreign key to your merchant table. But, you may then need two queries to retrieve data for each merchant, which would count towards your throughput costs.
Catering for a spill-over table would have to be handled in the application code rather than at the database level: if (store_count > 5) then execute another query to retrieve more data
If you don't need the performance and scalability of DynamoDB, perhaps RDS is a better solution.
A bit late to the party, but I believe the right schema would be to have partitionKey as merchantId with sortKey as storeId. This would create individual, separate records for each store and you can store the geo location. This way
You would not cross the 400KB threshold
Queries become efficient if you want to fetch the location for just 1 of the stores of the merchant. If you want to fetch all the stores, there is no impact with this schema.
PS : I am a Software Engineer working on Amazon Dynamodb.
I am a newbie in Amazon Dynamodb world with strong background from relation database world :-p
I am writing a service using AWS lambda functionality that migrates the data from dynamodb to RedShift for analytics purpose. My aim is to keep only active data of say 1 month in dynamodb and then purge it periodically.
I researched a lot but could not find a precise purging technique for Amazon dynamodb that will avoid full table scan.
Also, I want to perform delete based on the Range key attribute which is a timestamp attribute.
Can somebody help me out here?
Thanks
From my experience the easiest and most cost-effective way to handle this job is to create a new table each month, and remove complete old tables when time passes and you are done crunching them.
If you can make your use case use a TABLE-MMYYYY it would help you a lot.
I need to date/timestamp various transactions, and can add that explicityly into the data structure.
Firebase creates an ID like IuId2Du7p9rJoT-BARu using some algorithm.
Is there a way I can decode the date/time from the firebase-created ID and avoid storing a separate date/timestamp?
Short answer: no.
I've asked the same question previously, because my engineer instincts tell me I can never duplicate data. The conclusion that I came to after I thought this through to the logical end, is that even in a SQL database there exists tons of duplication. It's simply hidden under the covers (as indices, temporary tables, and memory caches). This is a part of large and active data.
So drop the timestamp in the data and go have lunch; save yourself some energy :)
Alternately, skip the timestamp entirely. You know that the records are stored by timestamp already, assuming you haven't provided your own priority, so you should be good to go.
I have over 1.500.000 data entries and it's going to increase gradually over time. This huge amount of data would come from 150 regions.
Now should I create 150 tables to manage this increasing huge data? Will this be efficient? I need fast operation. ASP.NET and Oracle will be used.
If all the data is the same, don't split it in to different tables. Take a look at Oracle's table partitions. One-hundred fifty partitions (or more) split out by region (or more) is probably more in line with what you're going to be looking for.
I would also recommend you look at the Oracle Database Performance Tuning Tips & Techniques book and browse Ask Tom on Oracle's website.
Only 1.5 M rows? Not a lot really...
Use one table; working out how to write a 150-way union across 150 tables will be murder.
1.5 million rows doesn't really seem like that much. How many people are accessing the table(s) at any given point? Do you have any indexes setup? If you expect it to grow much larger, you may want to look into partitioning in databases.
FWIW, I work with databases on a regular basis with 100M+ rows. It shouldn't be this bad unless you have thousands of people using it at a time.
1 table per region is way not normalized; you're probably going to lose a bunch of efficiency there. 1 table per data entry site is pretty unusual too. Normalization is huge, it will save you a ton of time down the road, so I'd make sure you're not storing any duplicate data.
If you're using oracle, you shouldn't need to have multiple tables. It'll support a lot more than 1.5 million rows. If you need to speed up data access, you can try a snowflake schema to pull in commonly accessed data.
If you mean 1,500,000 rows in a table then you do not have much to worry about. Oracle can handle much larger loads than that with ease.
If you need to identify the regions that the data came in, you can create a Region table and tie the ID from that to the big data table.
IMHO, you should post more details and we can help you better.
A database with 2,000 rows can be slow. It all depends on your database design, index, keys and most important is the hardware configuration your database server is running on. The way your application uses this data is also important. Is a read intensive database or transaction intensive? There is no right answer to what you are asking right now.
You first need to consider what operations are going to access the table. How will inserts be performed? Will the existing rows be updated, and if so how? By how much will the rows grow, and what percentage of them will grow? Will rows get deleted? By what criteria? How will you be selecting data? By what criteria and how many per query?
Data partition can be used for volume of data much larger than 1.5m rows. Look into optimizing
the SQL query ,batch processing and storage of data.