I'm currently bulk indexing massive data into elasticsearch (Actually there are a lot data inside the index originally, and I updated the mapping and started loading again.).
The URL I used is an ELB provided by Amazon.
However, when I count the # of doc, I found in the primary shard(using preference=primary_first), there is nothing, while there are actually over 10m data inside the index(no preference=primary_first).
I'm wondering why this happened and how to fix it.
Thanks!
Related
I know variables function in Airflow uses internal Metadata DB.
Does AirFlow have maximum records limitation for variables function?
Is there no limit to the number of variables as long as disk space is available?
I could not find specific description in official documents.
There is no limit to the number of variables you can store.
Variables is a table in the MySQL/Postgres/MsSQL database that you use as backend. Tables don't have maximum number of records.
Some tables as they get larger may cause performance issue but variables table probably won't be the one to cause such problems.
Should this be a concern for you - you can always use alternative secret backend by using this the connections/variables will not be stored on Airflow database but on other storage at your choice (for example: Google Secret Manager, Vault, etc..) You can see the full list on this doc. You can also roll your own secret backend if you wish.
In DynamoDB I have a table like below example data
pk sk name price
=======================================================
product cat#phone#name#iPhone11 iPhone 11 500
product cat#phone#name#Nokia1100 Nokia 1100 100
product cat#phone#name#iPhone11 iPhone 11 500
In a case I have to search by name. So, first I have created a global index for name where in index pk = pk, sk=name . Then I made a search which working fine.
Now I have changed my decision and created a local index for name, where name is sk. It's also working fine. My question is if I use local index here, has there any benefit ? and when I should not use local index ? If global index not required here but I have used , has there any performance issues ?
#niloy-rony,
This AWS doc very well explains LSI and GSI in detail.
Now to answer your questions
- LSI comes at no extra cost. You don't need to pay for GSI's RCUs, WCUs however need to pay for storage as depicted here in another AWS doc.
- One should not use LSI if you are very certain that single partition (ie - pk) of your main table (pk remains the same in LSI) can be over 10GB. This is also discussed in link shared above.
- There is no performance issue with LSI and GSI in terms of query latencies. However, reads in GSI are eventual consistent whereas LSI supports strong consistent reads.
Edit, putting excerpt from the AWS doc to understand strong and eventual consistent reads.
Strongly Consistent Reads - When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
Eventually Consistent Reads - When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data.
Refer this AWS doc for tips to minimise propagation delay of data from main table to GSIs
I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.
My collection schema is below and I have 80.000 documents in this collection.
{
"_id" : ObjectId("5aca8ea670ed86102488d39d"),
"UserID" : "5ac161d742092040783a4ee1",
"ReferenceID" : 87396,
"ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
"ElapsedTime" : 1694,
"CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}
If I run this command below to count all documents in collection, I have the result so quickly:
db.Tests.count()
But when I run this same command but to a specific user, I've got a message "Request rate is large".
db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()
In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.
I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.
I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.
Count by filter queries ARE using indexes if they are available.
If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:
{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}
So definitely add a suitable index on UserID.
If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.
When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:
"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"
It also provides you some insight where the effort has gone if you feel like RU charge is too large.
If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.
Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.
I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.
I think it just doesn't work.
The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.
This query is cheap and fast:
db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})
but this one is slow and expensive:
db.Tests.count({ ReferenceID: { '$gt': 10 }})
even though this query is fast:
db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })
I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes. Note the status: "We have started work on this feature. Will update here when this becomes generally available."
Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years. Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.
BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.
I'm writing a simple Wordpress plugin for work and am wondering if using the Transients API is practical in this case, or if I should seek out another way.
The plugin's purpose is simple. I'm making a call to USZip Web Service (http://www.webservicex.net/uszip.asmx?op=GetInfoByZIP) to retrieve data. Our sales team is using a Lead Intake sheet that the plugin will run on.
I wanted to reduce the number of API calls, so I thought of setting a transient for each zip code as the key and store the incoming data (city and zip). If the corresponding data for a given zip code already exists, then no need to make an API call.
Here are my concerns:
1. After a quick search, I realized that the transient data is stored in the wp_options table and storing the data would balloon that table in no time. Would this cause a significance performance issue if the db becomes huge?
2. Is this horrible practice to create this many transient keys? It could easily becomes thousands in a few months time.
If using Transient is not the best way, could you please help point me in the right direction? Thanks!
P.S. I opted for the Transients API vs the Options API. I know zip codes don't change often, but they sometimes so. I set expiration time of 3 months.
A less-inflated solution would be:
Store a single option called uszip with a serialized array inside the option
Grab the entire array each time and simply check if the zip code exists
If it doesn't exist, grab the data and save the whole transient again
You should make sure you don't hit the upper bounds of a serialized array in this table (9,000 elements) considering 43,000 zip codes exist in the US. However, you will most likely have a very localized subset of zip codes.
I recently got database for my website and was suprised to see how expensive it was for the amount of space they gave me (only 500 mb for sql2005). The database is used to store data such as posts or articles submitted by members.
the problem: I would like to be able to save not only the final version of a user post, but also all possible modification that occurred since he posted it. How can I save these modifications in the database without having to store the whole post several time with different version numbers? my concern is that the size of the database might grow too fast and reach the 500mb limit.
Any ideas?
How often do you need to retrieve the previous posts? How large are they?
You can compress the data before inserting into the database but this will increase retrieval time.
You could store the diff of the posts but this requires iterating through each post and applying the diff to recreate the history.
You should determine the number of articles/edits that you expect so that you can determine how many you can store within your limit.
If articles are large and previous versions not commonly accessed then just store the articles as files and store the header information in the database.