I have created a new solr schema. I have created a new index from new schema. I have an existing bucket type where data exists. How will the existing data can be indexed with the new index created?
If you have Active Anti-Entropy (AAE) enabled on your Riak cluster, existing entries will be eventually picked up by the index.
Keep in mind that it may take a long time (depending on the configuration) until your index has been fully repaired.
How to enable and tune AAE, select the right Riak version.
Alternatively, you can force a repair, or re-write the keys (read from a key and write the data back with the same key). Both depend very much on how many keys you have, and if you can tolerate the load it is going to create on your cluster.
UPDATE
There is a work in progress page in the Riak documentation repo that explains just that: Reindex Existing Data
Related
I'm trying to do the following periodically (lets say once a week):
download a couple of public datasets
merge them together, resulting in a dictionary (I'm using Python) of ~2.5m entries
upload/synchronize the result to Cloud Datastore so that I have it as "reference data" for other things running in the project
Synchronization can mean that some entries are updated, others are deleted (if they were removed from the public datasets) or new entries are created.
I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing:
iterate over the entries from the datastore
look them up in my dictionary and decide if the need update / delete (if no longer present in the dictionary)
write them back / delete them as needed
insert any new elements from the dictionary
I already batch the requests (using .put_multi, .delete_multi, etc).
Some things I considered:
Use DataFlow. The problem is that each tasks would have to load the dataset (my "dictionary") into memory which is time and memory consuming
Use the managed import / export. Problem is that it produces / consumes some undocumented binary format (I would guess entities serialized as protocol buffers?)
Use multiple threads locally to mitigate the latency. Problem is the google-cloud-datastore library has limited support for cursors (it doesn't have an "advance cursor by X" method for example) so I don't have a way to efficiently divide up the entities from the DataStore into chunks which could be processed by different threads
How could I improve the performance?
Assuming that your datastore entities are only updated during the sync, then you should be able to eliminate the "iterate over the entries from the datastore" step and instead store the entity keys directly in your dictionary. Then if there are any updates or deletes necessary, just reference the appropriate entity key stored in the dictionary.
You might be able to leverage multiple threads if you pre-generate empty entities (or keys) in advance and store cursors at a given interval (say every 100,000 entities). There's probably some overhead involved as you'll have to build a custom system to manage and track those cursors.
If you use dataflow, instead of loading in your entire dictionary you could first import your dictionary into a new project (a clean datastore database), then in your dataflow function you could load the key given to you through dataflow to the clean project. If the value comes back from the load, upsert that to your production project, if it doesn't exist, then delete the value from your production project.
I thought Datastore's key was ordered by insertion date, but apparently I was wrong. I need to periodically look for new entities in the Datastore, fetch them and process them.
Until now, I would simply store the last fetched key and wrongly query for anything greater than it.
Is there a way of doing so?
Thanks in advance.
Datastore automatically generated keys are generated with uniform distribution, in order to make search more performant. You will not be able to understand which entity where added last using keys.
Instead, you can try couple of different approaches.
Use Pub/Sub and architecture your app so another background task will consume this last added entities. On entities add in DB, you will just publish new Event into Pub/Sub with key id. You event listener (separate routine) will receive it.
Use names and generate you custom names. But, as you want to create sequentially growing names, this will case performance hit on even not big ranges of data. You can find more about this in Best Practices of Google Datastore.
https://cloud.google.com/datastore/docs/best-practices#keys
You can add additional creation time column, and still use automatic keys generation.
Let's say there is a database owned by someone else called theirdb with a very slow view named slowview. I have an app that queries this view regularly, but, because it takes too long, I want to materialize it to a table within a database that I own (mydb.materializedview).
Is there a way in Teradata to create an alias database object so that I can go like select * from theirdb.slowview, but actually be selecting from mydb.materializedview?
I need to do some rigorous testing against their view, but it's so slow that I hardly have time to test anything. The other option is to edit the code so that it reads from mydb.materializedview, but that is, unfortunately, not an option in this particular case.
Teradata does not allow for you to create aliases or symbolic links between objects.
If the object is fully qualified by database name and view name in the application your options are a little more restricted. You have have to create a backup of their view definition and them place your materialized table in the same database. This would obviously be best done during a planned application outage.
If the object is not fully qualified by database name and view name in the application and relies on a default database setting or application variable you have a little more flexibility. If all the work is done at a view level you can duplicate the environment in another database where you plan to have a materialized version of their slowview. Then by changing the users default database or application variable you can point it at the duplicate environment to complete your testing.
Additionally, you can try to cover (partially or fully) the query that makes up the slowview by using a join index. This allows you to leave the codebase as it is in the application but for queries that can be satisfied by the join index the optimizer will use the join index. Keep in mind that a join index does incur a cost as it is in essence a materialized version of the SQL which was used to construct it. This means additional IO and change management issues have to be taken in to account.
Lastly, you could try to create additional secondary or hash indexes on the objects within the slowview to improve it's performance.
I am working with a large web application where there can be up to 100,000 objects, populated from a DB, in cache.
There is a table in the database which, given the object ID, will give you a last_updated value which is updated whenever any aspect of that object changes in the DB.
I have read creating an SqlCacheDependency (one row in a table per object) per object, which such high number of objects is a no-go.
I am looking for alternative solutions. One such possible solution I thought of is to cache the "last_updated" table as a datastructure and create a cache dependency to the table it is based on. Then whenever one of the 100,000 objects is requested, I check the cached "last_updated" table and if it is out of date, I fetch the object again from the database and re-cache it. If it is not out of date, I give the cached version. Does this seem like a reasonable solution?
But.. how you can do it for a single row of the table.. In ASP.Net you can create SQl Server dependency which uses a broker Service.. and puts the data into the cache and whenever the table is updated.. the cache will be rejected and new data is taken from db and put into the cache..
i Hope this might give you some idea!
I have data entry form like...
Data Entry Form http://img192.imageshack.us/img192/2478/inputform.jpg
There are some empty rows and some of them have values. User can Update existing values and can also fill value in empty rows.
I need to map these values in my DB table and some of them will be inserted as new rows into the database and existing record will be updated.
I need your suggestions, How can I accomplish this scenario with best approach.
Thanks
For each row, I would have a primary key (hidden), a dirty flag, and a new flag. In the grid, you would set the "dirty" flag to true when changes are made. When adding new rows in the UI, you would set the new flag as well as generate a primary key (this would be easiest if you used GUIDs for the key). Then, when you post this all back to the server, you would do inserts when the new flag is set and updates for those with the dirty flag.
Once the commit of the data has completed, you would simply clear the dirty and new flags.
Of course, if the data is shared by multiple contributors and can be edited concurrently, there's a bit more involved if you don't want someone overwriting another's edits.
I would look into using ADO.net DataSets and DataTables as a backing store in memory for your custom data grid. ADO.net allows you to bulk load a data set out of the database and track inserts, updates, and deletes against that data in memory. Once you are done, you can then bulk process the stored transactions back into the database.
The big benefit of using ADO.net is that all the prickly change tracking code is written for you already, and the library is deployed to every .net capable machine.
While it isn't in vogue right now, you can also send ADO.net data sets across the wire using XML serialization for altering and then send it back to be processed into the database.
Google around. There are literally thousands of books, tutorials, and blog posts on how to use ADO.net.