I am confused because statistics are not gathered for some tables on many schemas.
These tables has been analyzed last time during night, I assume it had been done by job auto optimizer stats job which is enabled.
I realized this when try to gather statistics manually by receiving:
ora-20005 object statistics are locked
after Tuning Advisor ordered gathers statistics for long running query.
What could locked this statistics? Could be default disabled? I assume no one did this because there is no benefit of such behaviour in long term perspective.
After some research I found partial answer:
https://blogs.oracle.com/optimizer/entry/maintaining_statistics_on_large_partitioned_tables
I also found discovered that statistic are locked for partitioned table by partitioning procedure which runs every night, there is line:
dbms_stats.lock_table_stats(...)
I wonder is good or bad practice? I suppose some time ago it was good but since Oracle 11g it has no sense at all.
I will try to introduce approach with Incremental Statistics Maintenance (docs) instead of disabling global statistic gathering which I think is DEPRECATED idea...
Why do you believe that "it has no sense at all"?
Locking statistics is neither good nor bad practice. It all depends on why you're locking them. Presumably, someone in the past identified some sort of problem that locking the statistics solved. You'd need to find out what problem that was and whether it is still an issue. If you have tables with large amounts of transient data, for example, you may want to gather statistics when the tables are relatively full and lock the statistics so that the automatic statistics gathering job doesn't accidentally run when the tables are mostly empty and cause very expensive table scans later when the tables are full.
If the problem that was being solved was that gathering global statistics on the partitioned table was slow, then potentially using incremental statistics maintenance would be a better solution. Given that incremental statistics maintenance is not the default behavior, however, it would be incorrect to consider alternative approaches "deprecated". Particularly when you have an existing solution that meets your needs.
Related
In my case, raw data is stored on NoSQL. Before training ML model, i should preprocess raw data on NoSQL. At this time, if i preprocess raw data, then what is the best way to keep prerocessed data?
1. keep it on memory
2. keep it on another table in NoSQL
3. can you recommend another options?
Depends on your use case, size of the data, tech stack and machine learning framework / library. Truth be told, without knowledge of your data and requirements, no-one on SO will be able to give you a complete answer.
In terms of passing data to the model/ running the model, load it in memory. Look at batching your data into the model if you hit memory limits. Or use an AWS EMR cluster!
For the question on storing the data, I’ll use the previous answer’s example of Spark and try to give some general rules.
If the processed data is “Big” and regularly accessed (eg once a month/week/day), then store it in a distributed manner, then load into memory when running the model.
For Spark, best bet is to write it as partitioned parquet files or to a Hive Data Warehouse.
The key thing about those two is that they are distributed. Spark will create N parquet files containing all your data. When it comes to reading the dataset into memory (before running your model), it can read from many files at once - saving a lot of time. Tensorflow does a similar thing with the TFRecords format.
If your NoSQL database is distributed, then you can potentially use that.
If it won’t be regularly accessed and is “small”, then just run the code from scratch & load into memory.
If the processing takes no time at all and it’s not used for other work, then there’s no point storing it. It’s a waste of time. Don’t even think about it. Just focus on your model, get the data in memory and get running.
If the data won’t be regularly accessed but is “Big”, then time to think hard!
You need to carefully think about the trade off of processing time vs. data storage capability.
How much will it cost to store this data?
How often is it needed?
Is it business critical?
When someone asks for this, is it always a “needed to be done yesterday” request?
Etc.
—-
The Spark framework is a good solution to make what you want to do learn more about it here: spark. Spark for machine learning: here.
I am investigating solutions for identifying people utilising facial recognition and I am interested in using Microsoft's Face API.
I have noted that when adding new people the model needs to be trained again before those people will be recognised.
For our application it is crucial that whilst training is happening that the model continues to resolve identify requests so that the service runs uninterrupted.
It seems to make sense that the old model would continue to respond to identify requests whilst the new model is being trained up but I am not sure if this assumption is correct.
I would be grateful if someone with knowledge of the API could advise if this is the case or if not if there is another way round to ensure continuous resolution of identify requests. I have thought about creating a whole new person group with all the new images but this involves copying a lot of data and seems an inefficient way to go.
From the same documentation link in the previous answer:
During the training period, it is still possible to perform Identification and FindSimilar if a successful training is done before. However, the drawback is that the new added persons/faces will not appear in the result until a new post migration to large-scale training is completed.
My understanding is that this would work with LargePersonGroups (hence "post migration to large-scale training") but is unclear if would work for the legacy PersonGroups.
I tried a few manipulations on my projects using face API but the collection of faces is too small and the training is too fast to check. I think it's not blocking the previous version but cannot guarantee it.
Anyway, you will be interested in the following part of the documentation addressing the problems of training latency: https://learn.microsoft.com/en-us/azure/cognitive-services/face/face-api-how-to-topics/how-to-use-large-scale#buffer
It shows how you could avoid the problem you depict by using a "buffer" group
By this question, I am able to store large number (>50k) of entities in datastore. Now I want to access all of it in my application. I have to perform mathematical operations on it. It always time out. One way is to use TaskQueue again but it will be asynchronous job. I need a way to access these 50k+ entities in my application and process them without getting time out.
Part of the accepted answer to your original question may still apply, for example a manually scaled instance with 24h deadline. Or a VM instance. For a price, of course.
Some speedup may be achieved by using memcache.
Side note: depending on the size of your entities you may need to keep an eye on the instance memory usage as well.
Another possibility would be to switch to a faster instance class (and with more memory as well, but also with extra costs).
But all such improvements might still not be enough. The best approach would still be to give your entity data processing algorithm a deeper thought - to make it scalable.
I'm having a hard time imagining a computation so monolithic that can't be broken into smaller pieces which wouldn't need all the data at once. I'm almost certain there has to be some way of using some partial computations, maybe with storing some partial results so that you can split the problem and allow it to be handled in smaller pieces in multiple requests.
As an extreme (academic) example think about CPUs doing pretty much any super-complex computation fundamentally with just sequences of simple, short operations on a small set of registers - it's all about how to orchestrate them.
Here's a nice article describing a drastic reduction of the overall duration of a computation (no clue if it's anything like yours) by using a nice approach (also interesting because it's using the GAE Pipeline API).
If you post your code you might get some more specific advice.
I have a collection of financial time series of various sorts. Most of my analysis is either a column or a row oriented, very rarely I have to do any sort of complex queries. Also, I am (by now) doing almost all analysis in R.
Because of this, I am seriously considering not deploying any sort of RDBMS and instead managing data in R directly (saving RDS files). This would save me the pain of installing an administering a DB as well as probably improve the data loading speeds.
Is there any reason I should consider otherwise? Do you know anyone who manages their data this way? I know this is vague, but I am looking for opinions, not answers.
If working in R is your comfort zone.. I'd keep your data management there as well, even if your analyses or runs are longer.
I've had a similar decision lately:
Should I go in the direction of learning and applying a new (language/dialect/system) to shave some milliseconds off execution time.
or...
Should I go forth with the same stodgy old tools I have used, even if they will run slower at execution time?
Is the product of your runs for you only? If so, I'd stick with data management in R only.. even if production runs are slower.
If you were designing something for a Bank, Cell Phone Service, or a similar transactional environment, I'd recommend finding the super solution.
But if your R production is for you.. I'd stay in R.
Consider the opportunity cost. Learning a new language/ecosystem - and something like PostgreSQL surely qualifies - will soak up far more time than you likely think. Those skills may be valuable, but will they generate a return on time invested that is as high as the return you would get from additional time spent on your existing analysis?
If it's for personal use and there is no pressing performance issue, stick with R. Given that it's generally easier to do foolish things with text and RDS files than it is with a fully-fledged DB, just make sure you back up everything. From being a huge skeptic about cloud-based storage I have over the past half-year become a huge convert and all but my most sensitive information is now stored there. I use Dropbox, which maintains previous versions of data if you do mess up badly.
Being able to check a document or script from the cafe on the corner on your smartphone is nice.
There is a column-by-column management package, colbycol in CRAN designed to provide DB-like functions for large datasets. I assume the author must have conducted the same sort of analysis.
I am currently in the analysis phase of developing some sort of Locale-based Stock Screener ( please see Google's' for similar work) and I would appreciate advice from the SO Experts.
Firstly the Stock Screener would obviously need to store the formulas required to perform Calculations. My initial conclusion would that the formulae would need to be stored in the Database Layer. What are your ideas on this? Could I improve speed( very important) by storing formulas in a flat file(XML/TXT)?
Secondly, I would also like to ask advice on the internal execution of formulae by the Application. Currently I am leaning towards executing formulae on parameters AT RUN TIME as against running the formulae on parameters whenever these parameters are provided to the system and storing the execution results in the DB for simple retrieval later( My Local Stock Exchange currently does NOT support Real Time Stock Price updates). While I am quite certain that the initial plan ( executing at run time) is better initially , the application could potentially handle a wide variety of formulae as well as work on a wide variety of input parameters. What are your thoughts on this?
I have also gone through SO to find information on how to store formulae in a DB but wanted to enquire the possible ways one could resolve recursive formulae i.e. formaulae which require the results of other formulae to perform calculations? I wouldn't mind pointers to other questions or fora at all.
[EDIT]
[This page]2 provides a lot of infromation as to what I am trying to achieve but what is different is the fact that I need to design some formulae with SPECIAL tokens such as SP which would represent Stock Price for the current day and SP(-1) would represent price for the previous day. These special token would require the Application to perform some sort of DB access to retrieve the values which they are replaced with.
An example formula would be:
(SP/SP(-1)) / 100
which calculates Price Change for Securities and my idea is to replace the SP tokens with the values for the securities when Requested by the user and THEN perform the calculation and send the result to the user.
Thanks a lot for all your assistance.
Kris, I don't mean to presume that I have a better understanding of your requirements than you, but by coincidence I read this article this afternoon after I posted my earlier comment;
http://thedailywtf.com/Articles/Soft_Coding.aspx
Are you absolutely sure that the "convenience" of updating formulae without recompiling code is worth the maintenance head ache that such a solution may possibly become down the line?
I would strongly recommend that you hard code your logic unless you want someone without access to the source to be updating formulae on a fairly regular basis.
And I can't see this happening too often anyway, given that the particular domain here, stock prices, has a well established set of formulae for calculating the various relevant metrics.
I think your effort will be much better spent in making a solid and easily extensible "stock price" framework, or even searching for some openly available one with a proven track record.
Anyway, your project sounds very interesting, I hope it works out well whatever approach you decide to take. Good luck!