Getting large number of entities from datastore - google-cloud-datastore

By this question, I am able to store large number (>50k) of entities in datastore. Now I want to access all of it in my application. I have to perform mathematical operations on it. It always time out. One way is to use TaskQueue again but it will be asynchronous job. I need a way to access these 50k+ entities in my application and process them without getting time out.

Part of the accepted answer to your original question may still apply, for example a manually scaled instance with 24h deadline. Or a VM instance. For a price, of course.
Some speedup may be achieved by using memcache.
Side note: depending on the size of your entities you may need to keep an eye on the instance memory usage as well.
Another possibility would be to switch to a faster instance class (and with more memory as well, but also with extra costs).
But all such improvements might still not be enough. The best approach would still be to give your entity data processing algorithm a deeper thought - to make it scalable.
I'm having a hard time imagining a computation so monolithic that can't be broken into smaller pieces which wouldn't need all the data at once. I'm almost certain there has to be some way of using some partial computations, maybe with storing some partial results so that you can split the problem and allow it to be handled in smaller pieces in multiple requests.
As an extreme (academic) example think about CPUs doing pretty much any super-complex computation fundamentally with just sequences of simple, short operations on a small set of registers - it's all about how to orchestrate them.
Here's a nice article describing a drastic reduction of the overall duration of a computation (no clue if it's anything like yours) by using a nice approach (also interesting because it's using the GAE Pipeline API).
If you post your code you might get some more specific advice.

Related

What cache strategy do I need in this case ?

I have what I consider to be a fairly simple application. A service returns some data based on another piece of data. A simple example, given a state name, the service returns the capital city.
All the data resides in a SQL Server 2008 database. The majority of this "static" data will rarely change. It will occassionally need to be updated and, when it does, I have no problem restarting the application to refresh the cache, if implemented.
Some data, which is more "dynamic", will be kept in the same database. This data includes contacts, statistics, etc. and will change more frequently (anywhere from hourly to daily to weekly). This data will be linked to the static data above via foreign keys (just like a SQL JOIN).
My question is, what exactly am I trying to implement here ? and how do I get started doing it ? I know the static data will be cached but I don't know where to start with that. I tried searching but came up with so much stuff and I'm not sure where to start. Recommendations for tutorials would also be appreciated.
You don't need to cache anything until you have a performance problem. Until you have a noticeable problem and have measured your application tiers to determine your database is in fact a bottleneck, which it rarely is, then start looking into caching data. It is always a tradeoff, memory vs CPU vs real time data availability. There is no reason to make your application more complicated than it needs to be just because.
An extremely simple 'win' here (I assume you're using WCF here) would be to use the declarative attribute-based caching mechanism built into the framework. It's easy to set up and manage, but you need to analyze your usage scenarios to make sure it's applied at the right locations to really benefit from it. This article is a good starting point.
Beyond that, I'd recommend looking into one of the many WCF books that deal with higher-level concepts like caching and try to figure out if their implementation patterns are applicable to your design.

MSMQ - Message Queue Abstraction and Pattern

Let me define the problem first and why a messagequeue has been chosen. I have a datalayer that will be transactional and EXTREMELY insert heavy and rather then attempt to deal with these issues when they occur I am hoping to implement my application from the ground up with this in mind.
I have decided to tackle this problem by using the Microsoft Message Queue and perform inserts as time permits asynchronously. However I quickly ran into a problem. Certain inserts that I perform may need to be recalled (ie: retrieved) immediately (imagine this is for POS system and what happens if you need to recall the last transaction - one that still hasn’t been inserted).
The way I decided to tackle this problem is by abstracting the MessageQueue and combining it in my data access layer thereby creating the illusion of a single set of data being returned to the user of the datalayer (I have considered the other issues that occur in such a scenario (ie: essentially dirty reads and such) and have concluded for my purposes I can control these issues).
However this is where things get a little nasty... I’ve worked out how to get the messages back and such (trivial enough problem) but where I am stuck is; how do I create a generic (or at least somewhat generic) way of querying my message queue? One where I can minimize the duplication between the SQL queries and MessageQueue queries. I have considered using LINQ (but have very limited understanding of the technology) and have also attempted an implementation with Predicates which so far is pretty smelly.
Are there any patterns for such a problem that I can utilize? Am I going about this the wrong way? Does anyone have any of their own ideas about how I can tackle this problem? Does anyone even understand what I am talking about? :-)
Any and ALL input would be highly appreciated and seriously considered…
Thanks again.
For anyone interested. I decided in
the end to simply cache the
transaction in another location and
use the MSMQ as intended and described
below.
If the queue has a large-ish number of messages on it, then enumerating those messages will become a serious bottleneck. MSMQ was designed for first-in-first-out kind of access and anything that doesn't follow that pattern can cause you lots of grief in terms of performance.
The answer depends greatly on the sort of queries you're going to be executing, but the answer may be some kind of no-sql database (CouchDB or BerkeleyDB, etc)

How to increase my Web Application's Performance?

I have a ASP.NET web application (.NET 2008) using MS SQL server 2005. I want to increase the performance of the web site. Does anyone know of an article containing steps to do that, step by step, in SQL (indexes, etc.), and in the code?
Performance tuning is a very specific process. I don't know of any articles that discuss directly how to achieve this, but I can give you a brief overview of the steps I follow when I need to improve performance of an application/website.
Profile.
Start by gathering performance data. At the end of the tuning process you will need some numbers to compare to actually prove you have made a difference. This means you need to choose some specific processes that you monitor and record their performance and throughput.
For example, on your site you might record how long a login takes. You need to keep this very narrow. Pick a specific action that you want to record and time it. (Use a tool to do the timing, or put some Stopwatch code in you app to report times. Also, don't just run it once. Run it multiple times. Try to ensure you know all the environment set up so you can duplicate this again at the end.
Try to make this as close to your production environment as possible. Make sure your code is compiled in release mode, and running on real separate servers, not just all on one box etc.
Instrument.
Now you know what action you want to improve, and you have a target time to beat, you can instrument your code. This means injecting (manually or automatically) extra code that times each method call, or each line and records times and or memory usage right down the call stack.
There are lots of tools out their that can help you with this and automate some of it. (Microsoft's CLR profiler (free), Redgate - Ants (commercial), the higher editions of visual studio have stuff built in, and loads more) But you don't have to use automatic tools, it's perfectly acceptable to just use the Stopwatch class to time each block of your code. What you are looking for is a bottle neck. The likely hood is that you will find a high proportion of the overall time is spent in a very small bit of code.
Tune.
Now you have some timing data, you can start tuning.
There are two approaches to consider here. Firstly, take an overall perspective. Consider if you need to re design the whole call stack. Are you repeating something unnecessarily? Or are you just doing something you don't need to?
Secondly, now you have an idea of where your bottle neck is you can try and figure out ways to improve this bit of code. I can't offer much advice here, because it depends on what your bottle neck is, but just look to optimise it. Perhaps you need to cache data so you don't have to loop over it twice. Or batch up SQL calls so you can do just one. Or tighten your query filters so you return less data.
Re-profile.
This is the most important step that people often miss out. Once you have tuned your code, you absolutely must re-profile it in the same environment that you ran your initial profiling in. It is very common to make minor tweaks that you think might improve performance and actually end up degrading it because of some unknown way that the CLR handles something. This is much more common in managed languages because you often don't know exactly what is going on under the covers.
Now just repeat as necessary.
If you are likely to be performance tuning often I find it good to have a whole batch of automated performance tests that I can run that check the performance and throughput of various different activities. This way I can run these with every release and record performance changes each release. It also means that I can check that after a performance tuning session I know I haven't made the performance of some other area any worse.
When you are profiling, don't always just think about the time to run a single action. Also consider profiling under load, with lots of users logged in. Sometimes apps perform great when there's just one user connected, but when they hit a certain number of users suddenly the whole thing grinds to a halt. Perhaps because suddnely they are spending more time context switching or swapping memory in and out to disk. If it's throughput you want to improve you need to be figuring out what is causing the limit on throughput.
Finally. Check out this huge MSDN article on Improving .NET Application Performance and Scalability. Specifically, you might want to look at chapter 6 and chapter 17.
I think the best we can do from here is give you some pointers:
query less data from the sql server (caching, appropriate query filters)
write better queries (indexing, joins, paging, etc)
minimise any inappropriate blockages such as locks between different requests
make sure session-state hasn't exploded in size
use bigger metal / more metal
use appropriate looping code etc
But to stress; from here anything is guesswork. You need to profile to find the general area for the suckage, and then profile more to isolate the specific area(s); but start by looking at:
sql trace between web-server and sql-server
network trace between web-server and client (both directions)
cache / state servers if appropriate
CPU / memory utilisation on the web-server
I think First of all you have to find your Bottlenecks and then try to improve those.
This helps you to perform exactly where you have serios problem.
An in addition you needto improve your Connection to DB. For exampleusing a Lazy , Singletone Pattern and also create Batch request instead of single requests.
It help you to decrease DB connection.
Check your cache and suitable loop structures.
another thing is to use appropriate types, forexample if you need int donot create a long and etc
at the end ypu can use some Profiler (specially in SQL) andcheckif your queries implemented as well as possible.

How do you deal with design changes?

I just finished working on a project for the last couple of months. It's online and ready to go. The client is now back with what is more or less a complete rewrite of most parts of the application. A new contract has been drafted and payment made for the additional work involved.
I'm wondering what would be the best way to start reworking this whole thing. What are the first few things you would do? How would you rework the design in a way that you stay confident that the stuff you're changing does not break other stuff?
In short, how would you tackle drastic application design changes efficiently (both DB and code)?
Presuming that you have unit tests in place, this is just refactoring.
If you don't have unit tests in place, then
Write unit tests for the parts you're likely to keep.
Write unit tests for the parts you're going to change.
Run the tests. The "keep" should pass. The "change" should fail.
Start refactoring until the tests pass.
This is NOT-A-NEW thing in software and people have done this and written a lot about this.
Try reading
Working Effectively with Legacy
Code
Refactoring Databases:
Evolutionary Database Design
The techniques explained here are invaluable to sustain any kind of long running IT projects.
Database design is different from application design in this regard.
Very often, client rethinking changes the application completely, but changes little, if anything, in the fundamental underlying data model of the enterprise. The reason for this is that clients tend to think in terms of business processes, but not in terms of fundamental data. Business processing and data processing are tightly coupled. Data storage is less tightly coupled.
In the days of classical database design, designers learned how to exploit this pattern, by dividing their database design into (at least) two layers: logical design and physical design. There are any number of times that a change of business process requires a complete rewrite of the application, and a major rework of the database physical design, but requires few, if any, changes to the logical design.
If your database design didn't separate out the layers like this, it's hard to tell what gets affected and what doesn't. Start with your tables and columns. Ask yourself if any of the changes require removing any column from the table it's in, or require inventing new columns. If the answer is no, you're in luck. Next, look at the constraints placed on the database (things like PRIMARY KEY, FOREIGN KEY, UNIQUE and NOT NULL). These constraints might be tightened or loosened by the client's changes. If not, you're in luck. If you didn't declare any constraints in the database, and chose to do all your integrity protection in application code, you're probably out of luck.
You still have a fair amount of work to do in terms of changing the indexes on the tables, and the way the application works with the data. But you've salvaged part of the investment in the old system.
The application itself is much more vulnerable to client changes in process than the database. If your database design was completely driven by your application design, you may be out of luck.
If it's THAT drastic of a change it might be best to just start over. I've worked on a number of projects that have gone through some drastic changes.
Starting over gives you a chance to use experience learned since the last project and provide a more efficent product.
I would recommend against trying to re-work the old site into the new site, you'll probably spend more time fiddling around changing things than you would have if you had just re-written it.
Best of luck to you !
How would you rework the design in a way that you stay confident that the stuff you're changing does not break other stuff? In short, how would you tackle drastic application design changes efficiently (both DB and code)?
Tests, code complexity/coverage metrics, and a continuous integration system. Run them early and often, so you know which parts are the riskiest and where to start writing.
These will become your safety nets when you have to make potentially problematic changes. If something does break, your CI system will tell you, and you won't have spent weeks down some rabbit hole before you realize there's a problem.
Sometimes you do things better the second time around so just try and stay positive. Plus you will have more domain knowledge this time around.

When is it best to change code to match standards?

I have recently been put in charge of debugging two different programs which will eventually need to share an XML parsing script, at the minimum. One was written with PureMVC, and another was built from scratch. While it made sence, originally, to write the one from scratch (it saved a good deal of memory, but the memory problems have since been resolved).
Porting the non-PureMVC application will take a good deal of time and effort which does not need to be used, but it will make documentation and code-sharing easier. It will also lower the overall learning curve. With that in mind:
1. What should be taken into account when considering whether it is best to move things to one standard?
(On a related note)
Some of the code is a little odd. Because the interpreting App had to convert commands from one syntax to another, it made sense to have an interpreter Object. Because there needed to be communication with the external environment, it made more sense to have one object interact with the environment, and for that to deal with the interpreter exclusively.
Effectively, an anti-Singleton was created. The object would only interface with the interpreter, and that's it. If a member of another class were to try to call one of its public methods, the object would raise an Exception.
There are better ways to accomplish this, but it is definitely a bit odd. There are more standard means of accomplishing the same thing, though they often involve the creation of classes or class files which are extraordinarily large. The only solution which I could find that was standards compliant would involve as much commenting and explanation as is currently required, if not more. Considering this:
2. If some code is quirky, but effective, is it better to change it to make it less quirky, even if it is made a more unwieldy?
In my opinion this type of refactoring is often not considered in schedules and can only be done when there is extra time.
More often than not, the criterion for shipping code is if it works, not necessarily if it's the best possible code solution.
So in answer to your question, I try and refactor when I have time to do so. Priority One still remains to produce a functional piece of code.
Things to take into account:
Does it work as-is?
As Galwegian notes, this is the only criterion in many shops. However, IMO just as important is:
How skilled are the programmers who are going to maintain it? Have they ever encountered nonstandard code? Compare the cost of their time to learn it (including the cost of delayed dot releases) to the cost of your time to refactor it.
If you're maintaining it, then instead consider:
How much time will dealing with the nonstandard code cost you over the intended lifecycle of the codebase (e.g., the time between now and when the whole thing is rewritten)?
That's hard to guess, but consider that many codebases FAR outlive the usefulness envisioned by their original authors. (Y2K anyone?) I've gradually developed a sense of when a refactoring is worthwhile and when it's not, mostly by erring on the side of "not" too often and regretting it later.
Only change it if you need to be making changes anyway. But less quirky is always a good goal. Most of the time spent on a particular piece of software is in maintenance, so if you can do something to make that easier, you'll be reducing the overall time spent on that piece of code. Nonetheless, don't change something if it's working and doesn't need any modifications.
If you have time, now. If you don't have time and it can be avoided, later.

Resources