I'm currently in the process of re launching my site which is primarily a site that is a giant Google Map with LOTS of markers (Currently around 34,000 and estimated to be at 50,000+ by the end of 2013 [calculated using average additions a day]).
Since launch I have been improving my skills and optimising by using methods such as Gzip, removing unneeded data from the data file that the site uses and clustering the data.
The development site can be viewed at http://dev.rastrack.co.uk
What would I be able to do, to speed up the page load even more. From what I can tell the limitation is now where the client has to cluster all the markers.
I can host and write PHP code to do the job if this is possible. The data is stored as a Json string for each marker in MongoDB.
Related
I'm creating a web scraper with golang and I just wanted to ask some questions about how most of them work. For example, how does Googlebot not use a lot of bandwidth when scraping because you have to go to each URL to get data and that can be thousands of URLs so not only will that take bandwidth, it will also take a lot of time. I'm building a web scraper and I'm experiencing these issues and I wanted to ask what the best way of fixing these issues is. https://github.com/hackermondev/cosmic
My company uses SilverStripe v3.1.21, along with the Subsite module to display and administer a number of clients' websites that sell products. This results in close to 200 subsites and a page count in the tens of thousands. The websites are very slow to load and tools such as Google's PageSpeed tell us page speeds are poor. We've already done step like combining and minimising the JS and compressing resources such as imaging, which gave some improvements, however the pages remain slow. The system was handed to us in this state and further hardware upgrades are not on the table as an option, nor are gaining additional resources for redevelopment.
We've taken a look at the static publish module (https://github.com/silverstripe/silverstripe-staticpublisher) and found that when we generating static pages the pages become fast and get a good score on the various tools, however the process to regenerate all of these pages takes over 14 hours, which is unacceptable given these products are updated from an external source daily. We also find that the regeneration process is a memory hog, as the module builds all of the pages in memory before dumping to file, causing the process to crash. We've had to alter the process to go subsite-by-subsite just to make it run.
We then took a look at the static publishing queue module (https://github.com/silverstripe/silverstripe-staticpublishqueue), which seemed to address our issues by having it queue pages as needed for regeneration, making it much more responsive to changes. However, the module seems to be very buggy and often crashes when generating pages.
Has anyone had experience using these modules (or similar) with larger sites and may be able to provide any pointers or ideas on how to implement static publishing successfully?
We are using staticpublishqueue currently on several sites. The only problem we've had with it is crashing due to long builds and poor locking. Or to be precise it doesn't actually crash but keeps spawning more and more instances until the server becomes unresponsible.
I think we have a fix for this in our fork. At least we haven't had any problems after using the modified locking. You could try installing the fork instead of the official version. If this fixes things for you maybe we should make a pull request :)
First of: We only use staticpublishqueue, I don't have any experience in regards to the sub site module. So I can't speak for your exact combination.
We are using staticpublishqueue on a huge site. Setup: We have multiple servers running the SilverStripe Website. They share a MySQL Database and use Redis as a session store.
One great thing about staticpublishqueue: you can run it in parallel. So the servers all run an instance of staticpublishqueue and publish into a shared folder, which is then synced to a nginx load balancer in front of the actual webservers. Works quite nice, but it does not scale indefinitely. At some point the staticpublishqueue instances start to pick the same record to render and waste resources. I think about 6 is the max for us.
Couple of things we learned regarding staticpublishqueue:
do not run to many instances at the same time (see above)
make sure it has enough ram
make sure it runs as the same user as the website
the record look it uses is not compatible with a MariaDB Galera Cluster
If possible switch to SilverStripe 3.6.x and PHP7. The performance gain is huge.
We are migrating away from staticpublishqueue to Cloudflare (or maybe another CDN). Why? Because if a page that is requested has not been rendered yet the server will render it for each request individually and then throw it away. Until the que does a separate render for the cache. Total waste of resources, especially if you purge your cache after a sitewide layout change or something.
I am quite new to running websites in general. I am familiar with statistical profilers for desktop applications, but unsure how to even begin profiling a website as there are a lot of additional potential bottlenecks and I'm not sure what profilers are available for websites.
I have looked around and seen useful suggestions in other questions, but I am not sure they are a very complete solution. The main suggestions are azure performance counters and suggestions from this answer.
Summarizing they are:
Use firebug to determine rendering time and loading time seperately so one can tell whether one has a rendering issue or a server issue.
If server side:
Test a small static page like a page with a single gif. If that is slow one has a CPU issue. Otherwise one is probably IO bound or has problems with database performance.
One can use performance counters to check server aspects such as:
memory
garbage collection
tcp/ip issues
bytes sent / recieved
requests requested, queued, rejected
request wait time, processing time
From my naive standpoint some things that seem to be missing from this list are the sort of profiling one has for a traditional desktop application, i.e. what the stack looked like what percentage of the time (i.e. what functions were we spending time in, and in what context). Another missing item is profiling the database performance, which seems like it may be different on azure than in a local environment especially if one starts dealing with scaling. Another is time spent on requests to third party services, though maybe that can be done with azure performance counters(?).
I apologize for the naive nature of this question. What tools and aspects am I missing here to profile an azure MVC asp.net website and what changes would you make to the above list?
There's a lot of aspects to profiling a site, in terms of database calls, business logic, rendering a view, and even client side performance (any jQuery that might run, for example).
StackOverflow's MiniProfiler is one of the easiest things to get going, just install a NuGet package, add some Javascript includes, and wrap whatever you want to test inside a using() block, and you'll see execution times (including LINQ-to-SQL and EF). You can even create steps if you want finer grained timings of individual calls.
The nice thing about MiniProfiler is you can enable/disable based on the environment, which makes it suitable for running inside Azure (as opposed to say, the Visual Studio Profiler).
You can also look at Azure Performance Counters, which will give you an idea of system resources, but isn't profiling in the sense that MiniProfiler is. It will however give you an idea of network latency and CPU and memory utilization.
Once you're satisfied there, you can use Chrome's Developer Tools to profile your application on the client side. It'll give you an idea of how well your Javascript is doing, including CSS selectors and rendering.
Also worth noting, Visual Studio has a really good Profiler in some higher editions that can give you deep insights into your code. Time spent in methods, call counts, etc.
Between these four methods, you should be able to find most bottlenecks, especially for a first pass.
We have an application deployed on Windows Azure as a Web Role and we are using Pingdom for testing page load times: http://tools.pingdom.com/fpt/
The url for the application on Windows Azure is: http://www.doctorspring.com .
The load time of the app is usually around 7s.
The database is an SQL Azure database and the role and the database are in the same zone.
Sample pingdom result: http://tools.pingdom.com/fpt/#!/CllGggrMz/http://www.doctorspring.com/
Sample pingdom result(with gzip):http://tools.pingdom.com/fpt/#!/f2TUbR6OX/www.doctorspring.com
Suspecting that Azure could be the problem, we tried a free hosting from Somee as:
http://www.doctorspring.somee.com
The load time of the app on Somee is around 3.5s.
Sample pingdom result: http://tools.pingdom.com/fpt/#!/o3gZOjTwH/http://www.doctorspring.somee.com/
That is a huge performance issue for us.
Can you please help us understand the problem with Azure or suggest a method, as to how can we overcome it?
Thanks,
Manish
In both cases, loading the homepage is unacceptably slow - 3.5 seconds to generate a page is around 10 times slower than you need to be when there's no load on the site. I'd expect the site to crumble under even moderate load with this kind of performance.
Without knowing how the site is constructed, it's hard to explain the reason one environment is faster than the other - but my guess is that whatever is generating the page (some kind of CMS?) is the cause. Azure is known to be a touch slow when doing database queries - though normally this only manifests itself under extreme conditions.
I'd recommend tuning the CMS - especially with caching. We found that Azure is normally pretty fast, but when doing database lookups (e.g. retrieving content for the CMS), it can be variable; if your CMS is doing a LOT of database queries to get the homepage content, it's going to be slow.
It's also worth running Yslow - there's some low-hanging fruit on getting performance up.
What services are you running in Azure? Web-role, VM, Website? Are you connecting to an Azure Database instance from the homepage (if so how many distinct calls are you making)?. I'm getting around a 7.5 second load time from London, but to be honest even 3 seconds is too slow for the homepage. It's hard to know what's causing the prolonged page-load but if you are connecting to a DB instance there's a great deal you can do e.g.
Render the page and make some asynchronous calls to spool in additional data.
Make sure your Azure services are running close together
Consider caching database content to a blob. E.g. for the data in "Medical Questions Answered in Last 24 Hours" if you are pulling this from a DB on every load you could considerably speed up access by routinely caching this to a html file stored in a blob container and inject it into the page.
If you must make DB calls from the homepage try to make as few round trips as possible by batching up your queries into a stored procedure.
I've made a lot of assumptions here, but there are certainly things you could do to drastically improve performance on this page.
I am well aware of the fact, that this might not be the typical SO question, but since this is the strongest R programming community I know and the author of opencpu explicitly encourages to post here, I'll give it a try:
What role does data play in the opencpu approach? I mean cloud computing is nice, but you need some data to calculate. Uploading some example .csv or .xls table might be straight forward, but what does opencpu have in mind for real world data?
What about several hundred MBs (or even GBs) of data? How would you a) transfer it to your user folder? How would you b) share it among a group of authenticated users and c) hide it from the public?
I read the license part and from what I understand for safety it should be possible to run the calculations behind the scene as long as the source code is publicly available. But still, the little document leaves open questions and lot of guessing.
Thanks for trying OpenCPU. OpenCPU is still an evolving project at this point, so we are open to interesting suggestions or use cases.
About the data... you are asking many things at once. Some thoughts:
At this point, OpenCPU does not solve the 'big data' problem. It does not scale beyond what R itself scales to. It is mostly meant as an infrastructure for small to medium sized data; e.g. a typical research paper, project, etc.
OpenCPU is an API. It is not limited to browser clients. It is designed to be called from other clients as well.
OpenCPU has a store that you use to store R objects on the server. E.g you upload a CSV or whatever once, and then you store the actual dataframe. In any subsequent calls you can then include this object as an argument to function calls.
Another approach would be to combine it with a external database (e.g. mysql) and dynamically pull the data in your R code (e.g. using RMySQL)
Afaik, the legal aspects of open data are not completely clear at this point. I don't think there is consensus on how copyright applies to data, and what a good license would be. However, a key feature in the design of OpenCPU is making sure things are easily reproducible. This can of course only be done when the data is actually public.
Matt,
I'm dealing with a real-life use case that involves transforming and processing data from a 3GB (but growing) dataset. Here is the approach I am using (mostly based on suggestions from Gergely Daróczi):
as long as the source data can fit into the server memory, I'd choose to load the data with my R package and persist that data across user sessions (e.g. preloading data packages with OpenCPU)
if that's not option on your server, an alternative is to copy your data to Ramdisk (Linux tmpfs system) into .rds (or .rda, .rData, etc.) files and set these paths using getOption("path_to_my_persistent_data_files") in your R package, then load/unload these file(s) as needed in your package functions
when your data no longer fits into memory, I'd look into using a MongoDB backend together with R interface rmongodb, as that would likely be faster and easier to maintain than an RDBMS.
Currently OpenCPU does not provide any support for large persistent datasets, it's up to you to find an approach that best suits your needs and resources.
You can install a local instance of opencpu. You don't have to use the existing one on the Internet. Instructions are on the site.