ManifoldCF and Postgresql to crawl 1.5 Million of documents - manifoldcf

We used ManifoldCF with Postgresql (9.6) to crawl our websites.
The speed of the crawling is good (approximately 20.000docs/hours) until 500.000 docs.
after the performance decrease, and we can see long freeze (very long) of the crawling.
We suspect that postgresql rebuild the indexes of the intrinsiclink table.
Is it possible to forbidden this ? by settings of postgresql ?
Thank you
Dan

What MCF version you are using? try the latest version: 2.13
Most time the Database is dragging the performance. Better tuning the PG will get better results
According to MCF guide: https://manifoldcf.apache.org/release/release-2.13/en_US/performance-tuning.html
You should Turn off PG autovacuuming, see if that help.
There's many other factors in the tuning to try.

Related

Why MariaDB 10.2 uses again InnoDB instead of Percona XtraDB?

MariaDB homepage says that they use Percona XtraDB until 10.1 and from 10.2 on they are going to use normal InnoDB again (https://mariadb.com/kb/en/mariadb/xtradb-and-innodb/).
This does not seem reasonable to me, because XtraDB seems to be the better / improved version of InnoDB (https://www.percona.com/software/mysql-database/percona-server/feature-comparison). So is this a typo, are there any legal issues, or is the new version of InnoDB simply better than XtraDB?
There is even a question on MariaDB page, but it has not been answered for weeks now.
Sry, could not append all related links because of Stackoverflow rules.
Keeping InnoDB (or XtraDB) up to date with MySQL (Percona) is a complex task. It took us more than half a year to migrate from InnoDB-5.6 to InnoDB-5.7 in 10.2. Doing it again for XtraDB would probably have required only slightly less than this. For us to embark on such project, it must bring significant benefits to our users.
XtraDB had many great improvements over InnoDB in 5.1 and 5.5. But over time, MySQL has implemented almost all of them. InnoDB has caught up and XtraDB is only marginally better. Not enough to justify a multi-month merge that would delay 10.2-GA for everyone.
In particular, the only real improvement that XtraDB 5.7 seems to have is for a write-intensive I/O-bound workload, where innodb_thread_concurrency control is disabled.
With a proper innodb_thread_concurrency, XtraDB is only marginally better. We didn't want to delay 10.2-GA by up to half a year for the sake of those few users who have write-intensive I/O-bound InnoDB workload and don't know how to configure innodb_thread_concurrency.
Note, we still consider incorporating XtraDB optimizations, but as patches, rather than XtraDB as a whole, which no longer has numerous all-over-the-code improvements.
https://mariadb.com/kb/en/library/why-does-mariadb-102-use-innodb-instead-of-xtradb/
As far as I have seen, they did it for better compatibility from mysql. During my training at M17 they did not say anything about it. I had discovered this during that last 10 minutes of the social hour as I was providing feedback.
I'm sure its because its not GA yet.

How much data do I need to have to make use of Presto?

How much data do I need to have to make use of Presto? The web site states that it can query data sizes from gigabytes to petabytes. I understand how it is used to query very large datasets, but is anyone using it for hundreds of gigabytes?
Currently, Presto is most useful if you already have an existing Hive installation. If you are using Hive, you should definitely try Presto. If all your data fits in a relational database like PostgreSQL or MySQL on a single machine, and you are happy with the performance, then keep using that.
However, Presto should be much faster than either of those databases on a single machine for analytic queries because it executes a query in parallel. Neither of those databases parallelize the execution of individual queries. At the moment, using Presto requires setting up HDFS and Hive (even on a single machine), so getting started will be more work than if you already have an existing Hive installation.
Or, you can take a look at Impala - which has been available as production-ready software for six months. Like Presto, Impala is a distributed SQL query engine for data in HDFS that circumvents MapReduce. Unlike Presto, there is a commercial vendor providing support (Cloudera).
That said, David's comments about data size still apply. Use the right tool for the job.

IIS Performance

We have the following setup:
Virtual server, Intel Xeon X5650 # 2.67Ghz (4 processors)
8GB RAM
Windows server 2008 Standard 64bit
Sql Server Express
IIS 7.5
Our database is only 200mb. We are running an ASP.net app. We recently ran into some performance issues, ~200 concurrent connections was causing 100% CPU usage (mostly consumed by IIS) and bringing the response time to around 20sec! After some tweaks to our code we have been able to run a load test from loader.io with 1500 concurrent users over 1 minute and our response time at the end was around 5 seconds and CPU was around 95%, again consumed mainly by IIS, our memory was sitting at around 4GB usage. However we are expecting bigger spikes than 1500, anywhere up to around 4000 users in a short amount of time.
My questions are the following:
1) Is this normal performance for our current setup? Our site is quite intensive on the database and we are using Entity Framework.
2) Would upgrading to Sql Web edition have any benefit seeing as though our Database is so small?
3) Do you think that this type of setup could handle 4000 users?
4) Any suggestions on what we could do to handle this load?
I know this is somewhat subjective, but any answers are much appreciated.
Is this normal performance for our current setup?
Depends on your code. Did you profile the code to make sure you dont have anything stupid in there?
Our site is quite intensive on the database and we are using Entity Framework.
Again, did you pofile to figure out you spend a lot of time in entity framework? It is slow, ut the question is what "intensive" means. This is what profilers are for.
Would upgrading to Sql Web edition have any benefit seeing as though our Database is so
small?
Help, my pizza comes too late. Wiould upgrade to a larger car help? You say yourself that you spend the time in IIS, not sql server.
Do you think that this type of setup could handle 4000 users?
You think my car is big enough? Note I don't tell you what I need it for. Without looking at usage patterns and your code - no idea. THAT SAID: the server is pathetic compared to what you buy today. As such, this is a irrelevant question - just upgrade if you have to.
Any suggestions on what we could do to handle this load?
Load test + profiler, optimize code. Get bigger server. Realize that we dont have crystal balls to figure out how good / bad / stupid your code is.
Number one question arising here, is: did you deploy RELEASE or DEBUG compiled binaries of your project?
Upgrade to WebEdition will not solve any problem here, since the difference in the versions is very simple: WebEdition is just throttled in the internal scheduler/etc. - so you will be just fine with the standard edition.
My experience is that the most crucial aspect of concurrent request is the amount of server memory and the consumption of this memory by your code.
As the physical memory is consumed, the server starts to swap from physical to virtual memory which slows down processing dramatically and leads to symptoms you describe.
I would start with putting another 8gb of ram into the server. In the meantime try to optimize your code so that less data is processed during requests or less memory is used. Also, move sql server to a separate machine so that there is no competition between iis and sql server when it comes to memory availability.
With your current machine, I doubt the problem is the IIS itself, but rather related to the way your app is designed and/or utilize frameworks. I personally learned just recently that IIS requests including multiple rounds trips to the database can be measured in hundreds of micro-seconds, not hundreds of milliseconds... A single locking bug, or unbalanced queuing can limit your application scalability and regardless of your hardware specs [https://twitter.com/michaelzino/status/454512110165184512].
Entity Framework is known for validating your models against the database schema for the first initial calls. I would suggest profiling your app layers, starting from the data access layer, or the intrinsic database calls, and going up.

Asp.net NHibernate CPU performance after upgrade

Has anyone else had CPU spikes after switching over to NHibernate?
We switched to using NHibernate about 2 years ago. Since then we've had issues with the server running using the CPU near 60 - 80. We also had issues with the server running out of memory.
Weve consistently been told to optimize our query. Which we did with only limited success. It wasnt until I recently upgraded from NHibernate 2.1 to 3.2 that we finally saw an improvement in the CPU. It dropped from a 60 percent average to about 30 percent. I was amazed, I was told by many who consider themselves experts that upgrading NHibernate would only produce limited improvements if any at all.
My question is ... Has anyone else noticed CPU spikes with NHibernate nd have they seen any improvement after doing a major version upgrade. And last, why exactly is the new version performing so much better? I know NHibernate 3 has a lot better support for linq and about 70 percent of my queries use Linq, so my guess is that may be part of the reason I'm seeing better performace.
Also, does anyone have any ideas how I can optimism NHibernate to produce even better CPU performance other than upgrading the dlls which I have already done.
I'm currently running NHibernate 3.2 and fluent NHibernate 1.2 upgraded from 2.1 and 1.0 respectively.
I suspect you have been told the same as I am about to recommend, but I urge you to look at all possibilities and discount them.
Weve consistently been told to optimize our query - Suspicions always lie with either the SQL generated by the ORM or the amount of time the DB takes to execute the query. This is sound advice and you must disprove this by using the following methods.
First I would set up a trace on the live database server that runs for a week. Once this is done you may find that you get suggestions on indexes or SQL related issues.
Secondly I would fire up NHProf on my development box and run some stress tests against heavily used pages or pages that have a lot of database trips to see what is going on behind the scenes with NHibernate. NHProf will give you advice about various problems including; select n+1, unbounded results, large number of rows returned, queries with too many joins etc. Again this tool is invaluable to bridge the gap between SQL server and your code.
Hopefully after this exercise you will have ideas on how to fix certain issues, introduce caching OR if you find you don't have any items to address give you valuable feedback that you can then post to the NHUser group.
After all if you think about it tens of thousands of users use NHibernate. I have used NHibernate myself for several years and subscribe to the NHusers group and I have not seen the CPU spike issue before. Always it turns out to be either; the SQL generated, the database is under pressure or large recordsets being hydrated

Cassandra and asp.net (C#)

I am interested to create portal on cassandra services, since I faced some performance and scale issues starting from 1 million of records.
Definitely, it could be solved, but I am interested on other options.
My main issues is cost of updating all necessary indexes, to make reading fast.
First, is cassandra is good way for asp.net programmers? I mean, maybe there is some other projects, which worth to take a look
And second, can you provide any documentation samples on how to start with cassandra programming from C#?
since I faced performance and scale issues starting from 1 million of records.
Maybe your design was not that good, NoSQL is not a magic bullet for bad design. I have multi billion row tables and 95% of the response is sub second. Also what do you mean by updating indexes, do you mean updating statistics or rebuilding indexes?
since I faced performance and scale
issues starting from 1 million of
records.
You know, the one million mark for modern databases is where it is not something "totally ridiculously small" where you can ignore actually knowing what you do. Below one million is "tiny". I have a 800 million row table and get a LOT of sql running through with it - no problem at all.
First, is cassandra is good way for
asp.net programmers?
I would more suggest a basic book about SQL, reading the documentation and POSSIBLY throwing some hardware on the problem. As in: having totally bad hardware will kill all data management systems.
If you are using Cassandra for your .NET Application take a look at Aquiles. I developed it based on my company needs. If you find it useful or need any help let me know.
You can't really speak of Cassandra documentation. There's a myriad of partial tutorials on the web.
You may want to setup Linux in a virtual machine, because the windows build process is quite challenging, to say the least. (http://www.virtualbox.org, http://www.ubuntu.com)
Here's the howto:
http://www.ridgway.co.za/archive/2009/11/06/net-developers-guide-to-getting-started-with-cassandra.aspx
Note that the cassandra SVN url and the code sample have changed since the writing of this tutorial.
Here's another C# client:
http://github.com/mattvv/hectorsharp
And here some sample code:
http://www.copypastecode.com/26752/
Note that you need to download the latest Java Development Kit (JDK) from Sun for Linux.
It's not in the repositories of Ubuntu 10.04.
Then you need to type
export JAVA_HOME="/path/to/jdk"
in order for Cassandra to find your Java installation.
You might also want to take a look at:
http://en.wikipedia.org/wiki/NoSQL
Especially the taxonomy section is interesting.
Make sure Cassandra is the right type of NoSQL solution for your problem, e.g. use Neo4J if your problem actually is a graph problem.
Also, you need to make sure your NoSQL solution is ACID-compliant.
For example, Neo4J is the only ACID-compliant NoSQL graph engine.
Edit: Here's a jumpstart guide for Windows, without compiling:
http://coderjournal.com/2010/03/cassandra-jump-start-for-the-windows-developer/
http://www.ronaldwidha.net/2010/06/23/running-cassandra-on-windows-first-attempt/
http://www.yafla.com/dforbes/Getting_Started_with_Apache_Cassandra_a_NoSQL_frontrunner_on_Windows/
Instead of cassandra you might take a look at: ravendb. Supposedly it is a document store made with and created for .Net. It has Linq integration, and is (again supposedly) very fast.
As with any new technology, read if it helps you with your specific case, and check if it is proven technology (Do they have mainstream clients using it).
Before you go into this route see if you can't optimize your current solution first. Check if your queries are fast, if the indexes are done correctly, and if you can't remove load by adding caching.
Last nut not least, if adding some processors to your SQL machine might fix issues, it is typically a much cheaper solution.
If you want to do something new, then instead of going for noSQL, you might want to consider trying a database cluster.
The idea is when two machines each search half of the original database at the same time, you have half the search time without totally redesigning your existing database.

Resources