Impala - Let Impala pick its own memory limit - cloudera

I have Cloudera Express 5.3.2 installed on a cluster. I would like to use it for Impala querying.
I want to let Impala set the limit depending on the cluster's capacity. In the Impala configuration, in cloudera manager, it's written to "leave it blank to let Impala pick its own limit". However I can't leave the field blank because the web interface tells me that "this field is required".
http://i.imgur.com/c9RA8mV.png

Unfortunately Impala cannot set its own memory limit. You don't have to set a memory limit (use -1), but your queries will perform poorly if you run out of physical memory and the OS is forced to swap. If you're only using Impala on this cluster (i.e. not Hive, MapReduce, Spark, etc.), you can set this to most of the physical memory; we typically recommend 80%. If you do need to share resources with other systems, you should look at the resource management options available in CDH.

Related

MariaDB max connections

We have a big application that uses 40 microservices (Spring boot) that each have about 5 database connections to a mariadb server. That causes too many connection errors on our mariadb server. Default is 151 however I was thinking of just setting the max connections to 1000 to be on the safe side. I cant find anywhere on the Internet if this is possible or even wise. Our MariaDB is running standalone on a VPS with 8GB memory. It is not running in a docker container or something like that. It is run directly on the VPS.
What is the maximum connections advisable taking into consideration that we might scale up with our microservices?
You can scale up your max_connections just fine. Put a line like
max_connections=250
in your MariaDB my.cnf file. But don't just set it to a very high number; each potential connection consumes RAM, and with only 8GiB you need to be a bit careful about that.
If you give this command you'll get a bunch of data about your connections.
SHOW STATUS LIKE '%connect%';
The important ones to watch:
Connection_errors_max_connections The number of connection attempts that failed because you ran out of connection slots.
Connections The total number of connections ever handled
Max_used_connections The largest number of simultaneous connections used.
Max_used_connections_time The date and time when the server had its largest number of connections.
The numbers shown are cumulative since the last server boot or the most recent FLUSH STATUS; statement.
Keep an eye on these. If you run short you can always add more. If you have to add many more connections as you scale up, you probably will need to provision your VPS with more RAM. The last two are cool because you can figure out whether you're getting hammered at a particular time of day.
And, in your various microservices be very careful to use connection pools of reasonable maximum size. Don't let your microservices grab more than ten connections unless you run into throughput trouble. You didn't say what client tech you use (nodejs? dotnet? php? Java?) so it's hard to give you specific advice how to do that.

Impala memory configuration in cloudera

I'm trying to understand the Impala memory settings on my cluster.
We have hosts having 48 GB memory in the cluster. For a host, when I look at the memory resources for each service in cloudera manager, I see that for Impala daemon 38 Gb of memory is allocated.
But, Impala Daemon memory limit is set to O , in the Impala configurations.
So, where is the value 38GB getting assigned to Impala Daemon?
And I believe Impala Daemon memory limit is a node level limit, not a cluster level. Is that right?
Please note that static and dynamic pools are also not configured.
If you don't set a process memory limit, Impala will default to using 80% of the memory on the system as it's process memory limit. (Yes, the process memory limit is a per-node value, not a cluster-wide value.)
Note that this does not mean that 80% of the system memory is actually available, but Impala will limit itself to 80% of the memory. That means that if you have other processes using that memory then you'll see swapping.

connection pooling with MonetDB, R and Shiny Pro

With R and Shiny Pro it is possible to implement multi-user analytical applications.
When a database is used to store intermediate data, how to give access to multiple user access to the db becomes very relevant and necessary.
Currently I'm using MonetDB / MonetDB.R configured (as usual for R) as a single user access, which means that any user operation occurs in sequence.
I would like to implement some type of connection pooling with the DB.
From past SO responses the driver does not include connection pooling.
Are there alternatives within these toolsets?
I am not aware of any connection pool implemented for R DBI connections. The setup you describe seems rather special. You could just create a connection for every client session. MonetDB limits the amount of concurrent connections however, to increase this limit, you could set max_clients to a higher value, for example by starting mserver5 with --set max_clients=1000 or (if you use monetdbd), monetdb set nclients=1000 somedb. Of course, DBI connection pooling would also be a feature request for Shiny Pro and not for MonetDB.R .

percona xtrabackup incremental backup vs replication

I was playing with percona xtrabackup innobackupex for incremental backups. It is a cool tool and very efficient and effective for incremental backups. However, i could not help but wonder why doing incremental backups would be any better than just doing a regular mysql master-slave replication, and whenever needed to retrieve point-in-time data, just use the binary log?
What advantages would doing incremental backups have over doing master-slave replication? When should you choose to use over the other?
One disadvantage to using master-slave replication as a backup is that accidentally running data damaging commands like
DROP TABLE users;
would replicate to the slave.
They are solutions to two different problems; master-slave is redundancy and backup is resilience.
The MySQL JDBC driver has the ability to connect to many servers. If you look at the driver options (https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-url-format.html) you will notice that the host option is not only host, but hosts. If you specify the URL to both the master and the slave and something happens to the master the driver will automatically connect to the slave instead.
Backup, on the other hand, is, as was mentioned earlier, a way to recover from either a catastrophic crash (having your backups stored off-site is a must) or recover from a catastrophic mistake -- neither of which is served by a master-slave setup. (Well, technically you could have the slave at a different site but that still does not cover the mistake scenario)

Why is PCD bit set when I don't use ioremap_cache?

I am using ubuntu 12.10 32 bit on an x86 system. I have physical memory(about 32MB,sometimes more) which is enumerated and reserved through the ACPI tables as a device so that linux/OS cannot use it. I have a Linux driver for this memory device. THe driver implements mmap() so that when a process calls mmap(), the driver can map this reserved physical memory to user space. I also sometimes do nothing in the mmap except setup the VMA's and point vma->vmops to the vm_operations_struct with the open close and fault functions implemented. When the application accesses the mmapped memory, I get a page fault and my .fault function is called. Here is use vm_insert_pfn to map the virtual address to any physical address in the 32MB that I want.
Here is the problem I have: In the driver, if I call ioremap_cache() during init, I get good cache performance from the application when I access data in this memory. However, if I don't call ioremap_cache(), I see that any access to these physical pages results in a cache miss and gives horrible performance. I looked into the PTE's and see that the PCD bit for these virtual address->physical translation are set, which means caching on these physical pages is disabled. We tried setting _PAGE_CACHE_WB in the vma_page_prot field and also used remap_pfn_range with the new vma_page_prot but PCD bit was still set in the PTE's.
Does anybody have any idea on how we can ensure caching is enabled for this memory? The reason I don't want to use ioremap_cache() for 32 MB is because there are limited Kernel Virtual Address on 32bit systems and I don't want to hold them.
Suggestions:
Read linux/Documentation/x86/pat.txt
Boot Linux with debugpat
After trying the set_memory_wb() APIs, check /sys/kernel/debug/x86/pat_memtype_list

Resources