We are using Apache pinot as source system. We have loaded 10GB TPCH data into pinot. We are using Presto as query execution engine, using pinot connector.
We are trying with simple configuration. Presto installed on CentOS machine with 8CPUs and 64GB RAM. Only one instance of worker running with embedded coordinator. Pinot is installed on CentOS machine with 4 CPUs and 64 GB RAM. One Controller,one broker,one server and one zookeeper are running.
Running a query on Lineitem table involving group by roll-up, is taking 23 seconds. Around 20 seconds is spent in transferring 2.3GB data from pinot to presto.
In another query, involving join between Lineitem,Nation,Partsupply,Region with group by cube is taking around 2 minutes. Data transfer is taking around 25 seconds in this. Most of the remaiy time is spent in join and aggregation computation.
Is this normal performance with presto-pinot?
If not,what am I missing?
Do I, need to increase hardware? Increase number of presto/pinot processes?
Any specific presto properties I should consider modifying?
Thanks for your help in advance
Please list the queries so that we can provide a better answer. At a high level, Presto Pinot connector tries to pushdown most of the computation (filter, aggregation, group by) to Pinot and minimize the amount of data needed to pull from Pinot.
There are always queries that require a full table scan and computation cannot be pushed to pinot. Query latency can be higher in such cases. Pinot recently added a streaming api that can improve the latency further.
Related
I am trying to build out some data repositories in an Azure Managed Instance/Sql Server DB. I have been shocked by how slow the write process is, from R/RStudio. As an example, it took 65 minutes to write a table to Azure and less than one minute to write it to my local machine Sql Server.
It appears to write about 20 rows per second, regardless of the number of columns (if I refresh a query w/in SSMS, It's adding about 20 rows each time I run).
I have read in other threads that this could be due to the performance tier. I saw things about the B and A and P tiers. The only information I see on tiers in our account is "General Purpose" and "Business Critical". We have General Purpose (Gen5) with 8 cores and 512 GB of storage, of which we are utilizing less than 10%. While performing one of these write operations from R the overall CPU Utilization is less than 1%.
I am able to read tables from Azure back to R/RStudio quickly. Only writing is significantly hampered.
All of this makes it feel like it's going much slower than it should, as if there was a throttling effect or something. It is so slow that I cannot effectively get historical data there-- I allowed several things to run last night and they all timed out.
I am appending a large data frame (20 million rows) from R to PostgreSQL (9.5) using caroline::dbWriteTable2. I can see that this operation has created an active query, with waiting flag f using:
select *
from pg_stat_activity
where datname = 'dbname'
The query has been running for a long time (more than an hour) and I am wondering whether is is stalled. In my Windows 7 Resource Monitor I can see that the PostgreSQL server process is using CPU, but it is not listed in Disk Activity.
What other things can I do to check that the query has not been stalled for whatever reason?
Basically, if the backend is using CPU time, it is not stalled. SQL queries can run for a very long time.
There is no comfortable way to determine what a working PostgreSQL backend is currently doing; you can use something like strace on Linux to monitor the system calls issued or gdb to get a stack trace. If you know your way around the PostgreSQL source, and you know the plan of the active query, you can then guess what it is doing.
My advice is to take a look at the query plan (EXAMINE) and look if there are some expensive operations (high cost). That may cause the long execution time.
In the DynamoDB documentation and in many places around the internet I've seen that single digit ms response times are typical, but I cannot seem to achieve that even with the simplest setup. I have configured a t2.micro ec2 instance and a DynamoDB table, both in us-west-2, and when running the command below from the aws cli on the ec2 instance I get responses averaging about 250 ms. The same command run from my local machine (Denver) averages about 700 ms.
aws dynamodb get-item --table-name my-table --key file://key.json
When looking at the CloudWatch metrics in the AWS console it says the average get latency is 12 ms though. If anyone could tell me what I'm doing wrong or point me in the direction of information where I can solve this on my own I would really appreciate it. Thanks in advance.
The response times you are seeing are largely do to the cold start times of the aws cli. When running your get-item command the cli has to get loaded into memory, fetch temporary credentials (if using an ec2 iam role when running on your t2.micro instance), and establish a secure connection to the DynamoDB service. After all that is completed then it executes the get-item request and finally prints the results to stdout. Your command is also introducing a need to read the key.json file off the filesystem, which adds additional overhead.
My experience running on a t2.micro instance is the aws cli has around 200ms of overhead when it starts, which seems inline with what you are seeing.
This will not be an issue with long running programs, as they only pay a similar overhead price at start time. I run a number of web services on t2.micro instances which work with DynamoDB and the DynamoDB response times are consistently sub 20ms.
There are a lot of factors that go into the latency you will see when making a REST API call. DynamoDB can provide latencies in the single digit milliseconds but there are some caveats and things you can do to minimize the latency.
The first thing to consider is distance and speed of light. Expect to get the best latency when accessing DynamoDB when you are using an EC2 instance located in the same region. It is normal to see higher latencies when accessing DynamoDB from your laptop or another data center. Note that each region also has multiple data centers.
There are also performance costs from the client side based on the hardware, network connection, and programming language that you are using. When you are talking millisecond latencies the processing time on your machine can make a difference.
Another likely source of the latency will be the TLS handshake. Establishing an encrypted connection requires multiple round trips and computation on both sides to get the encrypted channel established. However, as long as you are using a Keep Alive for the connection you will only pay this overheard for the first query. Successive queries will be substantially faster since they do not incur this initial penalty. Unfortunately the AWS CLI isn't going to keep the connection alive between requests, but the AWS SDKs for most languages will manage this for you automatically.
Another important consideration is that the latency that DynamoDB reports in the web console is the average. While DynamoDB does provide reliable average low double digit latency, the maximum latency will regularly be in the hundreds of milliseconds or even higher. This is visible by viewing the maximum latency in CloudWatch.
They recently announced DAX (Preview).
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. For more information, see In-Memory Acceleration with DAX (Preview).
I am running the termstrc yield curve analysis package in R across 10 years of daily bond price data for 5 different countries. This is highly compute intensive, it takes 3200 seconds per country on a standard lapply, and if I use foreach and %dopar% (with doSNOW) on my 2009 i7 mac, using all 4 cores (8 with hyperthreading) I get this down to 850 seconds. I need to re-run this analysis every time I add a country (to compute inter-country spreads), and I have 19 countries to go, with many more credit yield curves to come in the future. The time taken is starting to look like a major issue. By the way, the termstrc analysis function in question is accessed in R but is written in C.
Now, we're a small company of 12 people (read limited budget), all equipped with 8GB ram, i7 PCs, of which at least half are used for mundane word processing / email / browsing style tasks, that is, using 5% maximum of their performance. They are all networked using gigabit (but not 10-gigabit) ethernet.
Could I cluster some of these underused PCs using MPI and run my R analysis across them? Would the network be affected? Each iteration of the yield curve analysis function takes about 1.2 seconds so I'm assuming that if the granularity of parallel processing is to pass a whole function iteration to each cluster node, 1.2 seconds should be quite large compared with the gigabit ethernet lag?
Can this be done? How? And what would the impact be on my co-workers. Can they continue to read their emails while I'm taxing their machines?
I note that Open MPI seems not to support Windows anymore, while MPICH seems to. Which would you use, if any?
Perhaps run an Ubuntu virtual machine on each PC?
Yes you can. There are a number of ways. One of the easiest is to use redis as a backend (as easy as calling sudo apt-get install redis-server on an Ubuntu machine; rumor has that you could have a redis backend on a windows machine too).
By using the doRedis package, you can very easily en-queue jobs on a task queue in redis, and then use one, two, ... idle workers to query the queue. Best of all, you can easily mix operating systems so yes, your co-workers' windows machines qualify. Moreover, you can use one, two, three, ... clients as you see fit and need and scale up or down. The queue does not know or care, it simply supplies jobs.
Bost of all, the vignette in the doRedis has working examples of a mix of Linux and Windows clients to make a bootstrapping example go faster.
Perhaps not the answer you were looking for, but - this is one of those situations where an alternative is sooo much better that it's hard to ignore.
The cost of AWS clusters is ridiculously low (my emphasis) for exactly these types of computing problems. You pay only for what you use. I can guarantee you that you will save money (at the very least in opportunity costs) by not spending the time trying to convert 12 windows machines into a cluster. For your purposes, you could probably even do this for free. (IIRC, they still offer free computing time on clusters)
References:
Using AWS for parallel processing with R
http://blog.revolutionanalytics.com/2011/01/run-r-in-parallel-on-a-hadoop-cluster-with-aws-in-15-minutes.html
http://code.google.com/p/segue/
http://www.vcasmo.com/video/drewconway/8468
http://aws.amazon.com/ec2/instance-types/
http://aws.amazon.com/ec2/pricing/
Some of these instances are so powerful you probably wouldn't even need to figure out how to setup your work on a cluster (given your current description). As you can see from the references costs are ridiculously low, ranging from 1-4$ per hour of compute time.
What about OpenCL?
This would require rewriting the C code, but would allow potentially large speedups. The GPU has immense computing power.
I have developed a program which uses SQLite 3.7 ... database, in it there is a rather extensive write/read module that imports , checks and updates data. This process takes 14 seconds on my PC and Im pleased as punch with the performance.
I use transactions for everything with paratetrs my PC is a Intel i7 with 18gig of ram. I have not set anything in the database. I used SQLite Expert to create the database and create the data structures including table and columns and checked that all indexes are created. In other words its all OK.
I have since deployed the program/database to 2 other machines. That 14 second process takes over 5 minutes on the other machines. Same program, identical data, identical database. The machines are upto date, one is a 3rd gen Intel i7 bought last week, the other is quite fast as well so hardware should not be an issue.
Im just not understanding what the problem could be? Is it the database itself ? I have not set anything other then encription on it. Remembering that I run the same and it takes the 14 seconds. Could it be that the database is 'optimised' to my PC ? so when I give it to others its not optimised?
I know I could turn off jurnaling to get better performance, but that would only speed up the process and still would leave the problem.
Any ideas would be welcome.
EDIT:
I have tested the program on my 7yo Dual Athelon with 3gig of ram running XP on HDD, and the procedure took 35 seconds. Well in tolerable limits considering. I just dont get what could be making 2 modern machines take 5 min ?
I have an idea that its a write issue, as using a reader they are slower but quite ecceptable.
SQLite speed is affected most by how well the disk does random reads and writes; any SSD is much more better at this than any rotating disk.
Whenever changes overflow the internal cache, they must be written to disk. You should use PRAGMA cache_size to increase the cache to more than the default 2 MB.
Changed data must be written to disk at the end of every transaction. Make sure that there are as many changes as possible in one transaction.
If much of your processing involves temporary tables or indexes, the speed is affected by the speed of the main disk. If your machines have enough RAM, you can force temporary data to RAM with PRAGMA temp_store.
You should enable Write-Ahead Logging.
Note: the default SQLite distribution does not have encryption.