how to restrict the concurrent running map tasks? - dictionary

My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the same time. I have found 2 variable related to this question.
a) mapred.job.map.capacity
but in my hadoop version, this parameter seems abandoned.
b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob (http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.0.2/mapred-default.xml)
I set this variable like below:
Configuration conf = new Configuration();
conf.set("date", date);
conf.set("mapred.job.queue.name", "hadoop");
conf.set("mapred.jobtracker.taskScheduler.maxRunningTasksPerJob", "10");
DistributedCache.createSymlink(conf);
Job job = new Job(conf, "ConstructApkDownload_" + date);
...
The problem is that it doesn't work. There is still more than 50 maps running as the job starts.
After looking through the hadoop document, I can't find another to limit the concurrent running map tasks.
Hope someone can help me ,Thanks.
=====================
I hava found the answer about this question, here share to others who may be interested.
Using the fair scheduler, with configuration parameter maxMaps to set the a pool's maximum concurrent task slots, in the Allocation File (fair-scheduler.xml).
Then when you submit jobs, just set the job's queue to the according pool.

You can set the value of mapred.jobtracker.maxtasks.per.job to something other than -1 (the default). This limits the number of simultaneous map or reduce tasks a job can employ.
This variable is described as:
The maximum number of tasks for a single job. A value of -1 indicates that there is no maximum.
I think there were plans to add mapred.max.maps.per.node and mapred.max.reduces.per.node to job configs, but they never made it to release.

If you are using Hadoop 2.7 or newer, you can use mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit to restrict map and reduce tasks at each job level.
Fix JIRA ticket.

mapred.tasktracker.map.tasks.maximum is the property to restrict the number of map tasks that can run at a time. Have it configured in your mapred-site.xml.
Refer 2.7 in http://wiki.apache.org/hadoop/FAQ

The number of mappers fired are decided by the input block size. The input block size is the size of the chunks into which the data is divided and sent to different mappers while it is read from the HDFS. So in order to control the number of mappers we have to control the block size.
It can be controlled by setting the parameters, mapred.min.split.size and mapred.max.split.size, while configuring the job in MapReduce. The value is to be set in bytes. So if we have a 20 GB file, and we want to fire 40 mappers, then we need to set it to 20480 / 40 = 512 MB each. So for that the code would be,
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
where conf is an object of the org.apache.hadoop.conf.Configuration class.

Read about scheduling jobs in Hadoop(for example "fair scheduler"). you can create a custom queue with to many configuration and then assign your job to that. if you limit your custom queue maximum map task to 10 then each job that assign to queue at most will have 10 concurrent map task.

Related

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

Why wouldn't a small Firebase Functions app just use a single Function to handle logic?

...aside from the benefit in separate performance monitoring and logging.
For logging, I am confident I can get granularity through manually adding the name of the "routine" to each call. This is how it is now with several discrete Functions for different parts of the system:
There are multiple automatic logs: start and finish of the routine, for example. It would be more challenging to find out how expensive certain routines are, but it would not be impossible.
The reason I want the entire logic of the application handled by a single handle function is because of reducing cold starts: one function means only one container that can be persistently kept alive when there are very few users of the app.
If a month is ~2.6m seconds and we assume the system uses 1 GB RAM and 1 GHz CPU frequency at all times, that's:
2600000 * 0.0000025 + 2600000 * 0.000001042 = USD$9.21 a month
...for one minimum instance.
I should also state that all of my functions have the bare minimum amount of global scope code; it just sets up Firebase assets (RTDB and Firestore).
From a billing, performance (based on user wait time), and user/developer experience perspective, is there any reason why it would be smart to keep all my functions discrete?
I'd also accept an answer saying "one single function for all logic is reasonable" as long as there's a reason for it.
Thanks!
If you have very small app with ~5 end points and very low traffic. Sure you could do something like this. But why not do it:
billing and performance
The important thing to realize is that with every request a new instance of your function is created. Which means there could be 10s of them running at the same time.
If you would like to have just 1 instance handling all the traffic you should explore GCP Cloud run, where you have 1 container handling multiple requests and scaling only when it's not sufficient.
Imagine you have several end-points and every one of them have different performance requirements.
1 can need only 128MB or RAM
1 can need 1GB RAM
(FYI: You can control the CPU MHz of the function via the RAM settings too - which can speed up execution in some cases)
If you had only 1 function with 1GB of ram. Every request would allocate such function and in some cases most of the memory could go to waste.
But if you split it into multiple, some requests will require much less resources and can save you $ when we talk about bigger amount of executions / month. (tens of thousands+).
Let's imagine function, 3 second execution, 10k executions/month:
128MB would cost you $0.0693
1024MB would cost you $0.495
As you can see, with small app the difference could be nothing. But if you scale it matters. (*The cost can vary based on datacenter)
As for the logging, I don't think it matters. Usually in bigger systems there could be messages traveling trough several functions so you have to deal with that anyway.
As for the cold start. You just need good UI to facilitate that. At first I was worry about it in our apps but later on, you just get used to it that some action can take ~2s to execute (cold start). And you should have the UI "loading" regardless, because you don't know if the function will take ~100ms or 3s due to bad connection.

Mule 4 : Batch Processing : Can we set Aggregator size more than Batch block size?

Scenario :
processing data of huge volume say 1 million records
each record is of 1MB in size
using batch processing to process these records where block size is set as 100
I have to aggregate the elements in an Aggregator and then call an external API to write these data.
I can only make n calls to the api in 1 hour.
So I am calculating aggregator size as follows :
aggregator_size = total_number_of_records/n
Thus, aggregator_size >> batch block size.
So is it a right approach?
What alternative can be done for this?
Thanks in advance.
Block size is the number of records that are going to be read by each batch thread to process.
Aggregator size is the number of records that are going to be aggregated in an aggregator inside a step, if not using streaming. Note that using streaming you can only access the aggregated records sequentially, but memory usage is lower.
For the scenario described also take into account that if you are going to write to the same file, there might be issues because several batch threads may try to write to the same file at the same time, corrupting it, or the order of writing can be unpredictable.

Setting a new label on all nodes takes too long in a huge graph

I'm working on a graph containing about 50 million nodes and 40 million relationships.
I need to update every node.
I'm trying to set a new label to these nodes, but it's taking too long.
The label applies to all 50 million nodes, so the operation never ends.
After some research, i found out that Neo4j treats this operation as a single transaction (i don't know if optimistic or not), keeping the changes uncommitted, until the end (which will never happen in this fashion).
I'm currently using Neo4j 2.1.4, which has a feature called "USING PERIODIC COMMIT" (already present in earlier versions). Unfortunately, this feature is coupled to the "LOAD CSV" feature, and not available to every cypher command.
The cypher is quite simple:
match n set n:Person;
I decided to use a workaround, and make some sort of block update, as follows:
match n
where not n:Person
with n
limit 500000
set n:node;
It's ugly, but i couldn't come up with a better solution yet.
Here are some of my confs:
== neo4j.properties =========
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=900M
neostore.propertystore.db.strings.mapped_memory=1300M
neostore.propertystore.db.arrays.mapped_memory=1300M
keep_logical_logs=false
node_auto_indexing=true
node_keys_indexable=name_autocomplete,document
relationship_auto_indexing=true
relationship_keys_indexable=role
execution_guard_enabled=true
cache_type=weak
=============================
== neo4j-server.properties ==
org.neo4j.server.webserver.limit.executiontime=20000
org.neo4j.server.webserver.maxthreads=200
=============================
The hardware spec is:
RAM: 24GB
PROC: Intel(R) Xeon(R) X5650 # 2.67GHz, 32 cores
HDD1: 1.2TB
In this environment, each block update of 500000 nodes took from 200 to 400 seconds. I think this is because every node satisfies the query at the start, but as the updates take place, more nodes need to be scanned to find the unlabeled ones (but again, it's a hunch).
So what's the best course of action whenever an operation needs to touch every node in the graph?
Any help towards a better solution to this will be appreciated!
Thanks in advance.
The most performant way to achieve this is using the batch inserter API. You might use the following recipe:
take a look at http://localhost:7474/webadmin and note the "node count". In fact it's not the number of nodes it's more the highest node id in use - we'll need that later on.
make sure to cleanly shut down your graph database.
take a backup copy of your graph.db directory.
write a short piece of java/groovy/(whatever jvm language you prefer...) program that performs the following tasks
open your graph.db folder using the batch inserter api
in a loop from 0..<node count> (from step above) check if the node with given id exists, if so grab its current labels and amend the list by the new label and use setNodeLabels to write it back.
make sure you run shutdown with the batchinserter
start up your Neo4j instance again

How many mappers/reducers should be set when configuring Hadoop cluster?

When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?
There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.
Quoting from "Hadoop The Definite Guide, 3rd edition", page 306
Because MapReduce jobs are normally
I/O-bound, it makes sense to have more tasks than processors to get better
utilization.
The amount of oversubscription depends on the CPU utilization of jobs
you run, but a good rule of thumb is to have a factor of between one and two more
tasks (counting both map and reduce tasks) than processors.
A processor in the quote above is equivalent to one logical core.
But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.
Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.
Taken from Hadoop Gyan-My blog:
No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.
No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

Resources