How to get laravel queue size without delayed? - laravel-queue

I know how to get the size
Queue::size()
will return the size of queue
but it also include the delayed jobs .How to ignore these?

Related

Mule 4 : Batch Processing : Can we set Aggregator size more than Batch block size?

Scenario :
processing data of huge volume say 1 million records
each record is of 1MB in size
using batch processing to process these records where block size is set as 100
I have to aggregate the elements in an Aggregator and then call an external API to write these data.
I can only make n calls to the api in 1 hour.
So I am calculating aggregator size as follows :
aggregator_size = total_number_of_records/n
Thus, aggregator_size >> batch block size.
So is it a right approach?
What alternative can be done for this?
Thanks in advance.
Block size is the number of records that are going to be read by each batch thread to process.
Aggregator size is the number of records that are going to be aggregated in an aggregator inside a step, if not using streaming. Note that using streaming you can only access the aggregated records sequentially, but memory usage is lower.
For the scenario described also take into account that if you are going to write to the same file, there might be issues because several batch threads may try to write to the same file at the same time, corrupting it, or the order of writing can be unpredictable.

Status of accessing the current offset of a consumer?

I see that there was some discussion on this subject before[*], but I cannot see any way how to access this. Is it possible yet?
Reasoning: we have for example ContainerStoppingErrorHandler. It would be vital, if container was automatically stopped, to know, where it stopped. I guess ContainerStoppingErrorHandler could be improved to log partition/offset, as it has access to Consumer (I hope I'm not mistaken). But if I'm stopping/pausing it manually via MessageListenerContainer, I cannot see a way how to get info about last processed offset. Is there a way?
[*] https://github.com/spring-projects/spring-kafka/issues/431

Bulk Insert in Symfony and Doctrine: How to select batch size?

I am working on a web app using Symfony 2.7 and Doctrine. A Symfony command is used to perform an update of a large number of entities.
I followed the Doctrine guidelines and use $entityManager->flush() not for every single entity.
This is die Doctrine example code:
<?php
$batchSize = 20;
for ($i = 1; $i <= 10000; ++$i) {
$user = new CmsUser;
$user->setStatus('user');
$user->setUsername('user' . $i);
$user->setName('Mr.Smith-' . $i);
$em->persist($user);
if (($i % $batchSize) === 0) {
$em->flush();
}
}
$em->flush(); //Persist objects that did not make up an entire batch
The guidelines say:
You may need to experiment with the batch size to find the size that
works best for you. Larger batch sizes mean more prepared statement
reuse internally but also mean more work during flush.
So I did try different batch size. The larger the batch size, the faster the command completes its task.
Thus the question is: What are the downsides of large batch sizes? Why not use $entityManager->flush() only once, after all entities have been updated
The docu just says, that larger batch sizes "mean more work during flush". But why/when could this be a problem?
The only downside I can see are Exceptions during the update: If the script stops before the saved changed where flushed, the changes are lost. Is this the only limitation?
What are the downsides of large batch sizes?
Large batch sizes may use a lot of memory if you create for examples 10,000 entities. If you don't save the entities in batchs, they will accumulate in memory and if the program reach the memory limit it may crash the whole script.
Why not use $entityManager->flush() only once, after all entities have been updated
It's possible, but storing 10,000 entities in the memory before calling flush() once will use more memory than saving entities 100 by 100. It may also take more time.
The docu just says, that larger batch sizes "mean more work during flush". But why/when could this be a problem?
If you don't have any performance issue with biggest batch sizes, it's probably because your data is not big enough to fill the memory or disrupt PHP's memory management.
So the size of the batch depend of multiple factors, mostly memory usage vs. time. If the script consume too much RAM, the size of the batch has to be lowered. But using really small batches may take more time than bigger batches. So you have to run multiple tests in order to adjust this size so that it uses most of the available memory but not more.
I don't have any proofs but I remember having worked with thousands of entities. When I used only one flush(), I saw that the progress bar was getting slower, it looked like my program was getting slower as I added more and more entities in the memory.
If the flush takes too much time, you might exceed the maximum execution time of the server, and lose the connection.
From my experience, 100 entities per batch worked great. Depending on the Entity, 200 was too much and other Entity, I could do 1000.
To properly insert in batch, you will need the command :
$em->clear();
after each of your flushes. The reason is the Doctrine does not free the objects it's flushing into the DB. This means that if you don't "clear" them, the memory consumption will keep on increasing until you bust your PHP Memory Limit and crash your operation.
I would also recommend against increasing PHP Memory Limit to higher values. If you do, you risk creating huge lag on your server which could increase the number of connections to your server and then crash it.
It is also recommended to process batch operations outside of the Web Server upload form page. So save the data in a Blob and then process it later with a Cronjob task that will process your batch processing at the desired time (outside of Web Server's peak usage time).
As suggested in Doctrine documentation, ORM is not the best tool to use with batches.
Unless your entity needs some specific logic (like listeners), avoid ORM and use DBAL directly.

How can I get the size of a resource without actually downloading it?

So I'm on very constrained bandwidth where I am right now and I clicked a link to a pdf tutorial for something and Chrome began to download it and I was watching the size spiral upward from 20Kb past 5Mb and decided to stop it. How do I know it's not a 4Gb pdf?? Ridiculous, I know.
But I started thinking, surely there must be a way I can simply request the size of the resource to check before downloading. Perhaps some sort of cURL request?
Does anyone know a way?
You could try using the HTTP HEAD method. This should get you the headers of the document without the body. This might have the content length in it.
Or you could send an HTTP Range request header with a GET request. See section 14.35.2 in this document. Range headers look like:
Range: 1-20000
which would request the first 20,000 bytes (octets) of a document. If the document is less than 20,000 bytes, you would get the whole document.
The only problem is that the server might not support the Range header, in which case it will send a 200 status instead of 206. In that case you can just reset the connection if you don't want to risk burning bandwidth on a 5Gb document.

how to restrict the concurrent running map tasks?

My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the same time. I have found 2 variable related to this question.
a) mapred.job.map.capacity
but in my hadoop version, this parameter seems abandoned.
b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob (http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.0.2/mapred-default.xml)
I set this variable like below:
Configuration conf = new Configuration();
conf.set("date", date);
conf.set("mapred.job.queue.name", "hadoop");
conf.set("mapred.jobtracker.taskScheduler.maxRunningTasksPerJob", "10");
DistributedCache.createSymlink(conf);
Job job = new Job(conf, "ConstructApkDownload_" + date);
...
The problem is that it doesn't work. There is still more than 50 maps running as the job starts.
After looking through the hadoop document, I can't find another to limit the concurrent running map tasks.
Hope someone can help me ,Thanks.
=====================
I hava found the answer about this question, here share to others who may be interested.
Using the fair scheduler, with configuration parameter maxMaps to set the a pool's maximum concurrent task slots, in the Allocation File (fair-scheduler.xml).
Then when you submit jobs, just set the job's queue to the according pool.
You can set the value of mapred.jobtracker.maxtasks.per.job to something other than -1 (the default). This limits the number of simultaneous map or reduce tasks a job can employ.
This variable is described as:
The maximum number of tasks for a single job. A value of -1 indicates that there is no maximum.
I think there were plans to add mapred.max.maps.per.node and mapred.max.reduces.per.node to job configs, but they never made it to release.
If you are using Hadoop 2.7 or newer, you can use mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit to restrict map and reduce tasks at each job level.
Fix JIRA ticket.
mapred.tasktracker.map.tasks.maximum is the property to restrict the number of map tasks that can run at a time. Have it configured in your mapred-site.xml.
Refer 2.7 in http://wiki.apache.org/hadoop/FAQ
The number of mappers fired are decided by the input block size. The input block size is the size of the chunks into which the data is divided and sent to different mappers while it is read from the HDFS. So in order to control the number of mappers we have to control the block size.
It can be controlled by setting the parameters, mapred.min.split.size and mapred.max.split.size, while configuring the job in MapReduce. The value is to be set in bytes. So if we have a 20 GB file, and we want to fire 40 mappers, then we need to set it to 20480 / 40 = 512 MB each. So for that the code would be,
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
where conf is an object of the org.apache.hadoop.conf.Configuration class.
Read about scheduling jobs in Hadoop(for example "fair scheduler"). you can create a custom queue with to many configuration and then assign your job to that. if you limit your custom queue maximum map task to 10 then each job that assign to queue at most will have 10 concurrent map task.

Resources