I am trying to pull information from two different dictionaries. (excuse me because I am literally hacking to understand.)
I have a for loop that gives me the vmname. I have another for loop that gives me the other information like 'replicationid'.
I could be doing a very huge assumption here but hey ill start there. what I want to do it to integrate for loop 1 and for loop 2. as so the results are like this, is it even possible?
initial output of for loop1 which I can get:
vma
vmb
vmc
initial output of for loop2 which I can get:
replication job 1
replication job 2
replication job 3
desired results is:
vma
replication job 1
vmb
replication job 2
vmc
replication job 3
def get_replication_job_status():
sms = boto3.client('sms')
resp = sms.get_replication_jobs()
#print(resp)
things = [(cl['replicationJobId'], cl['serverId']) for cl in
resp['replicationJobList']]
thangs = [cl['vmServer'] for cl in resp['replicationJobList']]
for i in thangs:
print()
print("this is vm " + (i['vmName']))
print("this is the vm location " + (i['vmPath']))
print("this is the vm address, " +(str(i['vmServerAddress'])))
for j in things:
print("The Replication ID is : " +(str(j[0])))
again I want:
vma
replication job 1
vmb
replication job 2
vmc
replication job 3
im am getting:
vma
replication job 1
replication job 2
replication job 3
vmb
replication job 1
replication job 2
replication job 3
..
..
..
If you are sure that your lists both have the same length, then what you need is python built-in zip function:
for thing, thang in zip(things, thangs):
print()
print(thing)
print(thang)
But if one of the lists is longer then another then zip will crop both lists to have the same length as the shortest, for example:
>>> for i, j in zip(range(3), range(5)):
... print(i, j)
...
(0, 0)
(1, 1)
(2, 2)
UPD:
You can also unpack your tuples right in for loop definition, so each item (they are 2-tuples) in things list gets saved to two variables:
for (replicationJobId, serverId), thang in zip(things, thangs):
print()
print(replicationJobId)
print(serverId)
print(thang)
UPD 2:
Why do you split resp into to two lists?
def get_replication_job_status():
sms = boto3.client('sms')
resp = sms.get_replication_jobs()
#print(resp)
for replication_job in resp['replicationJobList']:
vm_server = replication_job['vmServer']
print()
print("this is vm:", vm_server['vmName'])
print("this is the vm location:", vm_server['vmPath'])
print("this is the vm address:", vm_server['vmServerAddress'])
print("The Replication ID is :", replication_job['replicationJobId'])
Related
while working on the exercise 2.2 of "programming in Lua 4" I do have to create a function to built all permutations of the numbers 1-8. I decided to use Heaps algorithm und made the following script. I´m testing with numbers 1-3.
In the function I store the permutations as tables {1,2,3} {2,1,3} and so on into local "a" and add them to global "perm". But something runs wrong and at the end of the recursions I get the same permutation on all slots. I can´t figure it out. Please help.
function generateperm (k,a)
if k == 1 then
perm[#perm + 1] = a -- adds recent permutation to table
io.write(table.unpack(a)) -- debug print. it shows last added one
io.write("\n") -- so I can see the algorithm works fine
else
for i=1,k do
generateperm(k-1,a)
if k % 2 == 0 then -- builts a permutation
a[i],a[k] = a[k],a[i]
else
a[1],a[k] = a[k],a[1]
end
end
end
end
--
perm = {}
generateperm(3,{1,2,3}) -- start
--
for k,v in ipairs (perm) do -- prints all stored permutations
for k,v in ipairs(perm[k]) do -- but it´s 6 times {1,2,3}
io.write(v)
end
io.write("\n")
end
debug print:
123
213
312
132
231
321
123
123
123
123
123
123
I have a task in which the output is a dictionary with a list value in each key
#task(task_id="gen_dict")
def generate_dict():
...
return output_dict # output look like this {"A" : ["aa","bb", "cc"], "B" : ["dd","ee", "ff"]}
# my dag (Not mention the part of generating DAG and its properties)
start = DummyOperator(task_id="st")
end = DummyOperator(task_id="ed")
output = generate_dict()
for keys, values in output.items():
for v in values:
dm = DummyOperator(task_id=f"dm_{keys}_{v}")
dm >> end
start >> output
For this sample output above, it should create 6 dummy tasks which are dm_A_aa, dm_A_bb, dm_A_cc, dm_B_dd, dm_B_ee, dm_B_ff
But right now I'm facing the import error
AttributeError: 'XComArg' object has no attribute 'items'
Is it possible to do what I aim to do? If not, is it possible to do it using a list like ["aa", "bb", "cc", "dd", "ee", "ff"] instead?
The code in the question won't work as-is because the loop shown would run when the dag is parsed (happens when the scheduler starts up and periodically thereafter), but the data that it would loop over is not known until the task that generates it is actually run.
There are ways to do something similar though.
AIP-42 added the ability to map list data into task kwargs in airflow 2.3:
#task
def generate_lists():
# presumably the data below would come from a query executed at runtime
return [["aa", "bb", "cc"], ["dd", "ee", "ff"]]
#task
def use_list(the_list):
for item in the_list:
print(item)
with DAG(...) as dag:
use_list.expand(the_list=generate_lists())
The code above will create two tasks with output:
aa
bb
cc
dd
ee
ff
In 2.4 the expand_kwargs function was added. It's an alternative to expand (shown above) which operates on dicts instead.
It takes an XComArg referencing a list of dicts whose keys are the names of the arguments that you're mapping the data into. So the following code...
#task
def generate_dicts():
# presumably the data below would come from a query made at runtime
return [{"foo":6, "bar":7}, {"foo":8, "bar":9}]
#task
def two_things(foo, bar):
print(foo, bar)
with DAG(...) as dag:
two_things.expand_kwargs(generate_dicts())
... gives two tasks with output:
6 7
...and...
8 9
expand only lets you create tasks from the Cartesian product of the input lists, expand_kwargs lets you do the associating of data to kwargs at runtime.
This is actually a continued version of thisquestion:
I have a file
1
2
PAT1
3 - first block
4
PAT2
5
6
PAT1
7 - second block
PAT2
8
9
PAT1
10 - third block
and I use awk '/PAT1/{flag=1; next} /PAT2/{flag=0} flag'
to extract the blocks of lines.
Extracting them works ok, but I'm trying to iterate over these blooks in a block-by-block fashion and do some processing with each block (e.g. save to file, process with other scripts etc.).
How can I construct such a loop?
Problem is not very clear but you may do something like this:
awk '/PAT1/ {
flag = 1
++n
s = ""
next
}
/PAT2/ {
flag = 0
printf "Processing record # %d =>\n%s", n, s
}
flag {
s = s $0 ORS
}' file
Processing record # 1 =>
3 - first block
4
Processing record # 2 =>
7 - second block
This might work for you (GNU sed):
sed -ne '/PAT1/!b;:a;N;/PAT2/!ba;e echo process:' -e 's/.*/echo "&"|wc/pe;p' file
Gather up the lines between PAT1 and PAT2 and process the collection.
In the example above, the literal process: is printed.
The command to print the result of the wc command for the collection is built and printed.
The result of the evaluation of the above command is printed.
N.B. The position of the p flag in the substitution command is critical. If the p is before the e flag the pattern space is printed before the evaluation, if the p flag is after the e flag the pattern space is post evaluation.
I am running the following set of commands in Pig. My data set has one row for each student in a class and each student has a number of grades. Student name is tab separated from grades for that student. The scores for each student are comma separated. I need to find the average grade for each student.
After grouping, I can successfully get the count of grades for each student but I cannot get the average score for each student. Pig complains it cannot find the iterator when it is averaging. I am confused since the iterator for both aggregate function COUNT and AVG is the same. I am not sure what I am missing. Any help is appreciated?
Scripts:
grunt> A = LOAD 'grades.txt' USING PigStorage('\t') AS
(f1:chararray,f2:chararray);
grunt> dump A;
(s14,59,94,81)
(s15,60,77)
(s16,77,77)
(s17,76,76)
(s18,19,61,72)
(s20,34,35)
grunt> B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as (grade:int);
grunt> describe B;
B: {stu: chararray,grade: int}
grunt> dump B;
(s14,59)
(s14,94)
(s14,81)
(s15,60)
(s15,77)
(s16,77)
(s16,77)
(s17,76)
(s17,76)
(s18,19)
(s18,61)
(s18,72)
(s20,34)
(s20,35)
grunt> grp = group B by stu;
grunt> cnt = foreach grp generate group, COUNT(B.grade);
grunt> dump cnt;
(s14,3)
(s15,2)
(s16,2)
(s17,2)
(s18,3)
(s20,2)
grunt> avg = foreach grp generate group, AVG(B.grade);
grunt> dump avg;
2015-03-20 21:56:30,900 ERROR org.apache.pig.tools.pigstats.PigStatsUtil:
1 map reduce job(s) failed!
2015-03-20 21:56:30,907 ERROR org.apache.pig.tools.grunt.Grunt: ERROR 1066:
Unable to open iterator for alias avg
Details at logfile: /home/training/pig/pig_1426902869706.log
grunt>
As mentioned in the comments, a workaround was found:
changed
B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as (grade:int)
to
B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as grade
And then copied the bag into:
C = foreach B generate stu as stu, grade as (int)grade;
Lets say you want 5 threads to process data simultaneous. Also assume, you have 89 tasks to process.
Off the bat you know 89 / 5 = 17 with a remainder of 4. The best way to split up tasks would be to have 4 (the remainder) threads process 18 (17+1) tasks each and then have 1 (# threads - remainder) thread to process 17.
This will eliminate the remainder. Just to verify:
Thread 1: Tasks 1-18 (18 tasks)
Thread 2: Tasks 19-36 (18 tasks)
Thread 3: Tasks 37-54 (18 tasks)
Thread 4: Tasks 55-72 (18 tasks)
Thread 5: Tasks 73-89 (17 tasks)
Giving you a total of 89 tasks completed.
I need a way of getting the start and ending range of each thread mathematically/programmability; where the following should print the exact thing I have listed above:
$NumTasks = 89
$NumThreads = 5
$Remainder = $NumTasks % $NumThreads
$DefaultNumTasksAssigned = floor($NumTasks / $NumThreads)
For $i = 1 To $NumThreads
if $i <= $Remainder Then
$NumTasksAssigned = $DefaultNumTasksAssigned + 1
else
$NumTasksAssigned = $DefaultNumTasksAssigned
endif
$Start = ??????????
$End = ??????????
print Thread $i: Tasks $Start-$End ($NumTasksAssigned tasks)
Next
This should also work for any number of $NumTasks.
Note: Please stick to answering the math at hand and avoid suggesting or assuming the situation.
Why? Rather then predetermining the scheduling order, stick all of the tasks on a queue, and then have each thread pull them off one by one when they're ready. Then your tasks will basically run "as fast as possible".
If you pre-allocated, then one thread may be doing a particularly long bit of processing and blocking the running of all the tasks stuck behind it. Using the queue, as each task finishes and a thread frees up, it grabs the next task and keeps going.
Think of it like a bank with 1 line per teller vs one line and a lot of tellers. In the former, you might get stuck behind the person depositing coins and counting it out one by one, the latter you get to the next available teller, while Mr. PocketChange counts away.
I second Will Hartung 's remark. You may just feed them one task at a time (or a few tasks at a at a time, depending if there's much overhead, i.e. if individual tasks get typically completed very fast, relative to the cost of starting/recycling threads). Your subsequent comments effectively explain that your "threads" carry a heavy creation cost, and hence your desire to feed them once with as much work as possible, rather than wasting time creating new "thread" each fed a small amount of work.
Anyway... going to the math question...
If you'd like to assign tasks just once, the following formula, plugged in lieu of the the ????????? in your logic, should do the trick:
$Start = 1
+ (($i -1) * ($DefaultNumTasksAssigned + 1)
- (floor($i / ($Remainder + 1)) * ($i - $Remainder))
$End = $Start + $NumTasksAssigned -1
The formula is explained as follow:
1 is for the fact that your display / logic is one-based not zero-based
The second term is because we generally add ($DefaultNumTasksAssigned + 1) with each iteration.
The third term provides a correction for the last few iterations.
Its first part, (floor($i / ($Remainder + 1)) provides 0 until $i reaches the first thread
that doesn't receive one extra task, and 1 thereafter.
The second part express by how much we need to correct.
The formula for $End is easier, the only trick is the minus 1, it is because the Start and End values are inclusive (so for example between 1 an 19 there are 19 tasks not 18)
The following slightly modified piece of logic should also work, it avoids the "fancy" formula by keeping a running tab of the $Start variable, rather than recomputing it each time..
$NumTasks = 89
$NumThreads = 5
$Remainder = $NumTasks % $NumThreads
$DefaultNumTasksAssigned = floor($NumTasks / $NumThreads)
$Start = 1
For $i = 1 To $NumThreads
if $i <= $Remainder Then // fixed here! need <= because $i is one-based
$NumTasksAssigned = $DefaultNumTasksAssigned + 1
else
$NumTasksAssigned = $DefaultNumTasksAssigned
endif
$End = $Start + $NumTasksAssigned -1
print Thread $i: Tasks $Start-$End ($NumTasksAssigned tasks)
$Start = $Start + $NumTasksAssigned
Next
Here's a Python transcription of the above
>>> def ShowWorkAllocation(NumTasks, NumThreads):
... Remainder = NumTasks % NumThreads
... DefaultNumTasksAssigned = math.floor(NumTasks / NumThreads)
... Start = 1
... for i in range(1, NumThreads + 1):
... if i <= Remainder:
... NumTasksAssigned = DefaultNumTasksAssigned + 1
... else:
... NumTasksAssigned = DefaultNumTasksAssigned
... End = Start + NumTasksAssigned - 1
... print("Thread ", i, ": Tasks ", Start, "-", End, "(", NumTasksAssigned,")")
... Start = Start + NumTasksAssigned
...
>>>
>>> ShowWorkAllocation(89, 5)
Thread 1 : Tasks 1 - 18 ( 18 )
Thread 2 : Tasks 19 - 36 ( 18 )
Thread 3 : Tasks 37 - 54 ( 18 )
Thread 4 : Tasks 55 - 72 ( 18 )
Thread 5 : Tasks 73 - 89 ( 17 )
>>> ShowWorkAllocation(11, 5)
Thread 1 : Tasks 1 - 3 ( 3 )
Thread 2 : Tasks 4 - 5 ( 2 )
Thread 3 : Tasks 6 - 7 ( 2 )
Thread 4 : Tasks 8 - 9 ( 2 )
Thread 5 : Tasks 10 - 11 ( 2 )
>>>
>>> ShowWorkAllocation(89, 11)
Thread 1 : Tasks 1 - 9 ( 9 )
Thread 2 : Tasks 10 - 17 ( 8 )
Thread 3 : Tasks 18 - 25 ( 8 )
Thread 4 : Tasks 26 - 33 ( 8 )
Thread 5 : Tasks 34 - 41 ( 8 )
Thread 6 : Tasks 42 - 49 ( 8 )
Thread 7 : Tasks 50 - 57 ( 8 )
Thread 8 : Tasks 58 - 65 ( 8 )
Thread 9 : Tasks 66 - 73 ( 8 )
Thread 10 : Tasks 74 - 81 ( 8 )
Thread 11 : Tasks 82 - 89 ( 8 )
>>>
I think you've solved the wrong half of your problem.
It's going to be virtually impossible to precisely determine the time it will take to complete all your tasks, unless all of the following are true:
your tasks are 100% CPU-bound: that is, they use 100% CPU while running and don't need to do any I/O
none of your tasks have to synchronize with any of your other tasks in any way
you have exactly as many threads as you have CPUs
the computer that is running these tasks is not performing any other interesting tasks at the same time
In practice, most of the time, your tasks are I/O-bound rather than CPU-bound: that is, you are waiting for some external resource such as reading from a file, fetching from a database, or communicating with a remote computer. In that case, you only make things worse by adding more threads, because they're all contending for the same scarce resource.
Finally, unless you have some really weird hardware, it's unlikely you can actually have exactly five threads running simultaneously. (Usually processor configurations come in multiples of at least two.) Usually the sweet spot is at about 1 thread per CPU if your tasks are very CPU-bound, about 2 threads per CPU if the tasks spend half their time being CPU-bound and half their time doing IO, etc.
tl;dr: We need to know a lot more about what your tasks and hardware look like before we can advise you on this question.