>>> (1).__str__!=(2).__str__
True
Is there any technical reason why they've been made seperate objects instead of refering to a single object? It seems like it would be more efficient
That is because you don't get the __str__ function, but a method wrapper for it.
>>> x = (1).__str__
>>> type(x)
<class 'method-wrapper'>
>>> x()
'1'
>>> x.__self__
1
>>> x = (2).__str__
>>> x()
'2'
>>> x.__self__
2
A method wrapper is an object that keeps a reference to self. So it has to be different for every instance.
For more details, check this question.
Related
I have an upstream extract task, that extracts files into two different s3 paths. This operator returns a tuple of the two separate s3 paths as XCOM. How do I pass the appropriate XCOM value to the appropriate task?
extract_task >> load_task_0
load_task_1
Probably a little late to the party, but will answer anyways.
With TaskFlow API in Airflow 2.0 you can do something like this using decorators:
#task(multiple_outputs=True)
def extract_task():
return {
"path_0": "s3://path0",
"path_1": "s3://path1",
}
Then in your DAG:
#dag()
def my_dag():
output = extract_task()
load_task_0(output["path_0"])
load_task_1(output["path_1"])
This works with dictionary, probably won't work with tuple but you can try.
I encounter a problem while translating from python2 to python3 the following line:
fmap = defaultdict(count(1).next)
I changed count(1).next to next(count(1))
but get this error:
fmap = defaultdict(next(count(1))) TypeError: first argument must be
callable or None
I guess this line intend to assign new default value each time. Do you have suggestions?
Thanks
The error is clear - the first argument to a defaultdict must be a callable (function for example, or class name), or None. This callable will be called in case a key does not exist to construct the default vale. On the other hand:
next(count(3))
will return an integer, which is not callable, and makes no sense. If you want the defaultdict to default to an increasing a number whenever a missing key is used then something close to what you have is:
>>> x=defaultdict(lambda x=count(30): next(x))
>>> x[1]
30
>>> x[2]
31
>>> x[3]
32
>>> x[4]
33
The .next() method on iterators has been renamed in Python 3. Use .__next__() instead.
Code
fmap = defaultdict(count(1).__next__)
Demo
fmap["a"]
# 1
fmap["b"]
# 2
Note, defaultdict needs a callable argument, something that will act as a function, hence parentheses are removed, e.g. __next__.
I have a dictionary that maps a key to a function object. Then, using Spark 1.4.1 (Spark may not even be relevant for this question), I try to map each object in the RDD using a function object retrieved from the dictionary (acts as look-up table). e.g. a small snippet of my code:
fnCall = groupFnList[0].fn
pagesRDD = pagesRDD.map(lambda x: [x, fnCall(x[0])]).map(shapeToTuple)
Now, it has fetched from a namedtuple the function object. Which I temporarily 'store' (c.q. pointing to fn obj) in FnCall. Then, using the map operations I want the x[0] element of each tuple to be processed using that function.
All works fine and good in that there indeed IS a fn object, but it behaves in a weird way.
Each time I call an action method on the RDD, even without having used a fn obj in between, the RDD values have changed! To visualize this I have created dummy functions for the fn objects that just output a random integer. After calling the fn obj on the RDD, I can inspect it with .take() or .first() and get the following:
pagesRDD.first()
>>> [(u'myPDF1.pdf', u'34', u'930', u'30')]
pagesRDD.first()
>>> [(u'myPDF1.pdf', u'23', u'472', u'11')]
pagesRDD.first()
>>> [(u'myPDF1.pdf', u'4', u'69', u'25')]
So it seems to me that the RDD's elements have the functions bound to them in some way, and each time I do an action operation (like .first(), very simple) it 'updates' the RDD's contents.
I don't want this to happen! I just want the function to process the RDD ONLY when I call it with a map operation. How can I 'unbind' this function after the map operation?
Any ideas?
Thanks!
####### UPDATE:
So apparently rewriting my code to call it like pagesRDD.map(fnCall) should do the trick, but why should this even matter? If I call
rdd = rdd.map(lambda x: (x,1))
rdd.first()
>>> # some output
rdd.first()
>>> # same output as before!
So in this case, using a lambda function it would not get bound to the rdd and would not be called each time I do a .take()-like action. So why is that the case when I use a fn object INSIDE the lambda? Logically it just does not make sense to me. Any explanation on this?
If you redefine your functions that their parameter is an iterable. Your code should look like this.
pagesRDD = pagesRDD.map(fnCall).map(shapeToTuple)
I have a list comprehension:
thingie=[f(a,x,c) for x in some_list]
which I am parallelising as follows:
from multiprocessing import Pool
pool=Pool(processes=4)
thingie=pool.map(lambda x: f(a,x,c), some_list)
but I get the following error:
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7f60b3b0e9d8>:
attribute lookup <lambda> on __main__ failed
I have tried to install the pathos package which apparently addresses this issue, but when I try to import it I get the error:
ImportError: No module named 'pathos'
OK, so this answer is just for the record, I've figured it out with author of the question during comment conversation.
multiprocessing needs to transport every object between processes, so it uses pickle to serialize it in one process and deserialize in another. It all works well, but pickle cannot serialize lambda. AFAIR it is so because pickle needs functions source to serialize it, and lambda won't have it, but I'm not 100% sure and cannot quote my source.
It won't be any problem if you use map() on 1 argument function - you can pass that function instead of lambda. If you have more arguments, like in your example, you need to define some wrapper with def keyword:
from multiprocessing import Pool
def f(x, y, z):
print(x, y, z)
def f_wrapper(y):
return f(1, y, "a")
pool = Pool(processes=4)
result = pool.map(f_wrapper, [7, 9, 11])
Just before I close this, I found another way to do this with Python 3, using functools,
say I have a function f with three variables f(a,x,c), one of which I want to may, say x. I can use the following code to do basically what #FilipMalczak suggests:
import functools
from multiprocessing import Pool
f1=functools.partial(f,a=10)
f2=functools.partial(f2,c=10)
pool=Pool(processes=4)
final_answer=pool.map(f2,some_list)
Is it possible to return multiple values from a function?
I want to pass the return values into another function, and I wonder if I can avoid having to explode the array into multiple values
My problem?
I am upgrading Capybara for my project, and I realized, thanks to CSS 'contains' selector & upgrade of Capybara, that the statement below will no longer work
has_selector?(:css, "#rightCol:contains(\"#{page_name}\")")
I want to get it working with minimum effort (there are a lot of such cases), So I came up with the idea of using Nokogiri to convert the css to xpath. I wanted to write it so that the above function can become
has_selector? xpath(:css, "#rightCol:contains(\"#{page_name}\")")
But since xpath has to return an array, I need to actually write this
has_selector?(*xpath(:css, "#rightCol:contains(\"#{page_name}\")"))
Is there a way to get the former behavior?
It can be assumed that right now xpath func is like the below, for brevity.
def xpath(*a)
[1, 2]
end
You cannot let a method return multiple values. In order to do what you want, you have to change has_selector?, maybe something like this:
alias old_has_selector? :has_selector?
def has_selector? arg
case arg
when Array then old_has_selector?(*arg)
else old_has_selector?(arg)
end
end
Ruby has limited support for returning multiple values from a function. In particular a returned Array will get "destructured" when assigning to multiple variables:
def foo
[1, 2]
end
a, b = foo
a #=> 1
b #=> 2
However in your case you need the splat (*) to make it clear you're not just passing the array as the first argument.
If you want a cleaner syntax, why not just write your own wrapper:
def has_xpath?(xp)
has_selector?(*xpath(:css, xp))
end