pyparsing infixNotation ParseException - pyparsing

In python 3.7, i have below logic. Earlier version of pyparsing used operatorPrecedence and the newer version of pyparsing uses infix_notation.
expr << pp.infix_notation(semi_expression, [
("not", 1, pp.opAssoc.RIGHT, self.not_operator),
("and", 2, pp.opAssoc.LEFT, self.and_operator),
("or", 2, pp.opAssoc.LEFT, self.or_operator)
])
result = expr.parseString(filter_str)
return result
However i see traceback
File "/root/.local/share/virtualenvs/lib/python3.7/site-packages/testfilter.py", line 168, in parse_filter_str
result = expr.parseString(filter_str)
pyparsing.exceptions.ParseException: Expected 'or' term, found 'serial' (at char 1), (line:1, col:2)
Kindly help to solve this ?

Related

Rserve: pyServe not able to call basic R functions

I'm calling Rserve from python and it runs for basic operations, but not if I call basic functions as min
import pyRserve
conn = pyRserve.connect()
cars = [1, 2, 3]
conn.r.x = cars
print(conn.eval('x'))
print(conn.eval('min(x)'))
The result is:
[1, 2, 3]
Traceback (most recent call last):
File "test3.py", line 9, in <module>
print(conn.eval('min(x)'))
File "C:\Users\acastro\.windows-build-tools\python27\lib\site-packages\pyRserve\rconn.py", line 78, in decoCheckIfClosed
return func(self, *args, **kw)
File "C:\Users\acastro\.windows-build-tools\python27\lib\site-packages\pyRserve\rconn.py", line 191, in eval
raise REvalError(errorMsg)
pyRserve.rexceptions.REvalError: Error in min(x) : invalid 'type' (list) of argument
Do you know where is the problem?
Thanks
You should try min(unlist(x)).
If the list is simple, you may just try as.data.frame(x).
For some more complicate list, StackOverFlow has many other answers.

How to convert dict to RDD in PySpark

I am learning the Word2Vec Model to process my data.
I using Spark 1.6.0.
Using the example of the official documentation explain my problem:
import pyspark.mllib.feature import Word2Vec
sentence = "a b " * 100 + "a c " * 10
localDoc = [sentence, sentence]
doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc)
The vectors are as follows:
>>> model.getVectors()
{'a': [0.26699373, -0.26908076, 0.0579859, -0.080141746, 0.18208595, 0.4162335, 0.0258975, -0.2162928, 0.17868409, 0.07642203], 'b': [-0.29602322, -0.67824656, -0.9063686, -0.49016926, 0.14347662, -0.23329848, -0.44695938, -0.69160634, 0.7037, 0.28236762], 'c': [-0.08954003, 0.24668643, 0.16183868, 0.10982372, -0.099240996, -0.1358507, 0.09996107, 0.30981666, -0.2477713, -0.063234895]}
When I use the getVectors() to get the map of representation of the words. How to convert it into RDD, so I can pass it to KMeans Model?
EDIT:
I did what #user9590153 said.
>>> v = sc.parallelize(model.getVectors()).values()
# the above code is successful.
>>> v.collect()
The Spark-Shell shows another problem:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\rdd.py", line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 29, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\rdd.py", line 1540, in <lambda>
return self.map(lambda x: x[1])
IndexError: string index out of range
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Just parallelize:
sc.parallelize(model.getVectors()).values()
Parallelized Collections will help you over here.
val data = Array(1, 2, 3, 4, 5) # data here is the collection
val distData = sc.parallelize(data) # converted into rdd
For your case:
sc.parallelize(model.getVectors()).values()
For your doubt:
The action collect() is the common and simplest operation that returns our entire RDDs content to driver program.
The application of collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with the expected result.
Action Collect() had a constraint that all the data should fit in the machine, and copies to the driver.
So, you can not perform collect on RDD

Python: Type Error unsupported operand type(s) for +: 'int' and 'str'

I'm really stuck, I am currently reading Python - How to automate the boring stuff and I am doing one of the practice projects.
Why is it flagging an error? I know it's to do with the item_total.
import sys
stuff = {'Arrows':'12',
'Gold Coins':'42',
'Rope':'1',
'Torches':'6',
'Dagger':'1', }
def displayInventory(inventory):
print("Inventory:")
item_total = sum(stuff.values())
for k, v in inventory.items():
print(v + ' ' + k)
a = sum(stuff.values())
print("Total number of items: " + item_total)
displayInventory(stuff)
error i get:
Traceback (most recent call last):
File "C:/Users/Lewis/Dropbox/Python/Function displayInventory p120 v2.py", line 17, in
displayInventory(stuff)
File "C:/Users/Lewis/Dropbox/Python/Function displayInventory p120 v2.py", line 11, in displayInventory
item_total = int(sum(stuff.values()))
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Your dictionary values are all strings:
stuff = {'Arrows':'12',
'Gold Coins':'42',
'Rope':'1',
'Torches':'6',
'Dagger':'1', }
yet you try to sum these strings:
item_total = sum(stuff.values())
sum() uses a starting value, an integer, of 0, so it is trying to use 0 + '12', and that's not a valid operation in Python:
>>> 0 + '12'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
You'll have to convert all your values to integers; either to begin with, or when summing:
item_total = sum(map(int, stuff.values()))
You don't really need those values to be strings, so the better solution is to make the values integers:
stuff = {
'Arrows': 12,
'Gold Coins': 42,
'Rope': 1,
'Torches': 6,
'Dagger': 1,
}
and then adjust your inventory loop to convert those to strings when printing:
for k, v in inventory.items():
print(v + ' ' + str(k))
or better still:
for item, value in inventory.items():
print('{:12s} {:2d}'.format(item, value))
to produce aligned numbers with string formatting.
You are trying to sum a bunch of strings, which won't work. You need to convert the strings to numbers before you try to sum them:
item_total = sum(map(int, stuff.values()))
Alternatively, declare your values as integers to begin with instead of strings.
This error produce because you was trying sum('12', '42', ..), so you need to convert every elm to int
item_total = sum(stuff.values())
By
item_total = sum([int(k) for k in stuff.values()])
The error lies in line
a = sum(stuff.values())
and
item_total = sum(stuff.values())
The value type in your dictionary is str and not int, remember that + is concatenation between strings and addition operator between integers. Find the code below, should solve the error, converting it into int array from a str array is done by mapping
import sys
stuff = {'Arrows':'12',
'Gold Coins':'42',
'Rope':'1',
'Torches':'6',
'Dagger':'1'}
def displayInventory(inventory):
print("Inventory:")
item_total = (stuff.values())
item_total = sum(map(int, item_total))
for k, v in inventory.items():
print(v + ' ' + k)
print("Total number of items: " + str(item_total))
displayInventory(stuff)

Porting to Python3: PyPDF2 mergePage() gives TypeError

I'm using Python 3.4.2 and PyPDF2 1.24 (also using reportlab 3.1.44 in case that helps) on windows 7.
I recently upgraded from Python 2.7 to 3.4, and am in the process of porting my code. This code is used to create a blank pdf page with links embedded in it (using reportlab) and merge it (using PyPDF2) with an existing pdf page. I had an issue with reportlab in that saving the canvas used StringIO which needed to be changed to BytesIO, but after doing that I ran into this error:
Traceback (most recent call last):
File "C:\cms_software\pdf_replica\builder.py", line 401, in merge_pdf_files
input_page.mergePage(link_page)
File "C:\Python34\lib\site-packages\PyPDF2\pdf.py", line 2013, in mergePage
self.mergePage(page2)
File "C:\Python34\lib\site-packages\PyPDF2\pdf.py", line 2059, in mergePage
page2Content = PageObject._pushPopGS(page2Content, self.pdf)
File "C:\Python34\lib\site-packages\PyPDF2\pdf.py", line 1973, in _pushPopGS
stream = ContentStream(contents, pdf)
File "C:\Python34\lib\site-packages\PyPDF2\pdf.py", line 2446, in __init
stream = BytesIO(b_(stream.getData()))
File "C:\Python34\lib\site-packages\PyPDF2\generic.py", line 826, in getData
decoded._data = filters.decodeStreamData(self)
File "C:\Python34\lib\site-packages\PyPDF2\filters.py", line 326, in decodeStreamData
data = ASCII85Decode.decode(data)
File "C:\Python34\lib\site-packages\PyPDF2\filters.py", line 264, in decode
data = [y for y in data if not (y in ' \n\r\t')]
File "C:\Python34\lib\site-packages\PyPDF2\filters.py", line 264, in
data = [y for y in data if not (y in ' \n\r\t')]
TypeError: 'in <string>' requires string as left operand, not int
Here is the line and the line above where the traceback mentions:
link_page = self.make_pdf_link_page(pdf, size, margin, scale_factor, debug_article_links)
if link_page != None:
input_page.mergePage(link_page)
Here are the relevant parts of that make_pdf_link_page function:
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=(size['width'], size['height']))
....# left out code here is just reportlab specifics for size and url stuff
can.linkURL(url, r1, thickness=1, color=colors.green)
can.rect(x1, y1, width, height, stroke=1, fill=0)
# create a new PDF with Reportlab that has the url link embedded
can.save()
packet.seek(0)
try:
new_pdf = PdfFileReader(packet)
except Exception as e:
logger.exception('e')
return None
return new_pdf.getPage(0)
I'm assuming it's a problem with using BytesIO, but I can't create the page with reportlab with StringIO. This is a critical feature that used to work perfectly with Python 2.7, so I'd appreciate any kind of feedback on this. Thanks!
UPDATE:
I've also tried changing from using BytesIO to just writing to a temp file, then merging. Unfortunately I got the same error.
Here is tempfile version:
import tempfile
temp_dir = tempfile.gettempdir()
temp_path = os.path.join(temp_dir, "tmp.pdf")
can = canvas.Canvas(temp_path, pagesize=(size['width'], size['height']))
....
can.showPage()
can.save()
try:
new_pdf = PdfFileReader(temp_path)
except Exception as e:
logger.exception('e')
return None
return new_pdf.getPage(0)
UPDATE:
I found an interesting bit of information on this. It seems if I comment out the can.rect and can.linkURL calls it will merge. So drawing anything on a page, then trying to merge it with my existing pdf is causing the error.
After digging in to PyPDF2 library code, I was able to find my own answer. For python 3 users, old libraries can be tricky. Even if they say they support python 3, they don't necessarily test everything. In this case, the problem was with the class ASCII85Decode in filters.py in PyPDF2. For python 3, this class needs to return bytes. I borrowed the code for this same type of function from pdfminer3k, which is a port for python 3 of pdfminer. If you exchange the ASCII85Decode() class for this code, it will work:
import struct
class ASCII85Decode(object):
def decode(data, decodeParms=None):
if isinstance(data, str):
data = data.encode('ascii')
n = b = 0
out = bytearray()
for c in data:
if ord('!') <= c and c <= ord('u'):
n += 1
b = b*85+(c-33)
if n == 5:
out += struct.pack(b'>L',b)
n = b = 0
elif c == ord('z'):
assert n == 0
out += b'\0\0\0\0'
elif c == ord('~'):
if n:
for _ in range(5-n):
b = b*85+84
out += struct.pack(b'>L',b)[:n-1]
break
return bytes(out)

Which of these is pythonic? and Pythonic vs. Speed

I'm new to python and just wrote this module level function:
def _interval(patt):
""" Converts a string pattern of the form '1y 42d 14h56m'
to a timedelta object.
y - years (365 days), M - months (30 days), w - weeks, d - days,
h - hours, m - minutes, s - seconds"""
m = _re.findall(r'([+-]?\d*(?:\.\d+)?)([yMwdhms])', patt)
args = {'weeks': 0.0,
'days': 0.0,
'hours': 0.0,
'minutes': 0.0,
'seconds': 0.0}
for (n,q) in m:
if q=='y':
args['days'] += float(n)*365
elif q=='M':
args['days'] += float(n)*30
elif q=='w':
args['weeks'] += float(n)
elif q=='d':
args['days'] += float(n)
elif q=='h':
args['hours'] += float(n)
elif q=='m':
args['minutes'] += float(n)
elif q=='s':
args['seconds'] += float(n)
return _dt.timedelta(**args)
My issue is with the for loop here i.e the long if elif block, and was wondering if there is a more pythonic way of doing it.
So I re-wrote the function as:
def _interval2(patt):
m = _re.findall(r'([+-]?\d*(?:\.\d+)?)([yMwdhms])', patt)
args = {'weeks': 0.0,
'days': 0.0,
'hours': 0.0,
'minutes': 0.0,
'seconds': 0.0}
argsmap = {'y': ('days', lambda x: float(x)*365),
'M': ('days', lambda x: float(x)*30),
'w': ('weeks', lambda x: float(x)),
'd': ('days', lambda x: float(x)),
'h': ('hours', lambda x: float(x)),
'm': ('minutes', lambda x: float(x)),
's': ('seconds', lambda x: float(x))}
for (n,q) in m:
args[argsmap[q][0]] += argsmap[q][1](n)
return _dt.timedelta(**args)
I tested the execution times of both the codes using timeit module and found that the second one took about 5-6 seconds longer (for the default number of repeats).
So my question is:
1. Which code is considered more pythonic?
2. Is there still a more pythonic was of writing this function?
3. What about the trade-offs between pythonicity and other aspects (like speed in this case) of programming?
p.s. I kinda have an OCD for elegant code.
EDITED _interval2 after seeing this answer:
argsmap = {'y': ('days', 365),
'M': ('days', 30),
'w': ('weeks', 1),
'd': ('days', 1),
'h': ('hours', 1),
'm': ('minutes', 1),
's': ('seconds', 1)}
for (n,q) in m:
args[argsmap[q][0]] += float(n)*argsmap[q][1]
You seem to create a lot of lambdas every time you parse. You really don't need a lambda, just a multiplier. Try this:
def _factor_for(what):
if what == 'y': return 365
elif what == 'M': return 30
elif what in ('w', 'd', 'h', 's', 'm'): return 1
else raise ValueError("Invalid specifier %r" % what)
for (n,q) in m:
args[argsmap[q][0]] += _factor_for([q][1]) * n
Don't make _factor_for a method's local function or a method, though, to speed things up.
(I have not timed this, but) if you're going to use this function often it might be worth pre-compiling the regex expression.
Here's my take on your function:
re_timestr = re.compile("""
((?P<years>\d+)y)?\s*
((?P<months>\d+)M)?\s*
((?P<weeks>\d+)w)?\s*
((?P<days>\d+)d)?\s*
((?P<hours>\d+)h)?\s*
((?P<minutes>\d+)m)?\s*
((?P<seconds>\d+)s)?
""", re.VERBOSE)
def interval3(patt):
p = {}
match = re_timestr.match(patt)
if not match:
raise ValueError("invalid pattern : %s" % (patt))
for k,v in match.groupdict("0").iteritems():
p[k] = int(v) # cast string to int
p["days"] += p.pop("years") * 365 # convert years to days
p["days"] += p.pop("months") * 30 # convert months to days
return datetime.timedelta(**p)
update
From this question, it looks like precompiling regex patterns does not bring about noticeable performance improvement since Python caches and reuses them anyway. You only save the time it takes to check the cache which, unless you are repeating it numerous times, is negligible.
update2
As you quite rightly pointed out, this solution does not support interval3("1h 30s" + "2h 10m"). However, timedelta supports arithmetic operations which means one you can still express it as interval3("1h 30s") + interval3("2h 10m").
Also, as mentioned by some of the comments on the question, you may want to avoid supporting "years" and "months" in the inputs. There's a reason why timedelta does not support those arguments; it cannot be handled correctly (and incorrect code are almost never elegant).
Here's another version, this time with support for float, negative values, and some error checking.
re_timestr = re.compile("""
^\s*
((?P<weeks>[+-]?\d+(\.\d*)?)w)?\s*
((?P<days>[+-]?\d+(\.\d*)?)d)?\s*
((?P<hours>[+-]?\d+(\.\d*)?)h)?\s*
((?P<minutes>[+-]?\d+(\.\d*)?)m)?\s*
((?P<seconds>[+-]?\d+(\.\d*)?)s)?\s*
$
""", re.VERBOSE)
def interval4(patt):
p = {}
match = re_timestr.match(patt)
if not match:
raise ValueError("invalid pattern : %s" % (patt))
for k,v in match.groupdict("0").iteritems():
p[k] = float(v) # cast string to int
return datetime.timedelta(**p)
Example use cases:
>>> print interval4("1w 2d 3h4m") # basic use
9 days, 3:04:00
>>> print interval4("1w") - interval4("2d 3h 4m") # timedelta arithmetic
4 days, 20:56:00
>>> print interval4("0.3w -2.d +1.01h") # +ve and -ve floats
3:24:36
>>> print interval4("0.3x") # reject invalid input
Traceback (most recent call last):
File "date.py", line 19, in interval4
raise ValueError("invalid pattern : %s" % (patt))
ValueError: invalid pattern : 0.3x
>>> print interval4("1h 2w") # order matters
Traceback (most recent call last):
File "date.py", line 19, in interval4
raise ValueError("invalid pattern : %s" % (patt))
ValueError: invalid pattern : 1h 2w
Yes, there is. Use time.strptime instead:
Parse a string representing a time
according to a format. The return
value is a struct_time as returned
by gmtime() or localtime().

Resources