PySpark map datetime to DoW - datetime

I'm trying to map a column 'eventtimestamp' to its day of week with the following function:
from datetime import datetime
import calendar
from pyspark.sql.functions import UserDefinedFunction as udf
def toWeekDay(x):
v = int(datetime.strptime(str(x),'%Y-%m-%d %H:%M:%S').strftime('%w'))
if v == 0:
v = 6
else:
v = v-1
return calendar.day_name[v]
and for my df trying to create a new column dow with UDF.
udf_toWeekDay = udf(lambda x: toWeekDay(x), StringType())
df = df.withColumn("dow",udf_toWeekDay('eventtimestamp'))
Yet, I'm getting error I do not understand at all. Firstly, it was complaining for inserting datetime.datetime into strptime instead of string. So I parsed to str and now I don't have a clue what's wrong.
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 267, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 260, in <module>
exec(code)
File "<stdin>", line 10, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 429, in take
return self.limit(num).collect()
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 391, in collect
port = self._jdf.collectToPython()
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o6250.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1107.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1107.0 (TID 63757, ip-172-31-27-113.eu-west-1.compute.internal, executor 819): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Thanks a lot for clues!

we can use date_format to get dayofweek,
df = df.withColumn("dow",date_format(df['eventtimestamp'],'EEEE'))

Related

Gremlin/Python: run query as string

I have the following code, and run as expected. But I need to use the "g" traversal object to manipulate the graph.
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
g = traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))
g.V().drop().iterate()
g.addV('my-label').property('k', 'v').next()
print(g.V().toList())
Instead of the "g" object, I want to run string query to modify the graph, and the following doesn't work.
from gremlin_python.driver import client
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
ws_conn = DriverRemoteConnection('ws://localhost:8182/gremlin','g')
gremlin_conn = client.Client(ws_conn, "g")
query = "g.V().groupCount().by(label).unfold().project('label','count').by(keys).by(values)"
response = gremlin_conn.submit(query)
print(response)
Gives the following error:
(venv) sh-3.2$ python /Users/demo-prj/tests/tools/neptune/local.py
[v[4280]]
Traceback (most recent call last):
File "/Users/demo-prj/tests/tools/neptune/local.py", line 24, in <module>
response = gremlin_conn.submit(query)
File "/Users/demo-prj/venv/lib/python3.8/site-packages/gremlin_python/driver/client.py", line 127, in submit
return self.submitAsync(message, bindings=bindings, request_options=request_options).result()
File "/Users/demo-prj/venv/lib/python3.8/site-packages/gremlin_python/driver/client.py", line 148, in submitAsync
return conn.write(message)
File "/Users/demo-prj/venv/lib/python3.8/site-packages/gremlin_python/driver/connection.py", line 55, in write
self.connect()
File "/Users/demo-prj/venv/lib/python3.8/site-packages/gremlin_python/driver/connection.py", line 45, in connect
self._transport.connect(self._url, self._headers)
File "/Users/demo-prj/venv/lib/python3.8/site-packages/gremlin_python/driver/tornado/transport.py", line 40, in connect
self._ws = self._loop.run_sync(
File "/Users/demo-prj/venv/lib/python3.8/site-packages/tornado/ioloop.py", line 576, in run_sync
return future_cell[0].result()
File "/Users/demo-prj/venv/lib/python3.8/site-packages/tornado/ioloop.py", line 547, in run
result = func()
File "/Users/demo-prj/venv/lib/python3.8/site-packages/gremlin_python/driver/tornado/transport.py", line 41, in <lambda>
lambda: websocket.websocket_connect(url, compression_options=self._compression_options))
File "/Users/demo-prj/venv/lib/python3.8/site-packages/tornado/websocket.py", line 1333, in websocket_connect
conn = WebSocketClientConnection(request,
File "/Users/demo-prj/venv/lib/python3.8/site-packages/tornado/websocket.py", line 1122, in __init__
scheme, sep, rest = request.url.partition(':')
AttributeError: 'DriverRemoteConnection' object has no attribute 'partition'
This works.
from gremlin_python.driver import client
from tornado import httpclient
ws_url = 'ws://localhost:8182/gremlin'
ws_conn = httpclient.HTTPRequest(ws_url)
gremlin_conn = client.Client(ws_conn, "g")
query = "g.V().groupCount().by(label).unfold().project('label','count').by(keys).by(values)"
response = gremlin_conn.submit(query)
print(response)

Jupyter R Kernel Failing to Start

After installing the R kernel via conda, I get the following error when trying to start up a Kernel. The error says there's a file missing, but I can't figure out what file it's referring to. Any idea what's actually throwing the error?
Traceback (most recent call last):
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/web.py", line 1445, in _execute
result = yield result
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/notebook/services/sessions/handlers.py", line 73, in post
type=mtype))
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/notebook/services/sessions/sessionmanager.py", line 79, in create_session
kernel_id = yield self.start_kernel_for_session(session_id, path, name, type, kernel_name)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/notebook/services/sessions/sessionmanager.py", line 92, in start_kernel_for_session
self.kernel_manager.start_kernel(path=kernel_path, kernel_name=kernel_name)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 282, in wrapper
yielded = next(result)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/notebook/services/kernels/kernelmanager.py", line 141, in start_kernel
super(MappingKernelManager, self).start_kernel(**kwargs)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/jupyter_client/multikernelmanager.py", line 109, in start_kernel
km.start_kernel(**kwargs)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/jupyter_client/manager.py", line 244, in start_kernel
**kw)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/jupyter_client/manager.py", line 190, in _launch_kernel
return launch_kernel(kernel_cmd, **kw)
File "/home/Jupyter/anaconda2/lib/python2.7/site-packages/jupyter_client/launcher.py", line 123, in launch_kernel
proc = Popen(cmd, **kwargs)
File "/home/Jupyter/anaconda2/lib/python2.7/subprocess.py", line 390, in __init__
errread, errwrite)
File "/home/Jupyter/anaconda2/lib/python2.7/subprocess.py", line 1025, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

piplinedRDD can't convert to dataframe using toDF

I have a pyspark dataframe contains rows of data seperated by comma. I want to split each row and apply LabeledPoints method to it. Then covnert it to dataframe.
Here is my code
import os.path
from pyspark.mllib.regression import LabeledPoint
import numpy as np
file_name = os.path.join('databricks-datasets', 'cs190', 'data-001', 'millionsong.txt')
raw_data_df = sqlContext.read.load(file_name, 'text')
rdd = raw_data_df.rdd.map(lambda line: line.split(',')).map(lambda seq:LabeledPoints(seq[0],seq[1:])).toDF()
It gives the following error message after apply .DF().
---------------------------------------------------------------------------
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38.0 (TID 44, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Py4JJavaError Traceback (most recent call last)
<ipython-input-65-dc4d86a8ee45> in <module>()
----> 1 rdd = raw_data_df.rdd.map(lambda line: line.split(',')).map(lambda seq:LabeledPoints(seq[0],seq[1:])).toDF()
2 print(type(rdd))
3 #print(rdd.take(5))
/databricks/spark/python/pyspark/sql/context.py in toDF(self, schema, sampleRatio)
62 [Row(name=u'Alice', age=1)]
63 """
---> 64 return sqlContext.createDataFrame(self, schema, sampleRatio)
65
66 RDD.toDF = toDF
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
--> 423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
425 rdd, schema = self._createFromLocal(data, schema)
/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
Answer found:
rdd = raw_data_df.map(lambda row: row['value'].split(',')).map(lambda seq:LabeledPoint(float(seq[0]),seq[1:])).toDF()
Here, I need to specifically reference each line of text using row['value'], even though there is only one feature in the row.

Error in sage tutorial coding theory

I have just installed sage 6.3 in Ubuntu 14.04 and I have tried the tutorial in coding theory as follow:
MS = MatrixSpace(GF(2),4,7)
G = MS([[1,1,1,0,0,0,0], [1,0,0,1,1,0,0], [0,1,0,1,0,1,0], [1,1,0,1,0,0,1]])
C = LinearCode(G)
In the third evaluation, sage produced error as follow:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "_sage_input_4.py", line 10, in <module>
exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("QyA9IExpbmVhckNvZGUoRyk="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
File "", line 1, in <module>
File "/tmp/tmpXyxNvC/___code___.py", line 2, in <module>
exec compile(u'C = LinearCode(G)
File "", line 1, in <module>
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/coding/linear_code.py", line 785, in __init__
facade_for = gen_mat.row(0).parent()
File "matrix_mod2_dense.pyx", line 576, in sage.matrix.matrix_mod2_dense.Matrix_mod2_dense.row (build/cythonized/sage/matrix/matrix_mod2_dense.c:5387)
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 432, in VectorSpace
return FreeModule(K, rank=dimension, sparse=sparse, inner_product_matrix=inner_product_matrix)
File "factory.pyx", line 366, in sage.structure.factory.UniqueFactory.__call__ (build/cythonized/sage/structure/factory.c:1327)
File "factory.pyx", line 410, in sage.structure.factory.UniqueFactory.get_object (build/cythonized/sage/structure/factory.c:1679)
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 380, in create_object
return FreeModule_ambient_field(base_ring, rank, sparse=sparse)
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 4972, in __init__
FreeModule_ambient_pid.__init__(self, base_field, dimension, sparse=sparse)
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 4893, in __init__
FreeModule_ambient_domain.__init__(self, base_ring=base_ring, rank=rank, sparse=sparse)
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 4709, in __init__
FreeModule_ambient.__init__(self, base_ring, rank, sparse)
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 4184, in __init__
FreeModule_generic.__init__(self, base_ring, rank=rank, degree=rank, sparse=sparse)
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 714, in __init__
self.element_class()
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 896, in element_class
C = element_class(self.base_ring(), self.is_sparse())
File "/usr/local/sage/local/lib/python2.7/site-packages/sage/modules/free_module.py", line 6721, in element_class
import sage.modules.vector_real_double_dense
File "vector_real_double_dense.pyx", line 1, in init sage.modules.vector_real_double_dense (build/cythonized/sage/modules/vector_real_double_dense.c:5611)
File "__init__.pxd", line 155, in init sage.modules.vector_double_dense (build/cythonized/sage/modules/vector_double_dense.c:11813)
File "/usr/local/sage/local/lib/python2.7/site-packages/numpy/__init__.py", line 153, in <module>
from . import add_newdocs
File "/usr/local/sage/local/lib/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
from numpy.lib import add_newdoc
File "/usr/local/sage/local/lib/python2.7/site-packages/numpy/lib/__init__.py", line 18, in <module>
from .polynomial import *
File "/usr/local/sage/local/lib/python2.7/site-packages/numpy/lib/polynomial.py", line 19, in <module>
from numpy.linalg import eigvals, lstsq, inv
File "/usr/local/sage/local/lib/python2.7/site-packages/numpy/linalg/__init__.py", line 50, in <module>
from .linalg import *
File "/usr/local/sage/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 29, in <module>
from numpy.linalg import lapack_lite, _umath_linalg
ImportError: libgfortran.so.3: cannot open shared object file: No such file or directory
How can I solve this problem? I am new to sage!

ValueError when trying to start carbon-cache

I have an issue with Graphite, specifically with carbon-cache. At some point I had it running. now when coming back after a few weeks I tried to start graphite again. The django-webapp runs fine but it seems I have an issue with the carbon-cache backend. Graphite is installed in /opt/graphite and I run /opt/graphite/bin/carbon-cache.py start. This is the error I get:
root#stfutm01:/opt/graphite/bin# ./carbon-cache.py start
Starting carbon-cache (instance a)
Traceback (most recent call last):
File "./carbon-cache.py", line 30, in <module>
run_twistd_plugin(__file__)
File "/opt/graphite/lib/carbon/util.py", line 92, in run_twistd_plugin
runApp(config)
File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 386, in run
self.application = self.createOrGetApplication()
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 446, in createOrGetApplication
ser = plg.makeService(self.config.subOptions)
File "/opt/graphite/lib/twisted/plugins/carbon_cache_plugin.py", line 21, in makeService
return service.createCacheService(options)
File "/opt/graphite/lib/carbon/service.py", line 127, in createCacheService
from carbon.writer import WriterService
File "/opt/graphite/lib/carbon/writer.py", line 34, in <module>
schemas = loadStorageSchemas()
File "/opt/graphite/lib/carbon/storage.py", line 123, in loadStorageSchemas
archives = [ Archive.fromString(s) for s in retentions ]
File "/opt/graphite/lib/carbon/storage.py", line 107, in fromString
(secondsPerPoint, points) = whisper.parseRetentionDef(retentionDef)
File "/usr/local/lib/python2.7/dist-packages/whisper.py", line 76, in parseRetentionDef
(precision, points) = retentionDef.strip().split(':')
ValueError: need more than 1 value to unpack
I see that it as an issue with the split retentionDef.strip().split(':'). My storage schema config file (/opt/graphite/conf/storage-schemas.conf) looks like:
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10s:6h,1m:7d,10m:1y
[ts3]
priority = 100
pattern = ^skarp\.ts3\..*
retentions = 60s:1y,1h,:5y
Any hints where I should looking? Or does anybody know what I'm missing here?
I think the problem is the [ts3] rentions. "The retentions line can specify multiple retentions. Each retention of frequency:history is separated by a comma."
In ts3 it appears to be 3 retentions (comma-delimited), with the second not specifying a history and the last not specifying a frequency.
retentions = 60s:1y,1h,:5y
I think you may have meant:
retentions = 60s:1y,1h:5y
Which would be 60 second data for 1 year and 1 hour data for 5 years after that.

Resources