Using writeOGR with rpy2 - r

I am trying to use writeOGR using R called from Python 3.8.
import rpy2.robjects as ro
.....
ro.r('ttops <- .....')
ro.r('writeOGR(obj=ttops, dsn="T:/Internal/segmentation", layer="test", driver="ESRI Shapefile")')
errors with:
R[write to console]: Error in writeOGR(obj = ttops, dsn = "T:/Internal/LiDAR/crown_segmentation", :
could not find function "writeOGR"
Traceback (most recent call last):
File "C:/Users/david/PycharmProjects/main.py", line 7, in <module>
main()
File "C:/Users/david/PycharmProjects/main.py", line 4, in main
R_Packages().process()
File "C:\Users\david\PycharmProjects\model_testing\r_methods.py", line 17, in process
ro.r('writeOGR(obj=ttops, dsn="T:/Internal/segmentation", layer="test", driver="ESRI Shapefile")')
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\robjects\__init__.py", line 416, in __call__
res = self.eval(p)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\robjects\functions.py", line 197, in __call__
return (super(SignatureTranslatedFunction, self)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\robjects\functions.py", line 125, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 44, in _
cdata = function(*args, **kwargs)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\rinterface.py", line 624, in __call__
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in writeOGR(obj = ttops, dsn = "T:/Internal/segmentation", :
could not find function "writeOGR"
Am I missing something or is this a limit of rpy2? If it is a limit, what is an alternative to write shapefiles of R data using Python?

There was a library that I did not need to define in R that was needed in Python:
ro.r('library(rgdal)')

Related

In PyCaret getting error:"ValueError: Cannot cast object dtype to float64"

While using PyCaret's ML capabilities I am facing the following error.
The ML Code I am using:
if choice == "ML":
st.title("Your Machine Learning Process Starts Here!")
target = st.selectbox("Select Your Target", df.columns)
setup(df, target=target, silent=True)
setup_df = pull()
st.info("This is the ML Experiment Settings")
st.dataframe(setup_df)
best_model = compare_models()
compare_df = pull()
st.info("This is the ML Model")
st.dataframe(compare_df)
best_model
Now while running the code I am getting the error:
TypeError: setup() got an unexpected keyword argument 'silent'
Traceback:
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 563, in _run_script
exec(code, module.__dict__)
File "D:\Python Projects\Machine-Learning-App\app.py", line 37, in <module>
setup(df, target=target, silent=True)
Hence, I removed the silent=True and now am getting the following error:
ValueError: Cannot cast object dtype to float64
Traceback:
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 563, in _run_script
exec(code, module.__dict__)
File "D:\Python Projects\Machine-Learning-App\app.py", line 37, in <module>
setup(df, target=target)
File "C:\Users\aviparna.biswas\AppData\Roaming\Python\Python39\site-packages\pycaret\classification\functional.py", line 596, in setup
return exp.setup(
File "C:\Users\aviparna.biswas\AppData\Roaming\Python\Python39\site-packages\pycaret\classification\oop.py", line 885, in setup
self.pipeline.fit(self.X_train, self.y_train)
File "C:\Users\aviparna.biswas\AppData\Roaming\Python\Python39\site-packages\pycaret\internal\pipeline.py", line 211, in fit
X, y, _ = self._fit(X, y, **fit_params_steps)
File "C:\Users\aviparna.biswas\AppData\Roaming\Python\Python39\site-packages\pycaret\internal\pipeline.py", line 192, in _fit
X, y, fitted_transformer = self._memory_fit(
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\joblib\memory.py", line 594, in __call__
return self._cached_call(args, kwargs)[0]
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\joblib\memory.py", line 537, in _cached_call
out, metadata = self.call(*args, **kwargs)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\joblib\memory.py", line 779, in call
output = self.func(*args, **kwargs)
File "C:\Users\aviparna.biswas\AppData\Roaming\Python\Python39\site-packages\pycaret\internal\pipeline.py", line 87, in _fit_transform_one
_fit_one(transformer, X, y, message, **fit_params)
File "C:\Users\aviparna.biswas\AppData\Roaming\Python\Python39\site-packages\pycaret\internal\pipeline.py", line 54, in _fit_one
transformer.fit(*args, **fit_params)
File "C:\Users\aviparna.biswas\AppData\Roaming\Python\Python39\site-packages\pycaret\internal\preprocess\transformers.py", line 216, in fit
self.transformer.fit(*args, **fit_params)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\sklearn\impute\_base.py", line 364, in fit
X = self._validate_input(X, in_fit=True)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\sklearn\impute\_base.py", line 319, in _validate_input
raise ve
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\sklearn\impute\_base.py", line 302, in _validate_input
X = self._validate_data(
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\sklearn\base.py", line 577, in _validate_data
X = check_array(X, input_name="X", **check_params)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\sklearn\utils\validation.py", line 791, in check_array
array = array.astype(new_dtype)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\pandas\core\generic.py", line 5912, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\pandas\core\internals\managers.py", line 419, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\pandas\core\internals\managers.py", line 304, in apply
applied = getattr(b, f)(**kwargs)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\pandas\core\internals\blocks.py", line 580, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\pandas\core\dtypes\cast.py", line 1292, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\pandas\core\dtypes\cast.py", line 1234, in astype_array
values = values.astype(dtype, copy=copy)
File "D:\Python Projects\Machine-Learning-App\mlapp\lib\site-packages\pandas\core\arrays\categorical.py", line 556, in astype
raise ValueError(msg)
Is there any workaround this error?
For installing pycaret I have used the command pip install -U --pre pycaret the latest release 3.0.0rc2 which works in python 3.9.
I had the same issue. It is working for me when I use python 3.7.9, but not 3.8.x or 3.9.x (as of now).
The 3.0.0rc2 version does not support silent parameter in setup() anymore. Hence the first error.
About this line:
setup(df, target=target)
The error would indicate that your labels are not of numerical type but an object. I'd look into that.
This is due to Pycaret not being compatible with scikit-learn version.
As the documentation says (https://pycaret.gitbook.io/docs/get-started/installation)
"PyCaret is not yet compatible with sklearn>=0.23.2"
Try downgrading scikit-learn:
pip install scikit-learn==0.23.2

How to calculate the runtime of MPI program in mpi4py

I wrote this code and tried some functions to calculate the time but it shows an error while running.I tried to use the C functions that is used to calculate runtime of each processor by using Wtime() and MPI_Reduce.Wtime() works I guess but the reduce function does not work the way it should.
The code is
from mpi4py import MPI
import mpi4py
import numpy as np
comm = MPI.COMM_WORLD
size = comm.size
rank = comm.Get_rank()
time1 = mpi4py.MPI.Wtime()
a = np.random.randint(10, size=(10, 10))
if rank == 0:
b = np.random.randint(10, size=(10, 10))
print(b)
else:
b = None
b = comm.bcast(b, root=0)
c = np.dot(a, b)
if size == 1:
result = np.dot(a, b)
else:
if rank == 0:
a_row = a.shape[0]
if a_row >= size:
split = np.array_split(a, size, axis=0)
else:
split = None
split = comm.scatter(split, root=0)
split = np.dot(split, b)
data = comm.gather(split, root=0)
time2 = mpi4py.MPI.Wtime()
duration = time2 - time1
totaltime = comm.reduce(duration,op = sum, root = 0)
print("Runtime at %d is %f" %(rank,duration))
if rank == 0:
result = np.vstack(data)
print(result)
print(totaltime)
The error it shows is
Runtime at 3 is 0.000574
Traceback (most recent call last):
File "matrixmultMPI.py", line 48, in <module>
totaltime = comm.reduce(duration,op = sum, root = 0)
File "mpi4py/MPI/Comm.pyx", line 1613, in mpi4py.MPI.Comm.reduce
File "mpi4py/MPI/msgpickle.pxi", line 1322, in mpi4py.MPI.PyMPI_reduce
File "mpi4py/MPI/msgpickle.pxi", line 1254, in mpi4py.MPI.PyMPI_reduce_intra
File "mpi4py/MPI/msgpickle.pxi", line 1126, in mpi4py.MPI.PyMPI_reduce_p2p
TypeError: 'float' object is not iterable
Traceback (most recent call last):
File "matrixmultMPI.py", line 48, in <module>
totaltime = comm.reduce(duration,op = sum, root = 0)
File "mpi4py/MPI/Comm.pyx", line 1613, in mpi4py.MPI.Comm.reduce
File "mpi4py/MPI/msgpickle.pxi", line 1322, in mpi4py.MPI.PyMPI_reduce
File "mpi4py/MPI/msgpickle.pxi", line 1254, in mpi4py.MPI.PyMPI_reduce_intra
File "mpi4py/MPI/msgpickle.pxi", line 1126, in mpi4py.MPI.PyMPI_reduce_p2p
TypeError: 'float' object is not iterable
Traceback (most recent call last):
File "matrixmultMPI.py", line 48, in <module>
totaltime = comm.reduce(duration,op = sum, root = 0)
File "mpi4py/MPI/Comm.pyx", line 1613, in mpi4py.MPI.Comm.reduce
File "mpi4py/MPI/msgpickle.pxi", line 1322, in mpi4py.MPI.PyMPI_reduce
File "mpi4py/MPI/msgpickle.pxi", line 1254, in mpi4py.MPI.PyMPI_reduce_intra
File "mpi4py/MPI/msgpickle.pxi", line 1126, in mpi4py.MPI.PyMPI_reduce_p2p
TypeError: 'float' object is not iterable
What's wrong with it and how do I go about it!!?

How do I use Gather and Scatter functions in MPI using python mpi4py library for matrix multiplication

I was writing the code for Matrix Multiplication in python using mpi4py library and I am stuck on the part where Gather and Scatter functions are used, there is not much documentation regarding how to use it. The only websites I was able to find was this one which has the Official documentation about the library which is very new and the other one was this page.
The code I have written so far is
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank=comm.rank
size=comm.size
a = [[12,7,3],
[4 ,5,6],
[7 ,8,9]]
b = [[5,8,1],
[6,7,3],
[4,5,9]]
res = [[0,0,0],
[0,0,0],
[0,0,0]]
aa = [0,0,0]
cc = [0,0,0]
sum = 0
comm.Scatter(a, aa, root = 0)
for r in range(size):
if rank == r:
print("[%d] %s" %(rank, aa)) #this was just to check whether the Scattered values are being sent or not
comm.Bcast(b,root = 0)
for i in range(len(a)):
for j in range(len(b[0])):
for k in range(len(b)):
sum = sum + aa[j] * b[j][i]
cc[i] = sum #main calculation
comm.Gather(cc, res, root = 0) #here it should gather all values from cc to res matrix
if rank == 0:
for r in res:
print(r) #final printing of the res matrix
The error it's showing is
Traceback (most recent call last):
File "matrixmultMPI.py", line 22, in <module>
comm.Scatter(a, aa, root = 0)
File "mpi4py/MPI/Comm.pyx", line 740, in mpi4py.MPI.Comm.Scatter
File "mpi4py/MPI/msgbuffer.pxi", line 606, in mpi4py.MPI._p_msg_cco.for_scatter
File "mpi4py/MPI/msgbuffer.pxi", line 511, in mpi4py.MPI._p_msg_cco.for_cco_recv
File "mpi4py/MPI/msgbuffer.pxi", line 203, in mpi4py.MPI.message_simple
File "mpi4py/MPI/msgbuffer.pxi", line 138, in mpi4py.MPI.message_basic
File "mpi4py/MPI/asbuffer.pxi", line 365, in mpi4py.MPI.getbuffer
File "mpi4py/MPI/asbuffer.pxi", line 148, in mpi4py.MPI.PyMPI_GetBuffer
File "mpi4py/MPI/asbuffer.pxi", line 140, in mpi4py.MPI.PyMPI_GetBuffer
TypeError: a bytes-like object is required, not 'int'
Traceback (most recent call last):
File "matrixmultMPI.py", line 22, in <module>
comm.Scatter(a, aa, root = 0)
File "mpi4py/MPI/Comm.pyx", line 740, in mpi4py.MPI.Comm.Scatter
File "mpi4py/MPI/msgbuffer.pxi", line 606, in mpi4py.MPI._p_msg_cco.for_scatter
File "mpi4py/MPI/msgbuffer.pxi", line 511, in mpi4py.MPI._p_msg_cco.for_cco_recv
File "mpi4py/MPI/msgbuffer.pxi", line 203, in mpi4py.MPI.message_simple
File "mpi4py/MPI/msgbuffer.pxi", line 138, in mpi4py.MPI.message_basic
File "mpi4py/MPI/asbuffer.pxi", line 365, in mpi4py.MPI.getbuffer
File "mpi4py/MPI/asbuffer.pxi", line 148, in mpi4py.MPI.PyMPI_GetBuffer
File "mpi4py/MPI/asbuffer.pxi", line 140, in mpi4py.MPI.PyMPI_GetBuffer
TypeError: a bytes-like object is required, not 'int'
Traceback (most recent call last):
File "matrixmultMPI.py", line 22, in <module>
comm.Scatter(a, aa, root = 0)
File "mpi4py/MPI/Comm.pyx", line 740, in mpi4py.MPI.Comm.Scatter
File "mpi4py/MPI/msgbuffer.pxi", line 606, in mpi4py.MPI._p_msg_cco.for_scatter
File "mpi4py/MPI/msgbuffer.pxi", line 511, in mpi4py.MPI._p_msg_cco.for_cco_recv
File "mpi4py/MPI/msgbuffer.pxi", line 203, in mpi4py.MPI.message_simple
File "mpi4py/MPI/msgbuffer.pxi", line 138, in mpi4py.MPI.message_basic
File "mpi4py/MPI/asbuffer.pxi", line 365, in mpi4py.MPI.getbuffer
File "mpi4py/MPI/asbuffer.pxi", line 148, in mpi4py.MPI.PyMPI_GetBuffer
File "mpi4py/MPI/asbuffer.pxi", line 140, in mpi4py.MPI.PyMPI_GetBuffer
TypeError: a bytes-like object is required, not 'int'
Traceback (most recent call last):
File "matrixmultMPI.py", line 22, in <module>
comm.Scatter(a, aa, root = 0)
File "mpi4py/MPI/Comm.pyx", line 740, in mpi4py.MPI.Comm.Scatter
File "mpi4py/MPI/msgbuffer.pxi", line 597, in mpi4py.MPI._p_msg_cco.for_scatter
File "mpi4py/MPI/msgbuffer.pxi", line 495, in mpi4py.MPI._p_msg_cco.for_cco_send
File "mpi4py/MPI/msgbuffer.pxi", line 186, in mpi4py.MPI.message_simple
ValueError: too many values to unpack (expected 2)
Any help would be appreciated.

Failed to start the ocaml-jupyter kernal in jupyter notebook in windows 10

I have installled ocaml 64 and cygwin64 terminal in Windows 10. I have completely set the kernel by using this steps https://github.com/akabe/ocaml-jupyter.. But I got the kernel error in the top right corner of the jupyter notebook.
I have used ocaml kernel version 4.12.0+ mingw64. Can anyone help me to fix this error?
Traceback (most recent call last):
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\tornado\web.py", line 1704, in _execute
result = await result
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\tornado\gen.py", line 769, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\notebook\services\sessions\handlers.py", line 69, in post
model = yield maybe_future(
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\tornado\gen.py", line 762, in run
value = future.result()
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\tornado\gen.py", line 769, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\notebook\services\sessions\sessionmanager.py", line 98, in create_session
kernel_id = yield self.start_kernel_for_session(session_id, path, name, type, kernel_name)
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\tornado\gen.py", line 762, in run
value = future.result()
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\tornado\gen.py", line 769, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\notebook\services\sessions\sessionmanager.py", line 110, in start_kernel_for_session
kernel_id = yield maybe_future(
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\tornado\gen.py", line 762, in run
value = future.result()
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\notebook\services\kernels\kernelmanager.py", line 176, in start_kernel
kernel_id = await maybe_future(self.pinned_superclass.start_kernel(self, **kwargs))
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\utils.py", line 25, in wrapped
return loop.run_until_complete(coro(*args, **kwargs))
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\nest_asyncio.py", line 70, in run_until_complete
return f.result()
File "c:\users\honey\appdata\local\programs\python\python39\lib\asyncio\futures.py", line 201, in result
raise self._exception
File "c:\users\honey\appdata\local\programs\python\python39\lib\asyncio\tasks.py", line 256, in __step
result = coro.send(None)
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\multikernelmanager.py", line 186, in _async_start_kernel
self._add_kernel_when_ready(kernel_id, km, ensure_async(km.start_kernel(**kwargs)))
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\utils.py", line 25, in wrapped
return loop.run_until_complete(coro(*args, **kwargs))
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\nest_asyncio.py", line 70, in run_until_complete
return f.result()
File "c:\users\honey\appdata\local\programs\python\python39\lib\asyncio\futures.py", line 201, in result
raise self._exception
File "c:\users\honey\appdata\local\programs\python\python39\lib\asyncio\tasks.py", line 256, in __step
result = coro.send(None)
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\manager.py", line 335, in _async_start_kernel
await ensure_async(self._launch_kernel(kernel_cmd, **kw))
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\utils.py", line 25, in wrapped
return loop.run_until_complete(coro(*args, **kwargs))
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\nest_asyncio.py", line 70, in run_until_complete
return f.result()
File "c:\users\honey\appdata\local\programs\python\python39\lib\asyncio\futures.py", line 201, in result
raise self._exception
File "c:\users\honey\appdata\local\programs\python\python39\lib\asyncio\tasks.py", line 256, in __step
result = coro.send(None)
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\manager.py", line 257, in _async_launch_kernel
connection_info = await self.provisioner.launch_kernel(kernel_cmd, **kw)
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\provisioning\local_provisioner.py", line 179, in launch_kernel
self.process = launch_kernel(cmd, **scrubbed_kwargs)
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\launcher.py", line 169, in launch_kernel
raise ex
File "c:\users\honey\appdata\local\programs\python\python39\lib\site-packages\jupyter_client\launcher.py", line 157, in launch_kernel
proc = Popen(cmd, **kwargs)
File "c:\users\honey\appdata\local\programs\python\python39\lib\subprocess.py", line 947, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "c:\users\honey\appdata\local\programs\python\python39\lib\subprocess.py", line 1416, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

How to convert dict to RDD in PySpark

I am learning the Word2Vec Model to process my data.
I using Spark 1.6.0.
Using the example of the official documentation explain my problem:
import pyspark.mllib.feature import Word2Vec
sentence = "a b " * 100 + "a c " * 10
localDoc = [sentence, sentence]
doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc)
The vectors are as follows:
>>> model.getVectors()
{'a': [0.26699373, -0.26908076, 0.0579859, -0.080141746, 0.18208595, 0.4162335, 0.0258975, -0.2162928, 0.17868409, 0.07642203], 'b': [-0.29602322, -0.67824656, -0.9063686, -0.49016926, 0.14347662, -0.23329848, -0.44695938, -0.69160634, 0.7037, 0.28236762], 'c': [-0.08954003, 0.24668643, 0.16183868, 0.10982372, -0.099240996, -0.1358507, 0.09996107, 0.30981666, -0.2477713, -0.063234895]}
When I use the getVectors() to get the map of representation of the words. How to convert it into RDD, so I can pass it to KMeans Model?
EDIT:
I did what #user9590153 said.
>>> v = sc.parallelize(model.getVectors()).values()
# the above code is successful.
>>> v.collect()
The Spark-Shell shows another problem:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\rdd.py", line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 29, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\rdd.py", line 1540, in <lambda>
return self.map(lambda x: x[1])
IndexError: string index out of range
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Just parallelize:
sc.parallelize(model.getVectors()).values()
Parallelized Collections will help you over here.
val data = Array(1, 2, 3, 4, 5) # data here is the collection
val distData = sc.parallelize(data) # converted into rdd
For your case:
sc.parallelize(model.getVectors()).values()
For your doubt:
The action collect() is the common and simplest operation that returns our entire RDDs content to driver program.
The application of collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with the expected result.
Action Collect() had a constraint that all the data should fit in the machine, and copies to the driver.
So, you can not perform collect on RDD

Resources