How to convert dict to RDD in PySpark - dictionary

I am learning the Word2Vec Model to process my data.
I using Spark 1.6.0.
Using the example of the official documentation explain my problem:
import pyspark.mllib.feature import Word2Vec
sentence = "a b " * 100 + "a c " * 10
localDoc = [sentence, sentence]
doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc)
The vectors are as follows:
>>> model.getVectors()
{'a': [0.26699373, -0.26908076, 0.0579859, -0.080141746, 0.18208595, 0.4162335, 0.0258975, -0.2162928, 0.17868409, 0.07642203], 'b': [-0.29602322, -0.67824656, -0.9063686, -0.49016926, 0.14347662, -0.23329848, -0.44695938, -0.69160634, 0.7037, 0.28236762], 'c': [-0.08954003, 0.24668643, 0.16183868, 0.10982372, -0.099240996, -0.1358507, 0.09996107, 0.30981666, -0.2477713, -0.063234895]}
When I use the getVectors() to get the map of representation of the words. How to convert it into RDD, so I can pass it to KMeans Model?
EDIT:
I did what #user9590153 said.
>>> v = sc.parallelize(model.getVectors()).values()
# the above code is successful.
>>> v.collect()
The Spark-Shell shows another problem:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\rdd.py", line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 29, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
File "D:\spark-1.6.3-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "D:\spark-1.6.3-bin-hadoop2.6\python\pyspark\rdd.py", line 1540, in <lambda>
return self.map(lambda x: x[1])
IndexError: string index out of range
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Just parallelize:
sc.parallelize(model.getVectors()).values()

Parallelized Collections will help you over here.
val data = Array(1, 2, 3, 4, 5) # data here is the collection
val distData = sc.parallelize(data) # converted into rdd
For your case:
sc.parallelize(model.getVectors()).values()
For your doubt:
The action collect() is the common and simplest operation that returns our entire RDDs content to driver program.
The application of collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with the expected result.
Action Collect() had a constraint that all the data should fit in the machine, and copies to the driver.
So, you can not perform collect on RDD

Related

Using writeOGR with rpy2

I am trying to use writeOGR using R called from Python 3.8.
import rpy2.robjects as ro
.....
ro.r('ttops <- .....')
ro.r('writeOGR(obj=ttops, dsn="T:/Internal/segmentation", layer="test", driver="ESRI Shapefile")')
errors with:
R[write to console]: Error in writeOGR(obj = ttops, dsn = "T:/Internal/LiDAR/crown_segmentation", :
could not find function "writeOGR"
Traceback (most recent call last):
File "C:/Users/david/PycharmProjects/main.py", line 7, in <module>
main()
File "C:/Users/david/PycharmProjects/main.py", line 4, in main
R_Packages().process()
File "C:\Users\david\PycharmProjects\model_testing\r_methods.py", line 17, in process
ro.r('writeOGR(obj=ttops, dsn="T:/Internal/segmentation", layer="test", driver="ESRI Shapefile")')
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\robjects\__init__.py", line 416, in __call__
res = self.eval(p)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\robjects\functions.py", line 197, in __call__
return (super(SignatureTranslatedFunction, self)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\robjects\functions.py", line 125, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 44, in _
cdata = function(*args, **kwargs)
File "C:\Users\david\AppData\Local\Programs\Python\Python38\lib\site-packages\rpy2\rinterface.py", line 624, in __call__
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in writeOGR(obj = ttops, dsn = "T:/Internal/segmentation", :
could not find function "writeOGR"
Am I missing something or is this a limit of rpy2? If it is a limit, what is an alternative to write shapefiles of R data using Python?
There was a library that I did not need to define in R that was needed in Python:
ro.r('library(rgdal)')

Rserve: pyServe not able to call basic R functions

I'm calling Rserve from python and it runs for basic operations, but not if I call basic functions as min
import pyRserve
conn = pyRserve.connect()
cars = [1, 2, 3]
conn.r.x = cars
print(conn.eval('x'))
print(conn.eval('min(x)'))
The result is:
[1, 2, 3]
Traceback (most recent call last):
File "test3.py", line 9, in <module>
print(conn.eval('min(x)'))
File "C:\Users\acastro\.windows-build-tools\python27\lib\site-packages\pyRserve\rconn.py", line 78, in decoCheckIfClosed
return func(self, *args, **kw)
File "C:\Users\acastro\.windows-build-tools\python27\lib\site-packages\pyRserve\rconn.py", line 191, in eval
raise REvalError(errorMsg)
pyRserve.rexceptions.REvalError: Error in min(x) : invalid 'type' (list) of argument
Do you know where is the problem?
Thanks
You should try min(unlist(x)).
If the list is simple, you may just try as.data.frame(x).
For some more complicate list, StackOverFlow has many other answers.

setting environment variable to numeric value leads to error in python

Trying to setup ENV variables in the following code
import os
dicta = {}
def setv(evar, evalue):
os.environ[evar] = evalue
dicta.setdefault('UENV', {}).update({evar: evalue})
# Set environment variables
setv('API_USER', 'username')
setv('API_PASSWORD', 'secret')
setv('NUMBER', 1)
on the last statement where NUMBER variable is set to numeric value 1. getting following error:
Traceback (most recent call last):
File "./pyenv.py", line 19, in <module>
setv('NUMBER', 1)
File "./pyenv.py", line 13, in setv
os.environ[evar] = evalue
File "/home/python/3.6.3/1/el-6-x86_64/lib/python3.6/os.py", line 674, in __setitem__
value = self.encodevalue(value)
File "/home/python/3.6.3/1/el-6-x86_64/lib/python3.6/os.py", line 744, in encode
raise TypeError("str expected, not %s" % type(value).__name__)
TypeError: str expected, not int
I don't want to convert the variable value to str and keep the value in int. Any thought on keeping NUMBER value as numeric 1 and do not see this error message
Environment Variables are string values. Typecasting them back into integers after you import them from your environment them is the way to go.

Error in running a Python code from R with the package rPithon

I would like to run this Python code from R:
>>> import nlmpy
>>> nlm = nlmpy.mpd(nRow=50, nCol=50, h=0.75)
>>> nlmpy.exportASCIIGrid("raster.asc", nlm)
Nlmpy is a Python package to build neutral landscape models. The example comes from the website
To run this Python code from R, I 'm trying to use the package rPithon. However, I obtain this error message:
if (pithon.available())
{
nRow <- 50
nCol <- 50
h <- 0.75
# this file contains the definition of function concat
pithon.load("C:/Users/Anaconda2/Lib/site-packages/nlmpy/nlmpy.py")
pithon.call( "mpd", nRow, nCol, h)
} else {
print("Unable to execute python")
}
Error in pithon.get("_r_call_return", instance.name = instname) :
Couldn't retrieve variable: Traceback (most recent call last):
File "C:/Users/Documents/R/win-library/3.3/rPithon/pythonwrapperscript.py", line 110, in <module>
reallyReallyLongAndUnnecessaryPrefix.data = json.dumps([eval(reallyReallyLongAndUnnecessaryPrefix.argData)])
File "C:\Users\ANACON~1\lib\json\__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "C:\Users\ANACON~1\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\ANACON~1\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "C:\Users\ANACON~1\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: array([[ 0.36534654, 0.31962481, 0.44229946, ..., 0.11513079,
0.07156331, 0.00286971], [ 0.41534291, 0.41333479, 0.48118995, ..., 0.19203674,
0.04192771, 0.03679473], [ 0.5188
Is this error caused by a syntax issue in my code ? I work with the Anaconda 4.2.0 platform for Windows which uses the Python 2.7 version.
I haven't used the nlmpy package hence, I am not sure what would be your expected output. However, this code successfully communicates between R and Python.
There are two files,
nlmpyInR.R
command ="python"
path2script="path_to_your_pythoncode/nlmpyInPython.py"
nRow <-50
nCol <-50
h <- 0.75
# Build up args in a vector
args = c(nRow, nCol, h)
# Add path to script as first arg
allArgs = c(path2script, args)
Routput = system2(command, args=allArgs, stdout=TRUE)
#The command would be python nlmpyInPython.py 50 50 0.75
print(paste("The Output is:\n", Routput))
nlmpyInPython.py
import sys
import nlmpy
#Getting the arguments from the command line call
nRow = sys.argv[1]
nCol = sys.argv[2]
h = sys.argv[3]
nlm = nlmpy.mpd(nRow, nCol, h)
pyhtonOutput = nlmpy.exportASCIIGrid("raster.asc", nlm)
#Whatever you print will get stored in the R's output variable.
print pyhtonOutput
The cause of the error that you're getting is hinted at by the
"is not JSON serializable" line. Your R code calls the mpd
function with certain arguments, and that function itself will
execute correctly. The rPithon library will then try to send the
return value of the function back to R, and to do this it will try
to create a JSON object
that describes the return value.
This works well for integers, floating point values, arrays, etc,
but not every kind of Python object can be converted to such a
JSON representation. And because rPithon can't convert the return value
of mpd this way, an error is generated.
You can still use rPithon to call the mpd function though. The following
code creates a new Python function that performs two steps: first
it calls the mpd function with the specified parameters, and then it
exports the result to a file, of which the filename is also an argument.
Using rPithon, the new function is then called from R. Because myFunction doesn't return anything, representing the return value in JSON format will not be a problem.
library("rPithon")
pythonCode = paste("import nlmpy.nlmpy as nlmpy",
"",
"def myFunction(nRow, nCol, h, fileName):",
" nlm = nlmpy.mpd(nRow, nCol, h)",
" nlmpy.exportASCIIGrid(fileName, nlm)",
sep = "\n")
pithon.exec(pythonCode)
nRow <- 50
nCol <- 50
h <- 0.75
pithon.call("myFunction", nRow, nCol, h, "outputraster.asc")
Here, the Python code defined as an R string, and executed using
pithon.exec. You could also put that Python code in a separate file
and use pithon.load to process the code so that the myFunction
function is known.

Tuple index out of range Tkinter

So I've got a program that should take a function as input and graph it on a Tkinter canvas.
def draw(self):
self.canvas.delete(ALL)
for n, i in enumerate(self.sav):
self.function, colour = self.sav_func[n]
i = self.p1(i)
i = self.p2(i, self.function, colour)
if i != [0]:
try:
self.canvas.create_line(i, fill = colour)
except TclError as err:
tkMessageBox.showerror(TclError, err)
self.sav.remove(self.sav[len(self.sav)-1])
self.sav_func.remove(self.sav_func[len(self.sav_func)-1])
This section is giving me the following error:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python27\lib\lib-tk\Tkinter.py", line 1410, in __call__
return self.func(*args)
File "D:/Google Drive/assign2_2-1.py", line 113, in add_func
self.redraw_all()
File "D:/Google Drive/assign2_2-1.py", line 132, in redraw_all
self.draw()
File "D:/Google Drive/assign2_2-1.py", line 145, in draw
self.canvas.create_line(i, fill = colour)
File "C:\Python27\lib\lib-tk\Tkinter.py", line 2201, in create_line
return self._create('line', args, kw)
File "C:\Python27\lib\lib-tk\Tkinter.py", line 2182, in _create
cnf = args[-1]
IndexError: tuple index out of range
From what I can gather it's something to do with the number of inputs not matching the number of outputs, but I'm still a little lost. Help would be great!
it looks like i doesn't have enough values. To create a line it needs four values: x1,y1,x2,y2.

Resources