Sequencing in Dataframe pyspark - rdd

I Have a data source and need to add one additional sequence column while loading into the target.
I'm converting it into RDD and using zipWithIndex() function.
however, I'm getting:
an error occurred while calling
z:org:pache.spark.api.python.PythonRDD.collectAndServe
Any specific package that I'm missing on glue.
Note: cannot sort the data

Related

How do you use sdf_checkpoint to break spark table lineage in sparklyr?

I'm attempting to manipulate a Spark RDD via sparklyr with a dplyr mutate command to construct a large number of variables, and each time this seems to fail with an error message regarding Java memory exceeding 64 bits.
The mutate command is coded correctly and runs properly when executed separately to construct different dataframes for small subsets of the variables, but not when all of the variables are included in the same mutate command (or several mutate statements piped together in the same chain to construct a single dataset).
I suspect that this may be an issue with needing to 'break the computation graph' using sparklyr::sdf_checkpoint(eager = TRUE), but I am unclear precisely how to do this as the documentation for this function is lacking:
When I run this command, is the appropriate sequence of commands df <- df %>% compute('df') %>% sdf_checkpoint(eager = TRUE) ?
When I set the sparklyr::spark_set_checkpoint_dir() to an HDFS location, I can see that 'stuff' appears there after running the command - however when then referencing the df post the check-pointing, the mutate command still breaks with the same error message. When I am referencing the df, am I actually referencing the version that has had the lineage broken, or do I need to do something to 'load' the checkpoint back in? If so, might somebody be able to provide me with the steps to load this dataset back into memory without the lineage?

Storing geometry data in r packages

I read somewhere that creating custom functions and build them into packages is the best way to avoid repeating code and keeping several scripts, so i'm giving it a try.
devtools::install_github("https://github.com/albersonmiranda/desigualdade")
Here i'm trying to setup functions to shortcut some map plots.
REPRODUCING
In data-raw Github repo folder, which i removed from .gitignore only for this thread, there's a script named data.R. Running it will give you an object named censo_des. It's a tibble with Brazil's 2010 census and municipalities geometry data.
I can then load the functions in /R folder and run RM("RJ", n.nomes = 2) for plotting Rio de Janeiro's map, with geom_label() showing names for top 2 mucipalities of higher and lower per capita income for black and white people.
QUESTION
Running usethis::use_data(censo_des, overwrite = TRUE) to create .rda file and store it in /data folder seems to change censo_des structure and it's no longer a tibble. When i install the package, if i try to run censo_des i get the error message Error: Input must be a vector, not a sfc_GEOMETRY/sfc object. and if i try RM("RJ") i get Error: Can't slice a scalar.
I guess my issue is how to proper store sfc_GEOMETRY data in packages. Any thoughts?
Turns out you need to import sf package to handle geometry data. It isn't a regular tibble or data frame. Once i imported, everything worked fine.

udpipe_accuracy() always gives the same error " The CoNLL-U line '....' does not contain 10 columns!"

This is regarding the R package udpipe for NLP. I am using it to tokenize, tag, lemmatize and perform dependency parsing on text files.
I am not sure which template the conllu file is needed for the function
udpipe_accuracy
I loaded a CSV file of 10 columns but the error persists.
I could not search any questions on SO on this package and also there is no tag of udpipe.
udpipe_accuracy is used in combination with udpipe_train.
If you trained a custom udpipe model with udpipe_train based on data in conllu format, you can see how good it is by using udpipe_accuracy on hold-out conllu data which was not used to build the model.

Using functions from a package in R

I am using a R package Cat
In the help file an example for function data uses dataset head
data(head)
I have my own dataset and I want to try this function data within the package CAT on my dataset A
when I try
library(cat)
A = read.table("C:/A.csv",header=TRUE,sep=",")
data(A)
I get a warning
"Warning message: In data(A) : data set ‘A’ not found*
How do I use specific functions that are part of R packages on my dataset and not the examples in those packages>
As #cenka pointed out, when you are loading your own data, you do not need to use the data() function. That is only for loading data from packages. If you want to use functions from a specific package after you install it, be sure to call library(cat) to actually load the library into your current R session.

Getting associated GO:IDs for a given gene name using R bioconductor annotation package

I am trying to play around with hgu95av2.db and GO.db libraries from Bioconductor classes.
I have a list of genenames.
Genename1
Genename2
Genename3
These are standard gene names from genedb.
I now want to get the associated go.ids associated with them. I would like to use this information to bin the data in the future.
I have attempted to go through the annotationapi help file where they say one way of getting the ids is to use this api as follows:
select(hgu95av2.db, keys=keys, cols=c("GO"),
keytype="GENENAME")
I am not sure how to set up the keys.
Do I just set up a column object with the list of keys.
When I try to do and run the command above I get the following error:
Error in .testIfKeysAreOfProposedKeytype(x, keys, keytype) :
None of the keys entered are valid keys for the keytype specified.
I have played around with the keytype and always get the same error which makes me think I dont really understand fundamentally how to use this database query tool.
I have done a search in bioconductor and they just assume that I have expression data in affy matrix format where I just have a list of the genenames.
I would appreciate your help and apologies as I am a newbie and not really clear on the R bioconductor interface.
Many thanks

Resources