Writing a partitioned parquet file with SparkR - r

I've got two scripts, one in R and a short second one in pyspark that uses the output. I'm trying to copy that functionality into the first script for simplicity.
The second script is very simple -- read a bunch of csv files and emit them as partitioned parquet:
spark.read.csv(path_to_csv, header = True) \
.repartition(partition_column).write \
.partitionBy(partition_column).mode('overwrite') \
.parquet(path_to_parquet)
This should be equally simple in R but I can't figure out how to match the partitionBy functionality in SparkR. I've got this so far:
library(SparkR); library(magrittr)
read.df(path_to_csv, 'csv', header = TRUE) %>%
repartition(col = .$partition_column) %>%
write.df(path_to_parquet, 'parquet', mode = 'overwrite')
This successfully writes one parquet file for each value of partition_column. The issue is the emitted files have the wrong directory structure; whereas Python produces something like
/path/to/parquet/
partition_column=key1/
file.parquet.gz
partition_column=key2/
file.parquet.gz
...
R produces only
/path/to/parquet/
file_for_key1.parquet.gz
file_for_key2.parquet.gz
...
Am I missing something? the partitionBy function in SparkR appears only to refer to the context of window functions and I don't see anything else in the manual that could be related. Perhaps there's a way to pass something in ... but I don't see any examples in the documentation or from a search online.

Partitioning of the output is not supported in Spark <= 2.x.
However, it will be supported in SparR >= 3.0.0 (SPARK-21291 - R partitionBy API), with the following syntax:
write.df(
df, path_to_csv, "parquet", mode = "overwrite",
partitionBy = "partition_column"
)
Since corresponding PR modifies only R files, you should be able to patch any SparkR 2.x distribution, if upgrading to development version is not an option:
git clone https://github.com/apache/spark.git
git checkout v2.4.3 # Or whatever branch you use
# https://github.com/apache/spark/commit/cb77a6689137916e64bc5692b0c942e86ca1a0ea
git cherry-pick cb77a6689137916e64bc5692b0c942e86ca1a0ea
R -e "devtools::install('R/pkg')"
In the client mode this should be required only on the driver node.
but these are not fatal, and shouldn't cause any serious issues.

Related

How to export sf object to GDB using RPyGeo in R (Windows)?

I have a bunch of sf objects I'd like to export to GDB from R. I'm running R 4.0.2 on Windows 10. In this case the sf objects are all vector point data. The main reasons to export to GDB are to keep longer field names (the shapefile truncation is very annoying), and because GDBs are more desirable storage locations for our workflows.
Yes, I know about the ArcGisBinding package. I've got it to work in a test script but it's pretty unstable - often crashing and requiring a restart of R. This is a problem, because the sf objects I'd like to export come after an already long Rmd that reads in, formats and cleans the data. So it's not a simple manner of re-running the script until arc.write doesn't break. I could break up the script, but then I'd still have to read in a bunch of shapefiles. One option I haven't yet explored is using reticulate to call a python script instead of trying to do everything in R, but we're trying to do our analysis all in one place, if possible.
I'm pretty sure I've managed to set up RPyGeo appropriately, first setting my python path using the reticulate package. I'm doing it this way because IT restrictions means I can't edit PATH variables on my machine.
#package calls
library(sf)
library(spData)
library(reticulate)
#set python version in reticulate
py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
reticulate::use_python(python = py_path, required = TRUE)
#call RPyGeo
library(RPyGeo) # for potential point export
#output gdb
out.gdb <- "C:/LOCAL_PROJECTS/Output/Output.gdb"
#RPyGeo Parameters
# Note that, in order to use RPyGeo you need a working ArcMap or ArcGIS Pro installation on your computer.
# python path - note that this will change depending on which version of Arc one is using
# py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
arcpy <- rpygeo_build_env(workspace = out.gdb,
overwrite = TRUE,
extensions = c("Spatial","DataInteroperability"),
path = py_path)
I've tried a bunch of different tools to export an sf object, here using dummy data also used in the RPyGeo vignette
data(nz, package = "spData")
arcpy$Copy_management(in_data = nz,out_data = "nz_test")
arcpy$Copy_management(in_data = nz,out_data = file.path(out.gdb,"nz"))
arcpy$FeatureClassToGeodatabase_conversion(Input_Features = nz,Output_Geodatabase = out.gdb)
arcpy$FeatureClassToFeatureClass_conversion(in_features = nz,out_path = out.gdb,out_name = "nz")
arcpy$QuickExport_interop(Input = nz,Output = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = "nz")
Each time I get an error, for example:
Error in py_call_impl(callable, dots$args, dots$keywords) :
RuntimeError: Object: Error in executing tool
Detailed traceback:
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3232, in CopyFeatures
raise e
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3229, in CopyFeatures
retval = convertArcObjectToPythonObject(gp.CopyFeatures_management(*gp_fixargs((in_features, out_feature_class, config_keyword, spatial_grid_1, spatial_grid_2, spatial_grid_3), True)))
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\geoprocessing\_base.py", line 511, in <lambda>
return lambda *args: val(*gp_fixargs(args, True))
I'm not an expert in ArcPy by any means. Nor am I an expert in tracing errors inside packages. Am I making a simple syntax mistake? Is there something else that I'm missing? Any help would be much appreciated!

How to convert Tensorflow Object Detection API model to TFLite?

I am trying to convert a Tensorflow Object Detection model(ssd-mobilenet-v2-fpnlite, from TensorFlow 2 Detection Model Zoo) to TFLite. First of all, I train the model using the model_main_tf2.py and then I use the export_tflite_graph_tf2.py to export a saved model(.pb). However, when it comes to convert the .pb file to .tflite it throws this error:
OSError: SavedModel file does not exist at: /content/gdrive/My Drive/models/research/object_detection/fine_tuned_model/saved_model/saved_model.pb/{saved_model.pbtxt|saved_model.pb}
To convert the .pb file I used:
import tensorflow as tf
SAVED_MODEL_PATH = os.path.join(os.getcwd(),'object_detection', 'fine_tuned_model', 'saved_model', 'saved_model.pb')
# SAVED_MODEL_PATH: '/content/gdrive/My Drive/models/research/object_detection/exported_model/saved_model/saved_model.pb'
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_PATH)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.experimental_new_converter = True
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
open("detect.tflite", "wb").write(tflite_model)
or "tflite_convert" from command line, but with the same error. I also tried to run it with the latest tf-nightly version as it suggests here, but the outcome is the same. I tried to pass the path with various ways, it seems like the .pd is not well written (not the right file). Is there a way to manage to convert the model to tflite so as to implement it to android? Thank you!
Your saved_model path should be "/content/gdrive/My Drive/models/research/object_detection/fine_tuned_model/saved_model/". It is the folder instead of files in that folder
For quick test, try to type in terminal
tflite_convert \
--saved_model_dir="path to saved_folder" \
--output_file="path to tflite file u want to save"
I don't have enough reputation to just comment but the problem here seems to be your SAVED_MODEL_PATH.
You could try to hardcode the path and remove the .pb file. I don't remember exactly what's the trick here but it's definitively due to the path

How to get the output data from R script in node using r-script

I am trying to execute a R script from node.js using r-script because it looks pretty simple.
With the documentation example:
example.js
var out = R("ex-sync.R")
.data("hello world", 20)
.callSync();
console.log(out);
ex-sync.R
needs(magrittr)
set.seed(512)
do.call(rep, input) %>%
strsplit(NULL) %>%
sapply(sample) %>%
apply(2, paste, collapse = "")
My out variable which supposed to be the last line of R script, is always null and I have no idea why this can happen.
For Windows users:
You need to add the environment variable to Windows's %PATH% variable. R-script package needs to call "R" command from the CMD. If R.exe is not set as an environment variable, then it will never be able to call the "R" command from anywhere.
Look up how to add environment variables to Windows, and remember: if the path to the folder containing the executables has a white space, it must be added to double quotes. "C:\Program Files\R\R-version\bin\x64"
**** replace version**
If you have already done this but the problem persists, I can only think of two reasons:
There's something wrong with your R method and it's giving an internal exception inside the R session.
The system can't find the file. Maybe check the file path.

TreeTagger in R

I have downloaded TreeTaggerv3.2 for Windows and have configured it per the install.txt. I am trying to use it in R with koRpus package. I have set the kRp.env as -
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en",
preset="en", treetagger="manual", format="file",
TT.tknz=TRUE, encoding="UTF-8" )
.My data to be tagged is in a file and trying to use it as treetag("myfile.txt") but it is throwing the error-
Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE, :
'data' must be of a vector type, was 'NULL'
In addition: Warning message:
running command 'C:\windows\system32\cmd.exe /c C:\TreeTagger\bin\tag-english.bat
C:\Users\vivsingh\Desktop\NLP\tree_tag_ex.txt' had status 255
The standalone TreeTagger is working on by windows.Any idea on how it works?
I had the exact same error and warning while trying lemmatization on R word vector following Bernhard Learns blog using windows 7 and R 3.4.1 (x64). The issue was also appearing using textstem package but TreeTagger was running properly in cmd window.
I mixed several answers I found on this post and here is my steps and code running properly:
get into R win_library (~\Documents\R\win-library\3.4\rJava\jri\x64\jri.dll) and copy jri.dll (thanks kravi!) to replace it the parent folder.
close and restart R
library(koRpus)
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual", format="obj", TT.tknz=FALSE , lang="en", TT.options=list(path="c:/TreeTagger", preset="en"))
lemma_tagged_tbl <- tbl_df(lemma_tagged#TT.res)
Hope it helps.
I am posting this answer to keep a record. I also faced the same issue due to incorrect specification of the location of jri.dll on 64-Bit processor and windows 8.1. If we call
set.kRp.env(TT.cmd="manual", lang="en", TT.options=list(path="/path/to/tree-tagger-windows-x.x/TreeTagger", preset="en")) and we follow either of following two steps, we can resolve this error:
While installing R, if we install only 64 Bit version of R, and
specify the proper path for these variables
LD_LIBRARY_PATH = /path/to/rJava/jri
JAVA_HOME = /path/to/jdk1.x.x
java.library.path = /path/to/rJava/jri/jri.dll
CLASSPATH = /path/to/rJava/jri
If we already installed both versions viz. 32 bit and 64 bit of R on your computer then just copy jri.dll from /path/to/rJava/jri/x64/jri.dll and replace at path/to/rJava/jri/jri.dll. Further, we need to set the path of above mentioned four variables.
I've got this issue (very similar I guess) and posted query to GitHub.
https://github.com/unDocUMeantIt/koRpus/issues/7
The current working solution for me for this case was easier than I could expect, just downgrading the koRpus package. This can change with time but this version should remain appropriate.
library("devtools")
install_github("unDocUMeantIt/koRpus", ref="0.06-5")
This package is not Java related they said.
You can face the same error while setting up the korpus environment and getting the result from treetagger. For example, when you use:
tagged.text <- treetag(
"C:/temp/sample_text.txt",
treetagger = "manual",
lang = "en",
TT.options = list(
path = "c:/Treetagger",
preset = "en"
),
doc_id = "sample"
)
You would receive a similar error
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does not fail but produce a table with proper results, please contact the author!
Here you need to change the value of treetagger, from
treetagger = "manual"
to
treetagger = "kRp.env"
However, before that remember to set the kRp.env as #Xochitl C. suggested in their answer
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
Once you do this, you'll get the desired result.

Different results from Rscript and R CMD BATCH

I have an inconsistency issue which I cannot explain when running an R script. I am not able to produce a reproducible example because there is a whole set of files/functions called by the entry script.
Using Rscript or RStudio with R v3.1.2 I obtain the results I'm expecting, however when calling R CMD BATCH from bash my script does not produce identical output. From bash, R seems to read the command line arguments correctly and reports them from the script, BUT in my code only the Rscript and RStudio source methods seem to use the parameter correctly in my code.
The 2 command line calls are as follows:
Rscript ./script/forecast_category_script.R "category='razors'" "cores=4L"
R CMD BATCH --no-save "--args category='razors' cores=4L" ./script/forecast_category_script.R ~/data/output/out.out
Is there any obvious reason why these inconsistencies might be occurring? I'd prefer to use R CMD BATCH as it redirects output to a file and when I migrate my code to the university cluster as a batch job through the scheduler I'd like to be able to follow what it has done.
UPDATE: changing this line resolves it but why?
Previously I had the following line in there, basically so when I was testing I didn't keep reloading the huge dataset if it was already loaded in my RStudio environment:
if(!exists("spi")) spi = f_load.spi(category = category)
Replaced it with this:
spi = f_load.spi(category = category)
The underlying function f_load_spi remained the same however:
f_load.spi = function(spi = NULL, category = "razors" , n=NULL) {
# check if the data is pre-loaded
if (is.null(spi)) {
fil = paste0(pth.data.storage, "categories/", category, "/", category, ".sp_ss.interp.rds")
print(fil)
spi = readRDS(fil)
}
# subset to a specific set of items
if (!is.null(n)) {
fc.items = unique(spi$fc.item)
rnd = sample(1:length(fc.items), n)
spi = spi[fc.item %in% fc.items[rnd]]
}
spi
}
For some reason the category variable was not being passed through properly into the function and it was loading a different category (beer rather than razors) which was an enormous file and not suitable for testing.
This still doesn't explain why Rscript and R CMD BATCH behaved differently.
It is possible that one of them is loading up a previously saved workspace and using global variables. Have you checked whether it matters which directory you are in or if there are any .Rhistory files present? One way to ensure that you don't have any hidden variables is to clear the worspace at the beginning of each script. For example, rm(list=ls()) as the first line of your Rscript.
Also, you can pipe output to a file with an Rscript using sink().

Resources