I'm working with RJDBC on a server whose maintainers frequently update jar versions. Since RJDBC requires classpaths, this poses a problem when paths break. My situation is fortuitous in that the most current jars will always be in the same directory, but the version numbers will have changed.
I'm trying to use a simple grep function in R to isolate which jar I need based on a regex with some AND logic, however R makes this surprisingly difficult...
This question demonstrates how grep in R can function with the | operator for OR logic, but I can't seem to find similar AND logic operator.
Here's an example:
## Let's say I have three jars in a directory
jars <- list.files('/the/dir')
> jars
[1] "hive-jdbc-1.1.0-cdh5.4.3-standalone.jar" "hive-jdbc-1.1.0-cdh5.4.3.jar" "jython-standalone-2.5.3.jar"
The jar I want is "hive-jdbc-1.1.0-cdh5.4.3-standalone.jar"—how can I use AND logic in grep to extract it?
## I know that OR logic is supported:
j <- jars[grep('hive-jdbc|standalone', jars)]
> j
[1] "hive-jdbc-1.1.0-cdh5.4.3-standalone.jar" "hive-jdbc-1.1.0-cdh5.4.3.jar" "jython-standalone-2.5.3.jar"
## Would AND logic look like the same format?
> jars[grep('hive-jdbc&standalone', jars)]
character(0)
Not all-that-surprisingly, that last piece doesn't work... I found a useful, yet non-comprehensive, link for grep in R, but it doesn't show an AND operator. Any thoughts?
You could try
grep('hive-jdbc.*standalone', jars) # 'hive-jdbc' followed by 'standalone'
or
grepl('hive-jdbc', jars) & grepl('standalone', jars) # 'hive-jdbc' AND 'standalone'
Related
When I look at the documentation, there is no "correlation", but there is "ppearson" and "spearson". They are mentioned exactly once, as a "group-by statistical operation." But .. how exactly are they defined?
Also, when I try to use one, there is an error message, but I don't understand how to fix it. How do you use ppearson or spearson?
$ cat > foo.tsv
1^I2
2^I3
$ cat foo.tsv | datamash ppearson 1,2
datamash: operation ‘ppearson’ requires field pairs
EDIT: This documentation section says
GNU Datamash is designed to closely follow R project’s (https://www.r-project.org/) statistical functions. See the files/operators.R file for the R equivalent code for each of datamash’s operators. When building datamash from source code on your local computer, operators are compared to known results of the equivalent R functions.
Looking in R, I don't see an spearson:
> ?spearson
No documentation for ‘spearson’ in specified packages and libraries:
you could try ‘??spearson’
I'm trying to run the code below from BioSeqClass package, however I get an error message:
Error in system(command, intern = TRUE) : '“C:\Program' not found
selectWeka(data, evaluator="CfsSubsetEval", search="BestFirst", n)
This is a problem with how BioSeqClass is calling java: it is leaving the file names unprotected/unquoted, and R's system and system2 commands are horrible by not forcing quoting. (If you ever think of using these commands directly yourself, I strongly recommend something like processx.)
One should create an issue or bug-report, but I don't know how to do that with Bioconductor, and their mirror on github (https://github.com/Bioconductor-mirror) is defunct, so I'm at a loss there. Hopefully somebody with more info can weigh in on this.
Workarounds
It is not obvious if the problem is due to where weka.jar is located or perhaps one of the other arguments. You can find out where the problem is by debugging the selectWeka function and inspecting the value of command before the system call. Look for the Program Files component of a path.
If the problem is with weka.jar, then this suggests that you are installing packages somewhere under C:\Program Files\, which is in my experience bad practice on two counts:
For many problems I cannot recall (but this one re-ignites the discussion), I never install R in the default location under C:\Program Files\...; instead, I install it under a new directory, C:\R\R-3.5.3 (version-based) and go from there. You may not have control over this if on a university/company system.
Since this is not in a base-R package, this suggests that either you are not using a personal library location (collection of packages), or have placed your personal library for some reason under C:\Program Files. If the former, I strongly suggest you never install new packages inside the base-R installation directory, instead using your own. See ?.libPaths and many other tutorials/discussions on the topic online. Using packrat or checkpoint might also mitigate this problem.
If the problem is with trainFile (if I'm reading the source correctly), then your "permanent" fix is to change where Windows puts temporary files, as this trainFile is a temporary file created specifically for this function-run. If this is your problem, I'll leave it up to you to fix.
Regardless, you may not have time or need to make a more permanent solution, you just want to run this once or twice and then move on. For that fix:
Again, debug(selectWeka), and once command is defined (the next command to be executed is tmp <- system(command,intern = TRUE)), run this code to fix the value of command:
if(search=="Ranker"){
command = paste("java -cp ", shQuote(file.path(.path.package("BioSeqClass"), "scripts", "weka.jar")),
" weka.attributeSelection.", evaluator, " -i ", shQuote(trainFile),
" -s \"weka.attributeSelection.", search, " -N ", n, "\"", sep="" )
}else{
command = paste("java -cp ", shQuote(file.path(.path.package("BioSeqClass"), "scripts", "weka.jar")),
" weka.attributeSelection.", evaluator, " -i ", shQuote(trainFile),
" -s weka.attributeSelection.", search, sep="" )
}
(For the record, all I changed was adding shQuote twice to each paste.) Confirm that command now has quotes around things, something like
Browse[2]> command
[1] "java -cp \"C:\\Program Files\\...\\weka.jar" weka.attributeSel... -i \"c:\\path\\to\\some\\tempfile\" ...
Then you can continue-out of the debugger and let it run its course.
(I hope you don't have hundreds of calls to selectWeka.)
Caveat: I am not a use of BioSeqClass, so I'm saying all of this from speculation and inference. I might have mis-located the source of the error. And since I don't know what I'm doing with it, I have not tested the modified command assignment within selectWeka. I believe shQuote(...) is the right way to go, but you might need to use sQuote or dQuote instead, I'm not sure how your system is setup.
In tensorflow the training from the scratch produced following 6 files:
events.out.tfevents.1503494436.06L7-BRM738
model.ckpt-22480.meta
checkpoint
model.ckpt-22480.data-00000-of-00001
model.ckpt-22480.index
graph.pbtxt
I would like to convert them (or only the needed ones) into one file graph.pb to be able to transfer it to my Android application.
I tried the script freeze_graph.py but it requires as an input already the input.pb file which I do not have. (I have only these 6 files mentioned before). How to proceed to get this one freezed_graph.pb file? I saw several threads but none was working for me.
You can use this simple script to do that. But you must specify the names of the output nodes.
import tensorflow as tf
meta_path = 'model.ckpt-22480.meta' # Your .meta file
output_node_names = ['output:0'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('path/of/your/.meta/file'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
If you don't know the name of the output node or nodes, there are two ways
You can explore the graph and find the name with Netron or with console summarize_graph utility.
You can use all the nodes as output ones as shown below.
output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
(Note that you have to put this line just before convert_variables_to_constants call.)
But I think it's unusual situation, because if you don't know the output node, you cannot use the graph actually.
As it may be helpful for others, I also answer here after the answer on github ;-).
I think you can try something like this (with the freeze_graph script in tensorflow/python/tools) :
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
The important flag here is --input_binary=false as the file graph.pbtxt is in text format. I think it corresponds to the required graph.pb which is the equivalent in binary format.
Concerning the output_node_names, that's really confusing for me as I still have some problems on this part but you can use the summarize_graph script in tensorflow which can take the pb or the pbtxt as an input.
Regards,
Steph
I tried the freezed_graph.py script, but the output_node_name parameter is totally confusing. Job failed.
So I tried the other one: export_inference_graph.py.
And it worked as expected!
python -u /tfPath/models/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/your/config/path/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix=/your/checkpoint/path/model.ckpt-50000 \
--output_directory=/output/path
The tensorflow installation package I used is from here:
https://github.com/tensorflow/models
First, use the following code to generate the graph.pb file.
with tf.Session() as sess:
# Restore the graph
_ = tf.train.import_meta_graph(args.input)
# save graph file
g = sess.graph
gdef = g.as_graph_def()
tf.train.write_graph(gdef, ".", args.output, True)
then, use summarize graph get the output node name.
Finally, use
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
to generate the freeze graph.
I have lots (millions) of small log files in s3 in with its name (date/time) helping to define it i.e. servername-yyyy-mm-dd-HH-MM. e.g.
s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
...
s3://my_bucket/uk4339-2015-05-07-19-23.csv
s3://my_bucket/uk4339-2015-05-07-19-24.csv
...
etc
From EC2, using the AWS CLI, I would like to simultaneously download all files that are have the minute equal 16 for 2015, for all only server uk4339 and uk4338
Is there a clever way to do this?
Also if this is a terrible file structure in s3 to query data, I would be extremely grateful for any advice on how to set this up better.
I can put a relevant aws s3 cp ... command into a loop in a shell/bash script to sequentially download the relevant files but, was wondering if there was something more efficient.
As an added bonus I would like to row bind the results together too as one csv.
A quick example of a mock csv file can be generated in R using this line of R code
R> write.csv(data.frame(cbind(a1=rnorm(100),b1=rnorm(100),c1=rnorm(100))),file='uk4339-2015-05-07-19-24.csv',row.names=FALSE)
The csv that is created is uk4339-2015-05-07-19-24.csv. FYI, I will be importing the combined data into R at the end.
As you didn't answer my questions, nor indicate what OS you use, it is somewhat hard to make any concrete suggestions, so I will briefly suggest you use GNU Parallel to parallelise your S3 fetch requests to get around the latency.
Suppose you somehow generate a list of all the S3 files you want and put the resulting list in a file called GrabMe.txt like this
s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
Then you can get them in parallel, say 32 at a time, like this:
parallel -j 32 echo aws s3 cp {} . < GrabMe.txt
or if you prefer reading left-to-right
cat GrabMe.txt | parallel -j 32 echo aws s3 cp {} .
You can obviously alter the number of parallel requests from 32 to any other number. At the moment, it just echoes the command it would run, but you can remove the word echo when you see how it works.
There is a good tutorial here, and Ole Tange (the author of GNU Parallel) is on SO, so we are in good company.
I wish to get the fully qualified name of a file in R, given any of the standard notations. For example:
file.ext
~/file.ext (this case can be handled by path.expand)
../current_dir/file.ext
etc.
By fully qualified file name I mean, for example, (on a Unix-like system):
/home/user/some/path/file.ext
(Edited - use file.path and attempt Windows support) A crude implementation might be:
path.qualify <- function(path) {
path <- path.expand(path)
if(!grepl("^/|([A-Z|a-z]:)", path)) path <- file.path(getwd(),path)
path
}
However, I'd ideally like something cross-platform that can handle relative paths with ../, symlinks etc. An R-only solution would be preferred (rather than shell scripting or similar), but I can't find any straightforward way of doing this, short of coding it "from scratch".
Any ideas?
I think you want normalizePath():
> setwd("~/tmp/bar")
> normalizePath("../tmp.R")
[1] "/home/gavin/tmp/tmp.R"
> normalizePath("~/tmp/tmp.R")
[1] "/home/gavin/tmp/tmp.R"
> normalizePath("./foo.R")
[1] "/home/gavin/tmp/bar/foo.R"
For Windows, there is argument winslash which you might want to set all the time as it is ignored on anything other than Windows so won't affect other OSes:
> normalizePath("./foo.R", winslash="\\")
[1] "/home/gavin/tmp/bar/foo.R"
(You need to escape the \ hence the \\) or
> normalizePath("./foo.R", winslash="/")
[1] "/home/gavin/tmp/bar/foo.R"
depending on how you want the path presented/used. The former is the default ("\\") so you could stick with that if it suffices, without needing to set anything explicitly.
On R 2.13.0 then the "~/file.ext" bit also works (see comments):
> normalizePath("~/foo.R")
[1] "/home/gavin/foo.R"
I think I kind of miss the point of your question, but hopefully my answer can point you into the direction you want (it integrates your idea of using paste and getwdwith list.files):
paste(getwd(),substr(list.files(full.names = TRUE), 2,1000), sep ="")
Edit: Works on windows in some tested folders.