pyarrow compute expression in reticulate

pyarrow compute expression in reticulate - r

How do I use a filter in a reticulate pyarrow compute expression.
At base I have a pyarrow dataset (in this case called woodcan) that I want to turn into a table with a filter.
tab <- woodcan$to_table(ds$field('Region')=='Canada')
The above gets Error in py_compare_impl(a, b, op) : ValueError: An Expression cannot be evaluated to python True or False. If you are using the 'and', 'or' or 'not' operators, use '&', '|' or '~' instead.
How is that syntax supposed to look?

You could generate the expression running python code with py_run_string or py_run_file and pass it to filter argument of to_table:
library(reticulate)
run.py <- py_run_string('
import pyarrow.dataset as ds
expr = ds.field("Region") == "Canada"
')
woodcan$to_table(filter=run.py$expr)
Above code needs previous installation of py_arrow in conjuction with reticulate:
virtualenv_create("arrow-env")
arrow::install_pyarrow("arrow-env")
use_virtualenv("arrow-env")

With #Waldi's answer I came up with another way that doesn't rely on python strings.
From that answer we generate run.py$expr which is a <pyarrow.compute.Expression (Region == "Canada")> object.
From there it begs the question, how else can we generate that? Since it's a pyarrow.compute.Expression that tells us we need, of course, pyarrow.compute
so then it's just replacing "==" with equal
pc <- import('pyarrow.compute')
filter=pc$equal(pc$field("Region"), pc$scalar("Canada"))
and finally
woodcan$to_table(filter=filter)

Related

What is the option in Julia REPL that allows only a limited output?

The result variable is a json-type string, which is very long. What is the option in Julia REPL that allows only a limited output when the variable is this long? DataFrame is originally only partially output. I hope that the general variables will also be output like that.

You can overwrite the display method for AbstractStrings:
import Main.display
display(x::AbstractString) =
show(length(x)<=50 ? x : SubString(x,1,50)*"…")
Let us test it:
julia> str = join(rand('a':'z', 200))
"wcbifwzglgqyenrcdgdxagohlwdoxrrumoaltklkjauptwzrmi…"

return value of if statement in r

So, I'm brushing up on how to work with data frames in R and I came across this little bit of code from https://cloud.r-project.org/web/packages/data.table/vignettes/datatable-intro.html:
input <- if (file.exists("flights14.csv")) {
"flights14.csv"
} else {
"https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
Apparently, this assigns the strings (character vectors?) in the if and else statements to input based on the conditional. How is this working? It seems like magic. I am hoping to find somewhere in the official R documentation that explains this.
From other languages I would have just done:
if (file.exists("flights14.csv")) {
input <- "flights14.csv"
} else {
input <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
or in R there is ifelse which also seems designed to do exactly this, but somehow that first example also works. I can memorize that this works but I'm wondering if I'm missing the opportunity to understand the bigger picture about how R works.

From the documentation on the ?Control help page under "Value"
if returns the value of the expression evaluated, or NULL invisibly if none was (which may happen if there is no else).
So the if statement is kind of like a function that returns a value. The value that's returned is the result of either evaulating the if or the then block. When you have a block in R (code between {}), the brackets are also like a function that just return the value of the last expression evaluated in the block. And a string literal is a valid expression that returns itself
So these are the same
x <- "hello"
x <- {"hello"}
x <- {"dropped"; "hello"}
x <- if(TRUE) {"hello"}
x <- if(TRUE) {"dropped"; "hello"}
x <- if(TRUE) {"hello"} else {"dropped"}
And you only really need blocks {} with if/else statements when you have more than one expression to run or when spanning multiple lines. So you could also do
x <- if(TRUE) "hello" else "dropped"
x <- if(FALSE) "dropped" else "hello"
These all store "hello" in x

You are not really missing anything about the "big picture" in R. The R if function is atypical compared both to other languages as well as to R's typical behavior. Unlike most functions in R which do require assignment of their output to a "symbol", i.e a proper R name, if allows assignments that occur within its consequent or alternative code blocks to occur within the global environment. Most functions would return only the final evaluation, while anything else that occurred inside the function body would be garbage collected.
The other common atypical function is for. R for-loops only
retain these interior assignments and always return NULL. The R Language Definition calls these atypical R functions "control structures". See section 3.3. On my machine (and I suspect most Linux boxes) that document is installed at: http://127.0.0.1:10731/help/doc/manual/R-lang.html#Control-structures. If you are on another OS then there is probably a pulldown Help menu in your IDE that will have a pointer to it. Thew help document calls them "control flow constructs" and the help page is at ?Control. Note that it is necessary to quote these terms when you wnat to access that help page using one of those names since they are "reserved words". So you would need ?'if' rather than typing ?if. The other reserved words are described in the ?Reserved page.
?Control
?'if' ; ?'for'
?Reserved
# When you just type:
?if # and hit <return>
# you will see a "+"-sign which indicateds an incomplete expression.
# you nthen need to hit <escape> to get back to a regular R interaction.

In R, functions don't need explicit return. If not specified the last line of the function is automatically returned. Consider this example :
a <- 5
b <- 1
result <- if(a == 5) {
a <- a + 1
b <- b + 1
a
} else {b}
result
#[1] 6
The last line in if block was saved in result. Similarly, in your case the string values are "returned" implicitly.

Parameter file in ArgParse.jl?

Python's argparse has a simple way to read parameters from a file:
https://docs.python.org/2/library/argparse.html#fromfile-prefix-chars
Instead of passing your arguments one by one:
python script.py --arg1 val1 --arg2 val2 ...
You can say:
python script.py #args.txt
and then the arguments are read from args.txt.
Is there a way to do this in ArgParse.jl?
P.S.: If there is no "default" way of doing this, maybe I can do it by hand, by calling parse_args on a list of arguments read from a file. I know how to do this in a dirty way, but it gets messy if I want to replicate the behavior of argparse in Python, where I can pass multiple files with #, as well as arguments in the command line, and then the value of a parameter is simply the last value passed to this parameter. What's the best way of doing this?

This feature is not currently present in ArgParse.jl, although it would not be difficult to add. I have prepared a pull request.
In the interim, the following code suffices for what you need:
# faithful reproduction of Python 3.5.1 argparse.py
# partial copyright Python Software Foundation
function read_args_from_files(arg_strings, prefixes)
new_arg_strings = AbstractString[]
for arg_string in arg_strings
if isempty(arg_string) || arg_string[1] ∉ prefixes
# for regular arguments, just add them back into the list
push!(new_arg_strings, arg_string)
else
# replace arguments referencing files with the file content
open(arg_string[2:end]) do args_file
arg_strings = AbstractString[]
for arg_line in readlines(args_file)
push!(arg_strings, rstrip(arg_line, '\n'))
end
arg_strings = read_args_from_files(arg_strings, prefixes)
append!(new_arg_strings, arg_strings)
end
end
end
# return the modified argument list
return new_arg_strings
end
# preprocess args, then parse as usual
ARGS = read_args_from_files(ARGS, ['#'])
args = parse_args(ARGS, s)

the difference between = and <- operator in the function system.time()

I am using the function system.time() and I have discovered something which surprises me. I often use the allocation symbol “=” instead of “<-”. I am aware most R users use “<-” but I consider “=” clearer in my codes. Thus, I used “=” to allocate a value in a function system.line() and the following error message appeared : Error: unexpected '=' in "system.time(a[,1] ="
Here is the code :
a = matrix(1, nrow = 10000)
require(stats)
system.time(a[,1] = a[,1]*2) #this line doesn't work
#Error: unexpected '=' in "system.time(a[,1] ="
system.time(a[,1] = a[,1]*2) #this line works
system.time(for(i in 1:100){a[,1] = a[,1]*i}) #this line works!!!!
I found : Is there a technical difference between "=" and "<-" which explains that I can’t use “=” in a function to allocate since it is the symbol to assign argument in a function. But I have been surprised to see that it can work sometimes (see following code).
Does anyone know why it works here? (also why it doesn't work in the first case since I guess, a[,1] is not a parameter of the function system.time()...)
Thank you very much.
Edwin.

Wrap your code in { ... } braces and it will work:
system.time({a[,1] = a[,1]*2})
user system elapsed
0 0 0
From ?"<-"
The operators <- and = assign into the environment in which they are
evaluated. The operator <- can be used anywhere, whereas the operator
= is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a
braced list of expressions.

In system.time(a[,1] = a[,1]*2) the equals sign does not mean assignment, it is interpreted as an attempt to bind a "named argument"; but system.time does not have an argument of that name.
In system.time(for(i in 1:100){a[,1] = a[,1]*i}) the equals sign really is doing an assignment; and that works fine.
If you wrote system.time(a[,1] <- a[,1]*2) the <- can only mean assignment, not argument binding, and it works!
But beware! If you wrote system.time(a[,1] < - a[,1]*2), it also "works" but probably doesn't do what you meant!

Why the "=" R operator should not be used in functions?

The manual states:
The operator ‘<-’ can be used anywhere,
whereas the operator ‘=’ is only allowed at the top level (e.g.,
in the complete expression typed at the command prompt) or as one
of the subexpressions in a braced list of expressions.
The question here mention the difference when used in the function call. But in the function definition, it seems to work normally:
a = function ()
{
b = 2
x <- 3
y <<- 4
}
a()
# (b and x are undefined here)
So why the manual mentions that the operator ‘=’ is only allowed at the top level??
There is nothing about it in the language definition (there is no = operator listed, what a shame!)

The text you quote says at the top level OR in a braced list of subexpressions. You are using it in a braced list of subexpressions. Which is allowed.
You have to go to great lengths to find an expression which is neither toplevel nor within braces. Here is one. You sometimes want to wrap an assignment inside a try block: try( x <- f() ) is fine, but try( x = f(x) ) is not -- you need to either change the assignment operator or add braces.

Expressions not at the top level include usage in control structures like if. For example, the following programming error is illegal.
> if(x = 0) 1 else x
Error: syntax error
As mentioned here: https://stackoverflow.com/a/4831793/210673
Also see http://developer.r-project.org/equalAssign.html

Other than some examples such as system.time as others have shown where <- and = have different results, the main difference is more philisophical. Larry Wall, the creater of Perl, said something along the lines of "similar things should look similar, different things should look different", I have found it interesting in different languages to see what things are considered "similar" and which are considered "different". Now for R assignment let's compare 2 commands:
myfun( a <- 1:10 )
myfun( a = 1:10 )
Some would argue that in both cases we are assigning 1:10 to a so what we are doing is similar.
The other argument is that in the first call we are assigning to a variable a that is in the same environment from which myfun is being called and in the second call we are assigning to a variable a that is in the environment created when the function is called and is local to the function and those two a variables are different.
So which to use depends on whether you consider the assignments "similar" or "different".
Personally, I prefer <-, but I don't think it is worth fighting a holy war over.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

pyarrow compute expression in reticulate - r

Related

What is the option in Julia REPL that allows only a limited output?

return value of if statement in r

Parameter file in ArgParse.jl?

the difference between = and <- operator in the function system.time()

Why the "=" R operator should not be used in functions?

Categories

Resources