Conditional statements with Dataframes [Julia v1.0] - julia

I am porting over custom functions from R. I would like to use Julia Dataframes to store my data. I like to reference by column name instead of, say, array indices hence I am using the Dataframes package.
I simplified the follow to illustrate:
if( DataFrame(x=1).x .>1) end
The error is:
ERROR: TypeError: non-boolean (BitArray{1}) used in boolean context
Is there a simple workaround that would allow me to continue using DataFrames?

The expression:
DataFrame(x=1).x .> 1
Does the following things:
Creates a DataFrame
Extracts a column x from it
Compares all elements of this column to 1 using vectorized operation .> (broadcasting in Julia parlance)
In effect you get the following one element array:
julia> DataFrame(x=1).x .> 1
1-element BitArray{1}:
false
As opposed to R, Julia distinguishes between vectors and scalars so it is not the same as simply writing false. Moreover if statement expects a scalar not a vector, so something like this works:
if 2 > 1
println("2 is greater than 1")
end
but not something like this:
if DataFrame(x=2).x .> 1
println("success!")
end
However, for instance this would work:
if (DataFrame(x=2).x .> 1)[1]
println("success!")
end
as you extract the first (and only in this case) element from the array.
Notice that in R if you passed more than one-element vector to a conditional expression you get a warning like this:
> if (c(T,F)) {
+ print("aaa") } else {print("bbb")}
[1] "aaa"
Warning message:
In
the condition has length > 1 and only the first element will be used
Simply Julia is stricter than R in checking the types in this case. In R you do not have a distinction between scalars and vectors, but in Julia you have.
EDIT:
length(df) returns you the number of columns of a DataFrame (not number of rows). If you are coming from R it is easier to remember nrow and ncol functions.
Now regarding your question you can write either:
for i in 1:nrow(df)
if df.x[i] > 3
df.y[i] = df.x[i] + 1
end
end
or
bigx = df.x .> 3
df.y[bigx] = df.x[bigx] .+ 1
or
df.y .= ifelse.(df.x .> 3, df.x .+ 1, df.y)
or using DataFramesMeta to shorten the notation:
using DataFramesMeta
#with df begin
df.y .= ifelse.(:x .> 3, :x .+ 1, :y)
end
or
using DataFramesMeta
#byrow! df begin
if :x > 3
:y = :x + 1
end
end

Related

Dynamically accessing globals through string interpolation

You can call variables through a loop in Python like this:
var1 = 1
var2 = 2
var3 = 3
for i in range(1,4):
print(globals()[f"var{i}"])
This results in:
1
2
3
Now I'm looking for the equivalent way to call variables using interpolated strings in Julia! Is there any way?
PS: I didn't ask "How to get a list of all the global variables in Julia's active session". I asked for a way to call a local variable using an interpolation in a string.
PS: this is dangerous.
Code:
var1 = 1
var2 = 2
var3 = 3
for i in 1:3
varname = Symbol("var$i")
println(getfield(Main, varname))
end
List of globals:
vars = filter!(x -> !( x in [:Base, :Core, :InteractiveUtils, :Main, :ans, :err]), names(Main))
Values of globals:
getfield.(Ref(Main), vars)
To display names and values you can either just use varinfo() or eg. do DataFrames(name=vars,value=getfield.(Ref(Main), vars)).
You don't. Use a collection like an array instead:
julia> values = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> for v in values
println(v)
end
1
2
3
As suggested in earlier comments, dynamically finding the variable names really isn't the approach you want to go for.

Compare if the elements of two vectors are equal in Julia

I am trying to get the same behavior as R's == when applied to two vectors that get the comparison for each element in the vector.
a <- c(1,2 ,3 )
b <- c(1, 2 ,5 )
a==b
#[1] TRUE TRUE FALSE
I Julia, I came up with a very clumsy way of doing it, but now I wonder if there are easiest ways out there.
a = [1 2 3 ]
b = [1 2 5 ]
a == b #this does not return what I want.
#false
rows_a =size(a)[2]
equal_terms =ones(rows_a)
for i in 1:rows_a
equal_terms[i] =(a[i] == b[i])
end
equal_terms
#1.0
#1.0
#0.0
Thank you in advance.
In Julia you need to vectorize your operation:
julia> a .== b
1×3 BitMatrix:
1 1 0
Julia contrary to Python and R will require explicit vectorization each time you need it. Any operator or function call can be vectorized just by adding a dot ..
Please note that a and b are horizontal vectors and in Julia such are presented as 1×n matrices. Vectors in Julia are always vertical.

return indices of duplicated elements corresponding to the unique elements in R

anyone know if there's a build in function in R that can return indices of duplicated elements corresponding to the unique elements?
For instance I have a vector
a <- ["A","B","B","C","C"]
unique(a) will give ["A","B","C"]
duplicated(a) will give [F,F,T,F,T]
is there a build-in function to get a vector of indices for the same length as original vector a, that shows the location a's elements in the unique vecor (which is [1,2,2,3,3] in this example)?
i.e., something like the output variable "ic" in the matlab function "unique". (which is, if we let c = unique(a), then a = c(ic,:)).
http://www.mathworks.com/help/matlab/ref/unique.html
Thank you!
We can use match
match(a, unique(a))
#[1] 1 2 2 3 3
Or convert to factor and coerce to integer
as.integer(factor(a, levels = unique(a)))
#[1] 1 2 2 3 3
data
a <- c("A","B","B","C","C")
This should work:
cumsum( !duplicated( sort( a)) ) # one you replace Mathlab syntax with R syntax.
Or just:
as.numeric(factor(a) )

Accessing the object referenced in a logical condition

I am writing an xor function for a class, so although any recommendations on currently existing xor functions would be nice, I have to write my own. I have searched online, but have not been able to find any solution so far. I also realize my coding style may be sub-optimal. All criticisms will be welcomed.
I writing a function that will return an element-wise TRUE iff one condition is true. Conditions are given as strings, else they will throw an error due to unexpected symbols (e.g. >). I would like to output a list of the pairwise elements of a and b in which my xor function is true.
The problem is that, while I can create a logical vector of xor T/F based on the conditions, I cannot access the objects directly to subset them. It is the conditions that are function arguments, not the objects themselves.
'%xor%' <- function(condition_a, condition_b) {
# Perform an element-wise "exclusive or" on the conditions being true.
if (length(eval(parse(text= condition_a))) != length(eval(parse(text= condition_b))))
stop("Objects are not of equal length.") # Objects must be equal length to proceed
logical_a <- eval(parse(text= condition_a)) # Evaluate and store each logical condition
logical_b <- eval(parse(text= condition_b))
xor_vector <- logical_a + logical_b == 1 # Only one condition may be true.
xor_indices <- which(xor_vector == TRUE) # Store a vector which gives the indices of the elements which satisfy the xor condition.
# Somehow access the objects in the condition strings
list(a = a[xor_indices], b = b[xor_indices]) # Desired output
}
# Example:
a <- 1:10
b <- 4:13
"a < 5" %xor% "b > 4"
Desired output:
$a
[1] 1 5 6 7 8 9 10
$b
[1] 4 8 9 10 11 12 13
I have thought about doing a combination of ls() and grep() to find existing object names in the conditions, but this would run into problems if the objects in the conditions were not initialized. For example, if someone tried to run "c(1:10) < 5" %xor% "c(4:13) > 4".

Julia DataFrames.jl - filter data with NA's (NAException)

I am not sure how to handle NA within Julia DataFrames.
For example with the following DataFrame:
> import DataFrames
> a = DataFrames.#data([1, 2, 3, 4, 5]);
> b = DataFrames.#data([3, 4, 5, 6, NA]);
> ndf = DataFrames.DataFrame(a=a, b=b)
I can successfully execute the following operation on column :a
> ndf[ndf[:a] .== 4, :]
but if I try the same operation on :b I get an error NAException("cannot index an array with a DataArray containing NA values").
> ndf[ndf[:b] .== 4, :]
NAException("cannot index an array with a DataArray containing NA values")
while loading In[108], in expression starting on line 1
in to_index at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:85
in getindex at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:210
in getindex at /Users/abisen/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:268
Which is because of the presence of NA value.
My question is how should DataFrames with NA should typically be handled? I can understand that > or < operation against NA would be undefined but == should work (no?).
What's your desired behavior here? If you want to do selections like this you can make the condition (not a NAN) AND (equal to 4). If the first test fails then the second one never happens.
using DataFrames
a = #data([1, 2, 3, 4, 5]);
b = #data([3, 4, 5, 6, NA]);
ndf = DataFrame(a=a, b=b)
ndf[(!isna(ndf[:b]))&(ndf[:b].==4),:]
In some cases you might just want to drop all rows with NAs in certain columns
ndf = ndf[!isna(ndf[:b]),:]
Regarding to this question I asked before, you can change this NA behavior directly in the modules sourcecode if you want. In the file indexing.jl there is a function named Base.to_index(A::DataArray) beginning at line 75, where you can alter the code to set NA's in the boolean array to false. For example you can do the following:
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
A[A.na] = false
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
Ignoring NA's with isna() will cause a less readable sourcecode and in big formulas, a performance loss:
#timeit ndf[(!isna(ndf[:b])) & (ndf[:b] .== 4),:] #3.68 µs per loop
#timeit ndf[ndf[:b] .== 4, :] #2.32 µs per loop
## 71x179 2D Array
#timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop
#timeit dm[dm .< 3] = 1 #754.79 ns per loop
In many cases you want to treat NA as separate instances, i.e. assume that that everything that is NA is "equal" and everything else is different.
If this is the behaviour you want, current DataFrames API doesn't help you much, as both (NA == NA) and (NA == 1) returns NA instead of their expected boolean results.
This makes extremely tedious DataFrame filters using loops:
function filter(df,c)
for r in eachrow(df)
if (isna(c) && isna(r:[c])) || ( !isna(r[:c]) && r[:c] == c )
...
and breaks select-like functionalities in DataFramesMeta.jl and Query.jl when NA values are present or requested for..
One workaround is to use isequal(a,b) in place of a==b
test = #where(df, isequal.(:a,"cc"), isequal.(:b,NA) ) #from DataFramesMeta.jl
I think the new syntax in Julia is to use ismissing:
# drop NAs
df = DataFrame(col=[0,1,1,missing,0,1])
df = df[.!ismissing.(df[:col]),:]

Resources