Selecting from a Julia Dataframe using the `contains` function - julia

I have a DataFrame df with a column named "cond". One of the values in this column is "aer". To select all the rows with cond == "aer", this code works:
select(:(cond .== "aer"), df)
But this doesn't
select(:(contains(["aer"],cond)), df)
It fails with the error:
ERROR: all SubDataFrame indices must be > 0
in SubDataFrame at /Users/seanmackesey/.julia/DataFrames/src/dataframe.jl:1007
in sub at /Users/seanmackesey/.julia/DataFrames/src/dataframe.jl:1020
in select at /Users/seanmackesey/.julia/DataFrames/src/dataframe.jl:1031
I looked at the source but fail to understand what's going on here. What are the general limitations on what I can put in expression predicates like this?

I think the problem is that contain isn't a vectorized operation:
julia> contains(["aer"], ["aer", "aer", "abr"])
false
This probably means that it's not generating valid indices.
In general, the family of expressions that should work in select are those that generate a vector of indices. There are a few broken cases, but I believe the problem in this case is just that the predicate isn't producing useful indices.

Related

Question regarding using function c() in R coding

I am studying Data analyitcs, my teacher give a question for class "using one-sigma to find any outlier in vector D". He gave his answere as below but I do not understand why he called Out=c() before using function "for" and called "Out" again in fumction c(Out,o)? Could you help me answer this question? Thank you!
D=c(4,6,1,2,8,11)
xbar=mean(D)
std=sd(D)
L=xbar-std
U=xbar+std
Out=c()
for(j in 1:length(D)){
if(D[j]<L | D[j]>U) {o=D[j]} else{o=NULL} Out=c(Out,o)}
Out=c() is your output. It's just an empty dataframe in the beginning. The for loop is iterating j in the length D. So for each j observation, it's performing the conditional statement if(D[j]<L | D[j]>U) {o=D[j]} else{o=NULL} and then putting the results in the Output Out Hope this helps.
The need for an object named Out to exist before entering a for-loop is based on how the c and [<- functions are designed. They need names to exist in the table of objects that the R interpreter maintains, You used "=" but in that context it is really the <- function, the assignment operator, that is being used. The code in the question doesn't appear to use that operator, but it is actually being called when the "=" sign in Out=c(Out,o) is used. You cannot assign a value to the Out on the LHS of the assignment by appending to it because the Out in the RHS doesn't already a value (not even a value of length-0) within the R data objects list when thec-function tries to access its value.
The <- operator is really a function disguised as an infix operator. You can demonstrate this with:
`<-`(my.out , 4)
> my.out
[1] 4
It also has an indexed assignment version [<- which requires that the named object on the LHS exist. This is another source of error for for-loop users. If the named LHS object given to [<- doesn't exist at the time the loop is run, then the first time through the loop you will get and error:
rm(my.out2) #make sure it doesn't exist
for (i in 1:10) { my.out2[i] <- 4 } # LHS doesn't exist, but RHS value exists
#Error: object 'my.out2' not found

Explanation of subsetting

Can anyone explain what this line t[exists,][1:6,] is doing in the code below and how that subsetting works?
t<-trees
t[1,1]= NA
t[5,3]= NA
t[1:6,]
exists<-complete.cases(t)
exists
t[exists,][1:6,]
The complete.cases function will check the data frame and will return a vector of TRUE and FALSE where a TRUE indicates a row with no missing data. The vector will be as long as there are rows in t.
The t[exits,] part will subset the data so that only rows where exists is true will be considered - the row that have missing data will be FALSE in exists and removed. The [1:6,] will only take the first 6 rows where there is no missing data.
Some background
In R, [ is a function like any other. R parses t[exists, ] as
`[`(t, exists) # don't forget the backticks!
Indeed you can always call [ with the backtick-and-parentheses syntax, or even crazier use it in constructions like
as.data.frame(lapply(t[exists, ], `[`, 1:6, ))
which, believe it or not, is (almost) equivalent to t[exists,][1:6,].
The same is true for functions like [[, $, and more exotic stuff like names<-, which is a special function to assign argument value to the names attribute of an object. We use functions like this all the time with syntax like
names(iris) <- tolower(names(iris))
without realizing that what we're really doing is
`names(iris)<-`(iris, tolower(names(iris))
And finally, you can type
?`[`
for documentation, or type
`[`
to return the definition, just like any other function.
What t[exists,][1:6,] does
The simple answer is that R parses t[exists,][1:6,] as something like:
Get the value of t
From the result of step 1, get the rows that correspond to TRUE elements of exists.
From the result of step 2, get rows with row numbers in the vector 1:6, i.e. rows 1 through 6
The more complicated answer is that this is handled by the parser as:
`[`(`[`(t, exists, ), 1:6, ) # yes, this has blank arguments
which a human can interpret as
temporary_variable_1 <- `[`(t, exists, )
temporary_variable_2 <- `[`(temporary_variable_1, 1:6, )
print(temporary_variable_2) # implicitly, sending an object by itself to the console will `print` that object
Interestingly, because you typically can't pass blank arguments in R, certain constructions are impossible with the bracket function, like eval(call("[", t, exists, )) which will throw an undefined columns selected error.

Subsetting of Lists in R

I had a few questions about subsetting a named list in R using the [] operator:
For example, consider the list formals <- list(x = DOUBLE, y = DOUBLE, z = NULL). In this example, DOUBLE is treated as a symbol in R.
1) How should I retrieve all elements that are not equal to NULL. I tried formals[formals != NULL] but this only returns an object of type listwith no members.
2) How should I retrieve elements whose names satisfy for a condition. For example, how would I get all elements whose names are not z? I could use names(formals) but this is cumbersome and I was hoping for a quick solution using [].
Another option for the first question:
Filter(Negate(is.null), formals)
For the second case, you'll have to use names. Here's one way:
formals[names(formals) != 'z']
formals is actually a function in R. It's best to avoid names of functions when naming your variables.
This will work for your first question:
formals[!unlist(lapply(formals, is.null))]
I don't think you can avoid using names for the second question.

lapply fail, but function works fine for each individual input arguments

Many thanks in advance for any advices or hints.
I'm working with data frames. The simplified coding is as follows:
`
f<-funtion(name){
x<-tapply(name$a,list(name$b,name$c),sum)
1) y<-dataset[[deparse(substitute(name))]]
#where dataset is an already existed list object with names the same as the
#function argument. I would like to avoid inputting two arguments.
z<-vector("list",n) #where n is also defined already
2) for (i in 1:n){z[[i]]<-x[y[[i]],i]}
...
}
lapply(list_names,f)
`
The warning message is:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
and the output is incorrect. I tried debugging and found the conflict may lie in line 1) and 2). However, when I try f(name) it is perfectly fine and the output is correct. I guess the problem is in lapply and I searched for a while but could not get to the point. Any ideas? Many thanks!
The structure of the data
Thanks Joran. Checking again I found the problem might not lie in what I had described. I produce the full code as follows and you can copy-paste to see the error.
n<-4
name1<-data.frame(a=rep(0.1,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
name2<-data.frame(a=rep(0.2,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
name3<-data.frame(a=rep(0.3,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
#d is the name for the observations. d corresponds to b.
dataset<-vector("list",3)
names(dataset)<-c("name1","name2","name3")
dataset[[1]]<-list(c(1,2),c(1,2,3,4),c(1,2,3,4,5,10),c(4,5,8))
dataset[[2]]<-list(c(1,2,3,5),c(1,2),c(1,2,10),c(2,3,4,5,8,10))
dataset[[3]]<-list(c(3,5,8,10),c(1,2,5,7),c(1,2,3,4,5),c(2,3,4,6,9))
f<-function(name){
x<-tapply(name$a,list(name$b,name$c),sum)
rownames(x)<-sort(unique(name$d)) #the row names for
y<-dataset[[deparse(substitute(name))]]
z<-vector("list",n)
for (i in 1:n){
z[[i]]<-x[y[[i]],i]}
nn<-length(unique(unlist(sapply(z,names)))) # the number of names appeared
names_<-sort(unique(unlist(sapply(z,names)))) # the names appeared add to the matrix
# below
m<-matrix(,nrow=nn,ncol=n);rownames(m)<-names_
index<-vector("list",n)
for (i in 1:n){
index[[i]]<-match(names(z[[i]]),names_)
m[index[[i]],i]<-z[[i]]
}
return(m)
}
list_names<-vector("list",3)
list_names[[1]]<-name1;list_names[[2]]<-name2;list_names[[3]]<-name3
names(list_names)<-c("name1","name2","name3")
lapply(list_names,f)
f(name1)
the lapply(list_names,f) would fail, but f(name1) will produce exactly the matrix I want. Thanks again.
Why it doesn't work
The issue is the calling stack doesn't look the same in both cases. In lapply, it looks like
[[1]]
lapply(list_names, f) # lapply(X = list_names, FUN = f)
[[2]]
FUN(X[[1L]], ...)
In the expression being evaluated, f is called FUN and its argument name is called X[[1L]].
When you call f directly, the stack is simply
[[1]]
f(name1) # f(name = name1)
Usually this doesn't matter, but with substitute it does because substitute cares about the name of the function argument, not its value. When you get to
y<-dataset[[deparse(substitute(name))]]
inside lapply it's looking for the element in dataset named X[[1L]], and there isn't one, so y is bound to NULL.
A way to get it to work
The simplest way to deal with this is probably to just have f operate on character strings and pass names(list_names) to lapply. This can be accomplished fairly easily by changing the beginning of f to
f<-function(name){
passed.name <- name
name <- list_names[[name]]
x<-tapply(name$a,list(name$b,name$c),sum)
rownames(x)<-sort(unique(name$d)) #the row names for
y<-dataset[[passed.name]]
# the rest of f...
and changing lapply(list_names, f) to lapply(names(list_names),f). This should give you what you want with nearly minimal modification, but you also might consider also renaming some of your variables so the word name isn't used for so many different things--the function names, the argument of f, and all the various variables containing name.

if-else vs ifelse with lists

Why do the if-else construct and the function ifelse() behave differently?
mylist <- list(list(a=1, b=2), list(x=10, y=20))
l1 <- ifelse(sum(sapply(mylist, class) != "list")==0, mylist, list(mylist))
l2 <-
if(sum(sapply(mylist, class) != "list") == 0){ # T: all list elements are lists
mylist
} else {
list(mylist)
}
all.equal(l1,l2)
# [1] "Length mismatch: comparison on first 1 components"
From the ifelse documentation:
‘ifelse’ returns a value with the same shape as ‘test’ which is
filled with elements selected from either ‘yes’ or ‘no’ depending
on whether the element of ‘test’ is ‘TRUE’ or ‘FALSE’.
So your input has length one so the output is truncated to length 1.
You can also see this illustrated with a more simple example:
ifelse(TRUE, c(1, 3), 7)
# [1] 1
if ( cond) { yes } else { no } is a control structure. It was designed to effect programming forks rather than to process a sequence. I think many people come from SPSS or SAS whose authors chose "IF" to implement conditional assignment within their DATA or TRANSFORM functions and so they expect R to behave the same. SA and SPSS both have implicit FOR-loops in there Data steps. Whereas R came from a programming tradition. R's implicit for-loops are built in to the many vectorized functions (including ifelse). The lapply/sapply fucntions are the more Rsavvy way to implement most sequential processing, although they don't succeed at doing lagged variable access, especially if there are any randomizing features whose "effects" get cumulatively handled.
ifelse takes an expression that builds a vector of logical values as its first argument. The second and third arguments need to be vectors of equal length and either the first of them or the second gets chosen. This is similar to the SPSS/SAS IF commands which have an implicit by-row mode of operation.
For some reason this is marked as a duplicate of
Why does ifelse() return single-value output?
So a work around for that question is:
a=3
yo <- ifelse(a==1, 1, list(c(1,2)))
yo[[1]]

Resources