Assign list of attributes() to sublist in R - r

I have a dataframe called 'situations' containing list of attributes.
> str(situations)
'data.frame': 24 obs. of 8 variables:
$ ID.SITUATION : Factor w/ 24 levels "cnf_01_be","cnf_02_ch",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ELICITATION.D : Factor w/ 2 levels "NATUREL","SEMI.DIRIGE": 1 1 1 1 1 1 1 1 2 2 ...
$ INTERLOCUTEUR.C : Factor w/ 3 levels "DIALOGUE","MONOLOGUE",..: 2 2 2 2 3 3 3 3 1 1 ...
$ PREPARATION.D : Factor w/ 3 levels "PREPARE","SEMI.PREPARE",..: 2 2 2 2 3 3 3 3 3 3 ...
$ INTERACTIVITE.D : Factor w/ 3 levels "INTERACTIF","NON. INTERACTIF",..: 2 2 2 2 1 1 1 1 3 3 ...
$ MEDIATISATION.D : Factor w/ 3 levels "MEDIATIQUE","NON.MEDIATIQUE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ PROFESSIONNALISATION.C: Factor w/ 1 level "PRO": 1 1 1 1 1 1 1 1 1 1 ...
$ ID.TASK : Factor w/ 5 levels "conference scientifique",..: 1 1 1 1 2 2 2 2 3 3 ...
I have as many observation in this dataframes (24) than i have sublist in a given corpus.
ID situation names (cnf_01_be) correspond to the name of the sublist (cnf_01_be).
I know how to assign individual attributes :
attributes(corpus$cnf_01_be) = situations[1,]
attributes(corpus$cnf_02_ch) = situations[2,]
And retrieve them for a specific purpose :
attr(corpus$cnf_01_be, "ELICITATION.D")
attr(corpus$cnf_02_ch, "ELICITATION.D")
attr(corpus$cnf_02_ch, "PREPARATION.D")
But how can I use for example lapply to assign automatically attributes to all the sublist in my corpus ?
I feel like all my trial are going in the wrong direction :
setattr <- function(x,y) {
attributes(x) <- situations[y,]
return(attributes)
}
...or...
lapply(corpus,setattr)
lapply(corpus, attributes(corpus) <- situations[c(1:length(situations[,1])),])
Thanks in advance!

The main problem with using lapply (and similar approaches) is that they cannot normally change the original object of interest, but rather return a new structure. so if you already have a list "corpus" and just want to change its members' attributes you can't usually do that inside a function.
One way to overcome this limitation is to use eval.parent() call instead of the usual assignment. This function evaluates the assignment expression in the parent environment (the environment that called the function), rather than to the local instances (copies) of the objects you assign. if you use this you don't have to return any value.
Another option would be to create a local copy of your corpus list within the function, add to it all the attributes, then return the whole structure from the function and use it to substitute the old list. if your list is big/complex this is probably not a wise choice
Here is a code that does it. note - this is an ugly code. I'm still looking to see if I can make it simpler, but because of the issues above, i'm not sure there is a much simpler option. Anyway, I hope the following will do the trick for you:
f = function(lname,data) {
snames = eval.parent(parse(text=paste("names(",lname,")")))
for (xn in snames) {
rd = data[match(xn,as.character(data$id)),]
if (nrow(rd)>0) {
tmp___ <<-rd[1,]
cmm = paste("attributes(",lname,"[[",xn,"]]) = tmp___")
eval.parent(parse(text=cmm))
}
}
}
Note that in order to use it you need to supply your list name (as a character string, and not as a variable), and your data frame. In your case the call would be:
f("corpus",situations)
I hope this helps.

Related

R anesrake issue with list names non-binary argument

I am using anesrake to weight some survey data, but am getting a non-binary argument error. The error only occurs after I have added the names to the list to use as targets:
gender1<-c(0.516166000986901,0.483833999013099)
age<-c(0.15828262425613,0.364861110549873,0.429947760183493,0.0469085050104993)
mylist<-list(gender1,age)
names(mylist)<-c("gender1","age")
result<-anesrake(mylist,france,caseid=france$caseid, iterate=TRUE)
Error in x + weights : non-numeric argument to binary operator
In addition: Warning message:
In anesrake(targets, france, caseid = france$caseid, iterate = TRUE) :
Targets for age do not sum to 100%. Adjusting values to total 100%
This also says that the targets for age don't add to 100%, which they do, so also not sure what that's about. If I leave out the names(mylist) bit, I get the following error, presumably because R doesn't know which variables to use, but not a non-binary error:
Error in selecthighestpcts(discrep1, inputter, pctlim) :
No variables are off by more than 5 percent using the method you have chosen, either weighting is unnecessary or a smaller pre-raking limit should be chosen.
The variables in the data frame are called the same as the targets in the list, and are numeric:
> str(france)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 993 obs. of 5 variables:
$ Gender :Classes 'labelled', 'numeric' atomic [1:993] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Gender"
$ Age2 : num 2 3 2 2 2 2 2 1 2 3 ...
$ gender1: num 2 2 2 2 2 2 2 2 2 2 ...
$ caseid : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 2 3 2 2 2 2 2 1 2 3 ...
I have also tried converting gender1 and age to factor variables (as the numbers represent levels of each variable - gender has 2, age has 4), but with the same result. I have used anesrake before successfully, so there must be something I am missing, but cannot see it! Any help greatly appreciated....

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

R multiway split trees using ctree {partykit}

I want to analyze my data with a conditional inference trees using the ctree function from partykit. I specifically went for this function because - if I understood correctly - it's one of the only ones allowing multiway splits. I need this option because all of my variables are multilevel (unordered) categorical variables.
However, trying to enable multiway split using ctree_control gives the following error:
aufprallentree <- ctree(case ~., data = aufprallen,
control = ctree_control(minsplit = 10, minbucket = 5, multiway = TRUE))
## Error in 1:levels(x) : NA/NaN argument
## In addition: Warning messages:
## 1: In 1:levels(x) :
## numerical expression has 4 elements: only the first used
## 2: In partysplit(as.integer(isel), index = 1:levels(x)) :
## NAs introduced by coercion
Anyone knows how to solve this? Or if I'm mistaken and ctree does not allow multiway splits?
For clarity, an overview of my data: (no NAs)
str(aufprallen)
## 'data.frame': 299 obs. of 10 variables:
## $ prep : Factor w/ 6 levels "an","auf","hinter",..: 2 2 2 2 2 2 1 2 2 2 ...
## $ prep_main : Factor w/ 2 levels "auf","other": 1 1 1 1 1 1 2 1 1 1 ...
## $ case : Factor w/ 2 levels "acc","dat": 1 1 2 1 1 1 2 1 1 1 ...
## $ sense : Factor w/ 3 levels "crashdown","crashinto",..: 2 2 1 3 2 2 1 2 1 2 ...
## $ PO_type : Factor w/ 4 levels "object","region",..: 4 4 3 1 4 4 3 4 3 4 ...
## $ PO_type2 : Factor w/ 3 levels "object","region",..: 1 1 3 1 1 1 3 1 3 1 ...
## $ perfectivity : Factor w/ 2 levels "imperfective",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ mit_Körperteil: Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ PP_place : Factor w/ 4 levels "back","front",..: 4 1 1 1 1 1 1 1 1 1 ...
## $ PP_place_main : Factor w/ 3 levels "marked","rel",..: 2 3 3 3 3 3 3 3 3 3 ...
Thanks in advance!
A couple of remarks:
The error with 1:levels(x) was a bug in ctree. The code should have been 1:nlevels(x). I just fixed this on R-Forge - so you can check out the SVN from there and manually install the package if you want to use the option now. (Contact me off-list if you need more details on this.) Torsten will probably also make a new CRAN release in the next weeks.
Another function that can learn binary classification trees with multiway splits is glmtree in the partykit package. The code would be glmtree(case ~ ., data = aufprallen, family = binomial, catsplit = "multiway", minsize = 5). It uses parameter instability tests instead of conditional inference for association to determine the splitting variables and adopts the formal likelihood. But in many cases the results are fairly similar to ctree.
In both algorithms, the multiway splits are very basic: If a categorical variable is selected for splitting, then no split selection is done at all. Instead all categories get their own daughter node. There are algorithms that try to determine optimal groupings of categories with a data-driven number of daughter nodes (between 2 and the number of categories).
Even though you have categorical predictor variables with more than two levels you don't need multiway splits. Many algorithms just use binary splits because any multiway split can be represented by a sequence of binary splits. In many datasets, however, it turns out that it is beneficial to not separate all but just a few of the categories in a splitting factor.
Overall my recommendation would be to start out with standard conditional inference trees with binary splits only. And only if it turns out that this leads to many binary splits in the same factor, then I would go on to explore multiway splits.

What is the difference between dataset[,'column'] and dataset$column in R?

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways:
> dataset[,'column']
> dataset$column
It appears that both give me the same result. What is the difference?
In practice, not much, as long as dataset is a data frame. The main difference is that the dataset[, "column"] formulation accepts variable arguments, like j <- "column"; dataset[, j] while dataset$j would instead return the column named j, which is not what you want.
dataset$column is list syntax and dataset[ , "column"] is matrix syntax. Data frames are really lists, where each list element is a column and every element has the same length. This is why length(dataset) returns the number of columns. Because they are "rectangular," we are able to treat them like matrices, and R kindly allows us to use matrix syntax on data frames.
Note that, for lists, list$item and list[["item"]] are almost synonymous. Again, the biggest difference is that the latter form evaluates its argument, whereas the former does not. This is true even in the form `$`(list, item), which is exactly equivalent to list$item. In Hadley Wickham's terminology, $ uses "non-standard evaluation."
Also, as mentioned in the comments, $ always uses partial name matching, [[ does not by default (but has the option to use partial matching), and [ does not allow it at all.
I recently answered a similar question with some additional details that might interest you.
Use 'str' command to see the difference:
> mydf
user_id Gender Age
1 1 F 13
2 2 M 17
3 3 F 13
4 4 F 12
5 5 F 14
6 6 M 16
>
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ user_id: int 1 2 3 4 5 6
$ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 2
$ Age : int 13 17 13 12 14 16
>
> str(mydf[1])
'data.frame': 6 obs. of 1 variable:
$ user_id: int 1 2 3 4 5 6
>
> str(mydf[,1])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[,'user_id'])
int [1:6] 1 2 3 4 5 6
> str(mydf$user_id)
int [1:6] 1 2 3 4 5 6
>
> str(mydf[[1]])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[['user_id']])
int [1:6] 1 2 3 4 5 6
mydf[1] is a data frame while mydf[,1] , mydf[,'user_id'], mydf$user_id, mydf[[1]], mydf[['user_id']] are vectors.

R: Can't select a specific column in a data frame

I have a problem with a function to select a given column. I have a data frame called Volume from which I want to make a subset DateSearch:
DateSearch = subset(Volume,select=c("TRI",name))
For some reason it does not work. I have used browser(). I can select TRI or name but I can't select both (either with their name or indice). I have tried with and without "".
Does anyone know why is that?
Many thanks,
Vincent
I just did what (I think) you describe:
str(dfrm)
#'data.frame': 20 obs. of 8 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ factor1: Factor w/ 4 levels "Not at all","To a small extent",..: 3 2 3 NA 3 NA 3 NA 4 1 ...
## <snip>
name = "factor1"
subset(dfrm, select=c("ID", name))
No error, .... results as expected.
Examine the spelling carefully. My guess is that you have a space at the beginning or end of the result of the as.character result. Perhaps even a non-printing character? You can use nchar(name) to check.

Resources