How does matchit subclass look so complicated? - r

I am trying to match test/control using a string that concatenates different values (for example: 1_4_5). I ran matchit using exact matching for the variable above and mahalanobis for 2 or 3 other variables. The matching results returns with a few new columns such as weight and subclass. I understand that subclass is to be used to know which test units are matched with which control units. However, the subclass column looks very complicated compared to subclass column usually (just numbers). Mine looks like this 1_1_1_1_1_1_1_1_1_1_4, etc. Did I do anything incorrectly that the subclass is produced this way?
Thanks

Related

Extracting Nested Elements of an R List Generated by Loops

For lists within lists produced by a loop in R (in this example a list of caret models) I get an object with an unpredictable length and names for inner elements, such as list[[1]][[n repeats of 1]][[2]] where the internal [[1]] is repeated multiple times according to the function's input. In some cases, the length of n is not known, when accessing some older stored lists where input was not saved. While there are ways to work within a list index, like with list[length(list)], there appears to be no way to do this with repeated nested elements. This has made accessing them and passing them to various jobs awkward. I assume there is an efficient way to access them that I have missed, so I'm asking for help to do so, with an example case given below.
The function I'm generating gives out a list from a function that creates several outputs. The final list returned for a function having a complicated output structure is produced by returning something like:
return(list(listOfModels, trainingData, testingData))
The listofModels has variable length, depending on input of models given, and potentially other conditions depend on evaluation inside the function. It is made by:
listOfModels <- list(c(listOfModels, list(trainedModel)))
Where the "trainedModel" refers to the most recently trained model generated in the loop. The models used and the number of them may vary each time depending on choice. An unfortunate result is a complicated nested lists within a list.
That is, output[[1]] contains the models I want to access more efficiently, which are themselves list objects, while output[[2]] and output[[3]] are the dataframes used to train and evaluate the models. While accessing the dataframes is simple and has a defined, reproducible structure each time (simply being output[[2]], output[[3]] every time), output[[1]] becomes a mess. E.g., something like the following follows the "output[[1]]":
The only thing I am able to attempt in order to access this is using the fact that [[1]] is attached upon output[[1]] before [[2]]. All of the nested elements except one have a [[2]] at the end. Given the above pattern, there is an ugly solution that works, but is not a desirable format to work with. E.g., after evaluating n models given by a vector of strings called inputList, and a list given as output of the function, "output", I can have [[1]] repeated tens to hundreds of times.
for (i in (1:length(inputList)-1)){
eval(rlang::parse_expr(paste0(c("output", c(rep("[[1]]", 1+i)), "[[2]]" ) , collapse="")) )
}
This could be used to use all models for some downstream task like making predictions on new data, or whatever. In cases where the length of the inputList was not known, this could be found out by attempting to repeat this until finding an error, or something similar. This approach can be modified to call on a specific part of the list, for example, a certain model within inputList, if I know the original list input and can find the number for that model. Besides the bulkiness code working this way, compared to some way where I could just call on output[[1]][[n]] using some predictable format for various length n. One of the big problems is when accessing older runs that have been saved where the input list of models was not saved, leaving the length of n unknown. I don't know of any way of using something like length() or lengths() to count how many nested elements exist within a list. (For my example, output[[1]] is of length 1, no matter how many [[1]] repeat elements there are.)
I believe the simplest solution is to change the way the list is saved by the function, so that I can access it by a systematic reference, however, I have a bunch of old lists which I still want to access and perform some work with, and I'd also like to be able to have better control of working with lists in any case. So any help would be greatly appreciated.
I expected there would be some way to query the structure of nested R lists, which could be used to pass nested elements to separate functions, without having to use very long repetition of brackets.

Subsetting list containing multiple classes by same index/vector

I'm needing to subset a list which contains an array as well as a factor variable. Essentially if you imagine each component of the array is relative to a single individual which is then associated to a two factor variable (treatment).
list(array=array(rnorm(2,4,1),c(5,5,10)), treatment= rep(c(1,2),5))
Typically when sub-setting multiple components of the array from the first component of the list I would use something like
list$array[,,c(2,4,6)]
this would return the array components in location 2,4 and 6. However, for the factor component of the list this wouldn't work as subsetting is different, what you would need is this:
list$treatment[c(2,4,6)]
Need to subset a list with containing different classes (array and vector) by the same relative number.
You're treating your list of matrices as some kind of 3-dimensional object, but it's not.
Your list$matrices is of itself a list as well, which means you can index at as a list as well, it doesn't matter if it is a list of matrices, numerics, plot-objects, or whatever.
The data you provided as an example can just be indexed at one level, so list$matrices[c(2,4,6)] works fine.
And I don't really get your question about saving the indices in a numeric vector, what's to stop you from this code?
indices <- c(2,4,6)
mysubset <- list(list$matrices[indices], list$treatment[indices])
EDIT, adding new info for edited question:
I see you actually have an 3-D array now. Which is kind of weird, as there is no clear convention of what can be seen as "components". I mean, from your question I understand that list$array[,,n] refers to the n-th individual, but from a pure code-point of view there is no reason why something like list$array[n,,] couldn't refer to that.
Maybe you got the idea from other languages, but this is not really R-ish, your earlier example with a list of matrices made more sense to me. And I think the most logical would have been a data.frame with columns matrix and treatment (which is conceptually close to a list with a vector and a list of matrices, but it's clearer to others what you have).
But anyway, what is your desired output?
If it's just subsetting: with this structure, as there are no constraints on what could have been the content, you just have to tell R exactly what you want. There is no one operator that takes a subset of a vector and the 3rd index of an array at the same time. You're going to have to tell R that you want 3rd index to use for subsetting, and that you want to use the same index for subsetting a vector. Which is basically just the code you already have:
idx <- c(2,4,6)
output <- list(list$array[,,idx], list$treatment[idx])
The way that you use for subsetting multiple matrices actually gives an error since you are giving extra dimension although you already specify which sublist you are in. Hence in order to subset matrices for the given indices you can usemy_list[[1]][indices] or directly my_list$matrices[indices]. It is the same for the case treatement my_list[[2]][indices] or my_list$treatement[indices]

Properly define the members and invariants of a class in R

We have an R package for a certain purpose. The basic data structure is a correlation function which is a real/complex valued function for a smallish (100) number (T) of time slices. We have multiple measurements (N) of it, so at its core it is a N×T matrix. But then there are more things that it can become:
One can bootstrap it with R samples such that it becomes an R×T matrix. However we want to keep the original data, so there is a field for the R×T matrix and another for the N×T matrix.
It can be symmetrized which will cut T in half and also alter various other functions that work with those objects.
Also it can be shifted which takes the difference between consecutive elements and therefore drops one time slice. The first column in the matrix then corresponds to t = 1 and not t = 0 any more, which becomes important in fits to the data.
Correlation functions may have an imaginary part, this is stored as a second real matrix. But they might not.
When doing non-linear operations with the data, we do that once with the average of the original data and each bootstrap sample. If the result is another correlation function, that object will not have “original data” but only the average.
So basically we have a class that can have various fields and only the average of the original data is really common.
To make things worse, there is no formal documentation for the possible members and the invariants associated with them. Coming from C++ where a concise class definition allows me do encapsulation, The S3 class system in R seems like an invitation for inconsistencies.
This surfaced a few times when some function taking such a correlation function as argument and expected some field to be present while it was not. The code is riddled with lines that just add another field to the class when performing an operation.
Long story short: Is there some automatically enforcable way in the S3 class system to have an exhaustive list of all the fields that a class can have? Right now I only see the possibility to document (in English) in the constructor function and just hope nobody missed a line where fields were added.

How can I add number formats in Qt's model/view framework?

Let's say I have a QTableView with a corresponding model. This view shows numbers only, and I want different number formats for different columns.
Some examples:
In the price column I want two decimals and a comma as separator.
In the length column I don't want any decimals but I still want the
comma separator.
In the duration column I want the same format as the price column but
red color if the value is negative.
Now I could do this in the model's data function. Qt.DisplayRole handles the number formatting, and Qt.ForegroundRole can handle the coloring. While the coloring works as intended, doing the number formatting this way breaks sorting.
Is there a way more straight forward way to achieve this? I tried Googling around to see what I could find, but good examples of what I'm after eludes me.
If you are using QStandardItemModel as your model, you can set the sortRole to use, say, Qt::EditRole instead of Qt::DisplayRole.
If you are using something like QAbstractItemModel, you can override the sort method and do something similar, i.e. use the Qt::EditRole value. You could also wrap a QSortFilterProxyModel around your model and override the lessThan method.

R + Bioconductor : combining probesets in an ExpressionSet

First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.
I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.
The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps.

Resources