I have a simple problem constrained to R. I have what is effectively a sort of binary tree, where only the terminal leaves have values associated with them. A toy example is visible here.
Essentially, I perform an operation between the leaves with the greatest depth (in a tie of depth, order doesn't matter). I have made it addition here, but, in reality, they're getting plugged into a more complicated formula.
I am limited to R for my code. This structure can be represented with this command, though I obtain it via other means:
testBranch<-list(list(list(list(20,15),40),list(10,30)),5) #Depth of 4
I have a working function to determine how deep the deepest level is, but nested lists in R are boggling. Any clue how to efficiently find the set of indexes to access the deepest values? For instance, in the toy example above
testBranch[[1]][[1]][[1]]
would give me what I'd like, a list containing 2 elements. Using my addition example, I could then do this:
indexesOI<-getIndexes(testBranch)
testBranch[indexesOI]<-testBranch[indexesOI][1]+testBranch[indexesOI][2]
#testBranch now has depth of 3
Resulting in the tree corresponding to step 1 in the toy example, which can be represented in R by:
testBranchStep1<-list(list(list(35,40),list(10,30)),5)
I am open to using packages, if need be. Just not looking to rewrite a whole node class/dfs in R, as I don't have much experience with the class system. I have looked into data.tree, but have had no luck coercing my nested lists into their data structure.
Any help you can provide would be great! Pardon the hastily made ASCII trees. I am largely self-taught and haven't asked many questions here, so please let me know, too, if I need to adjust my formatting! Thanks!
You can do this with data.tree.
library(data.tree)
testBranch <- list(list(list(list(20,15),40),list(10,30)),5)
tree <- FromListSimple(testBranch)
tree
This will print the tree:
levelName
1 Root
2 °--1
3 ¦--1
4 ¦ °--1
5 °--2
data.tree provides many utility functions and properties (make sure you read the vignettes). To know the depth, in particular, use this:
height <- tree$height
Which yields:
> 4
You can then traverse the tree and find the nodes with maximum height:
maxDepthLeaves <- Traverse(tree, filterFun = function(node) node$level == height)
This traversal is the list of nodes at max level (only one Node in this case). You can then use Get to retrieve from the traversal any value, e.g. the name, the position, or the pathString:
Get(maxDepthLeaves, 'pathString')
Displaying as:
1
"Root/1/1/1"
Sounds like you are halfway there. Whenever you find the deepest node(s), you can output the index into a list. Here's a recursive function in pseudo-code since I don't know R.
If tree is a leaf node
If current depth is greater than max-depth
Delete list of indices
Append current index into list of indices
If current depth is equal to max-depth
Append current index into list of indices
Else
for each element in the tree
Get current index
Recursively call this function, passing in the current index
Related
By using the IndependentSets module in SageMath, we can list all the independent sets of a graph. Suppose I have a bipartite graph on the Symmetric Group with partite sets consisting of even and odd permutations.
How do I enumerate and list out all those independent sets which consists of equal number of elements from even and odd permutations. What all methods and functions do I need to use. Is there some built in function for listing the type of a symmetric group element as even or odd.
My idea of pseudo-code idea would be:
G=BipartiteGraph()
I=IndependentSets(G)
for list in I:
for i in list:
if enumerate(type(list[i])=='even')==enumerate(type(list[i]=='odd'):
add list in list1
print(list1)
However, I encountered the error that list indices must be integers or slices and not permutation group elements. How do rectify this? Any hints?
For lists within lists produced by a loop in R (in this example a list of caret models) I get an object with an unpredictable length and names for inner elements, such as list[[1]][[n repeats of 1]][[2]] where the internal [[1]] is repeated multiple times according to the function's input. In some cases, the length of n is not known, when accessing some older stored lists where input was not saved. While there are ways to work within a list index, like with list[length(list)], there appears to be no way to do this with repeated nested elements. This has made accessing them and passing them to various jobs awkward. I assume there is an efficient way to access them that I have missed, so I'm asking for help to do so, with an example case given below.
The function I'm generating gives out a list from a function that creates several outputs. The final list returned for a function having a complicated output structure is produced by returning something like:
return(list(listOfModels, trainingData, testingData))
The listofModels has variable length, depending on input of models given, and potentially other conditions depend on evaluation inside the function. It is made by:
listOfModels <- list(c(listOfModels, list(trainedModel)))
Where the "trainedModel" refers to the most recently trained model generated in the loop. The models used and the number of them may vary each time depending on choice. An unfortunate result is a complicated nested lists within a list.
That is, output[[1]] contains the models I want to access more efficiently, which are themselves list objects, while output[[2]] and output[[3]] are the dataframes used to train and evaluate the models. While accessing the dataframes is simple and has a defined, reproducible structure each time (simply being output[[2]], output[[3]] every time), output[[1]] becomes a mess. E.g., something like the following follows the "output[[1]]":
The only thing I am able to attempt in order to access this is using the fact that [[1]] is attached upon output[[1]] before [[2]]. All of the nested elements except one have a [[2]] at the end. Given the above pattern, there is an ugly solution that works, but is not a desirable format to work with. E.g., after evaluating n models given by a vector of strings called inputList, and a list given as output of the function, "output", I can have [[1]] repeated tens to hundreds of times.
for (i in (1:length(inputList)-1)){
eval(rlang::parse_expr(paste0(c("output", c(rep("[[1]]", 1+i)), "[[2]]" ) , collapse="")) )
}
This could be used to use all models for some downstream task like making predictions on new data, or whatever. In cases where the length of the inputList was not known, this could be found out by attempting to repeat this until finding an error, or something similar. This approach can be modified to call on a specific part of the list, for example, a certain model within inputList, if I know the original list input and can find the number for that model. Besides the bulkiness code working this way, compared to some way where I could just call on output[[1]][[n]] using some predictable format for various length n. One of the big problems is when accessing older runs that have been saved where the input list of models was not saved, leaving the length of n unknown. I don't know of any way of using something like length() or lengths() to count how many nested elements exist within a list. (For my example, output[[1]] is of length 1, no matter how many [[1]] repeat elements there are.)
I believe the simplest solution is to change the way the list is saved by the function, so that I can access it by a systematic reference, however, I have a bunch of old lists which I still want to access and perform some work with, and I'd also like to be able to have better control of working with lists in any case. So any help would be greatly appreciated.
I expected there would be some way to query the structure of nested R lists, which could be used to pass nested elements to separate functions, without having to use very long repetition of brackets.
Trying to create an array from an xyz data file. The data file is arranged so that x,y,z of each atom is on a new line and I want the array to reflect this.
Then to use this array to find find the distance from each atom in the list with all the others.
To do this the array has been copied such that atom1 & atom2 should be identical to the input file.
length is simply the number of atoms in the list.
The write statement: WRITE(20,'(3F12.9)') atom1 actually gives the matrix wanted but when I try to find individual elements they're all wrong!
Any help would be really appreciated!
Thanks guys.
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE ::atom1,atom2'
ALLOCATE(atom1(length,3),atom2(length,3))
READ(10,*) ((atom1(i,j), i=1,length), j=1,3)
atom2=atom1
distn=0
distc=0
DO n=1,length
x1=atom1(n,1)
y1=atom1(n,2) !1st atom
z1=atom1(n,3)
DO m=1,length
x2=atom2(m,1)
y2=atom2(m,2) !2nd atom
z2=atom2(m,3)`
Your READ statement reads all the x coordinates for all atoms from however many records, then all the y coordinates, then all the z coordinates. That's inconsistent with your description of the input file. You have the nesting of the io-implied-do's in the READ statement around the wrong way - it should be ((atom1(i,j),j=1,3),i=1,length).
Similarly, as per the comment, your diagnostic write mislead you - you were outputting all x ordinates, followed by all y ordinates, etc. Array element order of a whole array reference varies the first (leftmost) dimension fastest (colloquially known as column major order).
(There are various pitfalls associated with list directed formatting that mean I wouldn't recommend it for production code (or perhaps for input specifically written with the knowledge of and defence against those pitfalls). One of those pitfalls is that the READ under list directed formatting will pull in as many records as it requires to satisfy the input list. You may have detected the problem earlier if you were using an explicit format that nominated the number of fields per record.)
Im working with a graphNEL object and need to extract the adjacent nodes of
a specified node. This is solvable with adj(nodes(graph),"node123"),
however the nodes are returned as a vector of size 1. So I cant
access directly vertain nodes for it.
Lets say:
> adjacent <- adj(subgraph,"hsa:991")
> adjacent
$`hsa:991`
[1] "hsa:10744" "hsa:29945" "hsa:51433" "hsa:8881"
For an algorithm I just need lets say "hsa:29945" but since this
vector just is of size one, I have a problem. Is this possible?
The best thing would be that every node is recognized as a element.
Btw.: maybe somebody can explain to me why they are even only one element
I mean [1] "hsa:10744 hsa:29945 hsa:51433 hsa:8881" I could understand
but why are there quotes after every node? After all I just need to implement
a random walk on a graph. But I havent found any packages. So I will try to
implement it myself.
Hope you can help me.
Thanks in advance.
Cheers
Rich
adj(g, index=XXX) is returning a list containing the neighbours for each entry of XXX.
So, in order to extract the results for an entry of XXX you need to access the corresponding entry in the list. This then gives you the desired results:
##a simple mock-up graph
g <- new("graphNEL", nodes=c("V1","V2","V3"), edgemode="undirected")
g <- addEdge("V1","V2",g)
g <- addEdge("V1","V3",g)
adj.res <- adj(g,"V1") #returns a list
adj.res[["V1"]] #returns a vector
First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.
I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.
The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps.