How is this recursive CodeQL predicate is evaluated? - recursion

I'm in the process of trying to learn CodeQL and I'm a little confused about how certain CodeQL code is evaluated. I'm hoping someone can help me with a more simplistic explanation.
Take the following CodeQL code:
string getANeighbor(string country){
country = "France" and result = "Belgium"
or
country = "France" and result = "Germany"
or
country = "Germany" and result = "Austria"
or
country = "Germany" and result = "Belgium"
or
country = getANeighbor(result)
}
select getANeighbor("Germany")
The result I get back is:
France
Austria
Belgium
I understand why Belgium and Austria are returned. However I'm confused as to how CodeQL determines that France is to be returned as a result. My imperative programming intuition tells me that for France to be returned, I would need an additional line that looks something like country = "Germany" and result = "France", but I'm really confused here how France is being returned without that line of code.
Also, how does this line work exactly?:
country = getANeighbor(result)
With some of the simple examples that are given in the CodeQL handbook, they make it seem like that the 'result' keyword almost acts like 'return' in other languages. I feel that I have a fundamental misunderstanding of what 'result' does and how it works. I can't seem to find a good explanation after googling.
Thanks in advance!

I found the question from a a user who was kind enough to answer my question at a different site.
Source (https://github.com/github/securitylab/discussions/85):
Answer (Provided by: Pavel Avgustinov from Semmle):
This is a great question. I'll give the full details on recursive
predicates below, but let's first deal with result.
Also, how does this line work exactly?:
country = getANeighbor(result)
With some of the simple examples that are given in the CodeQL handbook, they make it seem like that the 'result' keyword almost acts
like 'return' in other languages. I feel that I have a fundamental
misunderstanding of what 'result' does and how it works. I can't seem
to find a good explanation after googling.
You're right to observe that in many examples result = f() in CodeQL
seems to act like return f() in other languages. However, a return
statement (with its usual meaning of actually interrupting the
evaluation of a function and returning to the caller) does not make
sense in a logic language, where it's up to the CodeQL evaluator how
and in what order to evaluate the different parts of a predicate you
define. Instead, when a predicate declares a return type (string in
your example), this implicitly declares the special variable result
in its body, whose type is the declared return type of the predicate.
You can use this variable just like any other, in particular to
constrain it as part of the predicate.
Note that this is quite similar to how you implicitly get a special
variable called this in a member predicate (or method, in traditional
languages).
Other than its implicit declaration, there is nothing special about
the result variable. In particular, logically you could write the
getANeighbor predicate like this, without its use:
predicate getANeighbor(string country, string neighbor) {
country = "France" and neighbor = "Belgium"
or
country = "France" and neighbor = "Germany"
or
country = "Germany" and neighbor = "Austria"
or
country = "Germany" and neighbor = "Belgium"
or
getANeighbor(neighbor, country)
}
You do have to invoke the predicate differently if you declare it
without a result type: getANeighbor(neighbor, country) rather than
country = getANeighbor(result). Other than this syntactic detail,
the two are entirely equivalent.
If you get used to the idea that result is just an implicitly
declared variable, not some special way of signaling the returned
value of a predicate, then code like country = getANeighbor(result)
(or more complex cases, where result is used multiple times to place
multiple conditions on it) will make more sense. Contrast it to the
result-free version I gave above, which explicitly flips the order
of arguments in the recursive call to getANeighbor: it's exactly the
same thing, we're invoking getANeighbor with "flipped" arguments,
namely result as the first parameter, and country equated to the
"returned" value.
As a final observation, you can of course get something more similar
to the return-like style of many examples by introducing an
intermediate variable: Instead of country = getANeighbor(result), we
can write exists(string neighbor | country = getANeighbor(neighbor) and result = neighbor), but you can see how this ends up being more
verbose.
The general intuition behind recursion in CodeQL is given here,
but at a high level you can think of each recursive call as
representing the "current" set of value tuples in the called
predicate, and all predicate definitions keep being evaluated until
there is no change. Let's see how this works in your example
predicate:
string getANeighbor(string country)
{
country = "France" and result = "Belgium"
or
country = "France" and result = "Germany"
or
country = "Germany" and result = "Austria"
or
country = "Germany" and result = "Belgium"
or
country = getANeighbor(result)
}
The first time we try to evaluate getANeighbour, we do not know
anything about it, and so we treat a call to it as a call to the empty
set of values. This means we deduce the following values for it:
+-----------+----------+---------+
| iteration | country | result |
+-----------+----------+---------+
| 1 | France | Belgium |
| 1 | France | Germany |
| 1 | Germany | Austria |
| 1 | Germany | Belgium |
+-----------+----------+---------+
However, after deducing these values, we know that there was a
recursive call to getANeighbor, and that the values we have for it
changed. Thus, we need to re-evaluate the definition of
getANeighbor, now using our expanded set of values for the recursive
call. This means that we deduce the following:
+-----------+----------+---------+
| iteration | country | result |
+-----------+----------+---------+
| 2 | France | Belgium |
| 2 | France | Germany |
| 2 | Germany | Austria |
| 2 | Germany | Belgium |
| | | |
| 2 | Belgium | France |
| 2 | Germany | France |
| 2 | Austria | Germany |
| 2 | Belgium | Germany |
+-----------+----------+---------+
Note that we re-deduced the first four lines, which do not depend on a
recursive call, just like we did in iteration 1. This is fine -- we
already knew that "Belgium" = getANeighbor("France") was true, and
thus we can simply ignore the fact that we deduced it in another way.
However, the last four lines are new; they arise from the recursive
call.
As before, we see that iteration 2 of evaluating getANeighbor
depended on a recursive predicate for which we have deduced new
values, and so we need to go around again. In the third iteration, we
deduce:
(a) the first four lines of iteration 2 (and iteration 1), just like before, from the non-recursive parts of the predicate definition;
(b) the last four lines of iteration 2, in the same way as for iteration 2 itself; and
(c) all 8 lines from iteration 2, just by applying the recursive case to the results of iteration 2. However, we didn't deduce any
new values, and so we know that at this point we can stop evaluation of getANeighbor. We say that it has "converged" (meaning "stopped
changing"), and for any other references to it, we are going to use
the final set of values which we computed.
This is why in your query, France is reported as a neighbor of
Germany.
[I've omitted some details of how to make such a recursive evaluation
efficient -- in practice there are some ways to avoid re-deducing the
same values again and again on each recursive iteration --, because
they are not necessary to understand how recursion works in the first
place.]
Thanks again to Pavel Avgustinov from Semmle for this detailed response!

Related

Surv function input - right,left or interval censored? In R

I am at the beginning of setting up a survival analysis in R.
I took a look in this book here: https://www.powells.com/book/modeling-survival-data-9780387987842/ but struggle to properly set the data up in the first place. So this is a very basic question to survival analysis as I can not find a good example online.
I'd to understand how to incorporate the consorized data into my surv() function. I understand the inputs in surv are:
0 = right censored
1 = event
2 = left censored
3 = interval censored
Right Censored: The time of study ends before an event takes place (ob1)
Left Censored: The event has already happend before the study starts
Event: Typically, death or some other form of expected outcome (marked by x)
Intervall Censored: The observation starts at some point in the study and has an event / drops out before end of study (ob5)
Left truncated: Ob 3,4,5 are left truncated
To better understand what I am talking about I sketched the described types of censored data below:
"o" marks beginning of data / first occurance in data set
"x" marks event
Start of study End of observation
ob1 o-|-----------------------------------------------------------|--------
| |
ob2 o-|-------------------------------xo |
| |
ob3 | o-----------------------------------xo |
| |
ob4 | o------------------x-|----------o
| |
ob5 | o----------------------------o |
|--------------------------------------------------------------
1999 2010
Finally, what would i like to know:
Did I classify ob1- ob5 correctly?
How about the other types of observations?
How do I represent these as input for the surv function? If for example right censored is true, i.e. the study ends how does a "0" indicate so? What is the input for the time series when neither event(1) nor end of observation occur (0)? what happens at a time when "nothing" happens?
When and how is the interval censored data marked? 3 for beginning and end?
I can provide some sample code if needed.
Again, thank you for your help on this and valuable questions!

groupby an element with jq

I have the following json:
{"us":{"$event":"5bbf4a4f43d8950b5b0cc6d2"},"org":"TΙ UIH","rc":{"$event":"13"}}
{"us":{"$event":"5bbf4a4f43d8950b5b0cc6d3"},"org":"TΙ UIH","rc":{"$event":"13"}}
{"us":{"$event":"5bbf4a4f43d8950b5b0cc6d4"},"org":"AB KIO","rc":{"$event":"13"}}
{"us":{"$event":"5bbf4a4f43d8950b5b0cc6d5"},"org":"GH SVS","rc":{"$event":"17"}}
How could i achieve the following output result? (tsv)
13 TΙ UIH 2
13 AB KIO 1
17 GH SVS 1
so far from what i have searched,
jq -sr 'group_by(.org)|.[]|[.[0].org, length]|#tsv'
how could i add one more group_by to achieve the desired result?
I was able to obtain the expected result from your sample JSON using the following :
group_by(.org, .rc."$event")[] | [.[0].rc."$event", .[0].org, length] | #tsv
You can try it on jqplay.org.
The modification of the group_by clause ensures we will have one entry by pair of .org/.rc.$event (without it we would only have one entry by .org, which might hide some .rc.$event).
Then we add the .rc.$event to the array you create just as you did with the .org, accessing the value of the first item of the array since we know they're all the same anyway.
To sort the result, you can put it in an array and use sort_by(.[0]) which will sort by the first element of the rows :
[group_by(.org, .rc."$event")[] | [.[0].rc."$event", .[0].org, length]] | sort_by(.[0])[] | #tsv

Efficient string similarity grouping

Setting:
I have data on people, and their parent's names, and I want to find siblings (people with identical parent names).
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
The expected output here would be a column indicating that the first two observations belong to family X, while the third and fourth columns are each in a separate family. E.g:
person_id parents_name family_id
1 "peter pan + marta steward", 1
2 "pieter pan + marta steward", 1
3 "armin dolgner + jane johanna dough", 2
4 "jack jackson + sombody else" 3
Current approach:
I am flexible regarding the distance metric. Currently, I use Levenshtein edit-distance to match obs, allowing for two-character differences. But other variants such as "largest common sub string" would be fine if they run faster.
For smaller subsamples I use stringdist::stringdist in a loop or stringdist::stringdistmatrix, but this is getting increasingly inefficient as sample size increases.
The matrix version explodes once a certain sample size is used. My terribly inefficient attempt at looping is here:
#create data of the same complexity using random last-names
#(4mio obs and ~1-3 kids per parents)
pdata<-data.frame(parents_name=paste0(rep(c("peter pan + marta ",
"pieter pan + marta ",
"armin dolgner + jane johanna ",
"jack jackson + sombody "),1e6),stringi::stri_rand_strings(4e6, 5)))
for (i in 1:nrow(pdata)) {
similar_fatersname0<-stringdist::stringdist(pdata$parents_name[i],pdata$parents_name[i:nrow(pdata)],nthread=4)<2
#[create grouping indicator]
}
My question: There should be substantial efficiency gains, e.g. because I could stop comparing strings once I found them to sufficiently different in something that is easier to assess, eg. string length, or first word. The string length variant already works and reduces complexity by a factor ~3. But thats by far too little. Any suggestions to reduce computation time are appreciated.
Remarks:
The strings are actually in unicode and not in the Latin alphabet (Devnagari)
Pre-processing to drop unused characters etc is done
There are two challenges:
A. The parallel execution of Levenstein distance - instead of a sequential loop
B. The number of comparisons: if our source list has 4 million entries, theoretically we should run 16 trillion of Levenstein distance measures, which is unrealistic, even if we resolve the first challenge.
To make my use of language clear, here are our definitions
we want to measure the Levenstein distance between expressions.
every expression has two sections, the parent A full name and the parent B full name which are separated by a plus sign
the order of the sections matters (i.e. two expressions (1, 2) are identical if Parent A of expression 1 = Parent A of expression 2 and Parent B or expression 1= Parent B of expression 2. Expressions will not be considered identical if Parent A of expression 1 = Parent B of expression 2 and Parent B of expression 1 = Parent A of expression 2)
a section (or a full name) is a series of words, which are separated by spaces or dashes and correspond to the the first name and last name of a person
we assume the maximum number of words in a section is 6 (your example has sections of 2 or 3 words, I assume we can have up to 6)
the sequence of words in a section matters (the section is always a first name followed by a last name and never the last name first, e.g. Jack John and John Jack are two different persons).
there are 4 million expressions
expressions are assumed to contain only English characters. Numbers, spaces, punctuation, dashes, and any non-English character can be ignored
we assume the easy matches are already done (like the exact expression matches) and we do not have to search for exact matches
Technically the goal is to find series of matching expressions in the 4-million expressions list. Two expressions are considered matching expression if their Levenstein distance is less than 2.
Practically we create two lists, which are exact copies of the initial 4-million expressions list. We call then the Left list and the Right list. Each expression is assigned an expression id before duplicating the list.
Our goal is to find entries in the Right list which have a Levenstein distance of less than 2 to entries of the Left list, excluding the same entry (same expression id).
I suggest a two step approach to resolve the two challenges separately. The first step will reduce the list of the possible matching expressions, the second will simplify the Levenstein distance measurement since we only look at very close expressions. The technology used is any traditional database server because we need to index the data sets for performance.
CHALLENGE A
The challenge A consists of reducing the number of distance measurements. We start from a maximum of approx. 16 trillion (4 million to the power of two) and we should not exceed a few tens or hundreds of millions.
The technique to use here consists of searching for at least one similar word in the complete expression. Depending on how the data is distributed, this will dramatically reduce the number of possible matching pairs. Alternatively, depending on the required accuracy of the result, we can also search for pairs with at least two similar words, or with at least half of similar words.
Technically I suggest to put the expression list in a table. Add an identity column to create a unique id per expression, and create 12 character columns. Then parse the expressions and put each word of each section in a separate column. This will look like (I have not represented all the 12 columns, but the idea is below):
|id | expression | sect_a_w_1 | sect_a_w_2 | sect_b_w_1 |sect_b_w_2 |
|1 | peter pan + marta steward | peter | pan | marta |steward |
There are empty columns (since there are very few expressions with 12 words) but it does not matter.
Then we replicate the table and create an index on every sect... column.
We run 12 joins which try to find similar words, something like
SELECT L.id, R.id
FROM left table L JOIN right table T
ON L.sect_a_w_1 = R.sect_a_w_1
AND L.id <> R.id
We collect the output in 12 temp tables and run an union query of the 12 tables to get a short list of all expressions which have a potential matching expressions with at least one identical word. This is the solution to our challenge A. We now have a short list of the most likely matching pairs. This list will contain millions of records (pairs of Left and Right entries), but not billions.
CHALLENGE B
The goal of challenge B is to process a simplified Levenstein distance in batch (instead of running it in a loop).
First we should agree on what is a simplified Levenstein distance.
First we agree that the levenstein distance of two expressions is the sum of the levenstein distance of all the words of the two expressions which have the same index. I mean the Levenstein distance of two expressions is the distance of their two first words, plus the distance of their two second words, etc.
Secondly, we need to invent a simplified Levenstein distance. I suggest to use the n-gram approach with only grams of 2 characters which have an index absolute difference of less than 2 .
e.g. the distance between peter and pieter is calculated as below
Peter
1 = pe
2 = et
3 = te
4 = er
5 = r_
Pieter
1 = pi
2 = ie
3 = et
4 = te
5 = er
6 = r_
Peter and Pieter have 4 common 2-grams with an index absolute difference of less than 2 'et','te','er','r_'. There are 6 possible 2-grams in the largest of the two words, the distance is then 6-4 = 2 - The Levenstein distance would also be 2 because there's one move of 'eter' and one letter insertion 'i'.
This is an approximation which will not work in all cases, but I think in our situation it will work very well. If we're not satisfied with the quality of the results we can try with 3-grams or 4-grams or allow a larger than 2 gram sequence difference. But the idea is to execute much fewer calculations per pair than in the traditional Levenstein algorithm.
Then we need to convert this into a technical solution. What I have done before is the following:
First isolate the words: since we need only to measure the distance between words, and then sum these distances per expression, we can further reduce the number of calculations by running a distinct select on the list of words (we have already prepared the list of words in the previous section).
This approach requires a mapping table which keeps track of the expression id, the section id, the word id and the word sequence number for word, so that the original expression distance can be calculated at the end of the process.
We then have a new list which is much shorter, and contains a cross join of all words for which the 2-gram distance measure is relevant.
Then we want to batch process this 2-gram distance measurement, and I suggest to do it in a SQL join. This requires a pre-processing step which consists of creating a new temporary table which stores every 2-gram in a separate row – and keeps track of the word Id, the word sequence and the section type
Technically this is done by slicing the list of words using a series (or a loop) of substring select, like this (assuming the word list tables - there are two copies, one Left and one Right - contain 2 columns word_id and word) :
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 1 AS gram_seq, SUBSTRING(word,1,2) AS gram
FROM left_word_table
And then
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 2 AS gram_seq, SUBSTRING(word,2,2) AS gram
FROM left_word_table
Etc.
Something which will make “steward” look like this (assume the word id is 152)
| pk | word_id | gram_seq | gram |
| 1 | 152 | 1 | st |
| 2 | 152 | 2 | te |
| 3 | 152 | 3 | ew |
| 4 | 152 | 4 | wa |
| 5 | 152 | 5 | ar |
| 6 | 152 | 6 | rd |
| 7 | 152 | 7 | d_ |
Don't forget to create an index on the word_id, the gram and the gram_seq columns, and the distance can be calculated with a join of the left and the right gram list, where the ON looks like
ON L.gram = R.gram
AND ABS(L.gram_seq + R.gram_seq)< 2
AND L.word_id <> R.word_id
The distance is the length of the longest of the two words minus the number of the matching grams. SQL is extremely fast to make such a query, and I think a simple computer with 8 gigs of RAM would easily do several hundred of million lines in a reasonable time frame.
And then it's only a matter of joining the mapping table to calculate the sum of word to word distance in every expression, to get the total expression to expression distance.
You are using the stringdist package anyway, does stringdist::phonetic() suit your needs? It computes the soundex code for each string, eg:
phonetic(pdata$parents_name)
[1] "P361" "P361" "A655" "J225"
Soundex is a tried-and-true method (almost 100 years old) for hashing names, and that means you don't need to compare every single pair of observations.
You might want to go further and do soundex on first name and last name seperately for father and mother.
My suggestion is to use a data science approach to identify only similar (same cluster) names to compare using stringdist.
I have modified a little bit the code generating "parents_name" adding more variability in first and second names in a scenario close to reality.
num<-4e6
#Random length
random_l<-round(runif(num,min = 5, max=15),0)
#Random strings in the first and second name
parent_rand_first<-stringi::stri_rand_strings(num, random_l)
order<-sample(1:num, num, replace=F)
parent_rand_second<-parent_rand_first[order]
#Paste first and second name
parents_name<-paste(parent_rand_first," + ",parent_rand_second)
parents_name[1:10]
Here start the real analysis, first extract feature from the names such as global length, length of the first, length of the second one, numeber of vowels and consonansts in both first and second name (and any other of interest).
After that bind all these feature and clusterize the data.frame in a high number of clusters (eg. 1000)
features<-cbind(nchars,nchars_first,nchars_second,nvowels_first,nvowels_second,nconsonants_first,nconsonants_second)
n_clusters<-1000
clusters<-kmeans(features,centers = n_clusters)
Apply stringdistmatrix only inside each cluster (containing similar couple of names)
dist_matrix<-NULL
for(i in 1:n_clusters)
{
cluster_i<-clusters$cluster==i
parents_name<-as.character(parents_name[cluster_i])
dist_matrix[[i]]<-stringdistmatrix(parents_name,parents_name,"lv")
}
In dist_matrix you have the distance beetwen each element in the cluster and you are able to assign the family_id using this distance.
To compute the distance in each cluster (in this example) the code takes approximately 1 sec (depending on the dimension of the cluster), in 15mins all the distances are computed.
WARNING: dist_matrix grow very fast, in your code is better if you will analyze it inside di for loop extracting famyli_id and then you can discard it.
You may improve by not comparing all the couples of lines.
Instead, create a new variable that will be helpfull for decide if it is worth comparing.
For exemple, create a new variable "score" contaning the ordered list of letters used in parents_name (for exemple if "peter pan + marta steward" then the score will be "ademnprstw"), and calculate distance only between lines where score are matching.
Of course, you can find a score that fits better your need, and improve a little to enable comparison when not all the letters used are common ..
I faced the same performance issue couple years ago. I had to match people's duplicates based on their typed names. My dataset had 200k names and the matrix approach exploded. After searching for some day about a better method, the method I'm proposing here did the job for me in some minutes:
library(stringdist)
parents_name <- c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else")
person_id <- 1:length(parents_name)
family_id <- vector("integer", length(parents_name))
#Looping through unassigned family ids
while(sum(family_id == 0) > 0){
ids <- person_id[family_id == 0]
dists <- stringdist(parents_name[family_id == 0][1],
parents_name[family_id == 0],
method = "lv")
matches <- ids[dists <= 3]
family_id[matches] <- max(family_id) + 1
}
result <- data.frame(person_id, parents_name, family_id)
That way the while will compare fewer matches on every iteration. From that, you might implement different performance boosters, like filtering the names with the same first letter before comparing, etc.
Making equivalency groups on non transitive relation does not make sense. If A is like B and B is like C, but A is not like C, how would you make families from that? Using something like soundex (that was idea of Neal Fultz, not mine) seems the only meaningful option and it solves your problem with performance too.
What I have used to reduce the permutations involved in this sort of name matching, is create a function that counts the syllables in the name (surname) involved. Then store this in the database, as a pre-processed value. This becomes a Syllable Hash function.
Then you can choose to group words together with the same number of syllables as each other. (Although I use algorithms that allow 1 or 2 syllables difference, which may be presented as legitimate spelling / typo errors...But my research has found that 95% of misspellings share the same number of syllables)
In this case Peter and Pieter would have the same syllable count (2), but Jones and Smith do not (they have 1). (For example)
If your function does not get 1 syllable for Jones, then you may need to increase your tolerance to allow for at least 1 syllable difference in the Syllable Hash function grouping that you use. (To account for incorrect syllable function results, and to catch the matching surname correctly in the grouping)
My syllable counting function may not apply completely - as you might need to cope with non-English letter sets...(So I have not pasted the code...Its in C anyway) Mind you - the Syllable count function does not have to be accurate in terms of TRUE syllable count; it simply needs to act as a reliable Hashing function - which it does. Far superior to SoundEx which relies on the first letter being accurate.
Give it a go, you might be surprised how much improvement you get by implementing a Syllable Hash function. You may have to ask SO for help getting the function into your language.
If I get it right, you want to compare every parent pair (every row in parent_name data frame) with all other pairs (rows), and keep rows that have Levenstein distance smaller or equal to 2.
I have written following code for the beginning:
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
fuzzy_match <- list()
system.time(for (i in 1:nrow(pdata)){
fuzzy_match[[i]] <- cbind(pdata, parents_name_2 = pdata[i,"parents_name"],
dist = as.integer(stringdist(pdata[i,"parents_name"], pdata$parents_name)))
fuzzy_match[[i]] <- fuzzy_match[[i]][fuzzy_match[[i]]$dist <= 2,]
})
fuzzy_final <- do.call(rbind, fuzzy_match)
Does it return what you wanted?
it reproduces your output, i guess you will have to decide partial matching criteria, i kept the default agrep ones
pdata$parents_name<-as.character(pdata$parents_name)
x00<-unique(lapply(pdata$parents_name,function(x) agrep(x,pdata$parents_name)))
x=c()
for (i in 1:length(x00)){
x=c(x,rep(i,length(x00[[i]])))
}
pdata$person_id=seq(1:nrow(pdata))
pdata$family_id=x

How to get average of last N numbers in a stream with static memory

I have a stream of numbers and in every cycle I need to count the average of last N of them. This can be, of course, solved using an array where I store the last N numbers and in every cycle I shift it, add the new one and count the average.
N = 3
+---+-----+
| a | avg |
+---+-----+
| 1 | |
| 2 | |
| 3 | 2.0 |
| 4 | 3.0 |
| 3 | 3.3 |
| 3 | 3.3 |
| 5 | 3.7 |
| 4 | 4.0 |
| 5 | 4.7 |
+---+-----+
First N numbers (where there "isn't enough data for counting the average") doesn't interest me much, so the results there may be anything/undefined.
My question is, can this be done without using an array, that is, with static amount of memory? If so, then how?
I'll do the coding myself - I just need to know the theory.
Thanks
Think of this as a black box containing some state. If you control the input stream, you can draw conclusions on the state. In your sliding window array-based approach, it is kind of obvious that if you feed a bunch of zeros into the algorithm after the original input, you get a bunch of averages with a decreasing number of non-zero values taken into account. The last one has just one original non-zero value, so if you multiply that my N you get the last input back. Using that and the second-to-last output which accounts for two non-zero inputs, you can reconstruct the second-to-last input, and so on.
So essentially your algorithm needs to maintain sufficient state to reconstruct the last N elements of input, at least if you formulate it as an on-line algorithm. I don't think an off-line algorithm can do any better, except if you consider it reading the input multiple times, but I don't have as strong an agument for that.
Of course, in some theoretical models you can avoid the array and e.g. encode all the state into a single arbitrary length integer, but that's just cheating the theory, and doesn't make any difference in practice.

Constructing an object using the genoset package in R

The genoset R package has a function for building a GenoSet by putting together several matrices and a RangedData object that specifies co-ordinates.
I have the following objects - three matrices, all with the same name, and a RangedData object of the following format (called locData).
space ranges |
<factor> <IRanges> |
cg00000957 1 [ 5937253, 5937253] |
cg00001349 1 [166958439, 166958439] |
cg00001583 1 [200011786, 200011786] |
cg00002028 1 [ 20960010, 20960010] |
cg00002719 1 [169396706, 169396706] |
cg00002837 1 [ 44513358, 44513358] |
When I try to create a GenoSet, though, I get the following error.
DMRSet=GenoSet(locData,Exprs,meth,unmeth,universe=NULL)
Error in .Call2("IRanges_from_integer", from, PACKAGE = "IRanges") :
cannot create an IRanges object from an integer vector with missing values.
What am I doing wrong? all the objects I'm putting together have the same rownames, except for the IRanges object itself, which I don't think has rownames since it isn't a matrix.
Additionally, the "column" of locData has non-integer characters.
Thank you!
It sounds like your "locData" may not be a RangedData. It can alternatively be a GRanges. Either way, you will want to name all of your arguments.
The underlying eSet class will be upset about that once you get past the locData trouble.
DMRSet=GenoSet(locData=locData,exprs=Exprs,meth=meth,unmeth=unmeth,universe=NULL)
Pete

Resources