How do you sort a dictionary by it's values and then sum these values up to a certain point? - julia

I was wondering what the best method would be to sort a dictionary of type Dict{String, Int} based on the value. I loop over a FASTQ file containing multiple sequence records, each record has a String as an identifier which serves as key and another string where i take the length from as the value of the key.
for example:
testdict["ee0a"]=length("aatcg")
testdict["002e4"]=length("aatcgtga")
testdict["12-f9"]=length(aatcgtgacgtga")
In this case the key value pairs would be "ee0a" => 5, "002e4" => 8, and "12-f9" => 13.
What i want to do is sort these pairs from highest value to the lowest value, afterwhich i sum these values in a different untill a that variable passes a certain threshold. I then need to save the keys i used so i can use them later on.
Is it possible to use the sort() function or use a SortedDict to achieve this? I would imagine that if the sorting succeeded i could use a while loop to add my keys to a list and add my values into a different variable untill it's greater than my threshold, and then use the list of keys to create a new dictionary with my selected key-value pairs.
However what would be the fastest way to do this? the FASTQ files i read in can contain multiple GB's worth of data so i'd love to create a sorted dictionary while reading in the file and select the records i want before doing anything else with the data.

If your file is multiple GB's worth of data I would avoid storing them in the Dict in the first place. I think it is better to process the file sequentially and store the keys that meet your condition in a PriorityQueue from the DataStructures.jl package. Of course you can repeat the same procedure if you read the data from a dictionary in memory (simply source changes from disk file to the dictionary)
Here is a pseudocode of what you could consider (a full solution would depend on how you read your data which you did not specify).
Assume that you want to store elements until they execeed threshold kept in THRESH constant.
pq = PriorityQueue{String, Int}()
s = 0
while (there are more key-value pairs in source file)
key, value = read(source file)
# this check avoids adding a key-value pair for which we are sure that
# it is not interesting
if s <= THRESH || value > peek(pq)[2]
enqueue!(pq, key, value)
s += value
# if we added something to the queue we have to check
# if we should not drop smallest elements from it
while s - peek(pq)[2] > THRESH
s -= dequeue!(pq)[2]
end
end
end
After this process pq will hold only key-value pairs you are interested in. The key benefit of this approach is that you never need to store whole data in RAM. At any point in time you only store the key-value pairs that would be selected at this stage of processing of the data.
Observe that this process does not give you an easily predictable result because several keys might have the same value. And if this value would be on a cutoff border you do not know which ones would be retained (however, you did not specify what you want to do in this special case - if you would specify the requirement for this case the algorithm should be updated a bit).

If you have enough memory to hold at least one or two full Dicts of the required size, you can use an inverted Dict with the length as key and an array of the old keys as values, to avoid losing data with a duplicate length value as a same key.
I think that the code below is then what your question was leading toward:
d1 = Dict("a" => 1, "b" => 2, "c" => 3, "d" => 2, "e" => 1, "f" =>5)
d2 = Dict()
for (k, v) in d1
d2[v] = haskey(d2, v) ? push!(d2[v], k) : [k]
end
println(d1)
println(d2)
for k in sort(collect(keys(d2)))
print("$k, $(d2[k]); ")
# here can delete keys under a threshold to speed further processing
end
If you don't have enough memory to hold an entire Dict, you may benefit
from first putting the data into a SQL database like SQLite and then doing
queries instead of modifying a Dict in memory. In that case, one column
of the table will be the data, and you would add a column for the data length
to the SQLite table. Or you can use a PriorityQueue as in the answer above.

Related

Python Dictionary search keys larger than

I have a python dictionary with numerical keys. I would like to find all the keys larger than the input key. Similar to a list.
Example of List:
for i in Mylist[10:]:
Here is example of my dictionary question:
dict = {133: "Value_1",
145: "Value_1",
185: "Value_1",
210: "Value_1",
240: "Value_1",}
for i in dict[185:]:
Is something like this possible or can we search to only get the keys larger than 185?
That is not how dictionaries or lists work in python.
List values are accessed by their index number, not by the value itself.
Dictionary values are accessed by whatever the key for that value is.
So in your example, to find all keys greater than the input value, the code would look like this:
Get input and turning the input, which we assume will be a number, into an integer.
min_value = int(input('Enter the minimum key value to search for: '))
Add a list which we will append the appropriate values to.
valid_values = []
Iterate over keys in the dictionary, checking if it's higher than the given minimum value.
for key in dictionary:
if key > min_value:
valid_values.append(key)
And that's it. You have a list with all the values greater than the inputted number.
A more efficient way to do it would be to use list comprehension, like so:
valid_values = [key for key in dictionary if key > min_value]
In the list comprehension code, you are asking it to give you the key for every key in the dictionary if the key is greater than the minimum value, and then place it in the list which valid_values refers to.
Hope that helps!

Vertica, ResultBufferSize has not effect

I'm trying to test the field: ResultBufferSize when working with Vertica 7.2.3 using ODBC.
From my understanding this field should effect the result set.
ResultBufferSize
but even with value 1 I get 20K results.
Anyway to make it work?
ResultBufferSize is the size of the result buffer configured at the ODBC data source. Not at runtime.
You get the actual size of a fetched buffer by preparing the SQL statement - SQLPrepare(), counting the result columns - SQLNumResultCols(), and then, for each found column, running SQLDescribe() .
Good luck -
Marco
I need to add a whole other answer to your comment, Tsahi.
I'm not completely sure if I still misunderstand you, though.
Maybe clarifying how I do it in an ODBC based SQL interpreter sheds some light on the matter.
SQLPrepare() on a string containing, say, "SELECT * FROM foo", returns SQL_SUCCESS, and the passed statement handle becomes valid.
SQLNumResultCols(&stmt,&colcount) on that statement handle returns the number of columns in its second parameter.
In a for loop from 0 to (colcount-1), I call SQLDescribeCol(), to get, among other things, the size of the column - that's how many bytes I'd have to allocate to fetch the biggest possible occurrence for that column.
I allocate enough memory to be able to fetch a block of rows instead of just one row in a subsequent SQLFetchScroll() call. For example, a block of 10,000 rows. For this, I need to allocate, for each column in colcount, 10,000 times the maximum possible fetchable size. Plus a two-byte integer for the Null indicator for each column. These two : data area allocated and null indicator area allocated, for 10,000 rows in my example, make the fetch buffer size, in other words, the result buffer size.
For the prepared statement handle, I call a SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE to 10,000 rows.
SQLFetchScroll() will return either 10,000 rows in one call, or, if the table foo contains fewer rows, all rows in foo.
This is how I understand it to work.
You can do the maths the other way round:
You set the max fetch buffer.
You prepare and describe the statement and columns as explained above.
For each column, you count two bytes for the null indicator, and the maximum possible fetch size as from SQLDescribeCol(), to get the sum of bytes for one row that need to be allocated.
You integer divide the max fetch buffer by the sum of bytes for one row.
And you use that integer divide result for the call of SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE.
Hope it makes some sense ...

Smart way to generate edges in Neo4J for big graphs

I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c

gawk - sorting values from array of arrays

Using gawk 4 to build arrays of arrays and need to figure out percentile data from it. Need to sort values in ascending order which doesn't appear possible using asort when working with multidimensional arrays. Some of my values will be duplicate integers, but I need to keep all duplicates.
Here is what my data looks like. Element names for [a] and [b] end up being unique strings. Array [b] then has elements that are named 1,2,3,etc and contain as values the data I need to sort on.
mArray[a][b][1]=3456
mArray[a][b][2]=1456
mArray[a][b][3]=1456
...
mArray[a][b][1]=9233
mArray[a][b][2]=9233
mArray[a][b][3]=1234
...
mArray[a][b][1]=4567
mArray[a][b][2]=4567
mArray[a][b][3]=3097
I figure I can create regular arrays from each unique [a] element and insert values from it's corresponding [b][x] and then asort on that, but then I lose whatever duplicate values exist. Right now I am hacking it by walking mArray and writing to different files based on name of [a], printing out all values under [b][x] then running sort. Curious if there is a more elegant way of doing it.
Here is what I tried using asort against my mArray to test getting proper output. After 30mins I get no output or errors.
for ( a in mArray ) {
for ( b in mArray[a] ) {
n=asort(mArray[a][b][c])
print n
}
}
Background: parsing CSV reports from a network monitoring system, grabbing throughput sample data then aggregating those values across all interfaces to determine 95th percentile for total throughput of a device.
Edit
Desired output format after sorting would be:
mArray[a][b][1]=1456
mArray[a][b][2]=1456
mArray[a][b][3]=3456
.
mArray[a][b][1]=1234
mArray[a][b][2]=9233
mArray[a][b][3]=9233
...
mArray[a][b][1]=3097
mArray[a][b][2]=4567
mArray[a][b][3]=4567
Well you have to sort myArray[a][b], not myArray[a][b][c], because c even doesn't exist ;)
If you don't want to sort in place, you have to add the destination as a second parameter to asort. At least this works in gawk, though I don't know since which version. In gawk 4 it does.
And then you have to print an array one by one...
for ( a in myArray ) {
for ( b in myArray[a] ) {
asort(myArray[a][b], n)
for( i in n ) print "m["a"]["b"]["i"]="n[i]
}
}

SQLite - Update with random unique value

I am trying to populate everyrow in a column with random ranging from 0 to row count.
So far I have this
UPDATE table
SET column = ABS (RANDOM() % (SELECT COUNT(id) FROM table))
This does the job but produces duplicate values, which turned out to be bad. I added a Unique constraint but that just causes it to crash.
Is there a way to update a column with random unique values from certain range?
Thanks!
If you want to later read the records in a random order, you can just do the ordering at that time:
SELECT * FROM MyTable ORDER BY random()
(This will not work if you need the same order in multiple queries.)
Otherwise, you can use a temporary table to store the random mapping between the rowids of your table and the numbers 1..N.
(Those numbers are automatically generated by the rowids of the temporary table.)
CREATE TEMP TABLE MyOrder AS
SELECT rowid AS original_rowid
FROM MyTable
ORDER BY random();
UPDATE MyTable
SET MyColumn = (SELECT rowid
FROM MyOrder
WHERE original_rowid = MyTable.rowid) - 1;
DROP TABLE MyOrder;
What you seem to be seeking is not simply a set of random numbers, but rather a random permutation of the numbers 1..N. This is harder to do. If you look in Knuth (The Art of Computer Programming), or in Bentley (Programming Pearls or More Programming Pearls), one suggested way is to create an array with the values 1..N, and then for each position, swap the current value with a randomly selected other value from the array. (I'd need to dig out the books to check whether it is any arbitrary position in the array, or only with a value following it in the array.) In your context, then you apply this permutation to the rows in the table under some ordering, so row 1 under the ordering gets the value in the array at position 1 (using 1-based indexing), etc.
In the 1st Edition of Programming Pearls, Column 11 Searching, Bentley says:
Knuth's Algorithm P in Section 3.4.2 shuffles the array X[1..N].
for I := 1 to N do
Swap(X[I], X[RandInt(I,N)])
where the RandInt(n,m) function returns a random integer in the range [n..m] (inclusive). That's nothing if not succinct.
The alternative is to have your code thrashing around when there is one value left to update, waiting until the random number generator picks the one value that hasn't been used yet. As a hit and miss process, that can take a while, especially if the number of rows in total is large.
Actually translating that into SQLite is a separate exercise. How big is your table? Is there a convenient unique key on it (other than the one you're randomizing)?
Given that you have a primary key, you can easily generate an array of structures such that each primary key is allocated a number in the range 1..N. You then use Algorithm P to permute the numbers. Then you can update the table from the primary keys with the appropriate randomized number. You might be able to do it all with a second (temporary) table in SQL, especially if SQLite supports UPDATE statements with a join between two tables. But it is probably nearly as simple to use the array to drive singleton updates. You'd probably not want a unique constraint on the random number column while this update is in progress.

Resources