How to cut down/delete a zarr array - zarr

I have a simple array (say length 1000) of objects in zarr. I want to replace it with a slimmed down version, picking only a subset of the items, as specified using a boolean array of size 1000. I want to keep everything else the same (e.g. if this array is a persistent one, I want to change the array on disk as well as in memory). I can't simply reassign the array:
my_zarr_data = my_zarr_data[:][selected_items]
Because then I get the error ValueError: missing object_codec for object array. Another option would be to make a copy, delete all the data, then add it back from the original using append(), but I can't see how to clear a zarr array while keeping the object_codec and other params the same (perhaps I could just do resize(0)?). At the moment I'm resizing to the length of sum(selected_items) and then using my_zarr_data.set_basic_selection(..., my_zarr_data[:][selected_items]). Is that right? Is there a more efficient way to permanently reassign an array to (say) the return value from get_mask_selection()?

Related

In Julia, how do I find out why Dict{String, Any} is Any?

I am very new to Julia and mostly code in Python these days. I am using Julia to work with and manipulate HDF5 files.
So when I get to writing out (h5write), I get an error because the data argument is of mixed type and I need to find out why.
The error message says Array{Dict{String,Any},4} is what I am trying to pass in, but when I look at the values (and it is a huge structure), I see a lot of 0xff and values like this. How do I quickly find why the Any and not a single type?
Just to make this an answer:
If my_dicts is an Array{Dict{String, Any}, 4}, then one way of working out what types are hiding in the Any part of the dict is:
unique(typeof.(values(my_dicts[1])))
To explain:
my_dicts[1] picks out the first element of your Array, i.e. one of your Dict{String, Any}
values then extracts the values, which is the Any part of the dictionary,
typeof. (notice the dot) broadcasts the typeof function over all elements returned by values, returning the types of all of these elements; and
unique takes the list of all these types and reduces it to its unique elements, so you'll end up with a list of each separate type contained in the Any partof your dictionary.

How do you sort a dictionary by it's values and then sum these values up to a certain point?

I was wondering what the best method would be to sort a dictionary of type Dict{String, Int} based on the value. I loop over a FASTQ file containing multiple sequence records, each record has a String as an identifier which serves as key and another string where i take the length from as the value of the key.
for example:
testdict["ee0a"]=length("aatcg")
testdict["002e4"]=length("aatcgtga")
testdict["12-f9"]=length(aatcgtgacgtga")
In this case the key value pairs would be "ee0a" => 5, "002e4" => 8, and "12-f9" => 13.
What i want to do is sort these pairs from highest value to the lowest value, afterwhich i sum these values in a different untill a that variable passes a certain threshold. I then need to save the keys i used so i can use them later on.
Is it possible to use the sort() function or use a SortedDict to achieve this? I would imagine that if the sorting succeeded i could use a while loop to add my keys to a list and add my values into a different variable untill it's greater than my threshold, and then use the list of keys to create a new dictionary with my selected key-value pairs.
However what would be the fastest way to do this? the FASTQ files i read in can contain multiple GB's worth of data so i'd love to create a sorted dictionary while reading in the file and select the records i want before doing anything else with the data.
If your file is multiple GB's worth of data I would avoid storing them in the Dict in the first place. I think it is better to process the file sequentially and store the keys that meet your condition in a PriorityQueue from the DataStructures.jl package. Of course you can repeat the same procedure if you read the data from a dictionary in memory (simply source changes from disk file to the dictionary)
Here is a pseudocode of what you could consider (a full solution would depend on how you read your data which you did not specify).
Assume that you want to store elements until they execeed threshold kept in THRESH constant.
pq = PriorityQueue{String, Int}()
s = 0
while (there are more key-value pairs in source file)
key, value = read(source file)
# this check avoids adding a key-value pair for which we are sure that
# it is not interesting
if s <= THRESH || value > peek(pq)[2]
enqueue!(pq, key, value)
s += value
# if we added something to the queue we have to check
# if we should not drop smallest elements from it
while s - peek(pq)[2] > THRESH
s -= dequeue!(pq)[2]
end
end
end
After this process pq will hold only key-value pairs you are interested in. The key benefit of this approach is that you never need to store whole data in RAM. At any point in time you only store the key-value pairs that would be selected at this stage of processing of the data.
Observe that this process does not give you an easily predictable result because several keys might have the same value. And if this value would be on a cutoff border you do not know which ones would be retained (however, you did not specify what you want to do in this special case - if you would specify the requirement for this case the algorithm should be updated a bit).
If you have enough memory to hold at least one or two full Dicts of the required size, you can use an inverted Dict with the length as key and an array of the old keys as values, to avoid losing data with a duplicate length value as a same key.
I think that the code below is then what your question was leading toward:
d1 = Dict("a" => 1, "b" => 2, "c" => 3, "d" => 2, "e" => 1, "f" =>5)
d2 = Dict()
for (k, v) in d1
d2[v] = haskey(d2, v) ? push!(d2[v], k) : [k]
end
println(d1)
println(d2)
for k in sort(collect(keys(d2)))
print("$k, $(d2[k]); ")
# here can delete keys under a threshold to speed further processing
end
If you don't have enough memory to hold an entire Dict, you may benefit
from first putting the data into a SQL database like SQLite and then doing
queries instead of modifying a Dict in memory. In that case, one column
of the table will be the data, and you would add a column for the data length
to the SQLite table. Or you can use a PriorityQueue as in the answer above.

Vertica, ResultBufferSize has not effect

I'm trying to test the field: ResultBufferSize when working with Vertica 7.2.3 using ODBC.
From my understanding this field should effect the result set.
ResultBufferSize
but even with value 1 I get 20K results.
Anyway to make it work?
ResultBufferSize is the size of the result buffer configured at the ODBC data source. Not at runtime.
You get the actual size of a fetched buffer by preparing the SQL statement - SQLPrepare(), counting the result columns - SQLNumResultCols(), and then, for each found column, running SQLDescribe() .
Good luck -
Marco
I need to add a whole other answer to your comment, Tsahi.
I'm not completely sure if I still misunderstand you, though.
Maybe clarifying how I do it in an ODBC based SQL interpreter sheds some light on the matter.
SQLPrepare() on a string containing, say, "SELECT * FROM foo", returns SQL_SUCCESS, and the passed statement handle becomes valid.
SQLNumResultCols(&stmt,&colcount) on that statement handle returns the number of columns in its second parameter.
In a for loop from 0 to (colcount-1), I call SQLDescribeCol(), to get, among other things, the size of the column - that's how many bytes I'd have to allocate to fetch the biggest possible occurrence for that column.
I allocate enough memory to be able to fetch a block of rows instead of just one row in a subsequent SQLFetchScroll() call. For example, a block of 10,000 rows. For this, I need to allocate, for each column in colcount, 10,000 times the maximum possible fetchable size. Plus a two-byte integer for the Null indicator for each column. These two : data area allocated and null indicator area allocated, for 10,000 rows in my example, make the fetch buffer size, in other words, the result buffer size.
For the prepared statement handle, I call a SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE to 10,000 rows.
SQLFetchScroll() will return either 10,000 rows in one call, or, if the table foo contains fewer rows, all rows in foo.
This is how I understand it to work.
You can do the maths the other way round:
You set the max fetch buffer.
You prepare and describe the statement and columns as explained above.
For each column, you count two bytes for the null indicator, and the maximum possible fetch size as from SQLDescribeCol(), to get the sum of bytes for one row that need to be allocated.
You integer divide the max fetch buffer by the sum of bytes for one row.
And you use that integer divide result for the call of SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE.
Hope it makes some sense ...

Fortran90 created allocatable arrays but elements incorrect

Trying to create an array from an xyz data file. The data file is arranged so that x,y,z of each atom is on a new line and I want the array to reflect this.
Then to use this array to find find the distance from each atom in the list with all the others.
To do this the array has been copied such that atom1 & atom2 should be identical to the input file.
length is simply the number of atoms in the list.
The write statement: WRITE(20,'(3F12.9)') atom1 actually gives the matrix wanted but when I try to find individual elements they're all wrong!
Any help would be really appreciated!
Thanks guys.
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE ::atom1,atom2'
ALLOCATE(atom1(length,3),atom2(length,3))
READ(10,*) ((atom1(i,j), i=1,length), j=1,3)
atom2=atom1
distn=0
distc=0
DO n=1,length
x1=atom1(n,1)
y1=atom1(n,2) !1st atom
z1=atom1(n,3)
DO m=1,length
x2=atom2(m,1)
y2=atom2(m,2) !2nd atom
z2=atom2(m,3)`
Your READ statement reads all the x coordinates for all atoms from however many records, then all the y coordinates, then all the z coordinates. That's inconsistent with your description of the input file. You have the nesting of the io-implied-do's in the READ statement around the wrong way - it should be ((atom1(i,j),j=1,3),i=1,length).
Similarly, as per the comment, your diagnostic write mislead you - you were outputting all x ordinates, followed by all y ordinates, etc. Array element order of a whole array reference varies the first (leftmost) dimension fastest (colloquially known as column major order).
(There are various pitfalls associated with list directed formatting that mean I wouldn't recommend it for production code (or perhaps for input specifically written with the knowledge of and defence against those pitfalls). One of those pitfalls is that the READ under list directed formatting will pull in as many records as it requires to satisfy the input list. You may have detected the problem earlier if you were using an explicit format that nominated the number of fields per record.)

Dimension Does Not Match When Populating Matrix

I am currently working in R and I am trying to populate a matrix with
a some for loops. However, I keep getting the "number of items to replace is not a multiple of replacement length" error. The way I set my matrix() is that I specified nrow
(because I am sure of the size) and I leave the ncol blank.
How can I create a matrix that dynamically allocate the dimensions?
Any recommendations?
Thank you.
A couple of options spring to mind:
Make an informed guess as to the size of the matrix and allocate accordingly. Then have your code check to see if you would exceed the limits chosen and expand the object. If you expand by a reasonable chunk size (i.e. don't add just 1 column, add 10 or 20 or n depending on the size of your problem, whatever is reasonable) then you won't incur the copy/expand overhead that often, which is what bogs loops down if written badly.
Store the data/result in a list, each component of which would be one row of your matrix. That way you fill in the object as you go along, and then can either process the resulting list into a matrix with padding, or just work directly with the list. If each row can be of a different length (number of columns) then it doesn't make sense to store as a matrix in the first place and the list is the better option.

Resources