count vs length vs size in a collection - collections

From using a number of programming languages and libraries I have noticed various terms used for the total number of elements in a collection.
The most common seem to be length, count, and size.
eg.
array.length
vector.size()
collection.count
Is there any preferred term to be used?
Does it depend on what type of collection it is? ie. mutable/immutable
Is there a preference for it being a property instead of a method?

Length() tends to refer to contiguous elements - a string has a length for example.
Count() tends to refer to the number of elements in a looser collection.
Size() tends to refer to the size of the collection, often this can be different from the length in cases like vectors (or strings), there may be 10 characters in a string, but storage is reserved for 20. It also may refer to number of elements - check source/documentation.
Capacity() - used to specifically refer to allocated space in collection and not number of valid elements in it. If type has both "capacity" and "size" defined then "size" usually refers to number of actual elements.
I think the main point is down to human language and idioms, the size of a string doesn't seem very obvious, whilst the length of a set is equally confusing even though they might be used to refer to the same thing (number of elements) in a collection of data.

FWIW (and that's vanishingly close to nothing), I prefer 'Count' because it seems to indicate that it's going to return the number of elements/items in the collection pretty unambigously.
When faced with the terms 'Length' or 'Size' I'm often left wondering for a moment (or even being forced to re-read documentation) whether the damn thing is going to tell me how many elements are in the colection or how many bytes the collection is consuming. This is particularly true for collections that are intended to be contingous like arrays or strings.
But no one who was responsible for the naming conventions used by the Java, BCL/.Net, or C/C++ standard frameworks/libraries bothered to ask me, so you're all stuck with whatever they came up with.
If only I were much smarter than I am and was named Bjarne, all of you might be spared the misery...
Of course, back in the real world, you should try to stick with whatever naming convention is used by the language/platform you're using (eg., size() in C++). Not that this seems to help you with your Array.Length dilemma.

The terms are somewhat interchangeably, though in some situations I would prefer one over another. Usually you can get the best usage if you think about How would you describe the length/size/count of this element verbally to another person?
length() implies that the element has a length. A string has a length. You say "a string is 20 characters long", right? So it has a length.
size() implies that the element has a size. E.g. a file has a size. You say "this file has a size of 2 MB", right? So it has a size.
That said, a string can also have a size, but I'd expect something else here. E.g. a UTF-16 string may have a length of 100 characters, but as every character is composed out of two byte, I'd expect size to be 200.
count() is very unusual. Objective-C uses count for the number of elements in an array. One might argue if an array has a length (as in Java), has a size (as in most other languages) or has a count. However, size might again be the size in byte (if the array items are 32 bit int, each item is 4 byte) and length... I wouldn't say "an array is 20 elements long", that sounds rather odd to me. I'd say "an array has 20 elements". I'm not sure if count expresses that very well, but I think count is here a short form for elementCount() and that again makes much more sense for an array than length() or size().
If you create own objects/elements in a programming language, it's best to use whatever other similar elements use, since programmers are used to accessing the desired property using that term.

Count I think is the most obvious term to use if you're looking for the number of items in a collection. That should even be obvious to new programmers who haven't become particularly attached to a given language yet.
And it should be a property as that's what it is: a description (aka property) of the collection. A method would imply that it has to do something to the collection to get the number of items and that just seems unintuitive.

Hmm...I would not use size. Because this might be confused with size in bytes.
Length - could make some sense for arrays, as long as they are supposed to use consequent bytes of memory.
Though...length...in what?
Count is clear. How many elements. I would use count.
About property/method, I would use property to mark it's fast, and method to mark it's slow.
And, the most important - I would stick to the standards of the languages/libraries you are using.

Adding to #gbjbaanb's answer...
If "property" implies public access to the value, I would say that "method" is preferred simply to provide encapsulation and to hide the implementation.
You might change you mind about how to count elements or how you maintain that count. If it is a property, you're stuck - if it is acessed via a method, you can change the underlying implementation without impacting users of the collection.

Kotlin answer
from _Collections.kt
/**
* Returns the number of elements in this collection.
*/
#kotlin.internal.InlineOnly
public inline fun <T> Collection<T>.count(): Int {
return size
}

In Elixir there is actually a clear naming scheme associated with it across types in the language.
When “counting” the number of elements in a data structure, Elixir
also abides by a simple rule: the function is named size if the
operation is in constant time (i.e. the value is pre-calculated) or
length if the operation is linear (i.e. calculating the length gets
slower as the input grows).

To me, this is a little like asking whether "foreach" is better than "for each". It just depends on the language/framework.

I would say that it depends on particular language that you are using and classes. For example in c# if you are using Array you have Property Length, if you have something that inherits from IEnumerable you have extension Method Count(), but it is not fast. And if you inherited from ICollection you have Property Count.

Related

What persistent data structures does Raku/Rakudo include?

Raku provides many types that are immutable and thus cannot be modified after they are created. Until I started looking into this area recently, my understanding was that these Types were not persistent data structures – that is, unlike the core types in Clojure or Haskell, my belief was that Raku's immutable types did not take advantage of structural sharing to allow for inexpensive copies. I thought that statement my List $new = (|$old-list, 42); literally copied the values in $old-list, without the data-sharing features of persistent data structures.
That description of my understanding is in the past tense, however, due to the following code:
my Array $a = do {
$_ = [rand xx 10_000_000];
say "Initialized an Array in $((now - ENTER now).round: .001) seconds"; $_}
my List $l = do {
$_ = |(rand xx 10_000_000);
say "Initialized the List in $((now - ENTER now).round: .001) seconds"; $_}
do { $a.push: rand;
say "Pushed the element to the Array in $((now - ENTER now).round: .000001) seconds" }
do { my $nl = (|$l, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
do { my #na = |$l;
say "Copied List \$l into a new Array in $((now - ENTER now).round: .001) seconds" }
which produced this output in one run:
Initialized an Array in 5.938 seconds
Initialized the List in 5.639 seconds
Pushed the element to the Array in 0.000109 seconds
Appended an element to the List in 0.000109 seconds
Copied List $l into a new Array in 11.495 seconds
That is, creating a new List with the old values + one more is just as fast as pushing to a mutable Array, and dramatically faster than copying the List into a new Array – exactly the performance characteristics that you'd expect to see from a persistent List (copying to an Array is still slow because it can't take advantage of structural sharing without breaking the immutability of the List). The fast copying of $l into $nl is not due to either List being lazy; neither are.
All of the above leads me to believe that Lists in Rakudo actually are persistent data structures, with all the performance benefits that implies. That leaves me with several questions:
Am I right about Lists being persistent data structures?
Are all other immutable Types also persistent data structures? Or are any?
Is any of this part of Raku, or just an implementation choice Rakudo has made?
Are any of these performance characteristics documented/guaranteed anywhere?
I have to say, I am both extremely impressed and more than a bit baffled to discover evidence that at least some of Raku(do)'s types are persistent. It's the sort of feature that other languages list as a key selling point or that leads to the creation of libraries with 30k+ stars on GitHub. Have we really had it in Raku without even mentioning it?
I remember implementing these semantics, and I certainly don't recall thinking about them giving rise to a persistent data structure at the time - although it does seems fair to attach that label to the result!
I don't think you'll find anywhere that explicitly spells out this exact behavior, however the most natural implementation of things that are required by the language quite naturally leads to it. Taking the ingredients:
The infix:<,> operator is the List constructor in Raku
When a List is created, it is non-committal with regards to laziness and flattening (these arise from how we use the List, which we don't - in general - know at the point of its construction)
When we write (|$x, 1), the prefix:<|> operator constructs a Slip, which is a kind of List that should melt into its surrounding List. Thus what infix:<,> sees is a Slip and an Int.
Making the Slip melt into the result List immediately would mean making a commitment about eagerness, which List construction alone should not do. Thus the Slip and everything after it is placed into the lazily evaluated ("non-reified") portion of the List.
This last of these is what gives rise to the observed persistent data structure style behavior.
I expect it would be possible to have a implementation that inspects the Slip and chooses to eagerly copy things that are known not to be lazy, and still be in compliance with the specification test suite. That would change the time complexity of your example. If you want to be defensive against that, then:
do { my $nl = (|$l.lazy, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
Should be sufficient to force the issue even if the implementation changed.
Of other cases that immediately come to mind that are related to persistent data structures or at least tail sharing:
The MoarVM implementation of strings, which is behind str and thus Str, implements string concatenation by creating a new string that refers to the two that are being concatenated instead of copying the data in the two strings (and does similar tricks for substr and repetition). This is strictly an optimization, not a language requirement, and in some delicate cases (the last grapheme of one string and the first grapheme of the next will form a single grapheme in the resulting string), it gives up and takes the copying path.
Outside of the core, modules like Concurrent::Stack, Concurrent::Queue, and Concurrent::Trie use tail sharing as a technique to implement relatively efficient lock-free data structures.

Ada pragma Pack or Alignment attribute for Records?

Having just discovered alignment issues for the first time I am unsure on which method is the best/safest way to deal with them. I have a record which I am serialising to send over a Stream and vice-versa so it must meet the interface spec and contain no padding.
Given the example record:
type MyRecord is record
a : Unsigned_8;
b : Unsigned_32;
end record;
This by default would require 8 bytes but I am able to remove packing using 2 methods:
for MyRecord'Alignment use 1;
or
pragma Pack (MyRecord);
I have found a few questions relating to C examples but haven't been able to find a clear answer on which method is the most appropriate, how to determine which method to use or if they are equivalent?
UPDATE
When I tried both on my 'real' code rather than a basic example I found that the Alignment attribute achieved what I was looking for. pragma Pack significantly reduced the size, not confirmed but I assume it has packed the many enumerated types I'm using, overriding the 'Size use 8 attribute applied to each type.
For Streams you could leave MyRecord without any representation clauses and use the default MyRecord’Write and MyRecord’Read; ARM 13.13.2(9) says
For elementary types, Read reads (and Write writes) the number of stream elements implied by the Stream_Size for the type T; the representation of those stream elements is implementation defined. For composite types, the Write or Read attribute for each component is called in canonical order, which is last dimension varying fastest for an array (unless the convention of the array is Fortran, in which case it is first dimension varying fastest), and positional aggregate order for a record.
One possible disadvantage of the GNAT implementation (and maybe of others) is that the ’Write and ’Read calls each end in a call to the underlying network software. Not a problem (aside from possible inefficiency) normally, but if you’re using TCP_NODELAY (or worse, UDP) this is not the behaviour you’re looking for.
Overloading ’Write leads back to your original problem (but at least it’s confined to the overloading procedure, so the rest of your program can deal with properly aligned data).
I’ve used an in-memory stream for this (especially the UDP case); ’Write to the in-memory stream, then send the Stream_Element_Array to the socket. One example is ColdFrame.Memory_Streams (.ads, .adb).
I think you want the record representation clauses, if you want full control:
for MyRecord'Size use 40;
for MyRecord use record
a at 0 range 0 .. 7;
b at 1 range 0 .. 31;
end record;
(or some such, I might have messed up some of the indices here).
NB: edited as per comment by Simon

In XQuery, how do you obfuscate a text string and maintain its character length on output?

I need to obfuscate the text content of an element. Let’s say, for example,
a plan ID. The plan ID may appear several times in one document or across
different documents. I need the obfuscated plan ID to be unique and consistent
(always map 12345 to abc72) and limited to only 5 characters. I would prefer not to have a separate document that exists that would be used as the mapping file or contain keys.
A simple hash function would not work because of the character
length limitation. Any other ideas? I’d like to stick with doing this in pure
XQuery.
You could use fn:translate (similar to unix tr command) to reliably
convert one character to another. This is similar to good old rot13, but more flexible and powerful.
You could also build on this, by using a different fixed translation for each position in your text strings, as well.
You could still use hashing. Just truncate to the number of digits you need, something like this:
substring(
xdmp:integer-to-hex(xdmp:hash64($input)),
1, string-length($input))
As long as the hashing function is good, that should work just fine. If you need to handle long strings, pad the hash out multiple times and then truncate. If you need any kind of security you should throw a private key into the mix, and swap out xdmp:hash64 for xdmp:hmac-sha512. That might be a good idea anyway, since SHA-2 512 has well-known characteristics.
substring(
xdmp:hmac-sha512($key, $input, 'base64'),
1, string-length($input))
Hash collisions are possible, but unlikely.

Count the frequency of bytes in a purely functional language

If we had an assignment:
Given a block of binary data, count the frequency of the bytes within it.
And you were supposed to do this in C, the answer would be trivial and reasonably fast even for larger binary blocks. How would one go about implementing this in a purely functional language, without side effects?
For example, if you wrote a function that accepted freqency counts for each byte and the rest of the list of bytes, and returned modified frequency counts, it would have to do awful lot of work for data set of 100M bytes.
Also, if you sorted the data and then somehow counted the amount of subsequent same-valued bytes, the sort itself would take a lot of time.
Is there a reasonable way to implement this?
The straightforward way to do it is indeed to pass in and return data structures mapping bytes to counts. This would probably be implemented as some kind of tree (since that's what you get out of the standard library containers, as far as I know). In pure functional programming when you're passed in a tree and you need to return a new tree with a difference in only one node, the returned tree ends up sharing almost all of its structure and data with the original tree.
There is some overhead in traversing the tree to get to the count, but since you're counting bytes the tree is only ever smaller than 256 elements, so the overhead is log(255), which is a constant. It doesn't get larger for large data sets - it doesn't change the big-oh complexity of the algorithm. That's actually true even if you use the greatest possible overhead of copying around a full 256-entry array of counts with no sharing.
If you want to optimise this, you can take advantage of the fact that the "intermediate" frequency counts are never needed except as part of the computation of the next set of counts. That means you can use various techniques for getting the implementation to use destructive updates even while you're still semantically writing functional code. An STref in Haskell is basically letting you do this manually.
Theoretically the compiler could notice that you're replacing a never-needed-again value with a new one, so it could do the update in place for you. I don't know whether or not any actual production ready compilers are currently able to make this optimisation.

Flex AS3: Are smaller variable names faster than longer names?

We are in the process of the optimization of a Flex AS3 Application.
One of my team members suggested we make the variable name lengths smaller to optimize the application performance.
I.e.:
var IsRegionSelected:Boolean = false; //Slower
var IsRS:Boolean = false; //faster
Is this true?
No, the gain you will obtain will be only for the size of the swf.
String are put into a constant pool and instruction refering to this String will use an index.
it can be seen as (very schematic) :
constant pool:
[0] IsRegionSelected
[1] IsRS
usage:
value at 0 = false
value at 1 = false
Your code will be probably translated as (for local variable):
push false
setlocal x
push false
setlocal y
where x and y are register int assign by the compiler, so no difference if it's register 2 or register 4
For more detailed read the avm specification
yep.. i second it. changing the name length is not gonna help you. concentrate on item renderers, effects, states and transitions. those may be killing your resource. also checkout for any embedding images, embedding fonts, etc, since those will increase ur final swf file size and increase initial loading time.
cheers, PK
I don't think so, the way you use your variable name does matter than its length.
Good code should be consistent. Whether that means setting rules for the names of variables and functions, adopting standard approaches, or simply making sure all of your code is indented the same way, consistency makes your code easier for others to read.
One should later construe on what is your variable name declared.
var g:String;
var gang:String;
Both perform the same operation, one is more readability where someone going through your code will also construe it.
There's a very small performance gain, but if you plan to use this application again later, it's not worth your sanity. Do absolutely any other optimization you can before this one - and if it's really slow enough to need optimizing, then there are definitely other factors that you'll need to take care of first before variable names.
Cut anything else you can before resorting to 1-2 millisecond boosts.
As Matchu says, there is a difference but a small one.
You should consider assigning meaningful ids to your variables instead of just using simple chars which have no sense.

Resources