Performance of string operations in Dyalog - vector

I have 2 questions related to comparing character vectors in Dyalog APL.
The following code will compare character vectors one-by-one:
a←'ATCG'
b←'GTCA'
a=b
In order to speed up (in case of 2 vectors, as well as in case of comparing many vectors to a single vector), should I convert character vector to a numeric vector or it won't matter in APL (similar to comparing chars in C)?
I am comparing DNA sequences (which may consist of letter from the ATCG alphabet only). Is there anything I can do to speed up various operations on such vectors?

Interestingly, on my (old) version of Dyalog APL, converting characters to small integers actually runs some 25% faster. This may have been sped up in more recent versions.
Try
a <- []av iota 'ATCG' // sorry, no apl characters
b <- []av iota 'GTCA'
a = b
Be sure that the largest value is less than 128.
To check that you have the smallest possible representation of integers, use the []dr function. []dr a should return 82 for an integer -128 <= x <= 127.
Dyalog APL will automagically convert to the lowest possible integer width.

Related

Representing decimal numbers in binary

How do I represent integers numbers, for example, 23647 in two bytes, where one byte contains the last two digits (47) and the other contains the rest of the digits(236)?
There are several ways do to this.
One way is to try to use Binary Coded Decimal (BCD). This codes decimal digits, rather than the number as a whole into binary. The packed form puts two decimal digits into a byte. However, your example value 23647 has five decimal digits and will not fit into two bytes in BCD. This method will fit values up to 9999.
Another way is to put each of your two parts in binary and place each part into a byte. You can do integer division by 100 to get the upper part, so in Python you could use
upperbyte = 23647 // 100
Then the lower part can be gotten by the modulus operation:
lowerbyte = 23647 % 100
Python will directly convert the results into binary and store them that way. You can do all this in one step in Python and many other languages:
upperbyte, lowerbyte = divmod(23647, 100)
You are guaranteed that the lowerbyte value fits, but if the given value is too large the upperbyte value many not actually fit into a byte. All this assumes that the value is positive, since negative values would complicate things.
(This following answer was for a previous version of the question, which was to fit a floating-point number like 36.47 into two bytes, one byte for the integer part and another byte for the fractional part.)
One way to do that is to "shift" the number so you consider those two bytes to be a single integer.
Take your value (36.47), multiply it by 256 (the number of values that fit into one byte), round it to the nearest integer, convert that to binary. The bottom 8 bits of that value are the "decimal numbers" and the next 8 bits are the "integer value." If there are any other bits still remaining, your number was too large and there is an overflow condition.
This assumes you want to handle only non-negative values. Handling negatives complicates things somewhat. The final result is only an approximation to your starting value, but that is the best you can do.
Doing those calculations on 36.47 gives the binary integer
10010001111000
So the "decimal byte" is 01111000 and the "integer byte" is 100100 or 00100100 when filled out to 8 bits. This represents the float number 36.46875 exactly and your desired value 36.47 approximately.

R numeric value: exact or rounded value

As R FAQ states, the only numbers that can be represented exactly in R’s numeric type are
integers
fractions whose denominator is a power of 2
All other numbers are internally rounded to (typically) 53 binary digits accuracy.
Having said that, how to say a given number can be stored exactly in R's numeric type or should be rounded? Or stated differently, how to say a numeric number is the result of a fraction whose denominator is a power of 2 or not?
For example, 0.03125 can be stored excatly since its the result of 1/32 (i.e., denominator is a power of 2)
sprintf("%.60f", 1/32)
#[1] "0.031250000000000000000000000000000000000000000000000000000000"

When is it advantageous to use the L suffix to specify integer quantities in R? [duplicate]

I often seen the symbol 1L (or 2L, 3L, etc) appear in R code. Whats the difference between 1L and 1? 1==1L evaluates to TRUE. Why is 1L used in R code?
So, #James and #Brian explained what 3L means. But why would you use it?
Most of the time it makes no difference - but sometimes you can use it to get your code to run faster and consume less memory. A double ("numeric") vector uses 8 bytes per element. An integer vector uses only 4 bytes per element. For large vectors, that's less wasted memory and less to wade through for the CPU (so it's typically faster).
Mostly this applies when working with indices.
Here's an example where adding 1 to an integer vector turns it into a double vector:
x <- 1:100
typeof(x) # integer
y <- x+1
typeof(y) # double, twice the memory size
object.size(y) # 840 bytes (on win64)
z <- x+1L
typeof(z) # still integer
object.size(z) # 440 bytes (on win64)
...but also note that working excessively with integers can be dangerous:
1e9L * 2L # Works fine; fast lean and mean!
1e9L * 4L # Ooops, overflow!
...and as #Gavin pointed out, the range for integers is roughly -2e9 to 2e9.
A caveat though is that this applies to the current R version (2.13). R might change this at some point (64-bit integers would be sweet, which could enable vectors of length > 2e9). To be safe, you should use .Machine$integer.max whenever you need the maximum integer value (and negate that for the minimum).
From the Constants Section of the R Language Definition:
We can use the ‘L’ suffix to qualify any number with the intent of making it an explicit integer.
So ‘0x10L’ creates the integer value 16 from the hexadecimal representation. The constant 1e3L
gives 1000 as an integer rather than a numeric value and is equivalent to 1000L. (Note that the
‘L’ is treated as qualifying the term 1e3 and not the 3.) If we qualify a value with ‘L’ that is
not an integer value, e.g. 1e-3L, we get a warning and the numeric value is created. A warning
is also created if there is an unnecessary decimal point in the number, e.g. 1.L.
L specifies an integer type, rather than a double that the standard numeric class is.
> str(1)
num 1
> str(1L)
int 1
To explicitly create an integer value for a constant you can call the function as.integer or more simply use "L " suffix.

What's the difference between integer class and numeric class in R

I want to preface this by saying I'm an absolute programming beginner, so please excuse how basic this question is.
I'm trying to get a better understanding of "atomic" classes in R and maybe this goes for classes in programming in general. I understand the difference between a character, logical, and complex data classes, but I'm struggling to find the fundamental difference between a numeric class and an integer class.
Let's say I have a simple vector x <- c(4, 5, 6, 6) of integers, it would make sense for this to be an integer class. But when I do class(x) I get [1] "numeric". Then if I convert this vector to an integer class x <- as.integer(x). It return the same exact list of numbers except the class is different.
My question is why is this the case, and why the default class for a set of integers is a numeric class, and what are the advantages and or disadvantages of having an integer set as numeric instead of integer.
There are multiple classes that are grouped together as "numeric" classes, the 2 most common of which are double (for double precision floating point numbers) and integer. R will automatically convert between the numeric classes when needed, so for the most part it does not matter to the casual user whether the number 3 is currently stored as an integer or as a double. Most math is done using double precision, so that is often the default storage.
Sometimes you may want to specifically store a vector as integers if you know that they will never be converted to doubles (used as ID values or indexing) since integers require less storage space. But if they are going to be used in any math that will convert them to double, then it will probably be quickest to just store them as doubles to begin with.
Patrick Burns on Quora says:
First off, it is perfectly feasible to use R successfully for years
and not need to know the answer to this question. R handles the
differences between the (usual) numerics and integers for you in the
background.
> is.numeric(1)
[1] TRUE
> is.integer(1)
[1] FALSE
> is.numeric(1L)
[1] TRUE
> is.integer(1L)
[1] TRUE
(Putting capital 'L' after an integer forces it to be stored as an
integer.)
As you can see "integer" is a subset of "numeric".
> .Machine$integer.max
[1] 2147483647
> .Machine$double.xmax
[1] 1.797693e+308
Integers only go to a little more than 2 billion, while the other
numerics can be much bigger. They can be bigger because they are
stored as double precision floating point numbers. This means that
the number is stored in two pieces: the exponent (like 308 above,
except in base 2 rather than base 10), and the "significand" (like
1.797693 above).
Note that 'is.integer' is not a test of whether you have a whole
number, but a test of how the data are stored.
One thing to watch out for is that the colon operator, :, will return integers if the start and end points are whole numbers. For example, 1:5 creates an integer vector of numbers from 1 to 5. You don't need to append the letter L.
> class(1:5)
[1] "integer"
Reference: https://www.quora.com/What-is-the-difference-between-numeric-and-integer-in-R
To quote the help page (try ?integer), bolded portion mine:
Integer vectors exist so that data can be passed to C or Fortran code which expects them, and so that (small) integer data can be represented exactly and compactly.
Note that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.
Like the help page says, R's integers are signed 32-bit numbers so can hold between -2147483648 and +2147483647 and take up 4 bytes.
R's numeric is identical to an 64-bit double conforming to the IEEE 754 standard. R has no single precision data type. (source: help pages of numeric and double). A double can store all integers between -2^53 and 2^53 exactly without losing precision.
We can see the data type sizes, including the overhead of a vector (source):
> object.size(1:1000)
4040 bytes
> object.size(as.numeric(1:1000))
8040 bytes
To my understanding - we do not declare a variable with a data type so by default R has set any number without L to be a numeric.
If you wrote:
> x <- c(4L, 5L, 6L, 6L)
> class(x)
>"integer" #it would be correct
Example of Integer:
> x<- 2L
> print(x)
Example of Numeric (kind of like double/float from other programming languages)
> x<-3.4
> print(x)
Numeric is an umbrella term for several types of classes (e.g. double and integer). Integers are numbers which do not have decimal points and thus are stored with minimal space in memory. Use the integer class only when doing computations with such numbers, otherwise revert to numeric.

How do computers evaluate huge numbers?

If I enter a value, for example
1234567 ^ 98787878
into Wolfram Alpha it can provide me with a number of details. This includes decimal approximation, total length, last digits etc. How do you evaluate such large numbers? As I understand it a programming language would have to have a special data type in order to store the number, let alone add it to something else. While I can see how one might approach the addition of two very large numbers, I can't see how huge numbers are evaluated.
10^2 could be calculated through repeated addition. However a number such as the example above would require a gigantic loop. Could someone explain how such large numbers are evaluated? Also, how could someone create a custom large datatype to support large numbers in C# for example?
Well it's quite easy and you can have done it yourself
Number of digits can be obtained via logarithm:
since `A^B = 10 ^ (B * log(A, 10))`
we can compute (A = 1234567; B = 98787878) in our case that
`B * log(A, 10) = 98787878 * log(1234567, 10) = 601767807.4709646...`
integer part + 1 (601767807 + 1 = 601767808) is the number of digits
First, say, five, digits can be gotten via logarithm as well;
now we should analyze fractional part of the
B * log(A, 10) = 98787878 * log(1234567, 10) = 601767807.4709646...
f = 0.4709646...
first digits are 10^f (decimal point removed) = 29577...
Last, say, five, digits can be obtained as a corresponding remainder:
last five digits = A^B rem 10^5
A rem 10^5 = 1234567 rem 10^5 = 34567
A^B rem 10^5 = ((A rem 10^5)^B) rem 10^5 = (34567^98787878) rem 10^5 = 45009
last five digits are 45009
You may find BigInteger.ModPow (C#) very useful here
Finally
1234567^98787878 = 29577...45009 (601767808 digits)
There are usually libraries providing a bignum datatype for arbitrarily large integers (eg. mapping digits k*n...(k+1)*n-1, k=0..<some m depending on n and number magnitude> to a machine word of size n redefining arithmetic operations). for c#, you might be interested in BigInteger.
exponentiation can be recursively broken down:
pow(a,2*b) = pow(a,b) * pow(a,b);
pow(a,2*b+1) = pow(a,b) * pow(a,b) * a;
there also are number-theoretic results that have engenedered special algorithms to determine properties of large numbers without actually computing them (to be precise: their full decimal expansion).
To compute how many digits there are, one uses the following expression:
decimal_digits(n) = 1 + floor(log_10(n))
This gives:
decimal_digits(1234567^98787878) = 1 + floor(log_10(1234567^98787878))
= 1 + floor(98787878 * log_10(1234567))
= 1 + floor(98787878 * 6.0915146640862625)
= 1 + floor(601767807.4709647)
= 601767808
The trailing k digits are computed by doing exponentiation mod 10^k, which keeps the intermediate results from ever getting too large.
The approximation will be computed using a (software) floating-point implementation that effectively evaluates a^(98787878 log_a(1234567)) to some fixed precision for some number a that makes the arithmetic work out nicely (typically 2 or e or 10). This also avoids the need to actually work with millions of digits at any point.
There are many libraries for this and the capability is built-in in the case of python. You seem primarily concerned with the size of such numbers and the time it may take to do computations like the exponent in your example. So I'll explain a bit.
Representation
You might use an array to hold all the digits of large numbers. A more efficient way would be to use an array of 32 bit unsigned integers and store "32 bit chunks" of the large number. You can think of these chunks as individual digits in a number system with 2^32 distinct digits or characters. I used an array of bytes to do this on an 8-bit Atari800 back in the day.
Doing math
You can obviously add two such numbers by looping over all the digits and adding elements of one array to the other and keeping track of carries. Once you know how to add, you can write code to do "manual" multiplication by multiplying digits and putting the results in the right place and a lot of addition - but software will do all this fairly quickly. There are faster multiplication algorithms than the one you would use manually on paper as well. Paper multiplication is O(n^2) where other methods are O(n*log(n)). As for the exponent, you can of course multiply by the same number millions of times but each of those multiplications would be using the previously mentioned function for doing multiplication. There are faster ways to do exponentiation that require far fewer multiplies. For example you can compute x^16 by computing (((x^2)^2)^2)^2 which involves only 4 actual (large integer) multiplications.
In practice
It's fun and educational to try writing these functions yourself, but in practice you will want to use an existing library that has been optimized and verified.
I think a part of the answer is in the question itself :) To store these expressions, you can store the base (or mantissa), and exponent separately, like scientific notation goes. Extending to that, you cannot possibly evaluate the expression completely and store such large numbers, although, you can theoretically predict certain properties of the consequent expression. I will take you through each of the properties you talked about:
Decimal approximation: Can be calculated by evaluating simple log values.
Total number of digits for expression a^b, can be calculated by the formula
Digits = floor function (1 + Log10(a^b)), where floor function is the closest integer smaller than the number. For e.g. the number of digits in 10^5 is 6.
Last digits: These can be calculated by the virtue of the fact that the expression of linearly increasing exponents form a arithmetic progression. For e.g. at the units place; 7, 9, 3, 1 is repeated for exponents of 7^x. So, you can calculate that if x%4 is 0, the last digit is 1.
Can someone create a custom datatype for large numbers, I can't say, but I am sure, the number won't be evaluated and stored.

Resources