read null terminated string from byte vector in julia - julia

I have a vector of type UInt8 and fixed length 10. I think it contains a null-terminated string but when I do String(v) it shows the string + all of the zeros of the rest of the vector.
v = zeros(UInt8, 10)
v[1:5] = Vector{UInt8}("hello")
String(v)
the output is "hello\0\0\0\0\0".
Either I'm packing it wrong or reading it wrong. Any thoughts?

I use this snippet:
"""
nullstring(Vector{UInt8})
Interpret a vector as null terminated string.
"""
nullstring(x::Vector{UInt8}) = String(x[1:findfirst(==(0), x) - 1])
Although I bet there are faster ways to do this.

You can use unsafe_string: unsafe_string(pointer(v)), this does it without a copy, so is very fast. But #laborg's solution is better in almost all cases, because it's safe.
If you want both safety and maximal performance, you have to write a manual function yourself:
function get_string(v::Vector{UInt8})
# Find first zero
zeropos = 0
#inbounds for i in eachindex(v)
iszero(v[i]) && (zeropos = i; break)
end
iszero(zeropos) && error("Not null-terminated")
GC.#preserve v unsafe_string(pointer(v), zeropos - 1)
end
But eh, what are the odds you REALLY need it to be that fast.

You can avoid copying bytes and preserve safety with the following code:
function nullstring!(x::Vector{UInt8})
i = findfirst(iszero, x)
SubString(String(x),1,i-1)
end
Note that after calling it x will be empty and the returned value is Substring rather than String but in many scenarios it does not matter. This code makes half allocations than code by #laborg and is slightly faster (around 10-20%). The code by Jacob is still unbeatable though.

Related

Using caching instead of memoization to speedup a function

While memoization of a function is a good idea, it could cause a program to crash because the program could potentially run out of memory.
Therefore it is NOT A SAFE OPTION to be used in a production program.
Instead I have developed caching with a fixed memory slots below with a soft limit and hard limit. When the cache slots is above the hard limit, it will have the least used slots deleted until the number of slots is reduced to the soft limit.
struct cacheType
softlimit::Int
hardlimit::Int
memory::Dict{Any,Any}
freq::Dict{Any,Int}
cacheType(soft::Int,hard::Int) = new(soft,hard,Dict(),Dict())
end
function tidycache!(c::cacheType)
memory_slots=length(c.memory)
if memory_slots > c.hardlimit
num_to_delete = memory_slots - c.softlimit
# Now sort the freq dictionary into array of key => AccessFrequency
# where the first few items have the lowest AccessFrequency
for item in sort(collect(c.freq),by = x -> x[2])[1:num_to_delete]
delete!(c.freq, item[1])
delete!(c.memory, item[1])
end
end
end
# Fibonacci function
function cachefib!(cache::cacheType,x)
if haskey(cache.memory,x)
# Increment the number of times this key has been accessed
cache.freq[x] += 1
return cache.memory[x]
else
# perform housekeeping and remove cache entries if over the hardlimit
tidycache!(cache)
if x < 3
cache.freq[x] = 1
return cache.memory[x] = 1
else
result = cachefib!(cache,x-2) + cachefib!(cache,x-1)
cache.freq[x] = 1
cache.memory[x] = result
return result
end
end
end
c = cacheType(3,4)
cachefib!(c,3)
cachefib!(c,4)
cachefib!(c,5)
cachefib!(c,6)
cachefib!(c,4)
println("c.memory is ",c.memory)
println("c.freq is ",c.freq)
I think this would be most useful in a production environment than just using memorization with no limits of memory consumption which could result in a program crashing.
In Python language, they have
#functools.lru_cache(maxsize=128, typed=False)
Decorator to wrap a function with a memoizing callable that saves up to the maxsize most recent calls. It can save time when an expensive or I/O bound function is periodically called with the same arguments.
Since a dictionary is used to cache results, the positional and keyword arguments to the function must be hashable.
Is there an equivalent in Julia language?
There is LRUCache.jl, which provides an LRU type which basically acts like a Dict. Unfortunately, this doesn't seem to work with the Memoize.jl package, but you can use my answer to your other question:
using LRUCache
const fibmem = LRU{Int,Int}(3) # store only 3 values
function fib(n)
get!(fibmem, n) do
n < 3 ? 1 : fib(n-1) + fib(n-2)
end
end

How do I represent sparse arrays in Pari/GP?

I have a function that returns integer values to integer input. The output values are relatively sparse; the function only returns around 2^14 unique outputs for input values 1....2^16. I want to create a dataset that lets me quickly find the inputs that produce any given output.
At present, I'm storing my dataset in a Map of Lists, with each output value serving as the key for a List of input values. This seems slow and appears to use a whole of stack space. Is there a more efficient way to create/store/access my dataset?
Added:
It turns out the time taken by my sparesearray() function varies hugely on the ratio of output values (i.e., keys) to input values (values stored in the lists). Here's the time taken for a function that requires many lists, each with only a few values:
? sparsearray(2^16,x->x\7);
time = 126 ms.
Here's the time taken for a function that requires only a few lists, each with many values:
? sparsearray(2^12,x->x%7);
time = 218 ms.
? sparsearray(2^13,x->x%7);
time = 892 ms.
? sparsearray(2^14,x->x%7);
time = 3,609 ms.
As you can see, the time increases exponentially!
Here's my code:
\\ sparsearray takes two arguments, an integer "n" and a closure "myfun",
\\ and returns a Map() in which each key a number, and each key is associated
\\ with a List() of the input numbers for which the closure produces that output.
\\ E.g.:
\\ ? sparsearray(10,x->x%3)
\\ %1 = Map([0, List([3, 6, 9]); 1, List([1, 4, 7, 10]); 2, List([2, 5, 8])])
sparsearray(n,myfun=(x)->x)=
{
my(m=Map(),output,oldvalue=List());
for(loop=1,n,
output=myfun(loop);
if(!mapisdefined(m,output),
/* then */
oldvalue=List(),
/* else */
oldvalue=mapget(m,output));
listput(oldvalue,loop);
mapput(m,output,oldvalue));
m
}
To some extent, the behavior you are seeing is to be expected. PARI appears to pass lists and maps by value rather than reference except to the special inbuilt functions for manipulating them. This can be seen by creating a wrapper function like mylistput(list,item)=listput(list,item);. When you try to use this function you will discover that it doesn't work because it is operating on a copy of the list. Arguably, this is a bug in PARI, but perhaps they have their reasons. The upshot of this behavior is each time you add an element to one of the lists stored in the map, the entire list is being copied, possibly twice.
The following is a solution that avoids this issue.
sparsearray(n,myfun=(x)->x)=
{
my(vi=vector(n, i, i)); \\ input values
my(vo=vector(n, i, myfun(vi[i]))); \\ output values
my(perm=vecsort(vo,,1)); \\ obtain order of output values as a permutation
my(list=List(), bucket=List(), key);
for(loop=1, #perm,
if(loop==1||vo[perm[loop]]<>key,
if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=vo[perm[loop]]);
listput(bucket,vi[perm[loop]])
);
if(#bucket, listput(list,[key,Vec(bucket)]));
Mat(Col(list))
}
The output is a matrix in the same format as a map - if you would rather a map then it can be converted with Map(...), but you probably want a matrix for processing since there is no built in function on a map to get the list of keys.
I did a little bit of reworking of the above to try and make something more akin to GroupBy in C#. (a function that could have utility for many things)
VecGroupBy(v, f)={
my(g=vector(#v, i, f(v[i]))); \\ groups
my(perm=vecsort(g,,1));
my(list=List(), bucket=List(), key);
for(loop=1, #perm,
if(loop==1||g[perm[loop]]<>key,
if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=g[perm[loop]]);
listput(bucket, v[perm[loop]])
);
if(#bucket, listput(list,[key,Vec(bucket)]));
Mat(Col(list))
}
You would use this like VecGroupBy([1..300],i->i%7).
There is no good native GP solution because of the way garbage collection occurs because passing arguments by reference has to be restricted in GP's memory model (from version 2.13 on, it is supported for function arguments using the ~ modifier, but not for map components).
Here is a solution using the libpari function vec_equiv(), which returns the equivalence classes of identical objects in a vector.
install(vec_equiv,G);
sparsearray(n, f=x->x)=
{
my(v = vector(n, x, f(x)), e = vec_equiv(v));
[vector(#e, i, v[e[i][1]]), e];
}
? sparsearray(10, x->x%3)
%1 = [[0, 1, 2], [Vecsmall([3, 6, 9]), Vecsmall([1, 4, 7, 10]), Vecsmall([2, 5, 8])]]
(you have 3 values corresponding to the 3 given sets of indices)
The behaviour is linear as expected
? sparsearray(2^20,x->x%7);
time = 307 ms.
? sparsearray(2^21,x->x%7);
time = 670 ms.
? sparsearray(2^22,x->x%7);
time = 1,353 ms.
Use mapput, mapget and mapisdefined methods on a map created with Map(). If multiple dimensions are required, then use a polynomial or vector key.
I guess that is what you are already doing, and I'm not sure there is a better way. Do you have some code? From personal experience, 2^16 values with 2^14 keys should not be an issue with regards to speed or memory - there may be some unnecessary copying going on in your implementation.

What does this extra '+' represent in this code? Recursive function

Problem:
A digital root is the recursive sum of all the digits in a number. Given n, take the sum of the digits of n. If that value has two digits, continue reducing in this way until a single-digit number is produced. This is only applicable to the natural numbers.
example:
digital_root(16)
=> 1 + 6
=> 7
This is a function that was coded:
function digital_root(n) {
if (n < 10) {
return n;
}
return digital_root( n.toString().split('').reduce( function (a, b) {
return a + +b;
}, 0));
}
Can someone clarify what the extra + is doing in this line of code? return a + +b;
Its probably a sneaky way of converting a string to an integer. You don't say what language this is, but many dynamic languages allow variables to be any type without declaration and use + for both addition and string concatenation, with implicit conversions between strings and numbers. Such languages make it easy to accidentally get the wrong thing (concatenating when you intend to add or vice versa).
However, using a unary + is (usually) a numeric identity, which will convert its argument to a number (if it happens to be a string -- it does nothing if the argument is already a number). So then the binary + will be add rather than concatenate.

Array type promotion in Julia

In Julia, I can use promote to make various types of objects compatible. For example:
>promote(1, 1.0)
(1.0,1.0)
>typeof(promote(1, 1.0))
(Float64, Float64)
However, if I use promote on arrays, it doesn't give me what I want:
>promote([1], [1.0])
([1],[1.0])
>typeof(promote([1], [1.0]))
(Array{Int64,1},Array{Float64,1})
What I want is for the Int64 array to be converted to a Float64 array, so I get something like:
>promote_array([1], [1.0])
([1.0],[1.0])
>typeof(promote_array([1], [1.0]))
(Array{Float64,1},Array{Float64,1})
Here promote_array is a hypothetical function I made up. I'm looking for a real function that does the same. Is there a function in Julia that does what promote_array does above?
I found the function Base.promote_eltype, which I can use to get what I want:
function promote_array(arrays...)
eltype = Base.promote_eltype(arrays...)
tuple([convert(Array{eltype}, array) for array in arrays]...)
end
This promote_array function then gives me the output I'm looking for:
>promote_array([1], [1.0])
([1.0],[1.0])
>typeof(promote_array([1], [1.0]))
(Array{Float64,1},Array{Float64,1})
The above solves my problem, although the existence of Base.promote_eltype suggests there may be an already built solution that I don't know about yet.
Here is what I would do:
function promote_array{S,T}(x::Vector{S},y::Vector{T})
U = promote_type(S,T)
convert(Vector{U},x), convert(Vector{U},y)
end
I'm not sure what your use case is exactly, but the following pattern is something I see as being fairly commonly required for code that has the tightest typing possible while being general:
function foo{S<:Real, T<:Real}(x::Vector{S},y::Vector{T})
length(x) != length(y) && error("Length mismatch")
result = zeros(promote_type(S,T), length(x))
for i in 1:length(x)
# Do some fancy foo-work here
result[i] = x[i] + y[i]
end
return result
end

Pass a string value in a recursive bison rule

i'm having some issues on bison (again).
I'm trying to pass a string value between a "recursive rule" in my grammar file using the $$,
but when I print the value I have passed, the output looks like a wrong reference ( AU�� ) instead the value I wrote in my input file.
line: tok1 tok2
| tok1 tok2 tok3
{
int len=0;
len = strlen($1) + strlen($3) + 3;
char out[len];
strcpy(out,$1);
strcat(out," = ");
strcat(out,$3);
printf("out -> %s;\n",out);
$$ = out;
}
| line tok4
{
printf("line -> %s\n",$1);
}
Here I've reported a simplified part of the code.
Giving in input the token tok1 tok2 tok3 it should assign to $$ the out variable (with the printf I can see that in the first part of the rule the out variable has the correct value).
Matching the tok4 sequentially I'm in the recursive part of the rule. But when I print the $1 value (who should be equal to out since I have passed it trough $$), I don't have the right output.
You cannot set:
$$ = out;
because the string that out refers to is just about to vanish into thin air, as soon as the block in which it was declared ends.
In order to get away with this, you need to malloc the storage for the new string.
Also, you need strlen($1) + strlen($3) + 4; because you need to leave room for the NUL terminator.
It's important to understand that C does not really have strings. It has pointers to char (char*), but those are really pointers. It has arrays (char []), but you cannot use an array as an aggregate. For example, in your code, out = $1 would be illegal, because you cannot assign to an array. (Also because $1 is a pointer, not an array, but that doesn't matter because any reference to an array, except in sizeof, is effectively reduced to a pointer.)
So when you say $$ = out, you are making $$ point to the storage represented by out, and that storage is just about to vanish. So that doesn't work. You can say $$ = $1, because $1 is also a pointer to char; that makes $$ and $1 point to the same character. (That's legal but it makes memory management more complicated. Also, you need to be careful with modifications.) Finally, you can say strcpy($$, out), but that relies on $$ already pointing to a string which is long enough to hold out, something which is highly unlikely, because what it means is to copy the storage pointed to by out into the location pointed to by $$.
Also, as I noted above, when you are using "string" functions in C, they all insist that the sequence of characters pointed to by their "string" arguments (i.e. the pointer-to-character arguments) must be terminated with a 0 character (that is, the character whose code is 0, not the character 0).
If you're used to programming in languages which actually have a string datatype, all this might seem a bit weird. Practice makes perfect.
The bottom line is that what you need to do is to create a new region of storage large enough to contain your string, like this (I removed out because it's not necessary):
$$ = malloc(len + 1); // room for NUL
strcpy($$, $1);
strcat($$, " = ");
strcat($$, $3);
// You could replace the strcpy/strcat/strcat with:
// sprintf($$, "%s = %s", $1, $3)
Note that storing mallocd data (including the result of strdup and asprintf) on the parser stack (that is, as $$) also implies the necessity to free it when you're done with it; otherwise, you have a memory leak.
I've solved it changin the $$ = out; line into strcpy($$,out); and now it works properly.

Resources