Frequency cipher in F# - encryption

I'm currently working on a frequency-substitution cipher in F#. Meaning that I count all the occurrences of each letter in a text and when that is done I want to replace the letters based on the letter frequency in the english alphabet.
What i've done so far is that i've created a (char * float * char) list that contains (letter, frequency percentage, recommended letter). Let's say the letter P is the most occurring letter in my ciphered text (13.5 percent of letters are P) and E is the most used letter in english texts our list element will look like this ('P', 13.5, 'E'). This procedure is done all letters in the text, so we will end up with a list of all letters and their recommended replacement.
The problem I have is that I don't really know how to replace the letters in the cipher text with their recommended replacements.
Letter frequency in the english alphabet.
[(' ', 20.0); ('E', 12.02); ('T', 9.1); ('A', 8.12); ('O', 7.68); ('I', 7.31);
('N', 6.95); ('S', 6.28); ('R', 6.02); ('H', 5.92); ('D', 4.32); ('L', 3.98);
('U', 2.88); ('C', 2.71); ('M', 2.61); ('F', 2.3); ('Y', 2.11); ('W', 2.09);
('G', 2.03); ('P', 1.82); ('B', 1.49); ('V', 1.11); ('K', 0.69); ('X', 0.17);
('Q', 0.11); ('J', 0.1); ('Z', 0.07)]
Letter frequency in cipher.
[('W', 21.18); ('Z', 8.31); ('I', 7.7); ('P', 6.96); ('Y', 5.5); ('H', 5.48);
('G', 5.35); ('K', 5.3); ('N', 4.31); ('O', 4.31); ('M', 3.66); (' ', 2.83);
('A', 2.58); ('T', 2.38); ('Q', 2.22); ('B', 2.11); ('F', 2.11); ('.', 2.04);
('R', 1.62); ('S', 1.37); ('E', 1.06); ('X', 0.97); ('U', 0.25); ('L', 0.16);
('V', 0.11); ('J', 0.07); ('C', 0.02); ('D', 0.02)]
Recommended letter changes.
[('W', 21.18, ' '); ('Z', 8.31, 'E'); ('I', 7.7, 'T'); ('P', 6.96, 'A');
('Y', 5.5, 'O'); ('H', 5.48, 'I'); ('G', 5.35, 'N'); ('K', 5.3, 'S');
('N', 4.31, 'R'); ('O', 4.31, 'H'); ('M', 3.66, 'D'); (' ', 2.83, ' ');
('A', 2.58, 'L'); ('T', 2.38, 'U'); ('Q', 2.22, 'C'); ('B', 2.11, 'M');
('F', 2.11, 'F'); ('.', 2.04, 'Y'); ('R', 1.62, 'W'); ('S', 1.37, 'G');
('E', 1.06, 'P'); ('X', 0.97, 'B'); ('U', 0.25, 'V'); ('L', 0.16, 'K');
('V', 0.11, 'X'); ('J', 0.07, 'Q'); ('C', 0.02, 'J'); ('D', 0.02, 'Z')]
If anyone have any ideas that would put me in the right direction on how to tackle the problem i'd be very appreciative since i've been stuck on this problem for some while now.

I believe you are missing . frequency in the English alphabet (should be between D and L. When you'll add missing value to alphaFreq list both lists will be of same length and you'll be able to produce recommended changes map by zipping two ordered lists:
let changes =
alphaFreq // list with letter frequency in the English alphabet
|> List.zip cipherFreq // zipping with cipher frequency list
|> List.map (fun ((cipherLetter,_), (alphaLetter,_)) -> (alphaLetter, cipherLetter))
|> Map.ofList
Encoding test:
"HELLO WORLD" |> String.map (fun ch -> changes.[ch]) |> printfn "%s"
// OZAAYWRYNAM
To get a decoder map just swap letter order -> (cipherLetter, alphaLetter)

Most likely you want a mapping operation. Something along the lines of myString |> String.map mappingFunction. Note that mapping function can also be a functor, or a curried higher order function.
The functor approach would allow you to put the frequencies into the object state.
The curried function would allow you to pass the frequencies as the parameter.
It's up to you to choose which approach makes more sense in your application and/or looks more natural.

Related

Erlang lists:filter returns "\n\f"

I'm working on a school project in a functional programming class. The project is about determining if a set of dominos (represented as a list of tuples of two numbers from 1-6) can be put end to end. I'm ok on the problem, but I'm running into an issue where lists:filter is returning the string "\n\f" instead of a list like it says in the documentation.
I couldn't find anything online, and was wondering if any of you had any ideas.
Thanks!
Here is my code. The issue is in the check_dominos() function.
-module(challenge).
-export([test/0, check_dominos/1]).
% If there is an even number of each number, true
% else, false
extract_numbers([]) -> [];
extract_numbers([{First, Second} | T]) -> [First] ++ [Second] ++ extract_numbers(T).
add_matching_numbers(_Previous, []) -> [];
add_matching_numbers(Previous, [First | T]) when Previous =:= First-> [Previous + First | add_matching_numbers(First, T)];
add_matching_numbers(_Previous, [First | T]) -> add_matching_numbers(First, T).
check_dominos(Dominos) ->
All_Numbers = extract_numbers(Dominos),
Sorted_Numbers = lists:sort(All_Numbers),
Accumulated_Numbers = add_matching_numbers(0, Sorted_Numbers) ,
Filter_Lambda = fun(Num) -> Num rem 2 == 0 end,
Result = lists:filter(Filter_Lambda, Accumulated_Numbers),
Result.
% Still working on the logic of this part
%case length(Accumulated_Numbers) =:= length(Result) of
% true -> true;
% _ -> false
%end.
test() ->
Test_1 = [{1, 3}, {3, 2}, {2, 1}], % Already in order
Test_2 = [{5, 2}, {5, 6}, {6, 3}, {1, 4}], % Shouldn't work
Test_3 = [{2, 6}, {3, 5}, {1, 4}, {3, 4}, {6, 1}, {2, 5}], % Should work
true = check_dominos(Test_1),
false = check_dominos(Test_2),
true = check_dominos(Test_3).
Erlang strings are lists of character codes, and by default Erlang shell tries to display lists of integers as strings. To change this behaviour call shell:strings(false). before running your program.
The previous answerer is correct. Any list containing only numbers that correspond to printable characters, will display as a string. The result of your
Accumulated_Numbers = add_matching_numbers(0, Sorted_Numbers)
on Test_2 is "\n\f", but displayed as a list it is [10, 12]. Try typing [10, 12]. in an interactive erlang shell and you will indeed see it displays "\n\f".
Try:
[7].
in an interactive Erlang shell. For me it displays:
[7]
Try:
[8].
Displays:
"\t"
N.B. The numbers 8 through 13 are printable characters, as are (some of?) the numbers 32 to 255. Might be some gaps in there. If you want to see the numeric value of a printable character, use a dollar sign, e.g. $\n. prints 10.
That said, with your current way of going about things, you won't be able to get an answer with add_matching_numbers as it stands. It drops a value whenever it doesn't match the next item in the sorted list, which doesn't tell you if you had any unmatched items. The result of [10,12] on List_2 tells of this: it is a list of even numbers, like the other results.
Good luck on finding your solution.

"You have to specify either input_ids or inputs_embeds", but I did specify the input_ids

I trained a BERT based encoder decoder model (EncoderDecoderModel) named ed_model with HuggingFace's transformers module.
I used the BertTokenizer named as input_tokenizer
I tokenized the input with:
txt = "Some wonderful sentence to encode"
inputs = input_tokenizer(txt, return_tensors="pt").to(device)
print(inputs)
The output clearly shows that a input_ids is the return dict
{'input_ids': tensor([[ 101, 5660, 7975, 2127, 2053, 2936, 5061, 102]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
But when I try to predict, I get this error:
ed_model.forward(**inputs)
ValueError: You have to specify either input_ids or inputs_embeds
Any ideas ?
Well, apparently this is a known issue, for example: This issue of T5
The problem is that there's probably a renaming procedure in the code, since we use a encoder-decoder architecture we have 2 types of input ids.
The solution is to explicitly specify the type of input id
ed_model.forward(decoder_input_ids=inputs['input_ids'],**inputs)
I wish it was documented somewhere, but now you know :-)

Is there an NCO command to change the time stamp of a variable in a netcdf?

I have a netcdf file containing maximum daily air temperature, time, lat and lon. I successfully got maximum temperature from a netcdf of 6 hourly temperatures using the nco command:
ncra -y max --mro -d time,,,4,4 6hourly.nc max.nc"
The only problem is, my time steps are still split into quarter days:
variables:
double lon(lon) ;
lon:units = "degrees_east" ;
lon:long_name = "lon" ;
lon:standard_name = "longitude" ;
double lat(lat) ;
lat:units = "degrees_north" ;
lat:long_name = "lat" ;
lat:standard_name = "latitude" ;
double time(time) ;
time:units = "days since 0850-01-01 00:00:00" ;
time:long_name = "time" ;
time:calendar = "proleptic_gregorian" ;
double tair(time, lat, lon) ;
tair:units = "K" ;
tair:_FillValue = 1.e+30 ;
tair:long_name = "2 meter air temperature" ;
tair:days\ since\ 850 = 0., 0.25, 0.5, 0.75, 1., 1.25, 1.5, 1.75, 2., 2.25, 2.5, 2.75, 3., 3.25, 3.5, 3.75, 4., 4.25, 4.5, 4.75, 5., 5.25, 5.5, 5.75, 6., 6.25, 6.5, 6.75, 7., 7.25, 7.5, 7.75, 8., 8.25, 8.5, 8.75, 9., 9.25, 9.5, 9.75, 10., 10.25, 10.5, 10.75, 11., 11.25, 11.5...
My question is, how do I change the time step for the 'days\ since\ 850' attribute in the variable tair, to whole numbers?
Thanks!
Charlotte
ncap2 can work with attributes. However, you have a particularly thorny issue because your attribute name contains whitespace, and the attribute value is an array. I think in this case you need to first rename the attribute, then manipulate it. (Then you can rename it back if desired.):
ncrename -O -a "tair#days since 850",tair#days_since_850 in.nc foo.nc
ncap2 -O -s 'tair#days_since_850=int(tair#days_since_850)' foo.nc out.nc
Edit 20210209 in answer to comment below:
To copy an attribute from one variable to another, try
ncap2 -s 'var1#att1=var2#att2' in.nc out.nc
In case anyone has a similar problem, this ended up working for me:
ncap2 -s 'time=array(0,1,$time)' outmax.nc outmax2.nc
ncap2 -s 'time=array(0,1,$time)' outmin.nc outmin2.nc

Is there a way to normalize data with high kurtosis?

I have a vector that has a kurtosis of 2.95 (which is pretty high, Leptokurtic). Following is a sample of that data:
x = c(6.819, 8.948, 0, 67.556, -40.785, -18.951, -29.151, 1.008,
0, 18.034, -6.631, 6.294, 0.643, -28.921, 0, -2.133, -44.348,
-87.488, 7.063, 0, -74.428, -16.361, 50.963, -32.431, -82.233,
-26.953, -48.475, 64.043, 0, 1.576, -2.728, -5.9, -63.059, -1.061,
-15.018, -58.119, -32.092, 5.329, -19.968, 38.822, 66.897, 0,
-2.579, 82.696, 42.745, 79.677, 2.522, -11.475, 1.019, 2.719,
-3.634, -7.975, 0, 1.873, 21.732, -10.217, -24.002, -76.049,
35.045, 27.22, -71.366, 16.293, -48.762, 65.481, 66.615, -19.616,
6.016, 59.722, 88.235, 10.1, 0, -4.598, 5.446, 56.909, 0, -24.827,
0, 6.487, 0, 63.315, 28.397, 9.433, 19.085, 0, 6.591, -22.643,
32.235, -12.535, -1.787, 56.157, 68.819, 0, -21.936, 38.695,
-79.006, 24.888, -5.187, 10.368, -68.191, 0, -22.171, -78.783,
-14.119, 54.084, -13.597, 26.669, 0, -18.402, 80.309, -12.652,
1.801, -69.946, -87.67, -19.586, 38.085, -21.031, -36.957, 1.357,
0.17, 47.407, -59.598, 66.125, 10.97, 6.33, -38.837, 1.868, 38.169,
-46.662, -32.255, 25.816, 14.432, -18.57, -0.456, -0.638, 31.07,
72.794, 52.957, 13.858, -18.885, 0, -13.488, 11.689, 1.618, 19.373,
-57.526, 0, -0.655, 36.308, 50.231, 0.048, -80.157, 0, -64.805,
-70.864, 0.813, 52.143, -4.989, 42.166, 7.397, 87.437, -17.897,
-0.877, 68.363, 47.315, -2.181, 2.699, 36.278, 0, -2.924, 71.56,
74.406, -46.071, 56.158, 1.44, 0, 0, 0, -3.233, 37.084, -85.189,
0, -16.137, -84.499, -12.67, -14.117, 0, 23.757, -58.299, -34.956,
0.402, 0, -67.585, -14.314, -73.426, 23.158, 1.782, 0, 4.399,
18.871, -6.929)
Is there a way to normalize this data?
Since this data range between -90 to 90, the normalized data should be in a similar range and should not change vastly, i.e. the range should not be changed to -1 to 1 or -20 to 20 etc...
I have tried using atan(X), 1/x, log(x), and many other transformational techniques but they all tend to increase the skewness. Is there a way to normalize this data without skewing it?
I am sure there must be an easy solution to this.
It may not be what you want but you can almost always perfectly normalize a distribution (if there are no ties) using a normal scores transformation:
xq <- qnorm(rank(x)/(length(x)+1), mean=mean(x), sd=sd(x))
plot(sort(x),sort(xq))
hist(xq)
qqnorm(xq)
The new range is (-99.2, 99.6) (the old range was +/- 88).
If you need to change the range you could do it as follows:
newmin + (newmax-newmin)*scale(xq, center=min(qx), scale=diff(range(xq)))
but as suggested in the comments this may not actually be the right approach to solve your broader problem.

How can I get the number of significant digits of the data stored in a NetCDF file?

I need to know the precision of the data stored in a NetCDF file.
I think that it is possible to know this precision because, when I dump a NetCDF file using ncdump, the number of significant digits displayed depends on the particular NetCDF file that I am using.
So, for one file I get:
Ts = -0.2121478, -0.08816089, -0.4285178, -0.3446428, -0.4800949,
-0.4332879, -0.2057121, -0.06589077, -0.001647412, 0.007711744,
And for another one:
Ts = -2.01, -3.6, -1, -0.53, -1.07, -0.7, -0.56, -1.3, -0.93, -1.41, -0.83,
-0.8, -2.13, -2.91, -1.13, -1.2, -2.23, -1.77, -2.93, -0.7, -2.14, -1.36,
I also have to say that there is no information about precision in any attribute, neither global nor local to the variable. You can see this in the dump of the header of the NetCDF file:
netcdf pdo {
dimensions:
time = UNLIMITED ; // (809 currently)
variables:
double time(time) ;
time:units = "months since 1900-01-01" ;
time:calendar = "gregorian" ;
time:axis = "T" ;
double Ts(time) ;
Ts:missing_value = NaN ;
Ts:name = "Ts" ;
// global attributes:
:Conventions = "CF-1.0" ;
}
Does anybody know how can I get the number of significant digits of the data stored in a NetCDF file?.
This is a tricky question: what ncdump (and many other pretty number generators) does is simply strip the trailing zeros from the fractional part, but does that say anything about the real (observed/calculated/..) precision of the values? Something measured with three decimals accuracy might be 1.100, yet ncdump will still print it as 1.1. If you want to know the true (physical?) significance, it would indeed have to be included as an attribute, or documented elsewhere.
For a large set of numbers, counting the maximum number of significant digits in the fractional part of the numbers could be a first indication of the precision. If that is what you are looking for, something like this might work in Python:
import numpy as np
a = np.array([1.01, 2.0])
b = np.array([1.10, 1])
c = np.array([10., 200.0001])
d = np.array([1, 2])
def count_max_significant_fraction(array):
# Return zero for any integer type array:
if issubclass(array.dtype.type, np.integer):
return 0
decimals = [s.rstrip('0').split('.')[1] for s in array.astype('str')]
return len(max(decimals, key=len))
print( count_max_significant_fraction(a) ) # prints "2"
print( count_max_significant_fraction(b) ) # prints "1"
print( count_max_significant_fraction(c) ) # prints "4"
print( count_max_significant_fraction(d) ) # prints "0"
I suggest you adopt the convention NCO uses and name the precision attribute "number_of_significant_digits" and/or "least_significant_digit". Terms are defined in the lengthy precision discussion that starts here.

Resources