Scanning a File to compare string. Average of comparisons - math

I have two text file, that are two set of strings. First_file.txt (X strings) and Second_file.txt (N strings)
First_file.txt
string1
string2
string3
...
stringX
Second_file.txt
string1
string2
string3
...
stringN
I have compared these two files in this way: i took the string1 from First_file and i've scanned the Second_file, line by line. If i find the same string, break and restart with string2 from First_file.
So the best case is that there is a match on the first line, the worst case is no match, so i have to scan the entire file.
I'm interested in the average numbers of comparisons: is right N/2 ?

The average number of comparisons depends on the length of both file 1 and 2. Each line in file 1 is compared in average with N/2 strings in file 2. Then the total average number of comparisons will be X * N/2 (being X the number of lines in file 1 and N the number of lines in file 2).

Related

Generating random 1 to 12 digit numbers in R following some condition

I wish to write a program in R to generate 1 random number (a positive integer) each starting from 3 digits to 12 digits following these conditions:
There is no order in the consecutive number digits.
Strictly no repetition of digits in a number until the 9th digit number.
0 can be used after the 9 digit number only.
After 10 digits, a digit can be used twice but with no order.
And most importantly:
**The first number will not be the last number of next line and vice versa. **
All I know how to use is the sample command in R:
sample(1:9, size=n, replace=FALSE)
where n is the number of digits I wish to generate. However, I need to write a more generalized function or program which strictly obeys these conditions.

How to remove ending zeros in binary bit sequence in R?

I need to remove ending zeros from binary bit sequences.
The length of the bit sequence is fixed, say 52. i.e.,
0101111.....01100000 (52-bit),
10111010..1010110011 (52-bit),
10111010..1010110100 (52-bit).
From converting decimal number to normalized double precision, significand is 52 bit, and hence zeros are populated to the right hand side even if significand is less than 52 bit at first step. I am reversing the process: i.e., I am trying to convert a normalized double precision in memory to decimal number, hence, I have to remove zeros (at the end) that are used to populate 52 bits for significand.
It is not guaranteed that the sequence in hand necessarily have 0s in the end (like the 2nd example above). If there is, all ending zeros must be truncated:
f(0101111.....01100000) # 0101111.....011; leading 0 must be kept
f(10111010..1010110011) # 10111010..1010110011; no truncation
f(10111010..1010110100) # 10111010..10101101
Unfortunately, the number of truncated 0s at the end differs. (5 in the 1st example; 2 in the 3rd example).
It is OK for me if input and output class are string:
f("0101111.....01100000") # "0101111.....011"; leading 0 must be kept
f("10111010..1010110011") # "10111010..1010110011"; no truncation
f("10111010..1010110100") # "10111010..10101101"
Any help is greatly appreciated.
This is a simple regular expression.
f <- function(x) sub('0+$', '', x)
Explanation:
0 - matches the character 0.
0+ - the character zero repeated at least one time, meaning, one or more times.
$ matches the end of the string.
0+$ the character 0 repeated one or more times and nothing else until the end of the string.
Replace the sub-string matched by the pattern with the empty string, ''.
Now test the function.
f("010111101100000")
#[1] "0101111011"
f("0100000001010101100010000000000000000000000000000000000000000000")
#[1] "010000000101010110001"
f("010000000101010110001000000")
#[1] "010000000101010110001"
f("00010000000101010110001000000")
#[1] "00010000000101010110001"

How to generate word sequence

1.I want to generate combinations of characters from a given word with each letter being repeated consecutively utmost 2 times and at least 1.The resultant words are of unequal lengths. For example from
"cat"
to
"cat", "catt", "caat", "caatt", "ccat", "ccatt", "ccaat", "ccaatt"
Required function takes a word of length n and generates 2^n words of unequal length. It is almost similar to binary digits with n length gives 2^n combinations. For example a 3 digit binary number gives
000 001 010 011 100 101 110 111
combinations, where 0=t and 1=tt.
2.And also the same function should restrict the resultant sequence maximum upto 2 consecutive repetition of a character even if the given word has repetitions of letters.For example
"catt"
to
"catt" "ccatt" "caatt" "ccaatt"
I tried something like this
pos=expand.grid(l1=c(1,11),l2=c(2,22),l3=c(3,33))
result=chartr('123','cat',paste0(pos[,1],pos[,2],pos[,3]))
#[1] "cat" "ccat" "caat" "ccaat" "catt" "ccatt" "caatt" "ccaatt"
It gives correct sequence but I am stuck with generalizing it to any given word with different lengths.
Thank you.
Use stdout as per normal...
print("Hello, world!")
x="cat"
l=seq(nchar(x))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
chartr(n,x,do.call(paste0,expand.grid(m)))
1.Just an addition to the answer given by Onyambu to solve the second part of the question i.e. restrict the output to maximum 2 consecutive repetitions of a character given any number of consecutive repetitions of characters in the input word.
x="catt"
l=seq(nchar(x))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
o <- chartr(n,x,do.call(paste0,expand.grid(m)))
Below line of code removes the words with more than 2 consecutive repetitive characters
unique(gsub('([[:alpha:]])\\1{2,}', '\\1\\1', o))
#[1] "catt" "ccatt" "caatt" "ccaatt"
2.If you want all the combinations starting from "cat" to "ccaattt" given any number of consecutive repetitions of characters in the input word. Code is
x1="catt"
Below line of code restricts the consecutive repetition of characters to 1.
x2= gsub('([[:alpha:]])\\1+', '\\1', x1)
l=seq(nchar(x2))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
o <- chartr(n,x,do.call(paste0,expand.grid(m)))
unique(gsub('([[:alpha:]])\\1{2,}', '\\1\\1', o))
#[1] "cat" "ccat" "caat" "ccaat" "catt" "ccatt" "caatt" "ccaatt"

find count of substrings of all anagrams of 1st string that are anagram of 2nd

Need an approach to solve this problem!
Problem : Given two strings containing lowercase alphabets count number of matches modulo 10^9+7 of non intersecting substrings in all distinct anagrams of 1st string such that they are equal to any anagram of 2nd string.
Example :
1) String 1: "ABC", String 2: "AB"
Answer = 4
Explanation : 'ABC','BAC','CAB','CBA' all contribute 1 such match each.
2) String 1: "ABCAB", String 2: "AB"
Answer = 40
Explanation : One possible Anagram of string 1 'ABABC' for which match count is 2 that is 'AB' and 'AB' while 'BABCA' contributes only one match that is 'BA' or 'AB'.
Constraints :
n,m are lengths of first and second strings
0 < n < 200
0 < m < 100
The approach I tried doing involved pre-computing the first 200 factorials modulo 10^9+7 and then from the given string calculating how many maximum non intersecting patterns (mx) the string could have and looping from p=1 to mx and calculating the number of rearrangements of first string that contain exactly p non intersecting substrings (i.e string 2) patterns.
Is there a different approach that I am missing here?
Here is another approach you can use -
1)Calculate number of anagrams of string2. You can google permutations and combinations to get a method to do so (in O(1))(say x).
2)Calculate at most how many string2 can string1 contribute. This can be done by calculating how many times the character of your string repeat in string2. like in your second example 'A' and 'B' repeat twice in string two so you can get at most anagrams at one time.Note - If choose lowest number if the character frequency doesnt match. Like if for some string 'A' repeats thrice and 'B' twice you can get at most 2 anagrams so you take lowest repeating character's frequency.
3) Calculate answer using formulae of permutation and combination-
Number of string1 anagrams with only one anagram of string2 x*(n1-n2+1) where n1 and n2 are lengths of string1 and 2 resp.
With two anagrams - (n1- 2*n2+2)*x*x
And so on

converting numerical values into decimal numbers

I have text file with values with one or two or some with 3 decimal points.These values are generated by the software based on the signal intensity of genes.When I tried to compute the distance matrix out of it,I got the warning message:
Warning message:
In dist(sam) : NAs introduced by coercion
A sample text file is given below:
sample1
a 23.45.12
b 123.345.234
c 45.2311.34
I need to convert these values either with one decimal point or as real numbers so that i can compute distance matrix out of it from which i can use it for clustering.My expected result is given as follows:
sample1
a 23.45
b 123.345
c 45.2311
Pleaso do help me
You can do this in one line of code with as.numeric and gsub with a suitable regular expression:
sample1 <- c(
a = "23.45.12",
b = "123.345.234",
c = "45.2311.34"
)
as.numeric(
gsub("(\\d+\\.\\d+)\\..*", "\\1", sample1)
)
[1] 23.4500 123.3450 45.2311
The regular expression:
\\d* finds one or more digits
\\. finds a period
Thus (\\d+\\.\\d+) finds two sets of digits with a period inbetween, and then groups it (with the brackets)
Finally, \\..* finds a period followed by a complete wildcard
Then gsub replaces the entire string with only what was found inside the brackets. This is called a regular expression back reference, indicated by \\1.

Resources