Find near duplicate string. Hi, I know there is a match, unique, duplicated function in R, but none of these does wha I'm really need. I've a unique column in my dataset that I need to go trough it to check if the number are nearly the same. For instance, the first element compared with the second has nearly equal pattern, except for the number '9'. The second compared with the third is nearly equal, except for the last number o the sequence, one is ending with 6 while other ending with 5. Lastly, the two last numbers are 100% equal. If I've used unique() function, only the last case would be correctly excluded.
I'm wondering if there is a function that I can flag nearly equal, maybe calculating the percentage of equality, so I can drive my attention to those cases with highly equality rate.
dat <- data.frame(text = c("87775956",
"987775956",
"987775955",
"987481732",
"987481732"))
Related
I am trying to log values for a meter at one minute intervals. In lieu of entering the full value each time, I want to be able to enter just the numbers that are different: as opposed to entering the difference of the values or the entire number.
I have a rather cumbersome formula for doing this and it's good for new digits greater than zero but less than ten, after that, it just adds the number to the total.
If I could just enter the new digits, that would be ideal, whether they be .65, 1.25, 35.95, or 501.69 etc.
Thank you!
I attacked it from the other direction and it now does what I want with a few less backflips.
It still removes the decimal so a trailing zero doesn't get lost in LEN().
Whatever the number of digits that are in the new number, it replaces that number of digits out of the original number with zeroes.
It adds the new number to the modified number.
It divides by 100 to get the decimal back.
I have no doubt there is a much more elegant way of doing this.
Thank you!
In my program, I generate unique pairs of WidthxHeight, e.g. 20x25.
Lets say i have a Range Height from 20 to 40, and a Range Width of 10 to 50.
In my scenario, there shouldn't be doubled pairs, e.g. if there 20x25, there sould be no 25x20 as it would be kind of the same.
I already have a working implementation where these are generated properly in a loop.
Now what I'm looking for is the math to calculate the amount of combinations beforehand to display it, without going thorugh the whole loop.
I believe there must be a quick calculation I could use for this?
(This question is borderline at being a programming problem. I will give a brief answer here, but the question could also be asked at the Mathematics Stack Exchange site after you add some of your own work. I'll keep my answer a little vague and leave the coding to you.)
First decide how many pairs overall you would get. In your case there are 40-20+1 = 21 heights and 50-10+1 = 41 widths, so there are 21*41 = 861 pairs total.
Now you need to subtract the number of duplicates. First find the range of numbers that could be both heights and widths. In your case that is 20 to 40, which is 40-20+1 = 21 numbers. A "duplicate" pair here would have two different numbers in the pair, which the larger one first. (We'll consider the case when the larger one last to be the "original".) The number of pairs of distinct numbers taken from 21 choices where the first is larger than the second is a famous combinatorial problem, with multiple ways to find the formula. I'll here just say that the answer is combinations 2 from 21, or 21*20/2 = 210 pairs.
So your total of undouble pairs is 861-210 = 651.
I checked this particular case with the Python expression
len(set([tuple(sorted((h,w))) for h in range(20,41) for w in range(10,51)]))
and got the expected result, 651.
Can you turn that into a short calculation for the general case?
I would like to be able to control the hierarchy of elements I extract from a search string.
Specifically, in the string "425 million won", I would like to extract "won" first, but then "n" if "won" doesn't appear.
I want the result to be "won" for the following:
stringr::str_extract("425 million won", "won|n")
Note that specifying a space before won in my regex is inadequate because of other limitations in my data (there may not necessarily be a space between "million" and "won"). Ideally, I would like to do this using regex, as opposed to if-else clauses because of performance considerations.
See code in use here
pattern <- "^(?:(?!won).)*\\K(?:won|n)"
s <- "425 million won"
m <- gregexpr(pattern,s,perl=TRUE)
regmatches(s,m)[[1]]
Explanation
^ Assert position at the start of the line
(?:(?!won).)* Tempered greedy token matching any character except instances where won proceeds
\K Resets the starting point of the match. Any previously consumed characters are no longer included in the final match
(?:won|n) Match either won or n
If you just want to extend on the code you already have:
na.omit(str_extract("420 million won", c("won", "n")))[1]
n <- length(rle(sign(z)))
z contains 1 and -1. n should indicate the number of how many times the sign of z changes.
The code above does not lead to the desired outcome. If I expand the command to
length(rle(sign(z))[[1]])
it works. I don't understand the underlying mechanism of how [[1]] solves the problem?
rle returns a list consisting of two components: lengths, and values. As such, its own length is always 2. By contrast, you want to know the length of either of those components (they obviously have the same length). So either length(rle(…)[[1]]) or length(rle(…)[[2]]) would work. Better to use the names instead of an index though, e.g.
length(rle(z)$lengths)
However, this won’t be the number of times the sign changes; rather, it will be the number of times the changes plus 1.
I am doing image processing, in which I came across a situation, where I have to compare two vectors and find an instance of the smaller vector in the larger vector.
Say the two vectors are A: with 100 elements (or entries)
and B; with 10 elements. B is a model and it may not be present exactly as it is' in the vector A. I can compare 10 elements at a time and find the difference. Ideal case is that the B is present somewhere and the difference is zero. Otherwise a minimum will result at some random location, and i am missing the location.
Please help me in giving an algorithm such that the i can find Bs' closest instance in A.
What you are looking for is the cross-correlation function.The peak the the cross correlation of the two vectors will be the point were vector B is most similar to vector A.
You may want to get an explanation of how it is implemented in matlab HERE as it gives an easier explanation of how this operation can be implemented in software.