Gathering the correct amount of digits for numbers when text mining

Gathering the correct amount of digits for numbers when text mining - r

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.

We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck

You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Splitting a column in a dataframe in R into two based on content

I have a column in a R dataframe that holds a product weight i.e. 20 kg but it has mixed measuring systems i.e. 1 lbs & 2 kg etc. I want to separate the value from the measurement and put them in separate columns then convert them in a new column to a standard weight. Any thoughts on how I might achieve that? Thanks in advance.

Assume you have the column given as
x <- c("20 kg","50 lbs","1.5 kg","0.02 lbs")
and you know that there is always a space between the number and the measurement. Then you can split this up at the space-character, e.g. via
splitted <- strsplit(x," ")
This results in a list of vectors of length two, where the first is the number and the second is the measurement.
Now grab the numbers and convert them via
numbers <- as.numeric(sapply(splitted,"[[",1))
and grab the units via
units <- sapply(splitted,"[[",2)
Now you can put everything together in a `data.frame.
Note: When using as.numeric, the decimal point has to be a dot. If you have commas instead, you need to replace them by a dot, for example via gsub(",","\\.",...).

separate(DataFrame, VariableName, into = c("Value", "Metric"), sep = " ")
My case was simple enough that I could get away with just one space separator but I learned you can also use a regular expression here for more complex separator considerations.

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)

If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.

You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.

You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

zero padding regex dependent on length of digits

I have a field which contains two charecters, some digits and potentially a single letter. For example
QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001
I would like to consistently return all letters in their original position, but digits as follows.
for 1 to 3 digits :
return all digits OR the digits left padded with zeros
For 4 or more digits :
it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits
example from the data above
QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001
The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.
This identifies the correct ones, but I'm hoping to correct those which are not
right.
"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"
For the curious, they are flight numbers but entered by a human. Hence the variety...

You may use
> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002" "FX016" "TD008" "BF007P" "VV1395" "HM1874" "JK001"
The pattern matches
^ - start of string
[A-Z]{2} - two uppercase letters
\\K - the text matched so far is removed from the match
0* - 0 or more zeros
(\\d{1,4}) - Capturing group 1: one to four digits
\\d* - 0+ digits.
Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

Probability of 3-character string appearing in a randomly generated password

If you have a randomly generated password, consisting of only alphanumeric characters, of length 12, and the comparison is case insensitive (i.e. 'A' == 'a'), what is the probability that one specific string of length 3 (e.g. 'ABC') will appear in that password?
I know the number of total possible combinations is (26+10)^12, but beyond that, I'm a little lost. An explanation of the math would also be most helpful.

The string "abc" can appear in the first position, making the string look like this:
abcXXXXXXXXX
...where the X's can be any letter or number. There are (26 + 10)^9 such strings.
It can appear in the second position, making the string look like:
XabcXXXXXXXX
And there are (26 + 10)^9 such strings also.
Since "abc" can appear at anywhere from the first through 10th positions, there are 10*36^9 such strings.
But this overcounts, because it counts (for instance) strings like this twice:
abcXXXabcXXX
So we need to count all of the strings like this and subtract them off of our total.
Since there are 6 X's in this pattern, there are 36^6 strings that match this pattern.
I get 7+6+5+4+3+2+1 = 28 patterns like this. (If the first "abc" is at the beginning, the second can be in any of 7 places. If the first "abc" is in the second place, the second can be in any of 6 places. And so on.)
So subtract off 28*36^6.
...but that subtracts off too much, because it subtracted off strings like this three times instead of just once:
abcXabcXabcX
So we have to add back in the strings like this, twice. I get 4+3+2+1 + 3+2+1 + 2+1 + 1 = 20 of these patterns, meaning we have to add back in 2*20*(36^3).
But that math counted this string four times:
abcabcabcabc
...so we have to subtract off 3.
Final answer:
10*36^9 - 28*36^6 + 2*20*(36^3) - 3
Divide that by 36^12 to get your probability.
See also the Inclusion-Exclusion Principle. And let me know if I made an error in my counting.

If A is not equal to C, the probability P(n) of ABC occuring in a string of length n (assuming every alphanumeric symbol is equally likely) is
P(n)=P(n-1)+P(3)[1-P(n-3)]
where
P(0)=P(1)=P(2)=0 and P(3)=1/(36)^3

To expand on Paul R's answer. Probability (for equally likely outcomes) is the number of possible outcomes of your event divided by the total number of possible outcomes.
There are 10 possible places where a string of length 3 can be found in a string of length 12. And there are 9 more spots that can be filled with any other alphanumeric characters, which leads to 36^9 possibilities. So the number of possible outcomes of your event is 10 * 36^9.
Divide that by your total number of outcomes 36^12. And your answer is 10 * 36^-3 = 0.000214
EDIT: This is not completely correct. In this solution, some cases are double counted. However they only form a very small contribution to the probability so this answer is still correct up to 11 decimal places. If you want the full answer, see Nemo's answer.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Gathering the correct amount of digits for numbers when text mining - r

We can extract the digits before a . by using regex lookaround library(stringr) str_extract(str1, "\\d+(?=\\.)") #[1] "986559" The \\d+ indicates one or more digist followed by the regex lookaound .

Related

regex to find the position of the first four concurrent unique values

Splitting a column in a dataframe in R into two based on content

Stringr function or or gsub() to find an x digit string and extract first x digits?

zero padding regex dependent on length of digits

Probability of 3-character string appearing in a randomly generated password

Categories

Resources