converting numerical values into decimal numbers - r

I have text file with values with one or two or some with 3 decimal points.These values are generated by the software based on the signal intensity of genes.When I tried to compute the distance matrix out of it,I got the warning message:
Warning message:
In dist(sam) : NAs introduced by coercion
A sample text file is given below:
sample1
a 23.45.12
b 123.345.234
c 45.2311.34
I need to convert these values either with one decimal point or as real numbers so that i can compute distance matrix out of it from which i can use it for clustering.My expected result is given as follows:
sample1
a 23.45
b 123.345
c 45.2311
Pleaso do help me

You can do this in one line of code with as.numeric and gsub with a suitable regular expression:
sample1 <- c(
a = "23.45.12",
b = "123.345.234",
c = "45.2311.34"
)
as.numeric(
gsub("(\\d+\\.\\d+)\\..*", "\\1", sample1)
)
[1] 23.4500 123.3450 45.2311
The regular expression:
\\d* finds one or more digits
\\. finds a period
Thus (\\d+\\.\\d+) finds two sets of digits with a period inbetween, and then groups it (with the brackets)
Finally, \\..* finds a period followed by a complete wildcard
Then gsub replaces the entire string with only what was found inside the brackets. This is called a regular expression back reference, indicated by \\1.

Related

How to remove ending zeros in binary bit sequence in R?

I need to remove ending zeros from binary bit sequences.
The length of the bit sequence is fixed, say 52. i.e.,
0101111.....01100000 (52-bit),
10111010..1010110011 (52-bit),
10111010..1010110100 (52-bit).
From converting decimal number to normalized double precision, significand is 52 bit, and hence zeros are populated to the right hand side even if significand is less than 52 bit at first step. I am reversing the process: i.e., I am trying to convert a normalized double precision in memory to decimal number, hence, I have to remove zeros (at the end) that are used to populate 52 bits for significand.
It is not guaranteed that the sequence in hand necessarily have 0s in the end (like the 2nd example above). If there is, all ending zeros must be truncated:
f(0101111.....01100000) # 0101111.....011; leading 0 must be kept
f(10111010..1010110011) # 10111010..1010110011; no truncation
f(10111010..1010110100) # 10111010..10101101
Unfortunately, the number of truncated 0s at the end differs. (5 in the 1st example; 2 in the 3rd example).
It is OK for me if input and output class are string:
f("0101111.....01100000") # "0101111.....011"; leading 0 must be kept
f("10111010..1010110011") # "10111010..1010110011"; no truncation
f("10111010..1010110100") # "10111010..10101101"
Any help is greatly appreciated.
This is a simple regular expression.
f <- function(x) sub('0+$', '', x)
Explanation:
0 - matches the character 0.
0+ - the character zero repeated at least one time, meaning, one or more times.
$ matches the end of the string.
0+$ the character 0 repeated one or more times and nothing else until the end of the string.
Replace the sub-string matched by the pattern with the empty string, ''.
Now test the function.
f("010111101100000")
#[1] "0101111011"
f("0100000001010101100010000000000000000000000000000000000000000000")
#[1] "010000000101010110001"
f("010000000101010110001000000")
#[1] "010000000101010110001"
f("00010000000101010110001000000")
#[1] "00010000000101010110001"

zero padding regex dependent on length of digits

I have a field which contains two charecters, some digits and potentially a single letter. For example
QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001
I would like to consistently return all letters in their original position, but digits as follows.
for 1 to 3 digits :
return all digits OR the digits left padded with zeros
For 4 or more digits :
it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits
example from the data above
QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001
The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.
This identifies the correct ones, but I'm hoping to correct those which are not
right.
"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"
For the curious, they are flight numbers but entered by a human. Hence the variety...
You may use
> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002" "FX016" "TD008" "BF007P" "VV1395" "HM1874" "JK001"
The pattern matches
^ - start of string
[A-Z]{2} - two uppercase letters
\\K - the text matched so far is removed from the match
0* - 0 or more zeros
(\\d{1,4}) - Capturing group 1: one to four digits
\\d* - 0+ digits.
Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

Extracting a character that contains a certain type of element in R

For Example, lets say I have the following string
vec <- " #_Jim98 Did you turn off the stove #9am?"
I would like to count the number of # characters that contain only numbers,letters,#, and underscore symbol in the string. In the case above, it would only count 1 since #9am? contains the ? symbol, so it won't be counted.
Also, it could not be longer than 10 characters.
1) Search for # followed by any number of the allowed characters "\\w" followed by a whitespace character "\\s" or | end of string $. If zero word characters are allowable then change the + to *. The expression is vectorized, i.e. x can be a character vector. No packages are used.
x <- " #_Jim98 Did you turn off the stove #9am?" # test input
pat <- "#\\w+(\\s|$)"
lengths(regmatches(x, gregexpr(pat, x)))
## [1] 1
Note that the reason for regmatches is that gregexpr produces a -1 rather than a zero length vector for no matches whereas regmatches will produce a zero length vector. Thus it works for the edge case of no matches.
2) A slightly more compact solution would be this where pat is from above:
library(gsubfn)
lengths(strapplyc(x, pat))
## [1] 1
We can do this with a regular expression. I'm interpreting that you are counting words separated by space characters or occurring at the beginning or end of the string. This assumes the # is at the start of the word, and I match a # followed by some number of word characters \\w(letters and digits) or underscores. You can remove the first (^|\\s) if you don't care about having # at the beginning of the word and would like to count 3 words in, for example, " #_Jim98 Did the Latin#s or tom#domain turn off the stove #9am?"
stringr::str_count(" #_Jim98 Did you turn off the stove #9am?", "(^|\\s)#(\\w|_)*?($|\\s)")
#> [1] 1
Created on 2018-04-12 by the reprex package (v0.2.0).

Gathering the correct amount of digits for numbers when text mining

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.
We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

regex to replace number between alphabets in R

I am looking for a regular expression in R to replace number between 2 alphabetical characters. For example, replace 3 with m, like this:
Sa3ple becomes Sample
Sample1.3 stays Sample1.3
This word statys the same because 3 is not between alphabetical characters
I tried with below R code to replace 3 with m, but it's only working partially.
One issue is that if regex matches, instead of replacing the matching row, every time it is replacing the first row from col3. Not sure, what exactly missing.
df$col3[grep('[a-zA-Z][3][a-zA-Z]|[3][a-zA-Z]',df$col3)] <- gsub('[3]+', 'm', df$col3)
regex is hard
pos <- "Sa3ple"
neg <- "Sample1.3"
gsub("([a-zA-z])\\d([a-zA-z])", "\\1m\\2", pos)
"Sample"
gsub("([a-zA-z])\\d([a-zA-z])", "\\1m\\2", neg)
"Sample1.3"
Explanation
(...) is group, which is referenced with \\1, \\2, etc
[a-zA-Z] is lower and uppercase letter (only 1)
\\d is any digit (add + or {2}) to identify more than 1 digit
I use this site to learn

Resources