Please R-gurus, how can I keep the last 9 digits of an alphanumeric string
for e.g.
LA XAN 000262999444
RA XAN 000263000507
WA XAN 000263268038
SA XAN 000263000464
000263000463
000263000476
I only want to get
262999444
263000507
263268038
263000464
263000463
263000476
Thanks a lot
It's pretty easy in stringr because sub_str interprets negative indices as offsets from the end of the string.
library(stringr)
str_sub(xx, -9, -1)
If you just want the last 9 positions, you could just use substr:
substr(xx,nchar(xx) - 8,nchar(xx))
assuming that your character vector is stored in xx. Also, as Hadley notes below, nchar will return unexpected things if xx is a factor, not a character vector. His solution using stringr is definitely preferable.
Assuming the input is a vector named "strgs":
sub(".*(.........)$", "\\1", strgs)
#[1] "262999444" "263000507" "263268038" "263000464"
?sub
?regex
Not really sure which language you are looking for but here would be a c# implementation.
The logic would be something like :
string s = "WA XAN 000263268038";
s = s.Substring(s.Length - 10, 9);
Hope this helps!
Related
I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply
Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9
This question already has answers here:
Multiple overlapping regex matches instead of one
(2 answers)
Biostrings gregexpr2 gives errors while gregexpr works fine
(1 answer)
Closed 3 years ago.
Code
gsub('101', '111', '110101101')
#[1] "111101111"
Would anyone know why the second 0 in the input isn't being substituted into a 1 in the output?
I'm looking for the pattern 101 in string and replace it with string 111. Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
You could use a lookahead ?=
The way this works is q(?=u) matches a q that is followed by a u, without making the u part of the match.
Example:
gsub('10(?=1)', '11', '110101101', perl=TRUE);
// Output: 111111111
Edit: you need to use gsub in perl mode to use lookaheads
Its because it doesnt work in a recursive way
gsub('101', '111', '110101101') divides the third string as it finds the matches. So it finds the first 101 and its left with 01101. Think about it. If it would replace "recursively", something like gsub('11', '111', '11'), would return an infinite string of '1' and break. It doesn't check in the already "replaced" text.
It is because when R first detected 110101101, it treat the next 0 as in 011 in 110101101.
It seems that you only want to replace '0' by '1'. Then you can just use gsub('0', '1', '110101101')
Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
Hopefully, R provides a means to generate the replacement string based on the matched substring. (This is a common feature.)
If so, search for 10+, and have the replacement string generator create a string consisting of a number of 1 characters equal to the length of the match. (e.g. If 100 is matched, replace with 111. If 1000 is matched, replace with 1111. etc.)
I don't know R in the least. Here's how it's done in some other languages in case that helps:
Perl:
$s =~ s{10+}{ "1" x length($&) }ger
Python:
re.sub(r'10+', lambda match: '1' * len(match.group()), s)
JavaScript:
s.replace(/10+/g, function(match) { return '1'.repeat(match.length) })
JavaScript (ES6):
s.replace(/10+/g, match => '1'.repeat(match.length))
According to the OP
Later on I wish to turn longer sub-sequences into sequences of 1's,
such as 10001 to 11111.
If I understand correctly, the final goal is to replace any sub-sequence of consecutive 0 into the same number of 1 if they are surrounded by a 1 on both sides.
In R, this can be achieved by the str_replace_all() function from the stringr package. For demonstration and testing, the input vector contains some edge cases where substrings of 0 are not surrounded by 1.
input <- c("110101101",
"11010110001",
"110-01101",
"11010110000",
"00010110001")
library(stringr)
str_replace_all(input, "(?<=1)0+(?=1)", function(x) str_dup("1", str_length(x)))
[1] "111111111" "11111111111" "110-01111" "11111110000" "00011111111"
The regex "(?<=1)0+(?=1)" uses look behind (?<=1) as well as look ahead (?=1) to ensure that the subsequence 0+ to replace is surrounded by 1. Thus, leading and trailing subsequences of 0 are not replaced.
The replacement is computed by a functions which returns a subsequence of 1 of the same length as the subsequence of 0 to replace.
Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12
Here is a string that I have
"7MA_S_VE_MS_FB_MEASURE_P1_2013-08-21_17-42-19.BMP"
I am trying to extract dates this way:
library(stringr)
as.Date(str_extract(test,"[0-9]{4}/[0-9]{2}/[0-9]{2}"),"%Y-%m-%d")
I am getting NA for this.
Desired output is
2013-08-21
Can someone point me in the right direction?
You have replaced your dash - with a slash / in your regular expression.
as.Date(str_extract(string, "[0-9]{4}-[0-9]{2}-[0-9]{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"
But you can also replace the [0-9] bits with \d, which represent the same thing. I'm not sure why, but regex pros seem to always use the \d version (note that you'll have to escape the backslash with another backslash):
as.Date(str_extract(string, "\\d{4}-\\d{2}-\\d{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"
If it as fixed position
as.Date(strsplit(str1, "_")[[1]][8])
#[1] "2013-08-21"
I'm looking for an R function like sprintf that can easily format an output on both sides of the decimal point. I know that %04d would work for integers, %04d formats decimal numbers before the decimal, and %04f formats decimal numbers after the decimal, but I can't seem to figure an elegant way to format on each side (I can, fortunately, think of a number of hideous ways). Is there an easy way? My desired output:
1 becomes 01.00
4.2 becomes 04.20
12.3 becomes 12.30
42.42 remains as is
Thanks!!
As pointed out by #nrussell R's sprintf() function is sufficient:
sprintf("%05.2f",x)
The sequence 05.2 in the format specifier indicates that the output of the number represented as a floating point (f) should be padded to the left with 0 so that the resulting string contains at least 5 characters, and that two digits should be displayed after the decimal point.
A more convoluted version would be to combine it with the str_pad() function from the stringr package. It has no advantage in this case compared to the solution using only sprintf(). I am adding it just for completeness, and because it was the solution I came up with.
x <- c(1, 4.2, 12.3, 42.42)
stringr::str_pad(sprintf("%.2f",x), 5, pad="0")
#[1] "01.00" "04.20" "12.30" "42.42"
The argument "5" passed to the str_pad() function indicates the length (width) of the string.