Strip out numbers from text: R - r

hello i having the data set which consists to text, whole numbers and decimal numbers, text is a paragraph which will be having all this mix, trying to strip out only the whole numbers and decimal numbers out of the text content, there are about 30k trow entries.
input format of data:
This. Is a good 13 part. of 135.67 code
how to strip 66.8 in the content 6879
get the numbers 3475.5 from. The data. 879 in this 369426
Output:
13 135.67
66.8 6879
3475.5 879 369426
i tried replace all alphabets one by one, but 26+26 replace all is making code lengthy, and replacing "." replaces "." from the numbers also
Thanks,
Praveen

Don't forget that R has already inbuilt regex functions:
input <- c('This. Is a good 13 part. of 135.67 code', 'how to strip 66.8 in the content 6879',
'get the numbers 3475.5 from. The data. 879 in this 369426')
m <- gregexpr('\\b\\d+(?:\\.\\d+)?\\b', input)
(output <- lapply(regmatches(input, m), as.numeric))
This yields
[[1]]
[1] 13.00 135.67
[[2]]
[1] 66.8 6879.0
[[3]]
[1] 3475.5 879.0 369426.0

An option using strsplit to split in separate lines and then use gsub to replace [:alpha] following . or or just [:alpha].
text <- "1. This. Is a good 13 part. of 135.67 code
2. how to strip 66.8 in the content 6879
3. get the numbers 3475.5 from. The data. 879 in this 369426"
lines <- strsplit(text, split = "\n")[[1]]
gsub("[[:alpha:]]+\\.|[[:alpha:]]+\\s*","",lines)
#[1] "1. 13 135.67 "
#[2] "2. 66.8 6879"
#[3] "3. 3475.5 879 369426"

you can try
library(stringr)
lapply(str_extract_all(a, "[0-9.]+"), function(x) as.numeric(x)[!is.na(as.numeric(x))])
[[1]]
[1] 13.00 135.67
[[2]]
[1] 66.8 6879.0
[[3]]
[1] 3475.5 879.0 369426.0
The basic idea is from here but we include the .. The lapply transforms to numeric and excludes NA's
The data:
a <- c("This. Is a good 13 part. of 135.67 code",
"how to strip 66.8 in the content 6879",
"get the numbers 3475.5 from. The data. 879 in this 369426")

Another method with gsub:
string = c('This. Is a good 13 part. of 135.67 code',
'how to strip 66.8 in the content 6879',
'get the numbers 3475.5 from. The data. 879 in this 369426')
trimws(gsub('[\\p{L}\\.\\s](?!\\d)+', '', string, perl = TRUE))
# [1] "13 135.67" "66.8 6879" "3475.5 879 369426"

A solution free of regex and external packages:
sapply(
strsplit(input, " "),
function(x) {
x <- suppressWarnings(as.numeric(x))
paste(x[!is.na(x)], collapse = " ")
}
)
[1] "13 135.67" "66.8 6879" "3475.5 879 369426"

Related

How to manipulate digits in a character string in R?

I feel like I have a super easy question but for the life of me I can't find it when googling or searching here (or I don't know the correct terms to find a solution) so here goes.
I have a large amount of text in R in which I want to identify all numbers/digits, and add a specific number to them, for example 5.
So just as a small example, if this were my text:
text <- c("Hi. It is 6am. I want to leave at 7am")
I want the output to be:
> text
[1] "Hi. It is 11am. I want to leave at 12am"
But also I need the addition for each individual digit, so if this is the text:
text <- c("Hi. It is 2017. I am 35 years old.")
...I want the output to be:
> text
[1] "Hi. It is 75612. I am 810 years old."
I have tried 'grabbing' the numbers from the string and adding 5, but I don't know how to then get them back into the original string so I can get the full text back.
How should I go about this? Thanks in advance!
Here is how I would do the time. I would search for a number that is followed by am or pm and then sub in a math expression to be evaluated by gsubfn. This is pretty flexible, but would require whole hours in its current implementation. I added an am and pm if you wanted to swap those, but I didn't try to code in detecting if the number changes from am to pm. Also note that I didn't code in rolling from 12 to 1. If you add numbers over 12, you will get a number bigger than 12.
text1 <- c("Hi. It is 6am. I want to leave at 7am")
text2 <- c("It is 9am. I want to leave at 10am, but the cab comes at 11am. Can I push my flight to 12am?")
change_time <- function(text, hours, sign, am_pm){
string_change <- glue::glue("`(\\1{sign}{hours})`{am_pm}")
gsub("(\\d+)(?=am|pm)(am|pm)", string_change, text, perl = TRUE)|>
gsubfn::fn$c()
}
change_time(text = text1, hours = 5, sign = "+", am_pm = "am")
#> [1] "Hi. It is 11am. I want to leave at 12am"
change_time(text = text2, hours = 3, sign = "-", am_pm = "pm")
#> [1] "It is 6pm. I want to leave at 7pm, but the cab comes at 8pm. Can I push my flight to 9pm?"
text1 <- c("Hi. It is 2017. I am 35 years old.")
text2 <- c("Hi. It is 6am. I want to leave at 7am")
change_number <- function(text, change, sign){
string_change <- glue::glue("`(\\1{sign}{change})`")
gsub("(\\d)", string_change, text, perl = TRUE) %>%
gsubfn::fn$c() }
change_number(text = text1, change = 5, sign = "+")
#>[1] "Hi. It is 75612. I am 810 years old."
change_number(text = text2, change = 5, sign = "+")
#>[1] "Hi. It is 11am. I want to leave at 12am"
This works perfectly. Many thanks to #AndS., I tweaked (or rather, simplified) your code to fit my needs better. I was determined to figure out the other text myself haha, so thanks for showing me how!
Something quick and dirty with base R:
add_n = \(x, n, by_digit = FALSE) {
if (by_digit) ptrn = "[0-9]" else ptrn = "[0-9]+"
tmp = gregexpr(ptrn, x)
raw = regmatches(x, gregexpr(ptrn, x))
raw_plusn = lapply(raw, \(x) as.integer(x) + n)
for (i in seq_along(x)) regmatches(x[i], tmp[i]) = raw_plusn[i]
x
}
text = c(
"Hi. It is 6am. I want to leave at 7am",
"wow it's 505 dollars and 19 cents",
"Hi. It is 2017. I am 35 years old."
)
> add_n(text, 5)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 510 dollars and 24 cents"
# [3] "Hi. It is 2022. I am 40 years old."
> add_n(text, -2)
# [1] "Hi. It is 4am. I want to leave at 5am" "wow it's 503 dollars and 17 cents"
# [3] "Hi. It is 2015. I am 33 years old."
> add_n(text, 5, by_digit = TRUE)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 10510 dollars and 614 cents"
# [3] "Hi. It is 75612. I am 810 years old."
Here's a tidyverse solution:
data.frame(text) %>%
# separate `text` into individual characters:
separate_rows(text, sep = "(?<!^)(?!$)") %>%
# add `5` to any digit:
mutate(
# if you detect a digit...
text = ifelse(str_detect(text, "\\d"),
# ... extract it, convert it to numeric, add `5`:
as.numeric(str_extract(text, "\\d")) + 5,
# ... else leave `text` as is:
text)
) %>%
# string the characters back together:
summarise(text = str_c(text, collapse = ""))
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 11am. I want to leave at 12am
Data 1:
text <- c("Hi. It is 6am. I want to leave at 7am")
Note that the same code works for the second text as well without any change:
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 75612. I am 810 years old.
Data 2:
text <- c("Hi. It is 2017. I am 35 years old.")

need to improve R regex phone number extraction from 26 to 28 different formatting

I am trying to extract from a random text phone numbers in 28 different formats in R. I have read previous posts here on R regex, such as \ being replaced with \\, and running the regex operator with perl=TRUE, so I have solved most of my issues. I need help with some debugging.
I use the following regular expression in R:
medium_regex2 = "(?:\\+?(\\d{1})?-?\\(?(\\d{3})\\)?[\\s-\\.]?)?(\\d{3})[\\s-\\.]?(\\d{4})[\\s-\\.]?"
and run the following code:
medium_phone_extract2 <- function(string){
unlist(regmatches(string,gregexpr(medium_regex2,string, perl=TRUE)))
}
medium_phone_extract2(phonenumbers)
The expression spots 26 out of 28 numbers correctly. The 2 missing number formats are:
"+90-555-4443322"
"+1.517.3002010"
How would you improve the regex so that these 2 formats are also correctly extracted?
edit: the full 28 formats I am trying to extract are:
phonenumbers <- c("05554443322",
"0555 444 3322",
"0555 444 33 22",
"5554443322",
"555 444 3322",
"555 444 33 22",
"905554443322",
"+905554443322",
"+90-555-4443322",
"+1-517-3002010",
"+1-(800)-3002010",
"+1-517-3002010",
"+1.517.3002010",
"000-000-0000",
"000 000 0000",
"000.000.0000",
"(000)000-0000",
"(000)000 0000",
"(000)000.0000",
"(000) 000-0000",
"(000) 000 0000",
"(000) 000.0000",
"000-0000",
"000 0000",
"000.0000",
"0000000",
"0000000000",
"(000)0000000")
howmany_numbers <- length(phonenumbers)
#28
And the 26 I am able to extract with the regex are:
[1] "05554443322" "0555 444 3322" "5554443322" "555 444 3322" "90555444332"
[6] "+90555444332" "0-555-4443322" "+1-517-3002010" "+1-(800)-3002010" "+1-517-3002010"
[11] "517.3002010" "000-000-0000" "000 000 0000" "000.000.0000" "(000)000-0000"
[16] "(000)000 0000" "(000)000.0000" "(000) 000-0000" "(000) 000 0000" "(000) 000.0000"
[21] "000-0000" "000 0000" "000.0000" "0000000" "0000000000"
[26] "(000)0000000"
You may use the following regex:
(?:\+?\d{0,3}-?\(?[\s.-]?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{2}\s?\d{2}
In case you want to only match it when not inside other digits, you may add (?<!\d) / (?!\d) lookarounds that prevent a match if there is a digit on the left or right:
(?<!\d)(?:\+?\d{0,3}-?\(?[\s.-]?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{2}\s?\d{2}(?!\d)
To ensure the usual word boundary on both sides use
(?<!\w)(?:\+?\d{0,3}-?\(?[\s.-]?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{2}\s?\d{2}\b
In R, do not forget to double all backslashes in the string literal:
regex <- "(?<!\\w)(?:\\+?\\d{0,3}-?\\(?[\\s.-]?\\d{3}\\)?[\\s.-]?)?\\d{3}[\\s.-]?\\d{2}\\s?\\d{2}\\b"
Main points:
((\\d{1})?|(\\d{2})?|(\\d{3}))? is better written as \d{0,3}, zero to three digits pattern (alternation makes matching process more resource consuming compared to a more linear, straight-forward pattern)
[\\s.-] is preferred to [\\s\\-\\.] since a hyphen is better placed at the end of the character class (no need to escape it there) and note that . always matches a literal . inside a character class
(\\d{4}|\\d{2}\\s\\d{2}) can and should be re-written as \\d{2}\\s?\\d{2} matching 2 digits followed with an optional whitespace and then 2 digits.
Not sure you really want to match a whitespace, hyphen or dot at the end of the pattern, so I suggest removing [\\s-\\.]? at the end.

Regex: Capturing Numbers at Beginning and Negating Numbers After Characters

I need to capture the 3.93, 4.63999..., and -5.35. I've tried all kinds of variations, but have been unable to grab the correct set of numbers.
Copay: 20.30
3.93
TAB 8.6MG Qty:60
4.6399999999999997
-5.35
2,000UNIT TAB Qty:30
AMOUNT
Qty:180
CAP 4MG
x = c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG");
grep("^[\\-]?\\d+[\\.]?\\d+$", x);
Output (see ?grep):
[1] 2 4 5
If leading/trailing spaces are allowed change the regex with
"^\\s*[\\-]?\\d+[\\.]?\\d+\\s*$"
Try this
S <- c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG")
library(stringr)
ans <- str_extract_all(S, "-?[[:digit:]]*(\\.|,)?[[:digit:]]+", simplify=TRUE)
clean <- ans[ans!=""]
Output
[1] "20.30" "3.93" "8.6"
[4] "4.6399999999999997" "-5.35" "2,000"
[7] "180" "4" "60"
[10] "30"

Issue with strsplit not storing searched field

I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as #d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street

4 lines records into one line? How to combine in a single line?

I would like to ask you for help. I have data looking like this: (one record is in four lines:
9540 16
0.1586E-03-0.3713E-04 0.1559E-03-0.4054E-04 0.2610E-02 0.2589E-03 0.4509E-03
0.7271E-03 0.2286E-03 0.8627E-03 0.1511E-02 0.1208E-03 0.1169 0.5486E-01
0.1419E-01 0.1715
9546 16
0.1546E-03-0.2273E-04 0.1504E-03-0.1516E-04 0.2517E-02 0.1968E-03 0.5512E-03
0.7556E-03 0.2998E-03 0.1024E-02 0.1495E-02 0.6889E-03 0.1134 0.5461E-01
0.1418E-01 0.1708
I would like to read this into R and look like this (in one line):
9540 16 0.1586E-03 -0.3713E-04 0.1559E-03 -0.4054E-04 0.2610E-02 0.2589E-03 0.4509E-03 0.7271E-03 0.2286E-03 0.8627E-03 0.1511E-02 0.1208E-03 0.1169 0.5486E-01 0.1419E-01 0.1715
9546 16 0.1546E-03 -0.2273E-04 0.1504E-03 -0.1516E-04 0.2517E-02 0.1968E-03 0.5512E-03 0.7556E-03 0.2998E-03 0.1024E-02 0.1495E-02 0.6889E-03 0.1134 0.5461E-01 0.1418E-01 0.1708
We could read the file using readLines. Create a grouping variable using gl, paste the 'lines' based on the group with tapply. If needed, we can remove the leading and lagging spaces with str_trim from library(stringr)
lines <- readLines('fourlines.txt')
lines2 <- tapply(lines, as.numeric(gl(length(lines), 4, length(lines))),
FUN= paste, collapse=' ')
library(stringr)
lines2 <- str_trim(unname(lines2))
Output of 'lines2'
lines2
#[1] "9540 16 0.1586E-03-0.3713E-04 0.1559E-03-0.4054E-04 0.2610E-02 0.2589E-03 0.4509E-03 0.7271E-03 0.2286E-03 0.8627E-03 0.1511E-02 0.1208E-03 0.1169 0.5486E-01 0.1419E-01 0.1715"
#[2] "9546 16 0.1546E-03-0.2273E-04 0.1504E-03-0.1516E-04 0.2517E-02 0.1968E-03 0.5512E-03 0.7556E-03 0.2998E-03 0.1024E-02 0.1495E-02 0.6889E-03 0.1134 0.5461E-01 0.1418E-01 0.1708"
If we want to remove the extra spaces
lines2 <- gsub('\\s+', ' ', lines2)

Resources