Regex: Capturing Numbers at Beginning and Negating Numbers After Characters - r

I need to capture the 3.93, 4.63999..., and -5.35. I've tried all kinds of variations, but have been unable to grab the correct set of numbers.
Copay: 20.30
3.93
TAB 8.6MG Qty:60
4.6399999999999997
-5.35
2,000UNIT TAB Qty:30
AMOUNT
Qty:180
CAP 4MG

x = c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG");
grep("^[\\-]?\\d+[\\.]?\\d+$", x);
Output (see ?grep):
[1] 2 4 5
If leading/trailing spaces are allowed change the regex with
"^\\s*[\\-]?\\d+[\\.]?\\d+\\s*$"

Try this
S <- c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG")
library(stringr)
ans <- str_extract_all(S, "-?[[:digit:]]*(\\.|,)?[[:digit:]]+", simplify=TRUE)
clean <- ans[ans!=""]
Output
[1] "20.30" "3.93" "8.6"
[4] "4.6399999999999997" "-5.35" "2,000"
[7] "180" "4" "60"
[10] "30"

Related

grep and regrex for R phone numbers

I would like to get the phone numbers from a file. I know the numbers have different forms, I don't know how to code for each form. Using grep and regrexpr in R. The numbers are written in this form:
xxx-xxx-xxxx ,
(xxx)xxx-xxxx,
xxx xxx xxxx,
xxx.xxx.xxxx
Try this:
phones <- c("foo 111-111-1111 bar" , "(111)111-1111 quux", "who knows 111 111 1111", "111.111.1111 I do", "111)111-1111 should not work", "1111111111 ditto", "a 111-111-1111 b (222)222-2222 c")
re <- gregexpr("(\\(\\d{3}\\)|\\d{3}[-. ])\\d{3}[-. ]\\d{4}", phones)
regmatches(phones, re)
# [[1]]
# [1] "111-111-1111"
# [[2]]
# [1] "(111)111-1111"
# [[3]]
# [1] "111 111 1111"
# [[4]]
# [1] "111.111.1111"
# [[5]]
# character(0)
# [[6]]
# character(0)
# [[7]]
# [1] "111-111-1111" "(222)222-2222"
In the data, I provide a few examples with other text on both, either, and neither side, as well as two examples that should not match. (That is: a starter "test set", as you want to make sure you both match good examples and no-match bad examples.) The last one hopes to match multiple numbers in one string/sentence.
gregexpr and regmatches are useful for finding and extracting or replacing regex-substrings within 1+ strings. For a "replace" example, one could do:
regmatches(phones, re) <- "GONE!"
phones
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
Obviously contrived replacement but certainly usable. Note though that regmatches operates in side-effect, meaning that it modified the phones vector in-place instead of returning the value. It's possible to force it to operate not in side-effect, but it is a little less intuitive:
phones # I reset it to the original value
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
`regmatches<-`(phones, re, value = "GONE!")
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
phones
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
Edit: scope-creep.
out <- unlist(Filter(length, regmatches(phones, re)))
out
# [1] "111-111-1111" "(111)111-1111" "111 111 1111" "111.111.1111" "111-111-1111"
# [6] "(222)222-2222"
gsub("[^0-9]", "", out)
# [1] "1111111111" "1111111111" "1111111111" "1111111111" "1111111111" "2222222222"
out <- gsub("[^0-9]", "", out)
sprintf("(%s)%s-%s", substr(out, 1, 3), substr(out, 4, 6), substr(out, 7, 10))
# [1] "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111"
# [6] "(222)222-2222"

Issue with strsplit not storing searched field

I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as #d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street

Count number of times a word-wildcard appears in text (in R)

I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to:
1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1).
2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2).
I'm able to achieve (1), but not (2). Can anyone please help? thanks.
library(tm)
library(qdap)
text <- "activation has begun. system activated"
text <- Corpus(VectorSource(text))
words <- c("activation", "activated", "activat*")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Result:
# docs word.count activation activated activat*
# 1 doc 1 5 1(20.00%) 1(20.00%) 0
Is it possible that this might have to do something with the versions? I ran the exact same code (see below) and got what you expected
> text <- "activation has begunm system activated"
> text <- Corpus(VectorSource(text))
> words <- c("activation", "activated", "activat")
> apply_as_df(text, termco, match.list=words)
docs word.count activation activated activat
1 doc 1 5 1(20.00%) 1(20.00%) 2(40.00%)
Below is the output when I run R.version(). I am running this in RStudio Version 0.99.491 on Windows 10.
> R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "2.3"
$year
[1] "2015"
$month
[1] "12"
$day
[1] "10"
$`svn rev`
[1] "69752"
$language
[1] "R"
$version.string
[1] "R version 3.2.3 (2015-12-10)"
$nickname
[1] "Wooden Christmas-Tree"
Hope this helps
Maybe consider different approach using library stringi?
text <- "activation has begun. system activated"
words <- c("activation", "activated", "activat*")
library(stringi)
counts <- unlist(lapply(words,function(word)
{
newWord <- stri_replace_all_fixed(word,"*", "\\p{L}")
stri_count_regex(text, newWord)
}))
ratios <- counts/stri_count_words(text)
names(ratios) <- words
ratios
Result is:
activation activated activat*
0.2 0.2 0.4
In the code I convert * into \p{L} which means any letter in regex pattern. After that I count found regex occurences.

Count misspelled words in R

Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fish", "Final Countdow", "show me your s", "where is what")
Data<-cbind(Row, Content)
View(Data)
I wanted to create a function which tells me how many words are wrong per Row.
A intermediate step would be to have it look like this:
Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fs", "Final Countdow", "show me your s", "where is what")
MisspelledWords<-c(NA, "whre, fs", "Countdow","s",NA)
Data<-cbind(Row, Content,MisspelledWords)
I know that i have to use aspell but i'm having problems to perform aspell on only rows and not always directly on the whole file, finally i want to Count how many words are wrong on every Row For this i would take code of: Count the number of words in a string in R?
Inspired by this article, here's a try with which_misspelled and check_spelling in library(qdap).
library(qdap)
# which_misspelled
n_misspelled <- sapply(Content, function(x){
length(which_misspelled(x, suggest = FALSE))
})
data.frame(Content, n_misspelled, row.names = NULL)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
# check_spelling
df <- check_spelling(Content, n.suggest = 0)
n_misspelled <- as.vector(table(factor(df$row, levels = Row)))
data.frame(Content, n_misspelled)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
To use aspell you have to use a file. It's pretty straightforward to use a function to dump a column to a file, run aspell and get the counts (but it will not be all that efficient if you have a large matrix/dataframe).
countMispelled <- function(words) {
# do a bit of cleanup (if necessary)
words <- gsub(" *", " ", gsub("[[:punct:]]", "", words))
temp_file <- tempfile()
writeLines(words, temp_file);
res <- aspell(temp_file)
unlink(temp_file)
# return # of mispelled words
length(res$Original)
}
Data <- cbind(Data, Errors=unlist(lapply(Data[,2], countMispelled)))
Data
## Row Content Errors
## [1,] "1" "I love cheese" "0"
## [2,] "2" "whre is thed fish" "2"
## [3,] "3" "Final Countdow" "1"
## [4,] "4" "show me your s" "0"
## [5,] "5" "where is what" "0"
You might be better off using a data frame vs a matrix (I just worked with what you provided) since you can keep Row and Errors numeric that way.

Split string in R

I am trying to split the output of "ls -lrt" command from Linux. but it's taking only one space as delimeter. If there is two space then its taking 2nd space as value. So I think I need to suppress multiple space as one. Does anybody has any idea on this?
> a <- try(system("ls -lrt | grep -i .rds", intern = TRUE))
> a
[1] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS"
[2] "-rw-r--r-- 1 u7x9573 sashare 86704 Jun 9 16:10 InputSource2.rds"
> str(a)
chr [1:6] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS" ...
>
>c = strsplit(a," ")
>c
[[1]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare" ""
[6] "2297" "Jun" "" "9" "16:10"
[11] "abcde.RDS"
[[2]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare"
[5] "86704" "Jun" "" "9"
[9] "16:10" "InputSource2.rds"
In next step I needed just file name and I used following code which worked fine:
mtrl_name <- try(system("ls | grep -i .rds", intern = TRUE))
This returns that info in a data frame for the indicated files:
file.info(list.files(pattern = "[.]rds$", ignore.case = TRUE))
or if we knew the extensions were lower case:
file.info(Sys.glob("*.rds"))
strsplit takes a regular expression so we can use those to help out. For more info read ?regex
> x <- "Spaces everywhere right? "
> # Not what we want
> strsplit(x, " ")
[[1]]
[1] "Spaces" "" "" "everywhere" "right?"
[6] ""
> # Use " +" to tell it to split on 1 or more space
> strsplit(x, " +")
[[1]]
[1] "Spaces" "everywhere" "right?"
> # If we want to be more explicit and catch the possibility of tabs, new lines, ...
> strsplit(x, "[[:space:]]+")
[[1]]
[1] "Spaces" "everywhere" "right?"

Resources