I would like to get the phone numbers from a file. I know the numbers have different forms, I don't know how to code for each form. Using grep and regrexpr in R. The numbers are written in this form:
xxx-xxx-xxxx ,
(xxx)xxx-xxxx,
xxx xxx xxxx,
xxx.xxx.xxxx
Try this:
phones <- c("foo 111-111-1111 bar" , "(111)111-1111 quux", "who knows 111 111 1111", "111.111.1111 I do", "111)111-1111 should not work", "1111111111 ditto", "a 111-111-1111 b (222)222-2222 c")
re <- gregexpr("(\\(\\d{3}\\)|\\d{3}[-. ])\\d{3}[-. ]\\d{4}", phones)
regmatches(phones, re)
# [[1]]
# [1] "111-111-1111"
# [[2]]
# [1] "(111)111-1111"
# [[3]]
# [1] "111 111 1111"
# [[4]]
# [1] "111.111.1111"
# [[5]]
# character(0)
# [[6]]
# character(0)
# [[7]]
# [1] "111-111-1111" "(222)222-2222"
In the data, I provide a few examples with other text on both, either, and neither side, as well as two examples that should not match. (That is: a starter "test set", as you want to make sure you both match good examples and no-match bad examples.) The last one hopes to match multiple numbers in one string/sentence.
gregexpr and regmatches are useful for finding and extracting or replacing regex-substrings within 1+ strings. For a "replace" example, one could do:
regmatches(phones, re) <- "GONE!"
phones
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
Obviously contrived replacement but certainly usable. Note though that regmatches operates in side-effect, meaning that it modified the phones vector in-place instead of returning the value. It's possible to force it to operate not in side-effect, but it is a little less intuitive:
phones # I reset it to the original value
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
`regmatches<-`(phones, re, value = "GONE!")
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
phones
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
Edit: scope-creep.
out <- unlist(Filter(length, regmatches(phones, re)))
out
# [1] "111-111-1111" "(111)111-1111" "111 111 1111" "111.111.1111" "111-111-1111"
# [6] "(222)222-2222"
gsub("[^0-9]", "", out)
# [1] "1111111111" "1111111111" "1111111111" "1111111111" "1111111111" "2222222222"
out <- gsub("[^0-9]", "", out)
sprintf("(%s)%s-%s", substr(out, 1, 3), substr(out, 4, 6), substr(out, 7, 10))
# [1] "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111"
# [6] "(222)222-2222"
Related
Background:
I am scraping this website to obtain a list of all people named under a respective section of the editorial board.
In total, there are 6 sections, each one beginning with a <b>...</b> part. (It actually should be 5, but the code is a bit messy.)
My goal:
I want to get a list of all people per section (a list of 6 elements called people).
My approach:
I try to fetch all the text, or text(), after each respective <b>...</b>-tag.
However, with the following R-code and XPath, I fail to get the correct list:
journal_url <- "https://aepi.biomedcentral.com/about/editorial-board"
webpage <- xml2::read_html(url(journal_url))
# get a list of 6 sections
all_sections <- rvest::html_nodes(wholepage, css = '#editorialboard p')
# the following does not work properly
people <- lapply(all_sections, function(x) rvest::html_nodes(x, xpath = '//b/following-sibling::text()'))
The mistaken outcome:
Instead of giving me a list of 6 elements comprising the people per section, it gives me a list of 6 elements comprising all people in every element.
The expected outcome:
The expected output would start with:
people
[[1]]
[1] Shichuo Li
[[2]]
[1] Zhen Hong
[2] Hermann Stefan
[3] Dong Zhou
[[3]]
[1] Jie Mu
# etc etc
The double forward slash xpath selects all nodes in the whole document, even when the object is a single node. Use the current node selector .
people <- lapply(all_sections, function(x) {
rvest::html_nodes(x, xpath = './b/following-sibling::text()')
})
Output:
[[1]]
{xml_nodeset (1)}
[1] Shichuo Li,
[[2]]
{xml_nodeset (3)}
[1] Zhen Hong,
[2] Hermann Stefan,
[3] Dong Zhou,
[[3]]
{xml_nodeset (0)}
[[4]]
{xml_nodeset (1)}
[1] Jie Mu,
[[5]]
{xml_nodeset (2)}
[1] Bing Liang,
[2] Weijia Jiang,
[[6]]
{xml_nodeset (35)}
[1] Aye Mye Min Aye,
[2] Sándor Beniczky,
[3] Ingmar Blümcke,
[4] Martin J. Brodie,
[5] Eric Chan,
[6] Yanchun Deng,
[7] Ding Ding,
[8] Yuwu Jiang,
[9] Hennric Jokeit,
[10] Heung Dong Kim,
[11] Patrick Kwan,
[12] Byung In Lee,
[13] Weiping Liao,
[14] Xiaoyan Liu,
[15] Guoming Luan,
[16] Imad M. Najm,
[17] Terence O'Brien,
[18] Jiong Qin,
[19] Markus Reuber,
[20] Ley J.W. Sander,
...
I have a dataset as follows,
[1] "21/12/16, 14:25:10: abcd
[2] "21/12/16, 14:25:14: 1234
[3] "21/12/16, 14:25:22: XXX
[4] "21/12/16, 14:25:30: YYY
[5] "21/12/16, 14:25:47: ZZZ
Date variable has all the dates in the above dataset as,
> head(date)
[1] "21/12/16" "21/12/16" "21/12/16" "21/12/16" "21/12/16"
Time variable has all times from the dataset as,
> head(time)
[1] "14:25" "14:25" "14:25" "14:25" "14:25"
Now I want the dataset to be modified as,
[1] abcd
[2] 1234
[3] XXX
[4] YYY
[5] ZZZ
How can we do this? I tried gsub but no use. Can someone help me out here.
You aren't completely precise as to the expected behavior, but for the dataset that you've supplied, splitting on ":" and taking the fourth element of the resulting vector will get the desired result. You should think about the use case and whether you can rely on that working in general, however. e.g. Will there always be exactly three colons before the string you want? Will the string you want never contain a colon? etc.
Also, I think you're missing a closing quote mark in your rows.
readLines(con = textConnection("21/12/16, 14:25:10: abcd
21/12/16, 14:25:14: 1234
21/12/16, 14:25:22: XXX
21/12/16, 14:25:30: YYY
21/12/16, 14:25:47: ZZZ")) -> text_file_lines
text_file_lines
## [1] "21/12/16, 14:25:10: abcd" "21/12/16, 14:25:14: 1234"
## [3] "21/12/16, 14:25:22: XXX" "21/12/16, 14:25:30: YYY"
## [5] "21/12/16, 14:25:47: ZZZ"
# built-in
# somewhat forgiving regex replace
sub("^[[:digit:]]+/[[:digit:]]+/[[:digit:]]+,[[:space:]]+[[:digit:]]+:[[:digit:]]+:[[:digit:]]+:[[:space:]]", "", text_file_lines)
## [1] "abcd" "1234" "XXX" "YYY" "ZZZ"
# external pkg
# this matches from last : onward and extracts the bits you want
stringi::stri_match_last_regex(text_file_lines, ": ([[:print:]]+)$")[,2]
## [1] "abcd" "1234" "XXX" "YYY" "ZZZ"
I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as #d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street
The code is
library(rjson)
url <- 'file.json'
j <- fromJSON(file=url, method='C')
there are more than 1000 lines in the file.json, however, the returned result is a list of 9.
the file.json is
{"reviewerID": "A30TL5EWN6DFXT", "asin": "120401325X", "reviewerName": "christina", "helpful": [0, 0], "reviewText": "They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again", "overall": 4.0, "summary": "Looks Good", "unixReviewTime": 1400630400, "reviewTime": "05 21, 2014"}
{"reviewerID": "ASY55RVNIL0UD", "asin": "120401325X", "reviewerName": "emily l.", "helpful": [0, 0], "reviewText": "These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :)", "overall": 5.0, "summary": "Really great product.", "unixReviewTime": 1389657600, "reviewTime": "01 14, 2014"}
{"reviewerID": "A2TMXE2AFO7ONB", "asin": "120401325X", "reviewerName": "Erica", "helpful": [0, 0], "reviewText": "These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!", "overall": 5.0, "summary": "LOVE LOVE LOVE", "unixReviewTime": 1403740800, "reviewTime": "06 26, 2014"}
what is the problem? thanks!
Your file does not contain valid JSON. You basically have three JSON hashes sitting right next to each other. The exact choice of whitespace that separates the values doesn't matter. It's equivalent to this:
{} {} {}
That's just as invalid as if it was three primitives sitting right next to each other:
3 'a' true
Speaking generally, when the input to a function is invalid, all bets are off. It is desirable to write functions to fail gracefully and emit clear error messages that describe the nature of the invalidity, and very often that is the case, but that doesn't always happen. In this case, what rjson::fromJSON() seems to be doing when it encounters this kind of invalid JSON is to parse and return the first value, and silently ignore everything else. That's unfortunate, but what can we do.
You should probably investigate how the file was generated, and seek to correct the problem at that end. But if you want to hack a solution, we can read in the lines of JSON into a character vector, paste-collapse them on comma, paste bracket delimiters around the resulting string, and then parse that string to get an array of hashes. This will only work if each adjacent hash occupies exactly one line in the file.
fromJSON(paste0('[',paste(collapse=',',readLines(url)),']'));
## [[1]]
## [[1]]$reviewerID
## [1] "A30TL5EWN6DFXT"
##
## [[1]]$asin
## [1] "120401325X"
##
## [[1]]$reviewerName
## [1] "christina"
##
## [[1]]$helpful
## [1] 0 0
##
## [[1]]$reviewText
## [1] "They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"
##
## [[1]]$overall
## [1] 4
##
## [[1]]$summary
## [1] "Looks Good"
##
## [[1]]$unixReviewTime
## [1] 1400630400
##
## [[1]]$reviewTime
## [1] "05 21, 2014"
##
##
## [[2]]
## [[2]]$reviewerID
## [1] "ASY55RVNIL0UD"
##
## [[2]]$asin
## [1] "120401325X"
##
## [[2]]$reviewerName
## [1] "emily l."
##
## [[2]]$helpful
## [1] 0 0
##
## [[2]]$reviewText
## [1] "These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :)"
##
## [[2]]$overall
## [1] 5
##
## [[2]]$summary
## [1] "Really great product."
##
## [[2]]$unixReviewTime
## [1] 1389657600
##
## [[2]]$reviewTime
## [1] "01 14, 2014"
##
##
## [[3]]
## [[3]]$reviewerID
## [1] "A2TMXE2AFO7ONB"
##
## [[3]]$asin
## [1] "120401325X"
##
## [[3]]$reviewerName
## [1] "Erica"
##
## [[3]]$helpful
## [1] 0 0
##
## [[3]]$reviewText
## [1] "These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!"
##
## [[3]]$overall
## [1] 5
##
## [[3]]$summary
## [1] "LOVE LOVE LOVE"
##
## [[3]]$unixReviewTime
## [1] 1403740800
##
## [[3]]$reviewTime
## [1] "06 26, 2014"
##
##
I am trying to split the output of "ls -lrt" command from Linux. but it's taking only one space as delimeter. If there is two space then its taking 2nd space as value. So I think I need to suppress multiple space as one. Does anybody has any idea on this?
> a <- try(system("ls -lrt | grep -i .rds", intern = TRUE))
> a
[1] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS"
[2] "-rw-r--r-- 1 u7x9573 sashare 86704 Jun 9 16:10 InputSource2.rds"
> str(a)
chr [1:6] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS" ...
>
>c = strsplit(a," ")
>c
[[1]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare" ""
[6] "2297" "Jun" "" "9" "16:10"
[11] "abcde.RDS"
[[2]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare"
[5] "86704" "Jun" "" "9"
[9] "16:10" "InputSource2.rds"
In next step I needed just file name and I used following code which worked fine:
mtrl_name <- try(system("ls | grep -i .rds", intern = TRUE))
This returns that info in a data frame for the indicated files:
file.info(list.files(pattern = "[.]rds$", ignore.case = TRUE))
or if we knew the extensions were lower case:
file.info(Sys.glob("*.rds"))
strsplit takes a regular expression so we can use those to help out. For more info read ?regex
> x <- "Spaces everywhere right? "
> # Not what we want
> strsplit(x, " ")
[[1]]
[1] "Spaces" "" "" "everywhere" "right?"
[6] ""
> # Use " +" to tell it to split on 1 or more space
> strsplit(x, " +")
[[1]]
[1] "Spaces" "everywhere" "right?"
> # If we want to be more explicit and catch the possibility of tabs, new lines, ...
> strsplit(x, "[[:space:]]+")
[[1]]
[1] "Spaces" "everywhere" "right?"