Multiple pattern Matching in R - r

For a multiple pattern matches (present in a character vector), i tried to apply grep(paste(States,collapse="|), Description). It works fine, but the problem here is that
Consider,
Descritpion=C("helloWorld Washington DC","Hello Stackoverflow////Newyork RBC")
States=C("DC","RBC","WA")
if the multiple pattern match for "WA" in the Description Vector. My function works for "helloWorld **Wa**shington DC" because "WA" is present. But i need a suggestion regarding the search pattern not in the whole String but at the end of String here with DC,RBC.
Thanks in advance

I guess you want something like the following. I've taken the liberty to clean up your example a bit.
Description <- c("helloWorld Washington DC", "Hello Stackoverflow", "Newyork RBC")
States <- c("DC","RBC","WA")
search.string <- paste0(States, "$", collapse = "|") # Construct the reg. exprs.
grep(search.string, Description, value = TRUE)
#[1] "helloWorld Washington DC" "Newyork RBC"
Note, we use $ to signify end-of-string match.

Related

Usng R - gsub using code in replacement - Replace comma with full stop after pattern

I would like to manually correct a record by using R. Last name and first name should always be separated by a comma.
names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
Sometimes, however, a full stop has crept in as a separator, as in the case of "JOHNSON. Richard". I would like to do this automatically. Since the last name is always at the beginning of the line, I can simply access it via sub:
sub("^[[:upper:]]+\\.","^[[:upper:]]+\\,",names)
However, I cannot use a function for the replacement that specifically replaces the full stop with a comma.
Is there a way to insert a function into the replacement that does this for me?
Your sub is mostly correct, but you'll need a capture group (the brackets and backreference \\1) for the replacement.
Because we are "capturing" the upper case letters, therefore \\1 here represents the original upper case letters in your original strings. The only replacement here is \\. to \\,. In other words, we are replacing upper case letters ^(([[:upper:]]+) AND full stop \\. with it's original content \\1 AND comma \\,.
For more details you can visit this page.
test_names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
sub("^([[:upper:]]+)\\.","\\1\\,",test_names)
[1] "ADAM, Smith J." "JOHNSON, Richard" "BROWN, Wilhelm K."
[4] "DAVIS, Daniel"
Can be done by a function like so:
names <- c("ADAM, Smith", "JOHNSON. Richard", "BROWN, Wilhelm", "DAVIS, Daniel")
replacedots <- function(mystring) {
gsub("\\.", ",", names)
}
replacedots(names)
[1] "ADAM, Smith" "JOHNSON, Richard" "BROWN, Wilhelm" "DAVIS, Daniel"

Extracting parts of text string between two characters

I am new to R and still learning so I would appreciate so much any help or suggestion.
I have different character strings similar to those:
"Department of Biophysical Chemistry, University of Braunschweig, Braunschweig, Germany; Consejo Superior de Investigaciones Científicas, CCHS, Madrid, Spain;"
Then I would like to extract only the name of the countries in those strings, including semicolon, that is:
"Germany; Spain;"
The problem for me is finding out how to extract just from the last coma to the semicolon and do that repeatedly. I tried with gsub function but I was not able to make the right approach..
For test input make a 3 component vector s as shown in the Note at the end so that we can see that it works for multiple lines -- here just three lines.
Now, we can get a one-line solution using strapply in the gsubfn package. We match the indicated pattern returning only the match to the capture group, i.e. the portion within parentheses. Then for each line we use sapply to paste the matches together.
library(gsubfn)
sapply(strapply(s, ", ([^,;]+;)"), paste, collapse = " ")
giving:
[1] "Germany; Spain;" "Germany; Spain;" "Germany; Spain;"
Note
s1 <- "Department of Biophysical Chemistry, University of Braunschweig, Braunschweig, Germany; Consejo Superior de Investigaciones Científicas, CCHS, Madrid, Spain;"
s <- c(s1, s1, s1)
We can try using strsplit along with sub here for a base R option:
x <- "Department of Biophysical Chemistry, University of Braunschweig, Braunschweig, Germany; Consejo Superior de Investigaciones Científicas, CCHS, Madrid, Spain;"
terms <- sapply(strsplit(x, ";\\s*")[[1]], function(x) {
sub("^.*\\s+", "", x)
})
output <- paste0(terms, ";", collapse=" ")
output
[1] "Germany; Spain;"
The logic here is to first split your semicolon-separated string on the pattern ;\s*, which results in a list containing each department. Then, we use apply to remove everything up to, and including, the last appearance of whitespace. Finally we paste collapse to generate another semicolon separated string.
Note: I changed the names of the output vector only for demo purposes, because R was using the full department description as the name by default, making it hard to display.
I would simply find the last comma before the ; and capture everything between using a simple gsub call. This will also work for a vector
gsub(".*?(=?[^,]*;)", "\\1", x, perl = TRUE)
# [1] " Germany; Spain;"

How to count the number of segments in a string in r?

I have a string printed out like this:
"\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
(The "\" wasn't there. R just automatically prints it out.)
I would like to calculate how many non-empty segments there are in this string. In this case the answer should be 11.
I tried to convert it to a vector, but R ignores the quotation marks so I still ended up with a vector with length 1.
I don't know whether I need to extract those segments first and then count, or there're easier ways to do that.
If it's the former case, which regular expression function best suits my need?
Thank you very much.
You can use scan to convert your large string into a vector of individual ones, then use nchar to count the lengths. Assuming your large string is x:
y <- scan(text=x, what="character", sep=",", strip.white=TRUE)
Read 12 items
sum(nchar(y)>0)
[1] 11
I assume a segment is defined as anything between . or ,. An option using strsplit can be found as:
length(grep("\\w+", trimws(strsplit(str, split=",|\\.")[[1]])))
#[1] 11
Note: trimws is not mandatory in above statement. I have included so that one can get the value of each segment by just adding value = TRUE argument in grep.
Data:
str <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
strsplit might be one possibility?
txt <- "Jenna and Alex were making cupcakes., Jenna asked Alex whether all were ready to be frosted.,
Alex said that, some of them , were., He added, that, the rest, would be, ready, soon.,"
a <- strsplit(txt, split=",")
length(a[[1]])
[1] 11
If the backslashes are part of the text it doesnt really change a lot, except for the last element which would have "\"" in it. By filtering that out, the result is the same:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all
were ready to be frosted.\", \"Alex said that\", \" some of them \",
\"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
a <- strsplit(txt, split=", \"")
length(a[[1]][a[[1]] != "\""])
[1] 11
This is an absurd idea, but it does work:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
Txt <-
read.csv(text = txt,
header = FALSE,
colClasses = "character",
na.strings = c("", " "))
sum(!vapply(Txt, is.na, logical(1)))

gsubfn function not giving desired output when ignore.case = TRUE

I am trying to substitute multiple patterns within a character vector with their corresponding replacement strings. After doing some research I found the package gsubfn which I think is able to do what I want it to, however when I run the code below I don't get my expected output (see end of question for results versus what I expected to see).
library(gsubfn)
# Our test data that we want to search through (while ignoring case)
test.data<- c("1700 Happy Pl","155 Sad BLVD","82 Lolly ln", "4132 Avent aVe")
# A list data frame which contains the patterns we want to search for
# (again ignoring case) and the associated replacement strings we want to
# exchange any matches we come across with.
frame<- data.frame(pattern= c(" Pl"," blvd"," LN"," ave"), replace= c(" Place", " Boulevard", " Lane", " Avenue"),stringsAsFactors = F)
# NOTE: I added spaces in front of each of our replacement terms to make
# sure we only grab matches that are their own word (for instance if an
# address was 45 Splash Way we would not want to replace "pl" inside of
# "Splash" with "Place
# The following set of paste lines are supposed to eliminate the substitute function from
# grabbing instances like first instance of " Ave" found directly after "4132"
# inside "4132 Avent Ave" which we don't want converted to " Avenue".
pat <- paste(paste(frame$pattern,collapse = "($|[^a-zA-Z])|"),"($|[^a-zA-Z])", sep = "")
# Here is the gsubfn function I am calling
gsubfn(x = test.data, pattern = pat, replacement = setNames(as.list(frame$replace),frame$pattern), ignore.case = T)
Output being received:
[1] "1700 Happy" "155 Sad" "82 Lolly" "4132 Avent"
Output expected:
[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
My working theory on why this isn't working is that the matches don't match the names associated with the list I am passing into the gsubfn's replacement argument because of some case discrepancies (eg: the match being found on "155 Sad BLVD" doesn't == " blvd" even though it was able to be seen as a match due to the ignore.case argument). Can someone confirm that this is the issue/point me to what else might be going wrong, and perhaps a way of fixing this that doesn't require me expanding my pattern vector to include all case permutations if possible?
Seems like stringr has a simple solution for you:
library(stringr)
str_replace_all(test.data,
regex(paste0('\\b',frame$pattern,'$'),ignore_case = T),
frame$replace)
#[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
Note that I had to alter the regex to look for only words at the end of the string because of the tricky 'Avent aVe'. But of course there's other ways to handle that too.

extract a number in the middle or end of a string in R

I have a string vector. I would like to extract a number after "# of Stalls: " The numbers are located either in the middle or in the end of the string.
x <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40")
Here is my trial, but it is not sufficient. I appreciate your help.
gsub(".*\\# of Stalls: ", "", x)
Since it's HTML, you can use rvest or another HTML parser to extract the nodes you want first, which makes extracting the numbers trivial. XPath selectors and functions afford a little more flexibility than CSS ones for this sort of work.
library(rvest)
x %>% paste(collapse = '<br/>') %>%
read_html() %>%
html_nodes(xpath = '//text()[contains(., "# of Stalls:")]') %>%
html_text() %>%
readr::parse_number()
#> [1] 244 40
We match one or more characters that are not a # ([^#]+) from the start (^) of the string followed by a # followed by zero or more characters that are not a number ([^0-9]*) followed by one or more numbers ([0-9]+) captured as a group ((...)), followed by other characters (.*) and replace it with the backreference (\\1) of the captured group
as.integer(sub("^[^#]+#[^0-9]*([0-9]+).*", "\\1", x))
#[1] 244 40
If the string is more specific, then we can specify it
as.integer(sub("^[^#]+# of Stalls:\\s+([0-9]+).*", "\\1", x))
#[1] 244 40
There are many ways to solve this problem, I am going to use stringr package to solve it. The first str_extract would fetch the values :
[1] "# of Stalls: 244" "# of Stalls: 40" and then the second str_extract extracts the only digit parts available in the string.
I am however not clear whether you want to extract the string or replace the string. In case you want extarct the string below would work for you. In case you want to replace the string then you need to use str_replace
library(stringr)
as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
In case you want to replace the string then you should do :
str_replace(x,"#\\D*(\\d{1,})","\\1")
Output:
Output for extract:
> as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
[1] 244 40
Output for replace:
> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"
Here are some solutions. (1) and (1a) are variations of the code in the question. (2) and (2a) take the opposite approach where, instead of removing what we don't want they match what we do want.
1) gsub The code in the question removes the portion behfore the number but does not remove the portion after. We can modify it to do both at once below. The |\\D.*$ part that we added does that. Note that "\\D" matches any non-digit.
as.integer(gsub(".*# of Stalls: |\\D.*$", "", xx))
## [1] 244 40
1a) sub Alernately do these in two separate sub calls. The inner sub is from the question and the outer sub removes the first non-numeric onwards after the number.
as.integer(sub("\\D.*$", "", sub(".*# of Stalls: ", "", xx)))
## [1] 244 40
2) strcapture With this approach, available in the development version of R, we can simplify the regular expression substantially. We specify a match with a capture group (portion in parentheses). strcapture will return the portion corresponding to the capture group and create a data.frame from it. The third argument is a prototype structure that it uses to know that it is supposed to return integers. Note that "\\d" matches any digit.
strcapture("# of Stalls: (\\d+)", xx, list(stalls = integer()))
## stalls
## 1 244
## 2 40
2a) strapply The strapply function in the gsubfn package is similar to strcapture but uses an apply paradigm where the first argument is the input string, the second is the pattern and the third is the function to apply to the capture group.
library(gsubfn)
strapply(xx, "# of Stalls: (\\d+)", as.integer, simplify = TRUE)
## [1] [1] 244 40
Note: The input xx used is the same as x in the question:
xx <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free",
"20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40"
)

Resources