Regex to pull sub-path - r

I have a dataframe df with a place field containing strings that looks like so:
countryName0 / provinceName0 / countyName0 / cityName0
countryName1 / provinceName1
Using this code I can pull out the finest resolution place identifier:
df$shortplace <- trimws(basename(df$place))
or:
df$shortplace <- gsub(".*/ ", "", df$place)
e.g.
cityName0
provinceName1
I can then use ggmap library to extract geocodes for cityName0 and provinceName1:
df$geo <- geocode(df$shortplace)
Result looks like this:
geo.lat geo.long
-33.789 147.909
-29.333 133.819
Unfortunately, some city names are not unique e.g. Perth is the capital of Western Australia, a town in Tasmania, and a city in Scotland. What I need to do is extract not the place identifier after the last "/" but the second last "/" (and replace the "/" with a " " to provide more information for the geocode() function. How do I scan to second last "/" and extract highest and second highest order place names? E.g.
shortplace
countyName0 cityName0
countryName1 provinceName1

There are other ways, but strsplit() seems the most straightforward to me here. Give this a try:
x = "countryName0 / provinceName0 / countyName0 / cityName0"
x_split = strsplit(x, " / ")[[1]] # Somewhat confusingly, result of strsplit() is a list; [[1]] pulls out the one and only entry here
n_terms = length(x_split)
result = paste(x_split[n_terms - 1], x_split[n_terms], sep = ", ")
result
# [1] "countyName0, cityName0"

One option is sub to match the alpha numeric characters followed by one or more spaces, / followed by space (\\s+), then another set of alpha numeric characters until the end of the string ($), capture as a group and replace with the backreferences (\\1 \\2) of the capture groups
df$shortplace <- sub(".*\\b([[:alnum:]]+)\\s+\\/\\s+([[:alnum:]]+)$", "\\1 \\2", df$place)
df$shortplace
#[1] "countyName0 cityName0" "countryName1 provinceName1"

This worked for me in the end:
df$shortplace <- gsub("((?:/[^/\r\n]*){2})$", "\1", df$place)
df$shortplace <- gsub("\\ / ", ", ", df$place)
Not super elegant but it does the job.

Related

How to extract first 2 words from a string in R?

I need to extract first 2 words from a string. If the string contains more than 2 words, it should return the first 2 words else if the string contains less than 2 words it should return the string as it is.
I've tried using 'word' function from stringr package but it's not giving the desired output for cases where len(string) < 2.
word(dt$var_containing_strings, 1,2, sep=" ")
Example:
Input String: Auto Loan (Personal)
Output: Auto Loan
Input String: Others
Output: Others
If you want to use stringr::word(), you can do:
ifelse(is.na(word(x, 1, 2)), x, word(x, 1, 2))
[1] "Auto Loan" "Others"
Sample data:
x <- c("Auto Loan (Personal)", "Others")
Something like this?
a <- "this is a character string"
unlist(strsplit(a, " "))[1:2]
[1] "this" "is"
EDIT:
To add the part where original string is returned if number of worlds is less than 2, a simple if-else function can be used:
a <- "this is a character string"
words <- unlist(strsplit(a, " "))
if (length(words) > 2) {
words[1:2]
} else {
a
}
You could use regex in base R using sub
sub("(\\w+\\s+\\w+).*", "\\1", "Auto Loan (Personal)")
#[1] "Auto Loan"
which will also work if you have only one word in the text
sub("(\\w+\\s+\\w+).*", "\\1", "Auto")
#[1] "Auto"
Explanation :
Here we extract the pattern shown inside round brackets which is (\\w+\\s+\\w+) which means :
\\w+ One word followed by \\s+ whitespace followed by \\w+ another word, so in total we extract two words. Extraction is done using backreference \\1 in sub.

Regex expression starting from a certain character

Example: "example._AL(5)._._4500_GRE/Jan_2018"
I am trying to extract text from the above string containing parentheses. I wanna extract everything starting from AL.
Output should look like: "AL(5)._._4500_GRE/Jan_2018"
There is some question on what we can assume is known but here are a few variations which make various assumptions.
1) word( This removes everything prior to the first word followed by a parenthesis.
"^" matches the start of string
".*?" is the shortest match of anything provided we still match rest of regex
"\\w+" matches a word
"\\(" matches a left paren
(...) forms a capture group which the replacement string can refer to as "\\1"
Code
x <- "example.AL(5)._._4500_GRE/Jan_2018"
sub("^.*?(\\w+\\()", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
1a) or matching a word followed by ( followed by anything and extracting that:
library(gsubfn)
strapplyc(x, "\\w+\\(.*", simplify = TRUE)
## [1] "AL(5)._._4500_GRE/Jan_2018"
2) AL( or if we know that the word is AL then:
sub("^.*?(AL\\(.*)", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
3) remove up to 1st dot or if we know that the part to be removed is the part before and including the first dot:
sub("^.*?\\.", "", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
4) dot separated fields If the format of the input is dot-separated fields we can parse them all out at once like this:
read.table(text = x, sep = ".", as.is = TRUE)
## V1 V2 V3 V4
## 1 example AL(5) _ _4500_GRE/Jan_2018

text match and replacement in R

I am working on a project, where a part of cleaning data is stripping out the country names. My original data frame (named noaa) LOCATION_NAME column would look like this:
head(noaa$LOCATION_NAME,5)
[1] "JORDAN: BAB-A-DARAA,AL-KARAK"
[2] "SYRIA: UGARIT"
[3] "TURKMENISTAN: W"
[4] "GREECE: THERA ISLAND (SANTORINI)"
[5] "ISRAEL: ARIHA (JERICHO)"
To strip out the country names I'm using:
noaa$LOCATION_NAME <- gsub('^.*: +', '', noaa$LOCATION_NAME)
It works pretty well, however, I still get entries like:
"ANTAKYA (ANTIOCH); SYRIA"
or
"DIMASHQ; TURKEY:ANTIOCH; LEBANON:TARABULUS" (because the expression doesn't begin with "countryname:"
Removing anything ending with a ":" is not an option, in case of:
"CHINA: YUNNAN PROVINCE: MIDU"
I would like to retain "YUNNAN PROVINCE: MIDU"
for "PAKISTAN: INDUS DELTA; INDIA: SAMAWANI (SAMAJI)"
I would like to retain "INDUS DELTA; SAMAWANI (SAMAJI)"
I also have instances like "SWITZERLAND" (no ":"), where I guess I would put just " " (space).
I have in my data frame a column with country names and I could make a vector with unique country names. I was wondering if there is a smart method to check if a part of a string matches a country name in my country column and if yes, then I could remove it.
I would be grateful for some help on this.
Since the country string might be in different section of the string, you can partition it using ";" and ":" first then do a match against your unique country names:
#dfOfCountries is the data.frame containing all the countries as mentioned in your qn
distinctcountries <- unique(dfOfCountries$COUNTRY)
noaa$COUNTRY <- sapply(noaa$LOCATION_NAME, function(x) {
strparts <- trimws(unlist(lapply(strsplit(x, ":")[[1]], strsplit, split=";")))
strparts[strparts %in% distinctcountries]
})
This makes a regex or list of patterns (separated by |).
noaa <- read.table(text='
LOCATION_NAME
"JORDAN: BAB-A-DARAA,AL-KARAK"
"SYRIA: UGARIT"
"TURKMENISTAN: W"
"GREECE: THERA ISLAND (SANTORINI)"
"ISRAEL: ARIHA (JERICHO)"
"SWITZERLAND SOMEWHERE"
', header = TRUE, stringsAsFactors = FALSE)
countries <- c("JORDAN", "SYRIA", "GREECE", "SWITZERLAND")
# build an or list of patterns including country name ending with
# either (in priority order) <space>: or : or <space>
patterns <- paste0(countries, collapse="(\\s\\:|\\:|\\s)|")
trimws(gsub(patterns, "", noaa$LOCATION_NAME))
# [1] "BAB-A-DARAA,AL-KARAK" "UGARIT" "TURKMENISTAN: W" "THERA ISLAND (SANTORINI)"
# [5] "ISRAEL: ARIHA (JERICHO)" "SOMEWHERE"

Move location of special character

I have an entire vector of strings with the only special symbol in them being "-"
To be clear a sample string is like 23 C-Exam
I'd like to change it 23-C Exam
I essentially want R to find the location of "-" and move it 2 spaces back.
I feel this is a really simple task although I cant figure out how.
Assume that whenever R finds "-" , two spaces back is whitespace just like the example above.
regex attempt:
x <- c("23 C-Exam","45 D-Exam")
#[1] "23 C-Exam" "45 D-Exam"
sub(".(.)-", "-\\1 ", x)
#[1] "23-C Exam" "45-D Exam"
Find a character ., before a character (.), followed by a literal dash -.
Replace with a literal dash -, the saved character from above \\1, and overwrite the dash with a space
There is probably a sleek way of doing this with regular expressions, but one approach is to simply splice together the various pieces of the desired output. First, I find the index in the string containing the -, and then I use substr() to piece together the output.
pos <- regexpr("-", "23 C-Exam")
x <- "23 C-Exam"
x <- paste0(substr(x, 1, pos-3),
"-",
substr(x, pos-1, pos-1),
" ",
substr(x, pos+1, nchar(x)))
> x
[1] "23-C Exam"
We can also use chartr
chartr(" -", "- ", x)
#[1] "23-C Exam" "45-D Exam"
data
x <- c("23 C-Exam","45 D-Exam")

R list within matrix to dataframe conversion

R struggles. I am using the following to extract quotations from text, with multiple results on a large datset. I am trying to have the output be a character string within a dataframe, so I can easily share this as an csv with others.
Sample data:
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
Using the following to extract quotations and a buffer of characters:
result <-function(testdata) {
str_extract_all(testdata, '[^\"]?{15}"[^\"]+"[^\"]?{15}')
}
extract <- sapply(testdata, FUN=result)
The extract is a list within a matrix. However, I want the extract to be a character string that I can later merge to a dataframe as a column. How do I convert this?
Code
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
# extract quotations
gsub(pattern = "[^\"]*((?:\"[^\"]*\")|$)", replacement = "\\1 ", x = testdata)
Output
[1] "\"I am a test,\" "
[2] "\"Would never happen.\" "
[3] "\"quote\" "
[4] "\"I said this,\" "
[5] "\"No,\" \"I do not like green eggs and ham.\" "
Explanation
pattern = "[^\"]" will match with any character except a double quote
pattern = "[^\"]*" will match with any character except a double quote 0 or more times
pattern = "\"[^\"]*\"" will match with a double quote, then any
character except a double quote 0 or more times, then another double
quote (i.e.) quotations
pattern = "(?:\"[^\"]*\")" will match with quotations, but wont capture
it
pattern = "((?:\"[^\"]*\")|$)" will match with quotations or endOfString,
and capture it. Note that this is the first group we capture
replacement = "\\1 " will replace with the first group we captured followed by a space

Resources