text match and replacement in R

text match and replacement in R - r

I am working on a project, where a part of cleaning data is stripping out the country names. My original data frame (named noaa) LOCATION_NAME column would look like this:
head(noaa$LOCATION_NAME,5)
[1] "JORDAN: BAB-A-DARAA,AL-KARAK"
[2] "SYRIA: UGARIT"
[3] "TURKMENISTAN: W"
[4] "GREECE: THERA ISLAND (SANTORINI)"
[5] "ISRAEL: ARIHA (JERICHO)"
To strip out the country names I'm using:
noaa$LOCATION_NAME <- gsub('^.*: +', '', noaa$LOCATION_NAME)
It works pretty well, however, I still get entries like:
"ANTAKYA (ANTIOCH); SYRIA"
or
"DIMASHQ; TURKEY:ANTIOCH; LEBANON:TARABULUS" (because the expression doesn't begin with "countryname:"
Removing anything ending with a ":" is not an option, in case of:
"CHINA: YUNNAN PROVINCE: MIDU"
I would like to retain "YUNNAN PROVINCE: MIDU"
for "PAKISTAN: INDUS DELTA; INDIA: SAMAWANI (SAMAJI)"
I would like to retain "INDUS DELTA; SAMAWANI (SAMAJI)"
I also have instances like "SWITZERLAND" (no ":"), where I guess I would put just " " (space).
I have in my data frame a column with country names and I could make a vector with unique country names. I was wondering if there is a smart method to check if a part of a string matches a country name in my country column and if yes, then I could remove it.
I would be grateful for some help on this.

Since the country string might be in different section of the string, you can partition it using ";" and ":" first then do a match against your unique country names:
#dfOfCountries is the data.frame containing all the countries as mentioned in your qn
distinctcountries <- unique(dfOfCountries$COUNTRY)
noaa$COUNTRY <- sapply(noaa$LOCATION_NAME, function(x) {
strparts <- trimws(unlist(lapply(strsplit(x, ":")[[1]], strsplit, split=";")))
strparts[strparts %in% distinctcountries]
})

This makes a regex or list of patterns (separated by |).
noaa <- read.table(text='
LOCATION_NAME
"JORDAN: BAB-A-DARAA,AL-KARAK"
"SYRIA: UGARIT"
"TURKMENISTAN: W"
"GREECE: THERA ISLAND (SANTORINI)"
"ISRAEL: ARIHA (JERICHO)"
"SWITZERLAND SOMEWHERE"
', header = TRUE, stringsAsFactors = FALSE)
countries <- c("JORDAN", "SYRIA", "GREECE", "SWITZERLAND")
# build an or list of patterns including country name ending with
# either (in priority order) <space>: or : or <space>
patterns <- paste0(countries, collapse="(\\s\\:|\\:|\\s)|")
trimws(gsub(patterns, "", noaa$LOCATION_NAME))
# [1] "BAB-A-DARAA,AL-KARAK" "UGARIT" "TURKMENISTAN: W" "THERA ISLAND (SANTORINI)"
# [5] "ISRAEL: ARIHA (JERICHO)" "SOMEWHERE"

Related

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name"
This contains things like
s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path"
and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/
and got this
s:///*.[^/]*
as the regex to use
so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing
01 GROUP etc
I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong?
When I use: https://regexr.com/
I get
\d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT:
There are two solutions here.
The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did:
To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems.
To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")

You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).

Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP

Conditional String Match R Character Vector Collapse Select Elements

I have a character vector where I'd like to match a specific string and then collapse the element containing that string match only with the next element in the character vector and then allow the process to continue until the character vector ends. For example just one situation:
'"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
Combining each element containing a : with only the element following it would be great BUT I've struggled with using the paste function because it just generally collapses the entire vector based on the : into one element which is not the more targeted solution I'm looking for.
Here's an example of what I'd like a portion of the revised output to look like:
"Inception Share Price:$15.00"

Here is something that might help:
First split using strsplit, then bind elements that belong together
# split the string
vec <- unlist(strsplit(string, '(?=\")(?=\")', perl = TRUE))
vec <- vec[! vec %in% c(' ', '\"')]
# that's how vec looks like right now
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA"
# [6] "NAV Ticker:"
#
# now paste the elements
ind <- grepl(':.+',vec)
tmp <- vec[!ind]
vec[!ind] <- paste0(tmp[seq(1,length(tmp),2)], tmp[seq(2,length(tmp),2)])
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:Tax-Free Income-Pennsylvania" "Ticker:MPA" "NAV Ticker:XMPAX"
# [5] "Average Daily Volume (shares):26,000" "Average Daily Volume (USD):$0.335M"
with the data
string = "\"FundSponsor:Blackrock Advisors\" \"Category:\" \"Tax-Free Income-Pennsylvania\" \"Ticker:\" \"MPA\" \"NAV Ticker:\" \"XMPAX\" \"Average Daily Volume (shares):\" \"26,000\" \"Average Daily Volume (USD):\" \"$0.335M\" \"Inception Date:\" \"10/30/1992\" \"Inception Share Price:\" \"$15.00\" \"Inception NAV:\" \"$14.18\" \"Tender Offer:\" \"No\" \"Term:\" \"No\""
Explanation
The regex (?=\")(?=\") basically tells R to split the string whenever there are two \". The syntax (?!*something*) means *something* comes before/after. So the above simply reads: split the string at every position that is preceeded by a \" and that preceeds a \".
The strsplit(...) above creates elements of the form \" and ('\"Category:\" \"...' becomes the vector '\"';'Category:';'\"';' ';'...'). So by using ! vec %in% c(...) we remove those unwanted elements.
Addendum
If elements of the form "string:" followed by a " " are contained, in the above code remove the line vec <- vec[! vec %in% c(' ', '\"')] and add the lines
vec <- vec[seq(2L, length(vec), 4L)]
vec[vec == ' '] <- NA_character_

I am not sure if you want the outcome to be one single key: value format or if you just want to clean that long string and have it in the following format key1: value1 key2: value2 key3: value3. If this is the case, you can achieve it via the following code:
char = '"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
char_tidy = gsub('\\" \\"', " ", char)
# output is below
> char_tidy
[1] "\"FundSponsor:Blackrock Advisors Category: Tax-Free Income-Pennsylvania Ticker: MPA NAV Ticker: XMPAX Average Daily Volume (shares): 26,000 Average Daily Volume (USD): $0.335M Inception Date: 10/30/1992 Inception Share Price: $15.00 Inception NAV: $14.18 Tender Offer: No Term: No\""

Regex to pull sub-path

I have a dataframe df with a place field containing strings that looks like so:
countryName0 / provinceName0 / countyName0 / cityName0
countryName1 / provinceName1
Using this code I can pull out the finest resolution place identifier:
df$shortplace <- trimws(basename(df$place))
or:
df$shortplace <- gsub(".*/ ", "", df$place)
e.g.
cityName0
provinceName1
I can then use ggmap library to extract geocodes for cityName0 and provinceName1:
df$geo <- geocode(df$shortplace)
Result looks like this:
geo.lat geo.long
-33.789 147.909
-29.333 133.819
Unfortunately, some city names are not unique e.g. Perth is the capital of Western Australia, a town in Tasmania, and a city in Scotland. What I need to do is extract not the place identifier after the last "/" but the second last "/" (and replace the "/" with a " " to provide more information for the geocode() function. How do I scan to second last "/" and extract highest and second highest order place names? E.g.
shortplace
countyName0 cityName0
countryName1 provinceName1

There are other ways, but strsplit() seems the most straightforward to me here. Give this a try:
x = "countryName0 / provinceName0 / countyName0 / cityName0"
x_split = strsplit(x, " / ")[[1]] # Somewhat confusingly, result of strsplit() is a list; [[1]] pulls out the one and only entry here
n_terms = length(x_split)
result = paste(x_split[n_terms - 1], x_split[n_terms], sep = ", ")
result
# [1] "countyName0, cityName0"

One option is sub to match the alpha numeric characters followed by one or more spaces, / followed by space (\\s+), then another set of alpha numeric characters until the end of the string ($), capture as a group and replace with the backreferences (\\1 \\2) of the capture groups
df$shortplace <- sub(".*\\b([[:alnum:]]+)\\s+\\/\\s+([[:alnum:]]+)$", "\\1 \\2", df$place)
df$shortplace
#[1] "countyName0 cityName0" "countryName1 provinceName1"

This worked for me in the end:
df$shortplace <- gsub("((?:/[^/\r\n]*){2})$", "\1", df$place)
df$shortplace <- gsub("\\ / ", ", ", df$place)
Not super elegant but it does the job.

Selecting Final Item from a List

I have a list of names and would like to extract the last name of each individual. The complication is that some of the entries have middle names, some have nicknames, etc. Here's my example, building off of this question, but changing the formatting to reflect my situation:
df <- c("bob smith","mary ann d. jane","jose chung","michael mike marx","charlie m. ivan")
To get the first names, I use the following:
firstnames <- sapply(strsplit(df, " "), '[',1)
Is there any way to get the element in "final" position, however? Thanks in advance.

> lastnames <- sapply(strsplit(df, " "), tail, 1)
>
> lastnames
[1] "smith" "jane" "chung" "marx" "ivan"

Concatenating strings with

I have a data frame with several variables. What I want is create a string using (concatenation) the variable names but with something else in between them...
Here is a simplified example (number of variables reduced to only 3 whereas I have actually many)
Making up some data frame
df1 <- data.frame(1,2,3) # A one row data frame
names(df1) <- c('Location1','Location2','Location3')
Actual code...
len1 <- ncol(df1)
string1 <- 'The locations that we are considering are'
for(i in 1:(len1-1)) string1 <- c(string1,paste(names(df1[i]),sep=','))
string1 <- c(string1,'and',paste(names(df1[len1]),'.'))
string1
This gives...
[1] "The locations that we are considering are"
[2] "Location1"
[3] "Location2"
[4] "Location3 ."
But I want
The locations that we are considering are Location1, Location2 and Location3.
I am sure there is a much simpler method which some of you would know...
Thank you for you time...

Are you looking for the collapse argument of paste?
> paste (letters [1:3], collapse = " and ")
[1] "a and b and c"

The fact that these are names of a data.frame does not really matter, so I've pulled that part out and assigned them to a variable strs.
strs <- names(df1)
len1 <- length(strs)
string1 <- paste("The locations that we are considering are ",
paste(strs[-len1], collapse=", ", sep=""),
" and ",
strs[len1],
".\n",
sep="")
This gives
> cat(string1)
The locations that we are considering are Location1, Location2 and Location3.
Note that this will not give sensible English if there is only 1 element in strs.
The idea is to collapse all but the last string with comma-space between them, and then paste that together with the boilerplate text and the last string.

If your main goal is to print the results to the screen (or other output) then use the cat function (whose name derives from concatenate):
> cat(names(iris), sep=' and '); cat('\n')
Sepal.Length and Sepal.Width and Petal.Length and Petal.Width and Species
If you need a variable with the string, then you can use paste with the collapse argument. The sprintf function can also be useful for inserting strings into other strings (or numbers into strings).

An other options would be:
library(stringr)
str_c("The location that we are consiering are ", str_c(str_c(names(df1)[1:length(names(df1))-1], collapse=", "), names(df1)[length(names(df1))], sep=" and "))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

text match and replacement in R - r

Related

Regular expressions, stringr - have the regex, can't get it to work in R

Conditional String Match R Character Vector Collapse Select Elements

Regex to pull sub-path

Selecting Final Item from a List

Concatenating strings with

Categories

Resources