I have a character vector where I'd like to match a specific string and then collapse the element containing that string match only with the next element in the character vector and then allow the process to continue until the character vector ends. For example just one situation:
'"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
Combining each element containing a : with only the element following it would be great BUT I've struggled with using the paste function because it just generally collapses the entire vector based on the : into one element which is not the more targeted solution I'm looking for.
Here's an example of what I'd like a portion of the revised output to look like:
"Inception Share Price:$15.00"
Here is something that might help:
First split using strsplit, then bind elements that belong together
# split the string
vec <- unlist(strsplit(string, '(?=\")(?=\")', perl = TRUE))
vec <- vec[! vec %in% c(' ', '\"')]
# that's how vec looks like right now
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA"
# [6] "NAV Ticker:"
#
# now paste the elements
ind <- grepl(':.+',vec)
tmp <- vec[!ind]
vec[!ind] <- paste0(tmp[seq(1,length(tmp),2)], tmp[seq(2,length(tmp),2)])
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:Tax-Free Income-Pennsylvania" "Ticker:MPA" "NAV Ticker:XMPAX"
# [5] "Average Daily Volume (shares):26,000" "Average Daily Volume (USD):$0.335M"
with the data
string = "\"FundSponsor:Blackrock Advisors\" \"Category:\" \"Tax-Free Income-Pennsylvania\" \"Ticker:\" \"MPA\" \"NAV Ticker:\" \"XMPAX\" \"Average Daily Volume (shares):\" \"26,000\" \"Average Daily Volume (USD):\" \"$0.335M\" \"Inception Date:\" \"10/30/1992\" \"Inception Share Price:\" \"$15.00\" \"Inception NAV:\" \"$14.18\" \"Tender Offer:\" \"No\" \"Term:\" \"No\""
Explanation
The regex (?=\")(?=\") basically tells R to split the string whenever there are two \". The syntax (?!*something*) means *something* comes before/after. So the above simply reads: split the string at every position that is preceeded by a \" and that preceeds a \".
The strsplit(...) above creates elements of the form \" and ('\"Category:\" \"...' becomes the vector '\"';'Category:';'\"';' ';'...'). So by using ! vec %in% c(...) we remove those unwanted elements.
Addendum
If elements of the form "string:" followed by a " " are contained, in the above code remove the line vec <- vec[! vec %in% c(' ', '\"')] and add the lines
vec <- vec[seq(2L, length(vec), 4L)]
vec[vec == ' '] <- NA_character_
I am not sure if you want the outcome to be one single key: value format or if you just want to clean that long string and have it in the following format key1: value1 key2: value2 key3: value3. If this is the case, you can achieve it via the following code:
char = '"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
char_tidy = gsub('\\" \\"', " ", char)
# output is below
> char_tidy
[1] "\"FundSponsor:Blackrock Advisors Category: Tax-Free Income-Pennsylvania Ticker: MPA NAV Ticker: XMPAX Average Daily Volume (shares): 26,000 Average Daily Volume (USD): $0.335M Inception Date: 10/30/1992 Inception Share Price: $15.00 Inception NAV: $14.18 Tender Offer: No Term: No\""
Related
I have a dataframe df with a place field containing strings that looks like so:
countryName0 / provinceName0 / countyName0 / cityName0
countryName1 / provinceName1
Using this code I can pull out the finest resolution place identifier:
df$shortplace <- trimws(basename(df$place))
or:
df$shortplace <- gsub(".*/ ", "", df$place)
e.g.
cityName0
provinceName1
I can then use ggmap library to extract geocodes for cityName0 and provinceName1:
df$geo <- geocode(df$shortplace)
Result looks like this:
geo.lat geo.long
-33.789 147.909
-29.333 133.819
Unfortunately, some city names are not unique e.g. Perth is the capital of Western Australia, a town in Tasmania, and a city in Scotland. What I need to do is extract not the place identifier after the last "/" but the second last "/" (and replace the "/" with a " " to provide more information for the geocode() function. How do I scan to second last "/" and extract highest and second highest order place names? E.g.
shortplace
countyName0 cityName0
countryName1 provinceName1
There are other ways, but strsplit() seems the most straightforward to me here. Give this a try:
x = "countryName0 / provinceName0 / countyName0 / cityName0"
x_split = strsplit(x, " / ")[[1]] # Somewhat confusingly, result of strsplit() is a list; [[1]] pulls out the one and only entry here
n_terms = length(x_split)
result = paste(x_split[n_terms - 1], x_split[n_terms], sep = ", ")
result
# [1] "countyName0, cityName0"
One option is sub to match the alpha numeric characters followed by one or more spaces, / followed by space (\\s+), then another set of alpha numeric characters until the end of the string ($), capture as a group and replace with the backreferences (\\1 \\2) of the capture groups
df$shortplace <- sub(".*\\b([[:alnum:]]+)\\s+\\/\\s+([[:alnum:]]+)$", "\\1 \\2", df$place)
df$shortplace
#[1] "countyName0 cityName0" "countryName1 provinceName1"
This worked for me in the end:
df$shortplace <- gsub("((?:/[^/\r\n]*){2})$", "\1", df$place)
df$shortplace <- gsub("\\ / ", ", ", df$place)
Not super elegant but it does the job.
I am working on a project, where a part of cleaning data is stripping out the country names. My original data frame (named noaa) LOCATION_NAME column would look like this:
head(noaa$LOCATION_NAME,5)
[1] "JORDAN: BAB-A-DARAA,AL-KARAK"
[2] "SYRIA: UGARIT"
[3] "TURKMENISTAN: W"
[4] "GREECE: THERA ISLAND (SANTORINI)"
[5] "ISRAEL: ARIHA (JERICHO)"
To strip out the country names I'm using:
noaa$LOCATION_NAME <- gsub('^.*: +', '', noaa$LOCATION_NAME)
It works pretty well, however, I still get entries like:
"ANTAKYA (ANTIOCH); SYRIA"
or
"DIMASHQ; TURKEY:ANTIOCH; LEBANON:TARABULUS" (because the expression doesn't begin with "countryname:"
Removing anything ending with a ":" is not an option, in case of:
"CHINA: YUNNAN PROVINCE: MIDU"
I would like to retain "YUNNAN PROVINCE: MIDU"
for "PAKISTAN: INDUS DELTA; INDIA: SAMAWANI (SAMAJI)"
I would like to retain "INDUS DELTA; SAMAWANI (SAMAJI)"
I also have instances like "SWITZERLAND" (no ":"), where I guess I would put just " " (space).
I have in my data frame a column with country names and I could make a vector with unique country names. I was wondering if there is a smart method to check if a part of a string matches a country name in my country column and if yes, then I could remove it.
I would be grateful for some help on this.
Since the country string might be in different section of the string, you can partition it using ";" and ":" first then do a match against your unique country names:
#dfOfCountries is the data.frame containing all the countries as mentioned in your qn
distinctcountries <- unique(dfOfCountries$COUNTRY)
noaa$COUNTRY <- sapply(noaa$LOCATION_NAME, function(x) {
strparts <- trimws(unlist(lapply(strsplit(x, ":")[[1]], strsplit, split=";")))
strparts[strparts %in% distinctcountries]
})
This makes a regex or list of patterns (separated by |).
noaa <- read.table(text='
LOCATION_NAME
"JORDAN: BAB-A-DARAA,AL-KARAK"
"SYRIA: UGARIT"
"TURKMENISTAN: W"
"GREECE: THERA ISLAND (SANTORINI)"
"ISRAEL: ARIHA (JERICHO)"
"SWITZERLAND SOMEWHERE"
', header = TRUE, stringsAsFactors = FALSE)
countries <- c("JORDAN", "SYRIA", "GREECE", "SWITZERLAND")
# build an or list of patterns including country name ending with
# either (in priority order) <space>: or : or <space>
patterns <- paste0(countries, collapse="(\\s\\:|\\:|\\s)|")
trimws(gsub(patterns, "", noaa$LOCATION_NAME))
# [1] "BAB-A-DARAA,AL-KARAK" "UGARIT" "TURKMENISTAN: W" "THERA ISLAND (SANTORINI)"
# [5] "ISRAEL: ARIHA (JERICHO)" "SOMEWHERE"
I need to keep track of my data through a series of calculations.
i have the following vectors:
start_date <- "2010-11-04"
end_date <- "2010-11-10"
i wish to merge them into a single character vector that looks like this
"2010-11-04 to 2010-11-10"
then i would like the inverse:
from: "2010-11-04 to 2010-11-10" back to the original two vectors
This does not use any packages:
p <- paste(start_date, "to", end_date)
p
## [1] "2010-11-04 to 2010-11-10"
Going in the other direction remove everything after the first space to get the string representing the first date and remove everything up to the last space to get the string representing the second date:
sub(" .*", "", p)
## [1] "2010-11-04"
sub(".* ", "", p)
## [1] "2010-11-10"
An alternative is to split the string using strsplit:
s <- unlist(strsplit(p, " to "))
s
## [1] "2010-11-04" "2010-11-10"
so s[1] is the string representing the first date and s[2] is the string representing the second date.
UPDATE: Modified as per Richard Scriven's comment.
I have a list of names and would like to extract the last name of each individual. The complication is that some of the entries have middle names, some have nicknames, etc. Here's my example, building off of this question, but changing the formatting to reflect my situation:
df <- c("bob smith","mary ann d. jane","jose chung","michael mike marx","charlie m. ivan")
To get the first names, I use the following:
firstnames <- sapply(strsplit(df, " "), '[',1)
Is there any way to get the element in "final" position, however? Thanks in advance.
> lastnames <- sapply(strsplit(df, " "), tail, 1)
>
> lastnames
[1] "smith" "jane" "chung" "marx" "ivan"
I have a data frame with several variables. What I want is create a string using (concatenation) the variable names but with something else in between them...
Here is a simplified example (number of variables reduced to only 3 whereas I have actually many)
Making up some data frame
df1 <- data.frame(1,2,3) # A one row data frame
names(df1) <- c('Location1','Location2','Location3')
Actual code...
len1 <- ncol(df1)
string1 <- 'The locations that we are considering are'
for(i in 1:(len1-1)) string1 <- c(string1,paste(names(df1[i]),sep=','))
string1 <- c(string1,'and',paste(names(df1[len1]),'.'))
string1
This gives...
[1] "The locations that we are considering are"
[2] "Location1"
[3] "Location2"
[4] "Location3 ."
But I want
The locations that we are considering are Location1, Location2 and Location3.
I am sure there is a much simpler method which some of you would know...
Thank you for you time...
Are you looking for the collapse argument of paste?
> paste (letters [1:3], collapse = " and ")
[1] "a and b and c"
The fact that these are names of a data.frame does not really matter, so I've pulled that part out and assigned them to a variable strs.
strs <- names(df1)
len1 <- length(strs)
string1 <- paste("The locations that we are considering are ",
paste(strs[-len1], collapse=", ", sep=""),
" and ",
strs[len1],
".\n",
sep="")
This gives
> cat(string1)
The locations that we are considering are Location1, Location2 and Location3.
Note that this will not give sensible English if there is only 1 element in strs.
The idea is to collapse all but the last string with comma-space between them, and then paste that together with the boilerplate text and the last string.
If your main goal is to print the results to the screen (or other output) then use the cat function (whose name derives from concatenate):
> cat(names(iris), sep=' and '); cat('\n')
Sepal.Length and Sepal.Width and Petal.Length and Petal.Width and Species
If you need a variable with the string, then you can use paste with the collapse argument. The sprintf function can also be useful for inserting strings into other strings (or numbers into strings).
An other options would be:
library(stringr)
str_c("The location that we are consiering are ", str_c(str_c(names(df1)[1:length(names(df1))-1], collapse=", "), names(df1)[length(names(df1))], sep=" and "))