Extracting first word after a specific expression in R

Extracting first word after a specific expression in R - r

I have a column that contains thousands of descriptions like this (example) :
Description
Building a hospital in the city of LA, USA
Building a school in the city of NYC, USA
Building shops in the city of Chicago, USA
I'd like to create a column with the first word after "city of", like that :
Description
City
Building a hospital in the city of LA, USA
LA
Building a school in the city of NYC, USA
NYC
Building shops in the city of Chicago, USA
Chicago
I tried with the following code after seeing this topic Extracting string after specific word, but my column is only filled with missing values
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))
I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.

Solution
This should make the trick for the data you showed:
df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
Alternative
However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
Check out the following example:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
Documentation
Check out ?regex:
Patterns (?=...) and (?!...) are zero-width positive and negative
lookahead assertions: they match if an attempt to match the ...
forward from the current position would succeed (or not), but use up
no characters in the string being processed. Patterns (?<=...) and
(?<!...) are the lookbehind equivalents: they do not allow repetition
quantifiers nor \C in ....

Related

Unknown issue prevents geocode_reverse from working

I recently update RStudio to the version RStudio 2022.07.1, working on Windows 10.
When I tried different geocode reverse functions(Which is input coordinate, output is the address), they all return no found.
Example 1:
library(revgeo)
revgeo(-77.016472, 38.785026)
Suppose return "146 National Plaza, Fort Washington, Maryland, 20745, United States of America". But I got
"Getting geocode data from Photon: http://photon.komoot.de/reverse?lon=-77.016472&lat=38.785026"
[[1]]
[1] "House Number Not Found Street Not Found, City Not Found, State Not Found, Postcode Not Found, Country Not Found"
Data from https://github.com/mhudecheck/revgeo
Example 2:
library(tidygeocoder)
library(dplyr)
path <- "filepath"
df <- read.csv (paste (path, "sample.csv", sep = ""))
reverse <- df %>%
reverse_geocode(lat = longitude, long = latitude, method = 'osm',
address = address_found, full_results = TRUE)
reverse
Where the sample.csv is
name
addr
latitude
longitude
White House
1600 Pennsylvania Ave NW, Washington, DC
38.89770
-77.03655
Transamerica Pyramid
600 Montgomery St, San Francisco, CA 94111
37.79520
-122.40279
Willis Tower
233 S Wacker Dr, Chicago, IL 60606
41.87535
-87.63576
Suppose to get
name
addr
latitude
longitude
address_found
White House
1600 Pennsylvania Ave NW, Washington, DC
38.89770
-77.03655
White House, 1600, Pennsylvania Avenue Northwest, Washington, District of Columbia, 20500, United States
Transamerica Pyramid
600 Montgomery St, San Francisco, CA 94111
37.79520
-122.40279
Transamerica Pyramid, 600, Montgomery Street, Chinatown, San Francisco, San Francisco City and County, San Francisco, California, 94111, United States
Willis Tower
233 S Wacker Dr, Chicago, IL 60606
41.87535
-87.63576
South Wacker Drive, Printer’s Row, Loop, Chicago, Cook County, Illinois, 60606, United States
But I got
# A tibble: 3 × 5
name addr latitude longitude address_found
<chr> <chr> <dbl> <dbl> <chr>
1 White House 1600 Pennsylvania Ave NW, Wash… 38.9 -77.0 NA
2 Transamerica Pyramid 600 Montgomery St, San Francis… 37.8 -122. NA
3 Willis Tower 233 S Wacker Dr, Chicago, IL 6… 41.9 -87.6 NA
Data source: https://cran.r-project.org/web/packages/tidygeocoder/readme/README.html
However, when I tried
reverse_geo(lat = 38.895865, long = -77.0307713, method = "osm")
I'm able to get
# A tibble: 1 × 3
lat long address
<dbl> <dbl> <chr>
1 38.9 -77.0 Pennsylvania Avenue, Washington, District of Columbia, 20045, United States
I had contact the tidygeocoder developer, he/she didn't find out any problem. Detail in https://github.com/jessecambon/tidygeocoder/issues/175
Not sure which part goes wrong. Anyone want try on their RStudio?

The updated revgeo needs to be submitted to CRAN. This has nothing to do with RStudio.
Going to http://photon.komoot.de/reverse?lon=-77.016472&lat=38.785026 in my browser also returns an error. However, I searched for the Photon reverse geocoder, and their example uses .io not .de in the URL, and https://photon.komoot.io/reverse?lon=-77.016472&lat=38.785026 works.
Photon also include a Note at the bottom of their examples:
Until October 2020 the API was available under photon.komoot.de. Requests still work as they redirected to photon.komoot.io but please update your apps accordingly.
Seems like that redirect is either broken or deprecated.
The version of revgeo on github has this change made already, so you can get a working version by using remotes::install_github("https://github.com/mhudecheck/revgeo")

How to extract one element of text from a column in R?

I'm working with a data frame that contains the locations of where people got tested for COVID. There is not standardization of formatting of the ordering facility (the place that ordered the test). My data frame look something like this:
TestingLocation <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"))
I have a list of the cities where someone could get tested.
Cities <- data.frame(PossibleTestCities=c("Los Angeles", "Chicago", "New York", "Miami", "Boston", "Austin", "Santa Fe"))
Is there a way to use the Cities frame I have to extract the city and put it into a new column. Additionally, if no city appears, to put "Unknown" or something along those lines? Ideally, my frame would look like this:
DesiredFrame <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"),
TestCity= c("New York", "Chicago", "Unknown", "Los Angeles", "Boston"))
Thank you!

Does this work:
library(dplyr)
library(stringr)
TestingLocation %>% mutate(TestCity = str_to_title(str_extract(toupper(TestingLocation), toupper(str_c(Cities$PossibleTestCities, collapse = '|'))))) %>%
mutate(TestCity = replace_na(TestCity, 'Unknown'))
TestingLocation TestCity
1 New York Hospital One New York
2 Chicago Clinic Two Chicago
3 Nursing Home Name One Unknown
4 Los Angeles University_Testing_Site Los Angeles
5 Test-Site-in-BOSTON-MA Boston

This doesn't look pretty but it works:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, sub(paste0(".*(",
paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*"),
"\\U\\1", tolower(TestingLocation$TestingLocation), perl = T))
There's a number of operations involved. There are two sub operations one nested in the other. The first is to replace the (lower-case) TestingLocation$TestingLocations with the matching (lower-case) Cities$PossibleTestCities and set the replacements to upper-case, while the second is to set the values that did not find a match and that hence remained lower-case to NA.
Instead of using a compact but hard-to parse single piece of code you can achieve the substitutions step-by-step:
# 1. define pattern with alternatives:
mypattern <- paste0(".*(", paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*")
# 2. perform first substitution to set matches to City names:
TestingLocation$TestCity <- sub(mypattern, "\\U\\1", tolower(TestingLocation$TestingLocation), perl = T)
# 3. perform second substitution to set non-match to NA:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, TestingLocation$TestCity)
Result:
TestingLocation
TestingLocation TestCity
1 New York Hospital One NEW YORK
2 Chicago Clinic Two CHICAGO
3 Nursing Home Name One <NA>
4 Los Angeles University_Testing_Site LOS ANGELES
5 Test-Site-in-BOSTON-MA BOSTON

Can I create a vector with regexps?

My data looks somthing like this:
412 U CA, Riverside
413 U British Columbia
414 CREI
415 U Pompeu Fabra
416 Office of the Comptroller of the Currency, US Department of the Treasury
417 Bureau of Economics, US Federal Trade Commission
418 U Carlos III de Madrid
419 U Brescia
420 LUISS Guido Carli
421 U Alicante
422 Harvard Society of Fellows
423 Toulouse School of Economics
424 Decision Economics Inc, Boston, MA
425 ECARES, Free U Brussels
I will need to geocode this data in order to get the coordinates for each specific institution. in order to do that I need all state names to be spelled out. At the same time I don't want acronyms like "ECARES" to be transformed into "ECaliforniaRES".
I have been toying with the idea of converting the state.abb and state.name vectors into vectors of regular expressions, so that state.abb would look something like this (Using Alabama and California as state 1 and state 2):
c("^AL "|" AL "|" AL,"|",AL "| " AL$", "^CA "[....])
And the state.name vector something like this:
c("^Alabama "|" Alabama "|" Alabama,"|",Alabama "| " Alabama$", "^California "[....])
Hopefully, I can then use the mgsub function to replace all expressions in the modified state.abb vector with the corresponding entries in the modified state.name vector.
For some reason, however, it doesn't seem to be possible to put regexps in a vector:
test<-c(^AL, ^AB)
Error: unexpected '^' in "test<-c(^"
I have tried excusing the "^"-signs but this doesnt really seem to work:
test<-c(\^AL, \^AB)
Error: unexpected input in "test<-c(\"
> test<-c(\\^AL, \\^AB)
Is there any way of putting regexps in a vector, or is there another way of achieving my goal (that is, to replace all two-letter state abbreviations to state names without messing up other acronyms in the process)?
Excerpt of my data:
c("U Lausanne", "Swiss Finance Institute", "U CA, Riverside",
"U British Columbia", "CREI", "U Pompeu Fabra", "Office of the Comptroller of the Currency, US Department of the Treasury",
"Bureau of Economics, US Federal Trade Commission", "U Carlos III de Madrid",
"U Brescia", "LUISS Guido Carli", "U Alicante", "Harvard Society of Fellows",
"Toulouse School of Economics", "Decision Economics Inc, Boston, MA",
"ECARES, Free U Brussels", "Baylor U", "Research Centre for Education",
"the Labour Market, Maastricht U", "U Bonn", "Swarthmore College"
)

We can make use of the state.abb vector and paste it together by collapseing with |
pat1 <- paste0("\\b(", paste(state.abb, collapse="|"), ")\\b")
The \\b signifies the word boundary so that indiscriminate matches "SAL" can be avoided
and similarly with state.name, paste the ^ and $ as prefix/suffix to mark the start, end of the string respectively
pat2 <- paste0("^(", paste(state.name, collapse="|"), ")$")

Extracting cities from string vector in R

I have a column in my dataset db, say db$affiliation, which looks like:
**db$affiliation**
[1] "[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA"
[2] "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS."
[3] "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
[4] ...
I would like to create a column within the same dataset containing only the name of the city in db$affiliation, such as
**db$cities**
[1] LOS ANGELES
[2] TWENTE
[3] BANGKOK
[4] ...
If multiple city names are available, I'd like the command to return only the last one, if no city names are available I'd like to have NA. How can I do that?
I thought that I could use world.cities$name in data(world.cities) in the maps package but I can not figure out how.
I even tried to split the db$affiliation column such as:
db$affiliation <- gsub("\\[[^\\]]*\\]", "", db$affiliation, perl=TRUE) # remove content within brackets
db$affiliation[2] # check the separator
db <- cSplit(db, 'affiliation', sep=c(", "), type.convert=FALSE) # split after comma
Which results (I've truncated it after affiliation_3) in:
affiliation_1 affiliation_2 affiliation_3
[1] UNIV CALIF LOS ANGELES DEPT GEOG LOS ANGELES
[2] UNIV TWENTE DEPT WATER ENGN & MANAGEMENT DRIENERLOLAAN
[3] CHULALONGKORN UNIV FAC ARCHITECTURE BANGKOK
And then pass:
db$cities <- lapply(db$affiliation_1, function(x)x[which(x %in% world.cities$name)])
But I get an empty column.
Thanks for the help!

There are many cities in your sample string so you may need to think again if you still want to fetch the 'last city' in case multiple cities are found in affiliation column.
library(maps)
data(world.cities)
#sample data
df <- data.frame(affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.",
"Prem"), stringsAsFactors = F)
#fetch city and it's respective country from 'affiliation' column
cities_country <- lapply(gsub("\\[|\\]|[,;]|\\.","",df$affiliation), function(x)
paste(as.character(world.cities$name[sapply(world.cities$name, grepl, x, ignore.case=T)]),
as.character(world.cities$country.etc[sapply(world.cities$name, grepl, x, ignore.case=T)]),
sep="_"))
df$cities_country <- lapply(cities_country, function(x) if(identical(x, character(0))) NA_character_ else x)
df
Output is:
affiliation
1 [SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA
2 [VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.
3 [ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.
4 Prem
cities_country
1 Al_Norway, Alle_Switzerland, Allen_Philippines, Allen_USA, Angeles_Costa Rica, Angeles_Philippines, Cali_Colombia, Cot_Costa Rica, Li_Norway, Los Angeles_Chile, Los Angeles_USA, Os_Kyrgyzstan, Os_Norway, U_Micronesia, Usa_Japan
2 Ae_Marshall Islands, Ede_Netherlands, Ede_Nigeria, Enschede_Netherlands, Hede_China, Ine_Marshall Islands, Laa_Austria, Lola_Guinea, Man_Ivory Coast, Mana_French Guiana, Manage_Belgium, Nagem_Luxembourg, Ob_Russia, Ola_Panama, Po_Burkina Faso, U_Micronesia, Van_Turkey, Wa_Ghana, We_New Caledonia
3 Aila_Estonia, Al_Norway, Anan_Japan, Ba_Fiji, Bangkok_Thailand, Hit_Iraq, Ila_Nigeria, Ilan_Taiwan, Long_Thailand, Nan_Thailand, Tsu_Japan, U_Micronesia, Ula_Turkey
4 NA
(Note that in above output I have kept all occurrences of cities and for convenience also suffixed it with their respective countries)

From the few lines you have shown it looks like you might be able to do the following (note you missed aligning the casing):
tmpVec <- sapply(strsplit(db$affiliation, split = ","), function(x) {
cleanVec <- toupper(trimws(x))
cleanVec[max(which(cleanVec %in% toupper(maps::world.cities$name)))]
})
Or put a bit more code into the function to avoid the ugly warnings.

Let me leave a part of a solution. As far as I can tell from my own research, letters in the square brackets seem to indicate personal names. For example, I found that Sutee Anantsuksomsri is an actual name. This observation suggests that we probably want to remove texts in the brackets.
Once I removed the texts in the square brackets, I split the words using unnest_tokens() in the tidytext package. Note that the function converts all letters to small letters. If you do not like it, you can change that by specifying to_lower = FALSE. First, I split each city name into word. I also assigned an ID number for each city. Second, I cleaned up your data. As I said earlier, I removed texts in square brackets using gsub(). Then, I applied unnest_tokens() to the data. I subset words using the words from cities in filter(). The result we get up to this point is the following. Obviously, you have more work to do. I leave the sampling data, mydf below. I hope you can move on from here.
data(world.cities)
cities <- world.cities %>%
mutate(id = 1:n()) %>%
unnest_tokens(input = name, output = word, token = "words")
temp <- mydf %>%
mutate(affiliation = gsub(x = affiliation, pattern = "\\[.*\\]", replacement = "")) %>%
unnest_tokens(input = affiliation, output = word, token = "words") %>%
filter(word %in% cities$word)
id word
1 1 los
2 1 angeles
3 1 los
4 1 angeles
5 1 ca
6 1 usa
7 2 water
8 2 ae
9 2 enschede
10 3 bangkok
DATA
mydf <- structure(list(id = 1:3, affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
)), .Names = c("id", "affiliation"), row.names = c(NA, -3L), class = "data.frame")

R: Match character vector with another character vector [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Can't wrap my mind around this task
Consider a data frame "usa" with 3 columns, "title", "city" and "state" (reproducible):
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data.frame(title, city, state)
Resulting in this:
title city state
1 Events in Chicago, September
2 California hotels
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
Now what I am trying to do is to fill the STATE variable for the first 2 observations, which are now missing.
TITLE variable contains a clue: either a city or a state is mentioned in each of the entries.
I need to do the following:
Check if any word in "title" column matches any observation found in "city" and "state" columns;
If any word in "title" matches any observation in "state", paste the same state for the given title's observation;
If any word in "title" matches any observation in "city", paste the matched city's state in the "state" column of the title's row.
So what I want to get eventually is this:
title city state
1 Events in Chicago, September IL
2 California hotels California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
In other words, in the second row the title contained a word "California", so a matching state was found from state vector. However, in the first line, the word "Chicago" was the key, and there was another entry in the data frame (row 4), which linked Chicago to "IL" state, so "IL" has to be pasted in the first row of "state" column.
Waiting for the community's ideas :) Thanks!

I would recommend you use the stringr package; specifically, a function called str_extract.
If you have a complete list of cities, e.g. city <- c("Los Angeles", "Chicago"), then you can make it into regular expression using paste(city, collapse = '|'). That will give you: 'Los Angeles|Chicago'. With str_extract, you can extract that city (will extract the first one it sees, and an NA if none appear). Here's the complete code. Note: this only works if your dataframe is a data_frame (tibble), not a data.frame (not totally sure why, haven't looked into it)
library(tidyverse)
library(stringr)
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data_frame(title, city, state) # notice this is a data_frame not data.frame
cities <- paste(c("Los Angeles", "Chicago"), collapse = '|')
states <- paste(c("California", "IL"), collapse = '|')
usa <- usa %>%
mutate(city = ifelse(city == '', str_extract(title, cities), city),
state = ifelse(state == '', str_extract(title, states), state))
This results in:
# A tibble: 4 x 3
title city state
<chr> <chr> <chr>
1 Events in Chicago, September Chicago <NA>
2 California hotels <NA> California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex