Cleaning and editing a column - r

I've been trying to figure out how to clean and edit a column in my data set.
The dataset I am using is supposed to only be for the city of San Francisco. A column in the data set called "city" contains multiple different spellings of San Francisco, as well as other cities. Here is what it looks like:
table(sf$city)
Brentwood CA
30401 18 370
DALY CITY FOSTER CITY HAYWARD
0 0 0
Novato Oakland OAKLAND
0 40 0
S F S.F. s.F. Ca
0 31428 12
SAN BRUNO SAN FRANCICSO San Franciisco
0 221 54
san francisco san Francisco San francisco
20 284 0
San Francisco SAN FRANCISCO san Francisco CA
78050 16603 6
San Francisco, San Francisco, Ca San Francisco, CA
12 4 72
San Francisco, CA 94132 San Franciscvo San Francsico
0 0 2
San Franicisco Sand Francisco sf
41 30 17
Sf SF SF , CA
214 81226 1
SF CA 94133 SF, CA SF, CA 94110
0 9 38
SF, CA 94115 SF. SF`
4 1656 31
SO. SAN FRANCISCO SO.S.F.
0 6
What I am trying to do is to change sf$city to only have "San Francisco". So all the data in sf$city will be placed under one city, San Francisco. So when I type table(sf$city), it only shows San Francisco.
Could I subset? Something like:
sf$city = subset(sf, city == "S.F." & "s.F. Ca" & "SAN FRANCICSO" & ...
And subset all the city variables I want? Or will this distort and mess up my data?

I would try regular expressions with agrep and grep.
Example data:
d <- c("Brentwood", "CA", "DALY CITY", "FOSTER CITY", "HAYWARD", "Novato",
"Oakland", "OAKLAND", "S F", "S.F.", "s.F. Ca", "SAN BRUNO",
"SAN FRANCICSO", "San Franciisco", "san francisco", "san Francisco",
"San francisco", "San Francisco", "SAN FRANCISCO", "san Francisco CA",
"San Francisco,", "San Francisco, Ca", "San Francisco, CA", "San Francisco, CA 94132",
"San Franciscvo", "San Francsico", "San Franicisco", "Sand Francisco",
"sf", "Sf", "SF", "SF , CA", "SF CA", "94133", "SF, CA", "SF, CA 94110",
"SF, CA 94115", "SF.", "SF`", "SO. SAN FRANCISCO", "SO.S.F.")
You can target words like "San Francisco" with agrep, and the default of max.dist = 0.1 works well enough here. You can then just target the S.F. variants using grep
d[agrep("San Francisco", d, ignore.case = TRUE, max.dist = 0.1)] <- "San Francisco"
d[grep("\\bS[. ]?F\\.?\\b", d, ignore.case = TRUE, perl = TRUE)] <- "San Francisco"
# [1] "Brentwood" "CA" "DALY CITY" "FOSTER CITY"
# [5] "HAYWARD" "Novato" "Oakland" "OAKLAND"
# [9] "San Francisco" "San Francisco" "San Francisco" "SAN BRUNO"
#[13] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[17] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[21] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[25] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[29] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[33] "San Francisco" "94133" "San Francisco" "San Francisco"
#[37] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[41] "San Francisco"
adist is another option for targeting words like "San Francisco". I found the following settings to work well. You can pick up "San Fran":
d[adist("San Francisco", d, ignore.case = TRUE,
cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "San Francisco"

To riff on #jeta's answer, you could also take the resulting data set and run it through the Google Maps API as shown here: https://gist.github.com/josecarlosgonz/6417633
Specifically, using the functions available at that link, you could take the grep() output and run
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
Which will give you the following output:
# V1 V2 V3 V4
# 1 36.0331164 -86.7827772 APPROXIMATE Brentwood, TN, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
As it looks like you know that all of your locations are in CA, you may also want to append a CA to the end of your vector as shown here:
d[grep("CA", d, invert = TRUE)] <- paste0(d[grep("CA", d, invert = TRUE)], ", CA")
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
As shown below, this will make sure that Google places Brentwood in CA.
The advantage of this approach is that you will end up with normalized cities in V4, which could be helpful when it comes to filtering and other things.
# V1 V2 V3 V4
# 1 37.931868 -121.6957863 APPROXIMATE Brentwood, CA 94513, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
NOTE: Google has a rate limit on it's API. If you want to avoid registering and getting an API key, you will want to chunk the ldply in 10-second bites as suggested in the comment at the Github link above.

To overwrite sf$city to be "San Francisco" for every entry, here is the typical method:
sf$city <- "San Francisco"
However, if some of your observations are not San Francisco, and you would like to drop these, you will want to drop these first. Here is a start:
# drop non-SF observations
sfReal <- sf[!(tolower(sf$city) %in% c("daly city", "brentwood", "hayward", "oakland"))]
My geography is not the best, so I may be missing some. Alternatively, you could use %in% to only include those observations that are San Francisco. Given the set you provided above, I doubt this is the case.
In the future, if this is a repeated task, you should look into regular expressions and grep. This is an amazing tool that will pay gigantic dividends for string manipulation tasks. #jota provides a great method for this in the answer provided.

Related

Extract cell with AND without commas in R

I'm trying to extract the city and state from the Address column into 2 separate columns labeled City and State in r. This is what my data looks like:
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ")) %>%
separate(address, c("City", "State"), sep=",")
I tried using the separate function but that only gets the ones with commas. Any ideas on how to do this for both cases?
There is a pattern at the end (space, letter, letter) which I can use to exploit and then remove any commas but not sure how the syntax would work using grep.
Starting from your df
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
> df
address
1 Los Angeles, CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia, PA
6 Trenton, NJ
It's possible to use gsub to subset the string like this:
> city=gsub(',','',gsub("(.*).{3}","\\1",df[,1]))
> city
[1] "Los Angeles" "Pittsburgh" "Miami" "Baltimore" "Philadelphia"
[6] "Trenton"
> state=gsub(".*(\\w{2})","\\1",df[,1])
> state
[1] "CA" "PA" "FL" "MD" "PA" "NJ"
df=data.frame(City=city,State=state)
> df
City State
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
This is a little unorthodox but it works well. It assumes that all states are 2 characters long and that there is at least 1 space between the city and state. Comma's are ignored
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
df$city <- substring(sub(",","",df$address),1,nchar(sub(",","",df$address))-3)
df$state <- substring(as.character(df$address),nchar(as.character(df$address))-1,nchar(as.character(df$address)))
df <- within(df,rm(address))
output:
city state
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ

How to add a new column based on rows with a pattern and a sequence in R?

If a have a dataframe with a column with a pattern that is: a row with a string with a name on it followed by other rows containing names and a sequence of numbers. This is repeated all over the dataframe.
I want to create a new column base on the condition that if it founds a row with a string that starts with the word "CANTON" (and without a number), copy the string without the first word (CANTON) trough all the next rows of the new column until it appears another row with a string that starts with the word "CANTON" where it has to take the new string, and copy the new last word in the new column.
An example of the dataframe is the next one:
datos <- data.frame(sitio = c("CANTON SAN JOSE", "01 Carmen", "02 Merced",
"03 Hospital", "04 Catedral", "05 San Franscisco",
"CANTON ESCAZU", "01 Escazu", "02 San Antonio", "03 San Rafael" ),
area = c(44.62, 1.49, 2.29, 3.38, 2.31, 2.85, 34.49, 4.38,
16.99, 13.22))
datos
And the expected result would be:
expected_result <-data.frame(
sitio = c("CANTON SAN JOSE", "01 Carmen", "02 Merced",
"03 Hospital", "04 Catedral", "05 San Franscisco",
"CANTON ESCAZU", "01 Escazu", "02 San Antonio",
"03 San Rafael" ),
area = c(44.62, 1.49, 2.29, 3.38, 2.31, 2.85, 34.49, 4.38,
16.99, 13.22),
canton = c("SAN JOSE", "SAN JOSE", "SAN JOSE", "SAN JOSE",
"SAN JOSE", "SAN JOSE", "ESCAZU", "ESCAZU", "ESCAZU",
"ESCAZU"))
I have tried to do many for loops, subsets and joining dataframes without success. I cannot make clear this pattern in a instruction in R.
Thanks for helping!
Hope this works for you data:
x <- gsub('^CANTON ', '', datos$sitio)
x[!grepl('^CANTON ', datos$sitio)] <- NA
datos$canton <- ave(x, cumsum(!is.na(x)), FUN = function(xx) xx[1])
# > datos
# sitio area canton
# 1 CANTON SAN JOSE 44.62 SAN JOSE
# 2 01 Carmen 1.49 SAN JOSE
# 3 02 Merced 2.29 SAN JOSE
# 4 03 Hospital 3.38 SAN JOSE
# 5 04 Catedral 2.31 SAN JOSE
# 6 05 San Franscisco 2.85 SAN JOSE
# 7 CANTON ESCAZU 34.49 ESCAZU
# 8 01 Escazu 4.38 ESCAZU
# 9 02 San Antonio 16.99 ESCAZU
# 10 03 San Rafael 13.22 ESCAZU

Extract city names from large text with R

Hello i have an intriguing question here. Suppose that i have a long character which includes city names between others.
test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"
My goal is to extract all the city names of it. And I achieved it by following five steps.
#replace | with ,
test2<-str_replace_all(test, "[|]", ", ")
# Remove punctuation from data
test3<-gsub("[[:punct:]\n]","",test2)
# Split data at word boundaries
test4 <- strsplit(test3, " ")
# Load data from package maps
data(world.cities)
# Match on cities in world.cities
citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])
The result may be correct
citiestest
[[1]]
[1] "San" "Boston" "Boston" "Washington" "York"
[6] "York" "Kettering" "York" "York" "Charlotte"
[11] "Carolina" "Cleveland" "Nashville" "Seattle" "Seattle"
[16] "Washington" "Asan"
But as you can see I cannot deal with cities with two-words name (New York, San Diego etc.) as they are separated. Of course fix this issue manually is not an option as my real dataset is quite large.
A rather different approach which may be more or less useful, depending on the data at hand: Pass each address to a geocoding API, then pull the city out of the response.
library(tidyverse)
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|')
places <- places %>%
mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))
places <- places %>%
mutate(address_components = map(geodata, list('results', 1, 'address_components')),
address_components = map(address_components,
~as_data_frame(transpose(.x)) %>%
unnest(long_name, short_name)),
city = map(address_components, unnest),
city = map_chr(city, ~{
l <- set_names(.x$long_name, .x$types);
coalesce(l['locality'], l['administrative_area_level_1'])
}))
Comparing the result and the original,
places %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 St. Louis Washington University, Saint Louis, Missouri, USA
#> 6 New York Mount SInai Medical Center, New York, New York, USA
#> 7 New York Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 Goyang-si National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 Amsterdam VU MEDISCH CENTRUM; Dept. of Medical Oncology
...well, it's not perfect. The biggest issue is that cities are classified as localities for US cities, but administrative_area_level_1 (which in the US is the state) for South Korea. Unlike the other Korean rows, 12 actually has a locality, which is not the city listed (which is in the response as an administrative region). Further, "Seoul" in line 13 was inexplicably translated to Korean.
The good news is that "Saint Louis" has been shortened to "St. Louis", which is a more standardized form, and the last row has been located in Amsterdam.
Scaling such an approach would likely require paying Google a little for the usage of their API.
Here is a base R option using strsplit and sub:
terms <- unlist(strsplit(test, "\\s*\\|\\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\\s*([^,]+),.*", "\\1", x))
cities[1:3]
Ucsd Medical Center, San Diego, California, USA
"San Diego"
Yale Cancer Center, New Haven, Connecticut, USA
"New Haven"
Massachusetts General Hospital., Boston, Massachusetts, USA
"Boston"
Demo
Another way that works with no loop
pat="(,.\\w+,)|(,.\\w+.\\w+,)"
gsub("(,\\s)|,","",regmatches(m<-strsplit(test,"\\|")[[1]],regexpr(pat,m)))
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York"
[8] "Charlotte" "Cleveland" "Nashville" "Seattle" "Gyeonggi-do" "Seoul" "Seoul"
[15] "Seoul" "Seoul"
The other results given in this page do fail: for example, there is a town called Greonggi-do This is not given in the other solutions. Also some of the codes give the whole string as the town
What I would do:
test2 <- str_replace_all(test, "[|]", ", ") #Same as you did
test3 <- unlist(strsplit(test2, split=", ")) #Turns string into a vector
check <- test3 %in% world.cities$name #Check if element vectors match list of city names
test3[check == TRUE] #Select vector elements that match list of city names
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York" "New York"
[9] "New York" "Charlotte" "Cleveland" "Nashville" "Seattle" "Washington"
To expand on #hrbrmstr's comment above, you can use the Stanford CoreNLP library to do named entity recognition (NER) on each string. The big caveat to such an undertaking is that most NER annotators only go so far as to annotate a token as a "location" or equivalent, which is not very useful when cities are mixed in with states and countries. Beyond its usual NER annotator, though, CoreNLP does contain an extra regex NER annotator that can increase NER granularity to the level of cities.
In R, you can use the coreNLP package to run the annotators. It does require rJava, which in some cases can be hard to configure. You'll also need to download the actual (pretty big) library, which can be done with coreNLP::downloadCoreNLP, and, should you like, set the CORENLP_HOME environment variable in ~/.Renviron to the installation path.
Also note that this approach is fairly slow and resource-intensive, as it's doing a lot of work in Java.
library(tidyverse)
library(coreNLP)
# set which annotators to use
writeLines('annotators = tokenize, ssplit, pos, lemma, ner, regexner\n', 'corenlp.properties')
initCoreNLP(libLoc = Sys.getenv('CORENLP_HOME'), parameterFile = 'corenlp.properties')
unlink('corenlp.properties') # clean up
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|') # separate strings
places_ner <- places %>%
mutate(annotations = map(string, annotateString),
tokens = map(annotations, 'token'),
tokens = map(tokens, group_by, token_id = data.table::rleid(NER)),
city = map(tokens, filter, NER == 'CITY'),
city = map(city, summarise, city = paste(token, collapse = ' ')),
city = map_chr(city, ~if(nrow(.x) == 0) NA_character_ else .x$city))
which returns
places_ner %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 NA Washington University, Saint Louis, Missouri, USA
#> 6 NA Mount SInai Medical Center, New York, New York, USA
#> 7 NA Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 NA National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 Seoul Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 NA VU MEDISCH CENTRUM; Dept. of Medical Oncology
Failings:
"New York" is recognized twice as a state or province ("New York City" would be recognized as such).
"Saint Louis" is recognized as a person. "St. Louis" is recognized as a location on my installation, but an online version of the same library recognizes the original as a location, so this may be a version issue.
"Gyeonggi-do" isn't recognized, though "Seoul" is. I'm not sure how granular the regexner annotator goes, but given (as the name suggests) it works by regex, there is a size/familiarity threshold under which it doesn't contain a regex. You can add your own regex to it if it's worthwhile, though.
The cleanNLP package also supports Stanford CoreNLP (and a couple other backends) with an easier-to-use interface (setup is still hard), but as far as I can tell doesn't allow the use of regexner at the moment due to how it initializes CoreNLP.
You can use the tidytext to extract bigram--> words --> intersect to get the common part
library(tidyverse)
libraty(tidytext)
# city is a vector containing pre-defined city name
t2 <- test %>% as_tibble() %>%
unnest_tokens(bigram,value,token = 'ngrams', n =2) %>%
separate(bigram,c('word1','word2'),remove = F)
city_get <- c(intersect(t2$bigram,city),intersect(t2$word1,city))%>%
unique()

Finding all string matches from another dataframe in R

I am relatively new in R.
I have a dataframe locs that has 1 variable V1 and looks like:
V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal
and another dataframe cities that has two variables that look like this:
city country
edmonton canada
san carlos spain
los angeles united states
santa maria united states
tokyo japan
madrid spain
santa maria portugal
lisbon portugal
I want to create two new variables in locs that relates any string match of V1 within city so that locs looks like this:
V1 city country
edmonton general hospital edmonton canada
hospital san carlos, madrid spain san carlos, madrid spain
hospital of santa maria, lisbon, portugal santa maria, lisbon portugal, united states
A few things to note: V1 may have multiple country names. Also, if there is a repeat country (for instance, both san carlos and madrid are in spain), then I only want one instance of the country.
Please advise.
Thanks.
A solution using tidyverse and stringr. locs2 is the final output.
library(tidyverse)
library(stringr)
locs2 <- locs %>%
rowwise() %>%
mutate(city = list(str_match(V1, cities$city))) %>%
unnest() %>%
drop_na(city) %>%
left_join(cities, by = "city") %>%
group_by(V1) %>%
summarise_all(funs(toString(sort(unique(.)))))
Result
locs2 %>% as.data.frame()
V1 city country
1 cardiovascular institute, hospital san carlos, madrid spain madrid, san carlos spain
2 edmonton general hospital edmonton canada
3 hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states
DATA
library(tidyverse)
locs <- data_frame(V1 = c("edmonton general hospital",
"cardiovascular institute, hospital san carlos, madrid spain",
"hospital of santa maria, lisbon, portugal"))
cities <- read.table(text = "city country
edmonton canada
'san carlos' spain
'los angeles' 'united states'
'santa maria' 'united states'
tokyo japan
madrid spain
'santa maria' portugal
lisbon portugal",
header = TRUE, stringsAsFactors = FALSE)

Using spread() in tidyr to pivot and drop NAs

I am using R and I have data like
California | Los Angeles
California | San Diego
California | San Francisco
New York | Albany
New York | New York City
which I would like to transform to
California | New York
Los Angeles | Albany
San Diego | New York City
San Francisco | NA
I am trying to use spread() in tidyr but can't quite get it to give me the output the way I need it. The closest I can come is
California | New York
Los Angeles | NA
San Diego | NA
San Francisco | NA
NA | Albany
NA | New York City
Can someone please help me get it in the desired format?
Here's how I do it in base:
df<-data.frame(v1=c(rep("California",3), rep("New York",2)), v2=c("Los Angeles", "San Diego", "San Franciso", "Albany", "New York City"))
cali<-as.character(df[df$v1=="California", 2])
ny<-as.character(df[df$v1=="New York", 2])
new <- data.frame(California=cali, NewYork=c(ny, NA))
new
California NewYork
1 Los Angeles Albany
2 San Diego New York City
3 San Franciso <NA>

Resources