R Find unique value and find record like %_% - r

I have a data table of 10,000 records having multiple columns. Below is the code and part of the data set
states <- str_trim(unlist(strsplit(as.vector(search_data_set$location_name), ";"))
Part of Dataset:
Maine Virginia;
Oklahoma;
Kansas Minnesota South Dakota;
Delaware;
West Virginia;
Utah South Carolina;
Utah South Dakota Utah;
Indiana; Michigan Alaska Washington;
Washington Connecticut Maine;
Maine Oregon South Carolina Oregon;
Alabama Alaska;
Iowa Alabama New Mexico;
Virgin Islands South Dakota;
Maine Louisiana; Colorado;
District of Columbia Virgin Islands;
Pennsylvania Alabama;
I need to fulfill the below requirement and need help here:
Each record should take a unique value of location. (In Utah South Dakota Utah; , Utah should be counted as Unique)
When the user searches the dataset it should bring the record, if the location is anywhere. (%Oregon%) The current code is not bringing the record "Maine Oregon South Carolina Oregon;" when the user searches for "Oregon"
Need help in achieving this. Thanks in advance!

Related

R: Is it possible to create multiple tables based on unique values by looping?

Say if we have a dataframe such the one below:
region country city
North America USA Washington
North America USA Boston
Western Europe UK Sheffield
Western Europe Germany Düsseldorf
Eastern Europe Ukraine Kiev
North America Canada Vancouver
Western Europe France Reims
Western Europe Belgium Antwerp
North America USA Chicago
Eastern Europe Belarus Minsk
Eastern Europe Russia Omsk
Eastern Europe Russia Moscow
Western Europe UK Southampton
Western Europe Germany Hamburg
North America Canada Ottawa
I would like to know how to loop through this dataframe in order to check if countries are assigned to the right region, same with cities. Usually I do it helping myself with table() function: however this is very time-consuming as this requires several ad-hoc statements such table(df$country[df$region == 'North America') and so on with all the regions involved and countries as well.
Thus, I'm eager to know how to create a loop so I could be able to get this output economizing as much as possible time and lines of code.
Thanks in advance!
df%>%group_by(region)%>%group_split()

Extract city names from large text with R

Hello i have an intriguing question here. Suppose that i have a long character which includes city names between others.
test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"
My goal is to extract all the city names of it. And I achieved it by following five steps.
#replace | with ,
test2<-str_replace_all(test, "[|]", ", ")
# Remove punctuation from data
test3<-gsub("[[:punct:]\n]","",test2)
# Split data at word boundaries
test4 <- strsplit(test3, " ")
# Load data from package maps
data(world.cities)
# Match on cities in world.cities
citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])
The result may be correct
citiestest
[[1]]
[1] "San" "Boston" "Boston" "Washington" "York"
[6] "York" "Kettering" "York" "York" "Charlotte"
[11] "Carolina" "Cleveland" "Nashville" "Seattle" "Seattle"
[16] "Washington" "Asan"
But as you can see I cannot deal with cities with two-words name (New York, San Diego etc.) as they are separated. Of course fix this issue manually is not an option as my real dataset is quite large.
A rather different approach which may be more or less useful, depending on the data at hand: Pass each address to a geocoding API, then pull the city out of the response.
library(tidyverse)
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|')
places <- places %>%
mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))
places <- places %>%
mutate(address_components = map(geodata, list('results', 1, 'address_components')),
address_components = map(address_components,
~as_data_frame(transpose(.x)) %>%
unnest(long_name, short_name)),
city = map(address_components, unnest),
city = map_chr(city, ~{
l <- set_names(.x$long_name, .x$types);
coalesce(l['locality'], l['administrative_area_level_1'])
}))
Comparing the result and the original,
places %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 St. Louis Washington University, Saint Louis, Missouri, USA
#> 6 New York Mount SInai Medical Center, New York, New York, USA
#> 7 New York Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 Goyang-si National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 Amsterdam VU MEDISCH CENTRUM; Dept. of Medical Oncology
...well, it's not perfect. The biggest issue is that cities are classified as localities for US cities, but administrative_area_level_1 (which in the US is the state) for South Korea. Unlike the other Korean rows, 12 actually has a locality, which is not the city listed (which is in the response as an administrative region). Further, "Seoul" in line 13 was inexplicably translated to Korean.
The good news is that "Saint Louis" has been shortened to "St. Louis", which is a more standardized form, and the last row has been located in Amsterdam.
Scaling such an approach would likely require paying Google a little for the usage of their API.
Here is a base R option using strsplit and sub:
terms <- unlist(strsplit(test, "\\s*\\|\\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\\s*([^,]+),.*", "\\1", x))
cities[1:3]
Ucsd Medical Center, San Diego, California, USA
"San Diego"
Yale Cancer Center, New Haven, Connecticut, USA
"New Haven"
Massachusetts General Hospital., Boston, Massachusetts, USA
"Boston"
Demo
Another way that works with no loop
pat="(,.\\w+,)|(,.\\w+.\\w+,)"
gsub("(,\\s)|,","",regmatches(m<-strsplit(test,"\\|")[[1]],regexpr(pat,m)))
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York"
[8] "Charlotte" "Cleveland" "Nashville" "Seattle" "Gyeonggi-do" "Seoul" "Seoul"
[15] "Seoul" "Seoul"
The other results given in this page do fail: for example, there is a town called Greonggi-do This is not given in the other solutions. Also some of the codes give the whole string as the town
What I would do:
test2 <- str_replace_all(test, "[|]", ", ") #Same as you did
test3 <- unlist(strsplit(test2, split=", ")) #Turns string into a vector
check <- test3 %in% world.cities$name #Check if element vectors match list of city names
test3[check == TRUE] #Select vector elements that match list of city names
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York" "New York"
[9] "New York" "Charlotte" "Cleveland" "Nashville" "Seattle" "Washington"
To expand on #hrbrmstr's comment above, you can use the Stanford CoreNLP library to do named entity recognition (NER) on each string. The big caveat to such an undertaking is that most NER annotators only go so far as to annotate a token as a "location" or equivalent, which is not very useful when cities are mixed in with states and countries. Beyond its usual NER annotator, though, CoreNLP does contain an extra regex NER annotator that can increase NER granularity to the level of cities.
In R, you can use the coreNLP package to run the annotators. It does require rJava, which in some cases can be hard to configure. You'll also need to download the actual (pretty big) library, which can be done with coreNLP::downloadCoreNLP, and, should you like, set the CORENLP_HOME environment variable in ~/.Renviron to the installation path.
Also note that this approach is fairly slow and resource-intensive, as it's doing a lot of work in Java.
library(tidyverse)
library(coreNLP)
# set which annotators to use
writeLines('annotators = tokenize, ssplit, pos, lemma, ner, regexner\n', 'corenlp.properties')
initCoreNLP(libLoc = Sys.getenv('CORENLP_HOME'), parameterFile = 'corenlp.properties')
unlink('corenlp.properties') # clean up
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|') # separate strings
places_ner <- places %>%
mutate(annotations = map(string, annotateString),
tokens = map(annotations, 'token'),
tokens = map(tokens, group_by, token_id = data.table::rleid(NER)),
city = map(tokens, filter, NER == 'CITY'),
city = map(city, summarise, city = paste(token, collapse = ' ')),
city = map_chr(city, ~if(nrow(.x) == 0) NA_character_ else .x$city))
which returns
places_ner %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 NA Washington University, Saint Louis, Missouri, USA
#> 6 NA Mount SInai Medical Center, New York, New York, USA
#> 7 NA Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 NA National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 Seoul Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 NA VU MEDISCH CENTRUM; Dept. of Medical Oncology
Failings:
"New York" is recognized twice as a state or province ("New York City" would be recognized as such).
"Saint Louis" is recognized as a person. "St. Louis" is recognized as a location on my installation, but an online version of the same library recognizes the original as a location, so this may be a version issue.
"Gyeonggi-do" isn't recognized, though "Seoul" is. I'm not sure how granular the regexner annotator goes, but given (as the name suggests) it works by regex, there is a size/familiarity threshold under which it doesn't contain a regex. You can add your own regex to it if it's worthwhile, though.
The cleanNLP package also supports Stanford CoreNLP (and a couple other backends) with an easier-to-use interface (setup is still hard), but as far as I can tell doesn't allow the use of regexner at the moment due to how it initializes CoreNLP.
You can use the tidytext to extract bigram--> words --> intersect to get the common part
library(tidyverse)
libraty(tidytext)
# city is a vector containing pre-defined city name
t2 <- test %>% as_tibble() %>%
unnest_tokens(bigram,value,token = 'ngrams', n =2) %>%
separate(bigram,c('word1','word2'),remove = F)
city_get <- c(intersect(t2$bigram,city),intersect(t2$word1,city))%>%
unique()

Find string, if does not exist, find another string

I have many files from OECD that have data available for different regional granularities. An example would be:
File A
REG_ID Region
AUS Australia
AU1GS Sydney
AU1 New South Wales
AU2 Victoria
AU2GM Melbourne
File B
REG_ID Region
AUS Australia
AU1GS Sydney
AU2GM Melbourne
File C
REG_ID Region
AUS Australia
AU1 New South Wales
AU1GS Sydney
AU2 Victoria
I want to extract the most granular region, in this case Sydney only, and not New South Wales. However, if Sydney is unavailable, I want to extract New South Wales.
How do I write code that is generalisable to all these files?

R append function

I'm writing an R script that parses out the a state abbreviation from a column in a data.frame. It then uses the which() function to determine the index of the found state abbreviation in a look up data frame that contains state abbreviations and their corresponding full state names. I then use the found index to access the the full state name and append it to a vector called completeList. I then add the vector completeList which should contain the full state names to my original data frame under a newly created column STATE_NAME.
However, for some reason completeList only contains the indexes that were found earlier and not the full state names that I expected. What did I do wrong?
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, addCompleteStateName)
}
file["STATE_NAME"]<-completeList
>completeList
[1] 27 17 17 29 42 50 20 53 45 19 22 52 9 29 26 37 8 58 35
Here is the csv file where the abbreviation of the station is found
STATION STATION_NAME ELEVATION
GHCND:USC00202381 EAST JORDAN MI US 180.1
GHCND:USC00111290 CARLYLE RESERVOIR IL US 153
GHCND:USC00116661 PAW PAW 2 S IL US 274.9
GHCND:USC00228556 SUMRALL MS US 88.1
GHCND:USC00340292 ARDMORE OK US 267.9
GHCND:USC00408522 SPARTA WASTEWATER PLANT TN US 289.9
GHCND:USC00148341 VALLEY FALLS KS US 283.5
GHCND:USW00014742 BURLINGTON INTERNATIONAL AIRPORT VT US 101.2
GHCND:USC00367782 SALINA 3 W PA US 338
GHCND:USC00134142 IOWA FALLS IA US 356.9
GHCND:USC00161565 CARVILLE 2 SW LA US 9.1
GHCND:USC00421446 CITY CRK WATER PLANT UT US 1628.9
GHCND:USW00013781 WILMINGTON NEW CASTLE CO AIRPORT DE US 22.6
GHCND:USC00229400 WATER VALLEY MS US 116.1
GHCND:USC00190562 BELCHERTOWN MA US 171
GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 40.2
GHCND:USC00060973 BURLINGTON CT US 155.4
GHCND:USC00475516 MINOCQUA WI US 484.9
GHCND:USC00286055 NEW BRUNSWICK 3 SE NJ US 38.1
Here is the csv file where we look up abbreviations and find the corresponding full state name
State/Possession Abbreviation
Alabama AL
Alaska AK
American Samoa AS
Arizona AZ
Arkansas AR
California CA
Colorado CO
Connecticut CT
Delaware DE
District of Columbia DC
Federated States of Micronesia FM
Florida FL
Georgia GA
Guam GU
Hawaii HI
Idaho ID
Illinois IL
Indiana IN
Iowa IA
Kansas KS
Kentucky KY
Louisiana LA
Maine ME
Marshall Islands MH
Maryland MD
Massachusetts MA
Michigan MI
Minnesota MN
Mississippi MS
Missouri MO
Montana MT
Nebraska NE
Nevada NV
New Hampshire NH
New Jersey NJ
New Mexico NM
New York NY
North Carolina NC
North Dakota ND
Northern Mariana Islands MP
Ohio OH
Oklahoma OK
Oregon OR
Palau PW
Pennsylvania PA
Puerto Rico PR
Rhode Island RI
South Carolina SC
South Dakota SD
Tennessee TN
Texas TX
Utah UT
Vermont VT
Virgin Islands VI
Virginia VA
Washington WA
West Virginia WV
Wisconsin WI
Wyoming WY
Why am I not getting the full state name?
figured it out 😎
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, toString(addCompleteStateName))
}
file["STATE_NAME"]<-completeList
the type was being forced to an integer
The variable addCompleteStateName is a factor. You can convert it to a character to append the labels.
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
# modified to convert addCompleteStateName to character
completeList<-append(completeList, as.character(addCompleteStateName))
}
file["STATE_NAME"]<-completeList

How can I separate one column into two in R so that the all capital letter words are in one column?

I have a one column like this:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
# [1] WV West Virginia FL Florida
# [3] CA California SC South Carolina
How can I separate the abbreviation from the whole state name. And I want to give the two new columns two different headers. I think I can only solve this by separating the all upper letter words away.
With tidyr we can use separate to expand the column into two while specifying the new names. The argument extra=merge limits the output to the given columns. The separator will default to non-alpha-numerics:
library(tidyr)
separate(df, x, c("Abb", "State"), extra="merge")
# Abb State
#1 WV West Virginia
#2 FL Florida
#3 CA California
#4 SC South Carolina
Data
x = c('WV West Virginia', 'FL Florida','CA California', 'SC South Carolina')
Two approaches without external packages:
Approach 1: you could use substring in combination with nchar.
dat <-data.frame(raw=c("WV West Virginia","FL Florida", "CA California","SC South Carolina"),
stringsAsFactors=F)
dat$code <- substr(dat$raw,1,2)
dat$state <- substr(dat$raw, 4, nchar(dat$raw))
> dat
raw code state
1 WV West Virginia WV West Virginia
2 FL Florida FL Florida
3 CA California CA California
4 SC South Carolina SC South Carolina
Approach two: you could use regular expressions to replace parts of your strings:
##approach two: regex
dat$code <- sub(" .+","",dat$raw)
dat$state <- sub("[A-Z]{2} ","",dat$raw)
Use the state.* constants that come with the base datasets package
DF = data.frame(raw=c("WV West Virginia","FL Florida","CA California","SC South Carolina"))
DF$state.abbr <- substr(DF$raw, 1, 2)
DF$state.name <- state.name[ match(DF$state.abbr, state.abb) ]
# raw state.abbr state.name
# 1 WV West Virginia WV West Virginia
# 2 FL Florida FL Florida
# 3 CA California CA California
# 4 SC South Carolina SC South Carolina
This way, you can afford to have typos or other oddities in the state names.
Use the reshape2 package.
library(reshape2)
x <- rbind('WV West Virginia','FL Florida','CA California','SC South Carolina')
colsplit(x," ",c("Code","State"))
Output:
Code State
1 WV West Virginia
2 FL Florida
3 CA California
4 SC South Carolina
Based on #rawr's comment, we could split 'x' at white space that follows the first two characters, i.e. showed by the regex lookaround ((?<=^.{2})). The output will be a list, which we rbind, convert to data.frame and then cbind with the original vector 'x'.
cbind(x, as.data.frame(do.call(rbind,strsplit(x, '(?<=^.{2})\\s+', perl=TRUE)),
stringsAsFactors=FALSE))
# x V1 V2
#1 WV West Virginia WV West Virginia
#2 FL Florida FL Florida
#3 CA California CA California
#4 SC South Carolina SC South Carolina
Or instead of the regex lookaround, we could use stri_split with n=2 and split at whitespace.
library(stringi)
cbind(x,as.data.frame(do.call(rbind,stri_split(x, regex='\\s+', n=2))))
Here's a data.table/ gsub approach:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
data.table::data.table(x)[,
abb := gsub("(^[A-Z]{2})( .+)", "\\1", x)][,
state := gsub("(^[A-Z]{2})( .+)", "\\2", x)][]
## x abb state
## 1: WV West Virginia WV West Virginia
## 2: FL Florida FL Florida
## 3: CA California CA California
## 4: SC South Carolina SC South Carolina

Resources