How to extract one element of text from a column in R? - r

I'm working with a data frame that contains the locations of where people got tested for COVID. There is not standardization of formatting of the ordering facility (the place that ordered the test). My data frame look something like this:
TestingLocation <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"))
I have a list of the cities where someone could get tested.
Cities <- data.frame(PossibleTestCities=c("Los Angeles", "Chicago", "New York", "Miami", "Boston", "Austin", "Santa Fe"))
Is there a way to use the Cities frame I have to extract the city and put it into a new column. Additionally, if no city appears, to put "Unknown" or something along those lines? Ideally, my frame would look like this:
DesiredFrame <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"),
TestCity= c("New York", "Chicago", "Unknown", "Los Angeles", "Boston"))
Thank you!

Does this work:
library(dplyr)
library(stringr)
TestingLocation %>% mutate(TestCity = str_to_title(str_extract(toupper(TestingLocation), toupper(str_c(Cities$PossibleTestCities, collapse = '|'))))) %>%
mutate(TestCity = replace_na(TestCity, 'Unknown'))
TestingLocation TestCity
1 New York Hospital One New York
2 Chicago Clinic Two Chicago
3 Nursing Home Name One Unknown
4 Los Angeles University_Testing_Site Los Angeles
5 Test-Site-in-BOSTON-MA Boston

This doesn't look pretty but it works:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, sub(paste0(".*(",
paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*"),
"\\U\\1", tolower(TestingLocation$TestingLocation), perl = T))
There's a number of operations involved. There are two sub operations one nested in the other. The first is to replace the (lower-case) TestingLocation$TestingLocations with the matching (lower-case) Cities$PossibleTestCities and set the replacements to upper-case, while the second is to set the values that did not find a match and that hence remained lower-case to NA.
Instead of using a compact but hard-to parse single piece of code you can achieve the substitutions step-by-step:
# 1. define pattern with alternatives:
mypattern <- paste0(".*(", paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*")
# 2. perform first substitution to set matches to City names:
TestingLocation$TestCity <- sub(mypattern, "\\U\\1", tolower(TestingLocation$TestingLocation), perl = T)
# 3. perform second substitution to set non-match to NA:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, TestingLocation$TestCity)
Result:
TestingLocation
TestingLocation TestCity
1 New York Hospital One NEW YORK
2 Chicago Clinic Two CHICAGO
3 Nursing Home Name One <NA>
4 Los Angeles University_Testing_Site LOS ANGELES
5 Test-Site-in-BOSTON-MA BOSTON

Related

Extracting first word after a specific expression in R

I have a column that contains thousands of descriptions like this (example) :
Description
Building a hospital in the city of LA, USA
Building a school in the city of NYC, USA
Building shops in the city of Chicago, USA
I'd like to create a column with the first word after "city of", like that :
Description
City
Building a hospital in the city of LA, USA
LA
Building a school in the city of NYC, USA
NYC
Building shops in the city of Chicago, USA
Chicago
I tried with the following code after seeing this topic Extracting string after specific word, but my column is only filled with missing values
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))
I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.
Solution
This should make the trick for the data you showed:
df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
Alternative
However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
Check out the following example:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
Documentation
Check out ?regex:
Patterns (?=...) and (?!...) are zero-width positive and negative
lookahead assertions: they match if an attempt to match the ...
forward from the current position would succeed (or not), but use up
no characters in the string being processed. Patterns (?<=...) and
(?<!...) are the lookbehind equivalents: they do not allow repetition
quantifiers nor \C in ....

Remove all rows that doesn't match a set of strings and recategorization of the columns

I have a set of social media data queried from twitter API, which also included people's self-reported location. However, the location string does not default to a standard format for categorization, and sometimes there are "trolls" value. Here is an example
a1 = data.frame(x=c(1:4),y=c("181 Metro Drive San Francisco", "Wall Street New York", "Austin, TX", "The Moon"))
a1
My plan is to obtain a CSV file with all cities names around the world at https://www.kaggle.com/max-mind/world-cities-database and import it into R as a vector, here is a small example
a2 = c("New York", "Washington", "Austin")
a2
What I want to do is to write an R function that cross-references a1 based on a2, replace all strings in a1 where it doesn't appear on a2 as NA, and replace all strings where it appears on a2 by that exact string values. For example, say that our function is f, the output of the function would be as follow
x = data.frame(x=c(1:4),c("San Francisco", "New York", "Austin", NA))
x
Can I write a function in R for this, or are there any existing R package build for this task? Thank you for the help
We can paste all the city names as a pattern and then use str_extract to extract it.
library(stringr)
str_extract(a1, str_c(a2, collapse = "|"))
#[1] "San Francisco" "New York" "Austin" NA
data
a2 = c("New York", "Washington", "Austin", "San Francisco")
a1 = c("181 Metro Drive San Francisco", "Wall Street New York",
"Austin, TX", "The Moon")

R: Match character vector with another character vector [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Can't wrap my mind around this task
Consider a data frame "usa" with 3 columns, "title", "city" and "state" (reproducible):
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data.frame(title, city, state)
Resulting in this:
title city state
1 Events in Chicago, September
2 California hotels
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
Now what I am trying to do is to fill the STATE variable for the first 2 observations, which are now missing.
TITLE variable contains a clue: either a city or a state is mentioned in each of the entries.
I need to do the following:
Check if any word in "title" column matches any observation found in "city" and "state" columns;
If any word in "title" matches any observation in "state", paste the same state for the given title's observation;
If any word in "title" matches any observation in "city", paste the matched city's state in the "state" column of the title's row.
So what I want to get eventually is this:
title city state
1 Events in Chicago, September IL
2 California hotels California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
In other words, in the second row the title contained a word "California", so a matching state was found from state vector. However, in the first line, the word "Chicago" was the key, and there was another entry in the data frame (row 4), which linked Chicago to "IL" state, so "IL" has to be pasted in the first row of "state" column.
Waiting for the community's ideas :) Thanks!
I would recommend you use the stringr package; specifically, a function called str_extract.
If you have a complete list of cities, e.g. city <- c("Los Angeles", "Chicago"), then you can make it into regular expression using paste(city, collapse = '|'). That will give you: 'Los Angeles|Chicago'. With str_extract, you can extract that city (will extract the first one it sees, and an NA if none appear). Here's the complete code. Note: this only works if your dataframe is a data_frame (tibble), not a data.frame (not totally sure why, haven't looked into it)
library(tidyverse)
library(stringr)
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data_frame(title, city, state) # notice this is a data_frame not data.frame
cities <- paste(c("Los Angeles", "Chicago"), collapse = '|')
states <- paste(c("California", "IL"), collapse = '|')
usa <- usa %>%
mutate(city = ifelse(city == '', str_extract(title, cities), city),
state = ifelse(state == '', str_extract(title, states), state))
This results in:
# A tibble: 4 x 3
title city state
<chr> <chr> <chr>
1 Events in Chicago, September Chicago <NA>
2 California hotels <NA> California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL

In R, how can I view and count unique entries in a column of a data set?

I'm new to R, and I need to pull out only the names of the cities in this data set:
what command would I use to do that?
Here's an option:
unique(cities$city)
You can also view the frequency that each city name occurred with:
table(cities$city)
Here's a demo with sample data:
cities <- c("New York","New York", "Los Angeles", "Boston", "Los Angeles")
unique(cities)
[1] "New York" "Los Angeles" "Boston"
table(cities)
cities
Boston Los Angeles New York
1 2 2

improve nested ifelse statement in r

I have more than 10k address info, looks like "XXX street, city, state, US", in a character vector.
I want to group them by states, so I use nested ifelse to get the address date.frame with two variable, add_info and state.
library(stringr)
for (i in nrow(address){
ifelse(str_detect(address, 'Alabama'), address[i,state]='Alabama',
ifelse(str_detect(address, 'Alaska'), address[i,state]='Alaska',
ifelse(str_detect(address, 'Arizona'), address[i,state]='Arizona',
...
ifelse(str_detect(address, 'Wyoming'), address[i,state]='Wyoming', address[i,state]=NA)...)
}
Of course, this is extremely inefficient, but I don't know how to rewrite this nested ifelse. Any idea?
There are many ways to approach this problem. This is one approach assuming that your address string always contains the full spelling of only one US state.
library(stringr)
# Get a list of all states
state.list = scan(text = "Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming", what = "", sep = ",", strip.white = T)
# Extract state from vector address using library(stringr)
state = unlist(sapply(address, function(x) state.list[str_detect(x, state.list)]))
# Generate fake data to test
fake.address = paste0(replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")),
sample(state.list, 20, rep = T),
replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")))
# Test using fake address
unlist(sapply(fake.address, function(x) state.list[str_detect(x, state.list)]))
Output for fake address
O4H8V0NYEHColoradoA5K5XK35LX 44NDPQVMZ8UtahMY0I4M3086 LJ0LJW8BOBFloridaP5H2QW8B81 521IHHC1MFCaliforniaG7QTYCJRO5
"Colorado" "Utah" "Florida" "California"
YESTB7R6EPRhode IslandXEEGD4GEY3 5OHN2BR29HKansasCOKR9DY1WJ 4UXNJQW0QKNew MexicoH9GVQR3ZFY 5SYELTKO5HTexas3ONM1HU1VB
"Rhode Island" "Kansas" "New Mexico" "Texas"
Z8MKKL7K1RWashingtonGEBS7LJUU0 WPRSQEI2CNIndiana141S0Z1M2E O4H8V0NYEHNorth DakotaA5K5XK35LX 44NDPQVMZ8New HampshireMY0I4M3086
"Washington" "Indiana" "North Dakota" "New Hampshire"
LJ0LJW8BOBWest VirginiaP5H2QW8B811 LJ0LJW8BOBWest VirginiaP5H2QW8B812 521IHHC1MFNew JerseyG7QTYCJRO5 YESTB7R6EPWisconsinXEEGD4GEY3
"Virginia" "West Virginia" "New Jersey" "Wisconsin"
5OHN2BR29HOregonCOKR9DY1WJ 4UXNJQW0QKOhioH9GVQR3ZFY 5SYELTKO5HRhode Island3ONM1HU1VB Z8MKKL7K1ROklahomaGEBS7LJUU0
"Oregon" "Ohio" "Rhode Island" "Oklahoma"
WPRSQEI2CNIowa141S0Z1M2E
"Iowa"
edit: Use the following function based on agrep() for Fuzzy matching. Should work with minor spelling mistakes. You might need to go into edit comment to copy the code. The code contains an index-assign [<- operator called functionally, so the display is glitching here.
unlist(sapply(fake.address, function(x) state.list[[<-((L<-as.logical(sapply(state.list, function(s) agrep(s, x)*1))),is.na(L),F)]))
Assuming that your formatting is consistent (sensu Joran's comment above), you could just parse with strsplit and then use data.frame:
address1<-"410 West Street, Small Town, MN, US"
address2<-"5844 Green Street, Foo Town, NY, US"
address3<-"875 Cardinal Lane, Placeville, CA, US"
vector<-c(address1,address2,address3)
df<-t(data.frame(strsplit(vector,", "))
colnames(df)<-c("Number","City","State","Country")
rownames(df)<-NULL
df
which produces:
Number City State Country
[1,] "410 West Street" "Small Town" "MN" "US"
[2,] "5844 Green Street" "Foo Town" "NY" "US"
[3,] "875 Cardinal Lane" "Placeville" "CA" "US"
There are several methods.
First we need some sample data.
# some sample data
set.seed(123)
dat <- data.frame(addr=sprintf('123 street, Townville, %s, US',
sample(state.name, 25, replace=T)),
stringsAsFactors=F)
If your data is super regular like that:
# the easy way, split on commas:
matrix(unlist(strsplit(dat$addr, ',')), ncol=4, byrow=T)
Method 2, use grep to search for values. This works even if no commas or different commas in different rows. (As long as the states always appear spelled the same way)
# get a list of state name matches; need to match ', state name,' otherwise
# West Virginia counts as Virginia...
matches <- sapply(paste0(', ', state.name, ','), grep, dat$addr)
# now pair up the state name with the row it matches to
state_df <- data.frame(state=rep(state.name, sapply(matches, length)),
row=unname(unlist(matches)),
stringsAsFactors=F)
# reorder based on position in original data.frame, and there you go!
dat$state <- state_df[order(state_df$row), 'state']
This seemed to be working in my tests:
just.ST <- gsub( paste0(".+(", paste(state.name,collapse="|"), ").+$"),
"\\1", address)
As mentioned in comments and illustrated in other answers, state.name should be available by default. It does have the deficiency that in case of a non-match it returns the whole string, but you can probably use:
is.na(just.ST) <- nchar(just.ST) > max(nchar(state.name))

Resources