Removing different words from vector in R - r

Lets say I have in R a long data frame like this:
var1 <- c("Los Angeles - CA", "New York - NY", "Seattle - WA", "Los Angeles - CA", "New York - NY")
var2 <- c(1, 2, 3, 4, 5)
df <- data.frame(var1, var2)
I want to remove the " - State", to get a result like:
var1 <- c("Los Angeles", "New York", "Seattle", "Los Angeles", "New York")
var2 <- c(1, 2, 3, 4, 5)
df <- data.frame(var1, var2)
I wasn't able to figure out how to do so since I have more than 5,000 rows and cannot use gsub because I'd have to state every state abbreviation to remove. I mean, there's dozens of patterns (-State) that I'd have to define a priori before using such functions,
Is there an easy way to remove all "-State" from that column at once by using some splitting pattern that I haven't figured out yet?

Couple of options.
Most basic would be to just remove the last 5 characters.
library(stringr)
str_sub(var1, 1L, -6L)
Or maybe search for the pattern and delete that:
gsub(" - \\w+$","",var1)
or
str_remove_all(var1, " - \\w+$")
All will get you the same result
[1] "Los Angeles" "New York" "Seattle" "Los Angeles" "New York"

var1 <- c("Los Angeles - CA", "New York - NY", "Seattle - WA", "Los Angeles - CA", "New York - NY")
gsub(" - [A-Z]+$", "", var1)
[1] "Los Angeles" "New York" "Seattle" "Los Angeles" "New York"

Related

How to extract one element of text from a column in R?

I'm working with a data frame that contains the locations of where people got tested for COVID. There is not standardization of formatting of the ordering facility (the place that ordered the test). My data frame look something like this:
TestingLocation <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"))
I have a list of the cities where someone could get tested.
Cities <- data.frame(PossibleTestCities=c("Los Angeles", "Chicago", "New York", "Miami", "Boston", "Austin", "Santa Fe"))
Is there a way to use the Cities frame I have to extract the city and put it into a new column. Additionally, if no city appears, to put "Unknown" or something along those lines? Ideally, my frame would look like this:
DesiredFrame <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"),
TestCity= c("New York", "Chicago", "Unknown", "Los Angeles", "Boston"))
Thank you!
Does this work:
library(dplyr)
library(stringr)
TestingLocation %>% mutate(TestCity = str_to_title(str_extract(toupper(TestingLocation), toupper(str_c(Cities$PossibleTestCities, collapse = '|'))))) %>%
mutate(TestCity = replace_na(TestCity, 'Unknown'))
TestingLocation TestCity
1 New York Hospital One New York
2 Chicago Clinic Two Chicago
3 Nursing Home Name One Unknown
4 Los Angeles University_Testing_Site Los Angeles
5 Test-Site-in-BOSTON-MA Boston
This doesn't look pretty but it works:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, sub(paste0(".*(",
paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*"),
"\\U\\1", tolower(TestingLocation$TestingLocation), perl = T))
There's a number of operations involved. There are two sub operations one nested in the other. The first is to replace the (lower-case) TestingLocation$TestingLocations with the matching (lower-case) Cities$PossibleTestCities and set the replacements to upper-case, while the second is to set the values that did not find a match and that hence remained lower-case to NA.
Instead of using a compact but hard-to parse single piece of code you can achieve the substitutions step-by-step:
# 1. define pattern with alternatives:
mypattern <- paste0(".*(", paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*")
# 2. perform first substitution to set matches to City names:
TestingLocation$TestCity <- sub(mypattern, "\\U\\1", tolower(TestingLocation$TestingLocation), perl = T)
# 3. perform second substitution to set non-match to NA:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, TestingLocation$TestCity)
Result:
TestingLocation
TestingLocation TestCity
1 New York Hospital One NEW YORK
2 Chicago Clinic Two CHICAGO
3 Nursing Home Name One <NA>
4 Los Angeles University_Testing_Site LOS ANGELES
5 Test-Site-in-BOSTON-MA BOSTON

Using compound statement in an "if" clause

I found a similar question involving templates that was was beyond the scope of my question. I want to be able to say something like: if (a and b) then (do something). Here is an example:
t1 <- tribble(
~state, ~county,
"New York", "Bronx",
"New York", "Richmond",
"New York", "Albany",
"Virginia", "Richmond"
)
five_boroughs = c("Bronx", "Kings", "New York", "Queens", "Richmond")
if t1$state == "New York" && t1$county in five_boroughs
t1$county = "New York City"
Using either &, &&, in, or %in% puts New York City in Virginia. I apologize to New Yorkers for calling counties boroughs.
We can use case_when
library(dplyr)
library(stringr)
t1 %>%
mutate(county = case_when(state == 'New York' &
county %in% five_boroughs~ str_c(state, ' City')))

Remove all rows that doesn't match a set of strings and recategorization of the columns

I have a set of social media data queried from twitter API, which also included people's self-reported location. However, the location string does not default to a standard format for categorization, and sometimes there are "trolls" value. Here is an example
a1 = data.frame(x=c(1:4),y=c("181 Metro Drive San Francisco", "Wall Street New York", "Austin, TX", "The Moon"))
a1
My plan is to obtain a CSV file with all cities names around the world at https://www.kaggle.com/max-mind/world-cities-database and import it into R as a vector, here is a small example
a2 = c("New York", "Washington", "Austin")
a2
What I want to do is to write an R function that cross-references a1 based on a2, replace all strings in a1 where it doesn't appear on a2 as NA, and replace all strings where it appears on a2 by that exact string values. For example, say that our function is f, the output of the function would be as follow
x = data.frame(x=c(1:4),c("San Francisco", "New York", "Austin", NA))
x
Can I write a function in R for this, or are there any existing R package build for this task? Thank you for the help
We can paste all the city names as a pattern and then use str_extract to extract it.
library(stringr)
str_extract(a1, str_c(a2, collapse = "|"))
#[1] "San Francisco" "New York" "Austin" NA
data
a2 = c("New York", "Washington", "Austin", "San Francisco")
a1 = c("181 Metro Drive San Francisco", "Wall Street New York",
"Austin, TX", "The Moon")

In R, how can I view and count unique entries in a column of a data set?

I'm new to R, and I need to pull out only the names of the cities in this data set:
what command would I use to do that?
Here's an option:
unique(cities$city)
You can also view the frequency that each city name occurred with:
table(cities$city)
Here's a demo with sample data:
cities <- c("New York","New York", "Los Angeles", "Boston", "Los Angeles")
unique(cities)
[1] "New York" "Los Angeles" "Boston"
table(cities)
cities
Boston Los Angeles New York
1 2 2

R - using regex to delete all strings with 2 characters or less [duplicate]

This question already has answers here:
R: Find and remove all one to two letter words
(2 answers)
Closed 5 years ago.
I've got a problem and I'm sure it's super simple to fix, but I've been searching for an answer for about an hour and can't seem to work it out.
I have a character vector with data that looks a bit like this:
[5] "Toronto, ON" "Manchester, UK"
[7] "New York City, NY" "Newark, NJ"
[9] "Melbourne" "Los Angeles, CA"
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, CO" "London, UK"
[15] "New York, NY"
and basically I'd like to get rid of all character elements that are 2 digits or shorter, so that the data can then look as follows:
[5] "Toronto, " "Manchester, "
[7] "New York City, " "Newark, "
[9] "Melbourne" "Los Angeles, "
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, " "London, "
[15] "New York, "
The commas I know how to get rid of. As I said, I'm sure this is super simple, any help would be greatly appreciated. Thanks!
You can use quantifier on a word character \\w with word boundaries, \\b\\w{1,2}\\b will match a word with one or two characters; use gsub to remove it in case you have multiple matched pattern:
gsub("\\b\\w{1,2}\\b", "", v)
# [1] "Toronto, " "Manchester, " "New York City, " "Newark, " "Melbourne" "Los Angeles, " "New York, USA"
# [8] "Liverpool, England" "Fort Collins, " "London, " "New York, "
Notice \\w matches both alpha letters and digits with underscore, if you only want to take alpha letters into account, you can use gsub("\\b[a-zA-Z]{1,2}\\b", "", v).
v <- c("Toronto, ON", "Manchester, UK", "New York City, NY", "Newark, NJ", "Melbourne", "Los Angeles, CA", "New York, USA", "Liverpool, England", "Fort Collins, CO", "London, UK", "New York, NY")
Doesn't use regex but it gets the job done:
d <- c(
"Toronto, ON", "Manchester, UK",
"New York City, NY", "Newark, NJ",
"Melbourne", "Los Angeles, CA" ,
"New York, USA", "Liverpool, England" ,
"Fort Collins, CO", "London, UK" ,
"New York, NY" )
toks <- strsplit(d, "\\s+")
lens <- sapply(toks, nchar)
mapply(function(a, b) {
paste(a[b > 2], collapse = " ")
}, toks, lens)

Resources