improve nested ifelse statement in r - r

I have more than 10k address info, looks like "XXX street, city, state, US", in a character vector.
I want to group them by states, so I use nested ifelse to get the address date.frame with two variable, add_info and state.
library(stringr)
for (i in nrow(address){
ifelse(str_detect(address, 'Alabama'), address[i,state]='Alabama',
ifelse(str_detect(address, 'Alaska'), address[i,state]='Alaska',
ifelse(str_detect(address, 'Arizona'), address[i,state]='Arizona',
...
ifelse(str_detect(address, 'Wyoming'), address[i,state]='Wyoming', address[i,state]=NA)...)
}
Of course, this is extremely inefficient, but I don't know how to rewrite this nested ifelse. Any idea?

There are many ways to approach this problem. This is one approach assuming that your address string always contains the full spelling of only one US state.
library(stringr)
# Get a list of all states
state.list = scan(text = "Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming", what = "", sep = ",", strip.white = T)
# Extract state from vector address using library(stringr)
state = unlist(sapply(address, function(x) state.list[str_detect(x, state.list)]))
# Generate fake data to test
fake.address = paste0(replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")),
sample(state.list, 20, rep = T),
replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")))
# Test using fake address
unlist(sapply(fake.address, function(x) state.list[str_detect(x, state.list)]))
Output for fake address
O4H8V0NYEHColoradoA5K5XK35LX 44NDPQVMZ8UtahMY0I4M3086 LJ0LJW8BOBFloridaP5H2QW8B81 521IHHC1MFCaliforniaG7QTYCJRO5
"Colorado" "Utah" "Florida" "California"
YESTB7R6EPRhode IslandXEEGD4GEY3 5OHN2BR29HKansasCOKR9DY1WJ 4UXNJQW0QKNew MexicoH9GVQR3ZFY 5SYELTKO5HTexas3ONM1HU1VB
"Rhode Island" "Kansas" "New Mexico" "Texas"
Z8MKKL7K1RWashingtonGEBS7LJUU0 WPRSQEI2CNIndiana141S0Z1M2E O4H8V0NYEHNorth DakotaA5K5XK35LX 44NDPQVMZ8New HampshireMY0I4M3086
"Washington" "Indiana" "North Dakota" "New Hampshire"
LJ0LJW8BOBWest VirginiaP5H2QW8B811 LJ0LJW8BOBWest VirginiaP5H2QW8B812 521IHHC1MFNew JerseyG7QTYCJRO5 YESTB7R6EPWisconsinXEEGD4GEY3
"Virginia" "West Virginia" "New Jersey" "Wisconsin"
5OHN2BR29HOregonCOKR9DY1WJ 4UXNJQW0QKOhioH9GVQR3ZFY 5SYELTKO5HRhode Island3ONM1HU1VB Z8MKKL7K1ROklahomaGEBS7LJUU0
"Oregon" "Ohio" "Rhode Island" "Oklahoma"
WPRSQEI2CNIowa141S0Z1M2E
"Iowa"
edit: Use the following function based on agrep() for Fuzzy matching. Should work with minor spelling mistakes. You might need to go into edit comment to copy the code. The code contains an index-assign [<- operator called functionally, so the display is glitching here.
unlist(sapply(fake.address, function(x) state.list[[<-((L<-as.logical(sapply(state.list, function(s) agrep(s, x)*1))),is.na(L),F)]))

Assuming that your formatting is consistent (sensu Joran's comment above), you could just parse with strsplit and then use data.frame:
address1<-"410 West Street, Small Town, MN, US"
address2<-"5844 Green Street, Foo Town, NY, US"
address3<-"875 Cardinal Lane, Placeville, CA, US"
vector<-c(address1,address2,address3)
df<-t(data.frame(strsplit(vector,", "))
colnames(df)<-c("Number","City","State","Country")
rownames(df)<-NULL
df
which produces:
Number City State Country
[1,] "410 West Street" "Small Town" "MN" "US"
[2,] "5844 Green Street" "Foo Town" "NY" "US"
[3,] "875 Cardinal Lane" "Placeville" "CA" "US"

There are several methods.
First we need some sample data.
# some sample data
set.seed(123)
dat <- data.frame(addr=sprintf('123 street, Townville, %s, US',
sample(state.name, 25, replace=T)),
stringsAsFactors=F)
If your data is super regular like that:
# the easy way, split on commas:
matrix(unlist(strsplit(dat$addr, ',')), ncol=4, byrow=T)
Method 2, use grep to search for values. This works even if no commas or different commas in different rows. (As long as the states always appear spelled the same way)
# get a list of state name matches; need to match ', state name,' otherwise
# West Virginia counts as Virginia...
matches <- sapply(paste0(', ', state.name, ','), grep, dat$addr)
# now pair up the state name with the row it matches to
state_df <- data.frame(state=rep(state.name, sapply(matches, length)),
row=unname(unlist(matches)),
stringsAsFactors=F)
# reorder based on position in original data.frame, and there you go!
dat$state <- state_df[order(state_df$row), 'state']

This seemed to be working in my tests:
just.ST <- gsub( paste0(".+(", paste(state.name,collapse="|"), ").+$"),
"\\1", address)
As mentioned in comments and illustrated in other answers, state.name should be available by default. It does have the deficiency that in case of a non-match it returns the whole string, but you can probably use:
is.na(just.ST) <- nchar(just.ST) > max(nchar(state.name))

Related

How to identify all country names mentioned in a string and split accordingly?

I have a string that contains country and other region names. I am only interested in the country names and would ideally like to add several columns, each of which contains a country name listed in the string. Here is an exemplary code for the way the dataframe lis set up:
df <- data.frame(id = c(1,2,3),
country = c("Cote d'Ivoire Africa Developing Economies West Africa",
"South Africa United Kingdom Africa BRICS Countries",
"Myanmar Gambia Bangladesh Netherlands Africa Asia"))
If I only split the string by space, those countries which contain a space get lost (e.g. "United Kingdom"). See here:
df2 <- separate(df, country, paste0("C",3:8), sep=" ")
Therefore, I tried to look up country names using the world.cities dataset. However, this only seems to loop through the string until there is non-country name. See here:
library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$c1 <- sapply(str_extract_all(df$country, all_countries), toString)
I am wondering whether it's possible to use the space a delimiter but define exceptions (like "United Kingdom"). This might obviously require some manual work, but appears to be most feasible solution to me. Does anyone know how to define such exceptions? I am of course also open to and thankful for any other solutions.
UPDATE:
I figured out another solution using the countrycode package:
library(countrycode)
countries <- data.frame(countryname_dict)
countries$continent <- countrycode(sourcevar = countries[["country.name.en"]],
origin = "country.name.en",
destination = "continent")
africa <- countries[ which(countries$continent=='Africa'), ]
library(stringr)
pat <- paste0("\\b", paste(africa$country.name.en , collapse="\\b|\\b"), "\\b")
df$country_list <- str_extract_all(df$country, regex(pat, ignore_case = TRUE))
You could do:
library(stringi)
vec <- stri_trans_general(countrycode::codelist$country.name.en, id = "Latin-ASCII")
stri_extract_all(df$country,regex = sprintf(r"(\b(%s)\b)",stri_c(vec,collapse = "|")))
[[1]]
[1] "Cote d'Ivoire"
[[2]]
[1] "South Africa" "United Kingdom"
[[3]]
[1] "Gambia" "Bangladesh" "Netherlands"

How to extract one element of text from a column in R?

I'm working with a data frame that contains the locations of where people got tested for COVID. There is not standardization of formatting of the ordering facility (the place that ordered the test). My data frame look something like this:
TestingLocation <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"))
I have a list of the cities where someone could get tested.
Cities <- data.frame(PossibleTestCities=c("Los Angeles", "Chicago", "New York", "Miami", "Boston", "Austin", "Santa Fe"))
Is there a way to use the Cities frame I have to extract the city and put it into a new column. Additionally, if no city appears, to put "Unknown" or something along those lines? Ideally, my frame would look like this:
DesiredFrame <- data.frame(TestingLocation= c("New York Hospital One", "Chicago Clinic Two", "Nursing Home Name One",
"Los Angeles University_Testing_Site", "Test-Site-in-BOSTON-MA"),
TestCity= c("New York", "Chicago", "Unknown", "Los Angeles", "Boston"))
Thank you!
Does this work:
library(dplyr)
library(stringr)
TestingLocation %>% mutate(TestCity = str_to_title(str_extract(toupper(TestingLocation), toupper(str_c(Cities$PossibleTestCities, collapse = '|'))))) %>%
mutate(TestCity = replace_na(TestCity, 'Unknown'))
TestingLocation TestCity
1 New York Hospital One New York
2 Chicago Clinic Two Chicago
3 Nursing Home Name One Unknown
4 Los Angeles University_Testing_Site Los Angeles
5 Test-Site-in-BOSTON-MA Boston
This doesn't look pretty but it works:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, sub(paste0(".*(",
paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*"),
"\\U\\1", tolower(TestingLocation$TestingLocation), perl = T))
There's a number of operations involved. There are two sub operations one nested in the other. The first is to replace the (lower-case) TestingLocation$TestingLocations with the matching (lower-case) Cities$PossibleTestCities and set the replacements to upper-case, while the second is to set the values that did not find a match and that hence remained lower-case to NA.
Instead of using a compact but hard-to parse single piece of code you can achieve the substitutions step-by-step:
# 1. define pattern with alternatives:
mypattern <- paste0(".*(", paste(tolower(Cities$PossibleTestCities), collapse = "|"),").*")
# 2. perform first substitution to set matches to City names:
TestingLocation$TestCity <- sub(mypattern, "\\U\\1", tolower(TestingLocation$TestingLocation), perl = T)
# 3. perform second substitution to set non-match to NA:
TestingLocation$TestCity <- sub("(^[a-z]+.*$)", NA, TestingLocation$TestCity)
Result:
TestingLocation
TestingLocation TestCity
1 New York Hospital One NEW YORK
2 Chicago Clinic Two CHICAGO
3 Nursing Home Name One <NA>
4 Los Angeles University_Testing_Site LOS ANGELES
5 Test-Site-in-BOSTON-MA BOSTON

Is there syntactic sugar to define a data frame in R

I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format

Using grepl to subset dataframe containing the same mentioning of some text in two columns

I'm working on a dataframe (account) with two columns containing "posting" IP location (in the column city) and the locations at the time when those accounts were first registered (in the column register). I'm using grepl() to subset rows whose posting location and register location are both from the state of New York (NY). Below are part of the data and my code for subsetting the desired output:
account <- data.frame(city = c("Beijing, China", "New York, NY", "Hoboken, NJ", "Los Angeles, CA", "New York, NY", "Bloomington, IN"),
register = c("New York, NY", "New York, NY", "Wilwaukee, WI", "Rochester, NY", "New York, NY", "Tokyo, Japan"))
sub_data <- subset(account, grepl("NY", city) == "NY" & grepl("NY", register) == "NY")
sub_data
[1] city register
<0 rows> (or 0-length row.names)
My code didn't work and returned 0 row (while at least two rows should have met my selection criterion). What went wrong in my code?
I have referenced this previous thread before lodging this question.
The function grepl already returns a logical vector, so just use the following:
sub_data <- subset(account,
grepl("NY", city) & grepl("NY", register)
)
By using something like grepl("NY", city) == "NY" you are asking R if any values in FALSE TRUE FALSE FALSE TRUE FALSE are equal to "NY", which is of course false.

Remove specific string at the end position of each row from dataframe(csv)

I am trying to clean a set of data which is in csv format. After loading data into R, i need to replace and also remove some characters from the it. Below is an example. Ideally i want to
replace the St at the end of each -> Street
in cases where there are St St.
i need to remove St and replace St. with just Street.
I tried to use this code
sub(x = evostreet, pattern = "St.", replacement = " ") and later
gsub(x = evostreet, pattern = "St.", replacement = " ") to remove the St. at the end of each row but this also remove some other occurrences of St and the next character
3 James St.
4 Glover Road St.
5 Jubilee Estate. St.
7 Fed Housing Estate St.
8 River State School St.
9 Brown State Veterinary Clinic. St.
11 Saw Mill St.
12 Dyke St St.
13 Governor Rd St.
I'm seeing a lot of close answers but I'm not seeing any that address the second problem he's having such as replacing "St St." with "Street"; e.g., "Dyke St St."
sub, as stated in the documentation:
The two *sub functions differ only in that sub replaces only the first occurrence of a pattern
So, just using "St\\." as the pattern match is incorrect.
OP needs to match a possible pattern of "St St." and I'll further assume that it could even be "St. St." or "St. St".
Assuming OP is using a simple list:
x = c("James St.", "Glover Road St.", "Jubilee Estate. St.",
"Fed Housing Estate St.", "River State School St St.",
"Brown State Vet Clinic. St. St.", "Dyke St St.")`
[1] "James St." "Glover Road St."
[3] "Jubilee Estate. St." "Fed Housing Estate St."
[5] "River State School St St." "Brown State Vet Clinic. St. St."
[7] "Dyke St St."
Then the following will replace the possible combinations mentioned above with "Street", as requested:
y <- sub(x, pattern = "[ St\\.]*$", replacement = " Street")
[1] "James Street" "Glover Road Street"
[3] "Jubilee Estate Street" "Fed Housing Estate Street"
[5] "River State School Street" "Brown State Vet Clinic Street"
[7] "Dyke Street"
Edit:
To answer OP's question below in regard to replacing one substr of St. with Saint and another with Street, I was looking for a way to be able to match similar expressions to return different values but at this point I haven't been able to find it. I suspect regmatches can do this but it's something I'll have to fiddle with later.
A simple way to accomplish what you're wanting - let's assume:
x <- c("St. Mary St St.", "River State School St St.", "Dyke St. St")
[1] "Saint Mary St St." "River State School St St."
[3] "Dyke St. St"
So you want x[1] to be Saint Mary Street, x[2] to be River State School Street and x[3] to be Dyke Street. I would want to resolve the Saint issue first by assigning sub() to y like:
y <- sub(x, pattern = "^St\\.", replacement = "Saint")
[1] "Saint Mary Street" "River State School Street"
[3] "Dyke Street"
To resolve the St's as the end, we can use the same resolution as I posted except notice now I'm not using x as my input vector but isntead the y I just made:
y <- sub(y, pattern = "[ St\\.]*$", replacement = " Street")
And that should take care of it. Now, I don't know if this is the most efficient way. And if you're dataset is rather large this may run slow. If I find a better solution I will post it (provided no one else beats me).
You don't need to use regular expression here.
sub(x = evostreet, pattern = "St.", replacement = " ", fixed=T)
The fixed argument means that you want to replace this exact character, not matches of a regular expression.
I think that your problem is that the '.' character in the regular expression world means "any single character". So to match literally in R you should write
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You will need to "comment" the dot... otherwise it means anything after St and that is why some other parts of your text are eliminated.
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You can add $ at the end if you want to remove the tag apearing just at the end of the text.
sub(x = evostreet, pattern = "St\\.$", replacement = " ")
The difference between sub and gsub is that sub will deal just with the firs time your tag appears in a text. gsub will eliminate all if there are duplicated. In your case as you are looking for the pattern at the end of the line it should not make any difference if you use the $.

Resources