This question already has an answer here:
How to use ifelse and paste functions
(1 answer)
Closed last year.
I am importing a dataset from SAS which has columns with named headers, and thousands of rows of entries, which I have named start_list. one column is fund, in which we have a selection of strings eg 'South East Auth', 'North West', 'South' etc. I want to add the word 'auth' to the North west and South strings, but leave the strings that already have this sunch as south east, alone.
I don't need to do anything clever like look for auth - there are only a few that need changing. I just want to basically, in laymans code, do
if string in column 'fund' = 'south east' then change to 'south east auth' and apply this to thousands of instances of it
vec <- c("South West", "South East", "North East", "North West")
paste0(vec, ifelse(vec %in% c("South East", "North West"), " Auth", ""))
# [1] "South West" "South East Auth" "North East" "North West Auth"
If you want it to be case-insensitive,
paste0(vec, ifelse(tolower(vec) %in% c("south east", "north west"), " Auth", ""))
# [1] "South West" "South East Auth" "North East" "North West Auth"
I found a similar question involving templates that was was beyond the scope of my question. I want to be able to say something like: if (a and b) then (do something). Here is an example:
t1 <- tribble(
~state, ~county,
"New York", "Bronx",
"New York", "Richmond",
"New York", "Albany",
"Virginia", "Richmond"
)
five_boroughs = c("Bronx", "Kings", "New York", "Queens", "Richmond")
if t1$state == "New York" && t1$county in five_boroughs
t1$county = "New York City"
Using either &, &&, in, or %in% puts New York City in Virginia. I apologize to New Yorkers for calling counties boroughs.
We can use case_when
library(dplyr)
library(stringr)
t1 %>%
mutate(county = case_when(state == 'New York' &
county %in% five_boroughs~ str_c(state, ' City')))
I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format
This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 4 years ago.
I have a list of university names input with spelling errors and inconsistencies. I need to match them against an official list of university names to link my data together.
I know fuzzy matching/join is my way to go, but I'm a bit lost on the correct method. Any help would be greatly appreciated.
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
And I desire an output that has them merged together as closely as possible
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))
I use adist() for things like this and have little wrapper function called closest_match() to help compare a value against a set of "good/permitted" values.
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist() utilizes Levenshtein distance to compare similarity between two strings.
I am trying to do something like this, I want every name inside of england to be set to england so when it is ran it will count everything in that collection and they will all be added to englands total. as you can see below there are 9 other labels I want anything named as such to become another england label. I hope this makes sense to someone out there, I really didn't know how to explain this.
area_c <- factor(Outlets2016_local$Region,levels = c("England","Scotland","Wales"),labels = c("England" = england,"Scotland","Wales"))
here is englands collection:
england <- c("London","North East","East of England","West Midlands","South East","North West","East Midlands","South West","Yorkshire and The Humber")
You can do the following if you don't mind recoding Outlets2016_local$Region.
england <- c("London", "North East", "East of England", "West Midlands", "South East", "North West", "East Midlands", "South West", "Yorkshire and The Humber")
Outlets2016_local$Region[Outlets2016_local$Region %in% england] <- "England"
area_c <- factor(Outlets2016_local$Region, levels = c("England", "Scotland", "Wales"), labels = c("England", "Scotland", "Wales"))