How can I edit strings in a data.frame [duplicate] - r

This question already has an answer here:
How to use ifelse and paste functions
(1 answer)
Closed last year.
I am importing a dataset from SAS which has columns with named headers, and thousands of rows of entries, which I have named start_list. one column is fund, in which we have a selection of strings eg 'South East Auth', 'North West', 'South' etc. I want to add the word 'auth' to the North west and South strings, but leave the strings that already have this sunch as south east, alone.
I don't need to do anything clever like look for auth - there are only a few that need changing. I just want to basically, in laymans code, do
if string in column 'fund' = 'south east' then change to 'south east auth' and apply this to thousands of instances of it

vec <- c("South West", "South East", "North East", "North West")
paste0(vec, ifelse(vec %in% c("South East", "North West"), " Auth", ""))
# [1] "South West" "South East Auth" "North East" "North West Auth"
If you want it to be case-insensitive,
paste0(vec, ifelse(tolower(vec) %in% c("south east", "north west"), " Auth", ""))
# [1] "South West" "South East Auth" "North East" "North West Auth"

Related

r- Error when trying to use mutate with case_when

I am trying to add vector to a data frame holding the region of each US state. I have tried the following code and keep on getting an error message. I'm new to the tidyverse so any help you can offer would be appreciated. I'm guessing it's something small and embarrassing. :)
df <- df %>%
mutate(region = case_when((State=="Connecticut"|State=="Maine"|State=="Massachusetts"|State=="New Hampshire"|State=="Rhode Island"|State=="Vermont"~ "New England"),
case_when((State=="Delaware"| State=="District of Columbia" | State=="Maryland"| State=="New Jersey"| State=="New York"| State=="Pennsylvania"~ "Central Atlanic"),
case_when((State=="Florida"| State=="Georgia"| State=="North Carolina"|State=="South Carolina"| State=="Virginia"| State=="West Virginia"~ "Lower Atlantic"),
case_when((State=="Illinois"| State=="Indiana"| State=="Iowa"| State=="Kansas"| State=="Kentucky"| State=="Michigan"| State=="Minnesota"| State=="Missouri"| State=="Nebraska"| State=="North Dakota"| State=="Ohio"| State=="Oklahoma"| State=="South Dakota"| State=="Tennessee" |State=="Wisconsin"~ "Midwest"),
case_when((State=="Alabama" | State=="Arkansas" | State=="Louisiana"| State=="Mississippi"| State=="New Mexico"| State=="Texas"~ "Gulf Coast"),
case_when((State=="Colorado"| State=="Idaho" | State=="Montana"| State=="Utah"| State=="Wyoming"~ "Rocky Mountain"),
case_when((State=="Alaska" | State=="Arizona" | State=="California"| State=="Hawaii" | State=="Nevada"| State=="Oregon"| State=="Washington"~ "West Coast"), TRUE~"NA"))))))))
Error in mutate():
! Problem while computing region = case_when(...).
Caused by error in case_when():
! Case 2 ((State == "Colorado" | State == "Idaho" | State == "Montana" | State == "Utah" | State == "Wyoming" ~ "Rocky Mountain")) must be a two-sided formula, not a character vector.
As docs show, there is no need to nest case_when. Simply, separate the mutually exclusive conditions by commas. Also, consider %in% and avoid the many OR calls.
mutate(region = case_when(
State %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont") ~ "New England"),
State %in% c("Delaware", "District of Columbia", "Maryland", "New Jersey", "New York", "Pennsylvania") ~ "Central Atlantic"),
...,
TRUE ~ NA
))
In fact, consider simply merging and avoid any conditional logic:
txt = 'State Region
Connecticut "New England"
Maine "New England"
Massachusetts "New England"
"New Hampshire" "New England"
"Rhode Island" "New England"
Vermont "New England"
Delaware "Central Atlantic"
"District of Columbia" "Central Atlantic"
Maryland "Central Atlantic"
"New Jersey" "Central Atlantic"
"New York" "Central Atlantic"
Pennsylvania "Central Atlantic"
...'
region_df <- read.table(text = txt, header = TRUE)
region_df
# State Region
# 1 Connecticut New England
# 2 Maine New England
# 3 Massachusetts New England
# 4 New Hampshire New England
# 5 Rhode Island New England
# 6 Vermont New England
# 7 Delaware Central Atlantic
# 8 District of Columbia Central Atlantic
# 9 Maryland Central Atlantic
# 10 New Jersey Central Atlantic
# 11 New York Central Atlantic
# 12 Pennsylvania Central Atlantic
# ...
main_df <- merge(main_df, region_df, by = "State")

Is there syntactic sugar to define a data frame in R

I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format

Fuzzy Matching/Join Two Data Frames of University Names [duplicate]

This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 4 years ago.
I have a list of university names input with spelling errors and inconsistencies. I need to match them against an official list of university names to link my data together.
I know fuzzy matching/join is my way to go, but I'm a bit lost on the correct method. Any help would be greatly appreciated.
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
And I desire an output that has them merged together as closely as possible
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))
I use adist() for things like this and have little wrapper function called closest_match() to help compare a value against a set of "good/permitted" values.
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist() utilizes Levenshtein distance to compare similarity between two strings.

R script, how do i assign 3 values from a collection with the same label to sort into levels

I am trying to do something like this, I want every name inside of england to be set to england so when it is ran it will count everything in that collection and they will all be added to englands total. as you can see below there are 9 other labels I want anything named as such to become another england label. I hope this makes sense to someone out there, I really didn't know how to explain this.
area_c <- factor(Outlets2016_local$Region,levels = c("England","Scotland","Wales"),labels = c("England" = england,"Scotland","Wales"))
here is englands collection:
england <- c("London","North East","East of England","West Midlands","South East","North West","East Midlands","South West","Yorkshire and The Humber")
You can do the following if you don't mind recoding Outlets2016_local$Region.
england <- c("London", "North East", "East of England", "West Midlands", "South East", "North West", "East Midlands", "South West", "Yorkshire and The Humber")
Outlets2016_local$Region[Outlets2016_local$Region %in% england] <- "England"
area_c <- factor(Outlets2016_local$Region, levels = c("England", "Scotland", "Wales"), labels = c("England", "Scotland", "Wales"))

R: finding specific numbers of characters in a character array

I want to find states with exactly two Os in the name. I tried this:
> data(state)
> index=grep('o.*o',state.name)
> state.name[index]
"Colorado" "North Carolina" "North Dakota" "South Carolina" "South Dakota"
Problem: there are three Os in "Colorado" and I don't want it. How can I revise my regex?
I also want to do three Os:
> data(state)
> index=grep('o.*o.*o',state.name)
> state.name[index]
"Colorado"
Is there a simpler way to do this?
You can do:
grep('^([^o]*o[^o]*){2}$', state.name, value = TRUE)
# [1] "North Carolina" "North Dakota"
# [3] "South Carolina" "South Dakota"
grep('^([^o]*o[^o]*){3}$', state.name, value = TRUE)
# [1] "Colorado"
and as GSee suggested below, you can add ignore.case = TRUE if you want to include states with a capital O like Ohio, Oklahoma, and Oregon.
Michael's response is definitely more eloquent but here's the brute force method:
state.name[sapply(strsplit(tolower(state.name), NULL), function(x) sum(x %in% "o") == 2)]
You should ensure that the other characters that you're matching, besides the two matching Os, are not Os:
grep("^[^o]*o[^o]*o[^o]*$", state.name, value = TRUE)
Solution using ?gregexpr: A little ugly, but generalizes to other regexs well. (Don't forget the capital O in Ohio.)
state.name[sapply(state.name,function(x) length(unlist(gregexpr("o|O",x)))) == 2]
Count number of os in state name.
State <- c("North Dakota","Ohio","Colorado","South Dakota")
nos <- nchar(gsub("[^oO]","",State))
State[nos==2]
State[nos==3]

Resources