Is there syntactic sugar to define a data frame in R - r
I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format
Related
R - if field A is empty, and field B contains certain partial string, add word to field A
I have a dataframe with one field that contains geographic names or phrases (locality.curated), and a corresponding field with an abbreviated code for these regions (convert), however not all the abbreviations are filled, some are NA. I want to add abbreviations to the convert field by matching to a key word in locality.curated, but not overwrite the convert field if it is already occupied. ie: if convert = NA AND locality curated = "Delaware" then paste "USA" to curated. I have tried various permutations of ifelse and if but either cannot get the syntax right, or, like my latest attempt below, doesn't seem to modify the dataframe. if(is.na(test$convert) && grepl("Delaware", test$locality.curated, value=T)) {paste("USA", test$convert)} Dummy data, input: test <- data.frame(locality.curated=c("Canada to Delaware River", "California", "Delaware", "Wilmington Delaware", "Alaska"), convert=c("CAN","USA", "USA",NA,NA)) desired output: test.out <- data.frame(locality.curated=c("Canada to Delaware River", "California", "Delaware", "Wilmington Delaware", "Alaska"), convert=c("CAN","USA", "USA","USA",NA)) Many thanks!
test$new <- ifelse(is.na(test$convert) & grepl(pattern = "Delaware", x = test$locality.curated), yes = "USA", no = test$convert) 1 Canada to Delaware River CAN CAN 2 California USA USA 3 Delaware USA USA 4 Wilmington Delaware <NA> USA 5 Alaska <NA> <NA>
Missing observations when using str_replace_all
I have a dataset of map data using the following: worldMap_df <- map_data("world") %>% rename(Economy = region) %>% filter(Economy != "Antarctica") %>% mutate(Economy = str_replace_all(Economy, c("Brunei" = "Brunei Darussalam", "Macedonia" = "Macedonia, FYR", "Puerto Rico" = "Puerto Rico US", "Russia" = "Russian Federation", "UK" = "United Kingdom", "USA" = "United States", "Palestine" = "West Bank and Gaza", "Saint Lucia" = "St Lucia", "East Timor" = "Timor-Leste"))) There are a number of countries (under Economy) that I am trying to use str_replace_all to concatenate. One example is observations for which Economy is either "Trinidad" or "Tobago". I've used the following but this seems to only partially re-label observations: trin_tobago_vector <- c("Trinidad", "Tobago") worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector, "Trinidad and Tobago") However, certain observations still have Trinidad and Tobago under Economy whilst others remain Trinidad OR Tobago. Can anyone see what I'm doing wrong here?
You supply str_replace_all with a pattern that is a vector: trin_tobago_vector. It will then iterate over your 'Economy' column and check the first element with "Trinidad", the second element with "Tobago", the third with "Trinidad", and so on. You should do this replacement in two steps instead: worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Trinidad$", "Trinidad and Tobago") worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, "^Tobago$", "Trinidad and Tobago") or use a named vector: trin_tobago_vector <- c("^Trinidad$" = "Trinidad and Tobago", "^Tobago$" = "Trinidad and Tobago") worldMap_df$Economy <- str_replace_all(worldMap_df$Economy, trin_tobago_vector) The ^ and $ inside the pattern vector make sure that only the literal strings "Trinidad" and "Tobago" are replaced.
R loop with characters list
The problem here is similar to this previous one but here we do not need to do any computation but just to build lists I have some list of world regions: list.asia <- c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia") list.africa <- c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa") I use the R library("ISOcodes") to produce lists of countries with ISO Alpha 3 digits format as follow: region <- subset(UN_M.49_Regions, Name %in% list.asia) subset <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", "))) subset$ISO_Alpha_3 This example, with the list.asiagives the expected result: [1] "AFG" "ARM" "AZE" "BHR" "BGD" "BTN" "BRN" "KHM" "CHN" "HKG" "MAC" "CYP" "PRK" [14] "GEO" "IND" "IDN" "IRN" "IRQ" "ISR" "JPN" "JOR" "KAZ" "KWT" "KGZ" "LAO" "LBN" [27] "MYS" "MDV" "MNG" "MMR" "NPL" "OMN" "PAK" "PHL" "QAT" "KOR" "SAU" "SGP" "LKA" [40] "PSE" "SYR" "TJK" "THA" "TLS" "TUR" "TKM" "ARE" "UZB" "VNM" "YEM" which can easily be saved as follow: countries.list.asia <- subset$ISO_Alpha_3 The problem is that I have got a lot of regions and I would prefer to do a loop. To keep it simple let's say that I only have 2 lists list.asia and list.africa. I regroup them in a new list.continent list.continent <- c("list.asia","list.africa") and then I "loop" the list production: (which does not work) for(i in list.continent){ list.loop <- sym(i) region <- subset(UN_M.49_Regions, Name %in% list.loop) subset <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", "))) paste("countries",list.loop, sep=".") <- subset$ISO_Alpha_3 rm(region, subset, list.loop) } The expected results (in this case) are 2 new objects (class list) called countries.list.asia and countries.list.africa containing the ISO Alpha 3 digits codes of the countries present in these regions. I tried to replace list.loop by !!list.loop or as.list(list.loop), but nothing works. Any Idea?
Consider using an overall list and not attempt to save an object to global environment iteratively and use a function to return your needed output and avoid the need to remove helper objects. And in R, a list + function can be encapsulated with lapply (or its wrapper sapply used here for list names): # NAMED LIST OF ACTUAL OBJECTS (NOT CHARACTER VECTOR) list.continent <- list(list.asia = list.asia, list.africa = list.africa) # BUILD NEW LIST OF SUBSETTED ITEMS new_list.continent <- sapply(list.continent, function(item) { region <- subset(UN_M.49_Regions, Name %in% item) sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", "))) return(sub$ISO_Alpha_3) }, simplify = FALSE) # SHOW OBJECT CONTENTS new_list.continent$list.asia new_list.continent$list.africa
R script, how do i assign 3 values from a collection with the same label to sort into levels
I am trying to do something like this, I want every name inside of england to be set to england so when it is ran it will count everything in that collection and they will all be added to englands total. as you can see below there are 9 other labels I want anything named as such to become another england label. I hope this makes sense to someone out there, I really didn't know how to explain this. area_c <- factor(Outlets2016_local$Region,levels = c("England","Scotland","Wales"),labels = c("England" = england,"Scotland","Wales")) here is englands collection: england <- c("London","North East","East of England","West Midlands","South East","North West","East Midlands","South West","Yorkshire and The Humber")
You can do the following if you don't mind recoding Outlets2016_local$Region. england <- c("London", "North East", "East of England", "West Midlands", "South East", "North West", "East Midlands", "South West", "Yorkshire and The Humber") Outlets2016_local$Region[Outlets2016_local$Region %in% england] <- "England" area_c <- factor(Outlets2016_local$Region, levels = c("England", "Scotland", "Wales"), labels = c("England", "Scotland", "Wales"))
improve nested ifelse statement in r
I have more than 10k address info, looks like "XXX street, city, state, US", in a character vector. I want to group them by states, so I use nested ifelse to get the address date.frame with two variable, add_info and state. library(stringr) for (i in nrow(address){ ifelse(str_detect(address, 'Alabama'), address[i,state]='Alabama', ifelse(str_detect(address, 'Alaska'), address[i,state]='Alaska', ifelse(str_detect(address, 'Arizona'), address[i,state]='Arizona', ... ifelse(str_detect(address, 'Wyoming'), address[i,state]='Wyoming', address[i,state]=NA)...) } Of course, this is extremely inefficient, but I don't know how to rewrite this nested ifelse. Any idea?
There are many ways to approach this problem. This is one approach assuming that your address string always contains the full spelling of only one US state. library(stringr) # Get a list of all states state.list = scan(text = "Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming", what = "", sep = ",", strip.white = T) # Extract state from vector address using library(stringr) state = unlist(sapply(address, function(x) state.list[str_detect(x, state.list)])) # Generate fake data to test fake.address = paste0(replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")), sample(state.list, 20, rep = T), replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse=""))) # Test using fake address unlist(sapply(fake.address, function(x) state.list[str_detect(x, state.list)])) Output for fake address O4H8V0NYEHColoradoA5K5XK35LX 44NDPQVMZ8UtahMY0I4M3086 LJ0LJW8BOBFloridaP5H2QW8B81 521IHHC1MFCaliforniaG7QTYCJRO5 "Colorado" "Utah" "Florida" "California" YESTB7R6EPRhode IslandXEEGD4GEY3 5OHN2BR29HKansasCOKR9DY1WJ 4UXNJQW0QKNew MexicoH9GVQR3ZFY 5SYELTKO5HTexas3ONM1HU1VB "Rhode Island" "Kansas" "New Mexico" "Texas" Z8MKKL7K1RWashingtonGEBS7LJUU0 WPRSQEI2CNIndiana141S0Z1M2E O4H8V0NYEHNorth DakotaA5K5XK35LX 44NDPQVMZ8New HampshireMY0I4M3086 "Washington" "Indiana" "North Dakota" "New Hampshire" LJ0LJW8BOBWest VirginiaP5H2QW8B811 LJ0LJW8BOBWest VirginiaP5H2QW8B812 521IHHC1MFNew JerseyG7QTYCJRO5 YESTB7R6EPWisconsinXEEGD4GEY3 "Virginia" "West Virginia" "New Jersey" "Wisconsin" 5OHN2BR29HOregonCOKR9DY1WJ 4UXNJQW0QKOhioH9GVQR3ZFY 5SYELTKO5HRhode Island3ONM1HU1VB Z8MKKL7K1ROklahomaGEBS7LJUU0 "Oregon" "Ohio" "Rhode Island" "Oklahoma" WPRSQEI2CNIowa141S0Z1M2E "Iowa" edit: Use the following function based on agrep() for Fuzzy matching. Should work with minor spelling mistakes. You might need to go into edit comment to copy the code. The code contains an index-assign [<- operator called functionally, so the display is glitching here. unlist(sapply(fake.address, function(x) state.list[[<-((L<-as.logical(sapply(state.list, function(s) agrep(s, x)*1))),is.na(L),F)]))
Assuming that your formatting is consistent (sensu Joran's comment above), you could just parse with strsplit and then use data.frame: address1<-"410 West Street, Small Town, MN, US" address2<-"5844 Green Street, Foo Town, NY, US" address3<-"875 Cardinal Lane, Placeville, CA, US" vector<-c(address1,address2,address3) df<-t(data.frame(strsplit(vector,", ")) colnames(df)<-c("Number","City","State","Country") rownames(df)<-NULL df which produces: Number City State Country [1,] "410 West Street" "Small Town" "MN" "US" [2,] "5844 Green Street" "Foo Town" "NY" "US" [3,] "875 Cardinal Lane" "Placeville" "CA" "US"
There are several methods. First we need some sample data. # some sample data set.seed(123) dat <- data.frame(addr=sprintf('123 street, Townville, %s, US', sample(state.name, 25, replace=T)), stringsAsFactors=F) If your data is super regular like that: # the easy way, split on commas: matrix(unlist(strsplit(dat$addr, ',')), ncol=4, byrow=T) Method 2, use grep to search for values. This works even if no commas or different commas in different rows. (As long as the states always appear spelled the same way) # get a list of state name matches; need to match ', state name,' otherwise # West Virginia counts as Virginia... matches <- sapply(paste0(', ', state.name, ','), grep, dat$addr) # now pair up the state name with the row it matches to state_df <- data.frame(state=rep(state.name, sapply(matches, length)), row=unname(unlist(matches)), stringsAsFactors=F) # reorder based on position in original data.frame, and there you go! dat$state <- state_df[order(state_df$row), 'state']
This seemed to be working in my tests: just.ST <- gsub( paste0(".+(", paste(state.name,collapse="|"), ").+$"), "\\1", address) As mentioned in comments and illustrated in other answers, state.name should be available by default. It does have the deficiency that in case of a non-match it returns the whole string, but you can probably use: is.na(just.ST) <- nchar(just.ST) > max(nchar(state.name))