Related
I am having some issues with a regular expression I hope you can help me with.
I have a dataset that looks like this:
name <- c("Chester-le-Street",
"Westbury-on-Trym",
"Easton-in-Gordano",
"Weston-super-Mare",
"Bourne End-cum-Hedsor",
"Amersham-on-the-Hill East",
"South Westbury-on-Trym")
What I want to do is mainly two things:
remove the symbols "-"
Replace the first letter after where a "-" was by a capital letter.
In a way that would result in the following:
target_name <- c("Chester Le Street",
"Westbury On Trym",
"Easton In Gordano",
"Weston Super Mare",
"Bourne End Cum Hedsor",
"Amersham On The Hill East",
"South Westbury On Trym")
Been trying many different things that take very long and I never quite get exactly what I am going for nonetheless.
Would really appreciate any help!
Thanks.
You could use -(\w) as your pattern which is a dash followed by a letter. Capture the letter as group 1, then replace that with a space and the Capital letter of the captured group 1. ie \U\1. Note that to do this, we escape the \ hence have 2 backslash in the R code
gsub("-(\\w)", " \\U\\1", name, perl = TRUE)
[1] "Chester Le Street" "Westbury On Trym" "Easton In Gordano"
[4] "Weston Super Mare" "Bourne End Cum Hedsor" "Amersham On The Hill East"
[7] "South Westbury On Trym"
Or:
library(stringr)
str_to_title(str_replace_all(name, '-', ' '))
[1] "Chester Le Street" "Westbury On Trym" "Easton In Gordano"
[4] "Weston Super Mare" "Bourne End Cum Hedsor" "Amersham On The Hill East"
[7] "South Westbury On Trym"
library(stringr)
name <- c("Chester-le-Street",
"Westbury-on-Trym",
"Easton-in-Gordano",
"Weston-super-Mare",
"Bourne End-cum-Hedsor",
"Amersham-on-the-Hill East",
"South Westbury-on-Trym")
name %>%
str_replace_all("-", " ") %>%
str_to_title()
I am using the USArrests data.frame in R and I need to see for each crime (Murder, Assault and Rape) which state presents the smallest and the largest crime rate.
I guess I have to calculate the max and min for each crime and I have done that.
which(USArrests$Murder == min(USArrests$Murder))
[1] 34
The problem is that I cannot retrieve State in row 34, but only the whole row:
USArrests[34,]
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
I am just starting using R so can anyone help me please?
I would usually suggest taking a different approach to a problem like this but for ease I'm going to offer the following solution and maybe come back later with a more well thought out way.
You can use the attributes() function to see particular 'attributes' of a dataframe.
Eg:
attributes(USArrests)
will give you the following output.
$names
[1] "Murder" "Assault" "UrbanPop" "Rape"
$class
[1] "data.frame"
$row.names
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "Florida" "Georgia" "Hawaii" "Idaho"
[13] "Illinois" "Indiana" "Iowa" "Kansas" "Kentucky" "Louisiana"
[19] "Maine" "Maryland" "Massachusetts" "Michigan" "Minnesota" "Mississippi"
[25] "Missouri" "Montana" "Nebraska" "Nevada" "New Hampshire" "New Jersey"
[31] "New Mexico" "New York" "North Carolina" "North Dakota" "Ohio" "Oklahoma"
[37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" "South Dakota" "Tennessee"
[43] "Texas" "Utah" "Vermont" "Virginia" "Washington" "West Virginia"
[49] "Wisconsin" "Wyoming"
So now we know the dataframe is composed of 'names' (name of charge), 'row.names' (names of states) and that the 'class' is a dataframe. As a newcomer to R it is important to note that in the results above, the row id is only given for the first item on each new line. This will make more sense in the last step.
Using this knowledge we can use attributes to find just the states by doing the following:
attributes(USArrests)$row.names
To find the 34th state in the list which you have identified as North Dakota, we can simply give the row id for that state, as per below.
attributes(USArrests)$row.names[34]
Which will give you....
[1] "North Dakota"
Again, this is probably not the most elegant way of doing this, but it will work for your scenario.
Hope this helps and happy coding.
EDIT
As I mentioned there's usually a more elegant, performant and efficient way of doing things. Here is another such way of achieving your goal.
row.names(USArrests)[which.min(USArrests$Murder)]
You'll probably be able to see instantly what is happening here, but essentially, we're asking for the row name associated with the lowest value for the Murder charge. Again this gives...
[1] "North Dakota"
You can now apply this logic to find the states with the max & min crime rates for each offence. Eg, for max Assaults
row.names(USArrests)[which.max(USArrests$Assault)]
Giving...
[1] "North Carolina"
It appears that the State name is stored as a rowname. You can access the rownames of a dataframe using the rownames function.
To find the element which has the lowest value in the vector-column, you can use the which.min function.
We have indeed:
> USArrests[which.min(USArrests$Murder), "Murder"]
[1] 0.8
Hence, your command becomes:
> rownames(USArrests)[which.min(USArrests$Murder)]
[1] "North Dakota"
I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format
The problem here is similar to this previous one but here we do not need to do any computation but just to build lists
I have some list of world regions:
list.asia <- c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia")
list.africa <- c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa")
I use the R library("ISOcodes") to produce lists of countries with ISO Alpha 3 digits format as follow:
region <- subset(UN_M.49_Regions, Name %in% list.asia)
subset <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
subset$ISO_Alpha_3
This example, with the list.asiagives the expected result:
[1] "AFG" "ARM" "AZE" "BHR" "BGD" "BTN" "BRN" "KHM" "CHN" "HKG" "MAC" "CYP" "PRK"
[14] "GEO" "IND" "IDN" "IRN" "IRQ" "ISR" "JPN" "JOR" "KAZ" "KWT" "KGZ" "LAO" "LBN"
[27] "MYS" "MDV" "MNG" "MMR" "NPL" "OMN" "PAK" "PHL" "QAT" "KOR" "SAU" "SGP" "LKA"
[40] "PSE" "SYR" "TJK" "THA" "TLS" "TUR" "TKM" "ARE" "UZB" "VNM" "YEM"
which can easily be saved as follow:
countries.list.asia <- subset$ISO_Alpha_3
The problem is that I have got a lot of regions and I would prefer to do a loop.
To keep it simple let's say that I only have 2 lists list.asia and list.africa. I regroup them in a new list.continent
list.continent <- c("list.asia","list.africa")
and then I "loop" the list production: (which does not work)
for(i in list.continent){
list.loop <- sym(i)
region <- subset(UN_M.49_Regions, Name %in% list.loop)
subset <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
paste("countries",list.loop, sep=".") <- subset$ISO_Alpha_3
rm(region, subset, list.loop)
}
The expected results (in this case) are 2 new objects (class list) called countries.list.asia and countries.list.africa containing the ISO Alpha 3 digits codes of the countries present in these regions.
I tried to replace list.loop by !!list.loop or as.list(list.loop), but nothing works. Any Idea?
Consider using an overall list and not attempt to save an object to global environment iteratively and use a function to return your needed output and avoid the need to remove helper objects. And in R, a list + function can be encapsulated with lapply (or its wrapper sapply used here for list names):
# NAMED LIST OF ACTUAL OBJECTS (NOT CHARACTER VECTOR)
list.continent <- list(list.asia = list.asia, list.africa = list.africa)
# BUILD NEW LIST OF SUBSETTED ITEMS
new_list.continent <- sapply(list.continent, function(item) {
region <- subset(UN_M.49_Regions, Name %in% item)
sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
return(sub$ISO_Alpha_3)
}, simplify = FALSE)
# SHOW OBJECT CONTENTS
new_list.continent$list.asia
new_list.continent$list.africa
I have more than 10k address info, looks like "XXX street, city, state, US", in a character vector.
I want to group them by states, so I use nested ifelse to get the address date.frame with two variable, add_info and state.
library(stringr)
for (i in nrow(address){
ifelse(str_detect(address, 'Alabama'), address[i,state]='Alabama',
ifelse(str_detect(address, 'Alaska'), address[i,state]='Alaska',
ifelse(str_detect(address, 'Arizona'), address[i,state]='Arizona',
...
ifelse(str_detect(address, 'Wyoming'), address[i,state]='Wyoming', address[i,state]=NA)...)
}
Of course, this is extremely inefficient, but I don't know how to rewrite this nested ifelse. Any idea?
There are many ways to approach this problem. This is one approach assuming that your address string always contains the full spelling of only one US state.
library(stringr)
# Get a list of all states
state.list = scan(text = "Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming", what = "", sep = ",", strip.white = T)
# Extract state from vector address using library(stringr)
state = unlist(sapply(address, function(x) state.list[str_detect(x, state.list)]))
# Generate fake data to test
fake.address = paste0(replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")),
sample(state.list, 20, rep = T),
replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")))
# Test using fake address
unlist(sapply(fake.address, function(x) state.list[str_detect(x, state.list)]))
Output for fake address
O4H8V0NYEHColoradoA5K5XK35LX 44NDPQVMZ8UtahMY0I4M3086 LJ0LJW8BOBFloridaP5H2QW8B81 521IHHC1MFCaliforniaG7QTYCJRO5
"Colorado" "Utah" "Florida" "California"
YESTB7R6EPRhode IslandXEEGD4GEY3 5OHN2BR29HKansasCOKR9DY1WJ 4UXNJQW0QKNew MexicoH9GVQR3ZFY 5SYELTKO5HTexas3ONM1HU1VB
"Rhode Island" "Kansas" "New Mexico" "Texas"
Z8MKKL7K1RWashingtonGEBS7LJUU0 WPRSQEI2CNIndiana141S0Z1M2E O4H8V0NYEHNorth DakotaA5K5XK35LX 44NDPQVMZ8New HampshireMY0I4M3086
"Washington" "Indiana" "North Dakota" "New Hampshire"
LJ0LJW8BOBWest VirginiaP5H2QW8B811 LJ0LJW8BOBWest VirginiaP5H2QW8B812 521IHHC1MFNew JerseyG7QTYCJRO5 YESTB7R6EPWisconsinXEEGD4GEY3
"Virginia" "West Virginia" "New Jersey" "Wisconsin"
5OHN2BR29HOregonCOKR9DY1WJ 4UXNJQW0QKOhioH9GVQR3ZFY 5SYELTKO5HRhode Island3ONM1HU1VB Z8MKKL7K1ROklahomaGEBS7LJUU0
"Oregon" "Ohio" "Rhode Island" "Oklahoma"
WPRSQEI2CNIowa141S0Z1M2E
"Iowa"
edit: Use the following function based on agrep() for Fuzzy matching. Should work with minor spelling mistakes. You might need to go into edit comment to copy the code. The code contains an index-assign [<- operator called functionally, so the display is glitching here.
unlist(sapply(fake.address, function(x) state.list[[<-((L<-as.logical(sapply(state.list, function(s) agrep(s, x)*1))),is.na(L),F)]))
Assuming that your formatting is consistent (sensu Joran's comment above), you could just parse with strsplit and then use data.frame:
address1<-"410 West Street, Small Town, MN, US"
address2<-"5844 Green Street, Foo Town, NY, US"
address3<-"875 Cardinal Lane, Placeville, CA, US"
vector<-c(address1,address2,address3)
df<-t(data.frame(strsplit(vector,", "))
colnames(df)<-c("Number","City","State","Country")
rownames(df)<-NULL
df
which produces:
Number City State Country
[1,] "410 West Street" "Small Town" "MN" "US"
[2,] "5844 Green Street" "Foo Town" "NY" "US"
[3,] "875 Cardinal Lane" "Placeville" "CA" "US"
There are several methods.
First we need some sample data.
# some sample data
set.seed(123)
dat <- data.frame(addr=sprintf('123 street, Townville, %s, US',
sample(state.name, 25, replace=T)),
stringsAsFactors=F)
If your data is super regular like that:
# the easy way, split on commas:
matrix(unlist(strsplit(dat$addr, ',')), ncol=4, byrow=T)
Method 2, use grep to search for values. This works even if no commas or different commas in different rows. (As long as the states always appear spelled the same way)
# get a list of state name matches; need to match ', state name,' otherwise
# West Virginia counts as Virginia...
matches <- sapply(paste0(', ', state.name, ','), grep, dat$addr)
# now pair up the state name with the row it matches to
state_df <- data.frame(state=rep(state.name, sapply(matches, length)),
row=unname(unlist(matches)),
stringsAsFactors=F)
# reorder based on position in original data.frame, and there you go!
dat$state <- state_df[order(state_df$row), 'state']
This seemed to be working in my tests:
just.ST <- gsub( paste0(".+(", paste(state.name,collapse="|"), ").+$"),
"\\1", address)
As mentioned in comments and illustrated in other answers, state.name should be available by default. It does have the deficiency that in case of a non-match it returns the whole string, but you can probably use:
is.na(just.ST) <- nchar(just.ST) > max(nchar(state.name))