using regular expressions with R - r

I have an array of characters in R. Some of the strings have a '(number)' pattern appended to that string. I'm trying to remove this '(number)' string from using regular expressions but cannot figure it out. I can access the rows of all the rows where the string has a whitespace than a character but there must be a way to find these number strings.
dat <- c("Alabama-Birmingham", "Arizona State", "Canisius", "UCF", "George Washington",
"Green Bay", "Iona", "Louisville (7)", "UMass", "Memphis", "Michigan State",
"Milwaukee", "Nebraska", "Niagara", "Northern Kentucky", "Notre Dame (21)",
"Quinnipiac", "Siena", "Tulsa", "Washington State", "Wright State",
"Xavier")
rows <- grep(" (.*)", dat)
fixed <- gsub(" (.*)","",games[rows,])
dat = fixed

First, you need to escape the parentheses and it would be good to be more specific about what is inside them
gsub("\\s+\\(\\d+\\)", "", dat)
[1] "Alabama-Birmingham" "Arizona State" "Canisius"
[4] "UCF" "George Washington" "Green Bay"
[7] "Iona" "Louisville" "UMass"
[10] "Memphis" "Michigan State" "Milwaukee"
[13] "Nebraska" "Niagara" "Northern Kentucky"
[16] "Notre Dame" "Quinnipiac" "Siena"
[19] "Tulsa" "Washington State" "Wright State"
[22] "Xavier"

We can do this with sub
sub("\\s*\\(.*", "", dat)
#[1] "Alabama-Birmingham" "Arizona State" "Canisius"
#[4] "UCF" "George Washington" "Green Bay"
#[7] "Iona" "Louisville" "UMass"
#[10] "Memphis" "Michigan State" "Milwaukee"
#[13] "Nebraska" "Niagara" "Northern Kentucky"
#[16] "Notre Dame" "Quinnipiac" "Siena"
#[19] "Tulsa" "Washington State" "Wright State"
#[22] "Xavier"

Related

In R, remove substring pattern from string with gsub

We have a string column in our database with values for sports teams. The names of these teams are occasionally prefixed with the team's ranking, like such: (13) Miami (FL). Here the 13 is Miami's rank, and the (FL) means this is Miami Florida, not Miami of Ohio (Miami (OH)):
We need to clean up this string, removing (13) and keeping only Miami (FL). So far we've used gsub and tried the following:
> gsub("\\s*\\([^\\)]+\\)", "", "(13) Miami (FL)")
[1] " Miami"
This is incorrectly removing the (FL) suffix, and it's also not handling the white space correctly in front.
Edit
Here's a few additional school names, to show a bit the data we're working with. Note that not every school has the (##) prefix.:
c("North Texas", "Southern Methodist", "Texas-El Paso",
"Brigham Young", "Winner", "(12) Miami (FL)", "Appalachian State",
"Arkansas State", "Army", "(1) Clemson",
"(14) Georgia Southern")
You can use sub to remove a number in brackets followed by whitespace.
sub("\\(\\d+\\)\\s", "", "(13) Miami (FL)")
#[1] "Miami (FL)"
The regex could be made stricter based on the pattern in data.
We can match the opening ( followed by one or more digits (\\d+), then the closing )) and one or more spaces (\\s+), replace with blanks ("")
sub("\\(\\d+\\)\\s+", "", "(13) Miami (FL)")
#[1] "Miami (FL)"
Using the OP' updated example
sub("\\(\\d+\\)\\s+", "", v1)
#[1] "North Texas" "Southern Methodist" "Texas-El Paso" "Brigham Young" "Winner" "Miami (FL)"
#[7] "Appalachian State" "Arkansas State" "Army" "Clemson" "Georgia Southern"
Or another option with str_remove from stringr
library(stringr)
str_remove("(13) Miami (FL)", "\\(\\d+\\)\\s+")
Another solution, based on stringr, is this:
str_extract(v1, "[A-Z].*")
[1] "North Texas" "Southern Methodist" "Texas-El Paso" "Brigham Young" "Winner"
[6] "Miami (FL)" "Appalachian State" "Arkansas State" "Army" "Clemson"
[11] "Georgia Southern"
This extracts everything starting from the first upper case letter (thereby ignoring the unwanted rankings).

Capitalizing in R with Exceptions

How can you capitalize data on R except add boundaries?
For example:
Given a list of cities and states in the form: "NEW YORK, NY"
It needs to be changed to: "New York, NY"
The str_to_title function changes it to "New York, Ny".
Patterns:
WASHINGTON, DC
AMHERST, MA
HANOVER, NH
DAVIDSON, NC
BRUNSWICK, ME
GREENVILLE, SC
PORTLAND, OR
LOUISVILLE, KY
They should all be in the form: Amherst, MA or Brunswick, ME
We could use a negative regex lookaround to match the upper case letters that are not succeeding the , and space , capture as a group ((...)), in the replacement specify the backreference of the captured group (\\1, \\2) while converting the second group to lower (\\L)
gsub("(?<!, )([A-Z])([A-Z]+)\\b", "\\1\\L\\2", str1, perl = TRUE)
#[1] "New York, NY" "Washington, DC" "Amherst, MA" "Hanover, NH"
#[5] "Davidson, NC" "Brunswick, ME"
#[7] "Greenville, SC" "Portland, OR" "Louisville, KY"
data
str1 <- c("NEW YORK, NY", "WASHINGTON, DC", "AMHERST, MA", "HANOVER, NH",
"DAVIDSON, NC", "BRUNSWICK, ME", "GREENVILLE, SC", "PORTLAND, OR",
"LOUISVILLE, KY")

USArrests data.frame in R - which state (row) presents the smallest and the largest crime rate (column)

I am using the USArrests data.frame in R and I need to see for each crime (Murder, Assault and Rape) which state presents the smallest and the largest crime rate.
I guess I have to calculate the max and min for each crime and I have done that.
which(USArrests$Murder == min(USArrests$Murder))
[1] 34
The problem is that I cannot retrieve State in row 34, but only the whole row:
USArrests[34,]
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
I am just starting using R so can anyone help me please?
I would usually suggest taking a different approach to a problem like this but for ease I'm going to offer the following solution and maybe come back later with a more well thought out way.
You can use the attributes() function to see particular 'attributes' of a dataframe.
Eg:
attributes(USArrests)
will give you the following output.
$names
[1] "Murder" "Assault" "UrbanPop" "Rape"
$class
[1] "data.frame"
$row.names
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "Florida" "Georgia" "Hawaii" "Idaho"
[13] "Illinois" "Indiana" "Iowa" "Kansas" "Kentucky" "Louisiana"
[19] "Maine" "Maryland" "Massachusetts" "Michigan" "Minnesota" "Mississippi"
[25] "Missouri" "Montana" "Nebraska" "Nevada" "New Hampshire" "New Jersey"
[31] "New Mexico" "New York" "North Carolina" "North Dakota" "Ohio" "Oklahoma"
[37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" "South Dakota" "Tennessee"
[43] "Texas" "Utah" "Vermont" "Virginia" "Washington" "West Virginia"
[49] "Wisconsin" "Wyoming"
So now we know the dataframe is composed of 'names' (name of charge), 'row.names' (names of states) and that the 'class' is a dataframe. As a newcomer to R it is important to note that in the results above, the row id is only given for the first item on each new line. This will make more sense in the last step.
Using this knowledge we can use attributes to find just the states by doing the following:
attributes(USArrests)$row.names
To find the 34th state in the list which you have identified as North Dakota, we can simply give the row id for that state, as per below.
attributes(USArrests)$row.names[34]
Which will give you....
[1] "North Dakota"
Again, this is probably not the most elegant way of doing this, but it will work for your scenario.
Hope this helps and happy coding.
EDIT
As I mentioned there's usually a more elegant, performant and efficient way of doing things. Here is another such way of achieving your goal.
row.names(USArrests)[which.min(USArrests$Murder)]
You'll probably be able to see instantly what is happening here, but essentially, we're asking for the row name associated with the lowest value for the Murder charge. Again this gives...
[1] "North Dakota"
You can now apply this logic to find the states with the max & min crime rates for each offence. Eg, for max Assaults
row.names(USArrests)[which.max(USArrests$Assault)]
Giving...
[1] "North Carolina"
It appears that the State name is stored as a rowname. You can access the rownames of a dataframe using the rownames function.
To find the element which has the lowest value in the vector-column, you can use the which.min function.
We have indeed:
> USArrests[which.min(USArrests$Murder), "Murder"]
[1] 0.8
Hence, your command becomes:
> rownames(USArrests)[which.min(USArrests$Murder)]
[1] "North Dakota"

How to sort data by alphabetical order by removing numbers at the beginning of the name [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 years ago.
I'm trying to sort these state names by alphabetical order while retaining the number to the left of the state name. I currently cannot figure out how to do this.
I've tried using various forms of gsub in an attempt to remove the numbers before sorting without any success.
This is the dataset with the states:
print(StateRankings)
# [1] "1. Arizona" "10. Missouri" "11. Tennessee" "12. Florida"
# [5] "13. West Virginia" "14. Kentucky" "15. New Hampshire" "16. Mississippi"
# [9] "17. Wyoming" "18. Alabama" "19. Idaho" "2. Alaska"
#[13] "20. Vermont" "21. Indiana" "22. Arkansas" "23. Wisconsin"
#[17] "24. South Carolina" "25. Nevada" "26. North Carolina" "27. Michigan"
#[21] "28. Louisiana" "29. Ohio" "3. Kansas" "30. Maine"
#[25] "31. Virginia" "32. South Dakota" "33. Pennsylvania" "34. Oregon"
#[29] "35. Nebraska" "36. Iowa" "37. New Mexico" "38. Washington"
#[33] "39. Colorado" "4. Oklahoma" "40. Illinois" "41. Minnesota"
#[37] "42. Delaware" "43. Rhode Island" "44. Maryland" "45. Connecticut"
#[41] "46. California" "47. Hawaii" "48. New Jersey" "49. Massachusetts"
#[45] "5. Montana" "50. New York" "6. Utah" "7. North Dakota"
#[49] "8. Texas" "9. Georgia"
We can remove the numbers and dot from the character vector, then use order to sort only the names and subset the original vector.
StateRankings[order(sub("^\\d+\\.\\s+", "", StateRankings))]
#[1] "18. Alabama" "2. Alaska" "1. Arizona" "12. Florida" "19. Idaho"
# 6] "14. Kentucky" "16. Mississippi" "10. Missouri" "15. New Hampshire"
#[10] "11. Tennessee" "13. West Virginia" "17. Wyoming"
Just FYI, R has inbuilt state names which is in ascending order stored in state.name
state.name
#[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado"
#[7] "Connecticut" "Delaware" "Florida" "Georgia" "Hawaii" "Idaho"........
data
StateRankings <- c("1. Arizona", "10. Missouri", "11. Tennessee" ,"12. Florida",
"13. West Virginia" ,"14. Kentucky", "15. New Hampshire", "16. Mississippi",
"17. Wyoming", "18. Alabama", "19. Idaho" ,"2. Alaska")

R - using regex to delete all strings with 2 characters or less [duplicate]

This question already has answers here:
R: Find and remove all one to two letter words
(2 answers)
Closed 5 years ago.
I've got a problem and I'm sure it's super simple to fix, but I've been searching for an answer for about an hour and can't seem to work it out.
I have a character vector with data that looks a bit like this:
[5] "Toronto, ON" "Manchester, UK"
[7] "New York City, NY" "Newark, NJ"
[9] "Melbourne" "Los Angeles, CA"
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, CO" "London, UK"
[15] "New York, NY"
and basically I'd like to get rid of all character elements that are 2 digits or shorter, so that the data can then look as follows:
[5] "Toronto, " "Manchester, "
[7] "New York City, " "Newark, "
[9] "Melbourne" "Los Angeles, "
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, " "London, "
[15] "New York, "
The commas I know how to get rid of. As I said, I'm sure this is super simple, any help would be greatly appreciated. Thanks!
You can use quantifier on a word character \\w with word boundaries, \\b\\w{1,2}\\b will match a word with one or two characters; use gsub to remove it in case you have multiple matched pattern:
gsub("\\b\\w{1,2}\\b", "", v)
# [1] "Toronto, " "Manchester, " "New York City, " "Newark, " "Melbourne" "Los Angeles, " "New York, USA"
# [8] "Liverpool, England" "Fort Collins, " "London, " "New York, "
Notice \\w matches both alpha letters and digits with underscore, if you only want to take alpha letters into account, you can use gsub("\\b[a-zA-Z]{1,2}\\b", "", v).
v <- c("Toronto, ON", "Manchester, UK", "New York City, NY", "Newark, NJ", "Melbourne", "Los Angeles, CA", "New York, USA", "Liverpool, England", "Fort Collins, CO", "London, UK", "New York, NY")
Doesn't use regex but it gets the job done:
d <- c(
"Toronto, ON", "Manchester, UK",
"New York City, NY", "Newark, NJ",
"Melbourne", "Los Angeles, CA" ,
"New York, USA", "Liverpool, England" ,
"Fort Collins, CO", "London, UK" ,
"New York, NY" )
toks <- strsplit(d, "\\s+")
lens <- sapply(toks, nchar)
mapply(function(a, b) {
paste(a[b > 2], collapse = " ")
}, toks, lens)

Resources