I'm going through and cleaning a dataset that has location entries like : "Sarasota Florida6h" I'm not sure why but all the strings have either 3 or 2 characters at the end starting with a number:
[413] "Los Angeles11h" "Pittsburgh PA1h"
[415] "London UK18h" "Mumbai India19h"
[417] "Orange County CA1h" "Columbus OH2d"
[419] "4d" "Sarasota Florida6h"
[421] "Toronto9m" "Adelaide Australia7h"
[423] "Wayland MA4h" "Scottsdale AZ USA1h"
[425] "Sydney Australia6d" "Connecticut USA31m"
[427] "United States5m" "Boulder Colorado12h"
[429] "Berlin Germany7h" " India Chaibasa1h"
I need a script to remove all letters after a numeral to clean these out:
I've tried the below, but clearly, there's something wrong here.
follow_dat$loc <- sapply(strsplit(follow_dat$Location, "\\[0-9]"), `[[`, 2)
Your kind assistance is appreciated.
Mari
Use regular expressions
for example you can clean them this way:
gsub("[0-9]..*","",follow_dat$Location)
What this expression is saying is "clean everything after you find a number with nothing '' in all follow_dat$Location"
If there are no other numbers in your strings (as your example suggests), then we can use gsub,
gsub('[0-9]+[a-z]', '',follow_dat$Location)
Related
I want to use textConnection and scan in R to switch
a pasted character dataset to a character vector as row.names.
My little example is as follows:
x = textConnection('
Arcadia
Bryce Canyon
Cuyahoga Valley
Everglades
Grand Canyon
Grand Teton
Great Smoky
Hot Springs
Olympic
Mount Rainier
Rocky Mountain
Shenandoah
Yellowstone
Yosemite
Zion
')
scan(x,character(0))
Each line of the dataset represents a place, thus
it is expected to have a character vector with length 15.
However, scan(x,character(0)) gives
Read 23 items
[1] "Arcadia" "Bryce" "Canyon" "Cuyahoga" "Valley"
[6] "Everglades" "Grand" "Canyon" "Grand" "Teton"
[11] "Great" "Smoky" "Hot" "Springs" "Olympic"
[16] "Mount" "Rainier" "Rocky" "Mountain" "Shenandoah"
[21] "Yellowstone" "Yosemite" "Zion"
I then tried scan(x,character(0),seq='\n'), but it also didn't work! Any help?
Since the input is quoted, we should specify the parameter sep (and not seq!) if we want scan to use 'non white space' as deliminator.
From ?scan:
sep by default, scan expects to read ‘white-space’ delimited input fields. Alternatively, sep can be used to specify a character
which delimits fields. A field is always delimited by an end-of-line
marker unless it is quoted. If specified this should be the empty
character string (the default) or NULL or a character string
containing just one single-byte character.
x = textConnection('
Arcadia
Bryce Canyon
Cuyahoga Valley
Everglades
Grand Canyon
Grand Teton
Great Smoky
Hot Springs
Olympic
Mount Rainier
Rocky Mountain
Shenandoah
Yellowstone
Yosemite
Zion
')
scan(x,character(0), sep="\n")
Returns:
Read 15 items
[1] "Arcadia" "Bryce Canyon" "Cuyahoga Valley" "Everglades"
[5] "Grand Canyon" "Grand Teton" "Great Smoky" "Hot Springs"
[9] "Olympic" "Mount Rainier" "Rocky Mountain" "Shenandoah"
[13] "Yellowstone" "Yosemite" "Zion"
I have address data I extracted from SQL, and have now loaded into R. I am trying to extract out the individual components, namely the ZIP-CODE at the end of the query (State would also be nice). I would like the ZIP-CODE and State to be in new individual columns.
The primary issue is the ZIP-CODE is sometimes 5 digits, and sometimes 9.
Two example rows would be:
Address_FULL
1234 NOWHERE ST WASHINGTON DC 20005
567 EVERYWHERE LN CHARLOTTE NC 22011-1203
I suspect I'll need some kind of regex \\d{5} notation, or some kind of fancy manipulation in dplyr that I'm not aware exists.
If the zip code is always at the end you could use
str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$")
To add a "zip" column via dplyr you could use
df %>% mutate(zip = str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$"))
Where df is your dataframe containing Address_FULL and
str_extract() is from stringr.
State could be extracted as follows:
str_extract(Address_FULL,"(?<=\\s)[[:alpha:]]{2}(?=\\s[[:digit:]]{5})")
However, this makes the following assumptions:
The state abbreviation is 2 characters long
The state abbreviation is followed immediately by a space
The zip code follows immediately after the space that follows the state
Assuming that the zip is always at the end, you can try:
tail(unlist(strsplit(STRING, split=" ")), 1)
For example
ex1 = "1234 NOWHERE ST WASHINGTON DC 20005"
ex2 = "567 EVERYWHERE LN CHARLOTTE NC 22011-1203"
> tail(unlist(strsplit(ex1, split=" ")), 1)
[1] "20005"
> tail(unlist(strsplit(ex2, split=" ")), 1)
[1] "22011-1203"
Use my package tfwstring
Works automatically on any address type, even with prefixes and suffixes.
if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")
parseaddress("1234 NOWHERE ST WASHINGTON DC 20005", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"1234" "NOWHERE" "ST" "WASHINGTON" "DC" "20005"
parseaddress("567 EVERYWHERE LN CHARLOTTE NC 22011-1203", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"567" "EVERYWHERE" "LN" "CHARLOTTE" "NC" "22011-1203"
I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.
This is returning the test string as foo, without any of the sub-expressions:
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)
In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.
rows$isMatch <- grepl(entryPattern, rows$text)
What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.
It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by #kent-johnson.
library(stringr)
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]
Which outputs:
[1] "101"
[2] "POULET Laure"
[3] "FRA"
[4] "1992"
[5] "25-29"
[6] "E. M. S. Bron Natation"
[7] "26.00"
I am trying to determine in R how to split a column that has multiple fields with multiple delimiters.
From an API, I get a column in a data frame called "Location". It has multiple location identifiers in it. Here is an example of one entry. (edit- I added a couple more)
6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)
4284 E 61ST ST
Kansas City, MO 64130
(39.014638172000446, -94.5335298549997)
3002 SPRUCE AVE
Kansas City, MO 64128
(39.07083265200049, -94.53320606399967)
6022 E Red Bridge Rd
Kansas City, MO 64134
(38.92458893200046, -94.52090062499968)
So the above is the entry in row 1-4, column "location".
I want split this into address, city, state, zip, long and lat columns. Some fields are separated by space or tab while others by comma. Also nothing is fixed width.
I have looked at the reshape package- but seems I need a single deliminator. I can't use space (or can I?) as the address has spaces in it.
Thoughts?
If the data you have is not like this, let everyone know by adding code we can copy and paste into R to reproduce your data (see how this sample data can be easily copied and pasted into R?)
Sample data:
location <- c(
"6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)",
"456 POOH LANE
New York City, NY 10025
(40, -90)")
location
#[1] "6540 BENNINGTON AVE\nKansas City, MO 64133\n(39.005620414000475, -94.50998643299965)"
#[2] "456 POOH LANE\nNew York City, NY 10025\n(40, -90)"
A solution:
# Insert a comma between the state abbreviation and the zip code
step1 <- gsub("([[:alpha:]]{2}) ([[:digit:]]{5})", "\\1,\\2", location)
# get rid of parentheses
step2 <- gsub("\\(|\\)", "", step1)
# split on "\n", ",", and ", "
strsplit(step2, "\n|,|, ")
#[[1]]
#[1] "6540 BENNINGTON AVE" "Kansas City" "MO"
#[4] "64133" "39.005620414000475" "-94.50998643299965"
#[[2]]
#[1] "456 POOH LANE" "New York City" "NY" "10025"
#[5] "40" "-90"
Here is an example with the stringr package.
Using #Frank's example data from above, you can do:
library(stringr)
address <- str_match(location,
"(^[[:print:]]+)[[:space:]]([[:alpha:]. ]+), ([[:alpha:]]{2}) ([[:digit:]]{5})[[:space:]][(]([[:digit:].-]+), ([[:digit:].-]+)")
address <- data.frame(address[,-1]) # get rid of the first column which has the full match
names(address) <- c("address", "city", "state", "zip", "lat", "lon")
> address
address city state zip lat lon
1 6540 BENNINGTON AVE Kansas City MO 64133 39.005620414000475 -94.50998643299965
2 456 POOH LANE New York City NY 10025 40 -90
Note that this is pretty specific to the format of the one entry given. It would need to be tweaked if there is variation in any number of ways.
This takes everything from the start of the string to the first [:space:] character as address. The next set of letters, spaces and periods up until the next comma is given to city. After the comma and a space, the next two letters are given to state. Following a space, the next five digits make up the zip field. Finally, the next set of numbers, period and/or minus signs each get assigned to lat and lon.
I have a CSV file like
LocationList,Identity,Category
"New York,New York,United States","42","S"
"NA,California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"NA,NA,United States","87","tree"
I want to remove all 'NA' from the 'LocationList' Column
The Desired Result -
LocationList,Identity,Category
"New York,New York,United States","42","S"
"California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"United States","87","tree"
The number of columns are not fixed and they may increase or decrease. Also I want to write to the CSV file without quotes and without escaping for the 'LocationList' column.
How to achieve the following in R?
New to R any help is appreciated.
In this case, you just want to replace the NA, with nothing. However, this is not the standard way to remove NA values.
Assuming dat is your data, use
dat$LocationList <- gsub("^(NA,)+", "", dat$LocationList)
Try:
my.data <- read.table(text='LocationList,Identity,Category
"New York,New York,United States","42","S"
"NA,California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"NA,NA,United States","87","tree"', header=T, sep=",")
my.data$LocationList <- gsub("NA,", "", my.data$LocationList)
my.data
# LocationList Identity Category
# 1 New York,New York,United States 42 S
# 2 California,United States 89 lyt
# 3 Hartford,Connecticut,United States 879 polo
# 4 San Diego,California,United States 45454 utyr
# 5 Seattle,Washington,United States uytr 69
# 6 United States 87 tree
If you get rid of the quotes when you write to a conventional csv file, you will have trouble reading the data in later. This is because you have commas already inside each value in the LocationList variable, so you would have commas both in the middle of fields and marking the break between fields. You might try using write.csv2() instead, which will indicate new fields with a semicolon ;. You could use:
write.csv2(my.data, file="myFile.csv", quote=FALSE, row.names=FALSE)
Which yields the following file:
LocationList;Identity;Category
New York,New York,United States;42;S
California,United States;89;lyt
Hartford,Connecticut,United States;879;polo
San Diego,California,United States;45454;utyr
Seattle,Washington,United States;uytr;69
United States;87;tree
(I now notice that the values for Identity and Category for row 5 are presumably messed up. You may want to switch those before writing to file.)
x <- my.data[5, 2]
my.data[5, 2] <- my.data[5, 3]
my.data[5, 2] <- x
rm(x)