Formatting Zipcode in R (removing - from zipcode) - r

I have a dataset of the following:
> head(data,3)
city state zip_code overall_spend
1 MIDDLESBORO KY 40965 $252,168.12
2 PALM BEACH FL 33411-3518 $369,240.74
3 CORBIN KY 40701 $292,496.03
Now, I want to format the zip_code which has extra parts after -. For example, in the second row, I have 33411-3518. After formatting I want to have only 33411. How can I do this to the whole zip_code column? Also, zip_code is a factor now

Try
data$zip_code <- sub('-.*', '', data$zip_code)
data$zip_code
#[1] "40965" "33411" "40701"

Related

how to find top highest number in R

I'm new in R coding. I want to find code for this question. Display the city name and the total attendance of the five top-attendance stadiums. I have dataframe worldcupmatches. Please, if anyone can help me out.
Since you have not provided us a subset of your data (which is strongly recommended), I will create a tiny dataset with city names and attendance like so:
df = data.frame(city = c("London", "Liverpool", "Manchester", "Birmingham"),
attendance = c(2390, 1290, 8734, 5433))
Then your problem can easily be solved. For example, one of the base R approaches is:
df[order(df$attendance, decreasing = T), ]
You could also use dplyr which makes things look a little tidier:
library(dplyr)
df %>% arrange(desc(attendance))
Output of the both methods is your original data, but ordered from the highest to the lowest attendance:
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
2 Liverpool 1290
If you specifically want to display a certain number of cities (or stadiums) with top highest attendance, you could do:
df[order(df$attendance, decreasing = T), ][1:3, ] # 1:3 takes the top 3 staidums
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
Again, dplyr approach/code looks much nicer:
df %>% slice_max(n = 3, order_by = attendance)
city attendance
1 Manchester 8734
2 Birmingham 5433
3 London 2390

remove list of strings from string column in R

I have a dataframe like so:
df = data.frame('name' = c('California parks', 'bear lake', 'beautiful tree house', 'banana plant'), 'extract' = c('parks', 'bear', 'tree', 'plant'))
How do I remove the strings of the 'extract' column from the name column to get the following result:
name_new = California, lake, beautiful house, banana
I'm suspecting this demands a combination of str_extract and lapply but can quite figure it out.
Thanks!
The str_remove or str_replace are vectorized for both string and pattern. So, if we have two columns, just pass those columns 'name', 'extract' as the string, pattern to remove the substring in the 'name' column elementwise. Once we remove those substring, there are chances of having spaces before or after which can be removed or replaced with str_replace with trimws (to remove the leading/lagging spaces)
library(dplyr)
library(stringr)
df %>%
mutate(name_new = str_remove(name, extract),
name_new = str_replace_all(trimws(name_new), "\\s{2,}", " "))
# name extract name_new
#1 California parks parks California
#2 bear lake bear lake
#3 beautiful tree house tree beautiful house
#4 banana plant plant banana
A base R option using gsub + Vectorize
within(df,name_new <- Vectorize(gsub)(paste0("\\s",extract,"\\s")," ",name))
which gives
name extract name_new
1 California parks parks California
2 bear lake bear lake
3 beautiful tree house tree beautiful house
4 banana plant plant banana

Splitting character object using vector of delimiters

I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.
Data is currently in the following structure (simplified):
a <- c("Name: John Doe Age: 50 Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs Age: 42 Address Please give full address 1 Lower Street, London")
df <- data.frame(rawtext = c(a,b))
I'd like to split each observation into individual variable columns. It should end up looking like this:
Name Age Address
John Doe 50 22 Main Street, New York
Jane Bloggs 42 1 Lower Street, London
I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.
delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)
I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.
e.g. the following bodge would get me the name field based on the delimiters:
sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)
[1] " John Doe "
This feels extremely inefficient. Is there a better way that I'm missing?
A tidyverse solution:
library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")
df %>%
mutate(rawtext = str_remove_all(rawtext, ":")) %>%
separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>%
mutate(across(where(is.character), str_squish), x = NULL)
# # A tibble: 2 x 3
# Name Age `Address Please give full address`
# <chr> <dbl> <chr>
# 1 John Doe 50 22 Main Street, New York
# 2 Jane Bloggs 42 1 Lower Street, London
Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.

Regex, Separate according to punctuation R?

I know this is a regex question, which has probably been answered but I cannot figure out the answer to this particular question. I have a dataset of 5000 addresses, and some of the addresses are presented as:
199 REEDSDALE ROAD MILTON, MA (42.252352, -71.075213)
2014 WASHINGTON STREET NEWTON, MA (42.332339, -71.246592)
75 FRANCIS STREET BOSTON, MA (42.335954, -71.107661)
235 NORTH PEARL STREET BROCKTON, MA (42.09707, -71.065645)
41 HIGHLAND AVENUE WINCHESTER, MA (42.465496, -71.121408)
The first comma is the separation of the address city from the state, but also there is latitude and longitude coordinates. I am interested in getting the coordinates into two columns, latitude and longitude as
lat lon
42.252352 -71.075213
42.332339 -71.246592
42.335954 -71.107661
42.09707 -71.065645
42.465496 -71.121408
Any and all help is appreciated!
One option is to extract the numeric part with a regex lookaround
library(tidyverse)
data_frame(lat = str_extract(lines, "(?<=\\()-?[0-9.]+"),
lon = str_extract(lines, "-?[0-9.]+(?=\\))"))
# A tibble: 5 x 2
# lat lon
# <chr> <chr>
#1 42.252352 -71.075213
#2 42.332339 -71.246592
#3 42.335954 -71.107661
#4 42.09707 -71.065645
#5 42.465496 -71.121408
Or with read.csv after removing the characters until the (, including the ( and ) (at the end) with gsub, making the , as separator for the read.csv to split into two columns
read.csv(text = gsub("^[^(]+\\(|\\)$", "", lines), header=FALSE,
col.names = c("lat", "lon"))
# lat lon
#1 42.25235 -71.07521
#2 42.33234 -71.24659
#3 42.33595 -71.10766
#4 42.09707 -71.06565
#5 42.46550 -71.12141
data
lines <- readLines("file.txt")

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

Resources