How to add a new column based on rows with a pattern and a sequence in R? - r

If a have a dataframe with a column with a pattern that is: a row with a string with a name on it followed by other rows containing names and a sequence of numbers. This is repeated all over the dataframe.
I want to create a new column base on the condition that if it founds a row with a string that starts with the word "CANTON" (and without a number), copy the string without the first word (CANTON) trough all the next rows of the new column until it appears another row with a string that starts with the word "CANTON" where it has to take the new string, and copy the new last word in the new column.
An example of the dataframe is the next one:
datos <- data.frame(sitio = c("CANTON SAN JOSE", "01 Carmen", "02 Merced",
"03 Hospital", "04 Catedral", "05 San Franscisco",
"CANTON ESCAZU", "01 Escazu", "02 San Antonio", "03 San Rafael" ),
area = c(44.62, 1.49, 2.29, 3.38, 2.31, 2.85, 34.49, 4.38,
16.99, 13.22))
datos
And the expected result would be:
expected_result <-data.frame(
sitio = c("CANTON SAN JOSE", "01 Carmen", "02 Merced",
"03 Hospital", "04 Catedral", "05 San Franscisco",
"CANTON ESCAZU", "01 Escazu", "02 San Antonio",
"03 San Rafael" ),
area = c(44.62, 1.49, 2.29, 3.38, 2.31, 2.85, 34.49, 4.38,
16.99, 13.22),
canton = c("SAN JOSE", "SAN JOSE", "SAN JOSE", "SAN JOSE",
"SAN JOSE", "SAN JOSE", "ESCAZU", "ESCAZU", "ESCAZU",
"ESCAZU"))
I have tried to do many for loops, subsets and joining dataframes without success. I cannot make clear this pattern in a instruction in R.
Thanks for helping!

Hope this works for you data:
x <- gsub('^CANTON ', '', datos$sitio)
x[!grepl('^CANTON ', datos$sitio)] <- NA
datos$canton <- ave(x, cumsum(!is.na(x)), FUN = function(xx) xx[1])
# > datos
# sitio area canton
# 1 CANTON SAN JOSE 44.62 SAN JOSE
# 2 01 Carmen 1.49 SAN JOSE
# 3 02 Merced 2.29 SAN JOSE
# 4 03 Hospital 3.38 SAN JOSE
# 5 04 Catedral 2.31 SAN JOSE
# 6 05 San Franscisco 2.85 SAN JOSE
# 7 CANTON ESCAZU 34.49 ESCAZU
# 8 01 Escazu 4.38 ESCAZU
# 9 02 San Antonio 16.99 ESCAZU
# 10 03 San Rafael 13.22 ESCAZU

Related

Separating address column into multiple columns in R

I have a dataset that has an address column for 400 records. I would like to split this column into multiple columns.
Sample data
Full_Address = c("1111 Harding St Hollywood, FL 33024",
"2222 W Broward Blvd Plantation, 33317",
"3333 SW 74 Ave Davie, 33314",
"4444 Thomas Street Hollywood, FL 33024",
"11111 Lake Road (SW 12 Street) Davie, 33325",
"555 Bryan Blvd Plantation, 33317",
"5555 NW 71 Ter Parkland, 33067",
"7777 N Oakland Forest Dr Oakland Park, 33309,
"888 Some Ave Pines Pembroke Pines, 33346",
"9999 Some Blvd Hallandale Beach, 33365",
"4440 Some 123 Ave Pompano Beach, 33389")
Desired Columns
ID = c("1111",
"2222",
"3333",
"4444",
"11111",
"555",
"5555",
"7777",
"888",
"9999",
"4440")
Street_Address = c("Harding St",
"W Broward Blvd",
"SW 74 Ave",
"Thomas Street",
"Lake Road (SW 12 Street)",
"Bryan Blvd",
"NW 71 Ter",
"N Oakland Forest Dr",
"Some Ave Pines",
"Some Blvd",
"Some 123 Ave")
City = c("Hollywood",
"Plantation",
"Davie",
"Hollywood",
"Davie",
"Plantation",
"Parkland",
"Oakland Park",
"Pembroke Pines",
"Hallandale Beach",
"Pompano Beach")
Zipcode = c("33024",
"33317",
"33314",
"33024",
"33325",
"33317",
"33067",
"33309",
"33346",
"33365",
"33389")
How can I do this in R via tidyr?
Code
library(tidyverse)
library(tidyr)
df = Full_Address
df = df %>% tidyr::separate( c("ID", "Street_Address", "City", "Zipcode"),
sep = , extra = "merge")) # stuck at this step.....
Note that this is taking a city to only have one Name: cities like New York Los Angeles will not be matched.
data.frame(Full_Address) %>%
extract(Full_Address, c("ID", "Street_Address", "City", "Zipcode"),
'(\\d+) ([^,]+) (\\w+),\\D+(\\d+)')
ID Street_Address City Zipcode
1 1111 Harding St Hollywood 33024
2 2222 W Broward Blvd Plantation 33317
3 3333 SW 74 Ave Davie 33314
4 4444 Thomas Street Hollywood 33024
5 11111 Lake Road (SW 12 Street) Davie 33325
6 555 Bryan Blvd Plantation 33317
7 5555 NW 71 Ter Parkland 33067

I'm trying to use stringr, specifically regex, to cut up "MA: Bristol County (25005)"

I'm trying to take a variable column and cut it up into several columns. The values follow a basic pattern with the county name having a variety of lengths and formats.
State-county :
[1] "MA: Bristol County (25005)"
[2] "LA: St. Tammany Parish (22103)"
[3] "CA: Ventura County (06111)"
[4] "CA: San Mateo County (06081)"
I need a state, county name, and county code column that I can add back into the data.frame. Been trying to figure out how to use str_extract to accomplish the task.
Ideally, this is where I'd end up, but I'll take any help I can get.
state: county: county code:
[1] "MA" Bristol County 25005
[2] "LA" St. Tammany Parish 22103
[3] "CA" Ventura County 06111
[4] "CA: San Mateo County 06081
I was able to use this code I found str_extract_all( "(?<=\\().+?(?=\\))") for the county code (thanks Nettle) and the closest I could get to the state abrev was
'str_extract_all( h,"..:")
which is close but includes the ":"
also tried: str_extract_all( "(?<=\\:")
Sorry if this isn't the best format, I tried to be really clear and in the style I've seen.
Use str_match_all:
str_match_all(df$State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)")
as_tibble(df) %>%
mutate(matches=str_match_all(State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)")) %>%
unnest_wider(matches) %>%
select(-2) %>%
set_names("State_county", "State", "County", "ZIP")
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
### OR with str_match as we're only using a single pattern
## this saves us from the warning caused by unnest_wider
as_tibble(df) %>%
mutate(matches=str_match(State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)"), State=matches[,2], County=matches[,3], ZIP=matches[,4], matches=NULL)
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
### Another way
str_match(df$State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)") %>%
as.data.frame %>% set_names("State_county", "State", "County", "County_code")
State_county State County County_code
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
Explanation:
str_match will basically return the captured groups (sub patterns written in non escaped parenthesis ([A-Z]+)) and the full string that matched the full pattern
[A-Z]+ : matches the state abrv.
[^()]+ : matches anything that's not an opening parenthesis. the county.
\\((\\d+)\\) : matches the open parenthesis \\( and closing one while it pulls the digits using grouping. the county code.
str_match(df$State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)")
[,1] [,2] [,3] [,4]
[1,] "MA: Bristol County (25005)" "MA" "Bristol County" "25005"
[2,] "LA: St. Tammany Parish (22103)" "LA" "St. Tammany Parish" "22103"
[3,] "CA: Ventura County (06111)" "CA" "Ventura County" "06111"
[4,] "CA: San Mateo County (06081)" "CA" "San Mateo County" "06081"
You can use tidyr's extract to get data into different columns specifying the regex that you want to use the divide the data.
tidyr::extract(df, col,
c('state', 'county', 'county_code'),
'(\\w+):\\s*(.*)\\((\\d+)\\)')
# state county county_code
#1 MA Bristol County 25005
#2 LA St. Tammany Parish 22103
#3 CA Ventura County 06111
#4 CA San Mateo County 06081
We use 3 capture groups to extract the data from col column.
data
df <- structure(list(col = c("MA: Bristol County (25005)",
"LA: St. Tammany Parish (22103)",
"CA: Ventura County (06111)", "CA: San Mateo County (06081)")),
class = "data.frame", row.names = c(NA, -4L))
Here is an entirely base R approach, which uses strsplit to separate the three components:
output <- apply(df, 1, function(x) { strsplit(x, "(?:: | \\(|\\))")})
output <- unlist(output, recursive=FALSE)
names(output) <- c(1:length(output))
df <- as.data.frame(do.call(rbind, output))
names(df) <- c("state", "county", "zip")
df
state county zip
1 MA Bristol County 25005
2 LA St. Tammany Parish 22103
Data:
df <- data.frame(state=c("MA: Bristol County (25005)",
"LA: St. Tammany Parish (22103)"),
stringsAsFactors=FALSE)

Extract cell with AND without commas in R

I'm trying to extract the city and state from the Address column into 2 separate columns labeled City and State in r. This is what my data looks like:
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ")) %>%
separate(address, c("City", "State"), sep=",")
I tried using the separate function but that only gets the ones with commas. Any ideas on how to do this for both cases?
There is a pattern at the end (space, letter, letter) which I can use to exploit and then remove any commas but not sure how the syntax would work using grep.
Starting from your df
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
> df
address
1 Los Angeles, CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia, PA
6 Trenton, NJ
It's possible to use gsub to subset the string like this:
> city=gsub(',','',gsub("(.*).{3}","\\1",df[,1]))
> city
[1] "Los Angeles" "Pittsburgh" "Miami" "Baltimore" "Philadelphia"
[6] "Trenton"
> state=gsub(".*(\\w{2})","\\1",df[,1])
> state
[1] "CA" "PA" "FL" "MD" "PA" "NJ"
df=data.frame(City=city,State=state)
> df
City State
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
This is a little unorthodox but it works well. It assumes that all states are 2 characters long and that there is at least 1 space between the city and state. Comma's are ignored
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
df$city <- substring(sub(",","",df$address),1,nchar(sub(",","",df$address))-3)
df$state <- substring(as.character(df$address),nchar(as.character(df$address))-1,nchar(as.character(df$address)))
df <- within(df,rm(address))
output:
city state
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ

Cleaning and editing a column

I've been trying to figure out how to clean and edit a column in my data set.
The dataset I am using is supposed to only be for the city of San Francisco. A column in the data set called "city" contains multiple different spellings of San Francisco, as well as other cities. Here is what it looks like:
table(sf$city)
Brentwood CA
30401 18 370
DALY CITY FOSTER CITY HAYWARD
0 0 0
Novato Oakland OAKLAND
0 40 0
S F S.F. s.F. Ca
0 31428 12
SAN BRUNO SAN FRANCICSO San Franciisco
0 221 54
san francisco san Francisco San francisco
20 284 0
San Francisco SAN FRANCISCO san Francisco CA
78050 16603 6
San Francisco, San Francisco, Ca San Francisco, CA
12 4 72
San Francisco, CA 94132 San Franciscvo San Francsico
0 0 2
San Franicisco Sand Francisco sf
41 30 17
Sf SF SF , CA
214 81226 1
SF CA 94133 SF, CA SF, CA 94110
0 9 38
SF, CA 94115 SF. SF`
4 1656 31
SO. SAN FRANCISCO SO.S.F.
0 6
What I am trying to do is to change sf$city to only have "San Francisco". So all the data in sf$city will be placed under one city, San Francisco. So when I type table(sf$city), it only shows San Francisco.
Could I subset? Something like:
sf$city = subset(sf, city == "S.F." & "s.F. Ca" & "SAN FRANCICSO" & ...
And subset all the city variables I want? Or will this distort and mess up my data?
I would try regular expressions with agrep and grep.
Example data:
d <- c("Brentwood", "CA", "DALY CITY", "FOSTER CITY", "HAYWARD", "Novato",
"Oakland", "OAKLAND", "S F", "S.F.", "s.F. Ca", "SAN BRUNO",
"SAN FRANCICSO", "San Franciisco", "san francisco", "san Francisco",
"San francisco", "San Francisco", "SAN FRANCISCO", "san Francisco CA",
"San Francisco,", "San Francisco, Ca", "San Francisco, CA", "San Francisco, CA 94132",
"San Franciscvo", "San Francsico", "San Franicisco", "Sand Francisco",
"sf", "Sf", "SF", "SF , CA", "SF CA", "94133", "SF, CA", "SF, CA 94110",
"SF, CA 94115", "SF.", "SF`", "SO. SAN FRANCISCO", "SO.S.F.")
You can target words like "San Francisco" with agrep, and the default of max.dist = 0.1 works well enough here. You can then just target the S.F. variants using grep
d[agrep("San Francisco", d, ignore.case = TRUE, max.dist = 0.1)] <- "San Francisco"
d[grep("\\bS[. ]?F\\.?\\b", d, ignore.case = TRUE, perl = TRUE)] <- "San Francisco"
# [1] "Brentwood" "CA" "DALY CITY" "FOSTER CITY"
# [5] "HAYWARD" "Novato" "Oakland" "OAKLAND"
# [9] "San Francisco" "San Francisco" "San Francisco" "SAN BRUNO"
#[13] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[17] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[21] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[25] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[29] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[33] "San Francisco" "94133" "San Francisco" "San Francisco"
#[37] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[41] "San Francisco"
adist is another option for targeting words like "San Francisco". I found the following settings to work well. You can pick up "San Fran":
d[adist("San Francisco", d, ignore.case = TRUE,
cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "San Francisco"
To riff on #jeta's answer, you could also take the resulting data set and run it through the Google Maps API as shown here: https://gist.github.com/josecarlosgonz/6417633
Specifically, using the functions available at that link, you could take the grep() output and run
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
Which will give you the following output:
# V1 V2 V3 V4
# 1 36.0331164 -86.7827772 APPROXIMATE Brentwood, TN, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
As it looks like you know that all of your locations are in CA, you may also want to append a CA to the end of your vector as shown here:
d[grep("CA", d, invert = TRUE)] <- paste0(d[grep("CA", d, invert = TRUE)], ", CA")
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
As shown below, this will make sure that Google places Brentwood in CA.
The advantage of this approach is that you will end up with normalized cities in V4, which could be helpful when it comes to filtering and other things.
# V1 V2 V3 V4
# 1 37.931868 -121.6957863 APPROXIMATE Brentwood, CA 94513, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
NOTE: Google has a rate limit on it's API. If you want to avoid registering and getting an API key, you will want to chunk the ldply in 10-second bites as suggested in the comment at the Github link above.
To overwrite sf$city to be "San Francisco" for every entry, here is the typical method:
sf$city <- "San Francisco"
However, if some of your observations are not San Francisco, and you would like to drop these, you will want to drop these first. Here is a start:
# drop non-SF observations
sfReal <- sf[!(tolower(sf$city) %in% c("daly city", "brentwood", "hayward", "oakland"))]
My geography is not the best, so I may be missing some. Alternatively, you could use %in% to only include those observations that are San Francisco. Given the set you provided above, I doubt this is the case.
In the future, if this is a repeated task, you should look into regular expressions and grep. This is an amazing tool that will pay gigantic dividends for string manipulation tasks. #jota provides a great method for this in the answer provided.

Import raw data into R

please anyone can help me to import this data into R from a text or dat file. It has space delimited, but cities names should not considered as two names. Like NEW YORK.
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 SAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIO 914,350
10 PHOENIX 894,070
For your particular data frame, where true spaces only occur between capital letters, consider using a regular expression:
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK 7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO 3,009,530")
# [1] "3 CHICAGO 3,009,530"
You can then interpret spaces as field separators.
A variation on a theme... but first, some sample data:
cat("1 NEW YORK 7,262,700",
"2 LOS ANGELES 3,259,340",
"3 CHICAGO 3,009,530",
"4 HOUSTON 1,728,910",
"5 PHILADELPHIA 1,642,900",
"6 DETROIT 1,086,220",
"7 SAN DIEGO 1,015,190",
"8 DALLAS 1,003,520",
"9 SAN ANTONIO 914,350",
"10 PHOENIX 894,070", sep = "\n", file = "test.txt")
Step 1: Read the data in with readLines
x <- readLines("test.txt")
Step 2: Figure out a regular expression that you can use to insert delimiters. Here, the pattern seems to be (looking from the end of the lines) a set of numbers and commas preceded by space preceded by some words in ALL CAPS. We can capture those groups and insert some "tab" delimiters (\t). The extra slashes are to properly escape them.
gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
# [1] "1\t NEW YORK \t7,262,700" "2\t LOS ANGELES \t3,259,340"
# [3] "3\t CHICAGO \t3,009,530" "4\t HOUSTON \t1,728,910"
# [5] "5\t PHILADELPHIA \t1,642,900" "6\t DETROIT \t1,086,220"
# [7] "7\t SAN DIEGO \t1,015,190" "8\t DALLAS \t1,003,520"
# [9] "9\t SAN ANTONIO \t914,350" "10\t PHOENIX \t894,070"
Step 3: Since we know our gsub is working, and we know that read.delim has a "text" argument that can be used instead of a "file" argument, we can use read.delim directly on the result of gsub:
out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x),
header = FALSE, strip.white = TRUE)
out
# V1 V2 V3
# 1 1 NEW YORK 7,262,700
# 2 2 LOS ANGELES 3,259,340
# 3 3 CHICAGO 3,009,530
# 4 4 HOUSTON 1,728,910
# 5 5 PHILADELPHIA 1,642,900
# 6 6 DETROIT 1,086,220
# 7 7 SAN DIEGO 1,015,190
# 8 8 DALLAS 1,003,520
# 9 9 SAN ANTONIO 914,350
# 10 10 PHOENIX 894,070
One possible last step would be to convert the third column to numeric:
out$V3 <- as.numeric(gsub(",", "", out$V3))
Expanding on #Hugh's answer I would try the following, although its not particularly efficient.
lines <- scan("cities.txt", sep="\n", what="character")
lines <- unlist(lapply(lines, function(x) {
gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x)
}))
citiesDF <- data.frame(num = rep(0, length(lines)),
city = rep("", length(lines)),
population = rep(0, length(lines)),
stringsAsFactors=FALSE)
for (i in 1:length(lines)) {
splitted = strsplit(lines[i], " +")
citiesDF[i, "num"] <- as.numeric(splitted[[1]][1])
citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2])
citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3]))
}

Resources