Separating data within a cell and duplicating row data - r

I have data that is within one cell, separated by spaces.
For example, there is one column with city name such as "New York, NY" and then another column with the zip codes "12345 67891 23456".
What is a good method for separating this single row so that it could become three rows, with each having "New York, NY" and then having a single zip code associated?

Try this:
library(dplyr)
library(tidyr)
tibble(city = "New York, NY", zipcodes = "12345 67891 23456") %>%
mutate(zipcodes = strsplit(zipcodes, "\\s+")) %>%
unnest(zipcodes)
# # A tibble: 3 x 2
# city zipcodes
# <chr> <chr>
# 1 New York, NY 12345
# 2 New York, NY 67891
# 3 New York, NY 23456
Base R:
dat <- data.frame(city = "New York, NY", zipcodes = "12345 67891 23456", stringsAsFactors = FALSE)
zips <- strsplit(dat$zipcodes, "\\s+")
data.frame(city=rep(dat$city, each = lengths(zips)), zipcode = unlist(zips))
# city zipcode
# 1 New York, NY 12345
# 2 New York, NY 67891
# 3 New York, NY 23456
One premise of this answer is that the zip codes are separated by one or more whitespace (space, tab, etc). If there are legitimate spaces (true in many countries), then #ThomasIsCoding's approach may be a better start in that it attempts to extract the specific elements. Both will fail where zip codes are alphanumeric and contain a space; for instance, the UK has BS2 0JA as a zip code. In that case, you'll need a lot more logic to extract them safely.

If you are using base R, do you mean this kind of output?
s <- "New York, NY 12345 67891 23456"
data.frame(addr = paste0(gsub("(.*?\\s)\\d.*","\\1",s), unlist(regmatches(s,gregexpr("\\d+",s)))))
yielding
addr
1 New York, NY 12345
2 New York, NY 67891
3 New York, NY 23456

Related

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index
State_Name
Abbreviation
1
California
CA
2
Maryland
MD
3
New York
NY
4
Texas
TX
5
Virginia
VA
6
Washington
WA
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index
State
Article
1
NA
Texas governor, Abbott, signs new abortion bill
2
NA
Effort to recall California governor Newsome loses steam
3
NA
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
NA
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
NA
Amazon HQ2 causing housing prices to soar in northern Virginia
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index
State
Article
1
TX
Texas governor, Abbott, signs new abortion bill
2
CA
Effort to recall California governor Newsome loses steam
3
NY
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
MD
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
VA
Amazon HQ2 causing housing prices to soar in northern Virginia
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!
You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.
Here's a way to do this in for loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
Not as concise as above but a Base R approach:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))

Checking to see if strings in one column matches the abbreviated form of the strings in another column

I have a large data frame "df" with 2 columns:
**column1** **column2**
The City of New York TCNY
The Land of the Free TLF
Stellar Stars Basketball Program SSBP
Center for Life Sciences CLS
Children's Hospital of Los Angeles CHLA
New York Yankees NY
etc etc
I've done some research and saw that you could use mapply to do a function on two columns at the same time but I'm uncertain what function I would do. I was thinking doing something where a function checks all the capital letters in the strings of column1 and checks if those capital letters exist in column2 but really unsure how.. Any help would be great! Thank you so much!
Here's an example of what I think you might be trying to achieve (on a subset of the rows you've shown in your question):
df <- data.frame(
col_1 = c("The City of New York", "The Land of the Free", "New York Yankees"),
col_2 = c("TCNY", "TLF", "NY")
)
> df
col_1 col_2
1 The City of New York TCNY
2 The Land of the Free TLF
3 New York Yankees NY
# Add a third column indicating whether the capitalised letters of the first
# column are equal to the strings in the second
df$col_3 <- unlist(apply(df, 1, function(x) gsub("[^A-Z]", "", x[1]) == x[2]))
> df
col_1 col_2 col_3
1 The City of New York TCNY TRUE
2 The Land of the Free TLF TRUE
3 New York Yankees NY FALSE
Above I'm using gsub to remove any characters that aren't upper case from the first column values, then comparing them to the second column in an apply statement, which is operating on each row of the dataframe. Then I'm using unlist to convert the result from a list to a vector, which can be stored in the third column of the dataframe df.
Using base r
transform(dat,correctABBV=x<-gsub("[^A-Z]","",column1),check=x==column2)
column1 column2 correctABBV check
1 The City of New York TCNY TCNY TRUE
2 The Land of the Free TLF TLF TRUE
3 Stellar Stars Basketball Program SSBP SSBP TRUE
4 Center for Life Sciences CLS CLS TRUE
5 Children's Hospital of Los Angeles CHLA CHLA TRUE
6 New York Yankees NY NYY FALSE
Here is one approach for you. I was not sure if you wanted to keep etc as an abbreviation or not. At the moment, I treat it as an abbreviation. First, I wanted to create abbreviations based on the first column. I checked how many words exist in each string using stri_count(). When the answer is TRUE to the logical condition, I used gsub() to extract capital letters. When the answer is FALSE to the logical condition, I added elements in mycol1 to abb. Finally, I checked if elements in abb and mycol2 are the same or not and created check.
mydf <- data.frame(mycol1 = c("The City of New York", "The Land of the Free", "Stellar Stars Basketball Program",
"Center for Life Sciences", "Children's Hospital of Los Angeles", "New York Yankees", "etc"),
mycol2 = c("TCNY", "TLF", "SSBP", "CLS", "CHLA", "NY", "etc"),
stringsAsFactors = FALSE)
library(dplyr)
library(stringi)
mutate(mydf,
abb = if_else(stri_count(mycol1, regex = "\\w+") > 1,
gsub(x = mycol1, pattern = "[^A-Z]",replacement = ""),
mycol1),
check = abb == mycol2)
mycol1 mycol2 abb check
1 The City of New York TCNY TCNY TRUE
2 The Land of the Free TLF TLF TRUE
3 Stellar Stars Basketball Program SSBP SSBP TRUE
4 Center for Life Sciences CLS CLS TRUE
5 Children's Hospital of Los Angeles CHLA CHLA TRUE
6 New York Yankees NY NYY FALSE
7 etc etc etc TRUE

Purrr-Fection: In Search of An Elegant Solution to Conditional Data Frame Operations Leveraging Purrr

The Background
I have an issue for which a number of solution pathways are possible, but I am convinced there is an as-yet-undiscovered elegant solution leveraging purrr.
The Example Code
I have a large data frame as follows, for which I have included an example below:
library(tibble)
library(ggmap)
library(purrr)
library(dplyr)
# Define Example Data
df <- frame_data(
~Street, ~City, ~State, ~Zip, ~lon, ~lat,
"226 W 46th St", "New York", "New York", 10036, -73.9867, 40.75902,
"5th Ave", "New York", "New York", 10022, NA, NA,
"75 Broadway", "New York", "New York", 10006, -74.01205, 40.70814,
"350 5th Ave", "New York", "New York", 10118, -73.98566, 40.74871,
"20 Sagamore Hill Rd", "Oyster Bay", "New York", 11771, NA, NA,
"45 Rockefeller Plaza", "New York", "New York", 10111, -73.97771, 40.75915
)
The Challenge
I would like to geotag all locations for which the lon and lat columns are currently NA. There are many ways I could go about this, one of which is shown below:
# Safe Code is Great Code
safe_geocode <- safely(geocode)
# Identify Data to be Geotagged by Absence of lon and lat
data_to_be_geotagged <- df %>% filter(is.na(lon) | is.na(lat))
# GeoTag Addresses of Missing Data Points
fullAddress <- paste(data_to_be_geotagged$Street,
data_to_be_geotagged$City,
data_to_be_geotagged$State,
data_to_be_geotagged$Zip,
sep = ", ")
fullAddress %>%
map(safe_geocode) %>%
map("result") %>%
plyr::ldply()
The Question
While I can get the above to work, and even wrangle the newly identified lon and lat coordinates back into the original data frame, the whole scheme feels dirty. I am convinced there is an elegant way to leverage piping and purrr to go through the data-frame and conditionally geotag the locations based on the absence of lon and lat.
I have been down a number of rabbit holes including purrr::pmap in an attempt to walk through multiple columns in parallel when constructing the full address (As well as rowwise() and by_row()). Nevertheless, I fall short in constructing anything that would qualify as an elegant solution.
Any insight provided would be most appreciated.
Really, you want to avoid calling geocode any more than necessary because it's slow and if you're using Google, you only have 2500 queries per day. Thus, it's best to make both columns from the same call, which can be done with a list column, making a new version of the data.frame with do, or a self-join.
1. With a list column
With a list column, you make a new version of lon and lat with ifelse, geocoding if there are NAs, else just copying the existing values. Afterwards, get rid of the old versions of the columns and unnest the new ones:
library(dplyr)
library(ggmap)
library(tidyr) # For `unnest`
# Evaluate each row separately
df %>% rowwise() %>%
# Add a list column. If lon or lat are NA,
mutate(data = ifelse(any(is.na(c(lon, lat))),
# return a data.frame of the geocoded results,
list(geocode(paste(Street, City, State, Zip))),
# else return a data.frame of existing columns.
list(data_frame(lon = lon, lat = lat)))) %>%
# Remove old columns
select(-lon, -lat) %>%
# Unnest newly created ones from list column
unnest(data)
## # A tibble: 6 × 6
## Street City State Zip lon lat
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 226 W 46th St New York New York 10036 -73.98670 40.75902
## 2 5th Ave New York New York 10022 -73.97491 40.76167
## 3 75 Broadway New York New York 10006 -74.01205 40.70814
## 4 350 5th Ave New York New York 10118 -73.98566 40.74871
## 5 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 6 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
2. With do
do, on the other hand, creates a wholly new data.frame from pieces of the old one. It requires slightly clunky $ notation, with . to represent the grouped data.frame piped in. Using if and else instead of ifelse lets you avoid nesting results in lists (which they had to be above, anyway).
# Evaluate each row separately
df %>% rowwise() %>%
# Make a new data.frame from the first four columns and the geocode results or existing lon/lat
do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){
geocode(paste(.[1:4], collapse = ' '))
} else {
.[5:6]
}))
which returns exactly the same thing as the first version.
3. On a subset, recombining with a self-join
If the ifelse is overly confusing, you can just geocode a subset and then recombine by binding the rows to the anti_join, i.e. all the rows that are in df but not the subset .:
df %>% filter(is.na(lon) | is.na(lat)) %>%
select(1:4) %>%
bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>%
bind_rows(anti_join(df, ., by = c('Street', 'Zip')))
which returns the same thing, but with the newly geocoded rows at the top. The same approach works with a list column or do, but since there's no need to combine two sets of columns, just bind_cols will do the trick.
4. On a subset with mutate_geocode
ggmap actually includes a mutate_geocode function that will add lon and lat columns when passed a data.frame and a column of addresses. It has an issue: it can't accept more than a column name for the address, and thus requires a single column with the entire address. Thus, while this version could be quite nice, it requires creating and deleting an extra column with the whole address, making it inconcise:
df %>% filter(is.na(lon) | is.na(lat)) %>%
select(1:4) %>%
mutate(address = paste(Street, City, State, Zip)) %>% # make an address column
mutate_geocode(address) %>%
select(-address) %>% # get rid of address column
bind_rows(anti_join(df, ., by = c('Street', 'Zip')))
## Street City State Zip lon lat
## 1 5th Ave New York New York 10022 -73.97491 40.76167
## 2 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 3 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
## 4 350 5th Ave New York New York 10118 -73.98566 40.74871
## 5 75 Broadway New York New York 10006 -74.01205 40.70814
## 6 226 W 46th St New York New York 10036 -73.98670 40.75902
5. Base R
Base R can assign to a subset directly, which makes the idiom here much simpler, even if it requires a lot of subsetting:
df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)])
Results are the same as the first version.
All versions only call geocode twice.
Note that while you could use purrr for the job, it's not particularly better suited than regular dplyr. purrr excels at dealing with lists, and while a list column is one option, it doesn't really have to be manipulated.
I'm not sure abut purrr but here's the following using the pipe:
df <- frame_data(
~Street, ~City, ~State, ~Zip, ~lon, ~lat,
"226 W 46th St", "New York", "New York", 10036, -73.9867, 40.75902,
"5th Ave", "New York", "New York", 10022, NA, NA,
"75 Broadway", "New York", "New York", 10006, -74.01205, 40.70814,
"350 5th Ave", "New York", "New York", 10118, -73.98566, 40.74871,
"20 Sagamore Hill Rd", "Oyster Bay", "New York", 11771, NA, NA,
"45 Rockefeller Plaza", "New York", "New York", 10111, -73.97771, 40.75915
)
df2<-df %>%
filter(is.na(lon) | is.na(lat)) %>%
group_by(Street, City, State) %>% #not really necessary but it suppresses a warning
mutate(lon=ifelse(is.na(lon) | is.na(lat),
geocode(paste(Street, City,State, sep=" ")), 0)) %>%
mutate(lat=ifelse(is.na(lon) | is.na(lat),
rev(geocode(paste(Street, City,State, sep=" "))), 0))
If you want the partial output like in your example code above:
as.data.frame(df2)[,5:6]
lon lat
1 40.77505 -73.96515
2 40.88259 -73.50538
Or include all columns:
as.data.frame(df2)
Street City State Zip lon lat
1 5th Ave New York New York 10022 40.77505 -73.96515
2 20 Sagamore Hill Rd Oyster Bay New York 11771 40.88259 -73.50538
And if you want to combine your original data with the new data you can do the following:
as.data.frame(rbind(filter(df, !is.na(lon) | !is.na(lat)),df2 ))
Street City State Zip lon lat
1 226 W 46th St New York New York 10036 -73.98670 40.75902
2 75 Broadway New York New York 10006 -74.01205 40.70814
3 350 5th Ave New York New York 10118 -73.98566 40.74871
4 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
5 5th Ave New York New York 10022 40.77505 -73.96515
6 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.96515 40.77505
...Or you can streamline it all in one like in the following (keeps original order):
df2<-df %>%
#group_by(Street, City, State) %>% # unescape if you want to suppress warning
mutate(lon=ifelse(is.na(lon) | is.na(lat),
geocode(paste(Street, City,State, sep=" ")), lon)) %>%
mutate(lat=ifelse(is.na(lon) | is.na(lat),
rev(geocode(paste(Street, City,State, sep=" "))), lat))
as.data.frame(df2)
Street City State Zip lon lat
1 226 W 46th St New York New York 10036 -73.98670 40.75902
2 5th Ave New York New York 10022 -73.98670 40.75902
3 75 Broadway New York New York 10006 -74.01205 40.70814
4 350 5th Ave New York New York 10118 -73.98566 40.74871
5 20 Sagamore Hill Rd Oyster Bay New York 11771 40.75902 -73.98670
6 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
Using dplyr:
df %>% mutate( lon = case_when( is.na(lon) ~ geocode(paste(Street, City, State, Zip))[,1],
TRUE ~ lon),
lat = case_when( is.na(lat) ~ geocode(paste(Street, City, State, Zip))[,2],
TRUE ~ lat )
)

Extract cities from each row in excel and export to its respective row using R

I have extracted tweets in .csv format and the data looks like this:
(row 1) The latest The Admin Resources Daily! Thanks to #officerenegade #roberthalf #elliottdotorg #airfare #jobsearch
(row 2) RT #airfarewatchdog: Los Angeles #LAX to Cabo #SJD $312 nonstop on #AmericanAir for summer travel. #airfare
(row 3) RT #TheFlightDeal: #Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t. Details:
(row 4) The latest The Nasir Muhammad Daily! Thanks to #Matt_Revel #Roddee #JaeKay #lefforum #airfare
(row 5) RT #BarefootNomads: So cool! <U+2708> <U+2764><U+FE0F> #airfare deals w #Skyscanner Facebook Messenger Bot #traveldeals #cheapflights ht…
(row 6) Flights to #Oranjestad #Aruba are £169 for a 15 day trip departing Tue, Jun 7th. #airfare via #hitlist_app"
I have made use of the NLP technique to extract city names from the tweets but the output is a list of cities with each city occupying a row one below the other. It is just identifying all the city names and making a list of it.
Output:
1 Los Angeles
2 New York
3 Mexico City
4 Mexico
5 Tue
6 London
7 New York
8 Fort Lauderdale
9 Los Angeles
10 Paris
I want the output to be something like:
1 Los Angeles Cabo (from the first tweet in row 2)
2 New York Mexico City Mexico (from the second tweet in row 3)
Code:
#Named Entity Recognition (NER)
bio <- readLines("C:\\xyz\\tweets.csv")
print(bio)
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
library(NLP)
library(openNLP)
library(RWeka)
library(qdap)
library(openNLPmodels.en)
library(magrittr)
bio <- as.String(bio)
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(2)
words(bio_doc) %>% head(10)
location_ann <- Maxent_Entity_Annotator(kind = "location")
pipeline <- list(sent_ann,
word_ann,
location_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotations(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
entities(bio_doc, kind = "location")
cities <- entities(bio_doc, kind = "location")
library(xlsx)
write.xlsx(cities, "C:\\xyz\\xyz.xlsx")
Also is there a way that I can further separate the cities as origin and destination, i.e. by classifying cities before 'to' or '-' as origin cities and the rest as destination cities?

Resources