Splitting character object using vector of delimiters - r

I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.
Data is currently in the following structure (simplified):
a <- c("Name: John Doe Age: 50 Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs Age: 42 Address Please give full address 1 Lower Street, London")
df <- data.frame(rawtext = c(a,b))
I'd like to split each observation into individual variable columns. It should end up looking like this:
Name Age Address
John Doe 50 22 Main Street, New York
Jane Bloggs 42 1 Lower Street, London
I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.
delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)
I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.
e.g. the following bodge would get me the name field based on the delimiters:
sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)
[1] " John Doe "
This feels extremely inefficient. Is there a better way that I'm missing?

A tidyverse solution:
library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")
df %>%
mutate(rawtext = str_remove_all(rawtext, ":")) %>%
separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>%
mutate(across(where(is.character), str_squish), x = NULL)
# # A tibble: 2 x 3
# Name Age `Address Please give full address`
# <chr> <dbl> <chr>
# 1 John Doe 50 22 Main Street, New York
# 2 Jane Bloggs 42 1 Lower Street, London
Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.

Related

extracting data from a data frame row, performing internal lookup, and restructuring into long format

I've been beating my head against this for awhile and was hoping for some suggestions. I'm trying to extract semicolon delimited text from a row in a data frame, performing an internal lookup on a string in that row based on the extracted values, and then outputting that (along with another extracted variable) into a long format...and then repeating for every row in the data frame. I can do the first and last manipulations with str_split, and I think i could just loop everything with apply, but the internal lookup (join?) has me tied in knots. I'd like to imagine that i could do this w/ dplyr but
Starting with a data frame:
name<-"Adam, B.C.; Dave, E.F.; Gerald, H."
school<-"[Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia"
index<-12345
foo<-data.frame(name,school,index)
foo
name school index
1 Adam, B.C.; Dave, E.F.; Gerald, H. [Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia 12345
Desired output:
name school index
Adam, B.C. U.Penn 12345
Dave, E.F. U.Georgia 12345
Gerald, H. U.Penn 12345
etc. etc. etc.
thanks!
A mixture of tidyr::separate() and tidyr::seperate_rows() could do the trick:
library(tidyverse)
foo |>
tidyr::separate_rows(school, sep = "\\[", convert = T) |>
tidyr::separate(col = school, into = c("name", "school"), sep = "]") |>
tidyr::separate_rows(name, sep = ";", convert = T) |>
slice(-1) |>
mutate(across(everything(), trimws)) |>
mutate(across(everything(), str_remove, ";" ))
Output:
# A tibble: 3 x 3
index name school
<chr> <chr> <chr>
1 12345 Adam, B.C. U.Penn
2 12345 Gerald, H. U.Penn
3 12345 Dave, E.F. U.Georgia

Joining two dataframes on a condition (grepl)

I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes,
df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20))
fullnames ages
1 Jane Doe 30
2 Mr. John Smith 51
3 Nate Cox, Esq. 45
4 Bill Lee III 38
5 Ms. Kate Smith 20
df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages=c(30, 45, 20, 28, 51, 38),
homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
lastnames ages homestate
1 Doe 30 NJ
2 Cox 45 CT
3 Smith 20 MA
4 Jung 28 RI
5 Smith 51 MA
6 Lee 38 NY
I want to do a left join on these two dataframes on ages and the row in which df2$lastnames is contained within df1$fullnames. I thought fuzzy_join might do it, but I don't think it liked my grepl:
joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"),
+ match_fun = c("=", "grepl()"),
+ mode="left")
Error in which(m) : argument to 'which' is not logical
Desired result: a dataframe identical to the first but with a "homestate" column appended. Any ideas?
TLDR
You just need to fix match_fun:
# ...
match_fun = list(`==`, stringr::str_detect),
# ...
Background
You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
Solution
A simple correction will do the trick, with further formatting by dplyr. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:
library(dplyr)
# ...
# Existing code
# ...
joined_dfs <- fuzzy_join(
df1, df2,
by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`
mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
Result
Given your sample data reproduced here
df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)
df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)
this solution should produce the following data.frame for joined_dfs, formatted as requested:
fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
Note
Because each ages is coincidentally a unique key, the following join on only *names
fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)
will better illustrate the behavior of matching on substrings:
fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
Where You Went Wrong
Error in Type
The value passed to match_fun should be either (the symbol for) a function
fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)
or a list of such (symbols for) functions:
fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)
Instead of providing a list of symbols
match_fun = list(=, grepl)
you incorrectly provided a vector of character strings:
match_fun = c("=", "grepl()")
Error in Syntax
The user should name the functions
`=`
grepl
yet you incorrectly attempted to call them:
=
grepl()
Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.
* Assuming the calls didn't fail with errors. Here, they would fail.
Inappropriate Functions
To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you incorrectly supplied the assignment operator =.
Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
its first argument (pattern) is (treated as) a single character string:
character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.
Thus, grepl() is not a
Vectorized function given two columns...
but rather a function given one string (scalar) and one column (vector) of strings.
The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is
Vectorised over string and pattern. Equivalent to grepl(pattern, x).
and which wraps stringi::stri_detect().
Note
Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally outside of [], which are very unlikely to be found in a name.
If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.
This seems to work given your two dataframes:
Edited as per comment by #Greg:
The code is adpated to the data as posted; if in your actual data, there are more variants expecially to last names, such as not only III but also IV, feel free to adapt the code accordingly:
library(dplyr)
df1 %>%
mutate(
# create new column that gets rid of strings after last name:
lastnames = sub("\\sI{1,3}$|,.+$", "", fullnames),
# grab last names:
lastnames = sub(".*?(\\w+)$", "\\1", lastnames)) %>%
# join the two dataframes:
left_join(., df2, by = c("lastnames", "ages"))
fullnames ages lastnames homestate
1 Jane Doe 30 Doe NJ
2 Mr. John Smith 51 Smith MA
3 Nate Cox, Esq. 45 Cox CT
4 Bill Lee III 38 Lee NY
5 Ms. Kate Smith 20 Smith MA
If you want lastnamesremoved just append this after %>%:
select(-lastnames)
EDIT #2:
If you don't trust the above solution given massive variation in how last names are actually noted, then of course fuzzy_join is an option too. BUT, the current fuzzy_join solution is not enough; it needs to be amended by one critical data transformation. This is because str_detect detects whether a string is contained within another string. That is, it will return TRUE if it compares, for example, Smith to Smithsonian or to Hammer-Smith - each time the string Smith is indeed contained in the longer names. If, as will likely be the case in a large dataset, Smith and Smithsonian happen to have the same ages the mismatch will be perfect: fuzzy_join will incorrectly join the two. The same problem arises when you have, e.g., Smith and Smith-Klein of the same age: there too fuzzy_join will join them.
The first set of problematic cases can be resolved by including word boundary achors \\b in df2. These assert that, for example, Smith must be bounded by word boundaries to either side, which is not the case with Smithsonian, which does have an invisible boundary to the left of Smithsonian but the right-hand anchor is after its last letter n. The second set of problematic cases can be addressed by including a negative lookahead after \\b, namely \\b(?!-), which asserts that after the word boundary there must not be a hyphen.
The solution is easily implemented with mutate and paste0 like so:
fuzzy_join(
df1, df2 %>%
mutate(lastnames = paste0("\\b", lastnames, "\\b(?!-)")),
by = c("ages", "fullnames" = "lastnames"),
match_fun = list(`==`, str_detect),
mode = "left"
) %>%
select(fullnames, ages = ages.x, homestate)

Add value in one column based on multiple key words in another column in r

I want to do the following things: if key words "GARAGE", "PARKING", "LOT" exist in column "Name" then I would add value "Parking&Garage" into column "Type".
Here is the dataset:
df<-data.frame(Name=c("GARAGE 1","GARAGE 2", "101 GARAGE","PARKING LOT","CENTRAL PARKING","SCHOOL PARKING 1","CITY HALL"))
The following codes work well for me, but is there a neat way to make the codes shorter? Thanks!
df$Type[grepl("GARAGE", df$Name) |
grepl("PARKING", df$Name) |
grepl("LOT", df$Name)]<-"Parking&Garage"
The regex "or" operator | is your friend here:
df$Type[grepl("GARAGE|PARKING|LOT", df$Name)]<-"Parking&Garage"
You can create a list of keywords to change, create a pattern dynamically and replace the values.
keywords <- c('GARAGE', 'PARKING', 'LOT')
df$Type <- NA
df$Type[grep(paste0(keywords, collapse = '|'), df$Name)] <- "Parking&Garage"
df
# Name Type
#1 GARAGE 1 Parking&Garage
#2 GARAGE 2 Parking&Garage
#3 101 GARAGE Parking&Garage
#4 PARKING LOT Parking&Garage
#5 CENTRAL PARKING Parking&Garage
#6 SCHOOL PARKING 1 Parking&Garage
#7 CITY HALL <NA>
This would be helpful if you need to add more keywords to your list later.
an alternative with dpylr and stringr packages:
library(stringr)
library(dplyr)
df %>%
dplyr::mutate(TYPE = stringr::str_detect(Name, "GARAGE|PARKING|LOT"),
TYPE = ifelse(TYPE == TRUE, "Parking&Garage", NA_character_))

Extracting substring by positions in pipe

I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.
I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.
data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>%
mutate(coor = str_locate_all(id, "\\s"),
name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )
You can use regex to extract what you want.
Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.
sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"
^# - starts with hash
\\w+ - A word
\\s - Whitespace
( - start of capture group
\\w+ - A word
followed by \\s - whitespace
\\w+ - another word
) - end of capture group.
.* - remaining string.
The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.
library(dplyr)
library(stringr)
library(purrr)
data %>%
mutate(coor = str_locate_all(id, "\\s"),
start = map_dbl(coor, `[`, 1) + 1,
end = map_dbl(coor, `[`, 3) - 1,
name = str_sub(id, start, end))
# A tibble: 2 x 2
# id name
# <chr> <chr>
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Another possible solution using stringr and purrr packages
library(stringr)
library(purrr)
library(dplyr)
data %>%
mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))
Explanation:
in str_split(id, " ") we create a list of the terms that are separated inside id by a whitespace
map_chr is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name we want) and then collapse them with a whitespace between them
Output
# A tibble: 2 x 2
# id name
# <chr> <chr>
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Resources