Splitting a string few characters after the delimiter - r

I have a large data set of names and states that I need to split. After splitting, I want to create new rows with each name and state. My data strings are in multiple lines that look like this
"Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ"
"Ralph Hogan, TX, Michael Johnson, FL"
I need the data to look like this
attr name state
1 Peter Johnson IN
2 Chet Charles TX
3 Ed Walsh AZ
4 Ralph Hogan TX
5 Michael Johnson FL
I can't figure out how to do this, perhaps split it somehow a few characters after the comma? Any help would be greatly appreciated.

If it is multiple line strings, then we can create a delimiter with gsub, split the strings using strsplit, create data.frame with the components of the split in the output list, and rbind it together.
d1 <- do.call(rbind, lapply(strsplit(gsub("([A-Z]{2})(\\s+|,)",
"\\1;", lines), "[,;]"), function(x) {
x1 <- trimws(x)
data.frame(name = x1[c(TRUE, FALSE)],state = x1[c(FALSE, TRUE)]) }))
cbind(attr = seq_len(nrow(d1)), d1)
# attr name state
#1 1 Peter Johnson IN
#2 2 Chet Charles TX
#3 3 Ed Walsh AZ
#4 4 Ralph Hogan TX
#5 5 Michael Johnson FL
Or this can be done in a compact way
library(data.table)
fread(paste(gsub("([A-Z]{2})(\\s+|,)", "\\1\n", lines), collapse="\n"),
col.names = c("names", "state"), header = FALSE)[, attr := 1:.N][]
# names state attr
#1: Peter Johnson IN 1
#2: Chet Charles TX 2
#3: Ed Walsh AZ 3
#4: Ralph Hogan TX 4
#5: Michael Johnson FL 5
data
lines <- readLines(textConnection("Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ
Ralph Hogan, TX, Michael Johnson, FL"))

Related

Extracting best guess first and last names from a string

I have a set of names that looks as such:
names <- structure(list(name = c('Michael Smith ♕',
'Scott Lewis - Realtor',
'Erin Hopkins Ŧ',
'Katie Parsons | Denver',
'Madison Hollins Taylor',
'Kevin D. Williams',
'|Ryan Farmer|',
'l a u r e n t h o m a s',
'Dave Goodwin💦',
'Candice Harper Makeup Artist',
'dani longfeld // millenialmodels',
'Madison Jantzen | DALLAS, TX',
'Rachel Wallace Perkins',
'Kayla Wright Photography',
'Scott Green Jr.')), class = "data.frame", row.names = c(NA, -15L))
In addition to getting first and last name extracted from each of these, for ones like Rachel Wallace Perkins and Madison Hollins Taylor, I'd like to create one to multiple extracts since we don't really know which is their true last name. The final output would look something like this:
names_revised <- structure(list(name = c('Michael Smith',
'Scott Lewis',
'Erin Hopkins',
'Katie Parsons',
'Madison Hollins',
'Madison Taylor',
'Kevin Williams',
'Ryan Farmer',
'Lauren Thomas',
'Dave Goodwin',
'Candice Harper',
'Dani Longfeld',
'Madison Jantzen',
'Rachel Wallace',
'Rachel Perkins',
'Kayla Wright',
'Scott Green')), class = "data.frame", row.names = c(NA, -17L))
Based on some previous answers, I attempted to do (using the tidyr package):
names_extract <- tidyr::extract(names, name, c("FirstName", "LastName"), "([^ ]+) (.*)")
But that doesn't seem to do the trick, as the output it produces looks as such:
FirstName LastName
1 Michael Smith ♕
2 Scott Lewis - Realtor
3 Erin Hopkins Ŧ
4 Katie Parsons | Denver
5 Madison Hollins Taylor
6 Kevin D. Williams
7 |Ryan Farmer|
8 l a u r e n t h o m a s
9 Dave Goodwin💦
10 Candice Harper Makeup Artist
11 dani longfeld // millenialmodels
12 Madison Jantzen | DALLAS, TX
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr.
I know there are a ton of little edge cases that make this difficult, but overall, what would be the best approach for handling this that would capture the most results I'm trying for?
This fixes most of the rows.
library(dplyr)
library(tidyr)
Names %>%
mutate(name2 = sub("^[[:punct:]]", "", name) %>%
sub(" \\w[.] ", " ", .) %>%
sub("[[:punct:]]+ *[^[:punct:]]*$", "", .) %>%
sub("\\W+[[:upper:]]+$", "", .) %>%
trimws) %>%
separate(name2, c("First", "Last"), extra = "merge")
giving:
name First Last
1 Michael Smith ♕ Michael Smith
2 Scott Lewis - Realtor Scott Lewis
3 Erin Hopkins Ŧ Erin Hopkins
4 Katie Parsons | Denver Katie Parsons
5 Madison Hollins Taylor Madison Hollins Taylor
6 Kevin D. Williams Kevin Williams
7 |Ryan Farmer| Ryan Farmer
8 l a u r e n t h o m a s l a u r e n t h o m a s
9 Dave Goodwin?? Dave Goodwin
10 Candice Harper Makeup Artist Candice Harper Makeup Artist
11 dani longfeld // millenialmodels dani longfeld
12 Madison Jantzen | DALLAS, TX Madison Jantzen
13 Rachel Wallace Perkins Rachel Wallace Perkins
14 Kayla Wright Photography Kayla Wright Photography
15 Scott Green Jr. Scott Green Jr
Here's a first go at cleaning the data - (much) more will be needed to obtain perfect data:
library(stringr)
df %>%
mutate(name = str_extract(name, "[\\w\\s.]+\\w"))
name
1 Michael Smith
2 Scott Lewis
3 Erin Hopkins
4 Katie Parsons
5 Madison Hollins Taylor
6 Kevin D. Williams
7 Ryan Farmer
8 l a u r e n t h o m a s
9 Dave Goodwin
10 Candice Harper Makeup Artist
11 dani longfeld
12 Madison Jantzen
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr
Here we use str_extract, which extracts just the first match in the string, which is convenient as most of the characters that you want to remove are right-end bound. The character class [\\w\\s.]+ matches any alphanumeric and whitespace characters and the dot occurring one or more times. It is followed by \\w, i.e., a single alphanumeric character to make sure that the extracted parts do not end on whitespace. As said, that's just a first go but the data is already very much tidier.

regex - Find match for " C " but not "J.C." in R

The Setup:
I am using regular expression to organize baseball lineups into a dataframe.
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington
',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr
',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez
')
mm <- gregexpr("\\b(P|C|OF|SS|1B|2B|3B)\\b", LINEUPS)
players <- do.call("rbind", unname(Map(function(x, m, i) {
pstart <- m
pend <- pstart + attr(m, "match.length")
hstart <- pend + 1
hend <- c(tail(pstart,-1)-1, nchar(x))
data.frame(game=i, pos=substring(x, pstart, pend), name=substring(x, hstart, hend))
}, LINEUPS, mm, seq_along(LINEUPS))))
players$pos <- sub("^\\s|\\s+$","", players$pos)
players$name <- sub("^\\s|\\s+$","", players$name)
library(dplyr)
library(tidyr)
players <- players %>%
group_by(game, pos) %>%
mutate(pos=if_else(rep(n(),n())>1, paste0(pos, row_number()), pos)) %>%
pivot_wider(game, names_from=pos, values_from=name)
The Problem:
When the player's name includes initials that also happen to match one of the positions, I run into problems. In the example above: SS J.C. Roberts matches the position C and 1B J.P. Allen matches the position P, causing the string to be split incorrectly.
The Question:
How do I modify the current search to exclude these kinds of matches so that I end up with the following result:
P1 <- c('Victor Bailey','Dennis Flores','Philip Hernandez')
P2 <- c('Joseph Cox','Edward Anderson','Joseph Cox')
C <- c('David Washington','David Washington','Russell James')
"1B" <- c('Steven Parker','Robert Harris Jr', 'J.P. Allen')
"2B" <- c('Carl Smith','Jack Rodriguez','Jack Rodriguez')
"3B" <- c('Brian Rivera','Thomas Wilson','Brian Rivera')
SS <- c('William Gonzales','J.C. Roberts','Ralph Martinez')
OF1 <- c('Andrew Johnson','Jason Torres','Ryan Walker')
OF2 <- c('Walter Hill','Randy Baker','Christopher Taylor')
OF3 <- c('Christopher Taylor','Ryan Walker','Andrew Johnson')
RESULT <- data.frame(P1, P2, C, `1B`, `2B`, `3B`, SS, OF1, OF2, OF3)
Assuming you want to match C as a whole word, but not inside a whole word J.C..
Use
\bC\b(?<!\bJ\.C(?=\.))
See proof. With your regex:
\b(P|C|OF|SS|1B|2B|3B)\b(?<!\bJ\.C(?=\.))
See this demo.
In your code:
mm <- gregexpr("\\b(P|C|OF|SS|1B|2B|3B)\\b(?<!\\bJ\\.C(?=\\.))", LINEUPS, perl=TRUE)
The main trick:
Use negative look-ahead in regex (?!<your-pattern>) to forbid following characters after your single letter position patterns - in this case (?!\\.).
Helper functions and finally the processing function process_lineups():
require(stringr)
extract_positions <- function(lineups, pos_pattern) {
sapply(stringr::str_extract_all(lineups, pos_pattern), stringr::str_trim)
}
extract_names <- function(lineups, pos_pattern) {
res <- sapply(stringr::str_split(lineups, pos_pattern), stringr::str_trim)
res[2:nrow(res), ]
}
get_indexes_matching <- function(pattern, vec) {
# Return all pattern-matching index positions in vec. `pattern` can be regex.
grep(pattern, vec)
}
pattern2names <- function(pattern, df) {
# Utility function to prepare names of result data frame.
# 1. clean from "^" and "$" in patterns.
# 2. Add numberings if multiple hits.
# (e.g. for "^P$" -> "P" -(if multi-hits add numbering)-> "P1" "P2")
cleaned_pattern <- gsub("^\\^", "", gsub("\\$$", "", pattern))
if (ncol(df) > 1) {
paste0(cleaned_pattern, 1:ncol(df))
} else {
cleaned_pattern
}
}
extract_patterns_to_df <- function(pattern, positions, names) {
# Return all hits of positions as names and the positions as column name(s).
# It returns a data frame. (e.g. columns: "P1" "P2" or single hit: column: "C")
res <- sapply(1:ncol(positions), function(i) names[get_indexes_matching(pattern, positions[, i]), i])
if (is.matrix(res)) {
df <- as.data.frame(t(res))
} else if (is.vector(res)) {
df <- data.frame("col" = res)
}
names(df) <- pattern2names(pattern, df)
df
}
process_lineups <- function(LINEUPS, position_pattern, ordered_patterns) {
# All necessary procedures to generate the final RESULT data frame.
positions <- extract_positions(LINEUPS, position_pattern)
names <- extract_names(LINEUPS, position_pattern)
Reduce(cbind,
lapply(ordered_patterns,
function(pos) extract_patterns_to_df(pos, positions, names)))
}
Apply the function process_lineups():
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez')
# use negative lookahead (?!<pattern>) to forbid e.g. P or C followed by a `\\.`
position_pattern <- "\\b(P(?!\\.)|C(?!\\.)|OF|SS|1B|2B|3B)\\b"
ordered_patterns <- c("^P$", "^C$", "^1B$", "^2B$", "^3B$", "^SS$", "^OF$")
res_df <- process_lineups(LINEUPS, position_pattern, ordered_patterns)
The result:
# > res_df
# P1 P2 C 1B
# 1 Victor Bailey Joseph Cox David Washington Steven Parker
# 2 Dennis Flores Edward Anderson David Washington Robert Harris Jr
# 3 Philip Hernandez Joseph Cox Russell James J.P. Allen
# 2B 3B SS OF1
# 1 Carl Smith Brian Rivera William Gonzales Andrew Johnson
# 2 Jack Rodriguez Thomas Wilson J.C. Roberts Jason Torres
# 3 Jack Rodriguez Brian Rivera Ralph Martinez Ryan Walker
# OF2 OF3
# 1 Walter Hill Christopher Taylor
# 2 Randy Baker Ryan Walker
# 3 Christopher Taylor Andrew Johnson
Finally, one could rename "1B", "2B", "3B" into "X1B", "X2B", "X3B".
You weren't asking for an optimisation, but I couldn't help myself trying ;-)
sample data
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez')
code
#split on delimeters, while keeping the delimiter
# also, trim whitespace using trimws
pattern <- "(?<=.)(?=\\b(P|C|OF|SS|1B|2B|3B)[^\\.]\\b)"
L <- lapply( strsplit( LINEUPS, pattern, perl = TRUE ), trimws )
#split after first space
pattern2 <- "^(\\w+)\\s?(.*)$"
lapply( L, function(x) {
data.frame( position = sub( pattern2, "\\1", x ),
player = sub( pattern2, "\\2",x ) )
})
output
# [[1]]
# position player
# 1 OF Andrew Johnson
# 2 P Victor Bailey
# 3 OF Walter Hill
# 4 2B Carl Smith
# 5 3B Brian Rivera
# 6 P Joseph Cox
# 7 1B Steven Parker
# 8 SS William Gonzales
# 9 OF Christopher Taylor
# 10 C David Washington
#
# [[2]]
# position player
# 1 SS J.C. Roberts
# 2 P Dennis Flores
# 3 OF Jason Torres
# 4 2B Jack Rodriguez
# 5 OF Randy Baker
# 6 P Edward Anderson
# 7 C David Washington
# 8 3B Thomas Wilson
# 9 OF Ryan Walker
# 10 1B Robert Harris Jr
#
# [[3]]
# position player
# 1 1B J.P. Allen
# 2 P Philip Hernandez
# 3 OF Ryan Walker
# 4 OF Christopher Taylor
# 5 2B Jack Rodriguez
# 6 C Russell James
# 7 3B Brian Rivera
# 8 P Joseph Cox
# 9 OF Andrew Johnson
# 10 SS Ralph Martinez
If you need to store the output by position as an object,m you van use list2env
store the output from above code to ans, and then:
list2env(
split(
data.table::rbindlist( ans, use.names = TRUE ),
by = "position",
keep.by = FALSE ),
envir = .GlobalEnv )
There are some good solutions here, but I believe I found a much more efficient one: removing the . characters entirely using LINEUPS <- gsub(".", "", LINEUPS, fixed = TRUE). For my purposes, it doesn't matter if the names are exact matches with the original input data - only that they are organized in a way I can put them to use.
Simple and functional. :)

Convert json list to data frame

I am having an issue converting a json file to a data frame.
I use jsonlite and fromJSON() function with also unlist() function but I cannot manage to get the data in the data model I want.
Json file is structured this way:
{"JOHN":["AZ","YZ","ZE","ZR","FZ"],"MARK":["FZ","JF","FS"],"LINDA":["FZ","RZ","QF"]}
And I would like to have a data frame similar to this:
NAME GROUP
JOHN AZ
JOHN YZ
JOHN ZE
JOHN ZR
JOHN FZ
MARK FZ
MARK JF
MARK FS
...
Thanks !
We can use fromJSON from jsonlite to get a list of key/value vectors, convert that to a two column data.frame with stack, rearrange the columns and change the column names (if needed).
library(jsonlite)
setNames(stack(fromJSON(str1))[2:1], c("NAME", "GROUP"))
# NAME GROUP
#1 JOHN AZ
#2 JOHN YZ
#3 JOHN ZE
#4 JOHN ZR
#5 JOHN FZ
#6 MARK FZ
#7 MARK JF
#8 MARK FS
#9 LINDA FZ
#10 LINDA RZ
#11 LINDA QF
data
str1 <- '{"JOHN":["AZ","YZ","ZE","ZR","FZ"],"MARK":["FZ","JF","FS"],"LINDA":["FZ","RZ","QF"]}'

Turn names into numbers in a dataframe based on the row index of the name in another dataframe

I have two dataframes. One is just the names of my facebook friends and another one is the links with a sorce and target columns. I want to turn the names in the links dataframe to numbers based on the row index of that name in the friends dataframe.
friends
name
1 Andrewt Thomas
2 Robbie McCord
3 Mohammad Mojadidi
4 Andrew John
5 Professor Owk
6 Joseph Charles
links
source target
1 Andrewt Thomas Andrew John
2 Andrewt Thomas James Zou
3 Robbie McCord Bz Benz
4 Robbie McCord Yousef AL-alawi
5 Robbie McCord Sherhan Asimov
6 Robbie McCord Aigerim Aig
Seems trivial, but I cannot figure it out. Thanks for help.
Just use a simple match
links$source <- match(links$source, friends$name)
links
# source target
# 1 1 Andrew John
# 2 1 James Zou
# 3 2 Bz Benz
# 4 2 Yousef AL-alawi
# 5 2 Sherhan Asimov
# 6 2 Aigerim Aig
Something like this?
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
Full example
links <- data.frame(source = c("John", "John", "Alice"), target = c("Jimmy", "Al", "Chris"))
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
links$source
[1] 3 3 2

Lookup values in a vectorized way

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Resources