Using gsub across columns - r

I have some data:
testData <- tibble(fname = c("Alice", "Bob", "Charlie", "Dan", "Eric"),
lname = c("Smith", "West", "CharlieBlack", "DanMcDowell", "Bush"))
A few of the last names have first names concatenated to them.
What is an effective way to go through and fix the lname column?
I want it to look like this:
lname = c("Smith", "West", "Black", "McDowell", "Bush")
I can use a for loop but I have half a million rows of data so I'd like to find a more efficient method.

We can use str_remove
library(tidyverse)
testData %>%
mutate(lname = str_remove(lname, fname))
# A tibble: 5 x 2
# fname lname
# <chr> <chr>
#1 Alice Smith
#2 Bob West
#3 Charlie Black
#4 Dan McDowell
#5 Eric Bush

We can use gsub within apply:
apply(testData,1,function(x) gsub(x['fname'],"",x['lname']))
Output:
[1] "Smith" "West" "Black" "McDowell" "Bush"

try mutate with an ifelse clause to catch the lname entires that are concatenated, e.g.:
library(dplyr)
testData <- testData %>% mutate(lname = ifelse(grepl('[[:upper:]][[:lower:]]+[[:upper:]]', lname), gsub('^[[:upper:]][[:lower:]]+', "", lname), lname))
In this example, you are saying "mutate lname IF the string has
an uppercase letter + at least one lowercase letter + an uppercase letter. If that condition is met, replace the first uppercase letter and following lowercase letters with nothing. If that condition is not met, just keep the original lname text".

Related

Add value in one column based on multiple key words in another column in r

I want to do the following things: if key words "GARAGE", "PARKING", "LOT" exist in column "Name" then I would add value "Parking&Garage" into column "Type".
Here is the dataset:
df<-data.frame(Name=c("GARAGE 1","GARAGE 2", "101 GARAGE","PARKING LOT","CENTRAL PARKING","SCHOOL PARKING 1","CITY HALL"))
The following codes work well for me, but is there a neat way to make the codes shorter? Thanks!
df$Type[grepl("GARAGE", df$Name) |
grepl("PARKING", df$Name) |
grepl("LOT", df$Name)]<-"Parking&Garage"
The regex "or" operator | is your friend here:
df$Type[grepl("GARAGE|PARKING|LOT", df$Name)]<-"Parking&Garage"
You can create a list of keywords to change, create a pattern dynamically and replace the values.
keywords <- c('GARAGE', 'PARKING', 'LOT')
df$Type <- NA
df$Type[grep(paste0(keywords, collapse = '|'), df$Name)] <- "Parking&Garage"
df
# Name Type
#1 GARAGE 1 Parking&Garage
#2 GARAGE 2 Parking&Garage
#3 101 GARAGE Parking&Garage
#4 PARKING LOT Parking&Garage
#5 CENTRAL PARKING Parking&Garage
#6 SCHOOL PARKING 1 Parking&Garage
#7 CITY HALL <NA>
This would be helpful if you need to add more keywords to your list later.
an alternative with dpylr and stringr packages:
library(stringr)
library(dplyr)
df %>%
dplyr::mutate(TYPE = stringr::str_detect(Name, "GARAGE|PARKING|LOT"),
TYPE = ifelse(TYPE == TRUE, "Parking&Garage", NA_character_))

Splitting character object using vector of delimiters

I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.
Data is currently in the following structure (simplified):
a <- c("Name: John Doe Age: 50 Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs Age: 42 Address Please give full address 1 Lower Street, London")
df <- data.frame(rawtext = c(a,b))
I'd like to split each observation into individual variable columns. It should end up looking like this:
Name Age Address
John Doe 50 22 Main Street, New York
Jane Bloggs 42 1 Lower Street, London
I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.
delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)
I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.
e.g. the following bodge would get me the name field based on the delimiters:
sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)
[1] " John Doe "
This feels extremely inefficient. Is there a better way that I'm missing?
A tidyverse solution:
library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")
df %>%
mutate(rawtext = str_remove_all(rawtext, ":")) %>%
separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>%
mutate(across(where(is.character), str_squish), x = NULL)
# # A tibble: 2 x 3
# Name Age `Address Please give full address`
# <chr> <dbl> <chr>
# 1 John Doe 50 22 Main Street, New York
# 2 Jane Bloggs 42 1 Lower Street, London
Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.

Extracting substring by positions in pipe

I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.
I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.
data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>%
mutate(coor = str_locate_all(id, "\\s"),
name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )
You can use regex to extract what you want.
Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.
sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"
^# - starts with hash
\\w+ - A word
\\s - Whitespace
( - start of capture group
\\w+ - A word
followed by \\s - whitespace
\\w+ - another word
) - end of capture group.
.* - remaining string.
The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.
library(dplyr)
library(stringr)
library(purrr)
data %>%
mutate(coor = str_locate_all(id, "\\s"),
start = map_dbl(coor, `[`, 1) + 1,
end = map_dbl(coor, `[`, 3) - 1,
name = str_sub(id, start, end))
# A tibble: 2 x 2
# id name
# <chr> <chr>
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Another possible solution using stringr and purrr packages
library(stringr)
library(purrr)
library(dplyr)
data %>%
mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))
Explanation:
in str_split(id, " ") we create a list of the terms that are separated inside id by a whitespace
map_chr is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name we want) and then collapse them with a whitespace between them
Output
# A tibble: 2 x 2
# id name
# <chr> <chr>
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome

Replace lowercase in names, not in surnames

I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:
Michael, Jordan; Bird, Larry;
If the name is a single word, the code would be like this:
breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")
Result with this code:
Michael, J.; Bird, L.;
The problem is in compound names. With this code, the name:
Jordan, Michael Larry;
It would be:
Jordan, Michael L.;
Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:
Jordan, M.L.;
Here is another solution:
x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"
Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.
To remove the space between M. and L., an additional step is needed:
gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"
There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.
library(stringr)
library(dplyr)
library(tidyr)
ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"
c(ballers, mj) %>%
#split the players
str_split(., ";", simplify = TRUE) %>%
# remove white space
str_trim() %>%
#transpose to get players in a column
t %>%
#split again into last name and first + middle (if any)
str_split(",", simplify = TRUE) %>%
# convert to a tibble
as_tibble() %>%
# remove more white space
mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
filter(!V1 == "") %>%
# name the columns
rename("Last"=V1, "First_two"=V2) %>%
# separate the given names into first and middle (if any)
separate(First_two,into=c("First", "Middle"), sep=" ",) %>%
# abbreviate to first letter
mutate(First_i=abbreviate(First, 1)) %>%
# abbreviate, but take into account that middle name might be missing
mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>%
# combine the First and middle initals
mutate(Initials=paste(First_i, Middle_i, sep=".")) %>%
# make the desired Last, F.M. vector
mutate(Final=paste(Last, Initials, sep=", "))
# A tibble: 3 x 7
Last First Middle First_i Middle_i Initials Final
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Michael Jordan NA J "" J. Michael, J.
2 Jordan Michael Larry M L. M.L. Jordan, M.L.
3 Bird Larry NA L "" L. Bird, L.
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].
Much longer than a regex.
There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.
library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))
This code generates the tibble df, which has the following contents:
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 Jordan M.
2 Bird L.
3 Obama B.
4 Bush G. W.
The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .
Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".
...also, did you mean Michael Jordan, not Jordan Michael? lol
Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.
x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"

Split a dataframe in multiple columns in R

My data frame is as follows:
User
JohnLenon03041965
RogerFederer12021954
RickLandsman01041975
and I am trying to get the output as
Name Lastname Birthdate
John Lenon 03041965
Roger Federer 12021954
Rick Landsman 01041975
I tried the following code:
**a = gsub('([[:upper:]])', ' \\1', df$User)
a <- as.data.frame(a)
library(tidyr)
a <-separate(a, a, into = c("Name", "Last"), sep = " (?=[^ ]+$)")**
I get the following:
Name Last
John Lenon03041965
Roger Federer12021954
Rick Landsman01041975
I am trying to use the separate condition like (?=[0-9]) but getting error like this:
c <-separate(c, c, into = c("last", "date"), sep = '(?=[0-9])')
Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after], : argument is of length zero
We can use a regex lookaround as sep by specifying either to split between a lower case letter and an upper case ((?<=[a-z])(?=[A-Z])) or (|) between a lower case letter and a number ((?<=[a-z])(?=[0-9]+))
df1 %>%
separate(User, into = c("Name", "LastName", "Birthdate"),
sep = "(?<=[a-z])(?=[A-Z])|(?<=[a-z])(?=[0-9]+)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
Or another option is extract to capture characters as a group by placing it inside the brackets ((...)). Here, the 1st capture group matches an upper case letter followed by one or more lower case letters (([A-Z][a-z])) from the start (^) of the string, 2nd captures one or more characters that are not numbers (([^0-9]+)) and in the 3rs, it is the rest of the characters ((.*))
df1 %>%
extract(User, into = c("Name", "LastName", "Birthdate"),
"^([A-Z][a-z]+)([^0-9]+)(.*)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
data
df1 <- structure(list(User = c("JohnLenon03041965", "RogerFederer12021954",
"RickLandsman01041975")), .Names = "User", class = "data.frame", row.names = c(NA,
-3L))

Resources