I have a dataframe like so:
df = data.frame('name' = c('California parks', 'bear lake', 'beautiful tree house', 'banana plant'), 'extract' = c('parks', 'bear', 'tree', 'plant'))
How do I remove the strings of the 'extract' column from the name column to get the following result:
name_new = California, lake, beautiful house, banana
I'm suspecting this demands a combination of str_extract and lapply but can quite figure it out.
Thanks!
The str_remove or str_replace are vectorized for both string and pattern. So, if we have two columns, just pass those columns 'name', 'extract' as the string, pattern to remove the substring in the 'name' column elementwise. Once we remove those substring, there are chances of having spaces before or after which can be removed or replaced with str_replace with trimws (to remove the leading/lagging spaces)
library(dplyr)
library(stringr)
df %>%
mutate(name_new = str_remove(name, extract),
name_new = str_replace_all(trimws(name_new), "\\s{2,}", " "))
# name extract name_new
#1 California parks parks California
#2 bear lake bear lake
#3 beautiful tree house tree beautiful house
#4 banana plant plant banana
A base R option using gsub + Vectorize
within(df,name_new <- Vectorize(gsub)(paste0("\\s",extract,"\\s")," ",name))
which gives
name extract name_new
1 California parks parks California
2 bear lake bear lake
3 beautiful tree house tree beautiful house
4 banana plant plant banana
Related
I have a dataset similar to the following (but larger):
dataset <- data.frame(First = c("John","John","Andy","John"), Last = c("Lewis","Brown","Alphie","Johnson"))
I would like to create a new column that contains each unique last name cooresponding to the given first name. Thus, each observation of "John" would have c("Lewis", "Brown", "Johnson") in the third column.
I'm a bit perplexed because my attempts at vectorization seem impossible given I can't reference the particular observation I'm looking at. Specifically, what I want to write is:
dataset$allLastNames <- unique(data$Last[data$First == "the current index???"])
I think this can work in a loop (since I reference the observation with 'i'), but it is taking too long given the size of my data:
for(i in 1:nrow(dataset)){
dataset$allLastNames[i] <- unique(dataset$Last[dataset$First == dataset$First[i]])
}
Any suggestions for how I could make this work (using Base R)?
Thanks!
You can use dplyr library with a few lines. First, you can group by first names and list all unique last names occurences.
library(dplyr)
list_names = dataset %>%
group_by(First) %>%
summarise(allLastNames = list(unique(Last)))
Then, add the summary table to your dataset matching the First names:
dataset %>% left_join(list_names,by='First')
First Last allLastNames
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson
Also, I think R is a good language to avoid using for-loops. You have several methods to work with dataset and arrays avoiding them.
Base R option:
allLastNames <- aggregate(.~First, dataset, paste, collapse = ",")
dataset <- merge(dataset, allLastNames, by = "First")
names(dataset) <- c("First", "Last", "allLastNames")
Output:
First Last allLastNames
1 Andy Alphie Alphie
2 John Lewis Lewis,Brown,Johnson
3 John Brown Lewis,Brown,Johnson
4 John Johnson Lewis,Brown,Johnson
library(dplyr)
library(stringr)
dataset %>%
group_by(First) %>%
mutate(Lastnames = str_flatten(Last, ', '))
# Groups: First [2]
First Last Lastnames
<chr> <chr> <chr>
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson
I want to do the following things: if key words "GARAGE", "PARKING", "LOT" exist in column "Name" then I would add value "Parking&Garage" into column "Type".
Here is the dataset:
df<-data.frame(Name=c("GARAGE 1","GARAGE 2", "101 GARAGE","PARKING LOT","CENTRAL PARKING","SCHOOL PARKING 1","CITY HALL"))
The following codes work well for me, but is there a neat way to make the codes shorter? Thanks!
df$Type[grepl("GARAGE", df$Name) |
grepl("PARKING", df$Name) |
grepl("LOT", df$Name)]<-"Parking&Garage"
The regex "or" operator | is your friend here:
df$Type[grepl("GARAGE|PARKING|LOT", df$Name)]<-"Parking&Garage"
You can create a list of keywords to change, create a pattern dynamically and replace the values.
keywords <- c('GARAGE', 'PARKING', 'LOT')
df$Type <- NA
df$Type[grep(paste0(keywords, collapse = '|'), df$Name)] <- "Parking&Garage"
df
# Name Type
#1 GARAGE 1 Parking&Garage
#2 GARAGE 2 Parking&Garage
#3 101 GARAGE Parking&Garage
#4 PARKING LOT Parking&Garage
#5 CENTRAL PARKING Parking&Garage
#6 SCHOOL PARKING 1 Parking&Garage
#7 CITY HALL <NA>
This would be helpful if you need to add more keywords to your list later.
an alternative with dpylr and stringr packages:
library(stringr)
library(dplyr)
df %>%
dplyr::mutate(TYPE = stringr::str_detect(Name, "GARAGE|PARKING|LOT"),
TYPE = ifelse(TYPE == TRUE, "Parking&Garage", NA_character_))
I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.
Data is currently in the following structure (simplified):
a <- c("Name: John Doe Age: 50 Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs Age: 42 Address Please give full address 1 Lower Street, London")
df <- data.frame(rawtext = c(a,b))
I'd like to split each observation into individual variable columns. It should end up looking like this:
Name Age Address
John Doe 50 22 Main Street, New York
Jane Bloggs 42 1 Lower Street, London
I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.
delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)
I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.
e.g. the following bodge would get me the name field based on the delimiters:
sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)
[1] " John Doe "
This feels extremely inefficient. Is there a better way that I'm missing?
A tidyverse solution:
library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")
df %>%
mutate(rawtext = str_remove_all(rawtext, ":")) %>%
separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>%
mutate(across(where(is.character), str_squish), x = NULL)
# # A tibble: 2 x 3
# Name Age `Address Please give full address`
# <chr> <dbl> <chr>
# 1 John Doe 50 22 Main Street, New York
# 2 Jane Bloggs 42 1 Lower Street, London
Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.
I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.
I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.
data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>%
mutate(coor = str_locate_all(id, "\\s"),
name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )
You can use regex to extract what you want.
Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.
sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"
^# - starts with hash
\\w+ - A word
\\s - Whitespace
( - start of capture group
\\w+ - A word
followed by \\s - whitespace
\\w+ - another word
) - end of capture group.
.* - remaining string.
The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.
library(dplyr)
library(stringr)
library(purrr)
data %>%
mutate(coor = str_locate_all(id, "\\s"),
start = map_dbl(coor, `[`, 1) + 1,
end = map_dbl(coor, `[`, 3) - 1,
name = str_sub(id, start, end))
# A tibble: 2 x 2
# id name
# <chr> <chr>
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Another possible solution using stringr and purrr packages
library(stringr)
library(purrr)
library(dplyr)
data %>%
mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))
Explanation:
in str_split(id, " ") we create a list of the terms that are separated inside id by a whitespace
map_chr is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name we want) and then collapse them with a whitespace between them
Output
# A tibble: 2 x 2
# id name
# <chr> <chr>
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
I have a dataset of the following:
> head(data,3)
city state zip_code overall_spend
1 MIDDLESBORO KY 40965 $252,168.12
2 PALM BEACH FL 33411-3518 $369,240.74
3 CORBIN KY 40701 $292,496.03
Now, I want to format the zip_code which has extra parts after -. For example, in the second row, I have 33411-3518. After formatting I want to have only 33411. How can I do this to the whole zip_code column? Also, zip_code is a factor now
Try
data$zip_code <- sub('-.*', '', data$zip_code)
data$zip_code
#[1] "40965" "33411" "40701"