Parsing a string with multiple brackets - r

I have a dataset dt with column "subject", that I need to parse. For example,
ID subject
1 USA(Texas)(Austin)
2 USA(California)(Sacramento)
As a result, I want to get the following table:
ID subject Country State Capital
1 USA(Texas)(Austin) USA Texas Austin
2 USA(California)(Sacramento) USA California Sacramento
How can I do it?

Since you have multiple brackets to extract data from you need to make your regex lazy.
library(dplyr)
library(tidyr)
extract(dt, subject, into = c("Country", "State", "Capital"),
regex = "(.*)\\((.*?)\\)\\((.*)\\)", remove = FALSE)
# ID subject Country State Capital
#1 1 USA(Texas)(Austin) USA Texas Austin
#2 2 USA(California)(Sacramento) USA California Sacramento
Another option with a simpler regex is to remove round brackets with gsub and use separate with sep argument as whitespace.
dt %>%
mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
separate(subject, into = c("Country", "State", "Capital"), sep = "\\s+")
data
dt <- structure(list(ID = 1:2, subject = structure(2:1,
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"),
class = "factor")), class = "data.frame", row.names = c(NA, -2L))

Related

Remove duplicate character strings from list column

I have this dataframe:
structure(list(class = c("Großbrittanien", "Rest Europa"), countries = list(
c("United Kingdom", "United Kingdom"), "Spain")), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
it looks like this:
I want to turn the countries-list column into a character column. I want to remove duplicate entries. Such that United Kingdom only appears once. I am a little confused how I could, using the dplyr syntax, achieve that.
You can unnest countries and then remove duplicated rows.
library(tidyverse)
df %>%
unnest(countries) %>%
distinct()
# # A tibble: 2 × 2
# class countries
# <chr> <chr>
# 1 Großbrittanien United Kingdom
# 2 Rest Europa Spain
Or without unnest, using unique by class, before converting to a character string.
With grouping:
library(dplyr)
df |>
mutate(countries = toString(unique(unlist(countries))), .by = class)
# Note: If you're using `dplyr < v.1.1.0`, use `group_by`/`ungroup`.
With purrr:
library(dplyr)
library(purrr)
df |>
mutate(countries = map_chr(countries, ~ toString(unique(.))))
Output:
# A tibble: 2 × 2
class countries
<chr> <chr>
1 Großbrittanien United Kingdom
2 Rest Europa Spain, Portugal
Data (including something that is not duplicated .. Portugal):
df <-
structure(list(class = c("Großbrittanien", "Rest Europa"), countries = list(
c("United Kingdom", "United Kingdom"), c("Spain", "Portugal"))), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
Here is the version using dplyr syntax
library(dplyr)
df %>%
unnest(countries) %>%
distinct(class, countries) %>%
group_by(class) %>%
summarise(countries = paste(countries, collapse = ", "))

Separating a column with multiple different entries with tidyr

I am trying to split up one column in a data frame that shows the period active(s) for several artists/ bands into two columns (start_of_career, end_of_career). The variable class is character. I tried to use tidyrs separate function for it and when I run it, I see that it is split in the console but not in the data frame itself, so I assume that it doesn't work properly.
Please see here a made up example of the data I want to split:
Column A
Column B
Artist A
1995-present
Artist B
1995-1997, 2008, 2010-present
As you can see, some rows will consists only of a start and end date, while others have several dates.
All I actually need is the first number and the last, e.g. for Artist B I need only start_of_career 1995 and end_of_career "present". But I am somehow not able to solve this issue.
The code I used was:
library(tidyr)
df %>% separate(col = period_active, into = c('start_of_career', 'end_of_career'), sep = '-')
I also tried other separators(",", " "), but it didn't work either.
I also tried:
df$start_of_career = strsplit(df$period_active, split = '-')
But this didn't work as well.
Using df shown reproducibly in the Note at the end remove everything except first and last parts of Column B and then separate what is left.
library(dplyr)
library(tidyr)
dd %>%
mutate(`Column B` = sub("-.*-", "-", `Column B`)) %>%
separate(`Column B`, c("start", "end"))
## Column A start end
## 1 Artist A 1995 present
## 2 Artist B 1995 present
Note
df <-
structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))
Using base R
df <- cbind(df[1], read.table(text = sub("-[0-9, ]+", "", df$`Column B`),
header = FALSE, col.names = c("start", "end"), sep = "-"))
-output
> df
Column A start end
1 Artist A 1995 present
2 Artist B 1995 present
We could do this with separate as well
library(tidyr)
separate(df, `Column B`, into = c("start", "end"), sep = "-[^A-Za-z]*")
Column A start end
1 Artist A 1995 present
2 Artist B 1995 present
data
df <- structure(list(`Column A` = c("Artist A", "Artist B"),
`Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame",
row.names = c(NA,
-2L))
We could use separate_rows and then filter for first and last row of group:
library(tidyr)
library(dplyr)
df %>%
separate_rows(Column.B) %>%
group_by(Column.A) %>%
filter(row_number()==1 | row_number()==n()) %>%
mutate(Colum.C = c("start", "end"))
Column.A Column.B Colum.C
<chr> <chr> <chr>
1 Artist A 1995 start
2 Artist A present end
3 Artist B 1995 start
4 Artist B present end
data:
structure(list(Column.A = c("Artist A", "Artist B"), Column.B = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))
Using strsplit and then subsequently pick the first and the last entry.
library(dplyr)
df %>%
rowwise() %>%
mutate(splitrow = strsplit(`Column B`, "-"),
start_of_career = splitrow[1],
end_of_career = splitrow[length(splitrow)],
splitrow = NULL) %>%
ungroup()
# A tibble: 2 × 4
`Column A` `Column B` start_of_career end_of_career
<chr> <chr> <chr> <chr>
1 Artist A 1995-present 1995 present
2 Artist B 1995-1997, 2008, 2010-present 1995 present
Data
df <- structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))
Another option: use strsplit, and return the list of start and end values
f <- \(v) {
v = strsplit(v, "-|,| ")[[1]]
list(start = v[1],end = v[length(v)])
}
df %>%
mutate(df, `Column B` = lapply(`Column B`,f)) %>%
unnest_wider(`Column B`)
Output:
# A tibble: 2 × 3
`Column A` start end
<chr> <chr> <chr>
1 Artist A 1995 present
2 Artist B 1995 present
Below code extract the first word before the dash and last word after.
for(i in 1:length(df))
{
df$start[i] <-sub("-.*", "", df$`Column B`[i])
df$end[i] <-sub("^.+-", "", df$`Column B`[i])
}

Pivot_longer: Rotating multiple columns of data with same data types

I'm trying to rotate multiple columns of data into single, data-type consistent columns.
I've created a minimum example below.
library(tibble)
library(dplyr)
# I have data like this
df <- tibble(contact_1_prefix=c('Mr.','Mrs.','Dr.'),
contact_2_prefix=c('Dr.','Mr.','Mrs.'),
contact_1 = c('Bob Johnson','Robert Johnson','Bobby Johnson'),
contact_2 = c('Tommy Two Tones','Tommy Three Tones','Tommy No Tones'),
contact_1_loc = c('Earth','New York','Los Angeles'),
contact_2_loc = c('London','Geneva','Paris'))
# My attempt at a solution:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)") %>%
pivot_wider(names_from='dat',values_from='contact')
#What I want is to widen that data to achieve a tibble with these two example lines
df_desired <- tibble(name=c('Bob Johnson','Tommy Two Tones'),
loc =c('Earth','London'),
prefix=c('Mr.','Dr.'))
I want all names under name, all locations under loc, and all prefixes under prefix.
If I use just this snippet from the middle statement:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)")
The dput of the output is:
structure(list(dat = c("prefix", "prefix", "name", "name", "loc",
"loc", "prefix", "prefix", "name", "name", "loc", "loc", "prefix",
"prefix", "name", "name", "loc", "loc"), contact = c("Mr.", "Dr.",
"Bob Johnson", "Tommy Two Tones", "Earth", "London", "Mrs.",
"Mr.", "Robert Johnson", "Tommy Three Tones", "New York", "Geneva",
"Dr.", "Mrs.", "Bobby Johnson", "Tommy No Tones", "Los Angeles",
"Paris")), row.names = c(NA, -18L), class = c("tbl_df", "tbl",
"data.frame"))
From that, I thought for sure pivot_wider was the solution, but there is a name conflict.
I assume a single pivot_longer statement will achieve the task. I studied Gathering wide columns into multiple long columns using pivot_longer carefully but can't quite figure this out. I have to admit I don't quite understand what the names_to = c(".value", "group") phrase does.
In any event, any help is appreciated.
Thanks
You were on the right path. Renaming is needed since only the name columns do not have any suffix to identify them. .value identifies part of the original column name that you want to uniquely identify as new columns. If you remove everything until the last underscore the part that remains are the new column names which you can specify using regex in names_pattern.
library(dplyr)
library(tidyr)
df %>%
rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols = everything(),
names_to = '.value',
names_pattern = '.*_(\\w+)')
# prefix name loc
# <chr> <chr> <chr>
#1 Mr. Bob Johnson Earth
#2 Dr. Tommy Two Tones London
#3 Mrs. Robert Johnson New York
#4 Mr. Tommy Three Tones Geneva
#5 Dr. Bobby Johnson Los Angeles
#6 Mrs. Tommy No Tones Paris
Here is a solution using split.default
data.table::rbindlist(
lapply( split.default( df, gsub( "[^0-9]+", "", names(df) ) ),
data.table::setnames,
new = c("prefix", "name", " loc" ) ) )
# prefix name loc
# 1: Mr. Bob Johnson Earth
# 2: Mrs. Robert Johnson New York
# 3: Dr. Bobby Johnson Los Angeles
# 4: Dr. Tommy Two Tones London
# 5: Mr. Tommy Three Tones Geneva
# 6: Mrs. Tommy No Tones Paris

From state and county names to fips in R

I have the following data frame in R. I would like to get fips from this dataset. I tried to use fips function in usmap (https://rdrr.io/cran/usmap/man/fips.html). But I could not get fips from this function because I need to enclose double quote. Then, I tried to use paste0(""", df$state, """), but I could not get it. Is there any efficient ways to get fips?
> df1
state county
1 california napa
2 florida palm beach
3 florida collier
4 florida duval
UPDATE
I can get "\"california\"" by using dQuote. Thanks. After the conversion of each column, I tried the followings. How do I deal with this issue?
> df1$state <- dQuote(df1$state, FALSE)
> df1$county <- dQuote(df1$county, FALSE)
> fips(state = df1$state, county = df1$county)
Error in fips(state = df1$state, county = df1$county) :
`county` parameter cannot be used with multiple states.
> fips(state = df1$state[1], county = df1$county[1])
Error in fips(state = df1$state[1], county = df1$county[1]) :
"napa" is not a valid county in "california".
> fips(state = "california", county = "napa")
[1] "06055"
We can split the dataset by state and apply the fips
library(usmap)
lapply(split(df1, df1$state), function(x)
fips(state = x$state[1], county = x$county))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with Map
lst1 <- split(df1$county, df1$state)
Map(fips, lst1, state = names(lst1))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(state) %>%
summarise(new = list(fips(state = first(state), county = county))) %>%
unnest(c(new))
# A tibble: 4 x 2
# state new
# <chr> <chr>
#1 california 06055
#2 florida 12099
#3 florida 12021
#4 florida 12031
data
df1 <- structure(list(state = c("california", "florida", "florida",
"florida"), county = c("napa", "palm beach", "collier", "duval"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

Combine multiple columns keeping variable name as part of data

I have data as below
df=data.frame(
Id=c("001","002","003","004"),
author=c('John Cage','Thomas Carlyle'),
circa=c('1988', '1817'),
quote=c('I cant understand why people are frightened of new ideas. Im frightened of the old ones.',
'My books are friends that never fail me.')
)
df
I would like to combine 3 columns to obtain the data frame below
df2 = data.frame(
Id=c("001","002"),
text = c(
'Author:
John Cage
Circa:
1988
quote:
I cant understand why people are frightened of new ideas. Im frightened of the old ones.
',
'Author:
Thomas Carlyle
Circa:
1817
quote:
My books are friends that never fail me.
'
)
)
df2
I am aware I can use paste or unite from tidyr, but how can I pass the column names to be within the new created column?
You can get the data in long format and then paste by group.
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -Id) %>%
group_by(Id) %>%
summarise(text = paste(name, value, sep = ":", collapse = "\n"))
# A tibble: 4 x 2
# Id text
# <fct> <chr>
#1 001 "author:John Cage\ncirca:1988\nquote:I cant understand why people are f…
#2 002 "author:Thomas Carlyle\ncirca:1817\nquote:My books are friends that nev…
#3 003 "author:John Cage\ncirca:1988\nquote:I cant understand why people are f…
#4 004 "author:Thomas Carlyle\ncirca:1817\nquote:My books are friends that nev…
Here is a solution with base R, where paste0() is used. Maybe the following code can help you make it
res <- cbind(df[1],text = apply(apply(df[-1], 1, function(v) paste0(names(df[-1]),": ",v)), 2, paste0, collapse = "\n"))
such that
> res
Id text
1 001 author: John Cage\ncirca: 1988\nquote: I cant understand why people are frightened of new ideas. Im frightened of the old ones.
2 002 author: Thomas Carlyle\ncirca: 1817\nquote: My books are friends that never fail me.
DATA
df <- structure(list(Id = structure(1:2, .Label = c("001", "002"), class = "factor"),
author = structure(1:2, .Label = c("John Cage", "Thomas Carlyle"
), class = "factor"), circa = structure(2:1, .Label = c("1817",
"1988"), class = "factor"), quote = structure(1:2, .Label = c("I cant understand why people are frightened of new ideas. Im frightened of the old ones.",
"My books are friends that never fail me."), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
We can use melt in data.table
library(data.table)
melt(setDT(df), id.var = 'Id')[, .(text = paste(variable,
value, sep=":", collapse="\n")), Id]
# Id text
#1: 001 author:John Cage\ncirca:1988\nquote:I cant understand why people are frightened of new ideas. Im frightened of the old ones.
#2: 002 author:Thomas Carlyle\ncirca:1817\nquote:My books are friends that never fail me.

Resources