Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.
Each individual is separated by a . There are 10 attributes for each individual.
txt = "EREKSON(Andrew,Hélène),female10/06/2011#Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956#London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013#Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980#Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."
I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.
I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.
This is the final dataframe I have in mind, based on the character string input -
result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
gender = c('male','male','female','female'),
birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
birth_city = c('Geneva','London','Paris','Berlin'),
birth_country = c('Switzerland','England','France','Germany'),
acc_type = c('PPF','PPF','PPF','VAT'),
acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
district = c('dist.093','dist.097','dist.088','dist.078'),
code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))
Any help would be much appreciated
Here's a tidy solution with tidyr's functions separate_rows and extract:
library(tidyr)
data.frame(txt) %>%
# separate `txt` into rows using the dot `.` *if*
# preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
extract(
# select column from which to extract:
txt,
# define column names into which to extract:
into = c("family_name","first_names","gender",
"birthday","birth_city","birth_country",
"acc_type","acc_num","district","code"),
# describe the string exhaustively using capturing groups
# `(...)` to delimit what's to be extracted:
regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)#(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
family_name first_names gender birthday birth_city birth_country acc_type acc_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 EREKSON Andrew,Peter male 10/06/2011 Geneva Switzerland PPF 2000X007…
2 OBAMA Barack,Hussian male 04/12/1956 London England PPF 2001X005…
3 CLINTON Hillary female 25/06/2013 Paris France PPF 2009X005…
4 GATES Melinda female 03/03/1980 Berlin Germany VAT 2010X006…
# … with 2 more variables: district <chr>, code <chr>
Here is a solution using the tidyverse which pipes together different stringr functions to clean the string, before having readr read it, basically as a CSV:
library(dplyr, warn.conflicts = FALSE) # for pipes
df <-
txt %>%
# Replace "." sep with newline
stringr::str_replace_all(
"\\.[A-Z]",
function(x) stringr::str_replace(x, "\\.", "\n")
) %>%
# Replace all commas in (First[,Middle1,Middle2,...]) with space
stringr::str_replace_all(
# Match anything inside brackets, but as few times as possible, so we don't
# match multiple brackets
"\\(.*?\\)",
# Inside the regex that was matched, replace comma with space
function(x) stringr::str_replace_all(x, ",", " ")
) %>%
# Replace ( with ,
stringr::str_replace_all("\\(", ",") %>%
# Remove )
stringr::str_remove_all("\\)") %>%
# Replace # with ,
stringr::str_replace_all("#", ",") %>%
# Remove the last "."
stringr::str_replace_all("\\.$", "\n") %>%
# Add , after female/male
stringr::str_replace_all("male", "male,") %>%
# Read as comma delimited file (works since string contains \n)
readr::read_delim(
file = .,
delim = ",",
col_names = FALSE,
show_col_types = FALSE
)
# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
"family_name",
"first_names",
"gender",
"birthday",
"birth_city",
"birth_country",
"acc_type",
"acc_num",
"district",
"code"
)
df
#> # A tibble: 4 × 10
#> family_name first_names gender birthday birth_city birth_country acc_type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 EREKSON Andrew Hélène female 10/06/2… Geneva Switzerland PPF
#> 2 BOUKAR Mohamed El-Hadi male 04/12/1… London England PPF
#> 3 HARIMA Olak N’nassik G… female 25/06/2… Paris France PPF
#> 4 THOMAS Hajil Pau Joëli female 03/03/1… Berlin Germany VAT
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>
Created on 2022-03-20 by the reprex package (v2.0.1)
Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and easier to change later.
Related
Question
What is the proper way to quote a parameter in a function that will be used to create a new variable that will be passed to another function?
Background
Ultimate goal is to create labels in a dataframe for a treemap with 2 levels of hierarchy, and I'm trying to create a reusable function. Here's a more basic example:
Example
library(scales)
library(tidyverse)
# Create dataframe
region = rep(c("North", "South"), 3)
district <- sprintf("Dist-%d", 1:6)
sales <- seq(2000, 1500000000, length.out = 6)
df <- tibble(region, district, sales)
df
# A tibble: 6 × 3
region district sales
<chr> <chr> <dbl>
1 North Dist-1 2000
2 South Dist-2 300001600
3 North Dist-3 600001200
4 South Dist-4 900000800
5 North Dist-5 1200000400
6 South Dist-6 1500000000
I created this helper function to format the currency. It will be used in the main function, and my issue is related to passing a new variable name from the main function to this helper:
# First function for formatting currency
mydollars <- scales::label_dollar(prefix = "$",
largest_with_cents = 5000,
scale_cut = c(0, " K" = 1e3, " M" = 1e6, " B" = 1e9, " T" = 1e12)
)
# Example function output
mydollars(df$sales)
[1] "$2 K" "$300 M" "$600 M" "$900 M" "$1.2 B" "$1.5 B"
This is the main function that utilizes the above helper. I'm passing a dataframe to the function, creating the 2nd level ".index" label, then I group and aggregate the number column, which I'm appending "2" suffix so I know it's the second number, and my problem arises from inside the paste() with mydollars("{{agg_number}}2"). If I replace that code with "Test String", I get the function to work.
treemap_index1 <- function(df, category1, category2, agg_number){
df_out <- df %>%
mutate("{{category2}}.index" := paste({{category2}}, mydollars({{agg_number}}), sep = "\n")) %>%
group_by({{category1}}) %>%
mutate("{{agg_number}}2" := sum({{agg_number}}),
"{{category1}}.index" := paste({{category1}},
mydollars("{{agg_number}}2"), # Code breaks on this line
sep = "\n")) %>%
print()
return(df_out)
}
treemap_index1(df, region, district, sales)
rlang::last_error()
<error/dplyr:::mutate_error>
Error in `mutate()`:
! Problem while computing `region.index = paste(region, mydollars("{{agg_number}}2"), sep = "\n")`.
ℹ The error occurred in group 1: region = "North".
Caused by error in `x * scale`:
! non-numeric argument to binary operator
---
Backtrace:
1. global treemap_index1(df, region, district, sales)
10. scales (local) mydollars("{{agg_number}}2")
11. scales::dollar(...)
12. scales::number(...)
13. scales:::scale_cut(...)
14. base::cut(...)
Run `rlang::last_trace()` to see the full context.
If I replace the offending code as seen below, the function would otherwise work:
treemap_index2 <- function(df, category1, category2, agg_number){
df_out <- df %>%
mutate("{{category2}}.index" := paste({{category2}}, mydollars({{agg_number}}), sep = "\n")) %>%
group_by({{category1}}) %>%
mutate("{{agg_number}}2" := sum({{agg_number}}),
"{{category1}}.index" := paste({{category1}},
"Test String", # Temporarily replaced code
sep = "\n")) %>%
print()
return(df_out)
}
treemap_index2(df, region, district, sales)
# A tibble: 6 × 6
# Groups: region [2]
region district sales district.index sales2 region.index
<chr> <chr> <dbl> <chr> <dbl> <chr>
1 North Dist-1 2000 "Dist-1\n$2 K" 1800003600 "North\nTest String"
2 South Dist-2 300001600 "Dist-2\n$300 M" 2700002400 "South\nTest String"
3 North Dist-3 600001200 "Dist-3\n$600 M" 1800003600 "North\nTest String"
4 South Dist-4 900000800 "Dist-4\n$900 M" 2700002400 "South\nTest String"
5 North Dist-5 1200000400 "Dist-5\n$1.2 B" 1800003600 "North\nTest String"
6 South Dist-6 1500000000 "Dist-6\n$1.5 B" 2700002400 "South\nTest String"
Assistance appreciated...
I would appreciate guidance on how to properly pass the new variable name to the helper function, and as I am new to data-masking, quosures, non-standard evaluation, any other comments on how this could be done better are appreciated. Thank you.
Adapting the answer by Lionel Henry (#LionelHenry) one option would be to use rlang::englue and the .data pronoun like so:
library(scales)
library(tidyverse)
treemap_index1 <- function(df, category1, category2, agg_number) {
df %>%
mutate("{{category2}}.index" := paste({{ category2 }}, mydollars({{ agg_number }}), sep = "\n")) %>%
group_by({{ category1 }}) %>%
mutate(
"{{agg_number}}2" := sum({{ agg_number }}),
"{{category1}}.index" := paste(
{{ category1 }},
mydollars(.data[[rlang::englue("{{agg_number}}2")]]),
sep = "\n"
)
)
}
treemap_index1(df, region, district, sales)
#> # A tibble: 6 × 6
#> # Groups: region [2]
#> region district sales district.index sales2 region.index
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 North Dist-1 2000 "Dist-1\n$2 K" 1800003600 "North\n$2 B"
#> 2 South Dist-2 300001600 "Dist-2\n$300 M" 2700002400 "South\n$3 B"
#> 3 North Dist-3 600001200 "Dist-3\n$600 M" 1800003600 "North\n$2 B"
#> 4 South Dist-4 900000800 "Dist-4\n$900 M" 2700002400 "South\n$3 B"
#> 5 North Dist-5 1200000400 "Dist-5\n$1.2 B" 1800003600 "North\n$2 B"
#> 6 South Dist-6 1500000000 "Dist-6\n$1.5 B" 2700002400 "South\n$3 B"
I have a dataset called cookbooks4 that currently looks like this
SectionTitle Author Work PubYear Item
Albertine Susan Markovitz The salad book 1928 pear,cream cheese,lettuce, vinegar
The dataset is long I just included the first line, but as you can imagine, each recipe has its own ingredients. I would need to create dummy variables for the Item (ingredients). I would like to select all unique ingredients and put them in a column. I should obtain something that looks like this. Bear in mind that this is the first line and I should have roughly 630 different in gredients (so 630 different columns for dummy variables). After cream I might have ingredients that are not listed in the item column of that specific recipe so the dummy would be 0. Any help would be greatly appreciated
SectionTitle Author Work PubYear Item pear cream ....
Albertine Susan... The salad.. 1928 pear,cream... 1 1
I did try this but I get an error message. Plus I would need to keep also all the other columns
library(dplyr)
library(stringr)
final <- strsplit(cookbooks4$Item, split = ",")
Item <- unique(str_trim(unlist(t)))
final2 <- as.data.frame(Reduce(cbind, lapply(Item, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))
names(final2) <- item
library(dplyr)
library(tidyr)
data <- data.frame(SectionTitle = "Albertine", Author = "Susan Markovitz",
Work = "The salad book", PubYear = 1928L, Item = "pear,cream,cheese,lettuce,vinegar")
data %>% mutate(Itemlist = strsplit(Item,",")) %>% unnest(Itemlist) %>%
pivot_wider(names_from = Itemlist, values_from = Itemlist, values_fn = length)
#> # A tibble: 1 × 10
#> SectionTitle Author Work PubYear Item pear cream cheese lettuce vinegar
#> <chr> <chr> <chr> <int> <chr> <int> <int> <int> <int> <int>
#> 1 Albertine Susan Mar… The … 1928 pear… 1 1 1 1 1
string_dat <- structure(list(ID = c(2455, 2455), Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")",
"Vertex, Inc.")), class = "data.frame", row.names = c(NA, -2L
))
> string_dat
ID Location
1 2455 c("Southside of Dune", "The Hogwarts Express")
2 2455 Vertex, Inc.
I would like to expand the data.frame above based on Location.
library(tidyr)
> string_dat %>% tidyr::separate_rows(Location, sep = ",")
# A tibble: 4 × 2
ID Location
<dbl> <chr>
1 2455 "c(\"Southside of Dune\""
2 2455 " \"The Hogwarts Express\")"
3 2455 "Vertex"
4 2455 " Inc."
Splitting just on , wrongly split Vertex, Inc. into two entries. Also it did not take care of c(\" and \"" for the first two strings.
I also tried to remove the c(\" at the beginning by using gsub, but it gave me the following error.
> gsub('c(\"', "", x = string_dat$Location)
Error in gsub("c(\"", "", x = string_dat$Location) :
invalid regular expression 'c("', reason 'Missing ')''
My desired output is
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 "Southside of Dune"
2 2455 "The Hogwarts Express"
3 2455 "Vertex, Inc."
********** Edit **********
library(tidyverse)
string_dat %>%
mutate(
# mark twin elements with `;`:
Location = str_replace(Location, '",', '";'),
# remove string-first `c` and all non-alphanumeric characters
# except `,`, `.`, and `;`:
Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
separate_rows(Location, sep = '; ')
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 "c(\"Southside of Dune\""
2 2455 "\"The Hogwarts Express\")"
3 2455 "Vertex, Inc."
Here's an approach that combines data cleaning with separate_rows:
library(tidyverse)
string_dat %>%
mutate(
# mark twin elements with `;`:
Location = str_replace(Location, '",', '";'),
# remove string-first `c` and all non-alphanumeric characters
# except `,`, `.`, and `;`:
Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
separate_rows(Location, sep = '; ')
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 Southside of Dune
2 2455 The Hogwarts Express
3 2455 Vertex, Inc.
How the regex ^c|(?![.,; ])\\W works:
^c: matches literal c at the beginning of the string
|: initiates alternation (i.e., "OR")
(?![.,; ])\\W: negative lookahead to assert that any non-alphanumeric characters (\\W with upper-case "W") are matched except any of period, comma, and semi-colon (this exception from the \\W character class is implemented by the lookahead)
The Location column has a strange data format. In the 1st element, it stores R code, because it's using the c("s1", "s2") syntax for a two-element character vector. For the 2nd element, you're missing escaped quotes for this to be valid R code for a one-element character vector.
If I manually edit the 2nd element to add these quotation marks, then we can easily evaluate the R code contained in the Location column, and then unnest the resulting list column. This might be easier than attempting to edit the strings programmatically?
library(tidyverse)
string_dat <- data.frame(
ID = c(2455, 2455),
Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")", "\"Vertex, Inc.\"")
)
string_dat %>%
rowwise() %>%
mutate(Location = list(eval(parse(text=Location)))) %>%
unnest(cols=Location)
#> # A tibble: 3 × 2
#> ID Location
#> <dbl> <chr>
#> 1 2455 Southside of Dune
#> 2 2455 The Hogwarts Express
#> 3 2455 Vertex, Inc.
Created on 2022-09-23 by the reprex package (v2.0.1)
I need to split the string in dataframe to two columns, the first one contains the value before the round brackets and the second column contains the value inside the round brackets.
This is an example:
study_name = c("apple bannan (tcga, raw 2018)", "frame shift (mskk2 nature, 2000)" )
results= c("Untested", "tested")
df = data_frame(study_name,results)
This is how I tried to do it:
df <- df %>%
mutate(reference = str_extract_all(study_name, "\\([^()]+\\)")) %>%
rename(~gsub("\\([^()]+\\)", "", study_name))
This is the expected dataframe:
reference = c("(tcga, raw 2018)", "(mskk2 nature, 2000)")
study = c("apple bannan", "frame shift")
expexted_df = data_frame(study, reference)
You can use separate() and set the separator as "\\s(?=\\()".
library(tidyr)
df %>%
separate(study_name, c("study", "reference"), sep = "\\s(?=\\()")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan (tcga, raw 2018) Untested
# 2 frame shift (mskk2 nature, 2000) tested
If you want to extract the text in the parentheses, using extract() is a suitable choice.
df %>%
extract(study_name, c("study", "reference"), regex = "(.+)\\s\\((.+)\\)")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan tcga, raw 2018 Untested
# 2 frame shift mskk2 nature, 2000 tested
We can use str_extract thus:
library(stringr)
df$reference <- str_extract(df$study_name, "\\(.*\\)")
df$study <- str_extract(df$study_name, ".*(?= \\(.*\\))")
Result:
df
study_name results reference study
1 apple bannan (tcga, raw 2018) Untested (tcga, raw 2018) apple bannan
2 frame shift (mskk2 nature, 2000) tested (mskk2 nature, 2000) frame shift
If you no longer want the study_name column, remove it thus:
df$study_name <- NULL
I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)