How to manipulate digits in a character string in R? - r

I feel like I have a super easy question but for the life of me I can't find it when googling or searching here (or I don't know the correct terms to find a solution) so here goes.
I have a large amount of text in R in which I want to identify all numbers/digits, and add a specific number to them, for example 5.
So just as a small example, if this were my text:
text <- c("Hi. It is 6am. I want to leave at 7am")
I want the output to be:
> text
[1] "Hi. It is 11am. I want to leave at 12am"
But also I need the addition for each individual digit, so if this is the text:
text <- c("Hi. It is 2017. I am 35 years old.")
...I want the output to be:
> text
[1] "Hi. It is 75612. I am 810 years old."
I have tried 'grabbing' the numbers from the string and adding 5, but I don't know how to then get them back into the original string so I can get the full text back.
How should I go about this? Thanks in advance!

Here is how I would do the time. I would search for a number that is followed by am or pm and then sub in a math expression to be evaluated by gsubfn. This is pretty flexible, but would require whole hours in its current implementation. I added an am and pm if you wanted to swap those, but I didn't try to code in detecting if the number changes from am to pm. Also note that I didn't code in rolling from 12 to 1. If you add numbers over 12, you will get a number bigger than 12.
text1 <- c("Hi. It is 6am. I want to leave at 7am")
text2 <- c("It is 9am. I want to leave at 10am, but the cab comes at 11am. Can I push my flight to 12am?")
change_time <- function(text, hours, sign, am_pm){
string_change <- glue::glue("`(\\1{sign}{hours})`{am_pm}")
gsub("(\\d+)(?=am|pm)(am|pm)", string_change, text, perl = TRUE)|>
gsubfn::fn$c()
}
change_time(text = text1, hours = 5, sign = "+", am_pm = "am")
#> [1] "Hi. It is 11am. I want to leave at 12am"
change_time(text = text2, hours = 3, sign = "-", am_pm = "pm")
#> [1] "It is 6pm. I want to leave at 7pm, but the cab comes at 8pm. Can I push my flight to 9pm?"

text1 <- c("Hi. It is 2017. I am 35 years old.")
text2 <- c("Hi. It is 6am. I want to leave at 7am")
change_number <- function(text, change, sign){
string_change <- glue::glue("`(\\1{sign}{change})`")
gsub("(\\d)", string_change, text, perl = TRUE) %>%
gsubfn::fn$c() }
change_number(text = text1, change = 5, sign = "+")
#>[1] "Hi. It is 75612. I am 810 years old."
change_number(text = text2, change = 5, sign = "+")
#>[1] "Hi. It is 11am. I want to leave at 12am"
This works perfectly. Many thanks to #AndS., I tweaked (or rather, simplified) your code to fit my needs better. I was determined to figure out the other text myself haha, so thanks for showing me how!

Something quick and dirty with base R:
add_n = \(x, n, by_digit = FALSE) {
if (by_digit) ptrn = "[0-9]" else ptrn = "[0-9]+"
tmp = gregexpr(ptrn, x)
raw = regmatches(x, gregexpr(ptrn, x))
raw_plusn = lapply(raw, \(x) as.integer(x) + n)
for (i in seq_along(x)) regmatches(x[i], tmp[i]) = raw_plusn[i]
x
}
text = c(
"Hi. It is 6am. I want to leave at 7am",
"wow it's 505 dollars and 19 cents",
"Hi. It is 2017. I am 35 years old."
)
> add_n(text, 5)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 510 dollars and 24 cents"
# [3] "Hi. It is 2022. I am 40 years old."
> add_n(text, -2)
# [1] "Hi. It is 4am. I want to leave at 5am" "wow it's 503 dollars and 17 cents"
# [3] "Hi. It is 2015. I am 33 years old."
> add_n(text, 5, by_digit = TRUE)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 10510 dollars and 614 cents"
# [3] "Hi. It is 75612. I am 810 years old."

Here's a tidyverse solution:
data.frame(text) %>%
# separate `text` into individual characters:
separate_rows(text, sep = "(?<!^)(?!$)") %>%
# add `5` to any digit:
mutate(
# if you detect a digit...
text = ifelse(str_detect(text, "\\d"),
# ... extract it, convert it to numeric, add `5`:
as.numeric(str_extract(text, "\\d")) + 5,
# ... else leave `text` as is:
text)
) %>%
# string the characters back together:
summarise(text = str_c(text, collapse = ""))
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 11am. I want to leave at 12am
Data 1:
text <- c("Hi. It is 6am. I want to leave at 7am")
Note that the same code works for the second text as well without any change:
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 75612. I am 810 years old.
Data 2:
text <- c("Hi. It is 2017. I am 35 years old.")

Related

Extracting gene name and ID number from a vector [duplicate]

This question already has answers here:
How do I separate a character column into two columns? [duplicate]
(2 answers)
Closed 1 year ago.
What gsub function can I use in R to get the gene name and the id number from a vector which looks like this?
head(colnames(cn), 20)
[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)"
1) Assuming the input s given in the Note at the end we can use read.table specifying that the fields are separated by ( and that ) is a comment character. We also strip white space around fields and give meaningful column names. No packages are used.
DF <- read.table(text = s, sep = "(", comment.char = ")",
strip.white = TRUE, col.names = c("Gene", "Id"))
DF
giving this data frame so DF$Gene is the genes and DF$Id is the id's.
Gene Id
1 A1BG 1
2 NAT2 10
3 ADA 100
4 CDH2 1000
5 AKT3 10000
6 GAGE12F 100008586
7 RNA5-8SN5 100008587
8 RNA18SN5 100008588
9 RNA28SN5 100008589
10 LINC02584 100009613
11 POU5F1P5 100009667
12 ZBTB11-AS1 100009676
13 MED6 10001
14 NR2E3 10002
15 NAALAD2 10003
16 DUXB 100033411
17 SNORD116-1 100033413
18 SNORD116-2 100033414
19 SNORD116-3 100033415
20 SNORD116-4 100033416
2) A variation of the above is to first remove the parentheses and then read it in giving the same result. Note that the second argument of chartr contains two spaces so that each parenthesis is translated to a space.
read.table(text = chartr("()", " ", s), col.names = c("Gene", "Id"))
Note
Lines <- '[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)" '
L <- Lines |>
textConnection() |>
readLines() |>
gsub(pattern = "\\[\\d+\\]", replacement = "")
s <- scan(text = L, what = "")
so s looks like this:
> dput(s)
c("A1BG (1)", "NAT2 (10)", "ADA (100)", "CDH2 (1000)", "AKT3 (10000)",
"GAGE12F (100008586)", "RNA5-8SN5 (100008587)", "RNA18SN5 (100008588)",
"RNA28SN5 (100008589)", "LINC02584 (100009613)", "POU5F1P5 (100009667)",
"ZBTB11-AS1 (100009676)", "MED6 (10001)", "NR2E3 (10002)", "NAALAD2 (10003)",
"DUXB (100033411)", "SNORD116-1 (100033413)", "SNORD116-2 (100033414)",
"SNORD116-3 (100033415)", "SNORD116-4 (100033416)")
First, in the future please share your data using the dput() command. See this for details.
Second, here is one solution for extracting the parts you need:
library(tidyverse)
g<-c("A1BG (1)","NAT2 (10)","ADA (100)" , "RNA18SN5 (100008588)", "RNA28SN5 (100008589)")
gnumber<-stringr::str_extract(g,"(?=\\().*?(?<=\\))")
gnumber
gname<-stringr::str_extract(g, "[:alpha:]+")
gname
# or, to get the whole first word:
gname<-stringr::word(g,1,1)
gname

Extract certain words from dynamic strings vector

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:
Data #1
What do you know about AlphaToy?
Data #2
What comes to your mind when you heard AlphaCars?
Data #3
What do you think of FoodTruckers?
What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.
As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:
brandName <- list(
Toy = c(
"1. What do you know about AlphaToy?",
"2. What do you know about BetaToyz?",
"3. What do you know about CharlieDoll?",
"4. What do you know about DeltaToys?",
"5. What do you know about Echoty?"
),
Car = c(
"18. What comes to your mind when you heard AlphaCars?",
"19. What comes to your mind when you heard BestCar?",
"20. What comes to your mind when you heard CoolCarz?"
),
Trucker = c(
"5. What do you think of FoodTruckers?",
"6. What do you think of IceCreamTruckers?",
"7. What do you think of JellyTruckers?",
"8. What do you think of SodaTruckers?"
)
)
extractBrandName <- function(...) {
#some codes here
}
#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Edit:
The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other
Can anyone help me with the function code to achieve the desired output? Thank you in advance!
Edit, here's attempt
The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").
Code
library(stringr)
extractBrandName <- function(x) {
cleanWords <- x %>%
str_remove_all("^\\d+|\\.|,|\\?") %>%
str_squish() %>%
str_split(" ")
commonWords <- cleanWords %>%
Reduce(intersect, .)
extractedWords <- cleanWords %>%
lapply(., function(y) {
y[!y %in% commonWords] %>%
str_c(collapse = " ")
}) %>% unlist()
return(extractedWords)
}
Output (1st test case)
> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Output (2nd test case)
This test case includes two or more words brand names, located at the middle and the beginning of the sentence.
brandName2 <- list(
Middle = c("Have you used any products from AlphaToy this past 6 months?",
"Have you used any products from BetaToys Collection this past 6 months?",
"Have you used any products from Charl TOYZ this past 6 months?"),
First = c("AlphaCars is the best automobile dealer, yes/no?",
"Best Vehc is the best automobile dealer, yes/no?",
"CoolCarz & Bike is the best automobile dealer, yes/no?")
)
> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy" "BetaToys Collection" "Charl TOYZ"
$First
[1] "AlphaCars" "Best Vehc" "CoolCarz & Bike"
In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.
This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:
library(stringr)
extractBrandName <- function(x) {
x %>%
str_split(" ") %>%
{.[[1]][.[[1]] %in% .[[2]]]} %>%
str_c(collapse = " ") %>%
str_c("^.+", .) %>%
str_remove(x, .) %>%
str_squish() %>%
str_remove("\\?")
}
lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
#>
#> $Car
#> [1] "AlphaCars" "BestCar" "CoolCarz"
#>
#> $Trucker
#> [1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"

How to Correct Sprintf Input in R to Change Value in XPath?

There are a series of XPaths which correspond to a list of job titles on a webpage.
E.g. the XPath of one job title is //*[#id="ctl00_CPH1_vcyS_vsGrid_ctl00_ctl04_Title"]
The XPath of another job title is //*[#id="ctl00_CPH1_vcyS_vsGrid_ctl00_ctl10_Title"]
The pattern that changes in these are the digits (e.g. 04) in the ctl04 part of the XPath.
So, I'd like to write a for loop which iterates over the XPaths, going from 04 to 18 in steps of 1. I have this code:
for (i in seq(from = 04, to = 18, by = 1)) {
title_xpath <- sprintf('//*[#id="ctl00_CPH1_vcyS_vsGrid_ctl00_ctl%g_Title"]', i)
}
I assumed through sprintf, the '%g' would be replaced with the values of i in the for loop (i.e. try 04, then 05, etc.), up to 18. But this doesn't happen.
Any ideas?
Edit: thanks for the suggestions so far. However, they don't work when I run the full code (pasted below):
title_list <- list()
item_count <- 1
for (i in seq(from = 1, to = 18, by = 1)) {
title_xpath <- sprintf('//*[#id="ctl00_CPH1_vcyS_vsGrid_ctl00_ctl%02d_Title"]', i)
# Find the element on the website and transform it to text directly
job_title <- driver$findElement(using = "xpath",
value = title_xpath)$getElementText()[[1]]
# Add the outcome to the list
title_list[[item_count]] <- job_title
item_count <- item_count + 1
}
print(title_list)
The part in this that doesn't work is related to the XPath. If I change the XPath from ctl%02d to ctl04, the job title at position ctl04 gets printed 18 times. What I want instead is for the code to print the job titles which correspond to ctl04, ctl05, etc., up to ctl18. Help appreciated.
Probably you need %02d :
title_xpath <- sprintf('//*[#id="ctl00_CPH1_vcyS_vsGrid_ctl00_ctl%02d_Title"]', 4:18)
title_xpath
# [1] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl04_Title\"]"
# [2] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl05_Title\"]"
# [3] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl06_Title\"]"
# [4] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl07_Title\"]"
# [5] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl08_Title\"]"
# [6] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl09_Title\"]"
# [7] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl10_Title\"]"
# [8] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl11_Title\"]"
# [9] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl12_Title\"]"
#[10] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl13_Title\"]"
#[11] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl14_Title\"]"
#[12] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl15_Title\"]"
#[13] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl16_Title\"]"
#[14] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl17_Title\"]"
#[15] "//*[#id=\"ctl00_CPH1_vcyS_vsGrid_ctl00_ctl18_Title\"]"

Sequence of numbers by hyphen without hyphenating single occurrences

I want to generate readable number sequences (e.g. 1, 2, 3, 4 = 1-4), but for a set of data where each number in the sequence must have four digits (e.g. 99 = 0099 or 1 = 0001 or 1022 = 1022) AND where there are different letters in front of each number.
I was looking at the answer to this question, which managed to do almost exactly as I want with two caveats:
If there is a stand-alone number that does not appear in a sequence, it will appear twice with a hyphen in between
If there are several stand-alone numbers that do no appear in a sequence, they won't be included in the result
### Create Data Set ====
## Create the data for different tags. I'm only using two unique levels here, but in my dataset I've got
## 400+ unique levels.
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
## Combine data
my.seq1 <- c(FM, SC)
## Sort data by number in sequence
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]
### Attempt Number Sequencing ====
## Get the letters
sp.tags <- substr(my.seq1, 1, 2)
## Get the readable number sequence
lapply(split(my.seq1, sp.tags), ## Split data by the tag ID
function(x){
## Get the run lengths as per [previous answer][1]
rl <- rle(c(1, pmin(diff(as.numeric(substr(x, 3, 7))), 2)))
## Generate number sequence by separator as per [previous answer][1]
seq2 <- paste0(x[c(1, cumsum(rl$lengths))], c("-", ",")[rl$values], collapse="")
return(substr(seq2, 1, nchar(seq2)-1))
})
## Combine lists and sort elements
my.seq2 <- unlist(strsplit(do.call(c, my.seq2), ","))
my.seq2 <- my.seq2[order(substr(my.seq2, 3, 7))]
names(my.seq2) <- NULL
my.seq2
[1] "FM0001-FM0001" "SC0002-SC0004" "FM0016-FM0019" "FM0028" "SC0039"
my.seq1
[1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021"
[13] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
The major problems with this are:
Some values are completely missing from the data set (e.g. FM0021, FM0024, FM0026)
The first number in the sequence (FM0001) appears with a hyphen in between
I feel like I'm getting warmer by using A5C1D2H2I1M1N2O1R2T1's answer to utilize seqToHumanReadable because it's quite elegant AND solves both problems. Two more problems are that I'm not able to tag the ID before each number and can't force the number of digits to four (e.g. 0004 becomes 4).
library(R.utils)
lapply(split(my.seq1, sp.tags), function(x){
return(unlist(strsplit(seqToHumanReadable(substr(x, 3, 7)), ',')))
})
$FM
[1] "1" " 16-19" " 21" " 24" " 26" " 28"
$SC
[1] "2-4" " 10" " 12" " 14" " 33" " 36" " 39"
Ideally the result would be:
"FM0001, SC002-SC004, SC0012, SC0014, FM0017-FM0019, FM0021, FM0024, FM0026, FM0028, SC0033, SC0036, SC0039"
Any ideas? It's one of those things that's really simple to do by hand but would take blinking ages, and you'd think a function would exist for it but I haven't found it yet or it doesn't exist :(
This should do?
# get the prefix/tag and number
tag <- gsub("(^[A-z]+)(.+)", "\\1", my.seq1)
num <- gsub("([A-z]+)(\\d+$)", "\\2", my.seq1)
# get a sequence id
n <- length(tag)
do_match <- c(FALSE, diff(as.numeric(num)) == 1 & tag[-1] == tag[-n])
seq_id <- cumsum(!do_match) # a sequence id
# tapply to combine the result
res <- setNames(tapply(my.seq1, seq_id, function(x)
if(length(x) < 2)
return(x)
else
paste(x[1], x[length(x)], sep = "-")), NULL)
# show the result
res
#R> [1] "FM0001" "SC0002-SC0004" "SC0010" "SC0012" "SC0014" "FM0016-FM0019" "FM0021"
#R> [8] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
# compare with
my.seq1
#R> [1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021" "FM0024"
#R> [14] "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
Data
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
my.seq1 <- c(FM, SC)
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

So I'm doing a project where I need to load a numerous amount of .pdfs into R. This part is somewhat covered. The problem is when importing the pdfs into R, every line is a string. Not all the information in de the string is relevant. And in some of the cases information is missing. So I want to select the info I need and place them into a tibble for further analysis.
Importing the pdf's are done by pdftools. It's working, hints or tips are welcome though
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
reproducible example:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
So here is where the problem starts.
What I've tried is using stringr and rebus to select specific parts of the text. I've made the following function to search the document for specific string, it returns the rownumber:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
And the following searchpatterns:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
The next step should be to make a tibble (or data frame) with the column names c("date", "reference", "product", "product reference", "weight", "amount") I've also tried making a tibble of the whole invoice_example problem is the missing info in some fields and the column names don’t match the corresponding value's.
So I would like to make some function that uses the search pattern and places that specific value to a predestined column. I've got no clue how to get this done. Or maybe I should handle this completely different?
final result should be something like this.
reproducible example:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
Result:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
Any suggested ways of dealing with this will be appreciated!
Actually you can use the functions of library(stringr) to achieve your goal (I skipped the rebus part as this seems to eb anyways 'just' a helper for creatign teh regex, which I did by hand):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\\d{2}-\\d{2}-\\d{4}",
reference = "\\d{9}",
product_id = "[a-z]{2}\\d{7}",
weight = "\\d+\\.\\d+ kg",
amount = "\\d+,\\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\\d+.\\d+) kg", "\\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables:
# $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023"
# $ date : Date, format: "2016-02-01" "2016-02-07" ...
# $ reference : chr "840000023" "840000048" "840000032" "840000027"
# $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465"
# $ weight : num 14.9 19 4 1.78
# $ amount : num 745 950 250 89
Since I'm not familiar with rebus I've rewritten your code. Assuming the invoices are at least somewhat structured the same I could generate a tibble from your example. You would just have to apply this to your whole list and then purrr::reduce it to a big tibble:
df <- tibble(date=na.omit(str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\\d{2}-\\d{2}-\\d{4} ","",str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4} \\d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))

Resources