Split a String in R, using specifc criteria [closed]

Split a String in R, using specifc criteria [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have this vector of characters and I need to split the elements in every row.
Here's the image and the code of the dataset:
Y = c("Voci dell’attivo 31-dic-2018 31-dic-2017", "10. Cassa e disponibilità liquide 113.154.047 105.800.459",
"20. Attività finanziarie valutate al fair value con impatto a conto economico 18.400.808 16.926.126",
"a) attività finanziarie detenute per la negoziazione 4.343.408 4.356.151",
"c) altre attività finanziarie obbligatoriamente valutate al fair value 14.057.400 12.569.975",
"30. Attività finanziarie valutate al fair value con impatto sulla redditività complessiva 636.155.429 738.384.151",
"40. Attività finanziarie valutate al costo ammortizzato 11.203.834.137 11.113.118.111",
"a) crediti verso banche 3.198.599.841 3.375.218.889", "b) crediti verso clientela 8.005.234.296 7.737.899.222",
"50. Derivati di copertura 516.238 696.134", "70. Partecipazioni 183.546.796 156.038.259",
"80. Attività materiali 224.588.110 237.315.814", "90. Attività immateriali 1.917.192 2.041.953",
"di cui:", "- avviamento 1.650.000 1.650.000", "100. Attività fiscali 222.227.750 175.106.461",
"a) correnti 4.897.477 10.066.708", "b) anticipate 217.330.273 165.039.753",
"120. Altre attività 82.551.535 110.120.187", "Totale dell'attivo 12.686.892.042 12.655.547.655",
"(unità di euro)", "Voci del passivo e del patrimonio netto",
"31-dic-2018 31-dic-2017", "10. Passività finanziarie valutate al costo ammortizzato 11.375.985.022 11.176.338.096",
"a) debiti verso banche 146.551.210 144.854.107", "b) debiti verso clientela 10.861.850.231 10.251.703.644",
"c) titoli in circolazione 367.583.581 779.780.345", "20. Passività finanziarie di negoziazione 2.392.620 2.370.319",
"40. Derivati di copertura 6.189.059 2.971.997", "60. Passività fiscali 4.091.701 3.909.554",
"a) correnti 752.147 -", "b) differite 3.339.554 3.909.554",
"80. Altre passività 239.940.794 152.157.192", "90. Trattamento di fine rapporto del personale 54.720.108 56.331.622",
"100. Fondi per rischi e oneri: 66.580.390 69.699.078", "a) impegni e garanzie rilasciate 12.705.663 9.475.181",
"c) altri fondi per rischi e oneri 53.874.727 60.223.897", "110. Riserve da valutazione 119.988.702 139.381.644",
"140. Riserve 460.527.397 761.938.256", "150. Sovrapprezzi di emissione 126.318.353 126.318.353",
"160. Capitale 155.247.762 155.247.762", "180. Utile (perdita) d’esercizio (+/-) 74.910.134 8.883.782",
"Totale del passivo e del patrimonio netto 12.686.892.042 12.655.547.655")
Basically what should I do is:
Put the first numbers/parenthsesis and ALL the following words into a column
Put the first number in a second column
Delete the last number
I don't know how to make R recognize different type of element inside the same string. How can I do this?

Given your Y, this should achieve what you want, or get very close to that:
library(tidyverse)
tibble(orig = str_remove(string = Y, pattern = " [[:digit:][:punct:]]+$")) %>%
transmute(col1 = str_remove(string = orig, pattern = " [[:digit:][:punct:]]+$"),
col2 = str_extract(string = orig, pattern = "[[:digit:][:punct:]]+$"))
#> # A tibble: 43 x 2
#> col1 col2
#> <chr> <chr>
#> 1 Voci dell’attivo 31-dic-2018 31-dic-2017 -2017
#> 2 10. Cassa e disponibilità liquide 113.154.047
#> 3 20. Attività finanziarie valutate al fair value con impatt… 18.400.808
#> 4 a) attività finanziarie detenute per la negoziazione 4.343.408
#> 5 c) altre attività finanziarie obbligatoriamente valutate a… 14.057.400
#> 6 30. Attività finanziarie valutate al fair value con impatt… 636.155.429
#> 7 40. Attività finanziarie valutate al costo ammortizzato 11.203.834.…
#> 8 a) crediti verso banche 3.198.599.8…
#> 9 b) crediti verso clientela 8.005.234.2…
#> 10 50. Derivati di copertura 516.238
#> # … with 33 more rows
Created on 2019-10-09 by the reprex package (v0.3.0)
To clarify: using the dollar sign it tells regex to start looking from matches from the end of the string, so what happens with that code is that first it removes the last combination of digits and points, then, from the remaining strings, it either keeps or remove digits and points to create the two columns.

Related

Reading csv flie with commas in R that stops fread function

I am trying to read multiple csv files (like 300) with the function fread in R.
When i open one of the csv files in excel, the columns are delimited correctly, even when some observations contain commas.
When I try to read one of the files, the fuction does't read all the observations in the file and the next error appears
> file_prueba<-fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", skip = 5, header = TRUE)
Warning message:
In fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", :
Stopped early on line 1073. Expected 17 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA","">>
Therefore i can't read the whole file. I suspect it is because one of the observations cointains commas like "PLOMERIA, TUBO DE COBRE, DE 60 MTS". But I'm not sure.
How can i fix this without fixing each csv file one by one?
Here's the file that i'm using int he example, but as I said, i need to read multiple files like this:
https://drive.google.com/file/d/1gSjyL14sZQC5KNtMXhN_iN79xCETTZAG/view?usp=sharing

The file is corrupt in two ways: lines 1073 and 3401 have embedded quotes. But there's another problem here ... read down to the second section fread and double-double-quotes for the problem with fread.
(Ultimately, this is a failure of the exporting process and a failure of fread to read embedded double quotes.)
Corrupted lines
Scroll right to see the problems.
Line 1073:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
Line 3401:
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
The best fix is to get whatever person/process exported this to export compliant CSV.
Here is a command-line (sed) fix that will allow fread to load it without warning or error (this is on a shell prompt, not in R).
sed -i \
-e 's/", PZA/"", PZA/g' \
-e s'/BARRA DE 1\/2"/BARRA DE 1\/2""/g' \
"INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV"
Simple explanation: the CSV standard (well-framed at https://en.wikipedia.org/wiki/Comma-separated_values) suggests that either double-quotes should never be in a quoted field, or if present they should be doubled (as in "" to produce a single " in the middle of a value).
In this case, it finds the two very specific failing text and adds the second quote.
-i means to make the change in-place; perhaps a more defensive use would be to do sed -e 's/../../g' -e 's/../../g' < oldfile.csv > newfile.csv, which would preserve the broken file. Over to you.
-e adds a sed script/command, multiple commands can be given.
s/from/to/g means to replace the pattern from with the string in to; the g means "global".
This changes the two lines (shown one after the other here for simplicity:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4"", PZA 6 MTS","231.55","1","PZA",""
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2"" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^^^^^-- the changes, double-double quotes
FYI: if you don't have sed in the path ... if you're running windows, then look in the RTools40 path; for me, I have c:/rtools40/usr/bin/sed.exe. If you're on macos or linux and cannot find sed, well ... that's odd.
After that sed command executes correctly, it will load without problem. HOWEVER, don't let this mislead you ... it is not really fixed. Keep reading.
csv <- fread("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv
# Año Mes Fecha_Pub_DOF Clave ciudad Nombre ciudad División
# <int> <int> <char> <int> <char> <char>
# 1: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 2: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 3: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
...snip...
# 11 variables not shown: [Grupo <char>, Clase <char>, Subclase <char>, Clave genérico <int>, Genérico <char>, Consecutivo <int>, Especificación <char>, Precio promedio <num>, Cantidad <int>, Unidad <char>, ...]
fread and double-double-quotes
The problem with the above is that while it seems to have worked correctly, it (still) does not do embedded quotes correctly. As long as you want your data to have all of the embedded quotes that you want, then you cannot use fread, unfortunately.
Why?
str(csv[1067,])
# Classes 'data.table' and 'data.frame': 1 obs. of 17 variables:
# $ Año : int 2020
# $ Mes : int 6
# $ Fecha_Pub_DOF : chr "20/07/2020 12:00:00 a. m."
# $ Clave ciudad : int 12
# $ Nombre ciudad : chr "San Luis Potosí, S.L.P."
# $ División : chr "3. Vivienda"
# $ Grupo : chr "3.1. Costo de uso de vivienda"
# $ Clase : chr "3.1.1. Costo de uso de vivienda"
# $ Subclase : chr "42 Vivienda propia"
# $ Clave genérico : int 140
# $ Genérico : chr "Productos para reparación menor de la vivienda"
# $ Consecutivo : int 1
# $ Especificación : chr "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
# $ Precio promedio: num 232
# $ Cantidad : int 1
# $ Unidad : chr "PZA"
# $ Estatus : chr ""
# - attr(*, ".internal.selfref")=<externalptr>
Namely, see
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
^^^^ should only be a single "
Fortunately, read.csv works fine here:
csv <- read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\", PZA 6 MTS"
FYI, if you don't care about the embedded quotes, you can still use fread if you change the sed expressions to remove the double-quotes instead of doubling the double-quotes. That is, -e 's/", PZA/, PZA/g' and likewise for the second expression. I didn't recommend this first because it changes your data, which you should not have to do.

The file you linked is properly quoted.
It has 5 lines of non-CSV data though, so skip these:
csv = read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", header = T, skip = 5, fileEncoding = "Latin1")
This works fine for me.
I am not so familiar with fread, and it does seem to have a problem with this file. Is there a reason you need data.table::fread for this?

How to scrape data from PDF with R?

I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]

I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.

Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.

Searching for sentences in get_sentences

I am analysing some amazon reviews. At the current step of my analysis, I'd like to take the sentences written in reviews that have less than two stars. I have done that, applied get sentences, and wrote this function in order to search for words inside the sentences and print out only those containing all of the words:
ricerca <- function(sentences,keyword){
found <- lapply(sentences, function(x) grep(keyword, x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
The sentences are made in the following way:
> class(frasi_negative)
[1] "get_sentences" "get_sentences_character" "list"
> frasi_negative[1:2]
[[1]]
[1] "Auricolari comodissimi."
[2] "Restano incollati alle orecchie e non cadono in nessuna situazione."
[3] "Suono limpido, pulito."
[4] "Bassi consistenti."
[5] "La durata della batteria è più che soddisfacente (io li ho usati anche per 4 ore di fila senza problemi)."
[6] "Decisamente soddisfatto."
[7] "Li ricomprerei."
[8] "Amazon perfetta come al solito."
[9] "AGGIORNAMENTO RECENSIONE!!!!!"
[10] "- Dopo un mese di utilizzo non è più possibile ricaricare le cuffie."
[11] "Non danno più segni di vita."
[12] "Delusissimo."
[13] "Non mi sarei mai aspettato una fine così."
[14] "Peccato perché il prodotto era praticamente perfetto."
[[2]]
[1] "Al mio cellulare (Xiaomi Redmi Note 5) si mostrano singolarmente, separate, quando cerco di connetterle."
[2] "O si connette alla destra, o alla sinistra, e in ogni caso il suono poi esce dalle casse del cellulare (nonostante aver dato alle cuffie tutti i permessi)."
[3] "Non capisco perché, data che la prima connessione era andata come si deve; spente e riaccese, hanno iniziato a comportarsi così."
[4] "Ho provato a riavviare sia loro che cellulare, a rimetterle nella scatoletta e ritoglierle, ma il problema persiste."
[5] "Non penso c'entri il mio cellulare (mai avuto problemi con prodotti simili), in ogni caso effettuo reso con rimborso."
When I try searching for a word, it seems to work (even if the output is really horrible):
> found<-ricerca(frasi_negative, "qualità")
> found[1:3]
[[1]]
[1] "Pessima qualità."
[2] "La qualità delle chiamate telefoniche è assolutamente pessima (il proprio interlocutore non riceve in modo chiaro la nostra voce, dunque, risultano inutilizzabili come aurocolari telefonici)."
[[2]]
[1] "imbarazzanti non so la gente qui come fanno a dargli 5 stelle, l'unica cosa che mi viene in mente e che non hanno mai provato un paio di cuffie decenti, qualità dell audio pessima si sente basso e male , l'unica cosa buona e che la batteria si comporta bene - a distanza di 20 cm ogni tanto si scollegano provato con piu dispositivi sicuramente richiedo il rimborso davvero pessime"
[[3]]
[1] "La qualità costruttiva è ottima, l'accoppiamento è avvenuto in maniera facile ed immediata, e la durata è ottima."
But when I try searching for a few words (as example c("quality","bad")), it only searches for the first word, and gets me lot of warnings.
I have no idea about how to adapt this function, so thanks to all of you in advance.
Library: sentimentr
UPDATE:
Thanks for the answers, but it seems that in both the function you guys published it outputs all sentences containing at least one of the two words. I just want to see those which contain both. Is there a way to do it?

The below function should iterate through an inputted vector of keywords:
ricerca <- function(sentences,keywords){
for(i in 1:length(keywords){
found <- lapply(sentences, function(x) grep(keywords[i], x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
}
}
I hope this helps!

Try combining the keyword using paste as one string.
ricerca <- function(sentences,keyword){
found <- lapply(sentences, function(x)
grep(paste0(keyword, collapse = "|"), x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
}

Extract sentences containing specific word(s)

I have a get_sentences (sentimentr) list, and need to extract only those sentences containing specific words.
This is how the data looks like:
> class(frasi_negative)
[1] "get_sentences" "get_sentences_character" "list"
> frasi_negative[2:3]
[[1]]
[1] "Al mio cellulare (Xiaomi Redmi Note 5) si mostrano singolarmente, separate, quando cerco di connetterle."
[2] "O si connette alla destra, o alla sinistra, e in ogni caso il suono poi esce dalle casse del cellulare (nonostante aver dato alle cuffie tutti i permessi)."
[3] "Non capisco perché, data che la prima connessione era andata come si deve; spente e riaccese, hanno iniziato a comportarsi così."
[4] "Ho provato a riavviare sia loro che cellulare, a rimetterle nella scatoletta e ritoglierle, ma il problema persiste."
[5] "Non penso c'entri il mio cellulare (mai avuto problemi con prodotti simili), in ogni caso effettuo reso con rimborso."
[[2]]
[1] "Comprate due mesi fa."
[2] "All'inizio funzionavano perfettamente, ma dopo qualche settimana hanno iniziato a disconnettersi tra loro di tanto in tanto."
[3] "qualche giorno fa estraendo la sinistra dall'astuccio magnetico è saltata la saldatura che la teneva chiusa e questo è il risultato."
[4] "Usandole in chiamata rendono un suono non limpido."
[5] "Il suono è accettabile ma nulla di speciale."
[6] "Ormai sono inutilizzabili."
[7] "Per il prezzo mi aspettavo un prodotto migliore e più duraturo (avendo già provato auricolari wireless della stessa fascia di prezzo di altri brand)"
As an example, if I search "wireless", only the [[2]][7] element should show up.
How can I achieve this?
Thank you in advance.

You can try to iterate over the list using lapply and return the sentence which matches a particular keyword using grep .
keyword <- 'wireless'
lapply(frasi_negative, function(x) grep(keyword, x, value = TRUE))

We can use str_detect
library(purrr)
library(stringr)
keyword <- 'wireless'
map(frasi_negative, ~ .x[str_detect(.x, keyword)])

Extract date from a text document in R

I am again here with an interesting problem.
I have a document like shown below:
"""UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n""""
From the above document, I need to extract date highlighted in bold and Italics.
I tried with strpdate function but did not get the desired results.
Any help will be greatly appreciated.
Thanks in advance.

Assuming you only want to capture a single date, you may use sub here:
text <- "UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n"
date <- sub("^.*\\b(\\d{2}-[A-Z]+-\\d{4})\\b.*", "\\1", text)
date
[1] "29-MAY-2019"
If you had the need to match multiple such dates in your text, then you may use regmatches along with regexec:
text <- "Hello World 29-MAY-2019 Goodbye World 01-JAN-2018"
regmatches(text,regexec("\\b(\\d{2}-[A-Z]+-\\d{4})\\b", text))[[1]]
[1] "29-MAY-2019" "29-MAY-2019"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split a String in R, using specifc criteria [closed] - r

Related

Reading csv flie with commas in R that stops fread function

How to scrape data from PDF with R?

Searching for sentences in get_sentences

Extract sentences containing specific word(s)

Extract date from a text document in R

Categories

Resources