Extract date from a text document in R - r

I am again here with an interesting problem.
I have a document like shown below:
"""UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n""""
From the above document, I need to extract date highlighted in bold and Italics.
I tried with strpdate function but did not get the desired results.
Any help will be greatly appreciated.
Thanks in advance.

Assuming you only want to capture a single date, you may use sub here:
text <- "UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n"
date <- sub("^.*\\b(\\d{2}-[A-Z]+-\\d{4})\\b.*", "\\1", text)
date
[1] "29-MAY-2019"
If you had the need to match multiple such dates in your text, then you may use regmatches along with regexec:
text <- "Hello World 29-MAY-2019 Goodbye World 01-JAN-2018"
regmatches(text,regexec("\\b(\\d{2}-[A-Z]+-\\d{4})\\b", text))[[1]]
[1] "29-MAY-2019" "29-MAY-2019"

Related

Transposition Cipher: How to solve

I have a ciphertext as follows, of which I do not know the keylength:
wlna evesy ehudre thnma upbum w onaw-dino olsile tf hndcseoorl foouA. bnsst uho,et r,vweeirh teorf efer tsw lae tsutas sfeccsan,ul eytd hduu sbhe edtmel faut s,b bo nte oefrroeth ad ofhlea fl, in nthdan olwe hacpe lenbe euce rdo edt acsuhutobslinre ut,h tae a sv tmsoeedswitiny cl ardesro ndipipn ino,ng trat reeac edimanthf o chme ay einrh iwhccodha ur stoorn uftentuauac aqnctina dse oy.realgea Lrsea ms nos fl kiceofdan wi tndieer eroscvto edsindre oun a usht-out e,bcoo n wesin o retou befwh,nd mahic vehy alax ep teindre hepe nsecho oftul sebox kybhi eswav chhenbe eeal aref dyrd rereHo.to r ow uaudhyrencli erngie ba hdconee edenvym r fogaeth terdne to h wospt hrheecore ed rveeseshi menss hhigtreeav edimanevo fr m erarytysee e wrot itn to frof hesulmt ohi d,wol cht aud sy e vrn apli. ltaead Hehdev ei blntycanee d irre bwdono ty wonrpesne s,owhf o ad omhare rmy bkall asml aefethe ndtert ohsun uu llaly ogare Osne.e tn he,owhlwat i stms obar pothebl he atteni slglEt nanhisminb, eg le
How would I be able to decipher this transposed ciphertext in a non-manual way like on https://tholman.com/other/transposition/ ?
I believe that the punctuation and spaces matter as well in this ciphertext.

Tables in RMarkdown: not using dataframe and/or data.table

I'm having some issues trying to insert a table code in a text on RMarkdown.
Reading rmakdown cheat sheet (write with markdown) and following the instructions, my code is like this:
Palavra-chave
Biblioteca do Conhecimento Online]
Google acadêmico Portugal
Resultado encontrado
estilos de uso del espacio virtual
2
10
12
estilos de uso do espaço virtual
6
15
21
questionário estilos de uso do espaço virtual
1
5
6
Total
9
30
39
Unfortunately, in any rendering format (pdf, .docx, or HTML) table is not displayed right formatted...
Any thoughts?
Tks in advance!

extract number in string using regex

I have a data.frame like this :
SO <- data.frame(coiffure_IDF$SIREN, coiffure_IDF$L6_NORMALISEE )
coiffure_IDF.SIREN coiffure_IDF.L6_NORMALISEE
1 54805015 75008 PARIS
2 300086907 94210 ST MAUR DES FOSSES
3 300090453 94220 CHARENTON LE PONT
4 300209608 75007 PARIS
5 300570553 95880 ENGHIEN LES BAINS
6 301123626 75019 PARIS
7 301362349 92300 LEVALLOIS PERRET
I want to have this :
coiffure_IDF.SIREN codpos_norm ville
1 54805015 75008 PARIS
2 300086907 94210 ST MAUR DES FOSSES
3 300090453 94220 CHARENTON LE PONT
4 300209608 75007 PARIS
5 300570553 95880 ENGHIEN LES BAINS
6 301123626 75019 PARIS
7 301362349 92300 LEVALLOIS PERRET
so I used regex :
SO2<- SO %>% extract(col="coiffure_IDF.L6_NORMALISEE", into=c("codpos_norm", "ville"), regex="(\\d+)\\s+(\\S+)")
so I have the right column is "codpos_norm" but in "ville" in line 2 I just have "ST" in stead of "ST MAUR DES FOSSES". In line 3 just "CHARENTON", etc
so I tried to add some \\s+ and \\S+ in the regex but R told me that they are to many groups and that it has to have only 2 groups.
What could I do ?
You need to match the rest of the string in Group 2, the \S construct only matches non-whitespace chars. Use .+ to match any 1+ chars up to the string end:
extract(col="coiffure_IDF.L6_NORMALISEE", into=c("codpos_norm", "ville"), regex="(\\d+)\\s+(.+)")
You may use .* to match empty strings (if there is no text after 1+ whitespaces).

Array values being overwritten in gawk

Sample of File I'm reading in
011084,31.0581,-87.0547, 25.9 AL BREWTON 3 SSE
012813,30.5467,-87.8808, 7.0 AL FAIRHOPE 2 NE
013160,32.8347,-88.1342, 38.1 AL GAINESVILLE LOCK
013511,32.7017,-87.5808, 67.1 AL GREENSBORO
013816,31.8700,-86.2542, 132.0 AL HIGHLAND HOME
015749,34.7442,-87.5997, 164.6 AL MUSCLE SHOALS AP
017157,34.1736,-86.8133, 243.8 AL SAINT BERNARD
017304,34.6736,-86.0536, 187.5 AL SCOTTSBORO
GAWK Code
#!/bin/gawk
BEGIN{
FS=",";
OFS=",";
}
{
print $1,$2,$3,$4
station=""$1 #Forces to be string
#Save latitude
stationInfo[station][lat]=$2
print "lat",stationInfo[station][lat]
#Save longitude
stationInfo[station][lon]=$3
print "lon",stationInfo[station][lon]
#Now try printing the latitude again
#It will return the value of the longitude instead
print "lat",stationInfo[station][lat]
print "---------------"
}
Sample output
011084,31.0581,-87.0547, 25.9 AL BREWTON 3 SSE
lat,31.0581
lon,-87.0547
lat,-87.0547
---------------
012813,30.5467,-87.8808, 7.0 AL FAIRHOPE 2 NE
lat,30.5467
lon,-87.8808
lat,-87.8808
---------------
For some reason the value stored in stationInfo[station][lat] is being overwritten by the longitude. I'm at a loss for what in the world is going on.
I'm using GAWK 4.1.1 on Fedora 22
Your problem is the fact that lon and lat are variables and evaluate to the empty string so this assignment stationInfo[station][lat]=$2 and stationInfo[station][lon]=$3 are assigning to stationInfo[station]["].
You need to quote the lat and lon in those (and the other) lines to use strings instead of variables.
#!/bin/gawk
BEGIN{
FS=",";
OFS=",";
}
{
print $1,$2,$3,$4
station=""$1 #Forces to be string
#Save latitude
stationInfo[station]["lat"]=$2
print "lat",stationInfo[station]["lat"]
#Save longitude
stationInfo[station]["lon"]=$3
print "lon",stationInfo[station]["lon"]
#Now try printing the latitude again
#It will return the value of the longitude instead
print "lat",stationInfo[station]["lat"]
print "---------------"
}

Row count for a column

In my subreport I want do display for eg.
Number of clients born in 1972: 34
So in the database I have a list of their birth years
How can I display this number in a field?
Here is a Sample of the data:
<Born> <Name> <BleBle>
1981 Mnr EH Van Niekerk 9517
1982 MEV A BELL 9520
1972 Mnr GI van der Westhuize 9517
1987 Mnr A Juyn 9517
1983 Mev MJC Prinsloo 9513
1972 Mnr WA Van Rensburg 9517
1989 Kmdt EL Van Der Colff 9514
1972 Mnr JS Jansen Van Vuuren 9517
So if this was all the data the output would have to be
Number of clients born in 1972: 3
Create a variable BORN_IN_1972.
Set its "Variable class" to java.lang.Integer.
Set "Calculation" to "Count".
Set "Variable Expression" to $F{Born}.
Set "Initial Value Expression" to 0.
Than add "Summary" band to your report. And put static text "Number of clients born in 1972:" and text field "$V{BORN_IN_1972}" into it.
Assuming birth year is a string:
SELECT COUNT(*)
FROM MyClients
WHERE birth_year = '1972'
And if birth year is being used as an input control:
SELECT COUNT(*)
FROM MyClients
WHERE birth_year = $P{birth_year}
To count non-zero records in jasper use the expression below -
( $F{test} == 0.0 ? null : $F{test} )

Resources