How to extract date from the text - r

I tried to extract a date from the following text. Unfortunately, it keeps giving me warning and the result is NA
I have a following text:
"IRA-401K Investment Assets Under Management (AUM) As of July 31, 2018 BMG Funds
$217,743,573 BMG BullionBars $45,176,561 TOTAL $262,920,134 Physical Holdings Download
Scotiabank BMG BullionBars List Download Brinks BMG BullionBars List Holdings by Ounces As
of July 31, 2018 Gold Bars 21,132.496 Silver Bars 453,531.574 Silver Coins
80,500 Platinum Bars"
The text contains following date: July 31, 2018. These dates appear twice in the text.
I used following code to extract the dates out of the text.
test_take <- lapply(cleanurl_text, parse_date_time, orders = "mdy",
locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"))
I get the following error message:
Warning message:
All formats failed to parse. No formats found.
When I include exact = TRUE
test_take <- lapply(as.character(cleanurl_text), parse_date_time, orders = "mdy",
locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"), exact = TRUE)
I get the following warning:
Warning message:
1 failed to parse.
The resulting object still contains NA.

The following regex can extract the date in the posted format.
pattern <- paste(month.name, collapse = "|")
pattern <- paste0("(", pattern, ")\\s\\d{1,2}.{1,2}\\d{4}")
m <- gregexpr(pattern, cleanurl_text)
regmatches(cleanurl_text, m)
#[[1]]
#[1] "July 31, 2018" "July 31, 2018"
Note that this can be done in just one code line, regmatches(gregexpr(.)), but I have opted for two lines in order to make it more readable.

Related

How to extract patterns along with dates in string using R?

I want to extract the dates along with a regex pattern (dates come after the pattern) from sentences.
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
The pattern is number of subscribers and then there is the date as Month Day, Year format. Sometimes there are as of or in or no characters between the pattern and dates.
I have tried the following script.
find_dates <- function(text){
pattern <- "\\bnumber\\s+of\\s+subscribers\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
str_extract(text, pattern)
}
However, this extracts the in-between words too, which I would like to ignore.
Desired output:
find_dates(text1)
'number of subscribers December 31, 2022'
find_dates(text2)
'number of subscribers January 10, 2023'
An approach using stringr
library(stringr)
find_Dates <- function(x) paste0(str_extract_all(x,
"\\bnumber\\b (\\b\\S+\\b ){2}|\\b\\S+\\b \\d{2}, \\d{4}")[[1]], collapse="")
find_Dates(text1)
[1] "number of subscribers December 31, 2022"
# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"
[[2]]
[1] "number of subscribers January 10, 2023"
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
find_dates <- function(text){
# pattern <- "(\\bnumber\\s+of\\s+subscribers)\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
pattern <- "(\\bnumber\\s+of\\s+subscribers)(?:\\s+as\\s+of\\s|\\s+in\\s+)?(\\S+(\\s+\\S+){2})" # pattern and next 3 words
str_extract(text, pattern, 1:2)
}
find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"
find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"

Reading a Fixed-Width Multi-Line File in R

I have data from a PDF file that I am reading into R.
library(pdftools)
library(readr)
library(stringr)
library(dplyr)
results <- pdf_text("health_data.pdf") %>%
readr::read_lines()
When I read it in with this method, a character vector is returned. Multi-line information for a given column is spread out on different lines (and not all columns for each observation will have data.
A reproducible example is below:
ex_result <- c("03/11/2012 BES 3RD BES inc and corp no- no- sale -",
" group with sale no- sale",
" boxes",
"03/11/2012 KRS six and firefly 45 mg/dL 100 - 200",
" seven",
"03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87")
I am trying to use read_fwf with fwf_widths as I read that it can handle multi-line input if you give the widths for multi-line records.
ex_result_width <- read_fwf(ex_result, fwf_widths(
c(10, 24, 16, 7, 5, 15,100),
c("date", "name","description", "value", "unit","range","ab_flag")))
I determined the sizes by typing in the console nchar with the longest string that I saw for that column.
Using fwf_widths I can get the date column by defining in the width = argument with 10 bytes, but for the NAME column if I set it to say 24 bytes it returns back columns concatenated instead of rows split to account for multi-line which then cascades to the other columns now having the wrong data and the rest being dropped when space has run out.
Ultimately this is the desired output:
desired_output <-tibble(
date = c("03/11/2012","03/11/2012","03/11/2012"),
name = c("BES 3RD group with boxes", "KRS six and seven", "KRS core"),
description = c("BES inc and corp", "firefly", "ladybug"),
value = c("no-sale", "45", "55"),
unit = c("","mg/dL","mg/dL"),
range = c("no-sale no-sale", "100 - 200", "42 - 87"),
ab_flag = c("", "", ""))
I am trying to see:
How can I get fwf_widths to recognize multi-line text and missing columns?
Is there a better way to read in the pdf file to account for multi-line values and missing columns? (I was following this tutorial but it seems to have a more structured pdf file)
str_subset(ex_result,pattern = "\/\d{2}\/")
[1] "03/11/2012 BES 3RD BES inc and corp no- no- sale -"
[2] "03/11/2012 KRS six and firefly 45 mg/dL 100 - 200"
[3] "03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87"

Retrieve date modified of a file from an FTP Server

Building off of this question (Retrieve modified DateTime of a file from an FTP Server), it's clear how to get the date modified value. However, the full date is not returned even though it's visible from the FTP site.
This shows how to get the date modified values for files at ftp://ftp.FreeBSD.org/pub/FreeBSD/
library(curl)
library(stringr)
con <- curl("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
dat <- readLines(con)
close(con)
no_dirs <- grep("^d", dat, value=TRUE, invert=TRUE)
date_and_name <- sub("^[[:alnum:][:punct:][:blank:]]{43}", "", no_dirs)
dates <- sub('\\s[[:alpha:][:punct:][:alpha:]]+$', '', date_and_name)
dates
## [1] "May 07 2015" "Apr 22 15:15" "Apr 22 10:00"
Some dates are in month/day/year format, others are in month/day/hour/minute format.
Looking at the FTP site, all dates in month/day/year hour/minutes/seconds format.
I assume it's got something to do with Unix format standards (explained in FTP details command doesn't seem to return the year the file was modified, is there a way around this?). It would be nice to get the full date.
If you use download.file you get an html representation of the directory which you can parse with the xml2 package.
read_ftp <- function(url)
{
tmp <- tempfile()
download.file(url, tmp, quiet = TRUE)
html <- xml2::read_html(readChar(tmp, 1e6))
file.remove(tmp)
lines <- strsplit(xml2::xml_text(html), "[\n\r]+")[[1]]
lines <- grep("(\\d{2}/){2}\\d{4}", lines, value = TRUE)
result <- read.table(text = lines, stringsAsFactors = FALSE)
setNames(result, c("Date", "Time", "Size", "File"))
}
Which allows you to just do this:
read_ftp("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
#> Date Time Size File
#> 1 05/07/2015 12:00AM 4,259 README.TXT
#> 2 04/22/2020 08:00PM 35 TIMESTAMP
#> 3 04/22/2020 08:00PM Directory development
#> 4 04/22/2020 10:00AM 2,325 dir.sizes
#> 5 11/12/2017 12:00AM Directory doc
#> 6 11/12/2017 12:00AM Directory ports
#> 7 04/22/2020 08:00PM Directory releases
#> 8 11/09/2018 12:00AM Directory snapshots
Created on 2020-04-22 by the reprex package (v0.3.0)

Writing a function to clean string data and rename columns

I am writing a function to be applied to many individual matrices. Each matrix has 5 columns of string text. I want to remove a piece of one string which matches the string inside another element exactly, then apply a couple more stringr functions, transform it into a data frame, then rename the columns and in the last step I want to add a number to the end of each column name, since I will apply this to many matrices and need to identify the columns later.
This is very similar to another function I wrote so I can't figure out why it won't work. I tried running each line individually by filling in the inputs like this and it works perfectly:
Review1[,4] <- str_remove(Review1[,4], Review1[,3])
Review1[,4] <- str_sub(Review1[,4], 4, -4)
Review1[,4] <- str_trim(Review1[,4], "both")
Review1 <- as.data.frame(Review1)
colnames(Review1) <- c("Title", "Rating", "Date", "User", "Text")
Review1 <- Review1 %>% rename_all(paste0, 1)
But when I run the function nothing seems to happen at all.
Transform_Reviews <- function(x, y, z, a) {
x[,y] <- str_remove(x[,y], x[,z])
x[,y] <- str_sub(x[,y], 4, -4)
x[,y] <- str_trim(x[,y], "both")
x <- as.data.frame(x)
colnames(x) <- c("Title", "Rating", "Date", "User", "Text")
x <- x %>% rename_all(paste0, a)
}
Transform_Reviews(Review1, 4, 3, 1)
This is the only warning message I get. I also receive this when I run the str_remove function individually, but it still changes the elements. But it changes nothing when I run the UDF.
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), ... :
empty search patterns are not supported
This is an example of the part of Review1 that I'm working with.
[,3] [,4]
[1,] "6 April 2014" "By Copnovelist on 6 April 2014"
[2,] "18 Dec. 2015" "By kenneth bell on 18 Dec. 2015"
[3,] "26 May 2015" "By Simon.B :-) on 26 May 2015"
[4,] "22 July 2013" "By Lilla Lukacs on 22 July 2013"
This is what I want the output to look like:
Date1 User1
1 6 April 2014 Copnovelist
2 18 Dec. 2015 kenneth bell
3 26 May 2015 Simon.B :-)
4 22 July 2013 Lilla Lukacs
I realized I just needed to use an assignment operator to see my function work.
Review1 <- Transform_Reviews(Review1, 4, 3, 1)

Creating a unified time-series, with dates coming from different (natural) languages

I am using the as.Date function as follows:
x$time_date <- as.Date(x$time_date, format = "%H:%M - %d %b %Y")
This worked fine until I saw a lot of NA values in the output, which I traced back to some of the dates stemming from a different language: German.
My English dates look like this: 18:00 - 10 Dec 2014
Where the German equivalent is: 18:00 - 10 Dez 2014
The month December is abbreviated the German way. This is not recognised by the as.Date function. I have the same problem for five other months:
Mar - März
May - Mai
Jun - Juni
Jul - Juli
Oct - Okt
This looks like it would be of use, but I am unsure of how to implement it for 'unrecognised' formats:
How to change multiple Date formats in same column
I attempted to just go through and use gsub to replace all the occurences of German months, but without luck. x below is the data.table and I work on just the time_date column:
x$time_date <- gsub("(März)?", "Mar", x$time_date) %>%
gsub("(Mai)?", "May", .) %>%
gsub("(Juni)?", "Jun", .) %>%
gsub("(Juli)?", "Jul", .) %>%
gsub("(Okt)?", "Oct", .) %>%
gsub("(Dez)?", "Dec", .)
Not only did this not work, but it is also a very slow process and I have nearly 20 GB of pure .csv files to work through.
In the as.Date documentation there is mention of different locales / languages, but not how to work with several simultaneously. I also found instructions on how to use different languages, however my data is all mixed, so I can only thing of a conditional loop using the correct language for each file, however that would also be slow.
Is there a known workaround for this, which I can't find?
Create a table tab that contains all the translations and then use subscripting to actually do the translation. The code below seems to work for me on Windows provided your input abbreviations are the same as the standard ones generated but the precise language names ("German", etc.) may vary depending on your system. See ?Sys.setlocale for more information. Also if the abbreviations in your input are different than the ones generated here you will have to add those to tab yourself, e.g. tab <- c(tab, Juli = "Jul")
langs <- c("French", "German", "English")
tab <- unlist(lapply(langs, function(lang) {
Sys.setlocale("LC_TIME", lang)
nms <- format(ISOdate(2000, 1:12, 1), "%b")
setNames(month.abb, nms)
}))
x <- c("18:00 - 10 Juli 2014", "18:00 - 10 Mai 2014") # test input
source_month <- gsub("[^[:alpha:]]", "", x)
mapply(sub, source_month, tab[source_month], x, USE.NAMES = FALSE)
giving:
[1] "18:00 - 10 Jul 2014" "18:00 - 10 May 2014"

Resources