Extract dates from a vector of character strings - r

I have vector with two elements. Each element contains a string of characters
with two sets of dates. I need to extract the latter of these two dates,
and make a new vector or list with them.
#webextract vector
webextract <- list("The Employment Situation, December 2006 January 5 \t 8:30 am\r","The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r")
#This is how the output of webextract looks like:
[[1]]
[1] The Employment Situation, December 2006 January 5 \t 8:30 am\r
[[2]]
[1] The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r
webextract is the result of web scraping an URL with plain text, that's why it looks like that. What I need to extract is "January 5" and "Feb. 2". I have been experimenting with grep and strsplit and failed to get anywhere. Have gone through all related SO questions without success. Thank you for your help.

We can try with gsub after unlisting the 'webextract'
gsub("^\\D+\\d+\\s+|(,\\s+\\d+)*\\D+\\d+:.*$", "", unlist(webextract))
#[1] "January 5" "Feb. 2"

Related

Optional pattern part in regex lookbehind

In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
however the result is the same.
In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"
You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w+\\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"

How to extract a fixed number of characters before a string in R

I have a text that contains somewhere in the document a citation to a court case, such as
x <- "2009 U.S. LEXIS"
I know it is always a four-digit year plus a space in front of the pattern "U.S. LEXIS". How should I extract these four digits of years?
Thanks
I think the data/vector given by you was inadequate to let the experts here solve your problem.
UPDATE try this
str_extract_all(x, "\\d{4}(?=\\sU.S.\\sLEXIS)")
[[1]]
[1] "2009" "2015" "1990"
OR to extract these as numbers
lapply(str_extract_all(x, "\\d{4}(?=\\sU.S.\\sLEXIS)"), as.numeric)
[[1]]
[1] 2009 2015 1990
OLD ANSWER Moreover, I am also new in regex therefore my solution may not be a very clean method. Typically your case is of searching nested groups in regex patterns. Still, you can try this method
x <- "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"
> x
[1] "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"
Now follow these steps
library(stringr)
lapply(str_extract_all(x, "(\\d{4})\\sU.S.\\sLEXIS"), str_extract, pattern = "(\\d{4})")
[[1]]
[1] "2009" "2015" "1990"
Typically "((\\d{4})\\sU.S.\\sLEXIS)" would have worked as regex pattern but I am sure about nested groups in R, so used lapply here. Basically str_extract_all(x, "(\\d{4})\\sU.S.\\sLEXIS" will cause to return all citations. Try this.
You can try :
x <- "2009 U.S. LEXIS"
as.numeric(sub('.*?(\\d{4}) U.S. LEXIS', '\\1', x))
#[1] 2009
Using stringr::str_extract :
as.numeric(stringr::str_extract(x, '\\d{4}(?= U.S. LEXIS)'))
We can use parse_number from readr
library(readr)
parse_number(x)
#[1] 2009
data
x <- "2009 U.S. LEXIS"
substr function in stringr library solve it
substr(x,1,4)
if you need to convert in numeric, then you can return it as.numeric
as.numeric(substr(x,1,4))

Writing a function to clean string data and rename columns

I am writing a function to be applied to many individual matrices. Each matrix has 5 columns of string text. I want to remove a piece of one string which matches the string inside another element exactly, then apply a couple more stringr functions, transform it into a data frame, then rename the columns and in the last step I want to add a number to the end of each column name, since I will apply this to many matrices and need to identify the columns later.
This is very similar to another function I wrote so I can't figure out why it won't work. I tried running each line individually by filling in the inputs like this and it works perfectly:
Review1[,4] <- str_remove(Review1[,4], Review1[,3])
Review1[,4] <- str_sub(Review1[,4], 4, -4)
Review1[,4] <- str_trim(Review1[,4], "both")
Review1 <- as.data.frame(Review1)
colnames(Review1) <- c("Title", "Rating", "Date", "User", "Text")
Review1 <- Review1 %>% rename_all(paste0, 1)
But when I run the function nothing seems to happen at all.
Transform_Reviews <- function(x, y, z, a) {
x[,y] <- str_remove(x[,y], x[,z])
x[,y] <- str_sub(x[,y], 4, -4)
x[,y] <- str_trim(x[,y], "both")
x <- as.data.frame(x)
colnames(x) <- c("Title", "Rating", "Date", "User", "Text")
x <- x %>% rename_all(paste0, a)
}
Transform_Reviews(Review1, 4, 3, 1)
This is the only warning message I get. I also receive this when I run the str_remove function individually, but it still changes the elements. But it changes nothing when I run the UDF.
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), ... :
empty search patterns are not supported
This is an example of the part of Review1 that I'm working with.
[,3] [,4]
[1,] "6 April 2014" "By Copnovelist on 6 April 2014"
[2,] "18 Dec. 2015" "By kenneth bell on 18 Dec. 2015"
[3,] "26 May 2015" "By Simon.B :-) on 26 May 2015"
[4,] "22 July 2013" "By Lilla Lukacs on 22 July 2013"
This is what I want the output to look like:
Date1 User1
1 6 April 2014 Copnovelist
2 18 Dec. 2015 kenneth bell
3 26 May 2015 Simon.B :-)
4 22 July 2013 Lilla Lukacs
I realized I just needed to use an assignment operator to see my function work.
Review1 <- Transform_Reviews(Review1, 4, 3, 1)

Regex - matching text after the nth '\n'

I have a sample text like this:
"\n Apr 15, 2019\n 12:00 PM – 3:00 PMWMC 2502, Burnaby\n "
I want to extract the date, time and location separately.
What I am thinking is to extract whatever before the second "\n", this should gives me "\n Apr 15, 2019". Then I can remove the "\n" and white spaces.
Then for the time, I want to remove whatever before the second "\n" and whatever after "PM".
For the location, just keep whatever after PM, then remove the "\n" and white spaces.
Here is the result I want:
[1] Apr 15, 2019
[2] 12:00 PM – 3:00 PM
[3] WMC 2502, Burnaby
Could anyone tell me how to do this? Doing it in some other ways is fine too.
Thanks.
Here is a base R one-liner using strsplit
sapply(strsplit(ss, "(\\s{2,}|(?<=[AP]M)(?=\\w))", perl = T), function(x) x[x != ""]) # [,1]
#[1,] "Apr 15, 2019"
#[2,] "12:00 PM – 3:00 PM"
#[3,] "WMC 2502, Burnaby"
It's difficult to say how well this generalises on account of the very small sample string.
Explanation: We split ss on either a stretch of at least 2 whitespaces "\\s{2,}" (this avoids splitting on a single whitespace), or at a position that is preceded by "[AP]M" through a positive look-behind and followed by a word character (i.e. not a whitespace) through a positive look-ahead "(?<=[AP]M)(?=\\w)".
Sample data
ss <- "\n Apr 15, 2019\n 12:00 PM – 3:00 PMWMC 2502, Burnaby\n "
This should work if your strings share the same structure with the sample text.
library(dplyr)
library(stringr)
str_split(x, "\\n", simplify = T) %>%
trimws() %>%
as.data.frame() %>%
mutate(
time = str_match(V3, "^.+PM"),
location = gsub(time, "", V3)
) %>%
select(
date = 2,
time,
location
)
# date time location
# 1 Apr 15, 2019 12:00 PM – 3:00 PM WMC 2502, Burnaby

How to sort by Library of Congress Classification (LCC) number in R

Library of Congress Classification numbers are used in libraries to give call numbers to things so they be ordered on the shelf. They can be simple or quite complex, with a few mandatory parts but many optional. (See "entering call numbers in 050" on 050 Library of Congress Call Number for how they break down, or lc_callnumber for a Ruby tool that sorts them.)
I would like to sort by LCC number in R. I've looked at Sort a list of nontrivial elements in R and Sorting list of list of elements of a custom class in R? but haven't got it figured out.
Here are four call numbers, entered in sorted order:
call_numbers <- c("QA 7 H3 1992", "QA 76.73 R3 W53 2015", "QA 90 H33 2016", "QA 276.45 R3 A35 2010")
sort sorts them by character, so 276 < 7 < 76.73 < 90.
> sort(call_numbers)
[1] "QA 276.45 R3 A35 2010" "QA 7 H3 1992" "QA 76.73 R3 W53 2015" "QA 90 H33 2016"
To sort them properly I think I'd have to define a class and then some methods on it, like this:
library(stringr)
class(call_numbers) <- "LCC"
## Just pick out the letters and digits for now, leave the rest
## until sorting works, then work down more levels.
lcc_regex <- '([[:alpha:]]+?) ([[:digit:]\\.]+?) (.*)'
"<.LCC" <- function(x, y) {
x_lcc <- str_match(x, lcc_regex)
y_lcc <- str_match(y, lcc_regex)
if(x_lcc[2] < y_lcc[2]) return(x)
if(as.integer(x_lcc[3]) < as.integer(y_lcc[3])) return(x)
}
"==.LCC" <- function(x, y) {
x_lcc <- str_match(x, lcc_regex)
y_lcc <- str_match(y, lcc_regex)
x_lcc[2] == y_lcc[2] && x_lcc[3] == y_lcc[3]
}
">.LCC" <- function(x, y) {
x_lcc <- str_match(x, lcc_regex)
y_lcc <- str_match(y, lcc_regex)
if(x_lcc[2] > y_lcc[2]) return(x)
if(as.integer(x_lcc[3]) > as.integer(y_lcc[3])) return(x)
}
This doesn't change the sort order. I haven't defined a subset method ("[.myclass") because I have no idea what it should be.
This might be a simplier approach. This assumes every number has the following format: 2-letter code, space, number, space, letter-number, space...Year.
The strategy is two split the LOC number by spaces and then obtain 3 columns of data for the first 3 fields and then each column can be sequentially sorted with the order function.
call_numbers <- c("QA 7 H3 1992", "QA 76.73 R3 W53 2015", "QA 90 H33 2016", "QA 276.45 R3 A35 2010")
#split on the spaces
split<-strsplit(call_numbers, " " )
#Retrieve the 2 letter code
letters<-sapply(split, function(x){x[1]})
#retrieve the 2nd number group and convert to numeric values for sorting
second<-sapply(split, function(x){as.numeric(x[2])})
#obtain the 3rd grouping
third<-sapply(split, function(x){x[3]})
#find the year
year<-sapply(split, function(x){x[length(x)]})
df<-data.frame(call_numbers)
#sort data based on the first and 2nd column
call_numbers[order(letters, second, third)]
For this limited dataset the technique works.
mixedsort from the gtools package (part of standard R) turns out to do just the trick:
library(gtools)
call_numbers <- c("QA 7 H3 1992", "QA 76.73 R3 W53 2015", "QA 90 H33 2016", "QA 276.45 R3 A35 2010")
mixedsort(call_numbers)
## [1] "QA 7 H3 1992" "QA 76.73 R3 W53 2015" "QA 90 H33 2016" "QA 276.45 R3 A35 2010"
Further, mixedorder can be used to sort a data frame by one column.
This is a special case of what was answered earlier in How to sort a character vector where elements contain letters and numbers in R?
I feel like I spent way too much time on figuring out a solution to exactly what you're trying to do --only mine was for JavaScript. But it basically comes down to the notion of "normalization" of these numbers so that they can be sorted alphabetically.
Maybe this solution can be used and ported over to R. At the very least, hopefully this could get you started. It involves some regular expressions and a little bit of extra scripting to get the call numbers into a state where they can be sorted.
https://github.com/rayvoelker/js-loc-callnumbers/blob/master/locCallClass.js
Good luck!
Easiest (and elegant) way: using str_sortfrom the packg stringr
# install.packages("stringr") ## Uncomment if not already installed
library(stringr)
str_sort(call_numbers, numeric = TRUE)
[1] "QA 7 H3 1992" "QA 76.73 R3 W53 2015" "QA 90 H33 2016"
[4] "QA 276.45 R3 A35 2010"

Resources