Removing dates and all junks from texts using R - r

I am cleaning a huge dataset made up of tens of thousands of texts using R. I know regular expression will do the job conveniently but I am poor in using it. I have combed stackoverflow but could not find solution. This is my dummy data:
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982",
"04/02/2016 Health is a priority: WAI000553",
"09/ 08/2016 Economy is bad: 2031CE8D",
": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")
I want to remove all the dates, punctuations and IDs and want my result to be this:
[1] "Education is good"
[2] "Health is a priority"
[3] "Economy is bad"
[4] "Vehicle license is needed"
Any help in R will be appreciated.

I think specificity is in order here:
First, let's remove the date-like strings. I'll assume either mm/dd/yyyy or dd/mm/yyyy, where the first two can be 1-2 digits, and the third is always 4 digits. If this is variable, the regex can be changed to be a little more permissive:
foo_data2 <- gsub("\\d{1,2}\\s*/\\s*\\d{1,2}\\s*/\\s*\\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982" " Health is a priority: WO0002021" " Economy is bad: WO001999" " Vehicle license is needed: WO001050"
From here, the abbreviations seem rather easy to remove, as the other answers have demonstrated. You have not specified if the abbreviation is hard-coded to be anything after a colon, numbers prepended with "WO", or just some one-word combination of letters and numbers. Those could be:
gsub(":.*", "", foo_data2)
# [1] " Education is good" " Health is a priority" " Economy is bad" " Vehicle license is needed"
gsub("\\bWO\\S*", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
gsub("\\b[A-Za-z]+\\d+\\b", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
The : removal should be straight forward, and using trimws(.) will remove the leading/trailing spaces.
This can obviously be combined into a single regex (using the logical | with pattern grouping) or a single R call (nested gsub) without complication, I kept them broken apart for discussion.
I think https://stackoverflow.com/a/22944075/3358272 is a good reference for regex in general, note that while that page shows many regex things with single-backslashes, R requires all of those use double-backslashes (e.g., \d in regex needs to be \\d in R). The exception to this is if you use R-4's new raw-strings, where these two are identical:
"\\b[A-Za-z]+\\d+\\b"
r"(\b[A-Za-z]+\d+\b)"

Using stringr try this:
library(stringr)
library(magrittr)
str_remove_all(foo_data, "\\/|\\d+|\\: WO") %>%
str_squish()
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)
data
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")

foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\\d{4}[[:space:]]+(.*):.*", "\\1", foo_data)
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)

Related

extracting information from pdfs that have line spills using R

I am trying to extract information from pdf files using R. The data I want are in tables although they arent recognised by R.
I am using the pdftools to read in the pdf file, export it to a text file and then re read it in line by line.
The files look like this.
I want to extract the Net cash from / (used in) operating activities but as you can see because the lines spill it makes it very hard.
pdf_text <- pdf_text("test.pdf")
write.table(pdf_text,"out.txt")
just <- readLines("input_file.txt")
> just[30:40]
[1] " (g) insurance costs - (137)"
[2] " 1.3 Dividends received (see note 3) - -"
[3] " 1.4 Interest received 9 21"
[4] " 1.5 Interest and other costs of finance paid - -"
[5] " 1.6 Income taxes paid - -"
[6] " 1.7 Government grants and tax incentives - -"
[7] " 1.8 Other (provide details if material) - -"
[8] " 1.9 Net cash from / (used in) operating"
[9] " (1,258) (3,785)"
[10] " activities"
I want to grab the numbers (1,258) and (3,785) still with the parentheses around them.
A common thing that happens is that the numbers will either be on line 8,9 or 10 (using my example above as reference) so I cant just simply write code to grab the data that is 'next' to "Net cash from / (used in) operating activities"
This code almost arrives at the desired result:
> text_file <- readLines("out.txt")
> operating_line <- grep("Net cash from / \\(used in\\) operat", text_file)
> operating_line <- operating_line[1]
> number_line1 <- text_file[operating_line]
> number_line2 <- text_file[operating_line + 1]
> number_line3 <- text_file[operating_line - 1]
> if (gsub("[^()[:digit:],]+", "", number_line1) != "") {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line1)
+ } else if (gsub("[^()[:digit:],]+", "", number_line2) != "") {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line2)
+ } else {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line3)
+ }
> numbers <- gsub("\\d+\\(\\)", "", numbers)
> numbers
[1] "(1,258)(3,785)"
However there is no gap between the (1,258) and (3,785).
i.e. they are not being identified as different elements

Optional pattern part in regex lookbehind

In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
however the result is the same.
In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"
You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w+\\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"

A more elegant way to remove duplicated names (phrases) in the elements of a character string

I have a vector of organization names in a dataframe. Some of them are just fine, others have the name repeated twice in the same element. Also, when that name is repeated, there is no separating space so the name has a camelCase appearance.
For example (id column added for general dataframe referencing):
id org
1 Alpha Company
2 Bravo InstituteBravo Institute
3 Charlie Group
4 Delta IncorporatedDelta Incorporated
but it should look like:
id org
1 Alpha Company
2 Bravo Institute
3 Charlie Group
4 Delta Incorporated
I have a solution that gets the result I need--reproducible example code below. However, it seems a bit lengthy and not very elegant.
Does anyone have a better approach for the same results?
Bonus question: If organizations have 'types' included, such as Alpha Company, LLC, then my gsub() line to fix the camelCase does not work as well. Any suggestions on how to adjust the camelCase fix to account for the ", LLC" and still work with the rest of the solution?
Thanks in advance!
(Thanks to the OP & those who helped on the previous SO post about splitting camelCase strings in R)
# packages
library(stringr)
# toy data
df <- data.frame(id=1:4, org=c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
# split up & clean camelCase words
df$org_fix <- gsub("([A-Z])", " \\1", df$org)
df$org_fix <- str_trim(str_squish(df$org_fix))
# temp vector with half the org names
df$org_half <- word(df$org_fix, start=1, end=(sapply(strsplit(df$org_fix, " "), length)/2)) # stringr::word
# double the temp vector
df$org_dbl <- paste(df$org_half, df$org_half)
# flag TRUE for orgs that contain duplicates in name
df$org_dup <- df$org_fix == df$org_dbl
# corrected the org names
df$org_fix <- ifelse(df$org_dup, df$org_half, df$org_fix)
# drop excess columns
df <- df[,c("id", "org_fix")]
# toy data for the bonus question
df2 <- data.frame(id=1:4, org=c("Alpha Company, LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
Another approach is to compare the first half of the string with the second half of the string. If equal, pick the first half. It also works if there are numbers, underscores or any other characters present in the company name.
org <- c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated", "WD40WD40", "3M3M")
ifelse(substring(org, 1, nchar(org) / 2) == substring(org, nchar(org) / 2 + 1, nchar(org)), substring(org, 1, nchar(org) / 2), org)
# [1] "Alpha Company" "Bravo Institute" "Charlie Group" "Delta Incorporated" "WD40" "3M"
You can use regex as this line below :
my_df$org <- str_extract(string = my_df$org, pattern = "([A-Z][a-z]+ [A-Z][a-z]+){1}")
If all individual words start with a capital letter (not followed by an other capital letter), then you can use it to split on. Only keep unique elements, and paste + collapse. Will also work om the bonus LCC-option
org <- c("Alpha CompanyCompany , LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated")
sapply(
lapply(
strsplit(gsub("[^A-Za-z0-9]", "", org),
"(?<=[^A-Z])(?=[A-Z])",
perl = TRUE),
unique),
paste0, collapse = " ")
[1] "Alpha Company LLC" "Bravo Institute" "Charlie Group" "Delta Incorporated"

How to extract keywords below and above a text from an article

I have this character vector of lines from a journal:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de­",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
I want to extract the Keywords from these lines, which are Inter-facility transport, Neonatal intensive care, Mortality. I've tried to get the line which has "Keywords" with test_1[str_detect(test_1, "^Keywords:")] I want to get all the keywords below this line and above 1. Introduction
What regex or stringr functions will do this?
Thanks
If I understood correctly, you are sort of scanning the pdf downloaded from here. I think you should find a better way to scan your PDFs.
Till then, the best option could be this:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
EDIT:
You can define a function to find the index of the line at which Keywords occurs and the indices of the lines below that line:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
Based on that function, you can extract the keywords:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\\S+")
[1] "Inter-facility" "Neonatal" "Mortality"

Extract title from multiple lines

I have multiple files each one has a different title, I want to extract the title name from each file. Here is an example of one file
[1] "<START" "ID=\"CMP-001\"" "NO=\"1\">"
[4] "<NAME>Plasma-derived" "vaccine" "(PDV)"
[7] "versus" "placebo" "by"
[10] "intramuscular" "route</NAME>" "<DIC"
[13] "CHI2=\"3.6385\"" "CI_END=\"0.6042\"" "CI_START=\"0.3425\""
[16] "CI_STUDY=\"95\"" "CI_TOTAL=\"95\"" "DF=\"3.0\""
[19] "TOTAL_1=\"0.6648\"" "TOTAL_2=\"0.50487622\"" "BLE=\"YES\""
.
.
.
[789] "TOTAL_2=\"39\"" "WEIGHT=\"300.0\"" "Z=\"1.5443\">"
[792] "<NAME>Local" "adverse" "events"
[795] "after" "each" "injection"
[798] "of" "vaccine</NAME>" "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
[801] "</GROUP_LABEL_2>" "<GRAPH_LABEL_1>" "PDV</GRAPH_LABEL_1>"
the extracted expected title is
Plasma-derived vaccine (PDV) versus placebo by intramuscular route
Note, each file has a different title's length.
Here is a solution using stringr. This first collapses the vector into one long string, and then captures all words / characters that are not a newline \n between every pair of "<NAME>" and "</NAME>". In the future, people will be able to help you easier if you make a reproducible example (e.g., using dput()). Hope this helps!
Note: if you just one the first title you can use str_match() instead of str_match_all().
library(stringr)
str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine"
Data:
string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
"TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")

Resources