How to conduct pattern recognition for strings? - r

Below is a vector I'm working with. What I am trying to do is extract only the ages (including whether the number is months or years old) from each entry in the vector. I know I have to use str/grep functions and regex, but not sure how to combine functions to get what I want done.
All ages are expressed like this: number time interval sex.
So for example: 18MOM is an 18 month old male, 18YOF is 18 year old female etc.
[1] "DX LAC CHIN/ABRASION CHEEK/CONTU HAND(S): 6YOF OUT RIDING BIKE, W WOBBLY ON BIKE AND HIT FACE ON ROAD, ABRASION TO L CHEEK, CHIN & R HAND"
[2] "DX LWOBS: 2YOM L PINKY FINGER CAUGHT IN BOWLING BALL, SM AMT BLDG/SWELLING TO PINKY FINGER. CRUSH W BOWLING BALL"
[3] "DX KNEE SPRAIN/CONTU KNEE/HIGH BLD PRESS: 16YOM R KNEE PN AFTER TWISTING KNEE COMING DOWN F JUMP' DUR' BASKETBALL GAME, LANDED ON BENT KNEE"
[4] "DX LBP: 21YOM STRETCHING OUT AFTER WORKOUT (DOING ***) HEARD POP"
[5] "DX FX PHALANX FOOT: 36YOF STUBBED R GREAT TOE ON STAIRS, PN, SWELL' SUROUNDING R GREAT TOE"
[6] "DX ELBOW CONTU/ELBOW ABRASION: 10YOM FELL F BED HAND HIT R ELBOW ON BEDPLAYING W SISTER, BRUSING TO ELBOW"
[7] "DX LWOBS: 3YOM LAC TO SCALP/ S/P PLASTIC LAMP FELL OFF DRESSER TO HEAD,PT W ~1CM LAC"
[8] "DX CONTU FINGER: 55YOM L 5TH FINGER PN AFTER FALL F BICYCLE W TRYING TOBAL AT STOPPED POSITION"
[9] "DX COSTOCHONDRITIS/CHEST PN: 24YOM SUBSTERNAL CHEST PN W WORKING OUT, HAD SHARP SPASM PN TO SUBSTERNAL CHEST TO L CHEST"
[10] "DX 1ST DEG BURN E: 28YOF W BURN TO L HAND, GRABBED HOT PAN UNDER BROILER W/O POTHOLDER; REDNESS TO PLAM & FINGER TIPS, FEW BLISTERS START' G F"
[11] "DX LWOBS LAC HAND: 1YOM W FINGER INJ, CUT FINGER ON A FAN"

agevector<-gsub(".* (\\d*[MY]O).*","\\1",vector)
This will create agevector which will be a character vector that includes things like 19MO and 5YO etc.
It looks for the pattern "[any number of digits] followed by [M or Y] followed by O".

You can use stringr
You can first extract all the ages from your text and then compute further analysis.
This code will do the trick (assuming your string vector is called str):
library(stringr)
ages <- str_extract_all(str, "(\\d{1,2}[MY]O[MF])", simplify = TRUE)
Use case:
library(stringr)
str <- c("DX LAC CHIN/ABRASION 12YOF CHEEK/CONTU HAND(S): 6YOF OUT RIDING BIKE, W WOBBLY ON BIKE AND HIT FACE ON ROAD, ABRASION TO L CHEEK, CHIN & R HAND",
"DX KNEE SPRAIN/CONTU KNEE/HIGH BLD PRESS: 16YOM R KNEE PN AFTER TWISTING KNEE COMING DOWN F JUMP' DUR' BASKETBALL GAME, LANDED ON BENT KNEE",
"DX FX PHALANX FOOT: 36YOF STUBBED R GREAT TOE ON STAIRS, PN, SWELL' SUROUNDING R GREAT TOE")
str <- paste(str, collapse = '')
ages <- str_extract_all(str, "(\\d{1,2}[MY]O[MF])", simplify = TRUE)
Output:
> ages
[,1] [,2] [,3] [,4]
[1,] "12YOF" "6YOF" "16YOM" "36YOF"
Hope this helps.

Related

How to extract substring between periods in R

I need to create a dataframe from a .csv file containing author references:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
Essentially I want to pull out the coauthors, year of publication, and article title.
refs$author[1]
Harris P R, Harris D L
refs$year[1]
1983
refs$title[1]
Training for the Metaindustrial Work Culture
At this stage, I do not need a publication source as I can get this via rscopus.
I can extract authors and years with this code:
refs <- refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}")))
However, I need help extracting the title (substring between two periods after bracketed date).
This regex works for your minimal example:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
sub("[^.]+\\.([^.]+)\\..*", "\\1", refs$reference)
#> [1] " Training for the Metaindustrial Work Culture"
Explanation:
"[^.]+\\.([^.]+)\\..*" - whole regex
[^.]+\\. - one or more characters that isn't a period, followed by a period (i.e. everything up until the first period)
([^.]+)\\..* - start capturing 'group 1' "(" which contains one or more characters that aren't a period ([^.]+) then stop capturing group 1 ")" at the next period "\\." (group 1 now = the title), then match everything else ".*"
Then, in the sub command, you print group 1 ("\\1").
Unfortunately, you may run into problems with your 'real world' data. Using rscopus to extract the title might be a better solution to avoid unforeseen errors.
Using tidyverse functions:
library(tidyverse)
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}"),
title = sub("[^.]+\\.([^.]+)\\..*", "\\1", reference))
#> reference
#> 1 Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.
#> author year title
#> 1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
Created on 2022-12-05 with reprex v2.0.2

How to split a string at every point, question mark and exclamation mark in r

As the title says, I should split a string at every . ! and ?
That doesn't work:
strsplit(x, "/ (\\?|\\.|!) /")
$`352`
[1] "Saudi Arabian Oil Minister Hisham (...)
the\n... accord and it will never sell its oil at prices below the\npronounced prices under any circumstance.\"\n Saudi Arabia was a main architect of December pact under\nwhich OPEC agreed to cut its total oil output ceiling by 7.25\npct and return to fixed prices of around 18 dollars a barrel.\n Reuter"
$`353`
[1] "Kuwait's oil minister said (...)
daily (bpd).\n Crude oil prices fell sharply last week as international\noil traders and analysts estimated the 13-nation OPEC was\npumping up to one million bpd over its self-imposed limits.\n Reuter"
$`368`
[1] "The port of Philadelphia (...)
the ship on the high tide.\n After delivering oil to a refinery in Paulsboro, New\nJersey, the ship apparently lost its steering and hit the power\ntransmission line carrying power from the nuclear plant to the\nstate of Delaware.\n Reuter"
I shortened it with "(...)" here, so that's not part of the code obviously.
There should be far more splits because there are points where it doesn't split.
Jonathan V. Solórzano is right:
x <- "Ceci.est!un?pipe. . ."
strsplit(x, "\\?|\\.|!")
[[1]]
[1] "Ceci" "est" "un" "pipe" " " " "

How to extract keywords below and above a text from an article

I have this character vector of lines from a journal:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de­",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
I want to extract the Keywords from these lines, which are Inter-facility transport, Neonatal intensive care, Mortality. I've tried to get the line which has "Keywords" with test_1[str_detect(test_1, "^Keywords:")] I want to get all the keywords below this line and above 1. Introduction
What regex or stringr functions will do this?
Thanks
If I understood correctly, you are sort of scanning the pdf downloaded from here. I think you should find a better way to scan your PDFs.
Till then, the best option could be this:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
EDIT:
You can define a function to find the index of the line at which Keywords occurs and the indices of the lines below that line:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
Based on that function, you can extract the keywords:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\\S+")
[1] "Inter-facility" "Neonatal" "Mortality"

Structure character data into data frame

I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.
Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean

Extracting Sub-expressions from a Dataframe of Strings Using Regular Expressions

I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.
This is returning the test string as foo, without any of the sub-expressions:
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)
In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.
rows$isMatch <- grepl(entryPattern, rows$text)
What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.
It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by #kent-johnson.
library(stringr)
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]
Which outputs:
[1] "101"
[2] "POULET Laure"
[3] "FRA"
[4] "1992"
[5] "25-29"
[6] "E. M. S. Bron Natation"
[7] "26.00"

Resources