I am trying to create a datatable from some pdf files, which result in data with sometimes some unplanned spaces, e.g.
MWE <- c("Gross Domestic Product 2.3",
"blabla 1.5",
"blabla2 6.5",
"G ross Domestic Product 4.5",
"Another L ine 9.6",
"Gross Domestic Product 6.9",
"G r oss D omes tic Pr o du ct 7.6")
I would like to have all the occurences of Gross Domestic Product, whether there are spaces or not. But a simple grep("Gross Domestic Product",MWE) takes into account spaces
grep("Gross Domestic Product",MWE)
[1] 1 6
I can do that upstream, for instance by erasing every spaces, e.g.
MWE_2 <- gsub("\\s","",MWE)
grep("GrossDomesticProduct",MWE_2)
[1] 1 4 6 7
I was wondering whether it was possible to achieve the same result with the grep option, which could prove useful for some uses (e.g. not creating a new table)
You can modify your string and use grep, as shown below. Idea is to create a regex which ignores space if present.
MWE <- c("Gross Domestic Product 2.3",
"blabla 1.5",
"blabla2 6.5",
"G ross Domestic Product 4.5",
"Another L ine 9.6",
"Gross Domestic Product 6.9",
"G r oss D omes tic Pr o du ct 7.6")
gdp_str <- "Gross Domestic Product"
gdp_str <- sub("\\s*", "\\\\s*", gsub('(.{1})', '\\1\\\\s*', gdp_str))
grep(gdp_str, MWE)
Related
I need to create a dataframe from a .csv file containing author references:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
Essentially I want to pull out the coauthors, year of publication, and article title.
refs$author[1]
Harris P R, Harris D L
refs$year[1]
1983
refs$title[1]
Training for the Metaindustrial Work Culture
At this stage, I do not need a publication source as I can get this via rscopus.
I can extract authors and years with this code:
refs <- refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}")))
However, I need help extracting the title (substring between two periods after bracketed date).
This regex works for your minimal example:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
sub("[^.]+\\.([^.]+)\\..*", "\\1", refs$reference)
#> [1] " Training for the Metaindustrial Work Culture"
Explanation:
"[^.]+\\.([^.]+)\\..*" - whole regex
[^.]+\\. - one or more characters that isn't a period, followed by a period (i.e. everything up until the first period)
([^.]+)\\..* - start capturing 'group 1' "(" which contains one or more characters that aren't a period ([^.]+) then stop capturing group 1 ")" at the next period "\\." (group 1 now = the title), then match everything else ".*"
Then, in the sub command, you print group 1 ("\\1").
Unfortunately, you may run into problems with your 'real world' data. Using rscopus to extract the title might be a better solution to avoid unforeseen errors.
Using tidyverse functions:
library(tidyverse)
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}"),
title = sub("[^.]+\\.([^.]+)\\..*", "\\1", reference))
#> reference
#> 1 Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.
#> author year title
#> 1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
Created on 2022-12-05 with reprex v2.0.2
I am struggling to only keep the part before the first " - ".
If I try this regex on regex101.com I get the expected output but when I try it in R I get a different output.
authors <- sub("\\s-\\s.*", "", authors)
Input:
[1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org"
[2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier"
[3] "CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library"
Expected output:
[1] "T Dietz, RL Shwom, CT Whitley"
[2] "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
Actual output:
[1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020"
[2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011"
[3] "CD Thomas - Diversity and Distributions, 2010"
Thanks in advance!
It seems you receive output containing some Unicode whitespaces.
In this case, the following will work:
sub("(*UTF)(*UCP)\\s-\\s.*", "", authors, perl=TRUE)
The (*UTF)(*UCP) (or probably just (*UCP)) will enable \s to match any Unicode whitespaces.
You can use this regex. Replace for nothing the result in Notepad++ for example:
Regex
-(.*?)$
You can also just split the string on your delimiter (-) and take the first element:
sapply(strsplit(authors, " -", fixed = T), `[[`, 1)
[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
You can also use regex greedy matching to remove everything after and including your delimiter. Because it is greedy it will match as much as possible:
stringr::str_remove(authors, " -.*")
[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
Too long for a comment at the moment, may delete later. When I run this code alone, I get your expected output:
authors <- c("T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org",
"L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier",
"CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library")
sub("\\s-\\s.*", "", authors)
#[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson" "CD Thomas"
This might have something to do with the fact that you reassign to authors every time you try subbing, which overwrites authors. You might have been doing that as you were developing the regex, and forgot to reassign the authors vector to the original.
I have a vector of organization names in a dataframe. Some of them are just fine, others have the name repeated twice in the same element. Also, when that name is repeated, there is no separating space so the name has a camelCase appearance.
For example (id column added for general dataframe referencing):
id org
1 Alpha Company
2 Bravo InstituteBravo Institute
3 Charlie Group
4 Delta IncorporatedDelta Incorporated
but it should look like:
id org
1 Alpha Company
2 Bravo Institute
3 Charlie Group
4 Delta Incorporated
I have a solution that gets the result I need--reproducible example code below. However, it seems a bit lengthy and not very elegant.
Does anyone have a better approach for the same results?
Bonus question: If organizations have 'types' included, such as Alpha Company, LLC, then my gsub() line to fix the camelCase does not work as well. Any suggestions on how to adjust the camelCase fix to account for the ", LLC" and still work with the rest of the solution?
Thanks in advance!
(Thanks to the OP & those who helped on the previous SO post about splitting camelCase strings in R)
# packages
library(stringr)
# toy data
df <- data.frame(id=1:4, org=c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
# split up & clean camelCase words
df$org_fix <- gsub("([A-Z])", " \\1", df$org)
df$org_fix <- str_trim(str_squish(df$org_fix))
# temp vector with half the org names
df$org_half <- word(df$org_fix, start=1, end=(sapply(strsplit(df$org_fix, " "), length)/2)) # stringr::word
# double the temp vector
df$org_dbl <- paste(df$org_half, df$org_half)
# flag TRUE for orgs that contain duplicates in name
df$org_dup <- df$org_fix == df$org_dbl
# corrected the org names
df$org_fix <- ifelse(df$org_dup, df$org_half, df$org_fix)
# drop excess columns
df <- df[,c("id", "org_fix")]
# toy data for the bonus question
df2 <- data.frame(id=1:4, org=c("Alpha Company, LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
Another approach is to compare the first half of the string with the second half of the string. If equal, pick the first half. It also works if there are numbers, underscores or any other characters present in the company name.
org <- c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated", "WD40WD40", "3M3M")
ifelse(substring(org, 1, nchar(org) / 2) == substring(org, nchar(org) / 2 + 1, nchar(org)), substring(org, 1, nchar(org) / 2), org)
# [1] "Alpha Company" "Bravo Institute" "Charlie Group" "Delta Incorporated" "WD40" "3M"
You can use regex as this line below :
my_df$org <- str_extract(string = my_df$org, pattern = "([A-Z][a-z]+ [A-Z][a-z]+){1}")
If all individual words start with a capital letter (not followed by an other capital letter), then you can use it to split on. Only keep unique elements, and paste + collapse. Will also work om the bonus LCC-option
org <- c("Alpha CompanyCompany , LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated")
sapply(
lapply(
strsplit(gsub("[^A-Za-z0-9]", "", org),
"(?<=[^A-Z])(?=[A-Z])",
perl = TRUE),
unique),
paste0, collapse = " ")
[1] "Alpha Company LLC" "Bravo Institute" "Charlie Group" "Delta Incorporated"
I have this character vector of lines from a journal:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
I want to extract the Keywords from these lines, which are Inter-facility transport, Neonatal intensive care, Mortality. I've tried to get the line which has "Keywords" with test_1[str_detect(test_1, "^Keywords:")] I want to get all the keywords below this line and above 1. Introduction
What regex or stringr functions will do this?
Thanks
If I understood correctly, you are sort of scanning the pdf downloaded from here. I think you should find a better way to scan your PDFs.
Till then, the best option could be this:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
EDIT:
You can define a function to find the index of the line at which Keywords occurs and the indices of the lines below that line:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
Based on that function, you can extract the keywords:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\\S+")
[1] "Inter-facility" "Neonatal" "Mortality"
Below is a vector I'm working with. What I am trying to do is extract only the ages (including whether the number is months or years old) from each entry in the vector. I know I have to use str/grep functions and regex, but not sure how to combine functions to get what I want done.
All ages are expressed like this: number time interval sex.
So for example: 18MOM is an 18 month old male, 18YOF is 18 year old female etc.
[1] "DX LAC CHIN/ABRASION CHEEK/CONTU HAND(S): 6YOF OUT RIDING BIKE, W WOBBLY ON BIKE AND HIT FACE ON ROAD, ABRASION TO L CHEEK, CHIN & R HAND"
[2] "DX LWOBS: 2YOM L PINKY FINGER CAUGHT IN BOWLING BALL, SM AMT BLDG/SWELLING TO PINKY FINGER. CRUSH W BOWLING BALL"
[3] "DX KNEE SPRAIN/CONTU KNEE/HIGH BLD PRESS: 16YOM R KNEE PN AFTER TWISTING KNEE COMING DOWN F JUMP' DUR' BASKETBALL GAME, LANDED ON BENT KNEE"
[4] "DX LBP: 21YOM STRETCHING OUT AFTER WORKOUT (DOING ***) HEARD POP"
[5] "DX FX PHALANX FOOT: 36YOF STUBBED R GREAT TOE ON STAIRS, PN, SWELL' SUROUNDING R GREAT TOE"
[6] "DX ELBOW CONTU/ELBOW ABRASION: 10YOM FELL F BED HAND HIT R ELBOW ON BEDPLAYING W SISTER, BRUSING TO ELBOW"
[7] "DX LWOBS: 3YOM LAC TO SCALP/ S/P PLASTIC LAMP FELL OFF DRESSER TO HEAD,PT W ~1CM LAC"
[8] "DX CONTU FINGER: 55YOM L 5TH FINGER PN AFTER FALL F BICYCLE W TRYING TOBAL AT STOPPED POSITION"
[9] "DX COSTOCHONDRITIS/CHEST PN: 24YOM SUBSTERNAL CHEST PN W WORKING OUT, HAD SHARP SPASM PN TO SUBSTERNAL CHEST TO L CHEST"
[10] "DX 1ST DEG BURN E: 28YOF W BURN TO L HAND, GRABBED HOT PAN UNDER BROILER W/O POTHOLDER; REDNESS TO PLAM & FINGER TIPS, FEW BLISTERS START' G F"
[11] "DX LWOBS LAC HAND: 1YOM W FINGER INJ, CUT FINGER ON A FAN"
agevector<-gsub(".* (\\d*[MY]O).*","\\1",vector)
This will create agevector which will be a character vector that includes things like 19MO and 5YO etc.
It looks for the pattern "[any number of digits] followed by [M or Y] followed by O".
You can use stringr
You can first extract all the ages from your text and then compute further analysis.
This code will do the trick (assuming your string vector is called str):
library(stringr)
ages <- str_extract_all(str, "(\\d{1,2}[MY]O[MF])", simplify = TRUE)
Use case:
library(stringr)
str <- c("DX LAC CHIN/ABRASION 12YOF CHEEK/CONTU HAND(S): 6YOF OUT RIDING BIKE, W WOBBLY ON BIKE AND HIT FACE ON ROAD, ABRASION TO L CHEEK, CHIN & R HAND",
"DX KNEE SPRAIN/CONTU KNEE/HIGH BLD PRESS: 16YOM R KNEE PN AFTER TWISTING KNEE COMING DOWN F JUMP' DUR' BASKETBALL GAME, LANDED ON BENT KNEE",
"DX FX PHALANX FOOT: 36YOF STUBBED R GREAT TOE ON STAIRS, PN, SWELL' SUROUNDING R GREAT TOE")
str <- paste(str, collapse = '')
ages <- str_extract_all(str, "(\\d{1,2}[MY]O[MF])", simplify = TRUE)
Output:
> ages
[,1] [,2] [,3] [,4]
[1,] "12YOF" "6YOF" "16YOM" "36YOF"
Hope this helps.