How to extract substring between periods in R - r

I need to create a dataframe from a .csv file containing author references:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
Essentially I want to pull out the coauthors, year of publication, and article title.
refs$author[1]
Harris P R, Harris D L
refs$year[1]
1983
refs$title[1]
Training for the Metaindustrial Work Culture
At this stage, I do not need a publication source as I can get this via rscopus.
I can extract authors and years with this code:
refs <- refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}")))
However, I need help extracting the title (substring between two periods after bracketed date).

This regex works for your minimal example:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
sub("[^.]+\\.([^.]+)\\..*", "\\1", refs$reference)
#> [1] " Training for the Metaindustrial Work Culture"
Explanation:
"[^.]+\\.([^.]+)\\..*" - whole regex
[^.]+\\. - one or more characters that isn't a period, followed by a period (i.e. everything up until the first period)
([^.]+)\\..* - start capturing 'group 1' "(" which contains one or more characters that aren't a period ([^.]+) then stop capturing group 1 ")" at the next period "\\." (group 1 now = the title), then match everything else ".*"
Then, in the sub command, you print group 1 ("\\1").
Unfortunately, you may run into problems with your 'real world' data. Using rscopus to extract the title might be a better solution to avoid unforeseen errors.
Using tidyverse functions:
library(tidyverse)
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}"),
title = sub("[^.]+\\.([^.]+)\\..*", "\\1", reference))
#> reference
#> 1 Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.
#> author year title
#> 1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
Created on 2022-12-05 with reprex v2.0.2

Related

using key word to label a new column in R

I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT

How to extract keywords below and above a text from an article

I have this character vector of lines from a journal:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de­",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
I want to extract the Keywords from these lines, which are Inter-facility transport, Neonatal intensive care, Mortality. I've tried to get the line which has "Keywords" with test_1[str_detect(test_1, "^Keywords:")] I want to get all the keywords below this line and above 1. Introduction
What regex or stringr functions will do this?
Thanks
If I understood correctly, you are sort of scanning the pdf downloaded from here. I think you should find a better way to scan your PDFs.
Till then, the best option could be this:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
EDIT:
You can define a function to find the index of the line at which Keywords occurs and the indices of the lines below that line:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
Based on that function, you can extract the keywords:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\\S+")
[1] "Inter-facility" "Neonatal" "Mortality"

Remove words that occur only once and with low IDF in R

I have a dataframe with a column with some text in it. I want to do three data pre-processing steps:
1) remove words that occur only once
2) remove words with low inverse document frequency (IDF) and 3) remove words that occur most frequently
This is an example of the data:
head(stormfront_data$stormfront_self_content)
Output:
[1] " , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!"
[2] "bonjour warm brother ! forward speaking !"
[3] " check time time forums. frequently moved columbia distinctly numbered. groups gatherings "
[4] " ! site pretty nice. amount news articles. main concern moment islamification."
[5] " , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed."
[6] " white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
Any help would be greatly appreciated, as I am not too familiar with R.
Here's a solution to Q1 in several steps:
Step 1: clean data by removing anything that is not alphanumeric (\\W):
data2 <- trimws(paste0(gsub("\\W+", " ", data), collapse = ""))
Step 2: Make a sorted frequency list of the words:
fw <- as.data.frame(sort(table(strsplit(data2, "\\s{1,}")), decreasing = T))
Step 3: define a pattern to match (namely all the words that occur only once), make sure you wrap them into boundary position markers (\\b) so that only exact matches get matched (e.g., networkbut not networking):
pattern <- paste0("\\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\\b")
Step 4: remove matched words:
data3 <- gsub(pattern, "", data2)
Step 5: clean up by removing superfluous spaces:
data4 <- trimws(gsub("\\s{1,}", " ", data3))
Result:
[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"
Here is an approach with tidytext
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE)
total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
group_by(word) %>%
summarize(total = n())
words <- left_join(word_count,total_count)
words %>%
bind_tf_idf(word, document, n)
# A tibble: 111 x 7
document word n total tf idf tf_idf
<int> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1 stormfront 10 11 0.139 1.10 0.153
2 1 networking 3 3 0.0417 1.79 0.0747
3 1 site 3 6 0.0417 0.693 0.0289
4 1 board 2 2 0.0278 1.79 0.0498
5 1 forums 2 3 0.0278 1.10 0.0305
6 1 introduction 2 2 0.0278 1.79 0.0498
7 1 local 2 2 0.0278 1.79 0.0498
8 1 main 2 3 0.0278 1.10 0.0305
9 1 member 2 3 0.0278 1.10 0.0305
10 1 online 2 2 0.0278 1.79 0.0498
# … with 101 more rows
From here it is trivial to filter with dplyr::filter, but since you don't define any specific criteria other than "only once", I'll leave that to you.
Data
data <- structure(c(" , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!",
"bonjour warm brother ! forward speaking !", " check time time forums. frequently moved columbia distinctly numbered. groups gatherings ",
" ! site pretty nice. amount news articles. main concern moment islamification.",
" , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed.",
" white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
), .Dim = c(6L, 1L))
Base R solution:
# Remove double spacing and punctuation at the start of strings:
# cleaned_str => character vector
cstr <- trimws(gsub("\\s*[[:punct:]]+", "", trimws(gsub('\\s+|^\\s*[[:punct:]]+|"',
' ', df), "both")), "both")
# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
unique(unlist(strsplit(x, "[^a-z]+")))}))))
# Store the inverse document frequency as a vector: idf => double vector:
document_freq$idf <- log(length(cstr)/document_freq$Freq)
# For each record remove terms that occur only once, occur the maximum number
# of times a word occurs in the dataset, or words with a "low" idf:
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
# Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_,
unlist(strsplit(x, "[^a-z]+")))))),
stringsAsFactors = FALSE)
# Store a vector containing each term's idf: idf => double vector
tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]
# Explicitly return the ppd vector: .GlobalEnv() => character vector
return(
data.frame(
cleaned_record = x,
pp_records =
paste0(unique(unlist(
strsplit(gsub("\\s+", " ",
trimws(
gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
tf_dataf$Freq == max(tf_dataf$Freq)],
collapse = "|"), "", x), "both"
)), "\\s")
)), collapse = " "),
row.names = NULL,
stringsAsFactors = FALSE
)
)
}
))
# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame
ppd_cleaned_df <- cbind(orig_record = df, pp_records)
# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df

How can I use Mutate with ifelse to parse string data into a new variable?

I'm newer to R and am playing around with the Titanic kaggle dataset. I've watched David Langer's great youtube videos on exploring this dataset and he is able to parse out the titles of each passenger with a for loop. However, I can't help but figure there is an easier way to do this with mutate and stringr.
note: titanic.full = data.frame
This is my best guess... obviously it doesn't work though:
mutate(titanic.full, Title = ifelse(str_detect(titanic.full$Name, "Mr."), "Mr.") elseif(str_detect(titanic.full$Name, "Mrs."), "Mrs."), "Other")
Any guidance would be very appreciated.
Using a regular expression match seems easier here. .*? matches all characters up to the first occurrence of what follows. (Mr|Mrs|Miss|$) matches any of the options with $ meaning end of line (in order to capture any lines that have none of the prior values). Finally .* matches whatever is left. "\\1" refers to the characters that match the portion of the pattern within parentheses.
titanic.full %>% mutate(Title = sub(".*?(Mr|Mrs|Miss|$).*", "\\1", Name))
Note: Since the input was not provided reproducibly in the question we provide it here:
u <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
titanic.full <- read.csv(u)
If you want the tidyverse solution you can do the following:
library(tidyverse)
df <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
df <- read.csv(df, stringsAsFactors = FALSE)
df <- as_tibble(df)
df
df %>%
extract(Name,
"Title",
"(Mr|Mrs|Miss) ([^ ]+)",
remove = FALSE) %>%
select(Name, Title)
Which returns:
# A tibble: 1,313 x 2
Name Title
* <chr> <chr>
1 Allen, Miss Elisabeth Walton Miss
2 Allison, Miss Helen Loraine Miss
3 Allison, Mr Hudson Joshua Creighton Mr
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) Mrs
5 Allison, Master Hudson Trevor <NA>
6 Anderson, Mr Harry Mr
7 Andrews, Miss Kornelia Theodosia Miss
8 Andrews, Mr Thomas, jr Mr
9 Appleton, Mrs Edward Dale (Charlotte Lamson) Mrs
10 Artagaveytia, Mr Ramon Mr
# ... with 1,303 more rows
Thanks to G. Grothendieck for providing the data.

Quantmod FRED Metadata in R

library(quantmod)
getSymbols("GDPC1",src = "FRED")
I am trying to extract the numerical economic/financial data in FRED but also the metadata. I am trying to chart CPI and have the meta data as a labels/footnotes. Is there a way to extract this data using the quantmod package?
Title: Real Gross Domestic Product
Series ID: GDPC1
Source: U.S. Department of Commerce: Bureau of Economic Analysis
Release: Gross Domestic Product
Seasonal Adjustment: Seasonally Adjusted Annual Rate
Frequency: Quarterly
Units: Billions of Chained 2009 Dollars
Date Range: 1947-01-01 to 2014-01-01
Last Updated: 2014-06-25 7:51 AM CDT
Notes: BEA Account Code: A191RX1
Real gross domestic product is the inflation adjusted value of the
goods and services produced by labor and property located in the
United States.
For more information see the Guide to the National Income and Product
Accounts of the United States (NIPA) -
(http://www.bea.gov/national/pdf/nipaguid.pdf)
You can use the same code that's in the body of getSymbools.FRED, but change ".csv" to ".xls", then read the metadata you're interested in from the .xls file.
library(gdata)
Symbol <- "GDPC1"
FRED.URL <- "http://research.stlouisfed.org/fred2/series"
tmp <- tempfile()
download.file(paste0(FRED.URL, "/", Symbol, "/downloaddata/", Symbol, ".xls"),
destfile=tmp)
read.xls(tmp, nrows=17, header=FALSE)
# V1 V2
# 1 Title: Real Gross Domestic Product
# 2 Series ID: GDPC1
# 3 Source: U.S. Department of Commerce: Bureau of Economic Analysis
# 4 Release: Gross Domestic Product
# 5 Seasonal Adjustment: Seasonally Adjusted Annual Rate
# 6 Frequency: Quarterly
# 7 Units: Billions of Chained 2009 Dollars
# 8 Date Range: 1947-01-01 to 2014-01-01
# 9 Last Updated: 2014-06-25 7:51 AM CDT
# 10 Notes: BEA Account Code: A191RX1
# 11 Real gross domestic product is the inflation adjusted value of the
# 12 goods and services produced by labor and property located in the
# 13 United States.
# 14
# 15 For more information see the Guide to the National Income and Product
# 16 Accounts of the United States (NIPA) -
# 17 (http://www.bea.gov/national/pdf/nipaguid.pdf)
Instead of hardcoding nrows=17, you can use grep to search for the row that has the headers of the data, and subset to only include rows before that.
dat <- read.xls(tmp, header=FALSE, stringsAsFactors=FALSE)
dat[seq_len(grep("DATE", dat[, 1])-1),]
unlink(tmp) # remove the temp file when you're done with it.
FRED has a straightforward, well-document json interface http://api.stlouisfed.org/docs/fred/ which provides both metadata and time series data for all of its economic series. Access requires a FRED account and api key but these are available on request from http://api.stlouisfed.org/api_key.html .
The excel descriptive data you asked for can be retrieved using
get.FRSeriesTags <- function(seriesNam)
{
# seriesNam = character string containing the ID identifying the FRED series to be retrieved
#
library("httr")
library("jsonlite")
# dummy FRED api key; request valid key from http://api.stlouisfed.org/api_key.html
apiKey <- "&api_key=abcdefghijklmnopqrstuvwxyz123456"
base <- "http://api.stlouisfed.org/fred/"
seriesID <- paste("series_id=", seriesNam,sep="")
fileType <- "&file_type=json"
#
# get series descriptive data
#
datType <- "series?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
series <- fromJSON(url)$seriess
#
# get series tag data
#
datType <- "series/tags?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
tags <- fromJSON(url)$tags
#
# format as excel descriptive rows
#
description <- data.frame(Title=series$title[1],
Series_ID = series$id[1],
Source = tags$notes[tags$group_id=="src"][1],
Release = tags$notes[tags$group_id=="gen"][1],
Frequency = series$frequency[1],
Units = series$units[1],
Date_Range = paste(series[1, c("observation_start","observation_end")], collapse=" to "),
Last_Updated = series$last_updated[1],
Notes = series$notes[1],
row.names=series$id[1])
return(t(description))
}
Retrieving the actual time series data would be done in a similar way. There are several json packages available for R but jsonlite works particularly well for this application.
There's a bit more to setting this up than the previous answer but perhaps worth it if you do much with FRED data.

Resources