splitting strings using regex in R - r

I have the following a really long list of strings that look like the following that I want to split it into several pieces.
strings<-c("https://www.website.com/stats/stat.227.y2020.eon.t879.html",
"https://www.website.com/stats/stat.229.y2019.eoff.t476.html")
and the desired output is as below:
links Year Seas Tour
https://www.website.com/stats/stat.227. y2020 eon t879
https://www.website.com/stats/stat.229. y2019 eoff t476
How can I achieve this using regex?

Using str_match :
stringr::str_match(strings, '.*\\.(y\\d+)\\.(\\w+)\\.(t\\d+)')
You can use the same regex in tidyr::extract if you put strings in a dataframe.
tidyr::extract(data.frame(strings), strings, c("Year","Seas", "Tour"),
'\\.(y\\d+)\\.(\\w+)\\.(t\\d+)', remove = FALSE)
# strings Year Seas Tour
#1 https://www.pgatour.com/stats/stat.227.y2020.eon.t879.html y2020 eon t879
#2 https://www.pgatour.com/stats/stat.229.y2019.eoff.t476.html y2019 eoff t476
Here, we capture data in 3 parts (capture groups)
1st part - 'y' followed by a number
2nd part - next word following part 1
3rd part 't' followed by a number.

You could use {unglue} :
library(unglue)
unglue::unglue_data(
strings, "{links}.{Year=[^.]+}.{Seas=[^.]+}.{Tour=[^.]+}.html")
#> links Year Seas Tour
#> 1 https://www.website.com/stats/stat.227 y2020 eon t879
#> 2 https://www.website.com/stats/stat.229 y2019 eoff t476
here "[^.]+" means "one or more non dot characters", which is what we want for Year, Seas, and Tour.

Related

Grepl for 2 words/phrases in proximity in R (dplyr)

I'm trying to create a filter for large dataframe. I'm trying to use grepl to search for a series of text within a specific column. I've done this for single words/combinations, but now I want to search for two words in close proximity (ie the word tumo(u)r within 3 words of the word colon).
I've checked my regular expression on https://www.regextester.com/109207 and my search works there, but it doesn't work within R.
The error I get is
Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"
Example below - trying to search for tumo(u)r within 3 words of cancer.
Can anyone help?
library(tibble)
example.df <- tibble(number = 1:4, AB = c('tumor of the colon is a very hard disease to cure', 'breast cancer is also known as a neoplasia of the breast', 'tumour of the colon is bad', 'colon cancer is also bad'))
filtered.df <- example.df %>%
filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, ignore.case=T)
R uses backslashes as escapes and the regex engine does,too. Need to double your backslashes. This is explained in multiple prior questions on StackOverflow as well as in the help page brought up at ?regex. You should try to use the escaped operators in a more simple set of tests before attempting complex operations. And you should pay better attention to the proper placement of parentheses and quotes in the pattern argument.
filtered.df <- example.df %>%
#filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB,
# errors here ....^.^..............^..^...^..^.............^.^
filter(grepl( "(\\btumor|tumour)\\W|\\w+(\\w+\\W+){0,3}colon\\b", AB,
ignore.case=T) )
> filtered.df
# A tibble: 2 × 2
number AB
<int> <chr>
1 1 tumor of the colon is a very hard disease to cure
2 3 tumour of the colon is bad

regex: extract segments of a string containing a word, between symbols

Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?
With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.

R: extract substring with capital letters from string

I have a dataframe with strings in a column. How could I extract only the substrings that are in capital letters and add them to another column?
This is an example:
fecha incident
1 2020-12-01 Check GENERATOR
2 2020-12-01 Check BLADE
3 2020-12-02 Problem in GENERATOR
4 2020-12-01 Check YAW
5 2020-12-02 Alarm in SAFETY SYSTEM
And I would like to create another column as follows:
fecha incident system
1 2020-12-01 Check GENERATOR GENERATOR
2 2020-12-01 Check BLADE BLADE
3 2020-12-02 Problem in GENERATOR GENERATOR
4 2020-12-01 Check YAW YAW
5 2020-12-02 Alarm in SAFETY SYSTEM SAFETY SYSTEM
I have tried with str_sub or str_extract_all using a regex but I believe I'm doing thigs wrong.
You can use str_extract if you want to work in a dataframe and tie it into a tidyverse workflow.
The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words). str_trim removes the white-space that can get picked up if the capitalized word is not at the end of the string.
Note that this code snipped will only extract the first capitalized words connected via a space. If there are capitalized words in different parts of the string, only the first one will be returned.
library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL" "CAP" "MULTIPLE CAPITAL" "CAP"
Created on 2021-01-08 by the reprex package (v0.3.0)
library(tidyverse)
string <- data.frame(test="does this WORK")
string$new <-str_extract_all(string$test, "[A-Z]+")
string
test new
1 does this WORK WORK
If there are cases when the upper-case letters are not next to each other you can use str_extract_all to extract all the capital letters in a sentence and then paste them together.
sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR" "BLADE" "GENERATOR" "YAW" "SAFETY SYSTEM"

How to remove decimal points from dataframe column?

I've a .csv dataframe in which one of the columns is a ZIP code. The ZIP code is a factor. Here is an example:
Country<- c("US","US","US","CAN","CAN")
ZIP<- C(00210,01210,65483.0,H3P,H3P3C)
data<- data.frame(Country,ZIP)
I did the following but the output is not what I want:
data$ZIP<-round(as.numeric(as.character(data$ZIP)), 0)
Although it removed the decimals but now the zip code 00210, 01210 became 210 and 1210. Also, zip codes for CANADA became NA. I want to preserve the zip code numbers to 5 digit and preserve the zip codes of CANADA.
How can I do that?
Thank you.
Try this
data$ZIP <- sub("\\.\\d+$", "", data$ZIP)
# Country ZIP
# 1 US 00210
# 2 US 01210
# 3 US 65483
# 4 CAN H3P
# 5 CAN H3P3C
Explanation
From the help page, a typical usage of sub is
sub(pattern, replacement, x)
x is a character vector where matches are sought...
In our case x'll be the ZIP column (values of the ZIP column to be specific).
The pattern is ("\\.\\d+$"):
\\. matches the dot
\\d+ matches one or more numeric characters
$ matches the end of the input string.
The replacement pattern is "".
It replaces numeric chars beginning from a match of dot till the end with an empty string.
For example
sub("\\.\\d+$", "", 21358.222)
# "21358"
Hope that helps.

R text mining: Create document term matrix from dataframe, convert to dataframe, retain columns from original dataframe

Thanks to lawyeR for recommending the tidytext package. Here is some code based on that package that seems to work pretty well on my sample data. It doesn't work quite so well though when the value of the text column is blank. (There are times when this will happen and it will make sense to keep the blank rather than filtering it.) I've set the first observation for TVAR to a blank to illustrate. The code drops this observation. How can I get R to keep the observation and to set the frequencies for each word to zero? I tried some ifelse statements using and not using the pipe. It's not working so well though. The troulbe seems to center around the unnest_tokens function from the tidytext package.
sampletxt$TVAR[1] <- ""
chunk_words <- sampletxt %>%
group_by(PTNO, DATE, TYPE) %>%
unnest_tokens(word, TVAR, to_lower = FALSE) %>%
count(word) %>%
spread(word, n, 0)
I have an R data frame. I want to use it to create a document term matrix. Presumably I would want to use the tm package to do that but there might be other ways. I then want to convert that matrix back to a data frame. I want the final data frame to contain identifying variables from the original data frame.
Question is, how do I do that? I found some answers to a similar question, but that was for a data frame with text and a single ID variable. My data are likely to have about half a dozen variables that identify a given text record. So far, attempts to scale up the solution for a single ID variable haven't proven all that successful.
Below are some sample data. I created these for another task which I did manage to solve.
How can I get a version of this data frame that has an additional frequency column for each word in the text entries and that retains variables like PTNO, DATE, and TYPE?
sampletxt <-
structure(
list(
PTNO = c(1, 2, 2, 3, 3),
DATE = structure(c(16801, 16436, 16436, 16832, 16845), class = "Date"),
TYPE = c(
"Progress note",
"Progress note",
"CAT scan",
"Progress note",
"Progress note"
),
TVAR = c(
"This sentence contains the word metastatic and the word Breast plus the phrase denies any symptoms referable to.",
"This sentence contains tnm code T-1, N-O, M-0. This sentence contains contains tnm code T-1, N-O, M-1. This sentence contains tnm code T1N0M0. This sentence contains contains tnm code T1NOM1. This sentence is a sentence!?!?",
"This sentence contains Dr. Seuss and no target words. This sentence contains Ms. Mary J. blige and no target words.",
"This sentence contains the term stageIV and the word Breast. This sentence contains no target words.",
"This sentence contains the word breast and the term metastatic. This sentence contains the word breast and the term stage IV."
)), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
The quanteda package is faster and more straightforward than tm, and works nicely with tidytext as well. Here's how to do it:
These operations create a corpus from your object, create a document-feature matrix, and then return a data.frame that combines the variables with the feature counts. (Additional options are available when creating the dfm, see ?dfm).
library("quanteda")
samplecorp <- corpus(sampletxt, text_field = "TVAR")
sampledfm <- dfm(samplecorp)
result <- cbind(docvars(sampledfm), as.data.frame(sampledfm))
You can then group by the variables to get your result. (Here I am showing just the first 6 columns.)
dplyr::group_by(result[, 1:6], PTNO, DATE, TYPE)
# # A tibble: 5 x 6
# # Groups: PTNO, DATE, TYPE [5]
# PTNO DATE TYPE this sentence contains
# * <dbl> <date> <chr> <dbl> <dbl> <dbl>
# 1 1 2016-01-01 Progress note 1 1 1
# 2 2 2015-01-01 Progress note 5 6 6
# 3 2 2015-01-01 CAT scan 2 2 2
# 4 3 2016-02-01 Progress note 2 2 2
# 5 3 2016-02-14 Progress note 2 2 2
packageVersion("quanteda")
# [1] ‘0.99.6’
this function from package "SentimentAnalysis" is the easiest way to do this, especially if you are trying to convert from column of a dataframe to DTM (though it also works on txt file!):
library("SentimentAnalysis")
corpus <- VCorpus(VectorSource(df$column_or_txt))
tdm <- TermDocumentMatrix(corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=FALSE,
ngmin=1, ngmax=2)))
it is simple and works like a charm each time, with both chinese and english, for those doing text mining in chinese.

Resources