Using the same regex for multiple specific columns in R - r

I have the data as below
Data
df <- structure(list(obs = 1:4, text0 = c("nothing to do with this column even it contains keywords",
"FIFA text", "AFC text", "UEFA text"), text1 = c("here is some FIFA text",
"this row dont have", "some UEFA text", "nothing"), text2 = c("nothing here",
"I see AFC text", "Some samples", "End of text")), class = "data.frame", row.names = c(NA,
-4L))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
4 4 UEFA text nothing End of text
Expected Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
Question: I have several columns contains some keywords (FIFA, UEFA, AFC) I am looking for. I want to filter these keywords on specific columns (in this case: text1, and text2 only). Any those keywords founded in text1 and text2 should be filtered as the expected output. We have nothing to do with text0. I am wondering if there is any regex to get this result.

Using filter_at
library(dplyr)
library(stringr)
patvec <- c("FIFA", "UEFA", "AFC")
# // create a single pattern string by collapsing the vector with `|`
# // specify the word boundary (\\b) so as not to have any mismatches
pat <- str_c("\\b(", str_c(patvec, collapse="|"), ")\\b")
df %>%
filter_at(vars(c('text1', 'text2')),
any_vars(str_detect(., pat)))
With across, currently does the all_vars matching instead of any_vars. An option is rowwise with c_across
df %>%
rowwise %>%
filter(any(str_detect(c_across(c(text1, text2)), pat))) %>%
ungroup

Also you can try (base R):
#Keys
keys <- c('FIFA', 'UEFA', 'AFC')
keys <- paste0(keys,collapse = '|')
#Filter
df[grepl(pattern = keys,x = df$text1) | grepl(pattern = keys,x = df$text2),]
Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples

Another base R option:
pat <- sprintf("\\b(%s)\\b",paste(patvec, collapse = "|"))
subset(df, grepl(pat, do.call(paste, df[c("text1","text2")])))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples

Related

In R, compare string variable of two dataframes to create new flag variable indicating match in both dataframes, using a for-loop?

I have two dataframes which I would like to compare. One of them contains a complete list of sentences as a string variable as well as manually assigned codes of 0 and 1 (i.e. data.1). The second dataframe contains a subset of the sentences of the first dataframe and is reduced to those sentences that were matched by a dictionary.
This is, in essence, what these two datasets look like:
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
I would like to merge the results of the data.2 into data.1 and ideally create a new code_2 variable there that indicates whether a sentence was matched by the dictionary. This would yield something like this:
> data.1
texts code code_2
1 This is a sentence 1 1
2 This is another sentence 1 0
3 This is not a sentence 0 1
4 Yet another sentence 1 0
To make this slightly more difficult, and as you can see above, the sentences in data.2 are not just a subset of data.1 but they may also be in a different order (e.g. "This is not a sentence" is in the third row of the first dataframe but in the first row of the second dataframe).
I was thinking that looping through all of the texts of data.1 would do the trick, but I'm not sure how to implement this.
for (i in 1:nrow(data.1)) {
# For each i in data.1...
# compare sentence to ALL sentences in data.2...
# create a new variable called "code_2"...
# assign a 1 if a sentence occurs in both dataframes...
# and a 0 otherwise (i.e. if that sentence only occurs in `data.1` but not in `data.2`).
}
Note: My question is similar to this one, where the string variable "Letter" corresponds to my "texts", yet the problem is somewhat different, since the matching of sentences itself is the basis for the creation of a new flag variable in my case (which is not the case in said other question).
can you just join the dataframes?
NOTE: Added replace_na to substitue with 0
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
data.1 %>% dplyr::left_join(data.2, by = 'texts') %>%
dplyr::mutate(code.y = tidyr::replace_na(code.y, 0))
I believe the following match based solution does what the question asks for.
i <- match(data.2$texts, data.1$texts)
i <- sort(i)
data.1$code_2 <- 0L
data.1$code_2[i] <- data.2$code[seq_along(i)]
data.1
# texts code code_2
#1 This is a sentence 1 1
#2 This is another sentence 1 0
#3 This is not a sentence 0 1
#4 Yet another sentence 1 0

Count the amount of bigrams in sentences in a dataframe

I have a dataset that looks a bit like this:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
I would like to have a count where I can see the occurrence of certain bigrams. So lets say I have:
trigger_bg_1 <- "sample text"
I expect the output of 2 (as there are two occurrences of "sample text" in the two sentences. I know how to do a word count like this:
trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")
for(i in unlist(words)){
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}
But I cant get something working for a bigram. Any thoughts on how I should change the code to get it working?
But as I have a long test of trigger-words which I need to count in over
In case you want to count the sentences where you have a match you can use grep:
length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2
In case you want to count how many times you find trigger_bg_1 you can use gregexpr:
sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2
You could sum a grepl
sum(grepl(trigger_bg_1, df$sentences))
[1] 2
If you are really interested in bigrams rather than just set word combinations, the package quanteda can offer a more substantial and systematic way forward:
Data:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
Solution:
library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)
Result:
bigrams
in sentence sample text text in sentence 1 sentence 2
2 2 2 1 1
If you want to inspect the frequency count of a specific bigram:
bigrams["in sentence"]
in sentence
2

Add quatation marks to every row of specific column

Having a dataframe in this format:
data.frame(id = c(4,2), text = c("my text here", "another text here"))
How is it possible to add triple quatation marks at the start and end of every value/row in text column.
Expected printed output:
id text
4 """my text here"""
2 """another text here"""
With no paste nor cat/paste you can simply run:
data.frame(id = c(4,2), text = c('"""my text here"""', '"""another text here"""'))
id text
1 4 """my text here"""
2 2 """another text here"""

Delimiting Text from Democratic Debate

I am trying to delimit the following data by first name, time stamp, and then the text. Currently, the entire data is listed in 1 column as a data frame this column is called Text 1. Here is how it looks
text
First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text
This is what I did so far:
text$specificname = str_split_fixed(text$text, ":", 2)
and it created the following
text specific name
First Name: 00:03 Welcome Back text text text First Name
First Name 2: 00:54 Text Text Text First Name2
First Name 3: 01:24 Text Text Text First Name 3
How do I do the same for the timestamp and text? Is this the best way of doing it?
EDIT 1: This is how I brought in my data
#Specifying the url for desired website to be scraped
url = 'https://www.rev.com/blog/transcript-of-july-democratic-debate-night-1-full-transcript-july-30-2019'
#Reading the HTML code from the website
wp = read_html(url)
#assignging the class to an object
alltext = html_nodes(wp, 'p')
#turn data into text, then dataframe
alltext = html_text(alltext)
text = data.frame(alltext)
Assuming that text is in the form shown in the Note at the end, i.e. a character vector with one component per line, we can use read.table
read.table(text = gsub(" +", ",", text), sep = ",", as.is = TRUE)
giving this data.frame:
V1 V2 V3
1 First Name: 00:03 Welcome Back text text text
2 First Name 2: 00:54 Text Text Text
3 First Name 3: 01:24 Text Text Text
Note
Lines <- "First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text"
text <- readLines(textConnection(Lines))
Update
Regarding the EDIT that was added to the question define a regular expression pat which matches possible whitespace, 2 digits, colon, 2 digits and possibly more whitespace. Then grep out all lines that match it giving tt and in each line left replace the match with #, the pattern (except for the whitespace) and # giving g. Finally read it in using # as the field separator giving DF.
pat <- "\\s*(\\d\\d:\\d\\d)\\s*"
tt <- grep(pat, text$alltext, value = TRUE)
g <- sub(pat, "#\\1#", tt)
DF <- read.table(text = g, sep = "#", quote = "", as.is = TRUE)

Classification based on list of words R

I have a data set with article titles and abstracts that I want to classify based on matching words.
"This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text"
Topic 1 Topic 2 Topic (X)
word1 word4 word(a)
word2 word5 word(b)
word3 word6 word(c)
Given that that text above matches words in Topic 2, I want to assign a new column with this label. Preferred if this could be done with "tidy-verse" packages.
Given the sentence as a string and the topics in a data frame you can do something like this
input<- c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text")
df <- data.frame(Topic1 = c("word1", "word2", "word3"),Topic2 = c("word4", "word5", "word6"))
## This splits on space and punctation (only , and .)
input<-unlist(strsplit(input, " |,|\\."))
newcol <- paste(names(df)[apply(df,2, function(x) sum(input %in% x) > 0)], collapse=", ")
Given I am unsure of the data frame you want to add this too I have made a vector newcol.
If you had a data frame of long sentences then you can use a similar approach.
inputdf<- data.frame(title=c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text", "word2", "word3, word4"))
input <- strsplit(as.character(inputdf$title), " |,|\\.")
inputdf$newcolmn <-unlist(lapply(input, function(x) paste(names(df)[apply(df,2, function(y) sum(x %in% y)>0)], collapse = ", ")))

Resources