Classification based on list of words R - r

I have a data set with article titles and abstracts that I want to classify based on matching words.
"This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text"
Topic 1 Topic 2 Topic (X)
word1 word4 word(a)
word2 word5 word(b)
word3 word6 word(c)
Given that that text above matches words in Topic 2, I want to assign a new column with this label. Preferred if this could be done with "tidy-verse" packages.

Given the sentence as a string and the topics in a data frame you can do something like this
input<- c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text")
df <- data.frame(Topic1 = c("word1", "word2", "word3"),Topic2 = c("word4", "word5", "word6"))
## This splits on space and punctation (only , and .)
input<-unlist(strsplit(input, " |,|\\."))
newcol <- paste(names(df)[apply(df,2, function(x) sum(input %in% x) > 0)], collapse=", ")
Given I am unsure of the data frame you want to add this too I have made a vector newcol.
If you had a data frame of long sentences then you can use a similar approach.
inputdf<- data.frame(title=c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text", "word2", "word3, word4"))
input <- strsplit(as.character(inputdf$title), " |,|\\.")
inputdf$newcolmn <-unlist(lapply(input, function(x) paste(names(df)[apply(df,2, function(y) sum(x %in% y)>0)], collapse = ", ")))

Related

Extract sentences from texts in data frame

I have a data frame with a column "text" and in each row of my data frame "text" contains several sentences (maybe only two, maybe 100 or more). Now I would like to analyze the text in every row of my data frame for specific keywords. If a keyword is found in the text of this row I would like to extract the sentences, which contain keywords, to a separate column, f.e.
needles = c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c("This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."),
findings = c("This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA)
)
So column "text" contains the sentences I want to check for keywords, "findings" is the result I would like to have in the end.
Can anyone help me how to apply the solution for all rows of the data frame?
Thank you!
We can work with list columns by splitting each row in different sentences and look for the needles inside each resulting sentence of each row.
The reduce functions are to take levels of depth of the lists.
code:
library(tidyverse)
needles <- c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c(
"This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."
),
findings = c(
"This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA
)
)
df <- as_tibble(mydata) %>%
mutate(mydata, findings = str_split(text, "\\.\\s") %>%
map(~str_subset(., rebus::or1(needles))) %>%
map_if(~length(.) > 1, ~reduce(., ~paste(.x, .y, sep = '. '))),
findings = map_if(findings, ~length(.) == 0, ~NA) %>% reduce(c))
df
#> # A tibble: 4 × 2
#> text findings
#> <chr> <chr>
#> 1 This is the first sentence. It is the … This is the first sentence
#> 2 My second sentence is this. I hope thi… I hope this project will work fine. T…
#> 3 And this is the last sentence. Finally… I am really happy about that.
#> 4 These sentences do not contain any rel… <NA>
Created on 2021-11-27 by the reprex package (v2.0.1)
What about something like this:
find_sentence <- function(text, word){
require(stringr)
x <- c(str_split(text, "\\..", simplify=TRUE))
inds <- which(str_detect(x, word))
if(length(inds) > 0){
list(x[inds])
}else{
list(NA)
}
}
mydata %>%
rowwise %>%
mutate(res = find_sentence(text, "the")) %>%
unnest(res)
# # A tibble: 4 × 3
# text findings res
# <chr> <chr> <chr>
# 1 This is the first sentence. It is the beginning of this project This is the first sentence. This is the fi…
# 2 This is the first sentence. It is the beginning of this project This is the first sentence. It is the begi…
# 3 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence. I hope this project will wo… Then I will an…
# 4 And this is the last sentence. Finally my work ends. I am really happy about that. I am really happy about tha… And this is th…
This returns a new variable called res that has a different row for each occurrence of the keyword in a sentence. So, if two sentences contained the word (as in the first sentence in text), the text and findings columns will be replicated for each of the relevant sentences in res.
With Base R,
lookup <- strsplit(as.character(mydata[,1]),"\\.")
out <- lapply(lookup,function(x) {
logic <- grepl(paste0(needles,collapse="|"),x)
paste0(x[logic],collapse=".")
})
data.frame(findings = do.call(rbind,out) )
gives,
# findings
#1 This is the first sentence
#2 I hope this project will work fine. Then I will analyze the third sentence
#3 I am really happy about that
#4
This uses grep and a strsplit to get the matches.
mydata$findings <- sapply( strsplit( t(mydata), "\\. " ), function(x)
x[unlist( lapply( needles, function(y) grep(y, x) ) )] )
text
1 This is the first sentence. It is the beginning of this project
2 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.
3 And this is the last sentence. Finally my work ends. I am really happy about that.
4 These sentences do not contain any relevant information. There is no keyword. And it is not relevant.
findings
1 This is the first sentence
2 I hope this project will work fine, Then I will analyze the third sentence.
3 I am really happy about that.
4

Count the amount of bigrams in sentences in a dataframe

I have a dataset that looks a bit like this:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
I would like to have a count where I can see the occurrence of certain bigrams. So lets say I have:
trigger_bg_1 <- "sample text"
I expect the output of 2 (as there are two occurrences of "sample text" in the two sentences. I know how to do a word count like this:
trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")
for(i in unlist(words)){
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}
But I cant get something working for a bigram. Any thoughts on how I should change the code to get it working?
But as I have a long test of trigger-words which I need to count in over
In case you want to count the sentences where you have a match you can use grep:
length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2
In case you want to count how many times you find trigger_bg_1 you can use gregexpr:
sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2
You could sum a grepl
sum(grepl(trigger_bg_1, df$sentences))
[1] 2
If you are really interested in bigrams rather than just set word combinations, the package quanteda can offer a more substantial and systematic way forward:
Data:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
Solution:
library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)
Result:
bigrams
in sentence sample text text in sentence 1 sentence 2
2 2 2 1 1
If you want to inspect the frequency count of a specific bigram:
bigrams["in sentence"]
in sentence
2

grepl for finding words

I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I don´t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE

Extract text according to delimiters but miss out missing entries

I have some text as follows:
inputString<- “Patient Name:MRS Comfor Atest Date of Birth:23/02/1981 Hospital Number:000000 Date of Procedure:01/01/2010 Endoscopist:Dr. Sebastian Zeki: Nurses:Anthony Nurse , Medications:Medication A 50 mcg, Another drug 2.5 mg Instrument:D111 Extent of Exam:second part of duodenum Visualization:Good Tolerance: Good Complications: None Co-morbidity:None INDICATIONS FOR EXAMINATION Illness Stomach pain. PROCEDURE PERFORMED Gastroscopy (OGD) FINDINGS Things found and biopsied DIAGNOSIS Biopsy of various RECOMMENDATIONS Chase for histology. FOLLOW UP Return Home"
I want to extract parts of the test in to their own columns according to some text boundaries I have set:
myWords<-c("Patient Name","Date of Birth","Hospital Number","Date of Procedure","Endoscopist","Second Endoscopist","Trainee","Referring Physician","Nurses"."Medications")
Not all of the delimiter words are in the text (but they are always in the same order).
I have a function that should separate them out (with the column title as the start of the word boundary:
delim<-myWords
inputStringdf <- data.frame(inputString,stringsAsFactors = FALSE)
inputStringdf <- inputStringdf %>%
tidyr::separate(inputString, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
However, when there is no finding between two delimiters, or if the delimiters do not exist, rather than place NA in the column, it just fills it with the next text found between two delimiters. How can I make sure that the correct columns are filled with the correct text as defined by the delimiters?
Using the input shown in the Note at the end transform it into DCF format and then read it in using read.dcf which converts the input lines into a character matrix m. See ?read.dcf for more info. No packages are used.
pat <- sprintf("(%s)", paste(myWords, collapse = "|"))
g <- gsub(pat, "\n\\1", paste0(Lines, "\n"))
m <- read.dcf(textConnection(g))
Here are the first three columns:
m[, 1:3]
## Patient Name Date of Birth Hospital Number
## [1,] "MRS Comfor Atest" "23/02/1981" "000000"
## [2,] "MRS Comfor Atest" NA "000000"
Note
The input is assumed to have one record per patient like this example which has two records. We have just repeated the first patient for simplicity in synthesizing an input data set except we have omitted the Date of Birth in the second record.
Lines <- c(inputString, sub("Date of Birth:23/02/1981 ", "", inputString))

Parse a character string and later re-assemble it

I am trying to parse a character string into its parts, check if each of the parts exist in a separate vocabulary, and later re-assemble only those strings whose parts are in the vocabulary. The vocabulary is a vector of words, and is created separately from the the strings I want to compare. The final goal is to create a data frame with only those strings whose word parts are in the vocabulary.
I have written a piece of code to parse out the data into strings, but cannot figure out how to make the comparison. If you believe that parsing out the data is not the optimal solution, please let me know.
Here is an example:
Assume that I have three character strings:
"The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue"
and my vocabulary consists of the words:
cat, **the**, **elephant**, hippo,
**in**, run, **is**, bike,
walk, **room, is, blue, cannot**
In this case I will pick only the first and third strings, because each of their word parts are matched to a word in my vocabulary. I will not select the second string, because the words "dog" and "swim" are not in the vocabulary.
Thank you!
Per request, attached is the code I have written so far to clean the strings, and parse them into unique words:
animals <- c("The elephant in the room is blue", "The dog cannot swim", "The cat is blue")
animals2 <- toupper(animals)
animals2 <- gsub("[[:punct:]]", " ", animals2)
animals2 <- gsub("(^ +)|( +$)|( +)", " ", animals2)
## Parse the characters and select unique words only
animals2 <- unlist(strsplit(animals2," "))
animals2 <- unique(animals2)
Here how I would do :
Read the data
clean vocab to remove extra spaces and *
Loop over strings , using setdiff
My code is :
## read your data
tt <- c("The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue")
vocab <- scan(textConnection('cat, **the**, **elephant**, hippo,
**in**, run, **is**, bike,
walk, **room, is, blue, cannot**'),sep=',',what='char')
## polish vocab
vocab <- gsub('\\s+|[*]+','',vocab)
vocab <- vocab[nchar(vocab) >0]
##
sapply(tt,function(x){
+ x.words <- tolower(unlist(strsplit(x,' '))) ## take lower (the==The)
+ length(setdiff(x.words ,vocab)) ==0
+ })
The elephant in the room is blue The dog cannot swim The cat is blue
TRUE FALSE TRUE

Resources