R: How to use grep() to find specific words? - r

I have a long data frame with words. I want to use multi specific words to find each all part-of-speech words.
For example:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply", "supplying cmp
abrasive", "chemical mechanical"))
words
1 clean
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
I want to extract "clean" and "supply" single words with different POS. I have tried use the grep() function to do.
specific_word <- c("clean", "supply")
grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>%
data.frame(word = ., row.names = NULL) %>%
unique()
But the result is not what I want:
word
1 cleans
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
I prefer to get
words
1 clean
2 cleaning
3 supplying
4 supply
I know maybe regular expression can solve my problem, but I don't know how to define it. Can anyone give me some advice?

There are various ways to do this, but generally if you want it to be a single word and you're using regex, you need to specify the beginning ^ and end $ of the line so as to limit what can come before or after your pattern. You seem to want it to be able to expand with more letters, so add in \\w* to allow it:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|')
pattern
#> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$"
df[grep(pattern, df$word), , drop = FALSE] # drop = FALSE to stop simplification to vector
#> word
#> 1 clean
#> 3 cleaning
#> 5 supplying
#> 6 supply
Another interpretation of what you're looking for is to split each term into individual words, and search any of those for a match. tidyr::separate_rows can be used for such a split, which you can then filter with grepl:
library(tidyverse)
df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
df %>% separate_rows(word) %>%
filter(grepl(paste(specific_word, collapse = '|'), word)) %>%
distinct()
#> # A tibble: 4 x 1
#> word
#> <chr>
#> 1 clean
#> 2 cleaning
#> 3 supplying
#> 4 supply
For more robust word tokenization, try tidytext::unnest_tokens or another word actual word tokenizer.

Related

How many elements in common on multiple lists?

Hi I'm observing a dataset which have a column named "genres" of string vectors that contain all tags of genres the film has, I want to create a plot that shows the popularity of all genres.
structure(list(anime_id = c("10152", "11061", "11266", "11757",
"11771"), Name.x = c("Kimi ni Todoke 2nd Season: Kataomoi", "Hunter
x Hunter (2011)",
"Ao no Exorcist: Kuro no Iede", "Sword Art Online", "Kuroko no
Basket"
), genres = list("Romance", c("Action", " Adventure", " Fantasy"
), "Fantasy", c("Action", " Adventure", " Fantasy", " Romance"
), "Sports")), row.names = c(NA, 5L), class = "data.frame")
initially the genres column is a string with genres divided by comma . for example : ['action', 'drama', 'fantasy']. To work with I run this code to edit the column :
AnimeList2022new$genres <- gsub("\\[|\\]|'" , "",
as.character(AnimeList2022new$genres))
AnimeList2022new$genres <- strsplit( AnimeList2022new$genres,
",")
I don't know how to compare all the vectors in order to know how many times a tags appear
enter image description here
I'm trying with group_by and summarise
genresdata <-MyAnimeList %>%
group_by(genres) %>%
summarise( count = n() ) %>%
arrange( -count)
but obviously this code group similar vectors and not similar string contained in the vectors.
this is the output:
enter image description here
Your genres column is of class list, so it sounds like you want the length() of reach row in it. Generally, we could do that like this:
MyAnimeList %>%
mutate(n_genres = sapply(genres, length))
But this is a special case where there is a nice convenience function lengths() (notice the s at the end) built-in to R that gives us the same result, so we can simply do
MyAnimeList %>%
mutate(n_genres = lengths(genres))
The above will give the number of genres for each row.
In the comments I see you say you want "for example how many times "Action" appears in the whole column". For that, we can unnest() the genre list column and then count:
library(tidyr)
MyAnimeList %>%
unnest(genres) %>%
count(genres)
# # A tibble: 7 × 2
# genres n
# <chr> <int>
# 1 " Adventure" 2
# 2 " Fantasy" 2
# 3 " Romance" 1
# 4 "Action" 2
# 5 "Fantasy" 1
# 6 "Romance" 1
# 7 "Sports" 1
Do notice that some of your genres have leading white space--it's probably best to solve this problem "upstream" whenever the genre column was created, but we could do it now using trimws to trim whitespace:
MyAnimeList %>%
unnest(genres) %>%
count(trimws(genres))
# # A tibble: 5 × 2
# `trimws(genres)` n
# <chr> <int>
# 1 Action 2
# 2 Adventure 2
# 3 Fantasy 3
# 4 Romance 2
# 5 Sports 1

How to join tokenized words back together in a column in R dataframe

I have a dataframe with previously tokenized words that look like below. Replication code:
df <- data.frame (id = c("1", "2","3"),
text = c("['I', 'like', 'apple']", "['we', 'go', 'swimming']", "['ask', 'questions']")
)
:
id text
1 ["I", "like", "apple"]
2 ["we", "go", "swimming"]
3 ["ask", "questions"]
The original data frame was obtained in Python after preprocessing (including tokenizing) raw text data.
I'd like to merge these tokens back into a sentence so it would look like below
id text
1 I like apple
2 we go swimming
3 ask questions
I tried using the paste() function df$text_new<-paste(df$text, sep = " "), but it failed to work, still returning the same result.
You can separate() then unite() them with tidyr. You will have to provide a character vector long enough for each word in the longest sentence with into = -- I used letters to get 26 -- and then refer to the first and last (a:z).
library(tidyr)
df <- data.frame (id = c("1", "2","3"),
text = c("['I', 'like', 'apple']", "['we', 'go', 'swimming']", "['ask', 'questions']"))
df %>%
separate(text, into = letters, fill = "right") %>%
unite(text, a:z, sep = " ", na.rm = TRUE)
#> id text
#> 1 1 I like apple
#> 2 2 we go swimming
#> 3 3 ask questions
Created on 2022-05-26 by the reprex package (v2.0.1)

how to get a sentiment score (and keep the sentiment words) in quanteda?

Consider this simple example
library(tibble)
library(quanteda)
tibble(mytext = c('this is a good movie',
'oh man this is really bad',
'quanteda is great!'))
# A tibble: 3 x 1
mytext
<chr>
1 this is a good movie
2 oh man this is really bad
3 quanteda is great!
I would like to perform some basic sentiment analysis, but with a twist. Here is my dictionary, stored into a regular tibble
mydictionary <- tibble(sentiment = c('positive', 'positive','negative'),
word = c('good', 'great', 'bad'))
# A tibble: 3 x 2
sentiment word
<chr> <chr>
1 positive good
2 positive great
3 negative bad
Essentially, I would like to count how many positive and negative words are detected in each sentence, but also keep track of the matching words. In other words, the output should look like
mytext nb.pos nb.neg pos.words
1 this is a good and great movie 2 0 good, great
2 oh man this is really bad 0 1 bad
3 quanteda is great! 1 0 great
How can I do that in quanteda? Is this possible?
Thanks!
Stay tuned for quanteda v. 2.1 in which we will have greatly expanded, dedicated functions for sentiment analysis. In the meantime, see below. Note that I made some adjustments since there is a discrepancy in what you report as the text and your input text, also you have all sentiment words in pos.words, not just positive words. Below, I compute both positive and all sentiment matches.
# note the amended input text
mytext <- c(
"this is a good and great movie",
"oh man this is really bad",
"quanteda is great!"
)
mydictionary <- tibble::tibble(
sentiment = c("positive", "positive", "negative"),
word = c("good", "great", "bad")
)
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
# make the dictionary into a quanteda dictionary
qdict <- as.dictionary(mydictionary)
Now we can use the lookup functions to get to your final data.frame.
# get the sentiment scores
toks <- tokens(mytext)
df <- toks %>%
tokens_lookup(dictionary = qdict) %>%
dfm() %>%
convert(to = "data.frame")
names(df)[2:3] <- c("nb.neg", "nb.pos")
# get the matches for pos and all words
poswords <- tokens_keep(toks, qdict["positive"])
allwords <- tokens_keep(toks, qdict)
data.frame(
mytext = mytext,
df[, 2:3],
pos.words = sapply(poswords, paste, collapse = ", "),
all.words = sapply(allwords, paste, collapse = ", "),
row.names = NULL
)
## mytext nb.neg nb.pos pos.words all.words
## 1 this is a good and great movie 0 2 good, great good, great
## 2 oh man this is really bad 1 0 bad
## 3 quanteda is great! 0 1 great great

Select phrases found in dictionary and return dataframe of doc_id and phrase

I have a dictionary file of medical phrases and a corpus of raw texts. I'm trying to use the dictionary file to select the relevant phrases from the text. Phrases, in this case, are 1 to 5-word n-grams. In the end, I would like the selected phrases in a dataframe with two columns: doc_id, phrase
I've been trying to use the quanteda package to do this but haven't been successful. Below is some code to reproduce my latest attempt. I'd appreciate any advice you have...I've tried a variety of methods but keep getting back only single-word matches.
version R version 3.6.2 (2019-12-12)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
Packages:
dbplyr 1.4.2
quanteda 1.5.2
library(quanteda)
library(dplyr)
raw <- data.frame("doc_id" = c("1", "2", "3"),
"text" = c("diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."))
term = c("diffuse intrinsic pontine glioma", "brain tumors", "brain", "pontine glioma", "mri", "medical imaging", "radiology", "anatomy", "physiological processes", "radiation therapy", "radiotherapy", "cancer treatment", "malignant cells")
medTerms = list(term = term)
dict <- dictionary(medTerms)
corp <- raw %>% group_by(doc_id) %>% summarise(text = paste(text, collapse=" "))
corp <- corpus(corp, text_field = "text")
dfm <- dfm(corp,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
remove = stopwords("english"))
dfm <- dfm_select(dfm, pattern = phrase(dict))
What I'd eventually like to get back is something like the following:
doc_id term
1 diffuse intrinsice pontine glioma
1 pontine glioma
1 brain tumors
1 brain
2 mri
2 medical imaging
2 radiology
2 anatomy
2 physiological processes
3 radiation therapy
3 radiotherapy
3 cancer treatment
3 malignant cells
If you want to match-multi word patterns from a dictionary, you can do so by constructing your dfm using ngrams.
library(quanteda)
library(dplyr)
library(tidyr)
raw$text <- as.character(raw$text) # you forgot to use stringsAsFactors = FALSE while constructing the data.frame, so I convert your factor to character before continuing
corp <- corpus(raw, text_field = "text")
dfm <- tokens(corp) %>%
tokens_ngrams(1:5) %>% # This is the new way of creating ngram dfms. 1:5 means to construct all from unigram to 5-grams
dfm(tolower = TRUE,
stem = FALSE,
remove_punct = TRUE) %>% # I wouldn't remove stopwords for this matching task
dfm_select(pattern = dict)
Now we just have to convert the dfm to a data.frame and bring it into a long format:
convert(dfm, "data.frame") %>%
pivot_longer(-document, names_to = "term") %>%
filter(value > 0)
#> # A tibble: 13 x 3
#> document term value
#> <chr> <chr> <dbl>
#> 1 1 brain 2
#> 2 1 pontine_glioma 1
#> 3 1 brain_tumors 1
#> 4 1 diffuse_intrinsic_pontine_glioma 1
#> 5 2 mri 1
#> 6 2 radiology 1
#> 7 2 anatomy 1
#> 8 2 medical_imaging 1
#> 9 2 physiological_processes 1
#> 10 3 radiotherapy 1
#> 11 3 radiation_therapy 1
#> 12 3 cancer_treatment 1
#> 13 3 malignant_cells 1
You could remove the value column but it might be of interest later on.
You could form all ngrams from 1 to 5 in length, and then select all out. But for large texts, this would be very inefficient. Here's a more direct way. I've reproduced the entire problem here with a few modifications (such as stringsAsFactors = FALSE and skipping some unnecessary steps).
Granted, this does not double count the terms as in your expected example, but I submit that you probably did not want this. Why count "brain" if it occurred within "brain tumor"? You would be better counting "brain tumor" when it occurs as that phrase, and "brain" only when it occurs without "tumor". The code below does that.
library(quanteda)
## Package version: 2.0.1
raw <- data.frame(
"doc_id" = c("1", "2", "3"),
"text" = c(
"diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."
),
stringsAsFactors = FALSE
)
dict <- dictionary(list(
term = c(
"diffuse intrinsic pontine glioma",
"brain tumors", "brain", "pontine glioma", "mri", "medical imaging",
"radiology", "anatomy", "physiological processes", "radiation therapy",
"radiotherapy", "cancer treatment", "malignant cells"
)
))
Here's the key to the answer: using the dictionary first to select the tokens, then to concatenate them, then to reshape them one dictionary match per new "document". The last step creates the data.frame you want.
toks <- corpus(raw) %>%
tokens() %>%
tokens_select(dict) %>% # select just dictionary values
tokens_compound(dict, concatenator = " ") %>% # turn phrase into single "tokens"
tokens_segment(pattern = "*") # make one token per "document"
# make into data.frame
data.frame(
doc_id = docid(toks), term = as.character(toks),
stringsAsFactors = FALSE
)
## doc_id term
## 1 1 diffuse intrinsic pontine glioma
## 2 1 brain tumors
## 3 1 brain
## 4 2 mri
## 5 2 medical imaging
## 6 2 radiology
## 7 2 anatomy
## 8 2 physiological processes
## 9 3 radiation therapy
## 10 3 radiotherapy
## 11 3 cancer treatment
## 12 3 malignant cells

Splitting a column in a data frame by an nth instance of a character

I have a dataframe with several columns, and one of those columns is populated by pipes "|" and information that I am trying to obtain.
For example:
View(Table$Column)
"|1||KK|12|Gold||4K|"
"|1||Rst|E|Silver||13||"
"|1||RST|E|Silver||18||"
"|1||KK|Y|Iron|y|12||"
"|1||||Copper|Cpr|||E"
"|1||||Iron|||12|F"
And so on for about 120K rows.
What I am trying to excavate is everything in between the 5th pipe and the 6th pipe in this series, but in it's own column vector, so the end result looks like this:
View(Extracted)
Gold
Silver
Silver
Iron
Copper
Iron
I don't want to use RegEx. My tools are only limited to R here. Would you guys happen to have any advice how to overcome this?
Thank you.
1) Assuming x as defined reproducibly in the Note at the end use read.table as shown. No regular expressions or packages are used.
read.table(text = Table$Column, sep = "|", header = FALSE,
as.is = TRUE, fill = TRUE)[6]
giving:
V6
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
2) This alternative does use a regular expression (which the question asked not to) but just in case here is a tidyr solution. Note that it requires tidyr 0.8.2 or later since earlier versions of tidyr did not support NA in the into= argument.
library(dplyr)
library(tidyr)
Table %>%
separate(Column, into = c(rep(NA, 5), "commodity"), sep = "\\|", extra = "drop")
giving:
commodity
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
3) This is another base solution. It is probably not the one you want given that (1) is so much simpler but I wanted to see if we could come up with a second approach in base that did not use regexes. Note that if the split= argument of strsplit is "" then it is treated specially and so is not a regex. It creates a list each of whose components is a vector of single characters. Each such vector is passed to the anonymous function which labels | and the characters in the field after it with its ordinal number. We then take the characters corresponding to 5 (except the first as it is |) and collapse them together using paste.
data.frame(commodities = sapply(strsplit(Table$Column, ""), function(chars) {
wx <- which(cumsum(chars == "|") == 5)
paste(chars[seq(wx[2], tail(wx, 1))], collapse = "")
}), stringsAsFactors = FALSE)
giving:
commodities
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
Note
Table <- data.frame(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||",
"|1||KK|Y|Iron|y|12||",
"|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F"), stringsAsFactors = FALSE)
You can try this:
df <- data.frame(x = c("|1||KK|12|Gold||4K|", "|1||Rst|E|Silver||13||"), stringsAsFactors = FALSE)
library(stringr)
stringr::str_split(df$x, "\\|", simplify = TRUE)[, 6]
1) We can use strsplit from base R on the delimiter | and extract the 6th element from the list of vectors
sapply(strsplit(Table$Column, "|", fixed = TRUE), `[`, 6)
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
2) Or using regex (again from base R), use sub to extract the 6th word
sub("^([|][^|]+){4}[|]([^|]*).*", "\\2",
gsub("(?<=[|])(?=[|])", "and", Table$Column, perl = TRUE))
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
data
Table <- structure(list(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||", "|1||KK|Y|Iron|y|12||", "|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F")), class = "data.frame", row.names = c(NA,
-6L))

Resources