Stprl from a dataframe to another - r

How is it possible having a dataframe like this:
df_words <- data.frame(words = c("4 Google", "5Amazon", "4sec"))
replace in the rows of a dataframe like this:
df <- data.frame(id = c(1,2,4), text = "Increase for 4 Google", "There is a slight decrease for 5Amazon", "I will need 4sec more"), stringAsFactors = FALSE)
replace with the specific word the one listed in the df_words like this
"4 Google|5Amazon" -> "stock"
"4sec" -> time
Example of expected output
data.frame(id = c(1,2,4), text = "Increase for stock", "There is a slight decrease for stock", "I will need time more"), stringAsFactors = FALSE)

I recommend the stringi library. Example:
library(stringi)
strings = c("Increase for 4 Google", "There is a slight decrease for 5Amazon", "I will need 4sec more")
patterns = c("4 Google", "5Amazon", "4sec")
replacements = c("stock", "stock", "time")
strings = stri_replace_all_fixed(strings,patterns,replacements)
However, you probably want to handle many stocks and many times, so you might be better off doing something like this:
stocks = c("4 Google", "5Amazon")
strings = stri_replace_all_fixed(strings,stocks,'stock')
strings = stri_replace_all_regex(strings,'\b[0-9]+sec\b',time)
\b[0-9]+sec\b is a regular expression meaning:
word boundary
one or more number characters
"sec"
word boundary
This will include strings such as "2sec" but exclude those such as "1sector"

Related

Tokenization of Compound Words not Working in Quanteda

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.
This is the subset of the dataset I'm using as a reproducible example:
test_cluster <- speeches_subset %>%
filter(grepl('Schwester Agnes',
speechContent,
ignore.case = TRUE))
test_corpus <- corpus(test_cluster,
docid_field = "id",
text_field = "speechContent")
Here, test_cluster contains six observations of 12 variables, that is, six rows in which the column speechContent contains the compound word "Schwester Agnes". test_corpus transforms the underlying data into a quanteda corpus object.
When I then run the following code, I would expect, first, the content of the speechContent variables to be tokenized, and due to tokens_compound, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound(), but I'm not sure... Any help would be greatly appreciated!
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("Schwester Agnes"))
test_kwic <- kwic(test_tokens,
pattern = "Schwester Agnes",
window = 5)
EDIT: I realize that the examples above are not easily reproducible, so please refer to the reprex below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = c("stack", "overflow"))
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
You need to apply phrase("stack overflow") and set concatenator = " " in tokens_compound().
require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of",
"This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.",
"this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id = 1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
test_kwic
#> Keyword-in-context with 2 matches.
#> [1, 29] for example is the word | stack overflow | However there are so many
#> [2, 24] but at the very end | stack overflow |
Created on 2022-05-06 by the reprex package (v2.0.1)

Removing Stop words from a list of strings in R

Sample data
Dput code of my data
x <- structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..",
"I want to vist my teacher today only!!"), class = "factor"),
Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA,
-2L))
I want to remove the stop words from the above data set using tidytext::stop_words$word and also retain the same columns in the output. Along with this how can I remove punctuation in tidytext package?
Note: I don't want to change my dataset into corpus
You can collapse all the words in tidytext::stop_words$word into one regex adding word boundaries. However, tidytext::stop_words$word is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.
For example taking only first 10 words from tidytext::stop_words$word, you can do :
gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b',
collapse = "|"), '|[[:punct:]]+'), '', x$Comments)
#[1] "I want to vist my teacher today only"
# "I have lot of homework to be completed"
clean_tweet = removeWords(clean_tweet, stopwords("english"))

String replacements: how to deal with similar strings and spaces

Context: translate a table from French to English using a table containing corresponding replacements.
Problem: character strings sometimes are very similar, when white space are involved str_replace() does not consider the whole string.
Reproductible example:
library(stringr) #needed for the str_replace_all() function
#datasets
# test is the table indicating corresponding strings
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
# test1 is the table I want to translate
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
# here is a function to translate
test2 = str_replace_all(test1$totrans, setNames(test$en, test$fr))
Output:
I get
> test2
[1] "Other" "Others" "Other encore"
Expected result:
> testexpected
[1] "Other" "Others" "Other again"
As you can see, if strings starts the same but there is no whitespace, replacement is a succes (see Other and Others) but when there is a whitespace, it fails ("Autre encore" is replaced by "Other encore" and not by "Other again").
I feel the answer is very obvious but I just can't find out how to solve it... Any suggestion is welcome.
I think you just need word boundaries (i.e. "\\b") around your look ups. It is straightforward to add these with a paste0 call inside str_replace_all.
Note you don't need to include the whole tidyverse for this; the str_replace_all function is part of the stringr package, which is just one of several packages loaded when you call library(tidyverse):
library(stringr)
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
str_replace_all(test1$totrans, paste0("\\b", test$fr, "\\b"), test$en)
#> [1] "Other" "Others" "Other again"
Created on 2020-05-14 by the reprex package (v0.3.0)

Replace string in one column by another column

I have a dataframe that has two columns and I want to replace some string from the first column by the second column. In the dataframe below, I want the 'em' tag to be replaced by 'a' tag with the url on the ulrs column.
df1 = data.frame(text = c("I like <em'>Rstudio</em> very much",
"<em'> Anaconda</em> is an amazing data science tool"),
urls = c('https://www.rstudio.com/', 'https://anaconda.org/'))
I am looking for a vector like below.
text = c("I like <a href = 'https://www.rstudio.com/'>Rstudio</a> very much",
"<a href = 'https://anaconda.org/'> Anaconda</a> is an amazing data science tool")
An option using gsub and mapply can be as:
mapply(function(x,y)gsub("<em'>.*</em>",x,y),df1$urls, df1$text)
# [1] "I like https://www.rstudio.com/ very much"
# [2] "https://anaconda.org/ is an amazing data science tool"
Data:
df1 = data.frame(text = c("I like <em'>Rstudio</em> very much",
"<em'> Anaconda</em> is an amazing data science tool"),
urls = c('https://www.rstudio.com/', 'https://anaconda.org/'))

R Find similar sentences in texts

I have a problem where I´m struggling to find a solution or an approach to solve it.
I have some model sentences, e.g.
model_sentences = data.frame("model_id" = c("model_id_1", "model_id_2"), "model_text" = c("Company x had 3000 employees in 2016.",
"Google makes 300 dollar in revenue in 2018."))
and some texts
data = data.frame("id" = c("id1", "id2"), "text" = c("Company y is expected to employ 2000 employees in 2020. This is an increase of 10%. Some stupid sentences.",
"Amazon´s revenue is 400 dollar in 2020. That is twice as much as last year."))
and I would like to extract sentences from those texts which are similar to the model sentences.
Something like this would be my desired solution
result = data.frame("id" = c("id1", "id2"), "model_id" = c("model_id_1", "model_id_2"), "sentence_from_data" = c("Company y is expected to employ 2000 employees in 2020.", "Amazon´s revenue is 400 dollar in 2020."), "score" = c(0.5, 0.4))
Maybe it is possible to find kind of a 'similarity_score'.
I use this function to split texts by sentence:
split_by_sentence <- function (text) {
result <-unlist(strsplit(text, "(?<=[[:alnum:]]{4}[?!.])\\s+", perl=TRUE))
result <- stri_trim_both(result)
result <- result [nchar (result) > 0]
if (length (result) == 0)
result <- ""
return (result)
}
But I have no idea how to compare each sentence to a model sentence.
I'm glad for any suggestions.
Check out this package stringdist
Example:
library(stringdist)
mysent = "This is a sentence"
apply(model_sentences, 1, function(row) {
stringdist(row['model_text'], mysent, method="jaccard")
})
It will return jaccard distance from mysent to model_text variable. The smaller the value is, the sentences are more similar in terms of given distance measure.

Resources