Considering this data frame
test = data.frame(language=c("german", "english"), text=I(list(c("und das Beil", "wichtige Thematik der"), c("some useful information", "the most unuseful product"))))
I need to delete the stopwords in each vector of column "text" according to which language the row belongs. Actually, I only need to differ between german and english, so I thought of using apply in combination with ifelse like this:
test[2] = apply(test, 1, function(x) ifelse(x[1] == "german", lapply(x[2], function(y)removeWords(y, stopwords("de"))), lapply(x[2], function(y)removeWords(y, stopwords("en")))))
But this doesn´t work..
Maybe there is even an more elegant way to solve this?
As a first step you could do:
library(tm)
apply(test, 1, function(x) removeWords(x[["text"]], stopwords(x[["language"]])))
Which gives you as a result:
[,1] [,2]
[1,] " Beil" " useful information"
[2,] "wichtige Thematik " " unuseful product"
I dont know what the desired output is though...
Here is a tidy solution that can easily be extended to multiple languages:
library(tidyverse)
test <- tibble(
language = c("german", "english"),
text = I(list(c("und das Beil", "wichtige Thematik der"),
c("some useful information", "the most unuseful product")))
)
test %>%
mutate(lang_abr = recode(language, "german" = "de", "english" = "en")) %>%
mutate(text = map2(text, lang_abr, ~ removeWords(.x, stopwords(.y))))
Related
I am trying to clean some text data, and after tokenising and e.g. removing punctuation, I want my transform the token object into a vector/dataframe/corpus.
My current approach is:
library(quanteda)
library(dplyr)
raw <- c("This is text #1.", "And a second document...")
tokens <- raw %>% tokens(remove_punct = T)
docs <- lapply(tokens, toString) %>% gsub(pattern = ",", replacement = "")
Is there a more "quanteda" or at least a simpler way to do this?
This would be how I would do it, and it preserves the docnames as element names in your output vector. (But you can add USE.NAMES = FALSE if you don't want to keep them.)
> sapply(tokens, function(x) paste(as.character(x), collapse = " "))
text1 text2
"This is text #1" "And a second document"
You don't need the library(dplyr) here.
Context: translate a table from French to English using a table containing corresponding replacements.
Problem: character strings sometimes are very similar, when white space are involved str_replace() does not consider the whole string.
Reproductible example:
library(stringr) #needed for the str_replace_all() function
#datasets
# test is the table indicating corresponding strings
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
# test1 is the table I want to translate
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
# here is a function to translate
test2 = str_replace_all(test1$totrans, setNames(test$en, test$fr))
Output:
I get
> test2
[1] "Other" "Others" "Other encore"
Expected result:
> testexpected
[1] "Other" "Others" "Other again"
As you can see, if strings starts the same but there is no whitespace, replacement is a succes (see Other and Others) but when there is a whitespace, it fails ("Autre encore" is replaced by "Other encore" and not by "Other again").
I feel the answer is very obvious but I just can't find out how to solve it... Any suggestion is welcome.
I think you just need word boundaries (i.e. "\\b") around your look ups. It is straightforward to add these with a paste0 call inside str_replace_all.
Note you don't need to include the whole tidyverse for this; the str_replace_all function is part of the stringr package, which is just one of several packages loaded when you call library(tidyverse):
library(stringr)
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
str_replace_all(test1$totrans, paste0("\\b", test$fr, "\\b"), test$en)
#> [1] "Other" "Others" "Other again"
Created on 2020-05-14 by the reprex package (v0.3.0)
I've been wrapping my head around this for a while, trying plenty of varieties of map, Reduce and such but without success so far.
I am looking for a functional, elegant approach to substitute a sequence of gsub as in
text_example <- c(
"I'm sure dogs are the best",
"I won't, I can't think otherwise",
"We'll be happy to discuss about dogs",
"cant do it today tho"
)
text_example %>%
gsub(pattern = "'ll", replacement = " will") %>%
gsub(pattern = "can'?t", replacement = "can not") %>%
gsub(pattern = "won'?t", replacement = "will not") %>%
gsub(pattern = "n't", replacement = " not") %>%
gsub(pattern = "'m", replacement = " am") %>%
gsub(pattern = "'s", replacement = " is") %>%
gsub(pattern = "dog", replacement = "cat") %>%
Into something of the form
text_example %>%
???(dict$pattern, dict$replacement, gsub())
Where, for sake of a reproducible example, dict can be a data.frame such as
dict <- structure(
list(
pattern = c("'ll", "can'?t", "won'?t", "n't", "'m", "'s", "dog"),
replacement = c(" will", "can not", "will not", " not", " am", " is", "cat")
),
row.names = c(NA, -7L),
class = "data.frame"
)
(and I am aware that the substitutions performed might not be correct linguistically, but that's not the problem now)
Of course, a brutal
for(i in seq(nrow(dict))) {
text_example <- gsub(dict$pattern[i], dict$replacement[i], text_example)
}
would work, and I know that there are dozens of libraries that solve this issue with some specific function. But I want to understand how to deal with recursions and problems like this in a simple, functional way, keeping as close as possible to base R. I love my lambdas!
Thank you in advance for the help.
You can use mapply for a parallel apply-effect:
mapply(dict$pattern, dict$replacement, function(pttrn, rep) gsub(pttrn, rep, text_example))
(You might want to use SIMPLIFY=FALSE)
Maybe the following does what you want.
It is inspired in Functional Programming, the link in your comment.
I don't like the output though, it is a list with as many elements as rows of dataframe dict and only the last one is the one of interess.
new_text <- function(pattern, replacement, text) {
txt <- text
function(pattern, replacement) {
txt <<- gsub(pattern, replacement, txt)
txt
}
}
Replace <- new_text(p, r, text = text_example)
Map(Replace, as.list(dict[[1]]), as.list(dict[[2]]))
I try to test this nice solution using a dataframe as input in the your_sentence.
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text#TT.res$token == badword)
cond2 <- (substring(tagged.text#TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
else{
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
}
}
lapply(your_sentences, remove_words)
The data frame has 1 column and 351 rows. In lapply in your_sentences I use the call for my dataframe and the column name and I receive this error (the same error is when I use the dataframe without call the column):
> dfnew <- lapply(df$text, remove_words)
Error in writeLines(text, con = conn.tempfile) : invalid 'text' argument
What can I do to solve the error?
Example data:
df = data.frame(text = c('I blame myself for what happened', 'For what happened the blame is yours', 'I will blame you if my friend removes'))
What a bummer, hoped that its only a typo :-). But I have a second guess. You probably stepped into the difficulties caused by StringsAsFactors = TRUE. This might have caused passing the type factor instead of character to your function. Try the following:
df = data.frame(text = c('I blame myself for what happened'
, 'For what happened the blame is yours'
, 'I will blame you if my friend removes')
, stringsAsFactors = FALSE)
Your strings seem to be saved as factors and therefore remove_words is supplied with factor values, instead of strings. Using the stringsAsFactors = FALSE as an argument will solve the issue:
df <- data.frame(text = c('I blame myself for what happened',
'For what happened the blame is yours',
'I will blame you if my friend removes'),
stringsAsFactors=F)
Or, if you have already defined your df with factors, you can change that using df <- lapply(df, as.character)
lapply(df$text, remove_words)
[[1]]
[1] "I myself for what happened"
[[2]]
[1] "For what happened the blame is yours"
[[3]]
[1] "I will you if my friend removes"
Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.