Convert tokens to corpus

Convert tokens to corpus - r

I have a variable name df which is a character vector.
As a preprocessing step I would like to remove stopwords and my own list of stopwords. After that I would like to create a corpus from the previous example and a dfm.
I use the following command lines:
library(quanteda)
datastopwords_removed <- tokens_remove(tokens(df2, remove_punct = TRUE), c(stopwords("english"), mystopwords$phrases))
mycorpus <- corpus(datastopwords_remove)
myDfm <- dfm(datastopwords_remove, ngrams = c(1,5))
But in corpus I receive this error:
Error in UseMethod("corpus") :
no applicable method for 'corpus' applied to an object of class "tokens"
How can I fix it? Also if I have in mystopword list phrases with more than one token should I make any special handling because it works and I didn't received an error so I suppose it removes them.

Related

R stopwords: getting rid of ALL the words starting with 'https'

I'm doing a project that includes Twitter scraping.
The problem: I don't seem to be able to remove ALL of the words that start with 'https'.
My code:
library(twitteR)
library(tm)
library(RColorBrewer)
library(e1017)
library(class)
library(wordcloud)
library(tidytext)
scraped_tweets <- searchTwitter('Silk Sonic - leave door open', n = 10000, lang='en')
# get text data from tweets
scraped_text <- sapply(scraped_tweets, function(x){x$getText()})
# removing emojis and characters
scraped_text <- iconv(scraped_text, 'UTF-8', 'ASCII')
scraped_corpus <- Corpus(VectorSource(scraped_text))
doc_matrix <- TermDocumentMatrix(scraped_corpus, control = list(removePunctuation=T,
stopwords = c('https','http', 'sonic',
'silk',stopwords('english')),
removeNumbers = T,tolower = T))
# convert object into a matrix
doc_matrix <- as.matrix(doc_matrix)
# get word counts
head(doc_matrix,1)
words <- sort(rowSums(doc_matrix), decreasing = T)
dm <- data.frame(word = names(words), freq = words)
# wordcloud
wordcloud(dm$word, dm$freq, random.order = F, colors = brewer.pal(8, 'Dark2'))
I added the tags 'https' and 'http', but it didn't help.
I can of course clean the output with gsub but it's not the same as I still get the rest of the link name as an output.
Are there any ideas how I could do this?
Thanks in advance.

Let's have a look at the documentation for the tm:
stopwords Either a Boolean value indicating stopword removal using default
language specific stopword lists shipped with this package, a character vec-
tor holding custom stopwords, or a custom function for stopword removal.
Defaults to FALSE.
The stopwords argument does not seem to make any partial or pattern matches on the provided stopwords. It does accept a custom function, though. This is one option, but I think it is easiest to do the url removal on the character vector before even turning it into a corpus:
scraped_text <- sapply(scraped_tweets, function(x){x$getText()})
scraped_text <- iconv(scraped_text, 'UTF-8', 'ASCII')
# Added line for regex string removal
scraped_text <- str_remove_all(scraped_text, r"(https?://[^)\]\s]+(?=[)\]\s]))")
scraped_corpus <- Corpus(VectorSource(scraped_text))
This is a rather simple regex for url recognition, but it works reasonably well. There are more complicated ones out there, which can easily be found with a google search.

How can I solve this R error message relating to atomic vectors?

I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts.
Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.
My R codes (extract) stand as follows:
setwd("E:/sentiment")
doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)
# replace specific characters in doc1
doc1<-gsub("[^\x01-\x7F]", "", doc1)
library(tm)
#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))
I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):
Error in doc1$Review.Text : $ operator is invalid for atomic vectors
I had a look at the following StackOverflow questions:
remove emoticons in R using tm package
Replace specific characters within strings
I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")
How can I solve this issue?

With 
doc1<-gsub("[^\x01-\x7F]", "", doc1)
 you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:
doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)
and now clear
doc1$Species
produces the error.
Eventually you want to do:
doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)

Error in creating TermDocumentMatrix using tm package in R

I am unable to create a term document matrix using tm package in R which throws the following error as I try to create one out of a preprocessed corpus.
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class
"character"
Below is my script that I am using. I am using R v3.4.1 with tm package v0.7-1.
data <- readLines("Data/en_US/en_US_sample.txt", n = 100)
data <- Corpus(VectorSource(data))
data <- tm_map(data, removePunctuation)
data <- tm_map(data, removeNumbers)
data <- tm_map(data, content_transformer(tolower))
data <- tm_map(data, removeWords, stopwords("en"))
data <- tm_map(data, stripWhitespace)
words <- TermDocumentMatrix("data")
I believe TermDocumentMatrix requires the corpus to be in some specified text document format so I tried coercing my corpus to PlainTextDocument using tm_map but it doesn't solve the problem. When I am loading the my text data using Corpus on VectorSource, object created shows the class as SimpleCorpus which might be the problem but I am not totally sure.
Any help would be much appreciated. Thanks!

You did everything right, just in your last line you accidentally passed a character "data" (note the quotation marks) to the function TermDocumentMatrix() instead of the object data.

How to select words from corpus for TermDocumentMatrix creation in tm

I want to retain only pattern words (i.e gene names which I have specified) from each document of my corpus to generate the dtm. I do not want to pre-process the documents before corpus creation. I want to select and retain the gene names from the corpus only. I have used a custom function to keep only the terms in "pattern" and remove everything else (How to select only a subset of corpus terms for TermDocumentMatrix creation in tm). Here are my codes.
library(tm)
library(Rstem)
library(RTextTools)
docs <- Corpus(DirSource(path of the directory))
# Custom function to keep only the terms in "pattern" and remove everything else
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE)))
# The pattern i want to search for
gene = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, gene)[[1]]
However, I get the error
" Error in UseMethod("content", x) :no applicable method for 'content' applied to an object of class "character" "

Removing non-English text from Corpus in R using tm()

I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables.
Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this:
Special
satisfação
Happy
Sad
Potential für
I then read my txt file into R:
words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))
This yields the warning message:
Warning message:
In readLines(y, encoding = x$Encoding) :
incomplete final line found on '/temp/file.txt'
But since it's a warning, not an error, I continue to push forward.
words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)
This then yields the error:
Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'
I'm open to finding ways to filter out the non-English characters either in TextWrangler or R; whatever is the most expedient. Thanks for your help!

Here's a method to remove words with non-ASCII characters before making a corpus:
# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg.
# dat <- readLines('~/temp/dat.txt')
dat <- "Special, satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Special, Happy, Sad, Potential

You can also use the package "stringi".
Using the above example:
library(stringi)
dat <- "Special, satisfação, Happy, Sad, Potential, für"
stringi::stri_trans_general(dat, "latin-ascii")
Output:
[1] "Special, satisfacao, Happy, Sad, Potential, fur"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Convert tokens to corpus - r

Related

R stopwords: getting rid of ALL the words starting with 'https'

How can I solve this R error message relating to atomic vectors?

Error in creating TermDocumentMatrix using tm package in R

How to select words from corpus for TermDocumentMatrix creation in tm

Removing non-English text from Corpus in R using tm()

Categories

Resources