Count misspelled words in R - r

Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fish", "Final Countdow", "show me your s", "where is what")
Data<-cbind(Row, Content)
View(Data)
I wanted to create a function which tells me how many words are wrong per Row.
A intermediate step would be to have it look like this:
Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fs", "Final Countdow", "show me your s", "where is what")
MisspelledWords<-c(NA, "whre, fs", "Countdow","s",NA)
Data<-cbind(Row, Content,MisspelledWords)
I know that i have to use aspell but i'm having problems to perform aspell on only rows and not always directly on the whole file, finally i want to Count how many words are wrong on every Row For this i would take code of: Count the number of words in a string in R?

Inspired by this article, here's a try with which_misspelled and check_spelling in library(qdap).
library(qdap)
# which_misspelled
n_misspelled <- sapply(Content, function(x){
length(which_misspelled(x, suggest = FALSE))
})
data.frame(Content, n_misspelled, row.names = NULL)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
# check_spelling
df <- check_spelling(Content, n.suggest = 0)
n_misspelled <- as.vector(table(factor(df$row, levels = Row)))
data.frame(Content, n_misspelled)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0

To use aspell you have to use a file. It's pretty straightforward to use a function to dump a column to a file, run aspell and get the counts (but it will not be all that efficient if you have a large matrix/dataframe).
countMispelled <- function(words) {
# do a bit of cleanup (if necessary)
words <- gsub(" *", " ", gsub("[[:punct:]]", "", words))
temp_file <- tempfile()
writeLines(words, temp_file);
res <- aspell(temp_file)
unlink(temp_file)
# return # of mispelled words
length(res$Original)
}
Data <- cbind(Data, Errors=unlist(lapply(Data[,2], countMispelled)))
Data
## Row Content Errors
## [1,] "1" "I love cheese" "0"
## [2,] "2" "whre is thed fish" "2"
## [3,] "3" "Final Countdow" "1"
## [4,] "4" "show me your s" "0"
## [5,] "5" "where is what" "0"
You might be better off using a data frame vs a matrix (I just worked with what you provided) since you can keep Row and Errors numeric that way.

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

R For loop unwanted overwrite

I would like every result of the loop in a different text(somename).
Right now the loop overwrites;
library(rvest)
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
for (i in urls){
a01 <- paste0("http://www.imdb.com",i)
text <- read_html(a01) %>% # load the page
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>% # isloate the text
html_text()
}
How could I code it in such a way that the 'i' from the list is added tot text in the for statement?
To solidify my comment:
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
texts <- sapply(head(urls, n = 3), function(i) {
read_html(paste0("http://www.imdb.com", i)) %>%
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>%
html_text()
}, simplify = FALSE)
str(texts)
# List of 3
# $ /title/tt5843990/: chr [1:4] "Lav Diaz" "Charo Santos-Concio" "John Lloyd Cruz" "Michael De Mesa"
# $ /title/tt4551318/: chr [1:4] "Andrey Konchalovskiy" "Yuliya Vysotskaya" "Peter Kurth" "Philippe Duquesne"
# $ /title/tt4550098/: chr [1:4] "Tom Ford" "Amy Adams" "Jake Gyllenhaal" "Michael Shannon"
If you use lapply(...), you'll get an unnamed list, which may or may not be a problem for you. Instead, using sapply(..., simplify = FALSE), we get a named list where each name is (in this case) the partial url retrieved from urls.
Using sapply without simplify can lead to unexpected outputs. As an example:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [1] 1 2 3
One may think that this will always return a vector. However, if any of the single elements returned is not the same length (for instance) as the others, then the vector becomes a list:
set.seed(10)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [[1]]
# [1] 1 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3 3
In which case, it's best to have certainty in the return value, forcing a list:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)), simplify = FALSE)
# [[1]]
# [1] 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3
That way, you always know exactly how to reference sub-returns. (This is one of the tenets and advantages to Hadley's purrr package: each function always returns a list of exactly the type you declare. (There are other advantages to the package.)

Using VCorpus() function but lose content

I am using the VCorpus() function in r package tm. Here is the problem I have
example_text = data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
This looks like
num Author1 Author2
1 1 Text mining is a great time. R is a great language
2 2 Text analysis provides insights R has many uses
3 3 qdap and tm are used in text mining here is a problem
Then I type df_source = DataframeSource(example_text[,2:3]) to only extract the last 2 columns.
df_source looks correct. After that, I did df_corpus = VCorpus(df_source) and df_corpus[[1]] is
<<PlainTextDocument>>
Metadata: 7
Content: chars: 2
And df_corpus[[1]] gives me
$content
[1] "3" "3"
But df_corpus[[1]] should return
<<PlainTextDocument>>
Metadata: 7
Content: chars: 49
And df_corpus[[1]][1] should return
$content
[1] "Text mining is a great time." "R is a great language"
I don't know where goes wrong. Any suggestions will be appreciated.
The texts inside example_text that are supposed to be character have all become factors because the 'factory-fresh' value of stringsAsFactors is TRUE, which is weird and annoying from my point of view.
example_text <- data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
lapply(example_text, class)
# $num
# [1] "numeric"
#
# $Author1
# [1] "factor"
#
# $Author2
# [1] "factor"
To ensure the column Author1 and Author2 to be character columns, you may try:
Add options(stringsAsFactors = FALSE) at the beginning of your code.
Add stringsAsFactors = FALSE inside your data.frame(...) statement.
Run example_text[, 2:3] <- lapply(example_text[, 2:3], as.character)
Run example_text[, 2:3] <- lapply(example_text[, 2:3], paste)
Then everything should work fine.

Counting words in text (in R): Results unreadable

I'm counting words in a given text using R libraries tm and qdap. When my vector (words) has only a few words, everything looks fine:
library(tm)
library(qdap)
text <- "activat affect affected affecting affects aggravat allow attribut based basis
bc because bosses caus change changed changes changing compel compliance"
text <- Corpus(VectorSource(text))
words <- c("activat", "affect", "affected")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Results:
# docs word.count activat affect affected
# 1 doc 1 20 1(5.00%) 4(20.00%) 1(5.00%)
But when my vector (words) has too many words the results get garbled and unreadable:
words <- c("activat", "affect", "affected", "affecting", "affects", "aggravat", "allow",
"attribut", "based", "basis", "bc", "because", "bosses", "caus", "change",
"changed", "changes", "changing", "compel", "compliance")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Results:
# docs word.count activat affect affected affecting affects aggravat allow
# attribut based basis bc because bosses caus change changed
# changes changing compel compliance
# 1 doc 1 20 1(5.00%) 4(20.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%)
# 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 2(10.00%) 3(15.00%) 1(5.00%)
# 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%)
How can I have the results display in a dataframe/matrix so I can read them more easily?
I tried using termco2mat (qdap library) which supposedly "returns a matrix of term counts" (https://trinker.github.io/qdap/termco.html) like so (please see below), but I'm getting an error:
apply_as_df(text, termco2mat, match.list=words)
# Results:
# Error in qdapfun(text.var = text, ...) :
# unused arguments (text.var = text, match.list = c("activat", "affect", "affected",
# "affecting", "affects", "aggravat", "allow", "attribut", "based", "basis", "bc",
# "because", "bosses", "caus", "change", "changed", "changes", "changing", "compel",
# "compliance"))
Or:
termco2mat(apply_as_df(text, termco, match.list=words))
# Results:
# Error in `rownames<-`(`*tmp*`, value = "doc 1") :
# attempt to set 'rownames' on an object with no dimensions
Here's a solution without qdap:
library(tm)
text1 <- "activat affect affected affecting affects aggravat allow attribut"
text2 <- "based basis bc because bosses caus change changed changes changing compel compliance"
text <- Corpus(VectorSource(c(text1, text2)))
words <- c("activat", "affect", "affected")
dtm <- DocumentTermMatrix(text)
data.frame(cnt = colSums(as.matrix(dtm[ , words])))
Output
cnt
activat 1
affect 1
affected 1
I'm not sure what you're trying to do but scores counts are how to extract the objects from the list. Maybe you want to t transpose the output?
apply_as_df(text, termco, match.list=words) %>%
counts() %>%
t()
## docs "doc 1"
## word.count "20"
## activat "1"
## affect "4"
## affected "1"
## affecting "1"
## affects "1"
## aggravat "1"
## allow "1"
## attribut "1"
## based "1"
## basis "1"
## bc "1"
## because "1"
## bosses "1"
## caus "2"
## change "3"
## changed "1"
## changes "1"
## changing "1"
## compel "1"
## compliance "1"

Load csv file with different number of strings

How would I load data from a csv file into R if the file contains different numbers of strings in every line? I need to unite it in one variable (E.g. a list of lists?). The data in the file looks like this (I don't know the maximum number of elements in one row):
Peter; Paul; Mary
Jeff;
Peter; Jeff
Julia; Vanessa; Paul
Use fill=TRUE:
read.table(text='
Peter; Paul; Mary
Jeff;
Peter; Jeff
Julia; Vanessa; Paul',sep=';',fill=TRUE)
V1 V2 V3
1 Peter Paul Mary
2 Jeff
3 Peter Jeff
4 Julia Vanessa Paul
r <- readLines("tmp3.csv")
getLine <- function(x) {
r <- scan(text=x,sep=";",what="character",quiet=TRUE)
r <- r[nchar(r)>0] ## drop empties
r <- gsub("(^ +| +$)","",r) ## strip whitespace
r
}
lapply(r,getLine)
## [[1]]
## [1] "Peter" "Paul" "Mary"
##
## [[2]]
## [1] "Jeff"
##
## [[3]]
## [1] "Peter" "Jeff"
##
## [[4]]
## [1] "Julia" "Vanessa" "Paul"
This is technically a list of vectors rather than a list of lists but it might be what you want ...

Resources