ngrams analysis in tidytext in R - r

I am trying to do ngram analysis for in tidytext, I have a corpus of 770 speeches. However the function unnest_tokens in tidytext takes data frame as input. when i checked with the example (jane austin books) each line of the book is stored as row in a data frame. i am not able to convert the corpus into a dataframe, neither for one speech at a time nor for all the corpus at once.
What is the way i can run ngrams (n=2,3, etc) analysis on tidytext using unnest tokens on my corpus. Can someone please suggest?
Thanks

You can use library ngram & tm for this.You can replace "myCorpus" with the corpus you created.
library(tm)
library(ngarm)
myCorpus<-c("Hi How are you","Hello World","I love Stackoverflow","Good Bye All")
ng <- ngram (myCorpus , n =2)
get.phrasetable (ng)
If you want to tokenize and convert your corpus into a dataframe then use the below code.
tokenizedCorpus <- lapply(myCorpus, scan_tokenizer)
mydata <- data.frame(text = sapply(tokenizedCorpus, paste, collapse = " "),stringsAsFactors = FALSE)

You say that you have a "corpus" of 770 speeches. Do you mean you have a character vector? If so, you can tokenize your text in this way:
library(tidyverse)
library(tidytext)
speech_vec <- c("I am giving a speech!",
"My second speech is even better.",
"Unfortunately, this speech is terrible!",
"For my final speech, I will wow you all.")
speech_df <- tibble(text = speech_vec) %>%
mutate(speech = row_number())
tidy_speeches <- speech_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_speeches
#> # A tibble: 21 x 2
#> speech bigram
#> <int> <chr>
#> 1 1 i am
#> 2 1 am giving
#> 3 1 giving a
#> 4 1 a speech
#> 5 2 my second
#> 6 2 second speech
#> 7 2 speech is
#> 8 2 is even
#> 9 2 even better
#> 10 3 unfortunately this
#> # … with 11 more rows
Created on 2020-02-15 by the reprex package (v0.3.0)
If instead, you mean that you have a DocumentTermMatrix from the tm package, check out this chapter for details on how to convert to a tidy data structure.

Related

Finding the dominant topic in each sentence in topic modeling

One question that I can't find the answer for in R is that how I can find the dominant topic in NLP model for each sentence?
Imagine I have data frame like this:
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
id <- c(1,2,3,4,5,6,7)
data <- data.frame(id, comment)
I do preprocess as shown below:
text_cleaning_tokens <- data %>%
tidytext::unnest_tokens(word, comment)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>%
anti_join(stop_words)
stemmed_token <- text_cleaning_tokens %>% mutate(word=wordStem(word))
tokens <- stemmed_token %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, clean_remark,-id,sep =" " )
tokens$clean_remark <- trimws(tokens$clean_remark)
The I ran FitLdaModel function on this data and finally, found the best topics based on 2 groups:
t_1 t_2
1 beauti built
2 block home
3 renov legal
4 bathroom locat
5 bdm bachelor
6 bdm_heart bachelor_suit
7 beauti_street block_norquai
8 beauti_upgr blueridg
9 bedroom blueridg_legal
10 bedroom_unit built_design
now based on the result I have, I want to find the most dominant topic in each sentence in topic modelling. For example, I want to know that for comment 1 ("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior"), which topic (topic 1 or topic 2) is the most dominant?
Can anyone help me with this question? do we have any package that can do this?
It is pretty easy to work with quanteda and topicmodels. The former is for data management and quantitative analysis of textual data, the latter is for topic modeling inference.
Here I take your comment object and transform it to a corpus and then to a dfm. I then convert it to be understandable by topicmodels.
The function LDA() gives you all you need to easily extract information. In particular, with get_topics() you get the most probable topic for each document. If you instead want to see the document-topic-weights you can do so with ldamodel#gamma. You will see that get_topics() does exactly what you asked.
Please, see if this works for you.
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(topicmodels)
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
mycorp <- corpus(comment)
docvars(mycorp, "id") <- 1L:7L
mydfm <- dfm(mycorp)
# convert the DFM to a Document Matrix for topicmodels
forTM <- convert(mydfm, to = "topicmodels")
myLDA <- LDA(forTM, k = 2)
dominant_topics <- get_topics(myLDA)
dominant_topics
#> text1 text2 text3 text4 text5 text6 text7
#> 2 2 2 2 1 1 1
dtw <- myLDA#gamma
dtw
#> [,1] [,2]
#> [1,] 0.4870600 0.5129400
#> [2,] 0.4994974 0.5005026
#> [3,] 0.4980144 0.5019856
#> [4,] 0.4938985 0.5061015
#> [5,] 0.5037667 0.4962333
#> [6,] 0.5000727 0.4999273
#> [7,] 0.5176960 0.4823040
Created on 2021-03-18 by the reprex package (v1.0.0)
I agree with the other answer that quanteda and topicmodels are a better choice. Maybe also look into seededlda which is an LDA implementation from one of the quanteda authors (with extra features you don't have to use).
However, if you want to stick with your choice of tidytext and textmineR, this is how you would do it.
First, I simplified your preprocessing a bit, since you did some steps that seemed unnecessary to me:
library(tidyverse)
library(tidytext)
text_cleaning_tokens <- data %>%
unnest_tokens(word, comment) %>%
mutate(word = str_remove(word, "[[:digit:]]|[[:punct:]]")) %>%
filter(!(nchar(word) <= 1))%>%
anti_join(stop_words, by = "word") %>%
mutate(word = SnowballC::wordStem(word))
Then I run LDA according to the textmineR example:
lda <- text_cleaning_tokens %>%
cast_sparse(id, word) %>%
textmineR::FitLdaModel(k = 2,
iterations = 200,
burnin = 175,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_r2 = TRUE)
Now all implementations of LDA deliver two important results:
phi (φ) which shows for each word in the corpus how it scored on each topic. The higher the phi-value, the more prevalent the word in this specific topic.
theta (θ) which shows for each document in the corpus how it scored on each topic. The higher the theta-value, the more prevalent the topic is in the document. (topicmodels calls it gamma for some reason.)
In other words, all you have to do to find the most dominant topic in a text is:
lda$theta %>%
as_tibble() %>%
rowwise() %>%
mutate(top = which.max(c_across(everything()))) %>% # find highest value per row dplyr style
bind_cols(data, .) %>% # bind to original data
as_tibble() # just for nicer printing
#> # A tibble: 7 x 5
#> id comment t_1 t_2 top
#> <int> <chr> <dbl> <dbl> <int>
#> 1 1 1 . outstanding renovation all improvements are t… 0.892 0.108 1
#> 2 2 solidly constructed lovingly maintained sf crest … 0.0161 0.984 2
#> 3 3 one year since built new this well designed store… 0.0238 0.976 2
#> 4 4 beautiful street large bdm in the heart of lynn v… 0.986 0.0139 1
#> 5 5 rare to find legal beautiful upgr in port moody c… 0.992 0.00820 1
#> 6 6 fantastic opportunity to get value for the money … 0.266 0.734 2
#> 7 7 original owner tired but rock solid perfect locat… 0.00549 0.995 2
Created on 2021-03-18 by the reprex package (v1.0.0)
I also recommend you read Julia Silge's stuff on the matter. For example, this and this.

How to tokenize my dataset in R using the tidytext library?

I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.
Here are the packages I have loaded:
# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)
Here is the dataset I tried to use which is online, so the results should be reproducible:
bible <- readLines('http://bereanbible.com/bsb.txt')
And here is where everything falls apart.
Input:
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.
Input:
bible <- readLines('http://bereanbible.com/bsb.txt')
bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.
The problem is that readLines()creates a vector, not a dataframe, as expected by unnest_tokens(), so you need to convert it. It is also helpful to separate the verse to it's own column:
library(tidytext)
library(tidyverse)
bible_orig <- readLines('http://bereanbible.com/bsb.txt')
# Get rid of the copyright etc.
bible_orig <- bible_orig[4:length(bible_orig)]
# Convert to df
bible <- enframe(bible_orig)
# Separate verse from text
bible <- bible %>%
separate(value, into = c("verse", "text"), sep = "\t")
tidy_bible <- bible %>%
unnest_tokens(word, text)
tidy_bible
#> # A tibble: 730,130 x 3
#> name verse word
#> <int> <chr> <chr>
#> 1 1 Genesis 1:1 in
#> 2 1 Genesis 1:1 the
#> 3 1 Genesis 1:1 beginning
#> 4 1 Genesis 1:1 god
#> 5 1 Genesis 1:1 created
#> 6 1 Genesis 1:1 the
#> 7 1 Genesis 1:1 heavens
#> 8 1 Genesis 1:1 and
#> 9 1 Genesis 1:1 the
#> 10 1 Genesis 1:1 earth
#> # … with 730,120 more rows
Created on 2020-07-14 by the reprex package (v0.3.0)

Sentiment analysis in R for cyrillic

I can't comment on this page where i found a function Sentiment Analysis Text Analytics in Russian / Cyrillic languages
get_sentiment_rus <- function(char_v, method="custom", lexicon=NULL, path_to_tagger = NULL, cl = NULL, language = "english") {
language <- tolower(language)
russ.char.yes <- "[\u0401\u0410-\u044F\u0451]"
russ.char.no <- "[^\u0401\u0410-\u044F\u0451]"
if (is.na(pmatch(method, c("syuzhet", "afinn", "bing", "nrc",
"stanford", "custom"))))
stop("Invalid Method")
if (!is.character(char_v))
stop("Data must be a character vector.")
if (!is.null(cl) && !inherits(cl, "cluster"))
stop("Invalid Cluster")
if (method == "syuzhet") {
char_v <- gsub("-", "", char_v)
}
if (method == "afinn" || method == "bing" || method == "syuzhet") {
word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
if (is.null(cl)) {
result <- unlist(lapply(word_l, get_sent_values,
method))
}
else {
result <- unlist(parallel::parLapply(cl = cl, word_l,
get_sent_values, method))
}
}
else if (method == "nrc") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
lexicon <- dplyr::filter_(syuzhet:::nrc, ~lang == tolower(language),
~sentiment %in% c("positive", "negative"))
lexicon[which(lexicon$sentiment == "negative"), "value"] <- -1
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "custom") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "stanford") {
if (is.null(path_to_tagger))
stop("You must include a path to your installation of the coreNLP package. See http://nlp.stanford.edu/software/corenlp.shtml")
result <- get_stanford_sentiment(char_v, path_to_tagger)
}
return(result)
}
It gives an error
> mysentiment <- get_sentiment_rus(as.character(corpus))
Show Traceback
Rerun with Debug
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "NULL"
And the sentiment scores are equal to 0
> SentimentScores <- data.frame(colSums(mysentiment[,]))
> SentimentScores
colSums.mysentiment.....
anger 0
anticipation 0
disgust 0
fear 0
joy 0
sadness 0
surprise 0
trust 0
negative 0
positive 0
Could you please point out where a problem might be? Or suggest any other working method for sentiment analysis в R? Just wonder what package supports russian language.
I am looking for any working method for sentiment analysis of a text in russian.
It looks to me like your function did not really find any sentiment words in your text. This might have to do with the sentiment dictionary you are using. Instead of trying to repair this function, you might want to consider a tidy approach instead, which is outlined in the book "Text Mining with R. A Tidy Approach". The advantage is that it does not mind the cyrillic letters and that it is really easy to understand and tweak.
First, we need a dictionary with sentiment values. I found one on GitHub, which we can directly read into R:
library(rvest)
library(stringr)
library(tidytext)
library(dplyr)
dict <- readr::read_csv("https://raw.githubusercontent.com/text-machine-lab/sentimental/master/sentimental/word_list/russian.csv")
Next, let's get some test data to work with. For no particular reason, I use the Russian Wikipedia entry for Brexit and scrape the text:
brexit <- "https://ru.wikipedia.org/wiki/%D0%92%D1%8B%D1%85%D0%BE%D0%B4_%D0%92%D0%B5%D0%BB%D0%B8%D0%BA%D0%BE%D0%B1%D1%80%D0%B8%D1%82%D0%B0%D0%BD%D0%B8%D0%B8_%D0%B8%D0%B7_%D0%95%D0%B2%D1%80%D0%BE%D0%BF%D0%B5%D0%B9%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%81%D0%BE%D1%8E%D0%B7%D0%B0" %>%
read_html() %>%
html_nodes("body") %>%
html_text() %>%
tibble(text = .)
Now this data can be turned into a tidy format. I split the text into paragraphs first, so we can check sentiment scores for paragraphs individually.
brexit_tidy <- brexit %>%
unnest_tokens(output = "paragraph", input = "text", token = "paragraphs") %>%
mutate(id = seq_along(paragraph)) %>%
unnest_tokens(output = "word", input = "paragraph", token = "words")
The way a dictionary is used with tidy data is incredibly straightwoard from this point. You just combine the data frame with sentiment values (i.e., the dictionary) and the data frame with the words in your text. Where text and dictionary match, the sentiment value is added. All other values are dropped.
# apply dictionary
brexit_sentiment <- brexit_tidy %>%
inner_join(dict, by = "word")
head(brexit_sentiment)
#> # A tibble: 6 x 3
#> id word score
#> <int> <chr> <dbl>
#> 1 7 затяжной -1.7
#> 2 13 против -5
#> 3 22 популярность 5
#> 4 22 против -5
#> 5 23 нужно 1.7
#> 6 39 против -5
Instead of the value for each word, you probably prefer the values per paragraphs. This can easily be done by getting the mean for each paragraph:
# group sentiment by paragraph
brexit_sentiment %>%
group_by(id) %>%
summarise(sentiment = mean(score))
#> # A tibble: 25 x 2
#> id sentiment
#> <int> <dbl>
#> 1 7 -1.7
#> 2 13 -5
#> 3 22 0
#> 4 23 1.7
#> 5 39 -5
#> 6 42 5
#> 7 43 -1.88
#> 8 44 -3.32
#> 9 45 -3.35
#> 10 47 1.7
#> # … with 15 more rows
There are a couple of ways this approach could be improved if necessary:
to get rid of different word forms, you could lemmatize the words, making matches more likely
in case your text includes misspellings, you could consider matching words which are similar with e.g. fuzzyjoin
you can find or create a better dictionary than the one I pulled of the first page I found when googling "russian sentiment dictionary"

Sentiment analysis (AFINN) in R

I am trying to the sentiment of a dataset of Tweets using the AFINN dictionary (get_sentiments("afinn"). A sample of the dataset is provided below:
A tibble: 10 x 2
Date TweetText
<dttm> <chr>
1 2018-02-10 21:58:19 "RT #RealSirTomJones: Still got the moves! That was a lo~
2 2018-02-10 21:58:19 "Yass Tom \U0001f600 #snakehips still got it #TheVoiceUK"
3 2018-02-10 21:58:19 Yasss tom he’s some chanter #TheVoiceUK #ItsNotUnusual
4 2018-02-10 21:58:20 #TheVoiceUK SIR TOM JONES...HE'S STILL HOT... AMAZING VO~
5 2018-02-10 21:58:21 I wonder how many hips Tom Jones has been through? #TheV~
6 2018-02-10 21:58:21 Tom Jones has still got it!!! #TheVoiceUK
7 2018-02-10 21:58:21 Good grief Tom Jones is amazing #TheVoiceuk
8 2018-02-10 21:58:21 RT #tonysheps: Sir Thomas Jones you’re a bloody legend #~
9 2018-02-10 21:58:22 #ITV Tom Jones what a legend!!! ❤️ #StillGotIt #TheVoice~
10 2018-02-10 21:58:22 "RT #RealSirTomJones: Still got the moves! That was a lo~
What I want to do is:
1. Split up the Tweets into individual words.
2. Score those words using the AFINN lexicon.
3. Sum the score of all the words of each Tweet
4. Return this sum into a new third column, so I can see the score per Tweet.
For a similar lexicon I found the following code:
# Initiate the scoreTopic
scoreTopic <- 0
# Start a loop over the documents
for (i in 1:length (myCorpus)) {
# Store separate words in character vector
terms <- unlist(strsplit(myCorpus[[i]]$content, " "))
# Determine the number of positive matches
pos_matches <- sum(terms %in% positive_words)
# Determine the number of negative matches
neg_matches <- sum(terms %in% negative_words)
# Store the difference in the results vector
scoreTopic [i] <- pos_matches - neg_matches
} # End of the for loop
dsMyTweets$score <- scoreTopic
I am however not able to adjust this code to get it working with the afinn dictionary.
This would be a great use case for tidy data principles. Let's set up some example data (these are real tweets of mine).
library(tidytext)
library(tidyverse)
tweets <- tribble(
~tweetID, ~TweetText,
1, "Was Julie helping me because I don't know anything about Python package management? Yes, yes, she was.",
2, "#darinself OMG, this is my favorite.",
3, "#treycausey #ftrain THIS IS AMAZING.",
4, "#nest No, no, not in error. Just the turkey!",
5, "The #nest people should write a blog post about how many smoke alarms went off yesterday. (I know ours did.)")
Now we have some example data. In the code below, unnest_tokens() tokenizes the text, i.e. breaks it up into individual words (the tidytext package allows you to use a special tokenizer for tweets) and the inner_join() implements the sentiment analysis.
tweet_sentiment <- tweets %>%
unnest_tokens(word, TweetText, token = "tweets") %>%
inner_join(get_sentiments("afinn"))
#> Joining, by = "word"
Now we can find the scores for each tweet. Take the original data set of tweets and left_join() on to it the sum() of the scores for each tweet. The handy function replace_na() from tidyr lets you replace the resulting NA values with zero.
tweets %>%
left_join(tweet_sentiment %>%
group_by(tweetID) %>%
summarise(score = sum(score))) %>%
replace_na(list(score = 0))
#> Joining, by = "tweetID"
#> # A tibble: 5 x 3
#> tweetID TweetText score
#> <dbl> <chr> <dbl>
#> 1 1. Was Julie helping me because I don't know anything about … 4.
#> 2 2. #darinself OMG, this is my favorite. 2.
#> 3 3. #treycausey #ftrain THIS IS AMAZING. 4.
#> 4 4. #nest No, no, not in error. Just the turkey! -4.
#> 5 5. The #nest people should write a blog post about how many … 0.
Created on 2018-05-09 by the reprex package (v0.2.0).
If you are interested in sentiment analysis and text mining, I invite you to check out the extensive documentation and tutorials we have for tidytext.
For future reference:
Score_word <- function(x) {
word_bool_vec <- get_sentiments("afinn")$word==x
score <- get_sentiments("afinn")$score[word_bool_vec]
return (score) }
Score_tweet <- function(sentence) {
words <- unlist(strsplit(sentence, " "))
words <- as.vector(words)
scores <- sapply(words, Score_word)
scores <- unlist(scores)
Score_tweet <- sum(scores)
return (Score_tweet)
}
dsMyTweets$score<-apply(df, 1, Score_tweet)
This executes what I initially wanted! :)

How to chain together multiple qdap transformations for text mining / sentiment (polarity) analysis in R

I have a data.frame that has week numbers, week, and text reviews, text. I would like to treat the week variable as my grouping variable and run some basic text analysis on it (e.g. qdap::polarity). Some of the review text have multiple sentences; however, I only care about the week's polarity "on-the-whole".
How can I chain together multiple text transformations before running qdap::polarity and adhere to its warning messages? I am able to chain together transformations with the tm::tm_map and tm::tm_reduce -- is there something comparable in qdap? What is the proper way to pre-treat/transform this text prior to running qdap::polarity and/or qdap::sentSplit?
More details in the following code / reproducible example:
library(qdap)
library(tm)
df <- data.frame(week = c(1, 1, 1, 2, 2, 3, 4),
text = c("This is some text. It was bad. Not good.",
"Another review that was bad!",
"Great job, very helpful; more stuff here, but can't quite get it.",
"Short, poor, not good Dr. Jay, but just so-so. And some more text here.",
"Awesome job! This was a great review. Very helpful and thorough.",
"Not so great.",
"The 1st time Mr. Smith helped me was not good."),
stringsAsFactors = FALSE)
docs <- as.Corpus(df$text, df$week)
funs <- list(stripWhitespace,
tolower,
replace_ordinal,
replace_number,
replace_abbreviation)
# Is there a qdap function that does something similar to the next line?
# Or is there a way to pass this VCorpus / Corpus directly to qdap::polarity?
docs <- tm_map(docs, FUN = tm_reduce, tmFuns = funs)
# At the end of the day, I would like to get this type of output, but adhere to
# the warning message about running sentSplit. How should I pre-treat / cleanse
# these sentences, but keep the "week" grouping?
pol <- polarity(df$text, df$week)
## Not run:
# check_text(df$text)
You could run sentSplit as suggested in the warning as follows:
df_split <- sentSplit(df, "text")
with(df_split, polarity(text, week))
## week total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 1 5 26 -0.138 0.710 -0.195
## 2 2 6 26 0.342 0.402 0.852
## 3 3 1 3 -0.577 NA NA
## 4 4 2 10 0.000 0.000 NaN
Note that I have a breakout sentiment package sentimentr available on github that is an improvment in speed, functionality, and documentation over the qdap version. This does the sentence splitting internally in the sentiment_by function. The script below allows you to install the package and use it:
if (!require("pacman")) install.packages("pacman")
p_load_gh("trinker/sentimentr")
with(df, sentiment_by(text, week))
## week word_count sd ave_sentiment
## 1: 2 25 0.7562542 0.21086408
## 2: 1 26 1.1291541 0.05781106
## 3: 4 10 NA 0.00000000
## 4: 3 3 NA -0.57735027

Resources