Extract total frequency of words from vector in R - r

This is the vector I have:
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)"
I want a data frame as a result, that would contain words and the frequecy of times they occur.
So result should look something like:
word count
a 300
and 260
be 200
... ...
... ...
What I tried to do, was use tm
corpus <- VCorpus(VectorSource(posts))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, removePunctuation)
m <- DocumentTermMatrix(corpus)
Running findFreqTerms(m, lowfreq =0, highfreq =Inf ) just gives me the words, so I understand its a sparse matrix, how do I extract the words and their frequency?
Is there a easier way to do this, maybe by not using tm at all?

posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts) # remove punctuations
posts <- gsub("[[:digit:]]", '', posts) # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") ))) # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] ) # remove empty characters
head(word_counts)
# Var1 Freq
# 2 a 8
# 3 about 3
# 4 allows 1
# 5 although 1
# 6 am 1
# 7 an 1

Plain R solution, assuming all words are separated by space:
words <- strsplit(posts, " ", fixed = T)
words <- unlist(words)
counts <- table(words)
The names(counts) holds words, and values are the counts.
You might want to use gsub to get rid of (),.?: and 's, 't or 're as in your example. As in:
posts <- gsub("'t|'s|'t|'re", "", posts)
posts <- gsub("[(),.?:]", " ", posts)

You've got two options. Depends if you want word count per document, or for all documents.
All Documents
library(dplyr)
count <- as.data.frame(t(inspect(m)))
sel_cols <- colnames(count)
count$word <- rownames(count)
rownames(count) <- seq(length = nrow(count))
count$count <- rowSums(count[,sel_cols])
count <- count %>% select(word,count)
count <- count[order(count$count, decreasing=TRUE), ]
### RESULT of head(count)
# word count
# 140 the 14
# 144 they 10
# 4 and 9
# 25 csm 7
# 43 for 5
# 55 had 4
This should capture occurrences across all documents (by use of rowSums).
Per Document
I would suggesting using the tidytext package, if you want word frequency per document.
library(tidytext)
m_td <- tidy(m)

The tidytext package allows fairly intuitive text mining, including tokenization. It is designed to work in a tidyverse pipeline, so it supplies a list of stop words ("a", "the", "to", etc.) to exclude with dplyr::anti_join. Here, you might do
library(dplyr) # or if you want it all, `library(tidyverse)`
library(tidytext)
data_frame(posts) %>%
unnest_tokens(word, posts) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## # A tibble: 101 × 2
## word n
## <chr> <int>
## 1 csm 7
## 2 0.0 3
## 3 nda 3
## 4 bit 2
## 5 ccp 2
## 6 dominion 2
## 7 forum 2
## 8 forums 2
## 9 hard 2
## 10 internal 2
## # ... with 91 more rows

Related

Data Preparation In R

I have six .txt datasets files i've stored at '../data/csv'. All the datasets have similar structure(X1(speech),part(part of the speech i.e Charlotte_part_1 ...Charlotte_part_60)). Am having trouble combining all the six datasets into a single .csv file called biden.csv which has speech, part,location, event and date .But am having trouble extracting the speech, part(this two are from the file content) and event(from file name) of the file names because of their varying naming structure.
The six datasets
"Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt",
"Cleveland_Sep30_2020_Whistle_Stop_Tour.txt",
"Milwaukee_Aug20_2020_Democratic_National_Convention.txt",
"Philadelphia_Sep20_2020_SCOTUS.txt",
"Washington_Sep26_2020_US_Conference_of_Mayors.txt",
"Wilmington_Nov25_2020_Thanksgiving.txt"
Sample content from 'Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt'
X1 part
"Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist." Charlotte_part_1
"One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around." Charlotte_part_2
Here is the code i have designed but its not producing the output i wan't...i mean it just creat the tibble with the tittles but no contents in any of the variables
biden_data <- tibble() # initialize empty tibble
# loop through all text files in the specified directory
for (file in list.files(path="./data/csv", pattern='*.txt', full.names=T)){
filename <- strsplit(file, "[./]")[[1]][5] # extract file name from path
# extract location from file name
location <- strsplit(filename, split='_')[[1]][1]
# extract raw date from file name
raw_date <- strsplit(filename, split='_')[[1]][2]
date <- as.Date(raw_date, "%b%d_%Y") # format as datetime
# extract event from file name
event <- strsplit(filename, split='_')[[1]][3]
# extract speech and part from file
content <- readChar(file, file.info(file)$size)
speech <- content[grepl("^X1", content)]
part <- content[grepl("^part", content)]
# create a new observation (row)
new_obs <- tibble(speech=speech, part=part, location=location, event=event, date=date)
# append the new observation to the existing data
biden_data <- bind_rows(biden_data, new_obs)
rm(filename, location, raw_date, date, content, speech, part, new_obs, file) # cleanup
}
Desired Output is supposed to look like this:
## # A tibble: 128 x 5
## speech part location event date
## <chr> <chr> <chr> <chr> <date>
## 1 Folks, thanks for taking the time to be here~ Char~ Charlot~ Raci~ 2020-09-23
## 2 One of the things that really matters to me,~ Char~ Charlot~ Raci~ 2020-09-23
## 3 How people going to do that? And the way, in~ Char~ Charlot~ Raci~ 2020-09-23
## 4 In addition to that, we find ourselves in a ~ Char~ Charlot~ Raci~ 2020-09-23
## 5 If he had spoken, as I said, they said at Co~ Char~ Charlot~ Raci~ 2020-09-23
## 6 But what I want to talk to you about today i~ Char~ Charlot~ Raci~ 2020-09-23
## 7 And thirdly, if you’re a business person, le~ Char~ Charlot~ Raci~ 2020-09-23
## 8 For too many people, particularly in the Afr~ Char~ Charlot~ Raci~ 2020-09-23
## 9 It goes to education, as well as access to e~ Char~ Charlot~ Raci~ 2020-09-23
## 10 And then we’re going to talk about, I think ~ Char~ Charlot~ Raci~ 2020-09-23
## # ... with 118 more rows
Starting with a vector of file paths:
files <- c("Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt", "Cleveland_Sep30_2020_Whistle_Stop_Tour.txt", "Milwaukee_Aug20_2020_Democratic_National_Convention.txt", "Philadelphia_Sep20_2020_SCOTUS.txt", "Washington_Sep26_2020_US_Conference_of_Mayors.txt", "Wilmington_Nov25_2020_Thanksgiving.txt")
We can capture the components into a frame:
meta <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files, list(location="", date="", event=""))
meta
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Cleveland Sep30_2020 Whistle_Stop_Tour
# 3 Milwaukee Aug20_2020 Democratic_National_Convention
# 4 Philadelphia Sep20_2020 SCOTUS
# 5 Washington Sep26_2020 US_Conference_of_Mayors
# 6 Wilmington Nov25_2020 Thanksgiving
And then iterate on that for the contents into a single frame.
out <- do.call(Map, c(list(f = function(fn, ...) cbind(..., read.table(fn, header = TRUE))),
list(files), meta))
out <- do.call(rbind, out)
rownames(out) <- NULL
out[1:3,]
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 3 Cleveland Sep30_2020 Whistle_Stop_Tour
# X1
# 1 Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist.
# 2 One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around.
# 3 Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt
# part
# 1 Charlotte_part_1
# 2 Charlotte_part_2
# 3 something
(I made fake files for all but the first file.)
Brief walk-through:
strcapture takes the regex (lots of _-separation) and creates a frame of location, date, etc.
Map takes a function with 1 or more arguments (we use two: fn= for the filename, and ... for "the rest") and applies it to each of the subsequent lists/vectors. In this case, I'm using ... to cbind (column-bind/concatenate) the columns from meta to what we read from the file itself. This is useful in that it combines the 1 row of each meta row with any-number-of-rows from the file itself. (We could have hard-coded ... instead as location, date, and event, but I tend to prefer to generalize, in case you need to extract something else from the filenames.)
Because we use ..., however, we need to combine files and the columns of meta in a list and then call our anon-function with the list contents as arguments.
The contents of out after our do.call(Map, ...) is in a list and not a single frame. Each element of this list is a frame with the same column-structure, so we then combine them by rows with do.call(rbind, out).
R is going to use the names from files into row names, which I find unnecessary (and distracting), so I removed the row names. Optional.
If you're interested, this may appear much easier to digest using dplyr and friends:
library(dplyr)
# library(tidyr) # unnest
out <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files,
list(location="", date="", event="")) %>%
mutate(contents = lapply(files, read.table, header = TRUE)) %>%
tidyr::unnest(contents)

R Hunspell autocorrect and stemming in pipe function for 2 columns tribble / with unnest_tokens

I am currently unsuccessfully trying to apply an autocorrection and a stemming using Hunspell to my data. The data in question is a tribble of sentences, each with an author, which are then to be evaluated via a more complex unnest function. This solution How do i optimize the performance of stemming and spell check in R? already describes a very effective way for autocorrection and stemming.
For me, the options are either to apply autocorrect and stemming to the complete sentences and then run my evaluations via unnest, or to check and adjust the individual words in the pipe function.
The following data describes my problem:
df <- tibble::tribble(
~text, ~author,
"We aree drivng as fast as we drove yestrday or evven fastter zysxzw", "U1",
"Today waas a beautifull day", "U2",
"Hopefulli we learn to write correect one day", "U2"
)
df %>%
unnest_tokens(input = text,
output = word) %>%
count(Author, word, sort = TRUE)
However, so far I have not found a solution to perform the autocorrection and stemming before this example evaluation. I would like to use only checked and matched words for the count function, for example.
I've been stuck on this problem for a while and am infinitely grateful for any input and ideas! Thank you!
Here is a simple guide how you can start:
You will need this function spellAndStem_tokens How do i optimize the performance of stemming and spell check in R?
spellAndStem_tokens <- function(sent, language = "en_US") {
sent_t <- quanteda::tokens(sent)
# extract types to only work on them
types <- quanteda::types(sent_t)
# spelling
correct <- hunspell_check(
words = as.character(types),
dict = hunspell::dictionary(language)
)
pattern <- types[!correct]
replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)
types <- stringi::stri_replace_all_fixed(
types,
pattern,
replacement,
vectorize_all = FALSE
)
# stemming
types <- hunspell_stem(types, dict = dictionary(language))
# replace original tokens
sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))
sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")
paste(as.character(sent_t_new), collapse = " ")
}
then code for First output:
#install.packages("quanteda")
#install.packages("hunspell")
library(hunspell)
library(quanteda)
library(tidyverse)
library(tidytext)
df %>%
unnest_tokens(word, text) %>%
count(word, sort= TRUE) %>%
print(n=30))
word n
<chr> <int>
1 we 3
2 as 2
3 day 2
4 a 1
5 aree 1
6 beautifull 1
7 correect 1
8 drivng 1
9 drove 1
10 evven 1
11 fast 1
12 fastter 1
13 hopefulli 1
14 learn 1
15 one 1
16 or 1
17 to 1
18 today 1
19 waas 1
20 write 1
21 yestrday 1
22 zysxzw 1
then code for Second Output:
df %>%
unnest_tokens(word, text) %>%
count(word, sort= TRUE)
mutate(word = spellAndStem_tokens(word)) %>%
print(n=30)
output:
word n
<chr> <int>
1 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 3
2 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 2
3 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 2
4 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
5 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
6 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
7 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
8 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
9 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
10 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
11 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
12 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
13 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
14 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
15 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
16 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
17 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
18 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
19 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
20 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
21 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
22 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday 1
>

Find frequency of terms from Function

I need to find frequency of terms from the function that I have created that find terms with punctuation in them.
library("tm")
my.text.location <- "C:/Users/*/"
newpapers <- VCorpus(DirSource(my.text.location))
I read it then make the function:
library("stringr")
punctterms <- function(x){str_extract_all(x, "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}")}
terms <- lapply(newpapers, punctterms)
Now I'm lost as to how will I find the frequency for each term in each file. Do I turn it into a DTM or is there a better way without it?
Thank you!
This task is better suited for quanteda, not tm. Your function creates a list and removes everything out of the corpus. Using quanteda you can just use the quanteda commands to get everything you want.
Since you didn't provide any reproducible data, I will use a data set that comes with quanteda. Comments above the code explain what is going on. Most important function in this code is dfm_select. Here you can use a diverse set of selection patterns to find terms in the text.
library(quanteda)
# load corpus
my_corpus <- corpus(data_corpus_inaugural)
# create document features (like document term matrix)
my_dfm <- dfm(my_corpus)
# dfm_select can use regex selections to select terms
my_dfm_punct <- dfm_select(my_dfm,
pattern = "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}",
selection = "keep",
valuetype = "regex")
# show frequency of selected terms.
head(textstat_frequency(my_dfm_punct))
feature frequency rank docfreq group
1 fellow-citizens 39 1 19 all
2 america's 35 2 11 all
3 self-government 30 3 16 all
4 world's 24 4 15 all
5 nation's 22 5 13 all
6 god's 15 6 14 all
So I got it to work without using quanteda:
m <- as.data.frame(table(unlist(terms)))
names(m) <- c("Terms", "Frequency")

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

What does support feature mean in result of function "term_stats()" from package "tm" in R and how is it different from count?

Running following script will produce the results
a <- c("Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. - Steve Jobs")
a_source <- VectorSource(a)
a_corpus <- VCorpus(a_source)
term_stats(a_corpus)
term_stats(a_corpus)
term count support
1 . 5 1
2 to 5 1
3 is 4 1
4 you 4 1
5 , 3 1
Support is the number of documents where the word occurs, count is the number of occurrences. You need both if doing tf-idf.
library(tm)
txt <- c("Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work.
And the only way to do great work is to love what you do.
If you haven't found it yet, keep looking. Don't settle.
As with all matters of the heart, you'll know when you find it.
- Steve Jobs")
term_stats(VCorpus(VectorSource(txt)))[1:5,]
term count support
. 5 1
to 5 1
is 4 1
#Split txt into 4 docs
txt_df <- data.frame( txt = c(
"Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work." ,
"And the only way to do great work is to love what you do." ,
"If you haven't found it yet, keep looking. Don't settle." ,
"As with all matters of the heart, you'll know when you find it. -
Steve Jobs"))
term_stats(VCorpus(VectorSource(txt_df$txt)))[1:6,]
term count support
. 5 4
you 4 4
, 3 3
the 3 3
to 5 2
is 4 2
Default is to sort by support.

Resources