I have a simple problem, consider this example
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# A tibble: 2 x 1
mytext
<chr>
1 stackoverflow is pretty good my friend
2 but sometimes pretty bad as well
I want to count the number of times stackoverflow is near good. I use the following regex but it does not work.
dataframe %>% mutate(mycount = str_count(mytext,
regex('stackoverflow(?:\\w+){0,5}good', ignore_case = TRUE)))
# A tibble: 2 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 0
2 but sometimes pretty bad as well 0
Can someone tell me what am I missing here?
Thanks!
I had a bunch of trouble with this too and I'm still not sure why the things I was trying didn't work. But I'm only decent at regular expressions, not an expert. However, I was able to get it to work with lookback and lookforward.
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well',
'stackoverflow one two three four five six good',
'stackoverflow good'))
dataframe
dataframe %>% mutate(mycount = str_count(mytext,
regex('(?<=stackoverflow)\\s(?:\\w+\\s){0,5}(?=good)', ignore_case = TRUE)))
## A tibble: 4 x 2
# mytext mycount
# <chr> <int>
#1 stackoverflow is pretty good my friend 1
#2 but sometimes pretty bad as well 0
#3 stackoverflow one two three four five six good 0
#4 stackoverflow good 1
The corpus library makes this pretty easy:
library(corpus)
dataframe <- data.frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# find instances of 'stackoverflow'
loc <- text_locate(dataframe$mytext, "stackoverflow")
# count the number of times 'good' is within 5 tokens
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
# aggregate over text
count <- tapply(near_good, loc$text, sum, default = 0)
Conceptually, corpus treats text as a sequence of tokens. The library allows you to index these sequences using the text_sub() command. You can also change the definition of a token using a text_filter().
Here's an example that works the same way but ignores punctuation-only tokens:
corpus <- corpus_frame(text = c("Stackoverflow, is pretty (?) GOOD my friend!",
"But sometimes pretty bad as well"))
text_filter(corpus)$drop_punct <- TRUE
loc <- text_locate(corpus, "stackoverflow")
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
count <- tapply(near_good, loc$text, sum, default = 0)
I think I got it
dataframe %>%
mutate(mycount = str_count(mytext,
regex('stackoverflow\\W+(?:\\w+ ){0,5}good', ignore_case = TRUE)))
# A tibble: 4 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 1
2 but sometimes pretty bad as well 0
3 stackoverflow good good stackoverflow 1
4 stackoverflowgood 0
The key was adding the \W+ meta-character that matches anything between words.
Related
I created this function to count the maximum number of consecutive characters in a word.
max(rle(unlist(strsplit("happy", split = "")))$lengths)
The function works on individual words, but when I try to use the function within a mutate step it doesn't work. Here is the code that involves the mutate step.
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
num_vowels = get_count(word),
num_consec_char = max(rle(unlist(strsplit(word, split = "")))$lengths)
)
The variables num_letters and num_vowels work fine, but I get a 2 for every value of num_consec_char. I can't figure out what I'm doing wrong.
This command rle(unlist(strsplit(word, split = "")))$lengths is not vectorized and thus is operating on the entire list of words for each row thus the same result for each row.
You will need to use some type of loop (ie for, apply, purrr::map) to solve it.
library(dplyr)
library(tidytext)
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
output<- text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
# num_vowels = get_count(word),
)
output$num_consec_char<- sapply(output$word, function(word){
max(rle(unlist(strsplit(word, split = "")))$lengths)
})
output
# A tibble: 32 × 4
line word num_letters num_consec_char
<int> <chr> <int> <int>
1 1 the 3 1
2 1 most 4 1
3 1 pressing 8 2
4 1 of 2 1
5 1 those 5 1
6 1 issues 6 2
7 1 considering 11 1
I would like to count the number of English words in a string of text.
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")))
df.words
ID text
1 1 frog friend fresh frink foot
2 2 get give gint gobble
I'd like the final product to look like this:
ID text count
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
I'm guessing I'll have to first separate based on spaces and then reference the words against a dictionary?
Building on #r2evans suggestion of using strsplit() and using a random English word .txt file dictionary online, example is below. This solution probably might not scale well if you have a large number of comparisons because of the unnest step.
library(dplyr)
library(tidyr)
# text file with 479k English words ~4MB
dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")),
stringsAsFactors = FALSE)
df.words %>%
mutate(text2 = strsplit(text, split = "\\s")) %>%
unnest(text2) %>%
semi_join(dict, by = c("text2")) %>%
group_by(ID, text) %>%
summarise(count = length(text2))
Output
ID text count
<int> <chr> <int>
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
Base R alternative, using EJJ's great recommendation for dict:
sapply(strsplit(df.words$text, "\\s+"),
function(z) sum(z %in% dict$text2))
# [1] 4 3
I thought that this would be a clear winner in speed, but apparently doing sum(. %in% .) one at a time can be a little expensive. (It is slower with this data.)
Faster but not necessarily simpler:
words <- strsplit(df.words$text, "\\s+")
words <- sapply(words, `length<-`, max(lengths(words)))
found <- array(words %in% dict$text2, dim = dim(words))
colSums(found)
# [1] 4 3
It's a hair faster (~ 10-15%) than EJJ's solution, so likely only a good thing if you need to wring some performance out of it.
(Caveat: EJJ's is faster with this 2-row dataset. If the data is 1000x larger, then my first solution is a little faster, and my second solution is twice as fast. Benchmarks are benchmarks, though, don't optimize code beyond usability if speed/time is not a critical factor.)
In the original dataset I have 3k+ rows and 2 columns - ids and languages that id can apply in practice. My first step was to find the frequency combinations of chosen languages. For e.g., how many times Python was chosen along with R, SQL; or how many times Java was picked with JavaScript, C++ and so on.
Some research on Stackoverflow helped me to find these possible patterns. Here's some code with a sample data set:
sample <- data.frame(id = rep(randomNames::randomNames(4), each = 4),
programming = c("R", "Python", "C#", "Other",
"R", "Tableu", "Assembler",
"Other", "Java", "JavaScript",
"Python", "C#","R", "Python", "C#",
"Other"))
gr <- sample %>%
group_by(id) %>%
arrange(programming) %>%
summarise(programming = paste(sort(unique(programming)), collapse = ", ")) %>%
count(programming)
But now I wonder how can I find the number of the most frequent picks for each language. For instance, R language was picked with Java and Kotlin very few times, this is not a very popular setting. But R that was picked with Python and SQL is more popular. And my purpose is to find what languages has the greatest frequency of being picked.
I also did some research (example), and, unfortunately, have not found the solution.
I think I should iterate my programming column to find all possible picks (R + ..., Python + ...; then R + Python + ...). I tried using lapply but struggled with writing a lambda function.
What are the possible ways to solve the issue? Is there any effective function for such purposes?
One option would be to create combinations of languages within each id and count the combinations which most frequently occur together. .
library(dplyr)
sample %>%
group_by(id) %>%
summarise(programming = combn(sort(programming), 2,
paste0, collapse = '-'), .groups = 'drop') %>%
count(programming, sort = TRUE)
# programming n
# <chr> <int>
# 1 C#-Python 3
# 2 Other-R 3
# 3 C#-Other 2
# 4 C#-R 2
# 5 Other-Python 2
# 6 Python-R 2
# 7 Assembler-Other 1
# 8 Assembler-R 1
# 9 Assembler-Tableu 1
#10 C#-Java 1
#11 C#-JavaScript 1
#12 Java-JavaScript 1
#13 Java-Python 1
#14 JavaScript-Python 1
#15 Other-Tableu 1
#16 R-Tableu 1
I am very new to NLP. Please, don't judge me strictly.
I have got a very big data-frame on customers' feedback, my goal is to analyze feedbacks. I tokenized words in feedbacks, deleted stop-words (SMART). Now, I need to receive a table of most and less frequent used words.
The code looks like this:
library(tokenizers)
library(stopwords)
words_as_tokens <-
tokenize_words(dat$description,
stopwords = stopwords(language = "en", source = "smart"))
The dataframe looks like this: there are lots of feedbacks (variable "description") and customers by whom the feedbacks were given (each customer is not unique, they can be repeated). I want to receive a table with 3 columns: a) customer name b) word c) its frequency. This "ranking" should be in a decreasing order.
Try this
library(tokenizers)
library(stopwords)
library(tidyverse)
# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description,
tokenize_words,
stopwords = stopwords(language = "en", source = "smart")),
function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)
# tidyverse's job
df <- words_as_tokens %>%
bind_rows(, .id = "name") %>%
rename(word = x)
# output
df
# name word Freq
# 1 John experience 2
# 2 John word 2
# 3 John absolutely 1
# 4 John action 1
# 5 John amazon 1
# 6 John amazon.ae 1
# 7 John answering 1
# ....
# 42 Alex break 2
# 43 Alex nice 2
# 44 Alex times 2
# 45 Alex 8 1
# 46 Alex accent 1
# 47 Alex africa 1
# 48 Alex agents 1
# ....
Data
dat <- data.frame(name = c("John", "Alex"),
description = c("Unprecedented. The perfect word to describe Amazon. In every positive sense of that word! All because of one man - Jeff Bezos. What an entrepreneur! What a vision! This is from personal experience. Let me explain. I had given up all hope, after a horrible experience with Amazon.ae (formerly Souq.com) - due to a Herculean effort to get an order cancelled and the subsequent refund issued. I have never faced such a feedback-resistant team in my life! They were robotically answering my calls and sending me monotonous, unhelpful emails, followed by absolutely zero action!",
"Not only does Amazon have great products but their Customer Service for the most part is wonderful. Although most times you are outsourced to a different country, I personally have found that when I call it's either South Africa or Philippines and they speak so well, understand me and my NY accent and are quite nice. Let’s face it. Most times you are calling CS with a problem or issue. These agents have to listen to 8 hours of complaints so they themselves need a break. No matter how annoyed I am I try to be on my best behavior and as nice as can be because they too need a break with how nasty we as a society can be."), stringsAsFactors = F)
You can try with quanteda as well as follows:
library(quanteda)
library(quanteda.textstats)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus,
tolower = TRUE,
remove = stopwords(), # this removes English stopwords
remove_punct = TRUE, # this removes punctuation
remove_numbers = TRUE, # this removes digits
remove_symbol = TRUE, # this removes symbols
remove_url = TRUE ) # this removes urls
# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )
I am trying to the sentiment of a dataset of Tweets using the AFINN dictionary (get_sentiments("afinn"). A sample of the dataset is provided below:
A tibble: 10 x 2
Date TweetText
<dttm> <chr>
1 2018-02-10 21:58:19 "RT #RealSirTomJones: Still got the moves! That was a lo~
2 2018-02-10 21:58:19 "Yass Tom \U0001f600 #snakehips still got it #TheVoiceUK"
3 2018-02-10 21:58:19 Yasss tom he’s some chanter #TheVoiceUK #ItsNotUnusual
4 2018-02-10 21:58:20 #TheVoiceUK SIR TOM JONES...HE'S STILL HOT... AMAZING VO~
5 2018-02-10 21:58:21 I wonder how many hips Tom Jones has been through? #TheV~
6 2018-02-10 21:58:21 Tom Jones has still got it!!! #TheVoiceUK
7 2018-02-10 21:58:21 Good grief Tom Jones is amazing #TheVoiceuk
8 2018-02-10 21:58:21 RT #tonysheps: Sir Thomas Jones you’re a bloody legend #~
9 2018-02-10 21:58:22 #ITV Tom Jones what a legend!!! ❤️ #StillGotIt #TheVoice~
10 2018-02-10 21:58:22 "RT #RealSirTomJones: Still got the moves! That was a lo~
What I want to do is:
1. Split up the Tweets into individual words.
2. Score those words using the AFINN lexicon.
3. Sum the score of all the words of each Tweet
4. Return this sum into a new third column, so I can see the score per Tweet.
For a similar lexicon I found the following code:
# Initiate the scoreTopic
scoreTopic <- 0
# Start a loop over the documents
for (i in 1:length (myCorpus)) {
# Store separate words in character vector
terms <- unlist(strsplit(myCorpus[[i]]$content, " "))
# Determine the number of positive matches
pos_matches <- sum(terms %in% positive_words)
# Determine the number of negative matches
neg_matches <- sum(terms %in% negative_words)
# Store the difference in the results vector
scoreTopic [i] <- pos_matches - neg_matches
} # End of the for loop
dsMyTweets$score <- scoreTopic
I am however not able to adjust this code to get it working with the afinn dictionary.
This would be a great use case for tidy data principles. Let's set up some example data (these are real tweets of mine).
library(tidytext)
library(tidyverse)
tweets <- tribble(
~tweetID, ~TweetText,
1, "Was Julie helping me because I don't know anything about Python package management? Yes, yes, she was.",
2, "#darinself OMG, this is my favorite.",
3, "#treycausey #ftrain THIS IS AMAZING.",
4, "#nest No, no, not in error. Just the turkey!",
5, "The #nest people should write a blog post about how many smoke alarms went off yesterday. (I know ours did.)")
Now we have some example data. In the code below, unnest_tokens() tokenizes the text, i.e. breaks it up into individual words (the tidytext package allows you to use a special tokenizer for tweets) and the inner_join() implements the sentiment analysis.
tweet_sentiment <- tweets %>%
unnest_tokens(word, TweetText, token = "tweets") %>%
inner_join(get_sentiments("afinn"))
#> Joining, by = "word"
Now we can find the scores for each tweet. Take the original data set of tweets and left_join() on to it the sum() of the scores for each tweet. The handy function replace_na() from tidyr lets you replace the resulting NA values with zero.
tweets %>%
left_join(tweet_sentiment %>%
group_by(tweetID) %>%
summarise(score = sum(score))) %>%
replace_na(list(score = 0))
#> Joining, by = "tweetID"
#> # A tibble: 5 x 3
#> tweetID TweetText score
#> <dbl> <chr> <dbl>
#> 1 1. Was Julie helping me because I don't know anything about … 4.
#> 2 2. #darinself OMG, this is my favorite. 2.
#> 3 3. #treycausey #ftrain THIS IS AMAZING. 4.
#> 4 4. #nest No, no, not in error. Just the turkey! -4.
#> 5 5. The #nest people should write a blog post about how many … 0.
Created on 2018-05-09 by the reprex package (v0.2.0).
If you are interested in sentiment analysis and text mining, I invite you to check out the extensive documentation and tutorials we have for tidytext.
For future reference:
Score_word <- function(x) {
word_bool_vec <- get_sentiments("afinn")$word==x
score <- get_sentiments("afinn")$score[word_bool_vec]
return (score) }
Score_tweet <- function(sentence) {
words <- unlist(strsplit(sentence, " "))
words <- as.vector(words)
scores <- sapply(words, Score_word)
scores <- unlist(scores)
Score_tweet <- sum(scores)
return (Score_tweet)
}
dsMyTweets$score<-apply(df, 1, Score_tweet)
This executes what I initially wanted! :)