matching text between two different data frames in R - r

I have the following data in a data frame:
structure(list(`head(ker$text)` = structure(1:6, .Label = c("#_rpg_17 little league travel tourney. These parents about to be wild.",
"#auscricketfan #davidwarner31 yes WI tour is coming soon", "#keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR",
"#NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave",
"Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy",
"Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO"
), class = "factor")), .Names = "head(ker$text)", row.names = c(NA,
-6L), class = "data.frame")
I have another data frame that contains hashtags extracted from the above data frame. It is as follows:
structure(list(destination = c("#topstation", "#destination", "#munnar",
"#Kerala", "#Delhi", "#beach")), .Names = "destination", row.names = c(NA,
6L), class = "data.frame")
I want to create a new column in my first data frame, which will have contain only the tags matched with the second data frame. For example, the first line of df1 does not have any hashtags, hence this cell in the new column will be blank. However, the second line contains 4 hashtags, of which three of them are matching with the second data frame. I have tried using:
str_match
str_extract
functions. I came very close to getting this using a code given in one of the posts here.
new_col <- ker[unlist(lapply(destn$destination, agrep, ker$text)), ]
While I understand, I am getting a list as an output I am getting an error indicating
replacement has 1472 rows, data has 644
I have tried setting max.distance to different parameters, each gave me differential errors. Can someone help me with a solution? One alternative which I am thinking of is to have each hashtag in a separate column, but not sure if it will help me in analysing the data further with other variables that I have. The output I am looking for is as follows:
text new_col new_col2 new_col3
statement1
statement2
statement3 #destination #munnar #topstation
statement4
statement5 #Kerala
statement6 #Kerala

library(stringi);
df1$tags <- sapply(stri_extract_all(df1[[1]],regex='#\\w+'),function(x) paste(x[x%in%df2[[1]]],collapse=','));
df1;
## head(ker$text) tags
## 1 #_rpg_17 little league travel tourney. These parents about to be wild.
## 2 #auscricketfan #davidwarner31 yes WI tour is coming soon
## 3 #keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination,#munnar,#topstation
## 4 #NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave
## 5 Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy #Kerala
## 6 Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO #Kerala
Edit: If you want a separate column for each tag:
library(stringi);
m <- sapply(stri_extract_all(df1[[1]],regex='#\\w+'),function(x) x[x%in%df2[[1]]]);
df1 <- cbind(df1,do.call(rbind,lapply(m,`[`,1:max(sapply(m,length)))));
df1;
## head(ker$text) 1 2 3
## 1 #_rpg_17 little league travel tourney. These parents about to be wild. <NA> <NA> <NA>
## 2 #auscricketfan #davidwarner31 yes WI tour is coming soon <NA> <NA> <NA>
## 3 #keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination #munnar #topstation
## 4 #NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave <NA> <NA> <NA>
## 5 Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy #Kerala <NA> <NA>
## 6 Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO #Kerala <NA> <NA>

You could do something like this:
library(stringr)
results <- sapply(df$`head(ker$text)`,
function(x) { str_match_all(x, paste(df2$destination, collapse = "|")) })
df$matches <- results
If you want to separate the results out, you can use:
df <- cbind(df, do.call(rbind, lapply(results,[, 1:max(sapply(results, length)))))

Related

str_extract() and summarise() gives me na row

This should be pretty straightforward, as think I'm just looking for verification about what I'm seeing.
I'm trying to use str_extract() to pull areas of interest out of a column in my data frame, and then count how often each word appears. I'm running into an issue though where when I do this, the data frame I produce has NA listed in one of the rows. This is confusing to me, because I don't know what is causing it or if it is a sign of an error in my code. I'm not sure how to fix this.
Additionally, note that the last item in words is "the table is light", which contains two of the words of interest in this example. I've done this intentionally because I want to make sure that it will be counted twice.
library(tidyverse)
df <- data.frame(words =c("paper book", "food press", "computer monitor", "my fancy speakers",
"my two dogs", "the old couch", "the new couch", "loud speakers",
"wasted paper", "put the dishes away", "set the table", "put it on the table",
"lets go to church", "turn out the lights", "why are the lights on",
"the table is light"))
keep <- c("dogs|paper|table|light|couch")
new_df <- df %>%
mutate(Subject = str_extract(words, keep), n = n()) %>%
group_by(Subject)%>%
summarise(`Word Count` = length(Subject))
This is what I'm getting now
Subject `Word Count`
<chr> <int>
1 couch 2
2 dogs 1
3 light 2
4 paper 2
5 table 3
6 NA 6
So my question is- what is causing the NA row in Subject? Is it all other records?
The NA appears for those values where there are no words in keep appearing in that row so there is nothing to extract.
library(dplyr)
library(stringr)
df %>% mutate(Subject = str_extract(words, keep))
# words Subject
#1 paper book paper
#2 food press <NA>
#3 computer monitor <NA>
#4 my fancy speakers <NA>
#5 my two dogs dogs
#6 the old couch couch
#7 the new couch couch
#8 loud speakers <NA>
#9 wasted paper paper
#10 put the dishes away <NA>
#11 set the table table
#12 put it on the table table
#13 lets go to church <NA>
#14 turn out the lights light
#15 why are the lights on light
#16 the table is light table
For example, for 2nd row 'food press' there are no values from "dogs|paper|table|light|couch" in it hence it returns NA.

Split strings into utterances and assign same-speaker utterances to columns in dataframe

I have multi-party conversations in strings like this:
convers <- "Peter: Hiya Mary: Hi. How w'z your weekend. Peter: a::hh still got a headache. An' you (.) party a lot? Mary: nuh, you know my kid's sick 'n stuff Peter: yeah i know that's=erm al hamshi: hey guys how's it goin'? Peter: Great! Mary: where've you BEn last week al hamshi: ah y' know, camping with my girl friend."
I also have a vector with the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
I'd like to create a dataframe with the utterances by each individual speaker in a separate column. I can only do this task in a piecemeal fashion, by addressing each speaker specifically using the indices in speakers, and then combine the separate results in a list but what I'd really like to have is a dataframe with separate columns for each speaker:
Peter <- str_extract_all(convers, paste0("(?<=", speakers[1],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
Mary <- str_extract_all(convers, paste0("(?<=", speakers[2],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
al_hamshi <- str_extract_all(convers, paste0("(?<=", speakers[3],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
df <- list(
Peter = Peter, Mary = Mary , al_hamshi = al_hamshi
)
df
$Peter
$Peter[[1]]
[1] "Hiya" "a::hh still got a headache. An' you (.) party a lot?"
[3] "yeah i know that's=erm" "Great!"
$Mary
$Mary[[1]]
[1] "Hi. How w'z your weekend." "nuh, you know my kid's sick 'n stuff" "where've you BEn last week"
$al_hamshi
$al_hamshi[[1]]
[1] "hey guys how's it goin'?" "ah y' know, camping with my girl friend."
How can I extract the same-speaker utterances not one by one but in one go and how can the results be assigned not to a list but a dataframe?
With a bit of pre-processing, and assuming the names exactly match the speakers in the conversation text, you can do:
# Pattern to use to insert new lines in string
pattern <- paste0("(", paste0(speakers, ":", collapse = "|"), ")")
# Split string by newlines
split_conv <- strsplit(gsub(pattern, "\n\\1", convers), "\n")[[1]][-1]
# Capture speaker and text into data frame
dat <- strcapture("(.*?):(.*)", split_conv, data.frame(speaker = character(), text = character()))
Which gives:
speaker text
1 Peter Hiya
2 Mary Hi. How w'z your weekend.
3 Peter a::hh still got a headache. An' you (.) party a lot?
4 Mary nuh, you know my kid's sick 'n stuff
5 Peter yeah i know that's=erm
6 al hamshi hey guys how's it goin'?
7 Peter Great!
8 Mary where've you BEn last week
9 al hamshi ah y' know, camping with my girl friend.
To get each speaker into their own column:
# Count lines by speaker
dat$cnt <- with(dat, ave(speaker, speaker, FUN = seq_along))
# Reshape and rename
dat <- reshape(dat, idvar = "cnt", timevar = "speaker", direction = "wide")
names(dat) <- sub("text\\.", "", names(dat))
cnt Peter Mary al hamshi
1 1 Hiya Hi. How w'z your weekend. hey guys how's it goin'?
3 2 a::hh still got a headache. An' you (.) party a lot? nuh, you know my kid's sick 'n stuff ah y' know, camping with my girl friend.
5 3 yeah i know that's=erm where've you BEn last week <NA>
7 4 Great! <NA> <NA>
If new lines already exist in your text, choose another character that doesn't exist to do use to split the string.
You can add :\\s to each speakers, as you are also doing, then make a gregexpr finding the position where a speaker starts. Extract this using regmatches and remove the previously added :\\s to get the speaker. Make again a regmatches but with invert giving the sentences. With spilt the sentences are grouped to the speaker. To bring this to the desired data.frame you have to add NA to have the same length for all speakes, done her with [ inside lapply:
x <- gregexpr(paste0(speakers, ":\\s", collapse="|"), convers)
y <- sub(":\\s$", "", regmatches(convers, x)[[1]])
z <- trimws(regmatches(convers, x, TRUE)[[1]][-1])
tt <- split(z, y)
do.call(data.frame, lapply(tt, "[", seq_len(max(lengths(tt)))))
# al.hamshi Mary Peter
#1 hey guys how's it goin'? Hi. How w'z your weekend. Hiya
#2 ah y' know, camping with my girl friend. nuh, you know my kid's sick 'n stuff a::hh still got a headache. An' you (.) party a lot?
#3 <NA> where've you BEn last week yeah i know that's=erm
#4 <NA> <NA> Great!

keeping the best string matched by fuzzy matching in R

I have two dataframes in R. one a dataframe of the phrases I want to match along with their synonyms in another column (df.word), and the other a data frame of the strings I want to match along with codes (df.string). The strings are complicated but to make it easy say we have:
df.word <- data.frame(label = c('warm wet', 'warm dry', 'cold wet'),
synonym = c('hot and drizzling\nsunny and raining','sunny and clear sky\ndry sunny day', 'cold winds and raining\nsnowing'))
df.string <- data.frame(day = c(1,2,3,4),
weather = c('there would be some drizzling at dawn but we will have a hot day', 'today there are cold winds and a bit of raining or snowing at night', 'a sunny and clear sky is what we have today', 'a warm dry day'))
I want to create df.string$extract in which I want to have the best match available for the string.
a column like this
df$extract <- c('warm wet', 'cold wet', 'warm dry', 'warm dry')
thanks in advance for anyone helping.
There are a few points that I did not quite understand in your question; however, I am proposing a solution for your question. Check whether it will work for you.
I assume that you want to find the best-matching labels for the weather texts. If so, you can use stringsim function from library(stringdist) in the following way.
First Note: If you clean the \n in your data, the result will be more accurate. So, I clean them for this example, but if you want you can keep them.
Second Note: You can change the similarity distance based on the different methods. Here I used cosine similarity, which is a relatively good starting point. If you want to see the alternative methods, please see the reference of the function:
?stringsim
The clean data is as follow:
df.word <- data.frame(
label = c("warm wet", "warm dry", "cold wet"),
synonym = c(
"hot and drizzling sunny and raining",
"sunny and clear sky dry sunny day",
"cold winds and raining snowing"
)
)
df.string <- data.frame(
day = c(1, 2, 3, 4),
weather = c(
"there would be some drizzling at dawn but we will have a hot day",
"today there are cold winds and a bit of raining or snowing at night",
"a sunny and clear sky is what we have today",
"a warm dry day"
)
)
Install the library and load it
install.packages('stringdist')
library(stringdist)
Create a n x m matrix that contains the similarity scores for each whether text with each synonym. The rows show each whether text and the columns represent each synonym group.
match.scores <- sapply( ## Create a nested loop with sapply
seq_along(df.word$synonym), ## Loop for each synonym as 'i'
function(i) {
sapply(
seq_along(df.string$weather), ## Loop for each weather as 'j'
function(j) {
stringsim(df.word$synonym[i], df.string$weather[j], ## Check similarity
method = "cosine", ## Method cosine
q = 2 ## Size of the q -gram: 2
)
}
)
}
)
r$> match.scores
[,1] [,2] [,3]
[1,] 0.3657341 0.1919924 0.24629819
[2,] 0.6067799 0.2548236 0.73552828
[3,] 0.3333974 0.6300619 0.21791793
[4,] 0.1460593 0.4485426 0.03688556
Get the best matches across the rows for each whether text, find the labels with the highest matching scores, and add these labels to the data frame.
ranked.match <- apply(match.scores, 1, which.max)
df.string$extract <- df.word$label[ranked.match]
df.string
r$> df.string
day weather extract
1 1 there would be some drizzling at dawn but we will have a hot day warm wet
2 2 today there are cold winds and a bit of raining or snowing at night cold wet
3 3 a sunny and clear sky is what we have today warm dry
4 4 a warm dry day warm dry

How do I count the number of words from a list mentioned in a data frame in R

I have a data frame with a review and text column with multiple rows. I also have a list containing words. I want a for loop to examine each row of the data frame to sum the number of words found in the from the list. I want to keep each row sum separated by the row and place the results into a new result data frame.
#Data Frame
Review Text
1 I like to run and play.
2 I eat cookies.
3 I went to swim in the pool.
4 I like to sleep.
5 I like to run, play, swim, and eat.
#List Words
Run
Play
Eat
Swim
#Result Data Frame
Review Count
1 2
2 1
3 1
4 0
5 4
Here is a solution for base R, where gregexpr is used for counting occurences.
Given the pattern as below
pat <- c("Run", "Play", "Eat", "Swim")
then the counts added to the data frame can be made via:
df$Count <- sapply(gregexpr(paste0(tolower(pat),collapse = "|"),tolower(df$Text)),
function(v) ifelse(-1 %in% v, 0,length(v)))
such that
> df
Review Text Count
1 1 I like to run and play 2
2 2 I eat cookies 1
3 3 I went to swim in the pool. 1
4 4 I like to sleep. 0
5 5 I like to run, play, swim, and eat. 4
We can use stringr::str_count after pasting the words together as one pattern.
df$Count <- stringr::str_count(df$Text,
paste0("\\b", tolower(words), "\\b", collapse = "|"))
df
# Review Text Count
#1 1 I like to run and play. 2
#2 2 I eat cookies. 1
#3 3 I went to swim in the pool. 1
#4 4 I like to sleep. 0
#5 5 I like to run, play, swim, and eat. 4
data
df <- structure(list(Review = 1:5, Text = structure(c(2L, 1L, 5L, 4L,
3L), .Label = c("I eat cookies.", "I like to run and play.",
"I like to run, play, swim, and eat.", "I like to sleep.",
"I went to swim in the pool."), class = "factor")), class =
"data.frame", row.names = c(NA, -5L))
words <- c("Run","Play","Eat","Swim")
Base R solution (note this solution is intentionally case insensitive):
# Create a vector of patterns to search for:
patterns <- c("Run", "Play", "Eat", "Swim")
# Split on the review number, apply a term counting function (for each review number):
df$term_count <- sapply(split(df, df$Review),
function(x){length(grep(paste0(tolower(patterns), collapse = "|"),
tolower(unlist(strsplit(x$Text, "\\s+")))))})
Data:
df <- data.frame(Review = 1:5, Text = as.character(c("I like to run and play",
"I eat cookies",
"I went to swim in the pool.",
"I like to sleep.",
"I like to run, play, swim, and eat.")),
stringsAsFactors = FALSE)

better and easy way to find who spoke top 10 anger words from conversation text

I have a dataframe that contains variable 'AgentID', 'Type', 'Date', and 'Text' and a subset is as follows:
structure(list(AgentID = c("AA0101", "AA0101", "AA0101", "AA0101",
"AA0101"), Type = c("PS", "PS", "PS", "PS", "PS"), Date = c("4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019"), Text = c("I am on social security XXXX and I understand it can not be garnished by Paypal credit because it's federally protected.I owe paypal {$3600.00} I would like them to cancel this please.",
"My XXXX account is being reported late 6 times for XXXX per each loan I was under the impression that I was paying one loan but it's split into three so one payment = 3 or one missed payment would be three missed on my credit,. \n\nMy account is being reported wrong by all credit bureaus because I was in forbearance at the time that these late payments have been reported Section 623 ( a ) ( 2 ) States : If at any time a person who regularly and in the ordinary course of business furnishes information to one or more CRAs determines that the information provided is not complete or accurate, the furnisher must promptly provide complete and accurate information to the CRA. In addition, the furnisher must notify all CRAs that received the information of any corrections, and must thereafter report only the complete and accurate information. \n\nIn this case, I was in forbearance during that tie and document attached proves this. By law, credit need to be reported as of this time with all information and documentation",
"A few weeks ago I started to care for my credit and trying to build it up since I have never used my credit in the past, while checking my I discover some derogatory remarks in my XXXX credit report stating the amount owed of {$1900.00} to XXXX from XX/XX/2015 and another one owed to XXXX for {$1700.00} I would like to address this immediately and either pay off this debt or get this negative remark remove from my report.",
"I disputed this XXXX account with all three credit bureaus, the reported that it was closed in XXXX, now its reflecting closed XXXX once I paid the {$120.00} which I dont believe I owed this amount since it was an fee for a company trying to take money out of my account without my permission, I was charged the fee and my account was closed. I have notified all 3 bureaus to have this removed but they keep saying its correct. One bureau is showing XXXX closed and the other on shows XXXX according to XXXX XXXX, XXXX shows a XXXX, this account has been on my report for seven years",
"On XX/XX/XXXX I went on XXXX XXXX and noticed my score had gone down, went to check out why and seen something from XXXX XXXX and enhanced recovery company ... I also seen that it had come from XXXX and XXXX dated XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX ... I didnt have neither one before, I called and it the rep said it had come from an address Im XXXX XXXX, Florida I have never lived in Florida ever ... .I have also never had XXXX XXXX nor XXXX XXXX ... I need this taken off because it if affecting my credit score ... This is obviously identify theft and fraud..I have never received bills from here which proves that is was not done by me, I havent received any notifications ... if it was not for me checking my score I wouldnt have known nothing of this" )), row.names = c(NA, 5L), class = "data.frame")
First, I found out the top 10 anger words using the following:
library(tm)
library(tidytext)
library(tidyverse)
library(sentimentr)
library(wordcloud)
library(ggplot2)
CS <- function(txt){
MC <- Corpus(VectorSource(txt))
SW <- stopwords('english')
MC <- tm_map(MC, tolower)
MC<- tm_map(MC,removePunctuation)
MC <- tm_map(MC, removeNumbers)
MC <- tm_map(MC, removeWords, SW)
MC <- tm_map(MC, stripWhitespace)
myTDM <- as.matrix(TermDocumentMatrix(MC))
v <- sort(rowSums(myTDM), decreasing=TRUE)
FM <- data.frame(word = names(v), freq=v)
row.names(FM) <- NULL
FM <- FM %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, "x") <= 1)
return(FM)
}
DF <- CS(df$Text)
# using nrc
nrc <- get_sentiments("nrc")
# create final dataset
DF_nrc = DF %>% inner_join(nrc)
And the I created a vector of top 10 anger words as follows:
TAW <- DF_nrc %>%
filter(sentiment=="anger") %>%
group_by(word) %>%
summarize(freq = mean(freq)) %>%
arrange(desc(freq)) %>%
top_n(10) %>%
select(word)
Next what I wanted to do is to find which were the 'Agent'(s) who spoke these words frequently and rank them. But I am confused how we could do that? Should I search the words one by one and group all by agents or is there some other better way. What I am looking at as a result, something like as follows:
AgentID Words_Spoken Rank
A0001 theft, dispute, money 1
A0001 theft, fraud, 2
.......
If you are more of a dplyr/tidyverse person, you can take an approach using some dplyr verbs, after converting your text data to a tidy format.
First, let's set up some example data with several speakers, one of whom speaks no anger words. You can use unnest_tokens() to take care of most of your text cleaning steps with its defaults, such as splitting tokens, removing punctuation, etc. Then remove stopwords using anti_join(). I show using inner_join() to find the anger words as a separate step, but you could join these up into one big pipe if you like.
library(tidyverse)
library(tidytext)
my_df <- tibble(AgentID = c("AA0101", "AA0101", "AA0102", "AA0103"),
Text = c("I want to report a theft and there has been fraud.",
"I have taken great offense when there was theft and also poison. It is distressing.",
"I only experience soft, fluffy, happy feelings.",
"I have a dispute with the hateful scorpion, and also, I would like to report a fraud."))
my_df
#> # A tibble: 4 x 2
#> AgentID Text
#> <chr> <chr>
#> 1 AA0101 I want to report a theft and there has been fraud.
#> 2 AA0101 I have taken great offense when there was theft and also poison.…
#> 3 AA0102 I only experience soft, fluffy, happy feelings.
#> 4 AA0103 I have a dispute with the hateful scorpion, and also, I would li…
tidy_words <- my_df %>%
unnest_tokens(word, Text) %>%
anti_join(get_stopwords())
#> Joining, by = "word"
anger_words <- tidy_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment == "anger"))
#> Joining, by = "word"
anger_words
#> # A tibble: 10 x 3
#> AgentID word sentiment
#> <chr> <chr> <chr>
#> 1 AA0101 theft anger
#> 2 AA0101 fraud anger
#> 3 AA0101 offense anger
#> 4 AA0101 theft anger
#> 5 AA0101 poison anger
#> 6 AA0101 distressing anger
#> 7 AA0103 dispute anger
#> 8 AA0103 hateful anger
#> 9 AA0103 scorpion anger
#> 10 AA0103 fraud anger
Now you now which anger words each person used, and the next step is to count them up and rank people. The dplyr package has fantastic support for exactly this kind of work. First you want to group_by() the person identifier, then calculate a couple of summarized quantities:
the total number of words (so you can arrange by this)
a pasted-together string of the words used
Afterwards, arrange by the number of words and make a new column that gives you the rank.
anger_words %>%
group_by(AgentID) %>%
summarise(TotalWords = n(),
WordsSpoken = paste0(word, collapse = ", ")) %>%
arrange(-TotalWords) %>%
mutate(Rank = row_number())
#> # A tibble: 2 x 4
#> AgentID TotalWords WordsSpoken Rank
#> <chr> <int> <chr> <int>
#> 1 AA0101 6 theft, fraud, offense, theft, poison, distressi… 1
#> 2 AA0103 4 dispute, hateful, scorpion, fraud 2
Do notice that with this approach, you don't have a zero entry for the person who spoke no anger words; they get dropped at the inner_join(). If you want them in the final data set, you will likely need to join back up with an earlier dataset and use replace_na().
Created on 2019-09-11 by the reprex package (v0.3.0)
Not the most elegant solution, but here's how you could count the words based on the line number:
library(stringr)
# write a new data.frame retaining the AgentID and Date from the original table
new.data <- data.frame(Agent = df$AgentID, Date = df$Date)
# using a for-loop to go through every row of text in the df provided.
for(i in seq(nrow(new.data))){ # i represent row number of the original df
# write a temporary object (e101) that:
## do a boolean check to see if the text from row i df[i, "Text"] the TAW$Word with stringr::str_detect function
## loop the str_detect with sapply so that the str_detect do a boolean check on each TAW$Word
## return the TAW$Word with TAW$Word[...]
e101 <- TAW$word[sapply(TAW$word, function(x) str_detect(df[i, "Text"], x))]
# write the number of returned words in e101 as a corresponding value in new data.frame
new.data[i, "number_of_TAW"] <- length(e101)
# concatenate the returned words in e101 as a corresponding value in new data.frame
new.data[i, "Words_Spoken"] <- ifelse(length(e101)==0, "", paste(e101, collapse=","))
}
new.data
# Agent Date number_of_TAW Words_Spoken
# 1 AA0101 4/1/2019 0
# 2 AA0101 4/1/2019 0
# 3 AA0101 4/1/2019 2 derogatory,remove
# 4 AA0101 4/1/2019 3 fee,money,remove
# 5 AA0101 4/1/2019 1 theft

Resources