R: extract and paste keyword matches - r

I am new to R and have been struggling with this one. I want to create a new column, that checks if a set of any of words ("foo", "x", "y") exist in column 'text', then write that value in new column.
I have a data frame that looks like this: a->
id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"
The correct output should be:
a2 ->
id text time username keywordtag
1 "hello x" 10 "me" x
2 "foo and y" 5 "you" foo,y
3 "nothing" 15 "everyone" 0
4 "x,y,foo" 0 "know" x,y,foo
I have this:
df1 <- data.frame(text = c("hello x", "foo and y", "nothing", "x,y,foo"))
terms <- c('foo', 'x', 'y')
df1$keywordtag <- apply(sapply(terms, grepl, df1$text), 1, function(x) paste(terms[x], collapse=','))
Which works, but crashes R when my needleList contains 12k words and my text has 155k rows. Is there a way to do this that won't crash R?

This is a variation on what you have done, and what was suggested in the comments. This uses dplyr and stringr. There may be a more efficient way but this may not crash your R session.
library(dplyr)
library(stringr)
terms <- c('foo', 'x', 'y')
term_regex <- paste0('(', paste(terms, collapse = '|'), ')')
### Solution: this uses dplyr::mutate and stringr::str_extract_all
df1 %>%
mutate(keywordtag = sapply(str_extract_all(text, term_regex), function(x) paste(x, collapse=',')))
# text keywordtag
#1 hello x x
#2 foo and y foo,y
#3 nothing
#4 x,y,foo x,y,foo

Related

R: Count the frequency of every unique character in a column

I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.

Remove first n words and take count

I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2
You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1
b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.

Have a specific list find howmany times exist

If I have a specific list with terms like this:
df_specific <- data.frame(terms = c("hi", "why here", "see you soon"))
and a framework with text
df_text <- data.frame(text = c("hi my name is", "why here you are",
"hi see you later", "I hope to see you soon"))
How is it possible to use the first list as index to find in the df_text how many times exist?
Example of expected output:
term num
hi 2
why here 1
see you soon 1
You can use grepl for an individual term and use sapply to map that across all of your test terms.
sapply(df_specific$terms, function(x) sum(grepl(x, df_text$text)))
[1] 2 1 1
If you want to get the specific format that you listed, just cbind the previous result onto df_specific
num = sapply(df_specific$terms, function(x) sum(grepl(x, df_text$text)))
cbind(df_specific, num)
terms num
1 hi 2
2 why here 1
3 see you soon 1
Overview
Using the tidyverse, I supplied each value in df_specific$term as the pattern to test for its existence in all the values in df_text$text via map_df() and str_count().
# load necessary packages ----
library(tidyverse)
# load necessary data ------
df_specific <- tibble(terms = c("hi", "why here", "see you soon"))
df_text <- tibble(text = c("hi my name is"
, "why here you are"
, "hi see you later"
, "I hope to see you soon"))
# perform analysis --------
df_specific %>%
pull(terms) %>%
set_names() %>%
# for each value in df_text$text
# count how many times .x appears in the vector
map_df(.f = ~ str_count(string = df_text$text
, pattern = .x) %>% sum()) %>%
# transform data from wide to long
gather(key = "term", value = "num")
# A tibble: 3 x 2
# term num
# <chr> <int>
# 1 hi 2
# 2 why here 1
# 3 see you soon 1
# end of script #

Tie melted table object back to original dataframe?

I am trying to count the number of times each word in a row in a dataframe occurs at a given time. Here is my dataframe:
library(stringr)
df <- data.frame("Corpus" = c("this is some text",
"here is some more text text",
"more food for everyone",
"less for no one",
"something text here is some more text",
"everyone should go home",
"more random text",
"random text more more more",
"plenty of random text",
"the final piece of random everyone text"),
"Class" = c("X", "Y", "Y", "Y", "Y",
"Y", "Y", "Z",
"Z", "Z"),
"OpenTime" = c("12/01/2016 10:45:00", "11/07/2016 10:32:00",
"11/15/2015 01:45:00", "08/23/2012 1:23:00",
"12/17/2016 11:45:00", "12/16/2016 9:47:00",
"04/11/2015 04:23:00", "11/27/2016 12:12:00",
"08/25/2015 10:46:00", "09/27/2016 10:46:00"))
I am trying to get this result:
Class OpenTime Word Frequency
X 12/01/2016 10:45:00 this 1
X 12/01/2016 10:45:00 is 1
X 12/01/2016 10:45:00 some 1
X 12/01/2016 10:45:00 text 1
Y 11/07/2016 10:32:00 here 1
Y 11/07/2016 10:32:00 is 1
Y 11/07/2016 10:32:00 some 1
Y 11/07/2016 10:32:00 more 1
Y 11/07/2016 10:32:00 text 2
...
I'd love to do this all with groupby in dplyr, but I haven't yet got that to work. Instead, this is what I've tried:
splits <- strsplit(as.character(df$Corpus), split = " ")
counts <- lapply(splits, table)
counts.melted <- lapply(counts, melt)
This gives me the transposed view I want:
> counts.melted
[[1]]
Var1 value
1 is 1
2 some 1
3 text 1
4 this 1
[[2]]
Var1 value
1 here 1
2 is 1
3 more 1
4 some 1
5 text 1
...
But how can I tie that list of melted vectors back with the original data to produce the desired output above? I tried using rep to repeat the the Class value for as many words there were in each row, but have had little success. It would be easy to do all of this in a for loop, but I would much rather do this using vectorised methods like lapply.
out.df <- data.frame("RRN" = NULL, "OpenTime" = NULL,
"Word" = NULL, "Frequency" = NULL)
For those coming here in the future, I was able to vectorize most of the solution to my problem. Unfortunately, I'm still looking for ways to use lapply instead of the for loop below, but this does exactly what I want:
# split each row in the corpus column on spaces
splits <- strsplit(as.character(df$Corpus), split = " ")
# count the number of times each word in a row appears in that row
counts <- lapply(splits, table)
# melt that table to make things more palatable
counts.melted <- lapply(counts, melt)
# the result data frame to which we'll append our results
out.df <- data.frame("Class" = c(), "OpenTime" = c(),
"Word" = c(), "Frequency" = c())
# it would be better to vectorize this, using something like lapply
for(idx in 1:length(counts.melted)){
# coerce the melted table at that index to a data frame
count.df <- as.data.frame(counts.melted[idx])
# change the column names
names(count.df) <- c("Word", "Frequency")
# repeat the Classand time for that row to fill in those column
count.df[, 'Class'] <- rep(as.character(df[idx, "Class"]), nrow(count.df))
count.df[, 'OpenTime'] <- rep(as.character(df[idx, "OpenTime"]), nrow(count.df))
# append the results
out.df <- rbind(out.df, count.df)
}

R remove multiple text strings in data frame

New to R. I am looking to remove certain words from a data frame. Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. Then convert back to a dataframe and maintain same structure.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
a
id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"
I was thinking something like:
a2 <- apply(a, 1, gsub(wordstoremove, "", a)
but clearly this doesnt work, before converting back to a data frame.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))
# id text time username
# 1 1 ai and x 10 me
# 2 2 and computing 5 you
# 3 3 nothing 15 everyone
# 4 4 ibm privacy 0 know
(dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste(wordstoremove, collapse = '|'), '', x))))
# id text time username
# 1 1 and x 10 me
# 2 2 and 5 you
# 3 3 nothing 15 everyone
# 4 4 0 know
Another option using dplyr::mutate() and stringr::str_remove_all():
library(dplyr)
library(stringr)
dat <- dat %>%
mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.
The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.
str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').
rawr's anwswer could be updated to:
dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))

Resources