I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.
Related
I'm trying to count the number of times that some pre-specified words appear in a character column, Post.
This is what my dataset looks like:
Now, I want to count all green/sustainable words in each of the posts and add this number as an extra column.
I have manually created a lexicon where all green words have Polarity == 1 and non-green words have Polarity == 0.
How can I do this?
str_count() from stringr can help with this (and with a lot more string-based tasks, see this R4DS chapter).
library(string)
# Create a reproducible example
dat <- data.frame(Post = c(
"This is a sample post without any target words",
"Whilst this is green!",
"And this is eco-friendly",
"This is green AND eco-friendly!"))
lexicon <- data.frame(Word = c("green", "eco-friendly", "neutral"),
Polarity = c(1, 1, 0))
# Extract relevant words from lexicon
green_words <- lexicon$Word[lexicon$Polarity == 1]
# Create new variable
dat$n_green_words <- str_count(dat$Post, paste(green_words, collapse = "|"))
dat
Output:
#> Post n_green_words
#> 1 This is a sample post without any target words 0
#> 2 Whilst this is green! 1
#> 3 And this is eco-friendly 1
#> 4 This is green AND eco-friendly! 2
Created on 2022-07-15 by the reprex package (v2.0.1)
I need support with RegEx filtering!
I have a list of keywords and many rows that should be checked.
In this example, the keyword "-book-" can be (1) in the middle of the sentence or (2) at the end, which would mean that the last hyphen is not present.
I need a RegEx expression, which identifies "-book-" and "-book".
I don't want similar keywords like "-booking-" etc to be identified.
library(dplyr)
keywords = c( "-album-", "-book-", "-castle-")
search_terms = paste(keywords, collapse ="|")
number = c(1:5)
sentences = c("the-best-album-in-shop", "this-book-is-fantastic", "that-is-the-best-book", "spacespacespace", "unwanted-sentence-with-booking")
data = data.frame(number, sentences)
output = data %>% filter(., grepl( search_terms, sentences) )
# Current output:
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
# DESIRED output:
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
3 3 that-is-the-best-book
You could also do:
subset(data, grepl(paste0(sprintf("%s?\\b",keywords),collapse = "|"), sentences))
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
3 3 that-is-the-best-book
Note that this will only check for the -book- at the (1) in the middle of the sentence or (2) at the end Not at the beginning
The -book- pattern will match a whole word book with hyphen on the left and right.
To match a whole word book with a hyphen on the left or right, you need an alternation \bbook-|-book\b.
Thus, you can use
keywords = c( "-album-", "\\bbook-", "-book\\b", "-castle-" )
Another solution you can take it into account
library(stringr)
data %>%
filter(str_detect(sentences, regex("-castle-|-album-|-book$|-book-\\w{1,}")))
# number sentences
# 1 1 the-best-album-in-shop
# 2 2 this-book-is-fantastic
# 3 3 that-is-the-best-book
example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3
This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".
Consider this simple example
library(stringr)
library(dplyr)
dataframe <- data_frame(text = c('how is the biggest ??',
'really amazing stuff'))
# A tibble: 2 x 1
text
<chr>
1 how is the biggest ??
2 really amazing stuff
I need to extract some terms based on a regex expression, but only extract the term that is the longest.
So far, I was able to only extract the first match (not necessary the longest) using str_extract.
> dataframe %>% mutate(mymatch = str_extract(text, regex('\\w+')))
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? how
2 really amazing stuff really
I tried to play with str_extract_all but I cant find an efficient syntax.
Output should be:
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
Any ideas?
Thanks!
You can do something like this:
library(stringr)
library(dplyr)
dataframe %>%
mutate(mymatch = sapply(str_extract_all(text, '\\w+'),
function(x) x[nchar(x) == max(nchar(x))][1]))
With purrr:
library(purrr)
dataframe %>%
mutate(mymatch = map_chr(str_extract_all(text, '\\w+'),
~ .[nchar(.) == max(nchar(.))][1]))
Result:
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
Note:
If there is a tie, this takes the first one.
Data:
dataframe <- data_frame(text = c('how is the biggest ??',
'really amazing biggest stuff'))
An easy way is to break the process down into 2 steps, first a list of list of all the words in each row. Then find and return the longest word from each sub list:
df <- data_frame(text = c('how is the biggest ??',
'really amazing stuff'))
library(stringr)
#create a list of all words per row
splits<-str_extract_all(df$text, '\\w+', simplify = FALSE)
#find longest word and return it
sapply(splits, function(x) {x[which.max(nchar(x))]})
As a variant of other answers, I'd suggest writing a function that does the manipuation
longest_match <- function(x, pattern) {
matches <- str_match_all(x, pattern)
purrr::map_chr(matches, ~ .[which.max(nchar(.))])
}
Then use it
dataframe %>%
mutate(mymatch = longest_match(text, "\\w+"))
By way of commentary, it seems better practice to isolate the function that does the new stuff longest_match() from the manipulations enabled by mutate(). For instance, the function is easy to test, can be used in other circumstances, and can be modified ('return the last rather than first longest match') independently of the data transformation step.. There's no real value in sticking everything into one line, so it makes sense to write lines of code that logically accomplish one thing -- find all matches, map from all matches to longest, ... purrr::map_chr() is better than sapply() because it is more robust -- it guarantees that the result is a character vector, so that something like
> df1 = dataframe[FALSE,]
> df1 %>% mutate(mymatch = longest_match(text, "\\w+"))
# A tibble: 0 x 2
# ... with 2 variables: text <chr>, mymatch <chr>
'does the right thing', i.e., mymatch is a character vector (sapply() would return a list in this case).
Or, using purrr...
library(dplyr)
library(purrr)
library(stringr)
dataframe %>% mutate(mymatch=map_chr(str_extract_all(text,"\\w+"),
~.[which.max(nchar(.))]))
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
I have a character list. I would like to return rows in a df that contain any of the strings in the list in a given column.
I have tried things like:
hits <- df %>%
filter(column, any(strings))
strings <- c("ape", "bat", "cat")
head(df$column)
[1] "ape and some other text here"
[2] "just some random text"
[3] "Something about cats"
I would like only rows 1 and 3 returned
Thanks in advance for the help.
Use grepl() with a regular expression matching any of the strings in your strings vector:
strings <- c("ape", "bat", "cat")
Firstly, you can collapse the strings vector to the regex you need:
regex <- paste(strings, collapse = "|")
Which gives:
> regex <- paste(strings, collapse = "|")
> regex
[1] "ape|bat|cat"
The pipe symbol | acts as an or operator, so this regex ape|bat|cat will match ape or bat or cat.
If your data.frame df looks like this:
> df
# A tibble: 3 x 1
column
<chr>
1 ape and some other text here
2 just some random text
3 something about cats
Then you can run the following line of code to return just the rows matching your desired strings:
df[grepl(regex, df$column), ]
The output is as follows:
> df[grepl(regex, df$column), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about cats
Note that the above example is case-insensitive, it will only match the lower case strings exactly as specified. You can overcome this easily using the ignore.case parameter of grepl() (note the upper case Cats):
> df[grepl(regex, df$column, ignore.case = TRUE), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about Cats
This can be accomplished with a regular expression.
aColumn <- c("ape and some other text here","just some random text","Something about cats")
aColumn[grepl("ape|bat|cat",aColumn)]
...and the output:
> aColumn[grepl("ape|bat|cat",aColumn)]
[1] "ape and some other text here" "Something about cats"
>
One an also set up the regular expression in an R object, as follows.
# use with a variable
strings <- "ape|cat|bat"
aColumn[grepl(strings,aColumn)]