Tidytext - set expressions as a single token

Tidytext - set expressions as a single token - r

I am trying to separate my text data into tokens using the unnest_tokens function from the tidytext package. The thing is that some expressions appear multiple times and I would like to keep them a single token instead of multiple tokens.
Normal outcome:
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice
4 1 text
5 2 a
6 2 second
7 2 nice
8 2 text
What I would like (expression = "nice text"):
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text

Here's a concise solution based on negative lookahead (?!...), to disallow separate_rows to separate Text on whitespace \\s if there's nice to the left of \\s and text to its right (\\bare word boundary anchors, in case you have, say, "nice texts", which you do want to separate)
library(tidyr)
df %>%
separate_rows(Text, sep = "(?!\\bnice\\b)\\s(?!\\btext\\b)")
# A tibble: 6 × 2
Id Text
<dbl> <chr>
1 1 A
2 1 first
3 1 nice text
4 2 A
5 2 second
6 2 nice text
A more advanced regex is with (*SKIP)(*F):
df %>%
separate_rows(Text, sep = "(\\bnice text\\b)(*SKIP)(*F)|\\s")
For more info: How do (*SKIP) or (*F) work on regex?

A bit verbose, and there might be an option to exclude certain phrases in the unnest_tokens, but it does the trick:
library(tidyverse)
library(tidytext)
df <- data.frame(Id = c(1, 2),,
Text = c('A first nice text', 'A second nice text')) %>%
unnest_tokens('Word', Text)
df %>%
group_by(Id) %>%
summarize(Word = paste(if_else(lag(Word) == 'nice' & Word == 'text', 'nice text', Word))) %>%
mutate(temp_id = row_number()) %>%
filter(temp_id != temp_id[Word == 'nice text'] - 1) %>%
ungroup() %>%
select(-temp_id)
which gives:
# A tibble: 6 x 2
Id Word
<dbl> <chr>
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text

Related

Counting number of strings despite multiple elements in one cell

I got a vector A <- c("Tom; Jerry", "Lisa; Marc")
and try to identity the number of occurrences of every name.
I already used the code:
sort(table(unlist(strsplit(A, ""))), decreasing = TRUE)
However, this code is only able to create output like this:
Tom; Jerry: 1 - Lisa; Marc: 1
I am looking for a way to count every name, despite the fact, that two names are present in one cell. Consequently, my preferred result would be:
Tom: 1 Jerry: 1 Lisa: 1 Marc:1

The split should be ; followed by zero or more spaces (\\s*)
sort(table(unlist(strsplit(A, ";\\s*"))), decreasing = TRUE)
-output
Jerry Lisa Marc Tom
1 1 1 1

Use separate_rows to split the strings, group_by the names and summarise them:
library(tidyverse)
data.frame(A) %>%
separate_rows(A, sep = "; ") %>%
group_by(A) %>%
summarise(N = n())
# A tibble: 4 × 2
A N
<chr> <int>
1 Jerry 1
2 Lisa 1
3 Marc 1
4 Tom 1

Split columns considering only the first dot in R using separate

This is my dataframe:
df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))
I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:
df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")
The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.
Any help?

Here is an alternative using readrs parse_number and a regex:
library(dplyr)
library(readr)
df %>%
mutate(Numbers = parse_number(col1), .before=1) %>%
mutate(col1 = gsub('\\d+\\. ','',col1))
Numbers col1
<dbl> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word

A tidyverse approach would be to first clean the data then separate.
df %>%
mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>%
tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")
Result:
# A tibble: 8 x 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word
7 7 word
8 8 word

I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:
df %>%
extract(
col = col1 ,
into = c('Number','Words'),
regex = "([0-9]+)\\. (.*)")
The regular expression ([0-9]+)\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.
The result:
# A tibble: 8 × 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word

Try read.table + sub
> read.table(text = sub("\\.", ",", df$col1), sep = ",")
V1 V2
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word

I am not sure how to do this with tidyr, but the following should work with base R.
df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL
Result
> head(df)
Numbers Words
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word

How to identify the text that are in common between sentences?

I would like to find the text or string that appeared in 3 of my columns.
> dput(df1)
structure(list(Jan = "The price of oil declined.", Feb = "The price of gold declined.",
Mar = "Prices remained unchanged."), row.names = c(NA, -1L
), class = c("tbl_df", "tbl", "data.frame"))
I want to get something like
Word Count
The 2
price 3
declined 2
of 2
Thank you.

You can count the occurrence of each word in the text and keep only the ones that occur more than once.
library(dplyr)
library(tidyr)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything()) %>%
separate_rows(value, sep = '\\s+') %>%
mutate(value = tolower(gsub('[[:punct:]]', '', value))) %>%
count(value) %>%
filter(n > 1)

May be this:
setNames(data.frame(table(unlist
(strsplit
(trimws(tolower(stack(df)$values),whitespace = '\\.'), '\\s+', perl=TRUE)
)
)
), c('words', 'Frequency'))
stack(df) will stack the df to columnar structure from row structure, then using values column we get all the sentences. we use trimws to remove all the unnecessary punctuation. we use strsplit to split data with spaces. Finally unlisting it to make it flatten. Taking the table and then converting to data.frame yields the desired results.setNames renames the columns.
Output:
# words Frequency
#1 declined 2
#2 gold 1
#3 of 2
#4 oil 1
#5 price 2
#6 prices 1
#7 remained 1
#8 the 2
#9 unchanged 1

This code won't process the data as you may wish, for ex. treating "price" and "Prices" as the same word. If you want that it will get more complicated.
> data.frame(table(strsplit(tolower(gsub("\\.|\\,","",paste(as.character(unlist(df)),collapse=" ")))," ")))
Var1 Freq
1 declined 2
2 gold 1
3 of 2
4 oil 1
5 price 2
6 prices 1
7 remained 1
8 the 2
9 unchanged 1

Base R solution:
setNames(
data.frame(
table(
unlist(strsplit(tolower(do.call(c, df1)), "\\s+|[[:punct:]]"))
)
),
c("Words", "Frequency")
)

tidyverse: filter with str_detect

I want to use filter command from dplyr along with str_detect.
library(tidyverse)
dt1 <-
tibble(
No = c(1, 2, 3, 4)
, Text = c("I have a pen.", "I have a book.", "I have a pencile.", "I have a pen and a book.")
)
dt1
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
MatchText <- c("Pen", "Book")
dt1 %>%
filter(str_detect(Text, regex(paste0(MatchText, collapse = '|'), ignore_case = TRUE)))
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
Required Output
I want the following output in more efficient way (since in my original problem there would be many unknown element of MatchText).
dt1 %>%
filter(str_detect(Text, regex("Pen", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Pen") %>%
bind_rows(
dt1 %>%
filter(str_detect(Text, regex("Book", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Book")
)
# A tibble: 5 x 2
No MatchText
<dbl> <chr>
1 1 Pen
2 3 Pen
3 4 Pen
4 2 Book
5 4 Book
Any hint to accomplish the above task more efficiently.

library(tidyverse)
dt1 %>%
mutate(
result = str_extract_all(Text, regex(paste0("\\b", MatchText, "\\b", collapse = '|'),ignore_case = TRUE))
) %>%
unnest(result) %>%
select(-Text)
# # A tibble: 4 x 2
# No result
# <dbl> <chr>
# 1 1 pen
# 2 2 book
# 3 4 pen
# 4 4 book
I'm not sure what happened to the "whole words" part of your question after edits - I left in the word boundaries to match whole words, but since "pen" isn't a whole word match for "pencile", my result doesn't match yours. Get rid of the \\b if you want partial word matches.

str_extract_all() gives multiple matches which you can unnest into separate rows to get your desired output. If you want you can still use the paste+collapse method to generate the pattern from a vector.
library(stringr)
dt1 %>%
mutate(match = str_extract_all(tolower(Text), "pen|book")) %>%
unnest(match) %>%
select(-Text)

keeping document number in tidytext

When I unnest_tokens for a list I enter manually; the output includes the row number each word came from.
library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)
#test data
text<- c( "furloughs","Working MORE for less pay", "total burnout and exhaustion")
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
The results look like this, which is what I want.
row_name word
<int> <chr>
1 1 furlough
2 2 work
3 2 more
4 2 for
5 2 less
6 2 pai
7 3 total
8 3 burnout
9 3 and
10 3 exhaust
But when I try to read in the real responses from a csv file:
#Import data
text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
But otherwise use the same code:
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
I get the entire token list assigned to row 1 and then again assigned to row 2 and so on.
row_name word
<int> <chr>
1 1 c
2 1 furlough
3 1 work
4 1 more
5 1 for
6 1 less
7 1 pai
8 1 total
9 1 burnout
10 1 and
OR, if I move the mutate(row_name = row_number) to after the unnest command, I get the row number for each token.
word row_name
<chr> <int>
1 c 1
2 furlough 2
3 work 3
4 more 4
5 for 5
6 less 6
7 pai 7
8 total 8
9 burnout 9
10 and 10
What am I missing?

I guess if you import the text using text <- read.csv("TextSample.csv", stringsAsFactors=FALSE), text is a data frame while if you enter it manually it is a vector.
If you would alter the code to: text_df <- tibble(text = text$col_name) to select the column from the data frame (which is a vector) in the csv case, I think you should get the same result as before.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Tidytext - set expressions as a single token - r

Related

Counting number of strings despite multiple elements in one cell

Split columns considering only the first dot in R using separate

How to identify the text that are in common between sentences?

tidyverse: filter with str_detect

keeping document number in tidytext

Categories

Resources