Remove part of a string based on overlapping patterns - r

I have the following data:
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short"),
some_other_cols = c(1, 2, 2))
Further, I have the following vector of patterns:
my_patterns <- c("my example", "is my", "my other text")
What I want to achieve is to remove any text of my_patterns that occurs in dat$x.
I tried the solution below, but the problem is that as soon as I remove the first pattern from the text (here: "my example"), my solution is not able to detect the occurence of the second (here: "is my") or third pattern anymore.
Wrong solution:
library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")
dat_new <- dat %>%
mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))
I guess I could do sth. like looping through all patterns, collect the string positions in dat$x that match my patterns, then combine them into a range and delete that range from the text. E.g. I add columns to my dat data frame like start_pattern_1 and end_pattern_1 and so on. So for the first row 1 I get 9 (start) and 18 (end) for the first pattern, 6/10 for the second pattern. I then need to check if any end position overlaps with any start position (here start 9 and end 10) and combine them into a range 6-18 and remove this range from the text.
Problem is that I potentially have many new start/end columns then (could be a few hundred patterns in my case) and if I need to pairwise compare the overlapping ranges, my computer will probably crash.
So I'm wondering how I could get it work or how I should best approach this solution. Maybe (and I hope so) there's a better/more elegant/easy solution.
Desired Output of dat would be:
x some_other_cols short_x
this is my example text 1 this text
and here is my other text example 2 and here example
my other text is short 2 is short
Appreciate your help! Thanks.

New option with str_locate_all mentionned by Uwe in a comment under the question which greatly simplify the code:
library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
if (nrow(positions) == 0) return(text)
ret <- strsplit(text,"")[[1]]
lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}
If you have control over the pattern definition and can create it by hand then it can be achieved with a regex solution:
> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this text" "and here example" " is short"
The idea is to create the pattern with optional parts (the ? after the grouping parentheses.
So we have roughly:
(is )? <= optional "is" followed by space
my <= literal "my" followed by space
(other text|example)? <= Optional text after "my ", either "other text" or (the |) "example"
If you don't have control, things gets messy, I hope I've commented enough for it to be understandable, according to the number of loops included don't expect it to be quick:
# Given datas
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
some_other_cols = c(1, 2, 2, 4))
my_patterns <- c("my example", "is my", "my other text")
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
ret <- strsplit(text,"")[[1]]
lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
sapply(regexec(pattern,vector),
function(x) {
if(x>-1) {
c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
}
})
}
# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}
Result after run:
> dat
x some_other_cols result
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text

There are two crucial points here:
The patterns to remove from a string may overlap
There may be multiple non-overlapping patterns to remove from the string
The solution below tries to address both issues using my favorite tools
library(data.table)
setDT(dat)[, rn := .I] # add row numbers to join on later
library(stringr)
library(magrittr) # piping used to improve readability
pos <-
# find start and end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn")) %>%
rbindlist() %>%
# collapse overlapping positions
setorder(rn, start, end) %>%
.[, grp := cumsum(cummax(shift(end, fill = 0)) < start), by = rn] %>%
.[, .(start = min(start), end = max(end)), by = .(rn, grp)]
Now, pos has become:
rn grp start end
1: 1 1 6 18
2: 2 1 10 25
3: 3 1 1 13
4: 5 1 6 10
5: 5 2 24 28
6: 6 1 1 13
7: 6 2 15 27
8: 7 1 3 7
9: 8 1 1 10
10: 8 2 12 16
11: 8 3 22 34
12: 9 1 1 10
13: 9 2 19 31
# remove patterns from strings from back to front
dat[, short_x := x]
for (g in rev(seq_len(max(pos$grp)))) {
# update join
dat[pos[grp == g], on = .(rn), short_x := `str_sub<-`(short_x, start, end, value = "")]
}
dat[, rn := NULL][ #remove row number
, short_x := str_squish(short_x)][] # remove whitespace
x some_other_cols short_x
1: this is my example text 1 this text
2: and here is my other text example 2 and here example
3: my other text is short 2 is short
4: yet another text 4 yet another text
5: this is my text where 'is my' appears twice 5 this text where '' appears twice
6: my other text is my example 6
7: This myself 7 Thself
8: my example is my not my other text 8 not
9: my example is not my other text 9 is not
The code to collapse overlapping positions is modified from this answer.
The intermediate result
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn"))
[[1]]
rn start end
1: 1 9 18
2: 6 18 27
3: 8 1 10
4: 9 1 10
[[2]]
rn start end
1: 1 6 10
2: 2 10 14
3: 5 6 10
4: 5 24 28
5: 6 15 19
6: 7 3 7
7: 8 12 16
[[3]]
rn start end
1: 2 13 25
2: 3 1 13
3: 6 1 13
4: 8 22 34
5: 9 19 31
shows that patterns 1 and 2 overlap in row 1 and patterns 2 and 3 overlap in row 2. Rows 5, 8, and 9 have non-overlapping patterns. Row 7 is to show that patterns are extracted regardless of word boundaries.
EDIT: dplyr version
The OP has mentioned that he/she has "successfully avoided data.table so far". So, I felt challenged to add a dplyr version:
library(dplyr)
library(stringr)
pos <-
# find start end end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as_tibble) %>%
bind_rows(.id = "rn")) %>%
bind_rows() %>%
# collapse overlapping positions
arrange(rn, start, end) %>%
group_by(rn) %>%
mutate(grp = cumsum(cummax(lag(end, default = 0)) < start)) %>%
group_by(rn, grp) %>%
summarize(start = min(start), end = max(end))
# remove patterns from strings from back to front
dat <- dat %>%
mutate(rn = row_number() %>% as.character(),
short_x = x %>% as.character())
for (g in rev(seq_len(max(pos$grp)))) {
dat <- dat %>%
left_join(pos %>% filter(grp == g), by = "rn") %>%
mutate(short_x = ifelse(is.na(grp), short_x, `str_sub<-`(short_x, start, end, value = ""))) %>%
select(-grp, -start, -end)
}
# remove row number
dat %>%
select(-rn) %>%
mutate(short_x = str_squish(short_x))
x some_other_cols short_x
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
5 this is my text where 'is my' appears twice 5 this text where '' appears twice
6 my other text is my example 6
7 This is myself 7 This self
8 my example is my not my other text 8 not
9 my example is not my other text 9 is not
The algorithm is essentially the same. However, there are two challenges here where dplyr differs from data.table:
dplyr requires explicit coersion from factor to character
there is no update join available in dplyr, so the for loop has become more verbose than the data.table counterpart (Perhaps, someone knows a fancy purrr function or a map-reduce trick to accomplish the same?)
EDIT 2
There are some bug fixes and improvements to above codes:
Collapsing positions has been corrected to work also for some edge case I have added to dat.
seq() has been replaced by seq_len().
str_squish() reduces repeated whitespace inside a string and removes whitespace from start and end of a string.
Data
I have added some use cases to test for non-overlapping patterns and complete removal, e.g.:
dat <- data.frame(
x = c(
"this is my example text",
"and here is my other text example",
"my other text is short",
"yet another text",
"this is my text where 'is my' appears twice",
"my other text is my example",
"This myself",
"my example is my not my other text",
"my example is not my other text"
),
some_other_cols = c(1, 2, 2, 4, 5, 6, 7, 8, 9)
)
my_patterns <- c("my example", "is my", "my other text")

Related

filter with a list of string conditions

This is an example what the data looks like:
height <- c("T_0.1", "T_0.2", "T_0.3", "T_0.11", "T_0.12", "T_0.13", "T_10.1", "T_10.2",
"T_10.3", "T_10.11", "T_10.12", "T_10.13","T_36.1", "T_36.2", "T_36.3", "T_36.11", "T_36.12",
"T_36.13")
value <- c(1,12,14,15,20,22,5,9,4,0.0,0.45,0.7,1,2,7,100,9,45)
df <- data.frame(height,value)
I want to filter all the values in height that ends with ".1", ".2", and ".3". However I want to do that using a "list of patterns" because the actual data frame has more than 1000 values.
Here what I tried:
vars_list <- c(".1", ".2",".3")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
matchPattern <- paste(vars_list, collapse = "|")
df_new <- df %>% select(matches(matchPattern))
Both codes returns 0 observation. I am not sure what it is the issue and I couldn't find a post that would help. So any help is very much appreciated!
The dot is a regex metacharacter, which matches any character except a new line. You need to escape it (i.e. tell R you are looking for a literal dot), by prepending it with \\.
However, your pattern will then match all rows in your sample data.
I assume you do not want to match, for example, "T_0.13", because it does not end with ".1", ".2" or ".3". In which case, you should add a dollar sign to indicate that you want your string to end with the desired match, rather than just contain it.
vars_list <- c("\\.1$", "\\.2$","\\.3$")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
df_new
# height value
# 1 T_0.1 1
# 2 T_0.2 12
# 3 T_0.3 14
# 7 T_10.1 5
# 8 T_10.2 9
# 9 T_10.3 4
# 13 T_36.1 1
# 14 T_36.2 2
# 15 T_36.3 7
Incidentally, another way you could express this is:
df[grepl("\\.[1-3]$", df$height),]
You can read more here about the syntax used in regular expressions.
Alternatively use the base function endsWith
df <- data.frame(height,value) %>% filter(endsWith(height,vars_list))
Created on 2023-02-12 with reprex v2.0.2
height value
1 T_0.1 1
2 T_0.2 12
3 T_0.3 14
4 T_10.1 5
5 T_10.2 9
6 T_10.3 4
7 T_36.1 1
8 T_36.2 2
9 T_36.3 7

Convert string to dictionary in R

I have a data frame column with dictionary-like strings.
data = data.frame(date = c('2022-12-01', '2022-12-02'),
code = c("{\"551\":4,\"181\":4,\"180\":4,\"181\":4}",
"{\"321\":14,\"181\":4,\"230\":4,\"189\":12}"))
My goal is to calculate the total number if the "dictionary" starts with 18.
For example, first row 2022-12-01, there are three items start with 18, so the total number is 4+4+4 =12.
For second row 2022-12-02, there are two items start with 18, so the total number is 4+4+12=16.
I tried strsplit(data$code, "\\W"), which split on every delimiter; or strsplit(data$code, ","), but fail to store it as a dictionary-type structure.
I feel that after converting the string to a dictionary, then filter on names starts with 18 would be feasible, but have no idea how to get started. Thank you for your advice!
data = data.frame(date = c('2022-12-01', '2022-12-02'),
code = c("{\"551\":4,\"181\":4,\"180\":4,\"181\":4}",
"{\"321\":14,\"181\":4,\"230\":4,\"189\":12}"))
data$count <- lapply(data$code,jsonlite::fromJSON) |> sapply(
\(x) sum(unlist(x)[grep("^18", names(x))]) )
data
#> date code count
#> 1 2022-12-01 {"551":4,"181":4,"180":4,"181":4} 12
#> 2 2022-12-02 {"321":14,"181":4,"230":4,"189":12} 16
Here are several approaches. The first uses strapply and is particularly short. The next shows how to create a dictionary using strapply and the last uses only base R.
In all of these use transform(data, sum = ...) or use mutate in dplyr to add the solution as a new column to data.
1) Match the number after an 18 and then convert the match to numeric and sum. Using strapply we get particularly concise code.
library(gsubfn)
sapply(strapply(data$code, '"18\\d+":(\\d+)', as.numeric), sum)
## [1] 12 16
2) In the question the desirability of creating a dictionary first was mentioned. To do that dict below is a list of dictionaries, one per row, and then we grep out the desired elements and sum.
library(gsubfn)
dict <- strapply(data$code, '"(\\d+)":(\\d+)', x + y ~ setNames(as.numeric(y), x))
sapply(lapply(dict, function(x) x[grepl("^18", names(x))]), sum)
## [1] 12 16
dict
## [[1]]
## 551 181 180 181
## 4 4 4 4
##
## [[2]]
## 321 181 230 189
## 14 4 4 12
3) A base solution replaces the {, } and comma characters with newline and then for each row reads the rest into two columns (the dictionary). It then subsets out the rows that begin with 18 and sums.
sapply(data$code, function(x)
gsub('[{},]', '\n', x) |>
read.table(text = _, sep = ":") |>
subset(grepl("^18", V1)) |>
with(sum(V2)), USE.NAMES = FALSE)
## [1] 12 16
If you just want that part of the code that constructs the dictionaries
lapply(data$code, function(x)
gsub('[{},]', '\n', x) |>
read.table(text = _, sep = ":"))
## [[1]]
## V1 V2
## 1 551 4
## 2 181 4
## 3 180 4
## 4 181 4
##
## [[2]]
## V1 V2
## 1 321 14
## 2 181 4
## 3 230 4
## 4 189 12
I would first make a data.frame where each row is a {name, value} pair. I do this by first separating the pairs onto rows, then separating the name and value into separate columns. Then I parse the text to keep only the numbers. Finally we summarise the table by date, taking the sum of those values for which the name starts with "18".
library(tidyverse)
data %>%
separate_rows(code, sep = ',') %>%
separate(code, sep = '":', into = c('name', 'value')) %>%
mutate(across(c(name, value), parse_number)) %>%
group_by(date) %>%
summarise(result = sum(value[substr(name, 1, 2) == "18"]))
Using base R
data$Sum <- sapply(regmatches(data$code, gregexpr('(?<=18\\d":)(\\d+)',
data$code, perl = TRUE)), \(x) sum(as.numeric(x)))
data$Sum
[1] 12 16
A base R approach using strsplit and sub/gsub.
First remove the braces and quotes, then look for strings starting with ^ 18 and finally sum the trailing numbers after :.
cbind(df, Sum = sapply(strsplit(df$code, ","), function(x)
sum(as.numeric(
sub(".*:", "", grep("^18", gsub("\\{|\"|\\}", "", x), value=T)))
)))
date code Sum
1 2022-12-01 {"551":4,"181":4,"180":4,"181":4} 12
2 2022-12-02 {"321":14,"181":4,"230":4,"189":12} 16

How to convert a sequence of numbers from data frame to text file in special format in R?

I have a data frame with one field consisting of a sequence of numbers:
test <- data.frame(N=c(1,2,3,5,7,8,9,11,13,14,15))
> test
N
1 1
2 2
3 3
4 5
5 7
6 8
7 9
8 11
9 13
10 14
11 15
The field N contains a sequence of integers in ascending order
N sometimes skips some numbers, such as 2,3,5 (4 is missing).
I need to convert it into the following text format:
1-3,5,7-9,
11,13-15
This file is not data frame, but just a simple text file which contains the following conditions:
Consecutive numbers which are located in the middle should be removed and replaced by -, i.e., 1,2,3 should be 1-3 and 1,2,3,5,6 should be 1-3,5,6
Each number (or shortened consecutive numbers) should be separated by comma , (no space needed)
If one line has three numbers or shortened consecutive numbers, the line should be broken to go to the next line
End of each line should have comma, but the last line should not have
Currently I just could convert the data frame into a sequence of numbers, but the output is surrounded by c(), consecutive numbers cannot be shortened without line breaks.
> tapply(test, (seq_along(test)-1)%/%3, paste, collapse=", ")
0
"c(1, 2, 3, 5, 7, 8, 9, 11, 13, 14, 15)"
I appreciate your idea to make it!!
Thank you in advance for your support.
Here's a possible solution using dplyr -
library(dplyr)
output <- test %>%
#Create groups to collapse consecutive numbers
group_by(grp = cumsum(c(TRUE, diff(N) > 1))) %>%
#If more than 1 number in a group paste first and last value
summarise(text = if(n() > 1) paste0(range(N), collapse = '-') else as.character(N)) %>%
#For each 3 groups collapse the ranges as one string
group_by(line_num = ceiling(row_number()/3)) %>%
summarise(text = toString(text))
output
# line_num text
# <dbl> <chr>
#1 1 1-3, 5, 7-9
#2 2 11, 13-15
#Write the output
cat(paste0(output$text, collapse = '\n'), file = 'output.txt')
The output text file looks like -
I'll say v as a vector.
v <- c(1,2,3,5,7,8,9,11,13,14,15)
then split v into consecutive sets
vv <- split(v, cumsum(c(1, diff(v) != 1)))
vv
$`1`
[1] 1 2 3
$`2`
[1] 5
$`3`
[1] 7 8 9
$`4`
[1] 11
$`5`
[1] 13 14 15
Finally, transform to form you want
lapply(vv, function(x) {
if (length(x) == 1) {
x
} else(
paste0(x[1], "-", tail(x, n=1))
)
}) %>% unlist %>% as.vector
[1] "1-3" "5" "7-9" "11" "13-15"

How to merge/copy different rows into one conditionally [R]

I have a large data frame with names and a "classifying" variable named sequence. sequence tells about the position regarding the other rows. It has two values: first and additional.
The problem is that the distribution of these values is not uniform, i.e., there's not an additional per every first, and every letters value is unique.
The data frame looks like this (simplified version):
letters <- sample(LETTERS, 20)
sequence <- c("first","additional","first","first","first","first","first","additional","additional","additional","first","first","additional","first","additional","additional","first","additional","first","first")
df <- data.drame(sequence, letters)
Now, what I want to do is take every additional value in letters and paste it into its corresponding first value in letters.
So, for example, the second (row) value in the letters column would be pasted into the first, since it's the corresponding additional. Further, the eigth, ninth and tenth values in letters should be pasted inside (next to) the seventh value of letters (e.g., first; additional; additional; additional).
I've tried the following with the obvious limitation that it only looks to the immediate next value,
library(dplyr)
df <- df %>% mutate(letters_ok = if_else(sequence == "additional",
paste(letters, lag(letters), sep = "; "), letters))
highlighting my problem: How do I manage to lag conditionally on the values in sequence, so that I can paste the values in letters according to the first or additional classification?
Since every letters value is unique and it's tied to a specific sequence value, I didn't use group_by. Evry other solution eludes my current knowledge of string/character wrangling, so I would very much appreciate any help.
Here is a data.table approach.. I slightly alterend your sample data, since letters is not a very convenient column name. Also, added set.seed(123) for reproductional purposes.
sample data
set.seed(123)
letter <- sample(LETTERS, 20)
sequence <- c("first","additional","first","first","first","first","first","additional","additional","additional","first","first","additional","first","additional","additional","first","additional","first","first")
df <- data.frame(sequence, letter)
# sequence letter
# 1 first O
# 2 additional S
# 3 first N
# 4 first C
# 5 first J
# 6 first R
# 7 first K
# 8 additional E
# 9 additional X
# 10 additional Y
# 11 first W
# 12 first T
# 13 additional I
# 14 first L
# 15 additional U
# 16 additional M
# 17 first P
# 18 additional H
# 19 first B
# 20 first G
code
library( data.table )
#convert to data.table format
setDT(df)
#add id-column
df[, id := .I ]
#perform rolling join
temp <- df[ sequence == "first", ][ df[ sequence == "additional", ],
.( x.letter, i.letter, i.id, x.id),
on = .(id),
roll = Inf ]
#summarise
temp <- temp[, paste0( `i.letter`, collapse = ";" ), by = .(x.id) ]
#join, drop id column
df[sequence == "first", ][ temp, letter := paste( letter, i.V1, sep = ";"), on = .(id = `x.id`) ][, id := NULL]
output
# sequence letter
# 1: first O;S
# 2: first N
# 3: first C
# 4: first J
# 5: first R
# 6: first K;E;X;Y
# 7: first W
# 8: first T;I
# 9: first L;U;M
#10: first P;H
#11: first B
#12: first G

Extract and count common word-pairs from character vector

How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".
The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.
Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
Here is where the effort falters.
Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.
Thank you.
First, modify your initial text list from:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
to:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
Build a new Corpus:
v <- Corpus(VectorSource(text))
Create a bigram tokenizer function:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
Create your TermDocumentMatrix using the control parameter tokenize:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
Now that you have your new tdm, to get your desired output, you could do:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
Which gives:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...
One idea here , is to create a new corpus with bigrams.:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
A recursive function to extract bigram :
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.
edit after OP comments :
In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14

Resources