Extract character-level n-grams from text in R

Extract character-level n-grams from text in R - r

I have a dataframe with text and I want to extract the character-level bigrams (n = 2), e.g. "st", "ac", "ck", for each text in R.
I also want to count the frequency of each character-level bigram in the text.
Data:
df$text
[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"

I'm not quite sure of your expected output here. I would have thought that the bigrams for "stack" would be "st", "ta", "ac", and "ck", since this captures each consecutive pair.
For example, if you wanted to know how many instances of the bigram "th" the word "brothers" had in it, and you split it into the bigrams "br", "ot", "he" and "rs", then you would get the answer 0, which is wrong.
You can build up a single function to get all bigrams like this:
# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes
# "st", "ta", "ac", and "ck"
pair_chars <- function(char_vec) {
all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}
# This function splits a single word into a character vector and gets its bigrams
word_bigrams <- function(words){
unlist(lapply(strsplit(words, ""), pair_chars))
}
# This function splits a string or vector of strings into words and gets their bigrams
string_bigrams <- function(strings){
unlist(lapply(strsplit(strings, " "), word_bigrams))
}
So now we can test this on your example:
df <- data.frame(text = c("hy my name is", "stackover flow is great",
"how are you"), stringsAsFactors = FALSE)
string_bigrams(df$text)
#> [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"
If you want to count occurrences, you can just use table:
table(string_bigrams(df$text))
#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo
#> 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1
However, if you are going to be doing a fair bit of text mining, you should look into specific R packages like stringi, stringr, tm and quanteda that help with the basic tasks
For example, all of the base R functions I wrote above can be replaced using the quanteda package like this:
library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#> [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck"
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"
Created on 2020-06-13 by the reprex package (v0.3.0)

In addition to the answer of Allen,
You could use the qgram function from the stringdist package in combination with gsub to remove the spaces.
library(stringdist)
qgrams(gsub(" ", "", df1$text), q = 2)
hy ym yn yo my na st ta ve wi wa ov rf sg ow re ou me is ko lo am ei er fl gr ho ey ck ea at ar ac
V1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Related

Dropping the last two numbers from every entry in a column of data.table

Preface: I am a beginner to R that is eager to learn. Please don't mistake the simplicity of the question (if it is a simple answer) for lack of research or effort!
Here is a look at the data I am working with:
year state age POP
1: 90 1001 0 239
2: 90 1001 0 203
3: 90 1001 1 821
4: 90 1001 1 769
5: 90 1001 2 1089
The state column contains the FIPS codes for all states. For the purpose of merging, I need the state column to match my another dataset. To achieve this task, all I have to do is omit the last two numbers for each FIPS code such that the table looks like this:
year state age POP
1: 90 10 0 239
2: 90 10 0 203
3: 90 10 1 821
4: 90 10 1 769
5: 90 10 2 1089
I can't figure out how to accomplish this task on a numeric column. Substr() makes this easy on a character column.

In case your number is not always 4 digits long, to omit the last two you can make use of the vectorized behavior of substr()
x <- rownames(mtcars)[1:5]
x
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout"
substr(x, 1, nchar(x)-2)
#> [1] "Mazda R" "Mazda RX4 W" "Datsun 7" "Hornet 4 Dri"
#> [5] "Hornet Sportabo"
# dummy code for inside a data.table
dt[, x_new := substr(x, 1, nchar(x)-2)]

Just for generalizing this in the instance when you might have a very large numeric column, and need to substr it correctly. (Which is probably a good argument for storing/importing it as a character column to start with, but it's an imperfect world...)
x <- c(10000000000, 1000000000, 100000000, 10000000, 1000000,100000,10000,1000,100)
substr(x, 1, nchar(x)-2 )
#[1] "1e+" "1e+" "1e+" "1e+" "1e+" "1e+" "100" "10" "1"
as.character(x)
#[1] "1e+10" "1e+09" "1e+08" "1e+07" "1e+06" "1e+05" "10000" "1000"
#[9] "100"
xsf <- sprintf("%.0f", x)
substr(xsf, 1, nchar(xsf)-2)
#[1] "100000000" "10000000" "1000000" "100000" "10000"
#[6] "1000" "100" "10" "1"
cbind(x, xsf, xsfsub=substr(xsf, 1, nchar(xsf)-2) )
# x xsf xsfsub
# [1,] "1e+10" "10000000000" "100000000"
# [2,] "1e+09" "1000000000" "10000000"
# [3,] "1e+08" "100000000" "1000000"
# [4,] "1e+07" "10000000" "100000"
# [5,] "1e+06" "1000000" "10000"
# [6,] "1e+05" "100000" "1000"
# [7,] "10000" "10000" "100"
# [8,] "1000" "1000" "10"
# [9,] "100" "100" "1"

How to split a string after the nth character in r

I am working with the following data:
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
I want to split the string after the second character and put them into two columns.
So that the data looks like this:
state district
AR 01
AZ 03
AZ 05
AZ 08
CA 01
CA 05
CA 11
CA 16
CA 18
CA 21
Is there a simple code to get this done? Thanks so much for you help

You can use substr if you always want to split by the second character.
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)

you could use strcapture from base R:
strcapture("(\\w{2})(\\w{2})",District,
data.frame(state = character(),District = character()))
state District
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
where \\w{2} means two words

The OP has written
I'm more familiar with strsplit(). But since there is nothing to split
on, its not applicable in this case
Au contraire! There is something to split on and it's called lookbehind:
strsplit(District, "(?<=[A-Z]{2})", perl = TRUE)
The lookbehind works like "inserting an invisible break" after 2 capital letters and splits the strings there.
The result is a list of vectors
[[1]]
[1] "AR" "01"
[[2]]
[1] "AZ" "03"
[[3]]
[1] "AZ" "05"
[[4]]
[1] "AZ" "08"
[[5]]
[1] "CA" "01"
[[6]]
[1] "CA" "05"
[[7]]
[1] "CA" "11"
[[8]]
[1] "CA" "16"
[[9]]
[1] "CA" "18"
[[10]]
[1] "CA" "21"
which can be turned into a matrix, e.g., by
do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
[,1] [,2]
[1,] "AR" "01"
[2,] "AZ" "03"
[3,] "AZ" "05"
[4,] "AZ" "08"
[5,] "CA" "01"
[6,] "CA" "05"
[7,] "CA" "11"
[8,] "CA" "16"
[9,] "CA" "18"
[10,] "CA" "21"

We can use str_match to capture first two characters and the remaining string in separate columns.
stringr::str_match(District, "(..)(.*)")[, -1]
# [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"

With the tidyverse this is very easy using the function separate from tidyr:
library(tidyverse)
District %>%
as.tibble() %>%
separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")
# A tibble: 10 × 2
state district
<chr> <chr>
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21

Treat it as fixed width file, and import:
# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
# V1 V2
# 1 AR 01
# 2 AZ 03
# 3 AZ 05
# 4 AZ 08
# 5 CA 01
# 6 CA 05
# 7 CA 11
# 8 CA 16
# 9 CA 18
# 10 CA 21

Recode factors to number of my choosing

I like to convert NG to 0, SG=1.25, LG=7.25, MG=26 and HG=40
My actual data that looks exactly like the t below is here:
actual data causing problems
t<-rep(c("NG","SG","LG","MG","HG"),each=5)
colnames(t)<-c("X.1","X1","X2","X4","X8","X12","X24","X48")
Why doesn't this work?
t[t=="NG"] <- "0"
t[t=="SG"] <- "1.25"
t[t=="LG"] <- "7.25"
t[t=="MG"] <- "26"
or this:
factor(t, levels=c("NG","SG","LG","MG", "HG"), labels=c("0","1.25","7.25","26","40"))
or this:
t <- sapply(t,switch,"NG"=0,"SG"=1.25,"LG"=7.25,"MG"=26, "HG"=40)

You may want this:
t <- rep(c(NG = 0, SG = 1.25, LG = 7.25, MG = 26, HG = 40), each = 5)
t <- factor(t)
levels(t)
# [1] "0" "1.25" "7.25" "26" "40"
labels(t)
# [1] "NG" "NG" "NG" "NG" "NG" "SG" "SG" "SG" "SG" "SG" "LG" "LG" "LG" "LG" "LG"
# [16] "MG" "MG" "MG" "MG" "MG" "HG" "HG" "HG" "HG" "HG"
The internal codes for the factor will always be integers, so you can't create a factor with internal codes that are double precision floats.
unclass(t)
# NG NG NG NG NG SG SG SG SG SG LG LG LG LG LG MG MG MG MG MG HG HG HG HG HG
# 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
# attr(,"levels")
# [1] "0" "1.25" "7.25" "26" "40"
You can still extract the numerical value using the label for a level:
t["SG"]
# SG
# 1.25
# Levels: 0 1.25 7.25 26 40

R - extract values from strings

I'm working with some data that has strings like these:
1) C: 0.664 (3327)T: 0.336 (1681)
2) C|C: 0.462 (1158)C|T: 0.404 (1011)T|T: 0.134 (335)
I'm interested in extracting just the letters and the numbers within the parenthesis to get data frames like these:
1)
L1 N1 L2 N2
C 3327 T 1681
2)
L1 N1 L2 N2 L3 N3
CC 1158 CT 1011 TT 335
Is there any function/package or efficient way to do this in R?

We could also use stri_extract_all from library(stringi) after removing the | with gsub. We use lookahead ((?=:)) and match one or more characters that are not ) or we match one or more character that are not ) ([^)]+) followed by the lookbehind ((?<=\\()).
library(stringi)
stri_extract_all_regex(gsub('\\|', '', x), '[^)]+(?=:)|(?<=\\()[^)]+')
#[[1]]
#[1] "C" "3327" "T" "1681"
#[[2]]
#[1] "CC" "1158" "CT" "1011" "TT" "335"
We could also use two gsub and then convert the output to a data.frame. The class of the numeric and character elements are differentiated using this method.
res <- read.table(text=gsub('\\:[^(]+|[()]', ' ',
gsub('[|]', '', x)),
sep='', header=FALSE, stringsAsFactors=FALSE, na.strings='', fill=TRUE)
# V1 V2 V3 V4 V5 V6
#1 C 3327 T 1681 <NA> NA
#2 CC 1158 CT 1011 TT 335
str(res)
#'data.frame': 2 obs. of 6 variables:
# $ V1: chr "C" "CC"
# $ V2: int 3327 1158
# $ V3: chr "T" "CT"
# $ V4: int 1681 1011
# $ V5: chr NA "TT"
# $ V6: int NA 335
NOTE: We can change the column names using ?colnames

Example
x = c(
"C: 0.664 (3327)T: 0.336 (1681)",
"C|C: 0.462 (1158)C|T: 0.404 (1011)T|T: 0.134 (335)"
)
Select parts
s = strsplit(x, "\\)|(:.*?\\()")
# [[1]]
# [1] "C" "3327" "T" "1681"
#
# [[2]]
# [1] "C|C" "1158" "C|T" "1011" "T|T" "335"
The regex matches two things: \\) or :.*?\\(. In the second:
. matches any character
* quantifies the match as "any character any number of times"
? tells the quantifier to be "non-greedy" so it stops at \\(, even though that also matches ..
From there, it's pretty straightforward to perform your remaining formatting tasks:
Map(function(r, n)
setNames( gsub("\\|", "", r), paste0(c("L","N"), rep(seq(n), each=2)) ),
s,
lengths(s)/2
)
# [[1]]
# L1 N1 L2 N2
# "C" "3327" "T" "1681"
#
# [[2]]
# L1 N1 L2 N2 L3 N3
# "CC" "1158" "CT" "1011" "TT" "335"

Finding most frequent term in each document of a corpus

I've been using R's tm package with much success on classificaiton issues. I know how to find the most frequent terms across the entire corpus (with findFreqTerms()), but don't see anything within the documentation that would find the most frequent term (after I've stemmed and removed stopwords, but before I remove sparse terms) in each individual document in the corpus. I've tried using apply() and the max command, but this gives me the maximum number of times the term in each document occurs, not the name of the term itself.
library(tm)
data("crude")
corpus<-tm_map(crude, removePunctuation)
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, tolower)
corpus<-tm_map(corpus, removeWords, stopwords("English"))
corpus<-tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
maxterms<-apply(dtm, 1, max)
maxterms
127 144 191 194 211 236 237 242 246 248 273 349 352
5 13 2 3 3 10 8 3 7 9 9 4 5
353 368 489 502 543 704 708
4 4 4 5 5 9 4
Thoughts?

Ben's answer gives what you've asked for but I am not sure if what you asked for is wise. It does not account for ties. Here is an approach and a second one using the qdap package. They will give you lists with the words (in qdap's case a list of data frames with words and frequencies. You can use unlist to get you the rest of the way with the first option and lapply, indexing and unlist with qdap. The qdap approach works on the raw Corpus:
Option #1:
apply(dtm, 1, function(x) unlist(dtm[["dimnames"]][2],
use.names = FALSE)[x == max(x)])
Option #2 with qdap:
library(qdap)
dat <- tm_corpus2df(crude)
tapply(stemmer(dat$text), dat$docs, freq_terms, top = 1,
stopwords = tm::stopwords("English"))
Wrapping the tapply with lapply(WRAP_HERE, "[", 1) makes the two answers identical in content and nearly in format.
EDIT: Added an example that is a leaner use of qdap:
FUN <- function(x) freq_terms(x, top = 1, stopwords = stopwords("English"))[, 1]
lapply(stemmer(crude), FUN)
## [[1]]
## [1] "oil" "price"
##
## [[2]]
## [1] "opec"
##
## [[3]]
## [1] "canada" "canadian" "crude" "oil" "post" "price" "texaco"
##
## [[4]]
## [1] "crude"
##
## [[5]]
## [1] "estim" "reserv" "said" "trust"
##
## [[6]]
## [1] "kuwait" "said"
##
## [[7]]
## [1] "report" "say"
##
## [[8]]
## [1] "yesterday"
##
## [[9]]
## [1] "billion"
##
## [[10]]
## [1] "market" "price"
##
## [[11]]
## [1] "mln"
##
## [[12]]
## [1] "oil"
##
## [[13]]
## [1] "oil" "price"
##
## [[14]]
## [1] "oil" "opec"
##
## [[15]]
## [1] "power"
##
## [[16]]
## [1] "oil"
##
## [[17]]
## [1] "oil"
##
## [[18]]
## [1] "dlrs"
##
## [[19]]
## [1] "futur"
##
## [[20]]
## [1] "januari"

You're almost there, replace max with which.max to get the column index of the term with the highest frequency per document (ie. per row). Then use that vector of column indices to subset the Terms (or column names, kind of) in the document term matrix. That will return the actual term for each document that has the maximum frequency for that document (rather than just the frequency value, as it does when you use max). So, following from your example
maxterms<-apply(dtm, 1, which.max)
dtm$dimnames$Terms[maxterms]
[1] "oil" "opec" "canada" "crude" "said" "said" "report" "oil"
[9] "billion" "oil" "mln" "oil" "oil" "oil" "power" "oil"
[17] "oil" "dlrs" "futures" "january"