Why does the ngrams() function give distinct bigrams? - r

I am writing an R script and am using library(ngram).
Suppose I have a string,
"good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
and want to find bi-grams.
The ngram library is giving me bi-grams as follows:
"appreci product" "process meat" "food product" "food bought" "qualiti dog" "product found" "product look" "look like" "like stew" "good qualiti" "labrador finicki" "bought sever" "qualiti product" "better labrador"
"dog food" "smell better" "vital can" "meat smell" "found good" "sever vital" "stew process" "can dog" "finicki appreci" "product better"
As the sentence contains "dog food" two times, I want this bi-gram two times. But I am getting it once!
Is there an option in thengram library or any other library that gives all the bi-grams of my sentence in R?

The development version of ngram has a get.phrasetable method:
devtools::install_github("wrathematics/ngram")
library(ngram)
text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ng <- ngram(text)
head(get.phrasetable(ng))
# ngrams freq prop
# 1 good qualiti 2 0.07692308
# 2 dog food 2 0.07692308
# 3 appreci product 1 0.03846154
# 4 process meat 1 0.03846154
# 5 food product 1 0.03846154
# 6 food bought 1 0.03846154
In addition, you can use the print() method and specify output == "full". That is:
print(ng, output = "full")
# NOTE: more output not shown...
better labrador | 1
finicki {1} |
dog food | 2
product {1} | bought {1}
# NOTE: more output not shown...

You can use stylo package. Gives duplicates:
library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)
Result:
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can" "can dog" "dog food"
[10] "food product" "product found" "found good" "good qualiti" "qualiti product" "product look" "look like" "like stew" "stew process"
[19] "process meat" "meat smell" "smell better" "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better"
>

You could use RWeka. In the result you can see "dog food" and "good qualiti" appearing twice
txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
library(RWeka)
RWEKABigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
RWEKABigramTokenizer(txt)
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can"
[8] "can dog" "dog food" "food product" "product found" "found good" "good qualiti" "qualiti product"
[15] "product look" "look like" "like stew" "stew process" "process meat" "meat smell" "smell better"
[22] "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better"
Or use the tm package in combination with RWeka
library(tm)
library(RWeka)
my_corp <- Corpus(VectorSource(txt))
tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer))
#show the 2 bigrams
findFreqTerms(tdm_RWEKA, lowfreq = 2)
[1] "dog food" "good qualiti"
#turn into matrix with frequency counts
tdm_matrix <- as.matrix(tdm_RWEKA)

In order to produce such bi-gram, you don't need any special package. Basically, split the text and paste it together again.
t <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ug <- strsplit(t, " ")[[1]]
bg <- paste(ug, ug[2:length(ug)])
The resulted bg would be:
[1] "good qualiti" "qualiti dog" "dog food"
[4] "food bought" "bought sever" "sever vital"
[7] "vital can" "can dog" "dog food"
[10] "food product" "product found" "found good"
[13] "good qualiti" "qualiti product" "product look"
[16] "look like" "like stew" "stew process"
[19] "process meat" "meat smell" "smell better"
[22] "better labrador" "labrador finicki" "finicki appreci"
[25] "appreci product" "product better" "better qualiti"

Try the quanteda package:
> quanteda::tokenize(txt, ngrams = 2, concatenator = " ")
[[1]]
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital"
[7] "vital can" "can dog" "dog food" "food product" "product found" "found good"
[13] "good qualiti" "qualiti product" "product look" "look like" "like stew" "stew process"
[19] "process meat" "meat smell" "smell better" "better labrador" "labrador finicki" "finicki appreci"
[25] "appreci product" "product better"
Plenty of additional arguments available through ngrams, including getting different combinations of n sizes, or skip-grams.

Related

How to count how many string in a column

I have a datasets with a column "amenities" and I want to count how many amenities in each row.
> airbnbT$amenities[1]
[1] ["Essentials", "Refrigerator", "Shampoo", "TV", "Dedicated workspace", "Hangers", "Iron", "Long term stays allowed", "Dishes and silverware", "First aid kit", "Free parking on premises", "Hair dryer", "Patio or balcony", "Washer", "Dryer", "Cooking basics", "Coffee maker", "Private entrance", "Hot water", "Fire extinguisher", "Wifi", "Air conditioning", "Hot tub", "Kitchen", "Microwave", "Oven", "Smoke alarm"]
14673 Levels: ["Air conditioning", "Baby bath", "Long term stays allowed", "Baby monitor"] ...
> class(airbnbT$amenities[1])
[1] "factor"
Here for row 1, there are 27 amenities.
Is there a way to count the comma in each row "," ? This way would count the numbers of amenities.
Try str_count from the stringr package. You will need to add 1 since there will be one fewer comma than the number of amenities:
library(stringr)
airbnbT$amenities_count = str_count(airbnbT$amenities,",") + 1

Scraping keywords on PHP page

I would like to scrape the keywords inside the dropdown table of this webpage https://www.aeaweb.org/jel/guide/jel.php
The problem is that the drop-down menu of each item prevents me from scraping the table directly because it only takes the heading and not the inner content of each item.
rvest::read_html("https://www.aeaweb.org/jel/guide/jel.php") %>%
rvest::html_table()
I thought of scraping each line that starts with Keywords: but I do not get how can I do that. Seems like the HTML is not showing the items inside the table.
A RSelenium solution,
#Start the server
library(RSelenium)
driver = rsDriver(
browser = c("firefox"))
remDr <- driver[["client"]]
#Navigate to the url
remDr$navigate("https://www.aeaweb.org/jel/guide/jel.php")
#xpath of the table
remDr$findElement(using = "xpath",'/html/body/main/div/section/div[4]') -> out
#get text from the table
out <- out$getElementText()
out= out[[1]]
Split using stringr package
library(stringr)
str_split(out, "\n", n = Inf, simplify = FALSE)
[[1]]
[1] "A General Economics and Teaching"
[2] "B History of Economic Thought, Methodology, and Heterodox Approaches"
[3] "C Mathematical and Quantitative Methods"
[4] "D Microeconomics"
[5] "E Macroeconomics and Monetary Economics"
[6] "F International Economics"
[7] "G Financial Economics"
[8] "H Public Economics"
[9] "I Health, Education, and Welfare"
[10] "J Labor and Demographic Economics"
[11] "K Law and Economics"
[12] "L Industrial Organization"
[13] "M Business Administration and Business Economics; Marketing; Accounting; Personnel Economics"
[14] "N Economic History"
[15] "O Economic Development, Innovation, Technological Change, and Growth"
[16] "P Economic Systems"
[17] "Q Agricultural and Natural Resource Economics; Environmental and Ecological Economics"
[18] "R Urban, Rural, Regional, Real Estate, and Transportation Economics"
[19] "Y Miscellaneous Categories"
[20] "Z Other Special Topics"
To get the Keywords for History of Economic Thought, Methodology, and Heterodox Approaches
out1 <- remDr$findElement(using = 'xpath', value = '//*[#id="cl_B"]')
out1$clickElement()
out1 <- remDr$findElement(using = 'xpath', value = '/html/body/main/div/section/div[4]/div[2]/div[2]/div/div/div/div[2]')
out1$getElementText()
[[1]]
[1] "Keywords: History of Economic Thought"

Split 2 numbers character into it's own new line

I've a character object with 84 elements.
> head(output.by.line)
[1] "\n17"
[2] "Now when Joseph saw that his father"
[3] "laid his right hand on the head of"
[4] "Ephraim, it displeased him; so he took"
[5] "hold of his father's hand to remove it"
[6] "from Ephraim's head to Manasseh's"
But there is a line that has 2 numbers (49) that is not in it's own line:
[35] "49And Jacob called his sons and"
I'd like to transform this into:
[35] "\n49"
[36] "And Jacob called his sons and"
And insert this in the correct numeration, after object 34.
Dput Output:
dput(output.by.line)
c("\n17", "Now when Joseph saw that his father", "laid his right hand on the head of",
"Ephraim, it displeased him; so he took", "hold of his father's hand to remove it",
"from Ephraim's head to Manasseh's", "head.", "\n18", "And Joseph said to his father, \"Not so,",
"my father, for this one is the firstborn;", "put your right hand on his head.\"",
"\n19", "But his father refused and said, \"I", "know, my son, I know. He also shall",
"become a people, and he also shall be", "great; but truly his younger brother shall",
"be greater than he, and his descendants", "shall become a multitude of nations.\"",
"\n20", "So he blessed them that day, saying,", "\"By you Israel will bless, saying, \"May",
"God make you as Ephraim and as", "Manasseh!\"' And thus he set Ephraim",
"before Manasseh.", "\n21", "Then Israel said to Joseph, \"Behold, I",
"am dying, but God will be with you and", "bring you back to the land of your",
"fathers.", "\n22", "Moreover I have given to you one", "portion above your brothers, which I",
"took from the hand of the Amorite with", "my sword and my bow.\"",
"49And Jacob called his sons and", "said, \"Gather together, that I may tell",
"you what shall befall you in the last", "days:", "\n2", "\"Gather together and hear, you sons of",
"Jacob, And listen to Israel your father.", "\n3", "\"Reuben, you are my firstborn, My",
"might and the beginning of my strength,", "The excellency of dignity and the",
"excellency of power.", "\n4", "Unstable as water, you shall not excel,",
"Because you went up to your father's", "bed; Then you defiled it-- He went up to",
"my couch.", "\n5", "\"Simeon and Levi are brothers;", "Instruments of cruelty are in their",
"dwelling place.", "\n6", "Let not my soul enter their council; Let",
"not my honor be united to their", "assembly; For in their anger they slew a",
"man, And in their self-will they", "hamstrung an ox.", "\n7",
"Cursed be their anger, for it is fierce;", "And their wrath, for it is cruel! I will",
"divide them in Jacob And scatter them", "in Israel.", "\n8",
"\"Judah, you are he whom your brothers", "shall praise; Your hand shall be on the",
"neck of your enemies; Your father's", "children shall bow down before you.",
"\n9", "Judah is a lion's whelp; From the prey,", "my son, you have gone up. He bows",
"down, he lies down as a lion; And as a", "lion, who shall rouse him?",
"\n10", "The scepter shall not depart from", "Judah, Nor a lawgiver from between his",
"feet, Until Shiloh comes; And to Him", "shall be the obedience of the people.",
"\n11", "Binding his donkey to the vine, And his", "donkey's colt to the choice vine, He"
)
Please, check this:
library(tidyverse)
split_line_number <- function(x) {
x %>%
str_replace("^([0-9]+)", "\n\\1\b") %>%
str_split("\b")
}
output.by.line %>%
map(split_line_number) %>%
unlist()
# Output:
# [35] "\n49"
# [36] "And Jacob called his sons and"
# [37] "said, \"Gather together, that I may tell"
# [38] "you what shall befall you in the last"
An option using stringr::str_match is to match two components of an optional number followed by everything. Get the captured output from the matched matrix (2:3) and create a new vector of strings by dropping NAs and empty strings.
vals <- c(t(stringr::str_match(output.by.line, "(\n?\\d+)?(.*)")[, 2:3]))
output <- vals[!is.na(vals) & vals != ""]
output[32:39]
#[1] "portion above your brothers, which I"
#[2] "took from the hand of the Amorite with"
#[3] "my sword and my bow.\""
#[4] "49"
#[5] "And Jacob called his sons and"
#[6] "said, \"Gather together, that I may tell"
#[7] "you what shall befall you in the last" "days:"
We'll make use of the stringr package:
library(stringr)
Modify the object:
output.by.line <- unlist(
ifelse(grepl('[[:digit:]][[:alpha:]]', output.by.line), str_split(gsub('([[:digit:]]+)([[:alpha:]])', paste0('\n', '\\1 \\2'), output.by.line), '[[:blank:]]', n = 2), output.by.line)
)
Print the resuts:
dput(output.by.line)
#[32] "portion above your brothers, which I"
#[33] "took from the hand of the Amorite with"
#[34] "my sword and my bow.\""
#[35] "\n49"
#[36] "And Jacob called his sons and"
#[37] "said, \"Gather together, that I may tell"
#[38] "you what shall befall you in the last"

Words in sentences and their nearest neighbors in a lexicons

I have following data frame:
sent <- data.frame(words = c("just right size", "size love quality", "laptop worth price", "price amazing user",
"explanation complex what", "easy set", "product best buy", "buy priceless when"), user = c(1,2,3,4,5,6,7,8))
Sent data frame resulted into:
words user
just right size 1
size love quality 2
laptop worth price 3
price amazing user 4
explanation complex what 5
easy set 6
product best buy 7
buy priceless when 8
I need to remove word at the begining of following sentence which is the same as a word at the end of previous sentece.
I mean eg. we have a sentences "just right size" and "size love quality", so I need to remove word size at the second user possition.
Then sentences "laptop worth price" and "price amazing user", so I need to remove word price at fourth user possition.
Can anyone help me, I'll appreciate any of your help. Thank you very much in advance.
You could extract the "first" and "last" word from the "words" column for the succeeding row and the current row using sub. If the words are the same, remove the first word from the succeeding row or else keep it as such (ifelse(...))
w1 <- sub(' .*', '', sent$words[-1])
w2 <- sub('.* ', '', sent$words[-nrow(sent)])
sent$words <- as.character(sent$words)
sent$words
#[1] "just right size" "size love quality"
#[3] "laptop worth price" "price amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "buy priceless when"
sent$words[-1] <- with(sent, ifelse(w1==w2, sub('\\w+ ', '',words[-1]),
words[-1]))
sent$words
#[1] "just right size" "love quality"
#[3] "laptop worth price" "amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "priceless when"

How to split a string conditionally in R?

I would like to split a string into multiple columns based on a number of conditions.
An example of my data:
Col1<- c("01/05/2004 02:59", "01/05/2004 05:04", "01/06/2004 07:19", "01/07/2004 02:55", "01/07/2004 04:32", "01/07/2004 04:38", "01/07/2004 17:13", "01/07/2004 18:40", "01/07/2004 20:58", "01/07/2004 23:39", "01/09/2004 13:28")
Col2<- c("Wabamun #4 off line.", "Keephills #2 on line.", "Wabamun #1 on line.", "North Red Deer T217s bus lock out. Under investigation.", "T217s has blown CTs on 778L", "T217s North Red Deer bus back in service (778L out of service)", "Keephills #2 off line.", "Wabamun #4 on line.", "Sundance #1 off line.", "Keephills #2 on line", "Homeland security event lowered to yellow ( elevated)")
df<- data.frame(Col1,Col2)
I would like to be able to split column w conditionally.
To get something like this:
Col3<- c("Wabamun #4", "Keephills #2", "Wabamun #1", "General Asset", "General Asset", "General Asset", "Keephills #2", "Wabamun #4", "Sundance #1", "Keephills #2", "General Asset")
Col4<- c("off line.", "on line.", "on line.", "North Red Deer T217s bus lock out. Under investigation.", "T217s has blown CTs on 778L", "T217s North Red Deer bus back in service (778L out of service)", "off line.", "on line.", "off line.", "on line", "Homeland security event lowered to yellow ( elevated)")
After I'm planning to find the times between when an asset goes down and comes back online. These are often generator plants so I would also be looking up the capacity of the plant. Example Keephills #2 has a capacity of 300MW.
Thankfully, regular expressions are here to save the day.
# This line prevents character strings turning into factors
df<- data.frame(Col1,Col2, stringsAsFactors=FALSE)
# This match works with the powerplant names as
# they're all 1 or more characters followed by a space, hash and single digit.
pwrmatch <- regexpr("^[[:alpha:]]+ #[[:digit:]]", df$Col2)
df$Col3 <- "General Asset"
df$Col3[grepl("^[[:alpha:]]+ #[[:digit:]]", df$Col2)] <- regmatches(df$Col2, pwrmatch)
Col3 now looks like: c("Wabamun #4", "Keephills #2", "Wabamun #1", "General Asset",
"General Asset", "General Asset", "Keephills #2", "Wabamun #4",
"Sundance #1", "Keephills #2", "General Asset")
The other line is a similar matter, simply matching all cases of on/off line.
linematch <- regexpr("(on|off) line", df$Col2)
df$Col4 <- df$Col2
df$Col4[grepl("(on|off) line", df$Col2)] <- regmatches(df$Col2, linematch)
Col4 now looks like: c("off line", "on line", "on line", "North Red Deer T217s bus lock out. Under investigation.",
"T217s has blown CTs on 778L", "T217s North Red Deer bus back in service (778L out of service)",
"off line", "on line", "off line", "on line", "Homeland security event lowered to yellow ( elevated)"
)
> Col3 <- Col4 <- character(nrow(df))
> index <- grep("#", Col2, invert = TRUE)
> spl1 <- unlist(strsplit(Col2[-index], " o"))[c(TRUE, FALSE)]
> Col3[-index] <- spl1
> Col3[index] <- "General Asset"
> spl2 <- unlist(strsplit(Col2[-index], " o"))[c(FALSE, TRUE)]
> Col4[-index] <- paste("o", spl2, sep="")
> Col4[index] <- Col2[index]
> Col3
## [1] "Wabamun #4" "Keephills #2" "Wabamun #1" "General Asset"
## [5] "General Asset" "General Asset" "Keephills #2" "Wabamun #4"
## [9] "Sundance #1" "Keephills #2" "General Asset"
> Col4
## [1] "off line."
## [2] "on line."
## [3] "on line."
## [4] "North Red Deer T217s bus lock out. Under investigation."
## [5] "T217s has blown CTs on 778L"
## [6] "T217s North Red Deer bus back in service (778L out of service)"
## [7] "off line."
## [8] "on line."
## [9] "off line."
## [10] "on line"
## [11] "Homeland security event lowered to yellow ( elevated)"

Resources