Split 2 numbers character into it's own new line - r

I've a character object with 84 elements.
> head(output.by.line)
[1] "\n17"
[2] "Now when Joseph saw that his father"
[3] "laid his right hand on the head of"
[4] "Ephraim, it displeased him; so he took"
[5] "hold of his father's hand to remove it"
[6] "from Ephraim's head to Manasseh's"
But there is a line that has 2 numbers (49) that is not in it's own line:
[35] "49And Jacob called his sons and"
I'd like to transform this into:
[35] "\n49"
[36] "And Jacob called his sons and"
And insert this in the correct numeration, after object 34.
Dput Output:
dput(output.by.line)
c("\n17", "Now when Joseph saw that his father", "laid his right hand on the head of",
"Ephraim, it displeased him; so he took", "hold of his father's hand to remove it",
"from Ephraim's head to Manasseh's", "head.", "\n18", "And Joseph said to his father, \"Not so,",
"my father, for this one is the firstborn;", "put your right hand on his head.\"",
"\n19", "But his father refused and said, \"I", "know, my son, I know. He also shall",
"become a people, and he also shall be", "great; but truly his younger brother shall",
"be greater than he, and his descendants", "shall become a multitude of nations.\"",
"\n20", "So he blessed them that day, saying,", "\"By you Israel will bless, saying, \"May",
"God make you as Ephraim and as", "Manasseh!\"' And thus he set Ephraim",
"before Manasseh.", "\n21", "Then Israel said to Joseph, \"Behold, I",
"am dying, but God will be with you and", "bring you back to the land of your",
"fathers.", "\n22", "Moreover I have given to you one", "portion above your brothers, which I",
"took from the hand of the Amorite with", "my sword and my bow.\"",
"49And Jacob called his sons and", "said, \"Gather together, that I may tell",
"you what shall befall you in the last", "days:", "\n2", "\"Gather together and hear, you sons of",
"Jacob, And listen to Israel your father.", "\n3", "\"Reuben, you are my firstborn, My",
"might and the beginning of my strength,", "The excellency of dignity and the",
"excellency of power.", "\n4", "Unstable as water, you shall not excel,",
"Because you went up to your father's", "bed; Then you defiled it-- He went up to",
"my couch.", "\n5", "\"Simeon and Levi are brothers;", "Instruments of cruelty are in their",
"dwelling place.", "\n6", "Let not my soul enter their council; Let",
"not my honor be united to their", "assembly; For in their anger they slew a",
"man, And in their self-will they", "hamstrung an ox.", "\n7",
"Cursed be their anger, for it is fierce;", "And their wrath, for it is cruel! I will",
"divide them in Jacob And scatter them", "in Israel.", "\n8",
"\"Judah, you are he whom your brothers", "shall praise; Your hand shall be on the",
"neck of your enemies; Your father's", "children shall bow down before you.",
"\n9", "Judah is a lion's whelp; From the prey,", "my son, you have gone up. He bows",
"down, he lies down as a lion; And as a", "lion, who shall rouse him?",
"\n10", "The scepter shall not depart from", "Judah, Nor a lawgiver from between his",
"feet, Until Shiloh comes; And to Him", "shall be the obedience of the people.",
"\n11", "Binding his donkey to the vine, And his", "donkey's colt to the choice vine, He"
)

Please, check this:
library(tidyverse)
split_line_number <- function(x) {
x %>%
str_replace("^([0-9]+)", "\n\\1\b") %>%
str_split("\b")
}
output.by.line %>%
map(split_line_number) %>%
unlist()
# Output:
# [35] "\n49"
# [36] "And Jacob called his sons and"
# [37] "said, \"Gather together, that I may tell"
# [38] "you what shall befall you in the last"

An option using stringr::str_match is to match two components of an optional number followed by everything. Get the captured output from the matched matrix (2:3) and create a new vector of strings by dropping NAs and empty strings.
vals <- c(t(stringr::str_match(output.by.line, "(\n?\\d+)?(.*)")[, 2:3]))
output <- vals[!is.na(vals) & vals != ""]
output[32:39]
#[1] "portion above your brothers, which I"
#[2] "took from the hand of the Amorite with"
#[3] "my sword and my bow.\""
#[4] "49"
#[5] "And Jacob called his sons and"
#[6] "said, \"Gather together, that I may tell"
#[7] "you what shall befall you in the last" "days:"

We'll make use of the stringr package:
library(stringr)
Modify the object:
output.by.line <- unlist(
ifelse(grepl('[[:digit:]][[:alpha:]]', output.by.line), str_split(gsub('([[:digit:]]+)([[:alpha:]])', paste0('\n', '\\1 \\2'), output.by.line), '[[:blank:]]', n = 2), output.by.line)
)
Print the resuts:
dput(output.by.line)
#[32] "portion above your brothers, which I"
#[33] "took from the hand of the Amorite with"
#[34] "my sword and my bow.\""
#[35] "\n49"
#[36] "And Jacob called his sons and"
#[37] "said, \"Gather together, that I may tell"
#[38] "you what shall befall you in the last"

Related

How to extract conversational utterances from single string

I have a conversation between several speakers recorded as a single string:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
I also have a vector of the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
Using this vector as a component of my regex pattern I'm doing relatively well with this extraction:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
However, the first part of the the third speaker's name (al) is contained in one of the extracted utterances (yeah i know thats erm al) and the last utterance by speaker al hamshi (ah you know camping with my girl friend) is missing from the output. How can the regex be improved so that all utterances get matched and extracted correctly?
What if you take another approach?
Remove all speakers from the text and split the string on '\\s*:\\s*'
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\\s*:\\s*')[[1]]
# [1] "" "hiya"
# [3] "hi how wz your weekend" "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff" "yeah i know thats erm"
# [7] "hey guys how s it goin" "Great"
# [9] "where ve you been last week" "ah you know camping with my girl friend"
You can clean up the output a bit to remove the first empty value from it.
A correct splitting approach would look like
p2 <- paste0("\\s*\\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\\W+", "", gsub(p2, "", convers, perl=TRUE)), "\\s*:\\s*")[[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
The regex to remove speakers from the string will look like
\s*\b(?:Peter|Mary|al hamshi)(?=:)
See the regex demo. It will match
\s* - 0+ whitespaces
\b - a word boundary
(?:Peter|Mary|al hamshi) - one of the speaker names
(?=:) - that must be followed with a : char.
Then, the non-word chars at the start are removed with the sub("^\\W+", "", ...) call, and then the whole string is split with \s*:\s* regex that matches a : enclosed with 0+ whitespaces.
Alternatively, you can use
(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
See this regex demo. Details:
(?<=(?:Peter|Mary|al hamshi):\s) - a location immediately preceded with any speaker name and a whitespace
.*? - any 0+ chars (other than line break chars, use (?s) at the pattern start to make it match any chars) as few as possible
(?=\s*(?:Peter|Mary|al hamshi):|\z) - a location immediately followed with 0+ whitespaces, then any speaker name and a : or end of string.
In R, you can use
library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"

Changing "firstname lastname" to "lastname, firstname"

I have a list of names that I need to convert from "Firstname Lastname" to "Lastname, Firstname".
Barack Obama
Donald J. Trump
J. Edgar Hoover
Beyonce Knowles-Carter
Sting
I used G. Grothendieck's answer to "last name, first name" -> "first name last name" in serialized strings to get to gsub("([^ ]*) ([^ ]*)", "\\2, \\1", str) which gives me -
Obama, Barack
J., DonaldTrump,
Edgar, J.Hoover,
Knowles-Carter, Beyonce
Sting
What I would like to get -
Obama, Barack
Trump, Donald J.
Hoover, J. Edgar
Knowles-Carter, Beyonce
Sting
I would like a regex answer.
There is an esoteric function called person designed for holding names, a conversion function as.person which does this parsing for you and a format method to make use of it afterwards (with a creative use of the braces argument). It even works with complex surnames (eg van Nistelrooy) but the single name result is unsatisfactory. It can fixed with a quick ending sub though.
x <- c("Barack Obama","Donald J. Trump","J. Edgar Hoover","Beyonce Knowles-Carter","Sting", "Ruud van Nistelrooy", "John von Neumann")
y <- as.person(x)
format(y, include=c("family","given"), braces=list(family=c("",",")))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting," "van Nistelrooy, Ruud"
[7] "von Neumann, John"
## fix for single names - curse you Sting!
sub(",$", "", format(y, include=c("family","given"), braces=list(family=c("",","))))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting" "van Nistelrooy, Ruud"
[7] "von Neumann, John"
Use
gsub("(.*[^van])\\s(.*)", "\\2, \\1", people)
The regex:
(.*[^van]) \\s (.*)
Any ammount of characters exluding "van"... the last white space... The last name containing any character.
Data:
people <- c("Barack Obama",
"Donald J. Trump",
"J. Edgar Hoover",
"Beyonce Knowles-Carter",
"Sting",
"Ruud van Nistelrooy",
"Xi Jinping",
"Hans Zimvanmer")
Result:
[1] "Obama, Barack" "Trump, Donald J." "Hoover, J. Edgar"
[4] "Knowles-Carter, Beyonce" "Sting" "van Nistelrooy, Ruud"
[7] "Jinping, Xi" "Zimvanmer, Hans"

Remove specific string at the end position of each row from dataframe(csv)

I am trying to clean a set of data which is in csv format. After loading data into R, i need to replace and also remove some characters from the it. Below is an example. Ideally i want to
replace the St at the end of each -> Street
in cases where there are St St.
i need to remove St and replace St. with just Street.
I tried to use this code
sub(x = evostreet, pattern = "St.", replacement = " ") and later
gsub(x = evostreet, pattern = "St.", replacement = " ") to remove the St. at the end of each row but this also remove some other occurrences of St and the next character
3 James St.
4 Glover Road St.
5 Jubilee Estate. St.
7 Fed Housing Estate St.
8 River State School St.
9 Brown State Veterinary Clinic. St.
11 Saw Mill St.
12 Dyke St St.
13 Governor Rd St.
I'm seeing a lot of close answers but I'm not seeing any that address the second problem he's having such as replacing "St St." with "Street"; e.g., "Dyke St St."
sub, as stated in the documentation:
The two *sub functions differ only in that sub replaces only the first occurrence of a pattern
So, just using "St\\." as the pattern match is incorrect.
OP needs to match a possible pattern of "St St." and I'll further assume that it could even be "St. St." or "St. St".
Assuming OP is using a simple list:
x = c("James St.", "Glover Road St.", "Jubilee Estate. St.",
"Fed Housing Estate St.", "River State School St St.",
"Brown State Vet Clinic. St. St.", "Dyke St St.")`
[1] "James St." "Glover Road St."
[3] "Jubilee Estate. St." "Fed Housing Estate St."
[5] "River State School St St." "Brown State Vet Clinic. St. St."
[7] "Dyke St St."
Then the following will replace the possible combinations mentioned above with "Street", as requested:
y <- sub(x, pattern = "[ St\\.]*$", replacement = " Street")
[1] "James Street" "Glover Road Street"
[3] "Jubilee Estate Street" "Fed Housing Estate Street"
[5] "River State School Street" "Brown State Vet Clinic Street"
[7] "Dyke Street"
Edit:
To answer OP's question below in regard to replacing one substr of St. with Saint and another with Street, I was looking for a way to be able to match similar expressions to return different values but at this point I haven't been able to find it. I suspect regmatches can do this but it's something I'll have to fiddle with later.
A simple way to accomplish what you're wanting - let's assume:
x <- c("St. Mary St St.", "River State School St St.", "Dyke St. St")
[1] "Saint Mary St St." "River State School St St."
[3] "Dyke St. St"
So you want x[1] to be Saint Mary Street, x[2] to be River State School Street and x[3] to be Dyke Street. I would want to resolve the Saint issue first by assigning sub() to y like:
y <- sub(x, pattern = "^St\\.", replacement = "Saint")
[1] "Saint Mary Street" "River State School Street"
[3] "Dyke Street"
To resolve the St's as the end, we can use the same resolution as I posted except notice now I'm not using x as my input vector but isntead the y I just made:
y <- sub(y, pattern = "[ St\\.]*$", replacement = " Street")
And that should take care of it. Now, I don't know if this is the most efficient way. And if you're dataset is rather large this may run slow. If I find a better solution I will post it (provided no one else beats me).
You don't need to use regular expression here.
sub(x = evostreet, pattern = "St.", replacement = " ", fixed=T)
The fixed argument means that you want to replace this exact character, not matches of a regular expression.
I think that your problem is that the '.' character in the regular expression world means "any single character". So to match literally in R you should write
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You will need to "comment" the dot... otherwise it means anything after St and that is why some other parts of your text are eliminated.
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You can add $ at the end if you want to remove the tag apearing just at the end of the text.
sub(x = evostreet, pattern = "St\\.$", replacement = " ")
The difference between sub and gsub is that sub will deal just with the firs time your tag appears in a text. gsub will eliminate all if there are duplicated. In your case as you are looking for the pattern at the end of the line it should not make any difference if you use the $.

Why does the ngrams() function give distinct bigrams?

I am writing an R script and am using library(ngram).
Suppose I have a string,
"good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
and want to find bi-grams.
The ngram library is giving me bi-grams as follows:
"appreci product" "process meat" "food product" "food bought" "qualiti dog" "product found" "product look" "look like" "like stew" "good qualiti" "labrador finicki" "bought sever" "qualiti product" "better labrador"
"dog food" "smell better" "vital can" "meat smell" "found good" "sever vital" "stew process" "can dog" "finicki appreci" "product better"
As the sentence contains "dog food" two times, I want this bi-gram two times. But I am getting it once!
Is there an option in thengram library or any other library that gives all the bi-grams of my sentence in R?
The development version of ngram has a get.phrasetable method:
devtools::install_github("wrathematics/ngram")
library(ngram)
text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ng <- ngram(text)
head(get.phrasetable(ng))
# ngrams freq prop
# 1 good qualiti 2 0.07692308
# 2 dog food 2 0.07692308
# 3 appreci product 1 0.03846154
# 4 process meat 1 0.03846154
# 5 food product 1 0.03846154
# 6 food bought 1 0.03846154
In addition, you can use the print() method and specify output == "full". That is:
print(ng, output = "full")
# NOTE: more output not shown...
better labrador | 1
finicki {1} |
dog food | 2
product {1} | bought {1}
# NOTE: more output not shown...
You can use stylo package. Gives duplicates:
library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)
Result:
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can" "can dog" "dog food"
[10] "food product" "product found" "found good" "good qualiti" "qualiti product" "product look" "look like" "like stew" "stew process"
[19] "process meat" "meat smell" "smell better" "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better"
>
You could use RWeka. In the result you can see "dog food" and "good qualiti" appearing twice
txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
library(RWeka)
RWEKABigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
RWEKABigramTokenizer(txt)
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can"
[8] "can dog" "dog food" "food product" "product found" "found good" "good qualiti" "qualiti product"
[15] "product look" "look like" "like stew" "stew process" "process meat" "meat smell" "smell better"
[22] "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better"
Or use the tm package in combination with RWeka
library(tm)
library(RWeka)
my_corp <- Corpus(VectorSource(txt))
tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer))
#show the 2 bigrams
findFreqTerms(tdm_RWEKA, lowfreq = 2)
[1] "dog food" "good qualiti"
#turn into matrix with frequency counts
tdm_matrix <- as.matrix(tdm_RWEKA)
In order to produce such bi-gram, you don't need any special package. Basically, split the text and paste it together again.
t <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ug <- strsplit(t, " ")[[1]]
bg <- paste(ug, ug[2:length(ug)])
The resulted bg would be:
[1] "good qualiti" "qualiti dog" "dog food"
[4] "food bought" "bought sever" "sever vital"
[7] "vital can" "can dog" "dog food"
[10] "food product" "product found" "found good"
[13] "good qualiti" "qualiti product" "product look"
[16] "look like" "like stew" "stew process"
[19] "process meat" "meat smell" "smell better"
[22] "better labrador" "labrador finicki" "finicki appreci"
[25] "appreci product" "product better" "better qualiti"
Try the quanteda package:
> quanteda::tokenize(txt, ngrams = 2, concatenator = " ")
[[1]]
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital"
[7] "vital can" "can dog" "dog food" "food product" "product found" "found good"
[13] "good qualiti" "qualiti product" "product look" "look like" "like stew" "stew process"
[19] "process meat" "meat smell" "smell better" "better labrador" "labrador finicki" "finicki appreci"
[25] "appreci product" "product better"
Plenty of additional arguments available through ngrams, including getting different combinations of n sizes, or skip-grams.

Words in sentences and their nearest neighbors in a lexicons

I have following data frame:
sent <- data.frame(words = c("just right size", "size love quality", "laptop worth price", "price amazing user",
"explanation complex what", "easy set", "product best buy", "buy priceless when"), user = c(1,2,3,4,5,6,7,8))
Sent data frame resulted into:
words user
just right size 1
size love quality 2
laptop worth price 3
price amazing user 4
explanation complex what 5
easy set 6
product best buy 7
buy priceless when 8
I need to remove word at the begining of following sentence which is the same as a word at the end of previous sentece.
I mean eg. we have a sentences "just right size" and "size love quality", so I need to remove word size at the second user possition.
Then sentences "laptop worth price" and "price amazing user", so I need to remove word price at fourth user possition.
Can anyone help me, I'll appreciate any of your help. Thank you very much in advance.
You could extract the "first" and "last" word from the "words" column for the succeeding row and the current row using sub. If the words are the same, remove the first word from the succeeding row or else keep it as such (ifelse(...))
w1 <- sub(' .*', '', sent$words[-1])
w2 <- sub('.* ', '', sent$words[-nrow(sent)])
sent$words <- as.character(sent$words)
sent$words
#[1] "just right size" "size love quality"
#[3] "laptop worth price" "price amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "buy priceless when"
sent$words[-1] <- with(sent, ifelse(w1==w2, sub('\\w+ ', '',words[-1]),
words[-1]))
sent$words
#[1] "just right size" "love quality"
#[3] "laptop worth price" "amazing user"
#[5] "explanation complex what" "easy set"
#[7] "product best buy" "priceless when"

Resources