R struggles. I am using the following to extract quotations from text, with multiple results on a large datset. I am trying to have the output be a character string within a dataframe, so I can easily share this as an csv with others.
Sample data:
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
Using the following to extract quotations and a buffer of characters:
result <-function(testdata) {
str_extract_all(testdata, '[^\"]?{15}"[^\"]+"[^\"]?{15}')
}
extract <- sapply(testdata, FUN=result)
The extract is a list within a matrix. However, I want the extract to be a character string that I can later merge to a dataframe as a column. How do I convert this?
Code
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
# extract quotations
gsub(pattern = "[^\"]*((?:\"[^\"]*\")|$)", replacement = "\\1 ", x = testdata)
Output
[1] "\"I am a test,\" "
[2] "\"Would never happen.\" "
[3] "\"quote\" "
[4] "\"I said this,\" "
[5] "\"No,\" \"I do not like green eggs and ham.\" "
Explanation
pattern = "[^\"]" will match with any character except a double quote
pattern = "[^\"]*" will match with any character except a double quote 0 or more times
pattern = "\"[^\"]*\"" will match with a double quote, then any
character except a double quote 0 or more times, then another double
quote (i.e.) quotations
pattern = "(?:\"[^\"]*\")" will match with quotations, but wont capture
it
pattern = "((?:\"[^\"]*\")|$)" will match with quotations or endOfString,
and capture it. Note that this is the first group we capture
replacement = "\\1 " will replace with the first group we captured followed by a space
Related
I have a large data frame in R with column "NameFull" holding a text string made up of two words (binomial scientific name), followed by author name(s) and initials. Both have been corrupted (presumably UTF translation issues). This means that in the binomials any leading "x" (indicating hybrids) has been replaced with "?". Unfortunately any non-standard characters in the author names have also been replaced with "?" so I cannot just replace all "?" with x.
I simply want to replace and leading "?" in the first two words with "x" (I will then have to manually compose a list of corrected author names to replace the corrupted ones, unless anyone has a bright idea on that!).
Example chunk of df:
df.corrupt <- data.frame(Bing = 1:6, FullName = c("?Anthematricaria dominii Rohlena", "?Anthemimatricaria inolens P.Fourn.", "?Anthemimatricaria maleolens P.Fourn.", "Achillea ?albinea Bjel?i? & K.Mal?", "Achillea carpatica B?ocki ex Dubovik", "Floscaldasia azorelloides SklenĀ ? & H.Rob."), Bang = 1:6)
I've tried to shoehorn it into regex but can't get close. Any help appreciated!
On my understanding, you want to replace ?only if it occurs in word-initial position in either the first or the second word; if that's correct this should work:
Data: (I've changed a few chars)
df.corrupt <- data.frame(Bing = 1:6,
FullName = c("?Anthematricaria dominii ?Rohlena",
"?Anthemimatricaria inolens P.Fourn.",
"?Anthemimatricaria maleolens ?P.Fourn.",
"Achillea ?albinea Bjel?i? & K.Mal?",
"Achillea carpatica B?ocki ex Dubovik",
"Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
Solution:
library(stringr)
str_replace_all(df.corrupt$FullName, "^\\?|(?<=^(\\?)?\\b\\w{1,100}\\b\\s)\\?", "x")
[1] "xAnthematricaria dominii ?Rohlena" "xAnthemimatricaria inolens P.Fourn."
[3] "xAnthemimatricaria maleolens ?P.Fourn." "Achillea xalbinea Bjel?i? & K.Mal?"
[5] "Achillea carpatica B?ocki ex Dubovik" "Floscaldasia azorelloides Sklen ? & H.Rob."
This stringr solution puts x where ?occurs right at the start of the string (^) or (|) using positive lookbehind (i.e., a non-consuming capturing group) where it follows a whitespace char (\\s), which in turn follows a word boundary (\\b) following up to 100 \\w chars following a word boundary, following finally an optional ?
We can check for the ? that succeeds a space or at the start of the string, replace with 'x'
trimws(gsub("(^|\\s)\\?", " x", df.corrupt$FullName))
I have a string printed out like this:
"\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
(The "\" wasn't there. R just automatically prints it out.)
I would like to calculate how many non-empty segments there are in this string. In this case the answer should be 11.
I tried to convert it to a vector, but R ignores the quotation marks so I still ended up with a vector with length 1.
I don't know whether I need to extract those segments first and then count, or there're easier ways to do that.
If it's the former case, which regular expression function best suits my need?
Thank you very much.
You can use scan to convert your large string into a vector of individual ones, then use nchar to count the lengths. Assuming your large string is x:
y <- scan(text=x, what="character", sep=",", strip.white=TRUE)
Read 12 items
sum(nchar(y)>0)
[1] 11
I assume a segment is defined as anything between . or ,. An option using strsplit can be found as:
length(grep("\\w+", trimws(strsplit(str, split=",|\\.")[[1]])))
#[1] 11
Note: trimws is not mandatory in above statement. I have included so that one can get the value of each segment by just adding value = TRUE argument in grep.
Data:
str <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
strsplit might be one possibility?
txt <- "Jenna and Alex were making cupcakes., Jenna asked Alex whether all were ready to be frosted.,
Alex said that, some of them , were., He added, that, the rest, would be, ready, soon.,"
a <- strsplit(txt, split=",")
length(a[[1]])
[1] 11
If the backslashes are part of the text it doesnt really change a lot, except for the last element which would have "\"" in it. By filtering that out, the result is the same:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all
were ready to be frosted.\", \"Alex said that\", \" some of them \",
\"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
a <- strsplit(txt, split=", \"")
length(a[[1]][a[[1]] != "\""])
[1] 11
This is an absurd idea, but it does work:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
Txt <-
read.csv(text = txt,
header = FALSE,
colClasses = "character",
na.strings = c("", " "))
sum(!vapply(Txt, is.na, logical(1)))
I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)
I have a vector of strings that looks like:
str <- c("bills slashed for poor families today", "your calls are charged", "complaints dept awaiting refund")
I want to get all the words that end with the letter s and remove the s. I have tried:
gsub("s$","",str)
but it doesn't work because it tries to match with the strings that end with s instead of words. I'm trying to get an output that looks like:
[1] bill slashed for poor familie today
[2] your call are charged
[3] complaint dept awaiting refund
Any pointers as to how I can do this? Thanks
$ checks for the end of the string, not the end of a word.
To check for the word boundaries you should use \b
So:
gsub("s\\b", "", str)
Here's a non base R solution:
library(rebus)
library(stringr)
plurals <- "s" %R% BOUNDARY
str_replace_all(str, pattern = plurals, replacement = "")
You could also use a positive lookahead assertion:
gsub(pattern = "s{1}(?>\\s)", " ", x = str, perl = T)
I am no expert on regex, but I believe this expression looks for an "s" if it is followed by a space. Finding a match, it replaces that "s" with a space. So, final "s's" are removed.
I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)