Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a dataset with unstructured text data.
From the text I want to extract sentences that have the following words:
education_vector <- c("university", "academy", "school", "college")
For example, from the text I am a student at the University of Wyoming. My major is biology. I want to get I am a student at the University of Wyoming.
From the text I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College I want to get I graduated from Walla Wall Community College. and so on
I tried using grep function but it returned wrong results
Answer modified to select first match.
texts = c("I am a student at the University of Wyoming. My major is biology.",
"I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College",
"First, I went to the Bowdoin College. Then I went to the University of California.")
gsub(".*?([^\\.]*(university|academy|school|college)[^\\.]*).*",
"\\1", texts, ignore.case=TRUE)
[1] "I am a student at the University of Wyoming"
[2] " I graduated from Walla Wall Community College"
[3] "First, I went to the Bowdoin College"
Explanation:
.*? is a non-greedy match up to the rest of the pattern. This is there to remove any sentences before the relevant sentence.
([^\\.]*(university|academy|school|college)[^\\.]*) matches any string of characters other than a period immediately before and after one of the key words.
.* handles anything after the relevant sentence.
This replaces the entire string with only the relevant part.
Here is a solution using grep
education <- c("university", "academy", "school", "college")
str1 <- "I am a student at the University of Wyoming. My major is biology."
str2 <- "I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College"
str1 <- tolower(str1) # we use tolower because "university" != "University"
str2 <- tolower(str2)
grep(paste(education, collapse = "|"), unlist(strsplit(str1, "(?<=\\.)\\s+",
perl = TRUE)),
value = TRUE)
grep(paste(education, collapse = "|"), unlist(strsplit(str2, "(?<=\\.)\\s+",
perl = TRUE)),
value = TRUE)
Related
I have several Word files containing articles from which I want to extract the strings between quotes. My code works fine if I have one quote per article but if I have more than one R extracts the sentence that separates one quote from the next.
Here is the text from my articles:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, “I adore tigers”. This is the end.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
And this is my code:
library(readtext)
library(stringr)
#' folder where you've saved your articles
path <- "articles"
#' reads in anything saved as .docx
mydata <-
readtext(paste0(path, "\\*.docx")) #' make sure the Word document is saved as .docx
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))
#' extract the quotes
stringi::stri_extract_all_regex(str = mydata$text, pattern = '(?<=").*?(?=")')
The output is:
[[1]]
[1] "We got him and he is healthy,"
[2] " said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, "
[3] "I adore tigers"
[[2]]
[1] "The target catalysed much greater conservation action, which was desperately needed,"
You can see that the second element of the first output is incorrect. I don't want to include
" said Houston Police Department (HPD) Major Offenders Commander Ron
Borza. He went on to say, "
Well, technically the second element of the first output is within quotes so the code is working correctly as per the pattern used. A quick fix would be to remove every 2nd entry from the list.
sapply(
stringi::stri_extract_all_regex(str = text, pattern = '(?<=").*?(?=")'),
`[`, c(TRUE, FALSE)
)
#[[1]]
#[1] "We got him and he is healthy," "I adore tigers"
#[[2]]
#[1] "The target catalysed much greater conservation action, which was desperately needed,"
We can do this with base R
sapply(regmatches(text, gregexpr('(?<=")[^"]+)', text, perl = TRUE)), function(x) x[c(TRUE, FALSE)])
This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 2 years ago.
I might be missing something very obvious but how can I write efficient code to get all matches of a singular version of a noun but NOT its plural? for example, I want to match
angel investor
angel
BUT NOT
angels
try angels
If I try
grep("angel ", string)
Then a string with JUST the word
angel
won't match.
Please help!
Use word-boundary markers \\b:
x <- c("angel investor", "angel","angels", "try angels")
grep("\\bangel\\b", x, value = T)
[1] "angel investor" "angel"
You can try the following approach. It still believe there are other excellent ways to solve this problem.
df <- data.frame(obs = 1:4, words = c("angle", "try angles", "angle investor", "angles"))
df %>%
filter(!str_detect(words, "(?<=[ertkgwmnl])s\\b"))
# obs words
# 1 1 angle
# 2 3 angle investor
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a .csv-file with a column containing book descriptions scraped from the web which I import into R for further analysis. My goal is to extract the protagonists' ages from this column in R, so what I imagine is this:
Match strings like "age" and "-year-old" with a regex
Copy the sentences containing these strings into a new column (so that I can make sure that the sentence is not, for example "In the middle ages 50 people lived in xy"
Extract the numbers (and, if possible some number words) from this column into a new column.
The resulting table (or probably data.frame) would then hopefully look like this
|Description |Sentence |Age
|YY is a novel by Mr. X |The 12-year-old boy| 12
|about a boy. The 12-year|is named Dave. |
|-old boy is named Dave..| |
If you could me help out that would great since my R-skills are still very limited and I have not found a solution for this problem!
Another option if the string contains other numbers/descriptions besides just age, but you only want age.
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The boy is 5 feet tall. The 12-year-old boy is named Dave. Dave is happy. Dave lives at 42 Washington street."
sentence <- str_split(description, "\\.")[[1]][which(grepl("-year-old", unlist(str_split(description, "\\."))))]
> sentence
[1] " The 12-year-old boy is named Dave"
age <- as.numeric(str_extract(description, "\\d+(?=-year-old)"))
> age
[1] 12
Here we use the string "-year-old" to tell us which sentence to pull and then we extract the age that is followed by that string.
You can try the following
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The 12-year-old boy is named Dave. Dave is happy."
sentence <- str_extract(description, pattern = "\\.[^\\.]*[0-9]+[^\\.]*.") %>%
str_replace("^\\. ", "")
> sentence
[1] "The 12-year-old boy is named Dave."
age <- str_extract(sentence, pattern = "[0-9]+")
> age
[1] "12"
I have a text file with a sample text like below all in small case:
"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a
once powerful oil minister and former head of state oil company pdvsa, in
connection with an alleged $4.8 billion vienna-based corruption scheme, the
state prosecutor's office announced on friday.
5.5 hours ago
— reuters
amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its
online marketplace when they mistakenly search for "brikenstock",
"birkenstok", "bierkenstock" and other variations in google.
6 hours ago
— business standard"
What I need in R is to get these two pieces of text, separated out.
The first piece of text would correspond with the text1 variable and the second piece of text should correspond with the text2 variable.
Please remember I have many text-like paragraphs in this file. The solution would have to work for, say, 100,000 texts.
The only thing I thought that could be used as a delimiter is "—" but with that I lose the source of the information such as "reuters" or "business standard". I need that as well.
Would you know how to accomplish this in R?
Read the text from field with readLines and then split on the shifted cumsum of the occurence of that special dash in from of the publisher:
Lines <- readLines("Lines.txt") # from file in wd()
split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "
[3] "once powerful oil minister and former head of state oil company pdvsa, in "
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."
[6] "5.5 hours ago"
[7] "— reuters"
$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."
[5] "6 hours ago"
[6] "— business standard'"
It's not a regular "-". Its a "—". And notice the by default readLines will omit the blank lines.
Here's what I could do. I do not like the loop in this, but I could not vectorize it. Hopefully this answer will at least serve as a starting point for other better answers.
Assumptions: All publisher names are preceeded by "— "
TEXT <- read.delim2("C:/Users/Arani.das/Desktop/TEXT.txt", header=FALSE, quote="", stringsAsFactors=F)
TEXT$Publisher <- grepl("— ", TEXT$V1)
TEXT$V1 <- gsub("^\\s+|\\s+$", "", TEXT$V1) #trim whitespaces in start and end of line
TEXT$FLAG <- 1 #grouping variable
for(i in 2:nrow(TEXT)){
if(TEXT$Publisher[i-1]==T){TEXT$FLAG[i]=TEXT$FLAG[i]+1}else{TEXT$FLAG[i]=TEXT$FLAG[i-1]}
} # Grouping entries
TEXT <- data.table::data.table(TEXT, key="FLAG")
TEXT2 <- TEXT[, list(News=paste0(V1[1:(length(V1)-2)], collapse=" "), Time=V1[length(V1)-1], Publisher=V1[length(V1)]), by="FLAG"]
Output:
FLAG News Time Publisher
1 Venezuela... 5.5 hours ago — reuters
2 amazon... 6 hours ago — business standard
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I want to find and replace a list of words with another list of words.
Say my data is
1) plz suggst med for flu
2) tgif , m goin to have a blast
3) she is getting anorexic day by day
List of words to be replaced are
1) plz -- please
2) pls -- please
3) sugg -- suggest
4) suggst -- suggest
5) tgif -- thank god its friday
6) lol -- laughed out loud
7) med -- medicine
I would like to have 2 lists, list "A" --a list of words to be found and list "B" --a list of words to be replaced with. So that I can keep adding terms to these lists as and when required. I need a mechanism to search for all the words in list "A" and then replace it with corresponding words in list "B".
What is the best way to achieve this in R. Thanks in advance.
Try this:
#messy list
listA <- c("plz suggst med for flu",
"tgif , m goin to have a blast",
"she is getting anorexic day by day")
#lookup table
list_gsub <- read.csv(text="
a,b
plz,please
pls,please
sugg,suggest
suggst,suggest
tgif,thank god its friday
lol,laughed out loud
med,medicine")
#loop through each lookup row
for(x in 1:nrow(list_gsub))
listA <- gsub(list_gsub[x,"a"],list_gsub[x,"b"], listA)
#output
listA
#[1] "please suggestst medicine for flu"
#[2] "thank god its friday , m goin to have a blast"
#[3] "she is getting anorexic day by day"
have a look at ?gsub
x <- c("plz suggst med for flu", "tgif , m goin to have a blast", "she is getting anorexic day by day")
gsub("plz", "please", x)