Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a .csv-file with a column containing book descriptions scraped from the web which I import into R for further analysis. My goal is to extract the protagonists' ages from this column in R, so what I imagine is this:
Match strings like "age" and "-year-old" with a regex
Copy the sentences containing these strings into a new column (so that I can make sure that the sentence is not, for example "In the middle ages 50 people lived in xy"
Extract the numbers (and, if possible some number words) from this column into a new column.
The resulting table (or probably data.frame) would then hopefully look like this
|Description |Sentence |Age
|YY is a novel by Mr. X |The 12-year-old boy| 12
|about a boy. The 12-year|is named Dave. |
|-old boy is named Dave..| |
If you could me help out that would great since my R-skills are still very limited and I have not found a solution for this problem!
Another option if the string contains other numbers/descriptions besides just age, but you only want age.
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The boy is 5 feet tall. The 12-year-old boy is named Dave. Dave is happy. Dave lives at 42 Washington street."
sentence <- str_split(description, "\\.")[[1]][which(grepl("-year-old", unlist(str_split(description, "\\."))))]
> sentence
[1] " The 12-year-old boy is named Dave"
age <- as.numeric(str_extract(description, "\\d+(?=-year-old)"))
> age
[1] 12
Here we use the string "-year-old" to tell us which sentence to pull and then we extract the age that is followed by that string.
You can try the following
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The 12-year-old boy is named Dave. Dave is happy."
sentence <- str_extract(description, pattern = "\\.[^\\.]*[0-9]+[^\\.]*.") %>%
str_replace("^\\. ", "")
> sentence
[1] "The 12-year-old boy is named Dave."
age <- str_extract(sentence, pattern = "[0-9]+")
> age
[1] "12"
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
What is the best way to match words with my sentence? Here is a little sample:
words <- c("apple", "pear", "grape")
sentences <- c("I have an apple and a pear", "Grape is my favorite", "I don't like pear")
The best is if the output could look like:
count sentence
2 "I have an apple and a pear"
1 "Grape is my favorite"
1 "I don't like pear
I have tried using str_count but to no avail. Any help is appreciated!
library(stringr)
str_count(sentences, paste0("(?i)\\b(", paste0(words, collapse = "|"), ")\\b"))
[1] 2 1 1
How this works:
(?i): this makes sure the pattern match is case-insensitive
\\b and \\b make sure the words are matched as words with word boundaries (if \\b is not used you may end up matching something that just contains your words but forms itself a different word such as grapple, which contains apple)
( and )form a non-capturing group, the content of which are the words separated, or combined if you prefer, by the pipe |, a metacharacter for alternation signifying 'OR'.
If you want to have this inside a dataframe:
df <- data.frame(
sentences = sentences,
count = str_count(sentences, paste0("(?i)\\b(", paste0(words, collapse = "|"), ")\\b")))
Result:
df
sentences count
1 I have an apple and a pear 2
2 Grape is my favorite 1
3 I don't like pear 1
This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 2 years ago.
I might be missing something very obvious but how can I write efficient code to get all matches of a singular version of a noun but NOT its plural? for example, I want to match
angel investor
angel
BUT NOT
angels
try angels
If I try
grep("angel ", string)
Then a string with JUST the word
angel
won't match.
Please help!
Use word-boundary markers \\b:
x <- c("angel investor", "angel","angels", "try angels")
grep("\\bangel\\b", x, value = T)
[1] "angel investor" "angel"
You can try the following approach. It still believe there are other excellent ways to solve this problem.
df <- data.frame(obs = 1:4, words = c("angle", "try angles", "angle investor", "angles"))
df %>%
filter(!str_detect(words, "(?<=[ertkgwmnl])s\\b"))
# obs words
# 1 1 angle
# 2 3 angle investor
My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a dataset with unstructured text data.
From the text I want to extract sentences that have the following words:
education_vector <- c("university", "academy", "school", "college")
For example, from the text I am a student at the University of Wyoming. My major is biology. I want to get I am a student at the University of Wyoming.
From the text I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College I want to get I graduated from Walla Wall Community College. and so on
I tried using grep function but it returned wrong results
Answer modified to select first match.
texts = c("I am a student at the University of Wyoming. My major is biology.",
"I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College",
"First, I went to the Bowdoin College. Then I went to the University of California.")
gsub(".*?([^\\.]*(university|academy|school|college)[^\\.]*).*",
"\\1", texts, ignore.case=TRUE)
[1] "I am a student at the University of Wyoming"
[2] " I graduated from Walla Wall Community College"
[3] "First, I went to the Bowdoin College"
Explanation:
.*? is a non-greedy match up to the rest of the pattern. This is there to remove any sentences before the relevant sentence.
([^\\.]*(university|academy|school|college)[^\\.]*) matches any string of characters other than a period immediately before and after one of the key words.
.* handles anything after the relevant sentence.
This replaces the entire string with only the relevant part.
Here is a solution using grep
education <- c("university", "academy", "school", "college")
str1 <- "I am a student at the University of Wyoming. My major is biology."
str2 <- "I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College"
str1 <- tolower(str1) # we use tolower because "university" != "University"
str2 <- tolower(str2)
grep(paste(education, collapse = "|"), unlist(strsplit(str1, "(?<=\\.)\\s+",
perl = TRUE)),
value = TRUE)
grep(paste(education, collapse = "|"), unlist(strsplit(str2, "(?<=\\.)\\s+",
perl = TRUE)),
value = TRUE)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I want to find and replace a list of words with another list of words.
Say my data is
1) plz suggst med for flu
2) tgif , m goin to have a blast
3) she is getting anorexic day by day
List of words to be replaced are
1) plz -- please
2) pls -- please
3) sugg -- suggest
4) suggst -- suggest
5) tgif -- thank god its friday
6) lol -- laughed out loud
7) med -- medicine
I would like to have 2 lists, list "A" --a list of words to be found and list "B" --a list of words to be replaced with. So that I can keep adding terms to these lists as and when required. I need a mechanism to search for all the words in list "A" and then replace it with corresponding words in list "B".
What is the best way to achieve this in R. Thanks in advance.
Try this:
#messy list
listA <- c("plz suggst med for flu",
"tgif , m goin to have a blast",
"she is getting anorexic day by day")
#lookup table
list_gsub <- read.csv(text="
a,b
plz,please
pls,please
sugg,suggest
suggst,suggest
tgif,thank god its friday
lol,laughed out loud
med,medicine")
#loop through each lookup row
for(x in 1:nrow(list_gsub))
listA <- gsub(list_gsub[x,"a"],list_gsub[x,"b"], listA)
#output
listA
#[1] "please suggestst medicine for flu"
#[2] "thank god its friday , m goin to have a blast"
#[3] "she is getting anorexic day by day"
have a look at ?gsub
x <- c("plz suggst med for flu", "tgif , m goin to have a blast", "she is getting anorexic day by day")
gsub("plz", "please", x)