I have a data frame with two columns. cnn_handle contains Twitter handles and tweet contains tweets where the Twitter handle in the corresponding row is mentioned. However, most tweets mention at least one other user/handle indicated by #. I want to remove all rows where a tweet contains more than one #.
df
cnn_handle tweet
1 #DanaBashCNN #JohnKingCNN #DanaBashCNN #kaitlancollins #eliehonig #thelauracoates #KristenhCNN CNN you are still FAKE NEWS !!!
2 #DanaBashCNN #DanaBashCNN He could have made the same calls here, from SC.
3 #DanaBashCNN #DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
4 #brianstelter #eliehonig #brianstelter My apologies to you sir. Just seems like that story disappeared. Imo the nursing home scandal is just as bad.
5 #brianstelter #DrAndrewBaer1 #JGreenblattADL #brianstelter #CNN #TuckerCarlson #FoxNews Anti-Semite are you, Herr Doktor? How very Mengele of you.
6 #brianstelter #ma_makosh #Shortguy1 #brianstelter #ChrisCuomo Liberals, their feelings before facts and their crucifixion of people before due process. Never a presumption of innocence when it concerns the rival party. So un-American.
7 #andersoncooper #BrendonLeslie And Biden was a staunch opponent of “forced busingâ€. He also said that integrating schools will cause a “racial jungleâ€. But u won’t hear this on #ChrisCuomo #jaketapper #Acosta #andersoncooper bc they continue to cover up the truth about Biden & his family.
8 #andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
9 #andersoncooper #johnnydollar01 #newsbusters #drsanjaygupta #andersoncooper He was terrible as a host
I suspect some type of regular expression is needed. However, I am not sure how to combine it with a greater-than sign.
The desired result i.e. tweets only mentioning the corresponding cnn_handle
cnn_handle tweet
2 #DanaBashCNN #DanaBashCNN He could have made the same calls here, from SC.
3 #DanaBashCNN #DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
8 #andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
A straighforward solution using str_count from stringrwhich presupposes that # occur only in Twitter handles:
base R:
library(stringr)
df[str_count(df$tweet, "#") > 1,]
dplyr:
library(dplyr)
library(stringr)
df %>%
filter(!str_count(tweet, "#") > 1)
Assuming your dataframe is called tweets, just check to see if there is more than one match for # followed by text:
pattern <- "#[a-zA-Z.+]"
multiple_ats <- unlist(lapply(tweets$tweet, function(x) length(gregexpr(pattern, x)[[1]])>1))
tweets[!multiple_ats,]
Output:
# A tibble: 3 x 2
cnn_handle tweet
<chr> <chr>
1 #DanaBashCNN "#DanaBashCNN He could have made the same calls here, from SC."
2 #DanaBashCNN "#DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point.,Also please refrain from showing a pic of him till you have one in his casket.,thank you"
3 #andersoncooper "Anderson Cooper revealed that he \"wanted a change\" when reflecting on his break from news as #TheMole arrives on Netflix."
Edit: You will have to change the pattern if Twitter user names are allowed to start with numbers or special characters. I don't know what the rules are.
Related
I have a text from which I want to extract the first two paragraphs. The text consists of several paragraphs seperated by empty lines. The paragraphs themselves can contain line breaks. What I want to extract is everything from the beginning of the text until the second empty line. This is the original text:
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me:
The text I want to have is:
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
I tried to create a regular expression doing the job and I though the following seemed to be a possible solution:
(.*|\n)*(?:[[:blank:]]*\n){2,}(.*|\n)*(?:[[:blank:]]*\n){2,}
When I use it in R in stri_extract_all_regex, I receive the following error:
Error in stri_extract_all_regex(video_desc_orig, "(.*|\n)*?(?:[[:blank:]]*\n){2,}(.*?|\n)*(?:[[:blank:]]*\n){2,}") :
Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)
It's the first time for me using Regex and I really don't know how to interpret this error. Any help appreciated ;)
You have nested quantifiers like (.*|\n)* which creates a lot of paths to explore. This pattern for example first matches all text, and then starts to backtrack to fit in the next parts of the pattern.
Including the last 2 newlines, making sure that the lines contain at least a single non whitespace character:
\A[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)*\n{2,}[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)*
Explanation
\A Start of string
[^\S\n]*\S.* Match a whole line with at least a single non whitespace char
(?:\n[^\S\n]*\S.*)* Optionally repeat all following lines that contain at least a single non whitespace chars
\n{2,} Match 2 or more newlines
[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)* Same as the previous pattern to match the lines for the second paragraph
See a regex demo and a R demo.
Example
library(stringi)
string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me: '
stri_extract_all_regex(
string,
'\\A[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*\\n{2,}[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*'
)
Output
[[1]]
[1] "Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.\nThen I went to a nice restaurant with them.\n\nBuy me a Beer: https://www.buymeacoffee.com/johnnyfd"
In R you need to do double slashes \\.
string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me: '
library(stringr)
string |>
str_extract('(.*|\\n)*(?:[[:blank:]]*\\n){2,}(.*|\\n)*(?:[[:blank:]]*\\n){2,}') |>
cat()
# Output
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
'''
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
RIDER
Hey -- sorry.
''''
I'm scraping some scripts that I want to do some text analysis with. I want to pull only dialogue from the scripts and it looks like it has a certain amount of spacing.
So for example, I want that line "Hey -- sorry.". I know that the spacing is 20 and that is consistent throughout the script. So I how can I only read in that line and the rest that have equal spacing?
I want to say that I am going to use read.fwf, reading a fixed width.
What do you guys think?
I'm scraping from urls like this:
https://imsdb.com/scripts/10-Things-I-Hate-About-You.html
library(tidytext)
library(tidyverse)
text <- c("PADUA HIGH SCHOOL - DAY
Welcome to Padua High School,, your typical urban-suburban
high school in Portland, Oregon. Smarties, Skids, Preppies,
Granolas. Loners, Lovers, the In and the Out Crowd rub sleep
out of their eyes and head for the main building.
PADUA HIGH PARKING LOT - DAY
KAT STRATFORD, eighteen, pretty -- but trying hard not to be
-- in a baggy granny dress and glasses, balances a cup of
coffee and a backpack as she climbs out of her battered,
baby blue '75 Dodge Dart.
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
She grabs his skateboard and uses it to SHOVE him against a
car, skateboard tip to his throat. He whimpers pitifully
and she lets him go. A path clears for her as she marches
through a pack of fearful students and SLAMS open the door,
entering school.
INT. GIRLS' ROOM - DAY
BIANCA STRATFORD, a beautiful sophomore, stands facing the
mirror, applying lipstick. Her less extraordinary, but
still cute friend, CHASTITY stands next to her.
BIANCA
Did you change your hair?
CHASTITY
No.
BIANCA
You might wanna think about it
Leave the girls' room and enter the hallway.
HALLWAY - DAY- CONTINUOUS
Bianca is immediately greeted by an admiring crowd, both
boys
and girls alike.
BOY
(adoring)
Hey, Bianca.
GIRL
Awesome shoes.
The greetings continue as Chastity remains wordless and
unaddressed by her side. Bianca smiles proudly,
acknowledging her fans.
GUIDANCE COUNSELOR'S OFFICE - DAY
CAMERON JAMES, a clean-cut, easy-going senior with an open,
farm-boy face, sits facing Miss Perky, an impossibly cheery
guidance counselor.")
names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")
text %>%
as_tibble() %>%
unnest_tokens(text, value, token = "lines") %>%
filter(str_detect(text, "\\s{15,}")) %>%
mutate(text = str_trim(text)) %>%
filter(!str_detect(text, names_stopwords))
Output:
# A tibble: 9 x 1
text
<chr>
1 hey -- sorry.
2 leave it
3 i said, leave it!
4 did you change your hair?
5 no.
6 you might wanna think about it
7 (adoring)
8 hey, bianca.
9 awesome shoes.
You can include further character names in the names_stopwords vector.
You can try the following :
url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'
url %>%
#Read webpage line by line
readLines() %>%
#Remove '<b>' and '</b>' from string
gsub('<b>|</b>', '', .) %>%
#select only the text which begins with 20 whitespace characters
grep('^\\s{20,}', ., value = TRUE) %>%
#Remove whitespace
trimws() %>%
#Remove all caps string
grep('^([A-Z]+\\s?)+$', ., value = TRUE, invert = TRUE)
#[1] "Hey -- sorry." "Leave it" "KAT (continuing)"
#[4] "I said, leave it!" "Did you change your hair?" "No."
#...
#...
I have tried cleaning this as much as possible but might require some more cleaning based on what you actually want to extract.
I am looking at twitter data which I am then feeding into an html document. Often the text contains special characters like emojis that aren't properly encoded for html. For example the tweet:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥
would become:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥
when fed into an html document.
Working manually I could use a tool like https://www.textfixer.com/html/html-character-encoding.php to encode the tweet to look like:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be "�";"�"; "�";"�"; "�";"�";
which I could then feed to an html document and have the emojis show up. Is there a package or function in R that could take text and html encode it similarly to the web tool above?
Here's a function which will encode non-ascii characters as HTML entities.
entity_encode <- function(x) {
cp <- utf8ToInt(x)
rr <- vector("character", length(cp))
ucp <- cp>128
rr[ucp] <- paste0("&#", as.character(cp[ucp]), ";")
rr[!ucp] <- sapply(cp[!ucp], function(z) rawToChar(as.raw(z)))
paste0(rr, collapse="")
}
This returns
[1] "If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥"
for your input but those seem to be equivalent encodings.
I understand that I can pull twits from R using below two methods
searchTwitter("abc")#pulls all twits that contain keyword 'abc'
searchTwitter("from:abc") #pulls all twits from user 'abc'
I want to pull twits where user was mentioned or tagged. In twitter, while writing a twit another user is mentioned or tagged by putting # before the user's name. To achieve my goal should i use searchTwitter("abc") or
should I use 'searchTwitter("#abc")'?
Is there a single command that could give me twits from the user
abc as well as all twits where abc was tagged? I can always do
two different searches - one searchTwitter("abc") and
searchTwitter("#abc") (provided #abc returns all twits where
user abc was mentioned or tagged) and then remove duplicate twtis.
But I want to avoid that as it would involve a lot of duplication of
twits. Twitter allows to pull only 18000 twits per 15 minutes and I
would like to avoid pulling same tweets twice.
You can pull all tweets from a single user using userTimeline('user_name'). You can define the number of tweets to query by adding n=number after the you define a users handle.
Example:
userTimeline('StackOverflow',n=5)
[[1]]
[1] "StackOverflow: #szalapski Hi Patrick! I just asked and our team says it's on their list to update the data AND make future updates… "
[[2]]
[1] "StackOverflow: If you want to see the code, check out Julia’s blog post below. #IBMcloud #DSX "
[[3]]
[1] "StackOverflow: Interested in text mining? Check out how Stack Overflow data scientist #juliasilge implements topic modeling using… "
[[4]]
[1] "StackOverflow: #martindaniel4 #Google "
[[5]]
[1] "StackOverflow: #Koprowski_it \xed��\xed�\u008d"
The query will also return https links but they had to be deleted for the post to go through here.
The book R and Data Mining: Examples and Case Studies by Yanchang Zhao has some excellent examples if your looking for a textbook walk through.
I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!
Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai
What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])