Cleaning tweets using gsub in R [duplicate]

Cleaning tweets using gsub in R [duplicate] - r

This question already has answers here:
Removing hashtags , hyperlinks and twitter handles from dataset in R using gsub
(2 answers)
Closed 4 years ago.
I am trying to clean a bunch of tweets using gsub.
V3
1 Well: Getting Insurance to Pay for Midwives http://xxxxxxxxx
2 Lightning may be giving you a headache http://xxxxxxxx
3 New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot? http://xxxxxxxx
4 VIDEO: Can we erase memories entirely? http://xxxxxxxx
5 Artificial sweeteners are a $1.5-billion-a-year market #kchangnyt reported last year. http://xxxxxxxx
I tried to use the following code to remove all the links (taken from a previous question at SO):
newdf1$V3 <- gsub("http\\w+", "", newdf1$V3)
However, there was no change in the tweets.
Further, when I use the code newdf1$V3 <- gsub("http.*", "", newdf1$V3), I am able to remove the links:
V3
1 Well: Getting Insurance to Pay for Midwives
2 Lightning may be giving you a headache
3 New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot?
4 VIDEO: Can we erase memories entirely?
5 Artificial sweeteners are a $1.5-billion-a-year market #kchangnyt reported last year.
Can someone explain why the code in the first case does not yield the desired results?

That's because \w only picks up on alphanumeric characters. since http is always followed by "://", the \w doesn't recognize it as a legal expression.
In contrast, .* just picks up anything that follows the "http", so that works.

Related

How do I automate the code to run for each row entry in the dataset in R?

Let's say I have this data frame of several random sentences
Sentences<-c("John is playing a video game at the moment","Tom will cook a delicious meal later",
"Kyle is with his friends watching the game",
"Diana is hosting her birthday party tomorrow night"
)
df<-data.frame(a)
keywords<-c("game","is","will","meal","birthday","party")
And I have a vector of key words. I need to create a new column in the data frame with only keywords mentioned in the sentence appearing.
na.omit(str_match(df[n,],keywords))
I have constructed this line of code which returns keywords that were used in those sentences (n stands for row number). How do I automate this code to be applied for each row?

We could use str_extract_all from stringr package for this:
library(dplyr)
library(stringr)
df %>%
mutate(new_col = str_extract_all(Sentences, paste(keywords, collapse = "|")))
Sentences new_col
1 John is playing a video game at the moment is, game
2 Tom will cook a delicious meal later will, meal
3 Kyle is with his friends watching the game is, is, game
4 Diana is hosting her birthday party tomorrow night is, birthday, party

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.

I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")

The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo

Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

add a sentiment column onto a dataset in r

I have done some basic sentiment analysis in r and wanted to know if there was a way to have the sentiment of a sentence or row analyzed, and then have a column appended with the sentiment of the sentence. All analysis I have done up until now gives me an overview of the sentiment or pulls specific words, but doesn't link back to the original row of data
The input of my data would be fed in through a BI software and would look something like below with a case number and some text:
"12345","I am extremely angry with my service"
"23456","I was happy with how everything turned out"
"34567","The rep did a great job helping me"
I would like it to be returned as an output below
"12345","I am extremely angry with my service","Anger"
"23456","I was happy with how everything turned out","Positive"
"34567","The rep did a great job helping me","Positive"
Any point in the right direction of a package or resource would be greatly appreciated!

The problem you run into with sentences is that sentiment lexicons are based on words. If you look at the nrc lexicon, the word "angry" has three sentiment values: anger, disgust and negative. Which one do you choose? Or you have the sentence returning multiple words that are in a lexicon. Try testing different lexicons with your text to see what happens for example with tidytext.
If want a a package that can analyse sentiment on sentence level, you can look into sentimentr. You will not get sentiment values like anger back, but a sentiment/polarity score. More about sentimentr can be found in the package documentation and on sentimentr github page.
A small example code:
library(sentimentr)
text <- data.frame(id = c("12345","23456","34567"),
sentence = c("I am extremely angry with my service", "I was happy with how everything turned out", "The rep did a great job helping me"),
stringsAsFactors = FALSE)
sentiment(text$sentence)
element_id sentence_id word_count sentiment
1: 1 1 7 -0.5102520
2: 2 1 8 0.2651650
3: 3 1 8 0.3535534
# add sentiment score to data.frame
text$sentiment <- sentiment(text$sentence)$sentiment
text
id sentence sentiment
1 12345 I am extremely angry with my service -0.5102520
2 23456 I was happy with how everything turned out 0.2651650
3 34567 The rep did a great job helping me 0.3535534

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!

Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

How to create a column and replace value

Question
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
Expected output
1 An artist impression of a star system is responsible for a nova.
1 The team from university of VYU focus on a class of compounds.
1 The young people was seen enjoying the foorball match.
2 Scientist have made a breakthrough and solved a decades- old mystery by revealing how a powerful.
2 Heart attacks more due to nurture than nature.
2 SA footballer Senzo Meyiwa shot dead to save girlfriend
The data is in the csv format and it has got around 1000 data points, numbers are in columns(1) and sentence are in column(2). I need to split the string and retain the row number for that particular sentence. Need your help to build the r code
Note: Number and the sentence are two different columns
I have tried this code to string split but i need code for row index
x$qwerty <- as.character(x$qwerty)
sa<-list(strsplit(x$qwerty,".",fixed=TRUE))[[1]]
s<-unlist(sa)
write.csv(s,"C:\\Users\\Suhas\\Desktop\\out23.csv")

One inconvenience of vectorization in R is that they operate from "inside" the vector. That is, they operate on the elements themselves, rather than the elements in the context of the vector. Therefore the user loses the innate ability to keep track of the index, i.e. where element being operated on was located in the original object.
The workaround is to generate the index separately. This is easy to achieve with seq_along, which is an optimized version of 1:length(qwerty). Then you can just paste the index and the results together. In your case, you'll obviously want to do the pasteing before you unlist.

If your dataset is as shown above, may be this helps. You can read from the file as readLines("file.txt")
lines <- readLines(n=7)
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
lines1 <- lines[lines!='']
indx <- grep("^\\d", lines1)
lines2 <- unlist(strsplit(lines1, '(?<=\\.)(\\b| )', perl=TRUE))
indx <- grepl("^\\d+$", lines2)
res <- unlist(lapply(split(lines2,cumsum(indx)),
function(x) paste(x[1], x[-1])), use.names=FALSE)
res
#[1] "1 An artist impression of a star system is responsible for a nova."
#[2] "1 The team from university of VYU focus on a class of compounds."
#[3] "1 The young people was seen enjoying the football match."
#[4] "2 Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful."
#[5] "2 Heart attacks more due to nurture than nature."
#[6] "2 SA footballer Senzo Meyiwa shot dead to save girlfriend"
If you want it as 2 column data.frame
dat <- data.frame(id=rep(lines2[indx],diff(c(which(indx),
length(indx)+1))-1), Col1=lines2[!indx], stringsAsFactors=FALSE)
head(dat,2)
# id Col1
#1 1 An artist impression of a star system is responsible for a nova.
#2 1 The team from university of VYU focus on a class of compounds.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cleaning tweets using gsub in R [duplicate] - r

That's because \w only picks up on alphanumeric characters. since http is always followed by "://", the \w doesn't recognize it as a legal expression. In contrast, .* just picks up anything that follows the "http", so that works.

Related

How do I automate the code to run for each row entry in the dataset in R?

Extract words starting with # in R dataframe and save as new column

add a sentiment column onto a dataset in r

unnest_tokens fails to handle vectors in R with tidytext package

How to create a column and replace value

Categories

Resources