How do I clean twitter data in R? - r

I extracted tweets from twitter using the twitteR package and saved them into a text file.
I have carried out the following on the corpus
xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,stripWhitespace, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removePunctuation, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,strip_retweets, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removeWords,stopwords(english), lazy=TRUE, 'mc.cores=1')
(using mc.cores=1 and lazy=True as otherwise R on mac is running into errors)
tdm<-TermDocumentMatrix(xx)
But this term document matrix has a lot of strange symbols, meaningless words and the like.
If a tweet is
RT #Foxtel: One man stands between us and annihilation: #IanZiering.
Sharknado‚Äã 3: OH HELL NO! - July 23 on Foxtel #SyfyAU
After cleaning the tweet I want only proper complete english words to be left , i.e a sentence/phrase void of everything else (user names, shortened words, urls)
example:
One man stands between us and annihilation oh hell no on
(Note: The transformation commands in the tm package are only able to remove stop words, punctuation whitespaces and also conversion to lowercase)

Using gsub and
stringr package
I have figured out part of the solution for removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls .
clean_tweet = gsub("&amp", "", unclean_tweet)
clean_tweet = gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", clean_tweet)
clean_tweet = gsub("#\\w+", "", clean_tweet)
clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
clean_tweet = gsub("http\\w+", "", clean_tweet)
clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)
ref: ( Hicks , 2014)
After the above
I did the below.
#get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT #[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
ref: (Stanton 2013)
Before doing any of the above I collapsed the whole string into a single long character using the below.
paste(mytweets, collapse=" ")
This cleaning process has worked for me quite well as opposed to the tm_map transforms.
All that I am left with now is a set of proper words and a very few improper words.
Now, I only have to figure out how to remove the non proper english words.
Probably i will have to subtract my set of words from a dictionary of words.

library(tidyverse)
clean_tweets <- function(x) {
x %>%
# Remove URLs
str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
# Remove mentions e.g. "#my_account"
str_remove_all("#[[:alnum:]_]{4,}") %>%
# Remove hashtags
str_remove_all("#[[:alnum:]_]+") %>%
# Replace "&" character reference with "and"
str_replace_all("&", "and") %>%
# Remove puntucation, using a standard character class
str_remove_all("[[:punct:]]") %>%
# Remove "RT: " from beginning of retweets
str_remove_all("^RT:? ") %>%
# Replace any newline characters with a space
str_replace_all("\\\n", " ") %>%
# Make everything lowercase
str_to_lower() %>%
# Remove any trailing whitespace around the text
str_trim("both")
}
tweets %>% clean_tweets

To remove the URLs you could try the following:
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
xx <- tm_map(xx, removeURL)
Possibly you could define similar functions to further transform the text.

For me, this code did not work, for some reason-
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
Error was-
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
So, instead, I used
clean_tweet4 <- str_replace_all(clean_tweet3, "https://t.co/[a-z,A-Z,0-9]*","")
clean_tweet5 <- str_replace_all(clean_tweet4, "http://t.co/[a-z,A-Z,0-9]*","")
to get rid of URLs

The code do some basic cleaning
Converts into lowercase
df <- tm_map(df, tolower)
Removing Special characters
df <- tm_map(df, removePunctuation)
Removing Special characters
df <- tm_map(df, removeNumbers)
Removing common words
df <- tm_map(df, removeWords, stopwords('english'))
Removing URL
removeURL <- function(x) gsub('http[[:alnum;]]*', '', x)

Related

Within a column, I'd like to gsub each row of string values and remove any value that matches a list of values I created

Context
I am working with a messy datafile right now. I have a list of comments that I'd like to sort out and grab the most common combination of phrases. An example phrase would be "Did not qualify because of X and Y" and "Did not qualify because of Y and X". I am trying to go through and remove Stop Words so I can match X and Y as a common phrase. I was able to easily do this for common single words, but phrases are a little difficult. Below is my code for context
Create Datafile
dat1 <- dat %>% filter(Action != Exclude)
Remove problem characters
dat1$Comments <- stri_trans_general(dat1$Comments, "latin-ascii")
dat1$Comments <- gsub(pattern='<[^<>]*>', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern='\n', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern="[[:punct:]]", replacement=" ", x=dat1$Comments)
Remove stop words (Where my problem is)
sw <- paste0("\\b(", paste0(stop_words$word, collapse="|"), ")\\b")
dat1$Comments <- lapply(dat1$Comments, function(x) (gsub(pattern=sw, replacement=" ", x)))
Remove extra spaces between words
dat1$Comments <- trimws(gsub("\\s+", " ", dat1$Comments))
dat1$Comments <- gsub("(^[[:space:]]*)|([[:space:]]*$)", "", dat1$Comments)
Sweet Data
top_phrases <- data.frame(text = dat1$Comments) %>%
unnest_tokens(bigram, text, 'ngrams', n = Length, to_lower = TRUE) %>%
count(bigram, sort = TRUE)
Issue
This is what pops up and is traced back to the gsub code
Error in gsub(pattern = sw, replacement = " ", x) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
If anyone is curious, here is what is stored in "sw"
"\\b(a|a's|able|about|above|according|accordingly|across|actually|after|afterwards|again|against|ain't|all|allow|allows|almost|alone|along|already|also|although|always|am|among|amongst|an|and|another|any|anybody|anyhow|anyone|anything|anyway|anyways|anywhere|apart|appear|appreciate|appropriate|are|aren't|around|as|aside|ask|asking|associated|at|available|away|awfully|b|be|became|because|become|becomes|becoming|been|before|beforehand|behind|being|believe|below|beside|besides|best|better|between|beyond|both|brief|but|by|c|c'mon|c's|came|can|can't|cannot|cant|cause|causes|certain|certainly|changes|clearly|co|com|come|comes|concerning|consequently|consider|considering|contain|containing|contains|corresponding|could|couldn't|course|currently|d|definitely|described|despite|did|didn't|different|do|does|doesn't|doing|don't|done|down|downwards|during|e|each|edu|eg|eight|either|else|elsewhere|enough|entirely|especially|et|etc|even|ever|every|everybody|everyone|everything|everywhere|ex|exactly|example|except|f|far|few|fifth|first|five|followed|following|follows|for|former|formerly|forth|four|from|further|furthermore|g|get|gets|getting|given|gives|go|goes|going|gone|got|gotten|greetings|h|had|hadn't|happens|hardly|has|hasn't|have|haven't|having|he|he's|hello|help|hence|her|here|here's|hereafter|hereby|herein|hereupon|hers|herself|hi|him|himself|his|hither|hopefully|how|howbeit|however|i|i'd|i'll|i'm|i've|ie|if|ignored|immediate|in|inasmuch|inc|indeed|indicate|indicated|indicates|inner|insofar|instead|into|inward|is|isn't|it|it'd|it'll|it's|its|itself|j|just|k|keep|keeps|kept|know|knows|known|l|last|lately|later|latter|latterly|least|less|lest|let|let's|like|liked|likely|little|look|looking|looks|ltd|m|mainly|many|may|maybe|me|mean|meanwhile|merely|might|more|moreover|most|mostly|much|must|my|myself|n|name|namely|nd|near|nearly|necessary|need|needs|neither|never|nevertheless|new|next|nine|no|nobody|non|none|noone|nor|normally|not|nothing|novel|now|nowhere|o|obviously|of|off|often|oh|ok|okay|old|on|once|one|ones|only|onto|or|other|others|otherwise|ought|our|ours|ourselves|out|outside|over|overall|own|p|particular|particularly|per|perhaps|placed|please|plus|possible|presumably|probably|provides|q|que|quite|qv|r|rather|rd|re|really|reasonably|regarding|regardless|regards|relatively|respectively|right|s|said|same|saw|say|saying|says|second|secondly|see|seeing|seem|seemed|seeming|seems|seen|self|selves|sensible|sent|serious|seriously|seven|several|shall|she|should|shouldn't|since|six|so|some|somebody|somehow|someone|something|sometime|sometimes|somewhat|somewhere|soon|sorry|specified|specify|specifying|still|sub|such|sup|sure|t|t's|take|taken|tell|tends|th|than|thank|thanks|thanx|that|that's|thats|the|their|theirs|them|themselves|then|thence|there|there's|thereafter|thereby|therefore|therein|theres|thereupon|these|they|they'd|they'll|they're|they've|think|third|this|thorough|thoroughly|those|though|three|through|throughout|thru|thus|to|together|too|took|toward|towards|tried|tries|truly|try|trying|twice|two|u|un|under|unfortunately|unless|unlikely|until|unto|up|upon|us|use|used|useful|uses|using|usually|uucp|v|value|various|very|via|viz|vs|w|want|wants|was|wasn't|way|we|we'd|we'll|we're|we've|welcome|well|went|were|weren't|what|what's|whatever|when|whence|whenever|where|where's|whereafter|whereas|whereby|wherein|whereupon|wherever|whether|which|while|whither|who|who's|whoever|whole|whom|whose|why|will|willing|wish|with|within|without|won't|wonder|would|would|wouldn't|x|y|yes|yet|you|you'd|you'll|you're|you've|your|yours|yourself|yourselves|z|zero|i|me|my|myself|we|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his|himself|she|her|hers|herself|it|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|would|should|could|ought|i'm|you're|he's|she's|it's|we're|they're|i've|you've|we've|they've|i'd|you'd|he'd|she'd|we'd|they'd|i'll|you'll|he'll|she'll|we'll|they'll|isn't|aren't|wasn't|weren't|hasn't|haven't|hadn't|doesn't|don't|didn't|won't|wouldn't|shan't|shouldn't|can't|cannot|couldn't|mustn't|let's|that's|who's|what's|here's|there's|when's|where's|why's|how's|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|around|as|ask|asked|asking|asks|at|away|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|best|better|between|big|both|but|by|came|can|cannot|case|cases|certain|certainly|clear|clearly|come|could|did|differ|different|differently|do|does|done|down|down|downed|downing|downs|during|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|had|has|have|having|he|her|here|herself|high|high|high|higher|highest|him|himself|his|how|however|i|if|important|in|interest|interested|interesting|interests|into|is|it|its|itself|just|keep|keeps|kind|knew|know|known|knows|large|largely|last|later|latest|least|less|let|lets|like|likely|long|longer|longest|made|make|making|man|many|may|me|member|members|men|might|more|most|mostly|mr|mrs|much|must|my|myself|necessary|need|needed|needing|needs|never|new|new|newer|newest|next|no|nobody|non|noone|not|nothing|now|nowhere|number|numbers|of|off|often|old|older|oldest|on|once|one|only|open|opened|opening|opens|or|order|ordered|ordering|orders|other|others|our|out|over|part|parted|parting|parts|per|perhaps|place|places|point|pointed|pointing|points|possible|present|presented|presenting|presents|problem|problems|put|puts|quite|rather|really|right|right|room|rooms|said|same|saw|say|says|second|seconds|see|seem|seemed|seeming|seems|sees|several|shall|she|should|show|showed|showing|shows|side|sides|since|small|smaller|smallest|some|somebody|someone|something|somewhere|state|states|still|still|such|sure|take|taken|than|that|the|their|them|then|there|therefore|these|they|thing|things|think|thinks|this|those|though|thought|thoughts|three|through|thus|to|today|together|too|took|toward|turn|turned|turning|turns|two|under|until|up|upon|us|use|used|uses|very|want|wanted|wanting|wants|was|way|ways|we|well|wells|went|were|what|when|where|whether|which|while|who|whole|whose|why|will|with|within|without|work|worked|working|works|would|year|years|yet|you|young|younger|youngest|your|yours)\\b"
Both TRE (the default regex engine used in base R regex functions) and PCRE (the regex engine used in base R regex functions with perl=TRUE) have quite hard limits for the pattern length.
In your case, stringr regex functions will work better as they are using ICU regex engine that supports much longer regex patterns.
So, you may replace
gsub(pattern=sw, replacement=" ", x)
with
stringr::str_replace_all(x, sw, " ")

using Regex and/or removing duplicate

I am scraping the website and as a result, I have half cleaned code:
[3] "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%"*
Above is one example and I am trying to remove a number before or after that heart.
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
Basically removing numbers between heart and colon including heart.
I will greatly appreciate any help. I have tried the following codes, but they did not work.
str_replace_all(dataSet, "♠*:", "", fixed = T)
gsub("*♠", "", data, fixed = T)
website <- read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population")
results <- website %>% html_nodes("table")
data_body <- results[1] %>% html_nodes("tbody")
rows <- data_body %>% html_nodes("tr")
clean_rows_text <- str_replace_all(rows_text,"[7000100000000000000]", "")
clean_rows_text <- str_replace_all(clean_rows_text, "\n\n", ":")
clean_rows_text <- str_replace_all(clean_rows_text, "\n", "")
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
From this point, I can handle the rest.
This should do it:
data <- "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%*"
gsub("♠.+?(?=:)", "", data, perl=T)

Filtering text from numbers and stopwords in R(not for tdm)

I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system
Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"
There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "

Im trying to create a dictionary of brands and then clean an input of a certain transaction to extract only the brand name

Im working with gsub to erase every word after a brand in the dictionary, but how can I erase the words before to?
Hi, Im trying to clean transactions to look clearly at the brands that the clients use. This is an example using gsub and erasing every word after the brand "cabify"
tabla1_texto <- "exppcabify u.s.2313; 1212; 534"
tabla1_texto <- gsub("cabify", "cabify-", tabla1_texto)
tabla1_texto <- gsub(";", " ;",tabla1_texto)
tabla1_texto <- gsub("-\\S* ","", tabla1_texto)
this erase every character till the ";", how can I delete the "expp" to?
Someone also knows how can i create a dictionary of brands automatically?
Thanks
To delete the prior word, you can use:
gsub("\\w+(?=cabify)", "", tabla1_texto, perl = TRUE)
To delete everything before, you can use:
gsub(".*(?=cabify)", "", tabla1_texto, perl = TRUE)
A starting point for a "dictionary" could be:
brands <- c("cabify", "thundersausage")
for (brand in brands) {
tabla1_texto <- gsub(brand, paste0(brand, "-"), tabla1_texto)
tabla1_texto <- gsub(";", " ;",tabla1_texto)
tabla1_texto <- gsub("-\\S* ","", tabla1_texto)
tabla1_texto <- gsub(paste0("\\w+(?=", brand, ")"), "", tabla1_texto, perl = TRUE)
}
tabla1_texto # view the result

String: extract wanted character instead of removing unwanted

I was wandering if in R their is a function like KeepChar("abcde....xyz", some_text) that you feed with all the desired character that you want to keep, and returns the strings with only the desired character left in it. Here the function would only keep the letters of the alphabet in lower case. I would like something that looks like this:
some_text <- "Hel-_l0o W#oRr^ld"
some_text <- KeepChar("abcdefghijklmnopqrstuvwxyz ", some_text)
some_text
> "hello world"
I feel that the removing method that I am currently using gsub("#\\w+", "", some_text), tm_map(some_text, stripWhitespace) or str_replace_all(some_text,"[^[:graph:]]", " ") takes a lot of time and coding line with a constant risk of forgetting to remove a specific character, especially when you already know exactly what you want to keep.
Why I ask this question is because I am coding a plateform to process sentiment analysis on texts from various sources like twitter and I want to make sure not to forget to remove any unwanted character.
To handle a pattern without using regex I will try this:
string <- "Hel-_l0o W#oRr^ld"
pattern <- "abcdefghijklmnopqrstuvwxyz"
KeepChar = function(pattern, string){
splitted_string <- unlist(strsplit(string, ""))
splitted_pattern <- unlist(strsplit(pattern, ""))
ids_string <- splitted_string %in% splitted_pattern
return(paste(splitted_string[ids_string], sep = "", collapse = ""))
}
some_text <- KeepChar(pattern = pattern, string = string)
You can try this:
some_text <- "Hel-_l0o W#oRr^ld"
gsub("[^[:alpha:] ]", "", some_text)#will return all characters
gsub("[^[:lower:] ]", "", some_text)#will return only lower characters alongwith space
gsub("[^[:upper:] ]", "", some_text)#will return higher case characters alongwith space
You can also look at the page https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html to see the matches available in R

Resources