How do I remove line breaks from a string? - r

I've am extracting tweets using the Twitter API in R.
I have been saving my results to csv in r using a write.csv2 command which is fine but there is an issue where character returns in the tweet text are causing the multiple rows in the spreadsheet for the one tweet.
I've tried using a str_replace_all but it doesn't seem to work for me and i can't find anything as to why.
Here is my code
searchTags = c("Galwaybikeshare", "Corkbikeshare", "dublinbikes", "BelfastBikes", "SantanderCycles", "CitiBikeNYC", "obike", "Hubway", "bicing")
additionalParams = c("-rt -http")
searchString <- paste((paste(searchTags[1:9], collapse = " OR ")), additionalParams, collapse = "")
tweets_list <- searchTwitter(searchString, n=20, lang = "en", resultType = 'recent')
str_replace_all(tweets_list, "[\r\n]" , "")
tweets.df <- twListToDF(tweets_list)
todayDate <- Sys.Date()
tweetArchive <- paste("BikeShareTweets ", todayDate, ".csv", sep ="")
write.csv2(tweets.df, file = tweetArchive)
The text below is an example of a tweet which is causing the issue.
"TransitNinja205: 0.01% of the budget for 5-borough #CitiBikeNYC,\nand 0.2% for #FairFares. #NYCmayor #NYCmayorsOffice #progressive"
Why isn't my str_replace_all removing the \n from the text?

stringr::str_replace_all works, you’re just ignoring the result. To fix it:
tweets_list = str_replace_all(tweets_list, "[\r\n]" , "")

stringr::str_remove_all will also do this for you.
tweets_list = str_remove_all(tweets_list, "[\r\n]")

Related

R Using a list of text value output in binary when character appears more than once in a string

Using R in Databricks.
I have the following sample list of possible text entries.
extract <- c("codeine", "tramadol", "fentanyl", "morphine")
I want check if any of these appear more than once in a string (example below) and return a binary output in a new column.
Example = ("codeine with fentanyl oral")
The output for this example would be 1.
I have tried the following with only partial success:
df$testvar1 <- +(str_count(df$medname, fixed(extract))> 1)
also tried
df$testvar2 <- cSplit_e(df$medname, split.col = "String", sep = " ", type = "factor", mode = "binary", fixed = TRUE, fill = 0)
and also
df$testvar3 <- str_extract_all(df$medname, paste(extract, collapse = " "))
Combine your extract with |.
+(stringr::str_count(Example, paste(extract, collapse = "|"))> 1)
# [1] 1
I tried the following and it worked for my code
df$testvar <- sapply(df$medname, function(x) str_extract(x, paste(extract, collapse="|")))

How to access a single item in an R data frame?

So I'm diving into yet another language (R), and need to be able to look at individual items in a dataframe(?). I've tried a number of ways to access this, but so far am confused by what R wants me to do to get this out. Current code:
empStatistics <- read.csv("C:/temp/empstats.csv", header = TRUE, row.names = NULL, encoding = "UTF-8", sep = ",", dec = ".", quote = "\"", comment.char = "")
attach(empStatistics)
library(svDialogs)
Search_Item <- dlgInput("Enter a Category", "")$res
if (!length(Search_Item)) {
cat("You didn't pick anything!?")
} else {
Category <- empStatistics[Search_Item]
}
Employee_Name <- dlgInput("Enter a Player", "")$res
if (!length(Employee_Name)) {
cat("No Person Selected!\n")
} else {
cat(empStatistics[Employee_Name, Search_Item])
}
and the sample of my csv file:
Name,Age,Salary,Department
Frank,25,40000,IT
Joe,24,40000,Sales
Mary,34,56000,HR
June,39,70000,CEO
Charles,60,120000,Janitor
From the languages I'm used to, I would have expected the brackets to work, but that obviously isn't the case here, so I tried looking for other solutions including separating each variable into its own brackets, trying to figure out how to use subset() (failed there, not sure it is applicable), tried to find out how to get the column and row indexes, and a few other things I'm not sure I can describe.
How can I enter values into variables, and then use that to get the individual pieces of data (ex, enter "Frank" for the name and "Age" for the search item and get back 25 or "June" for the name and "Department" for the search item to get back "CEO")?
If you would like to access it like that, you can do:
Search_Item <- "Salary"
Employee_Name <- "Frank"
empStatistics <- read.csv("empstats.csv",header = TRUE, row.names = 1)
empStatistics[Employee_Name,Search_Item]
[1] 40000
R doesn't have an Index for its data.frame. The other thing you can try is:
empStatistics <- read.csv("empstats.csv",header = TRUE)
empStatistics[match(Employee_Name,empStatistics$Name),Search_Item]
[1] 40000

R string, UTF-8 coding swedish character treatment

have problem to change the swedish characters ä ö å in a presentable way in R
I got my data directly from MS SQL database
here are the examples
markets <- c("Caf\xe9 ","Restaurang kv\xe4ll ","Barnomsorg tillagningsk\xf6k ","Folkh\xf6gskola ")
then I use gusb to remove the lefthand space
market=gsub(" ", "", markets,fixed = TRUE)
I got this error:
Error in gsub(" ", "", market, fixed = TRUE) :
input string 3 is invalid UTF-8
then I use this command:
markets_new=gsub(" ", "", markets)
then have strange Chinese characters in the string,
"Caf攼㸹"
"Restauranglunch+kv攼㸴ll"
"Barnomsorgtillagningsk昼㸶k"
"Folkh昼㸶gskola"
I tried the treatment change the default setting of Rstudio by following:
https://yihui.name/en/2018/11/biggest-regret-knitr/?fbclid=IwAR2E5Lp0zjS51fcdjgZ1tej0sg5EBxfG8sNitt-cUA2XEshnT3lNCHNQ3Do
it does not help, was also try to use gsub() substitute the characters but seems not working.
One more thing, if I use
write.csv(markets,'submarket product view.csv',row.names = F)
then in my csv file what I see as follows
"Caf<e9> "
"Restaurang kv<e4>ll "
"Barnomsorg tillagningsk<f6>k "
"Folkh<f6>gskola "
"Sm<f6>rg<e5>s/salladsrestaurang "
I think <e9> is e with a hat, <e4> is ä, <f6> is ö, and <e5> is å
Any treatment suggestion?
Thanks to #Wiktor Stribiżew
this solution works best:
df$m <- gsub(" ", "", `Encoding<-`(as.character(df$m), "latin1"),fixed = TRUE)
try this
Encoding(markets) <- "UTF-16"
markets <- trimws(markets)
#[1] "Café" "Restaurang kväll" "Barnomsorg tillagningskök" "Folkhögskola"

Efficient way to remove all proper names from corpus

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

Regular expressions error message - "Out of memory"

I've been playing around with R's sentiment analysis capabilities and keep running into an error that is raised when running a gsub function. The positive and negative word lists were taken from here.
After some Google searches, I found one mention of this error on the R help list but nothing else. Has anyone run into this problem? What is going on? Is there a workaround?
I've ran similar code (using gsub and stringer packages) when working with strings in the past and this is the first time I've ever had this type of error come up. Furthermore, I tried to reproduce this error by writing a similar script on a different set of strings and that worked fine.
Here is the error message:
> pos_match <- str_c(vpos, collapse = "|")
> neg_match <- str_c(vneg, collapse = "|")
> dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression, reason 'Out of memory'
Here's the whole 'process.'
## SET WORKING DIRECTOR AND IMPORT PACKAGES:
setwd("~/Desktop/R_Tricks")
require(tm); require(stringr); require(lubridate); library(RTextTools)
# IMPORT DATA:
d1 <- read.csv("Video_Comments.csv", stringsAsFactors=FALSE, sep=",", fileEncoding="ISO_8859-2")
pos <- read.csv("positive-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
neg <- read.csv("negative-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
vpos = as.vector(pos[,1]); vneg = as.vector(neg[,1])
head(vpos); head(vneg)
colnames(d1); nrow(d1); ncol(d1)
str(d1); head(d1)
table(d1$Likes); table(d1$Replies)
nrow(vpos); nrow(vneg)
length(vpos); length(vneg)
is.atomic(vpos); is.atomic(vneg)
# SELECT DATA:
dat = data.frame(Comment=c(d1$Comment))
head(dat)
# CLEAN DATA - COMMENTS:
dat$Comment = gsub('[[:punct:]]', '', dat$Comment)
dat$Comment = gsub('[[:cntrl:]]', '', dat$Comment)
dat$Comment = gsub('\\d+', '', dat$Comment)
dat$Comment = tolower(dat$Comment)
head(dat)
# CLEAN DATA - CLASSIFICATIONS:
vpos = gsub('[[:punct:]]', '', vpos); vneg = gsub('[[:punct:]]', '', vneg)
vpos = gsub('[[:cntrl:]]', '', vpos); vneg = gsub('[[:cntrl:]]', '', vneg)
vpos = gsub('\\d+', '', vpos); vneg = gsub('\\d+', '', vneg)
vpos = tolower(vpos); vneg = tolower(vneg)
head(vpos); head(vneg)
# MATCH WORDS WITH FACEBOOK COMMENTS:
pos_match <- str_c(vpos, collapse = "|")
neg_match <- str_c(vneg, collapse = "|")
dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
EDIT:
Another error message I've received is the following:
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression 'faced|faces|abnormal|abolish|abominable|abominably|abominate|abomination|abort|aborted|
EDIT 2:
Data for reproducing error:
dat = c("Hey guys I am Aliza Lomez...18 y.o. I need your likes please like my page and find love quotes, beauty tips and much more.Please like my page you will never regret thank u all\u0083 <3 <3 <3...",
"Alexandra Saturn", "And that's what makes a Subaru a Subaru", "Missouri in a battleground....; meanwhile in southern California....", "What the Frisbee", "very cool !!!!", "Get a life",
"Try that with my GT!!!", "Did he make any money?", "Wo! WO! BSMITH THROWING DISCS WITH SUBARUS?!?! THIS IS SO AWESOME! SHOULD OF USED AN STI THO")
I don't know the entire solution but I can get you started. I made this community wiki so, hopefully, someone can fill in the blanks...
For the invalid regex, to create an OR you need to enclose everything in parentheses. For example, if you wanted to match the words "a", "an", or "the", you would use the regex string (a|an|the). If I have a list of words I'd like to match with an OR in regex, here's what I usually use:
mywords <- c("a", "an", "the")
mystring <- paste0("(", paste(mywords, collapse="|"), ")")
> mystring
[1] "(a|an|the)"
That should rid you of the invalid regex error, as your string doesn't begin with an open parenthesis and ends with a pipe instead of a close parenthesis.

Resources