Text argument from a dataframe - r

I try to test this nice solution using a dataframe as input in the your_sentence.
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text#TT.res$token == badword)
cond2 <- (substring(tagged.text#TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
else{
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
}
}
lapply(your_sentences, remove_words)
The data frame has 1 column and 351 rows. In lapply in your_sentences I use the call for my dataframe and the column name and I receive this error (the same error is when I use the dataframe without call the column):
> dfnew <- lapply(df$text, remove_words)
Error in writeLines(text, con = conn.tempfile) : invalid 'text' argument
What can I do to solve the error?
Example data:
df = data.frame(text = c('I blame myself for what happened', 'For what happened the blame is yours', 'I will blame you if my friend removes'))

What a bummer, hoped that its only a typo :-). But I have a second guess. You probably stepped into the difficulties caused by StringsAsFactors = TRUE. This might have caused passing the type factor instead of character to your function. Try the following:
df = data.frame(text = c('I blame myself for what happened'
, 'For what happened the blame is yours'
, 'I will blame you if my friend removes')
, stringsAsFactors = FALSE)

Your strings seem to be saved as factors and therefore remove_words is supplied with factor values, instead of strings. Using the stringsAsFactors = FALSE as an argument will solve the issue:
df <- data.frame(text = c('I blame myself for what happened',
'For what happened the blame is yours',
'I will blame you if my friend removes'),
stringsAsFactors=F)
Or, if you have already defined your df with factors, you can change that using df <- lapply(df, as.character)
lapply(df$text, remove_words)
[[1]]
[1] "I myself for what happened"
[[2]]
[1] "For what happened the blame is yours"
[[3]]
[1] "I will you if my friend removes"

Related

combining words in tm R is not achieving desired result

I am trying to combine a few words so that they count as one.
In this example I want val and valuatin to be counted as valuation.
The code I have been using to try and do this is below:
#load in package
library(tm)
replaceWords <- function(x, from, keep){
regex_pat <- paste(from, collapse = "|")
gsub(regex_pat, keep, x)
}
oldwords <- c("val", "valuati")
newword <- c("valuation")
TextDoc2 <- tm_map(TextDoc, replaceWords, from=oldwords, keep=newword)
However this does not work as expected. Any time there is val in a word it is now being replaced with valuation. For example equivalent becomes equivaluation. How do I get around this error and achieved my desired result?
Try this function -
replaceWords <- function(x, from, keep){
regex_pat <- sprintf('\\b(%s)\\b', paste(from, collapse = '|'))
gsub(regex_pat, keep, x)
}
val matches with equivalent. Adding word boundaries stop that from happening.
grepl('val', 'equivalent')
#[1] TRUE
grepl('\\bval\\b', 'equivalent')
#[1] FALSE

How do I find differing words in two strings, sentence-wise?

I am comparing two similar texts. x1 is the model text and x2 is the text with mistakes (e.g spelling, new characters etc.). I am trying to remove words found in both texts. Since my actual text is not in English I cannot use the dictionary.
What I have tried is to step through each character of x1 and if it is same character in x2 then delete from x2 and move to next character of x1.
Code I've been working on:
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
library(tidyverse)
x1 <- str_split(x1, "(?<=\\.)\\s")
x1 <- lapply(x1,tolower)
x2 <- str_split(x2, "(?<=\\.)\\s")
x2 <- lapply(x2,tolower)
delete_a_from_b <- function(a,b) {
a_as_list <- str_remove_all(a,"word") %>%
str_split(boundary("character")) %>% unlist
b_n <- nchar(b)
b_as_list <- str_remove_all(b,"word") %>%
str_split(boundary("character")) %>% unlist
previous_j <-1
for (i in 1:length(a_as_list)) {
if(previous_j > length(b_as_list))
break
for (j in previous_j:length(b_as_list)){
if(a_as_list[[i]]==b_as_list[[j]]){
b_as_list[[j]] <- ""
previous_j <- j+1
break
}
}
}
print(paste0(b_as_list,collapse = ""))
paste0(b_as_list,collapse = "")
}
x3 <- delete_a_from_b(x1,x2)
x3 <- strsplit(x3,"\\s")
Output:
x3
[[1]]
[1] "text" "this" "i" "i" "d?am" "clueless" "thius" "coing.\","
[9] "\"ley’s" "dythsssoon" "as" "possibk"
What I want as result is: 'text' 'this' 'id' 'thius' 'ley’s' 'ythis' 'possiblke'
I take it you want to compare the two strings x1 and x2 by sentence - not really clear in the question. The previous solutions do not take this into account.
Try this:
First split, both strings into sentences:
x1_sentences <- unlist(strsplit(tolower(x1), split = "[.?!] "))
x2_sentences <- unlist(strsplit(tolower(x2), split = "[.?!] "))
length(x1_sentences) == length(x2_sentences) # Make sure same number of resulting sentences
Then, for each sentence, split the two vectors again and show difference in words:
for (i in 1:length(x1_sentences)) {
x1_vector <- unlist(strsplit(x1_sentences[i], split = "[ ]"))
x2_vector <- unlist(strsplit(x2_sentences[i], split = "[ ]"))
print(setdiff(x2_vector, x1_vector)) # The order here is important!
}
Gives (which you can easily turn into a new vector):
[1] "text"
[1] "this"
[1] "id"
[1] "thius"
[1] "ley’s" "ythis" "possiblke."
think i did it , is this what you need?
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
x1_w<-strsplit(paste(x1, collapse = " "), ' ')[[1]]
x2_w<-strsplit(paste(x2, collapse = " "), ' ')[[1]]
x1<- lapply(x1,tolower)
x2<- lapply(x2,tolower)
`%notin%` <- Negate(`%in%`)
x2_w[which(x2_w %notin% x1_w)]
# same as:
setdiff(x2_w,x1_w)
# out:
#> x2_w[which(x2_w %notin% x1_w)]
#[1] "text" "id" "thius" "ley’s" "ythis" "possiblke."

R code hangs in between with large data?

I am dealing with db with around 5lac+ records. I want to count the words in the data.
This is my code
library(tm)
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,user="postgres",password="root", dbname="pharma",host="localhost",port=5432)
query<-"select data->'PubmedArticleSet'->'PubmedArticle'->'MedlineCitation'->'Article'->'Journal'->>'Title' from searchresult where id BETWEEN 1 AND (select max(id) from searchresult)"
der<-dbGetQuery(con,query)
der<- VectorSource(der)
der<- Corpus(der)
der<-tolower(der)
wordlist<-strsplit(der, "\\W+", perl=TRUE)
wordvector<-unlist(wordlist)
freqlist<-table(wordvector)
sortedfreqlist<-sort(freqlist, decreasing=TRUE)
sortedtable<-paste(names(sortedfreqlist),sortedfreqlist, sep="\t")
cat("Word\tFrequency", sortedtable, file=choose.files(), sep="\n")
But the code hangs and stops at " wordlist<-strsplit(der, "\\W+", perl=TRUE)" can some one please help me with this?
Is this because of the huge data?
Try replacing
wordlist<-strsplit(der, "\\W+", perl=TRUE)
with
word_vector = scan(text = as.character(der[1]),
what = "character", quote = "", quiet = TRUE)
sorted_word_table = sort(table(word_vector ))
There are a few funny things going on in your code (ie you make a Corpus and then call tolower() on the whole thing which turns it into a character vector), but this should get you going.
The first bit splits your text up into words. You might also want to remove punctuation before you do this though using der = removePunctuation(der[1]). The second bit makes a table of the word frequencies.
If the second bit is slow you could use the data.table package and the following function based on this answer instead of calling table()
t_dt <- function(x, key = TRUE){
#creates a 1-d frequency table for x
library(data.table)
dt <- data.table(x)
if(key) setkey(dt,x)
tab <- dt[, list(freq = .N), by = x]
out <- tab$freq
names(out) <- tab$x
out
}
sorted_word_table = sort(t_dt(word_vector ))

Regular expressions error message - "Out of memory"

I've been playing around with R's sentiment analysis capabilities and keep running into an error that is raised when running a gsub function. The positive and negative word lists were taken from here.
After some Google searches, I found one mention of this error on the R help list but nothing else. Has anyone run into this problem? What is going on? Is there a workaround?
I've ran similar code (using gsub and stringer packages) when working with strings in the past and this is the first time I've ever had this type of error come up. Furthermore, I tried to reproduce this error by writing a similar script on a different set of strings and that worked fine.
Here is the error message:
> pos_match <- str_c(vpos, collapse = "|")
> neg_match <- str_c(vneg, collapse = "|")
> dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression, reason 'Out of memory'
Here's the whole 'process.'
## SET WORKING DIRECTOR AND IMPORT PACKAGES:
setwd("~/Desktop/R_Tricks")
require(tm); require(stringr); require(lubridate); library(RTextTools)
# IMPORT DATA:
d1 <- read.csv("Video_Comments.csv", stringsAsFactors=FALSE, sep=",", fileEncoding="ISO_8859-2")
pos <- read.csv("positive-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
neg <- read.csv("negative-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
vpos = as.vector(pos[,1]); vneg = as.vector(neg[,1])
head(vpos); head(vneg)
colnames(d1); nrow(d1); ncol(d1)
str(d1); head(d1)
table(d1$Likes); table(d1$Replies)
nrow(vpos); nrow(vneg)
length(vpos); length(vneg)
is.atomic(vpos); is.atomic(vneg)
# SELECT DATA:
dat = data.frame(Comment=c(d1$Comment))
head(dat)
# CLEAN DATA - COMMENTS:
dat$Comment = gsub('[[:punct:]]', '', dat$Comment)
dat$Comment = gsub('[[:cntrl:]]', '', dat$Comment)
dat$Comment = gsub('\\d+', '', dat$Comment)
dat$Comment = tolower(dat$Comment)
head(dat)
# CLEAN DATA - CLASSIFICATIONS:
vpos = gsub('[[:punct:]]', '', vpos); vneg = gsub('[[:punct:]]', '', vneg)
vpos = gsub('[[:cntrl:]]', '', vpos); vneg = gsub('[[:cntrl:]]', '', vneg)
vpos = gsub('\\d+', '', vpos); vneg = gsub('\\d+', '', vneg)
vpos = tolower(vpos); vneg = tolower(vneg)
head(vpos); head(vneg)
# MATCH WORDS WITH FACEBOOK COMMENTS:
pos_match <- str_c(vpos, collapse = "|")
neg_match <- str_c(vneg, collapse = "|")
dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
EDIT:
Another error message I've received is the following:
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression 'faced|faces|abnormal|abolish|abominable|abominably|abominate|abomination|abort|aborted|
EDIT 2:
Data for reproducing error:
dat = c("Hey guys I am Aliza Lomez...18 y.o. I need your likes please like my page and find love quotes, beauty tips and much more.Please like my page you will never regret thank u all\u0083 <3 <3 <3...",
"Alexandra Saturn", "And that's what makes a Subaru a Subaru", "Missouri in a battleground....; meanwhile in southern California....", "What the Frisbee", "very cool !!!!", "Get a life",
"Try that with my GT!!!", "Did he make any money?", "Wo! WO! BSMITH THROWING DISCS WITH SUBARUS?!?! THIS IS SO AWESOME! SHOULD OF USED AN STI THO")
I don't know the entire solution but I can get you started. I made this community wiki so, hopefully, someone can fill in the blanks...
For the invalid regex, to create an OR you need to enclose everything in parentheses. For example, if you wanted to match the words "a", "an", or "the", you would use the regex string (a|an|the). If I have a list of words I'd like to match with an OR in regex, here's what I usually use:
mywords <- c("a", "an", "the")
mystring <- paste0("(", paste(mywords, collapse="|"), ")")
> mystring
[1] "(a|an|the)"
That should rid you of the invalid regex error, as your string doesn't begin with an open parenthesis and ends with a pipe instead of a close parenthesis.

Deriving a variable from a column name passed to a function

I've gotten hold of some really messy data and I wrote a function to do some conversions (string to numeric), and I would love to improve it. Basically the function takes a vector of messy character data and converts the data to numeric.
for example:
## say you had this
df1 <- data.frame ( V1 = c(" $25.25", "4,828", " $7,253"), V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(vec){
vec <- str_trim(vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print("!!ERROR STRANGE CHARACTERS!!")
}
}
df1$V1recode <- numconv(df1$V1)
df1$V2recode <- numconv(df1$V2)
[1] "!!ERROR STRANGE CHARACTERS!!"
How do can I assign the name of the original column name within the function so I can paste it to the error message within the function, so it instead reads:
!!ERROR STRANGE CHARACTER IN V2!!
I've tried calling names() and colnames() within the function, but this doesn't seem to work.
Thanks in advance,
C
The old deparse(substitute(.)) trick seems to work.
numconv <- function(vec){nam <- deparse(substitute(vec))
vec <- gsub(" ","", vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS!!", nam) )
}
}
df1$V2recode <- numconv(df1$V2)
# [1] "!!ERROR STRANGE CHARACTERS!! df1$V2"
(I didn't load stringr since I thought a gsub call would be more efficient.)
I feel this is a somewhat hacky way to do this, but you could use substitue and then strsplit on the $, but this assumes you always call a column using its name with $. Anyway, you can get the column name using this and paste it into an error message as you wish...
x <- strsplit(as.character( substitute(vec) ) ,"$" )[[3]]
The key is to wrap the recoding up into the function as well. That way you can keep track of which columns you're working on and so get the column names to put in your warning message. The following function recodes whatever columns of a data frame are listed in the 'col_names' argument (if left null the function applies to all of them). The function returns the original data frame, plus the recoded columns with the string in flag added to the column names.
require(stringr)
df1 <- data.frame (
V1 = c(" $25.25", "4,828", " $7,253"),
V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(df, col_names = NULL, flag = "recode"){
if(is.null(col_names)) {
col_names <- colnames(df)
}
out <- lapply(1:length(col_names), function(i) {
vec <- str_trim(df[,col_names[i]])
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS in", col_names[i], "!!"))
}
vec
})
out <- data.frame(out, stringsAsFactors = FALSE)
colnames(out) <- paste(col_names, flag, sep = "")
cbind(df, out)
}
numconv(df1)
[1] "!!ERROR STRANGE CHARACTERS in V2 !!"
V1 V2 V1recode V2recode
1 $25.25 THIS is bad data 25.25 THIS is bad data
2 4,828 725 4828.00 725
3 $7,253 *error 7253.00 *error

Resources