I have pairs of customer feedback data in a CSV, denoting whether the customer recommended the service they received (1 or 0), "rec", and an associated comment, "comment". I am trying to compare the customer feedback between those who recommended the service and those who did not.
I have used the tm package to simply read all the lines in a CSV with only comments and do some follow-on text-mining on all the comments, which worked:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x))
Now I am trying to compare the comments of those customers who recommend and those who do not by including the "rec" column, but I have not been able to create a corpus from a single column CSV - I tried the following:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x$comment))
But I get an error saying
"Error in if (vectorized && (length <= 0))
stop("vectorized sources must have positive length") :
missing value where TRUE/FALSE needed"
I also tried binding the "rec" codes to the comments after creating a topic model, but certain comments end up getting filtered by the "topic" function so the "rec" column is longer than the # of documents in the resulting topic model.
If this something I can do with the tm package simply? I haven't worked with the qdap package at all but is that something that is more appropriate here?
... as ben mentioned:
vec <- as.character(x[,"place of comments"])
Corpus(VectorSource(vec))
perhaps some customer id as meta data would be nice...
hth
Related
How do you drop multiple terms from the sentimentr dictionary?
For example, the words "please" and "advise" are associated with positive sentiment, but I do not want those particular words to influence my analysis.
I've figured out a way with the following script to exclude 1 word but need to exclude many more:
mysentiment<- lexicon::hash_sentiment_jockers_rinker[x != "please"]
mytext <- c(
'Hello, We are looking to purchase this material for a part we will be making, but your site doesnt state that this is RoHS complaint. Is it possible that its just not listed as such online, but it actually is RoHS complaint? Please advise. '
)
sentiment_by(mytext, polarity_dt = mysentiment)
extract_sentiment_terms(mytext,polarity_dt = mysentiment)
You can subset the mysentiment data.table. Just create a vector of the words you don't want included and use it to subset.
mysentiment<- lexicon::hash_sentiment_jockers_rinker
words_to_exclude <- c("please", "advise")
mysentiment <- mysentiment [!x %in% words_to_exclude]
I'm currently working with wordnet in R (I'm using RStudio for Windows (64bit)) and created a data.frame containing synset_offset, ss_type and word from the data.x files (where x is noun, adj, etc) of the wordnet database.
A sample can be created like this:
wnet <- data.frame(
"synset_offset" = c(02370954,02371120,02371337),
"ss_type" = c("VERB","VERB","VERB"),
"word" = c("fill", "depute", "substitute")
)
My issue happens when using the wordnet package to get the list of synonyms that I'd like to add as an additional column.
library(wordnet)
wnet$synonyms <- synonyms(wnet$word,wnet$ss_type)
I receive the following error.
Error in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, :
java.lang.NoSuchMethodError: <init>
If I apply the function with defined values, it works.
> synonyms("fill","VERB")
[1] "fill" "fill up" "fulfil" "fulfill" "make full" "meet" "occupy" "replete" "sate" "satiate" "satisfy"
[12] "take"
Any suggestions to solve my issue are welcome.
I can't install the wordnet package on my computer for some reason, but it seems you're giving the synonyms function array arguments and you can't, you should be able to solve it with apply.
syn_list <- apply(wnet,by=1,function(row){synonyms(row["word"],row["ss_type"])})
it will return the output of the synonyms function for each row of the wnet data.frame
it's not clear what you want to do with:
wnet$synonyms <- synonyms(wnet$word,wnet$ss_type)
as for each row you will have an array of synonyms, that don't fit in the 3 rows of your data.frame.
maybe something like this will work for you:
wnet$synonyms <- sapply(syn_list,paste,collapse=", ")
EDIT - Here is a working solution to the problem above.
wnet$synset <- mapply(synonyms, as.character(wnet$word), as.character(wnet$ss_type))
I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work with the first ~15K posts but when I keep subsetting, it fails. I realize my code is less than elegant so happy to learn better/completely different solutions - all I care about is keeping the IDs when I go to V Corpus and then back again. I'm new to the tm package and the readTabular function in particular. (Note: I ran the to lower and remove Words before making the VCorpus as I originally thought that was part of the issue).
Working code is below:
Sample data
fb = data.frame(RecordContent = c("I'm dating a celebrity! Skip to 2:02 if you, like me, don't care about the game.",
"Photo fails of this morning. Really Joe?",
"This piece has been almost two years in the making. Finally finished! I'm antsy for October to come around... >:)"),
FromRecordId = c(682245468452447, 737891849554475, 453178808037464),
stringsAsFactors = F)
Remove punctuation & make lower case
fb$RC = tolower(gsub("[[:punct:]]", "", fb$RecordContent))
fb$RC2 = removeWords(fb$RC, stopwords("english"))
Step 1: Create special reader function to retain record IDs
myReader = readTabular(mapping=list(content="RC2", id="FromRecordId"))
Step 2: Make my corpus. Read in the data using DataframeSource and the custom reader function where each FB post is a "document"
corpus.test = VCorpus(DataframeSource(fb), readerControl=list(reader=myReader))
Step 3: Clean and stem
corpus.test2 = corpus.test %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument, language = "english") %>%
as.VCorpus()
Step 4: Make the corpus back into a character vector. The row names are now the IDs
fb2 = data.frame(unlist(sapply(corpus.test2, `[`, "content")), stringsAsFactors = F)
Step 5: Make new ID variable for later merge, name vars, and prep for merge back onto original dataset
fb2$ID = row.names(fb2)
fb2$RC.ID = gsub(".content", "", fb2$ID)
colnames(fb2)[1] = "RC.stem"
fb3 = select(fb2, RC.ID, RC.stem)
row.names(fb3) = NULL
I think the ids are being stored and retained by default, by the tm module. You can fetch them all (in a vectorized manner) with
meta(corpus.test, "id")
$`682245468452447`
[1] "682245468452447"
$`737891849554475`
[1] "737891849554475"
$`453178808037464`
[1] "453178808037464"
I'd recommend to read the documentation of the the tm::meta() function, but it's not very good.
You can also add arbitrary metadata (as key-value pairs) to each collection item in the corpus, as well as collection-level metadata.
I want to create a transaction object in basket format which I can call anytime for my analyses. The data contains comma separated items with 1001 transactions. The first 10 transactions look like this:
hering,corned_b,olives,ham,turkey,bourbon,ice_crea
baguette,soda,hering,cracker,heineken,olives,corned_b
avocado,cracker,artichok,heineken,ham,turkey,sardines
olives,bourbon,coke,turkey,ice_crea,ham,peppers
hering,corned_b,apples,olives,steak,avocado,turkey
sardines,heineken,chicken,coke,ice_crea,peppers,ham
olives,bourbon,coke,turkey,ice_crea,heineken,apples
corned_b,peppers,bourbon,cracker,chicken,ice_crea,baguette
soda,olives,bourbon,cracker,heineken,peppers,baguette
corned_b,peppers,bourbon,cracker,chicken,bordeaux,hering
...
I observed that there are duplicated transactions in the data and removed them but each time I tried to read the transactions, I get:
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
Here is my code:
data <- read.csv("AssociationsItemList.txt",header=F)
data <- data[!duplicated(data),]
pop <- NULL
for(i in 1:length(data)){
pop <- paste(pop, data[i],sep="\n")
}
write(pop, file = "Trans", sep = ",")
transdata <- read.transactions("Trans", format = "basket", sep=",")
I'm sure there's something little yet important I've missed. Kindly offer your assistance.
The problem is not with duplicated transactions (the same row appearing twice)
but duplicated items (the same item appearing twice, in the same transaction --
e.g., "olives" on line 4).
read.transactions has an rm.duplicates argument to remove those duplicates.
read.transactions("Trans", format = "basket", sep=",", rm.duplicates=TRUE)
Vincent Zoonekynd is right, the problem is caused by duplicated items in a transaction. Here I can explain why arules require transactions without duplicated items.
The data of transactions is store internally as a ngCMatrix Object. Relevant source code:
setClass("itemMatrix",
representation(
data = "ngCMatrix",
...
setClass("transactions",
contains = "itemMatrix",
...
ngCMatrix is an sparse matrix defined at Matrix package. It's description from official document:
The nsparseMatrix class is a virtual class of sparse “pattern” matrices, i.e., binary matrices conceptually with TRUE/FALSE entries. Only the positions of the elements that are TRUE are stored
It seems ngCMatirx stored status of an element by an binary indicator. Which means the transactions object in arules can only store exist/not exist for a transaction object and can not record quantity. So...
I just used the 'unique' function to remove duplicates. My data was a little different since I had a dataframe (data was too large for a CSV) and I had 2 columns: product_id and transaction_id. I know it's not your specific question, but I had to do this to create the transaction dataset and apply association rules.
data # > 1 Million Transactions
data <- unique(data[ , 1:2 ] )
trans <- as(split(data[,"product_id"], data[,"trans_id"]),"transactions")
rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.2))
GEOquery is a great R package to retrieve and analyze the Gene Expression data stored in NCBI Gene Expression Omnibus (GEO) database. I have used the following code provided from GEO2R service of GEO database (that generates the initial R script to analyze your desired data automatically) to extract some GEO series of experiments:
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL1261", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
gset # displays a summary of the data stored in this variable
The problem is that I can not retrieve the sample titles from it. I have found some function Columns() that works on GDS datasets and returns the sample names, but not on GSE.
Please note I am not interested in sample accession IDs (i.e. GSM258609 GSM258610, etc), what I want is the sample human readable titles.
Is there any idea? Thanks
After
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
gset is a simple list, it's first element is an ExpressionSet, and the sample information are in the phenoData or pData, so maybe you're looking for
pData(gset[[1]])
See ?ExpressionSet for more.