Compare datasets in R - r

I have gathered a set of transactions in a CSV file of the format:
{Pierre, lait, oeuf, beurre, pain}
{Paul, mange du pain,jambon, lait}
{Jacques, oeuf, va chez la crémière, pain, voiture}
I plan to do a simple association rule analysis, but first I want to exclude items from each transactions which do not belong to ReferenceSet = {lait, oeuf, beurre, pain}.
Thus my resulting dataset would be, in my example :
{Pierre, lait, oeuf, beurre, pain}
{Paul,lait}
{Jacques, oeuf, pain,}
I'm sure this is quite simple, but would love to read suggestions/answers to help me a bit.

Another answer references %in%, but in this case intersect is even handier (you may want to look at match, too -- but I think it's documented in the same place as %in%) -- with lapply and intersect we can make the answer into a one-liner:
Data:
> L <- list(pierre=c("lait","oeuf","beurre","pain") ,
+ paul=c("mange du pain", "jambon", "lait"),
+ jacques=c("oeuf","va chez la crémière", "pain", "voiture"))
> reference <- c("lait", "oeuf", "beurre", "pain")
Answer:
> lapply(L,intersect,reference)
$pierre
[1] "lait" "oeuf" "beurre" "pain"
$paul
[1] "lait"
$jacques
[1] "oeuf" "pain"

One way is follows (but, as I'm leaving the structure as a matrix I've left NAs where data has been removed (these could be removed if exporting back to CSV); I'm also sure it's possible to do it without loops - this would make it faster (but, IMHO less readable), and I'm sure there's a more efficient way to do the logic too - I'd also be interested in seeing someone's else view on this)
ref <- c("lait","oeuf","beurre","pain")
input <- read.csv("info.csv",sep=",",header=FALSE,strip.white=TRUE)
> input
V1 V2 V3 V4 V5
1 Pierre lait oeuf beurre pain
2 Paul mange du pain jambon lait
3 Jacques oeuf va chez la crémière pain voiture
input <- as.matrix(input)
output <- matrix(nrow=nrow(input),ncol=ncol(input))
currentRow <- c()
for(i in 1:nrow(input)) {
j <- 2
output[i,1]<-input[i,1]
for(k in 2:length(input[i,])) {
if(toString(input[i,k]) %in% ref){
output[i,j] <- toString(input[i,k])
j<-j+1
}
}
}
> output
[,1] [,2] [,3] [,4] [,5]
[1,] "Pierre" "lait" "oeuf" "beurre" "pain"
[2,] "Paul" "lait" NA NA NA
[3,] "Jacques" "oeuf" "pain" NA NA

The %in% operator will come in handy.
pierre <- c("lait","oeuf","beurre","pain")
paul <- c("mange du pain", "jambon", "lait")
jacques <- c("oeuf","va chez la crémière", "pain", "voiture")
reference <- c("lait", "oeuf", "beurre", "pain")
pierre_fixed <- pierre[pierre %in% reference]
paul_fixed <- paul[paul %in% reference]
jacques_fixed <- jacques[jacques %in% reference]
pierre_fixed
paul_fixed
jacques_fixed

Related

Using VCorpus() function but lose content

I am using the VCorpus() function in r package tm. Here is the problem I have
example_text = data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
This looks like
num Author1 Author2
1 1 Text mining is a great time. R is a great language
2 2 Text analysis provides insights R has many uses
3 3 qdap and tm are used in text mining here is a problem
Then I type df_source = DataframeSource(example_text[,2:3]) to only extract the last 2 columns.
df_source looks correct. After that, I did df_corpus = VCorpus(df_source) and df_corpus[[1]] is
<<PlainTextDocument>>
Metadata: 7
Content: chars: 2
And df_corpus[[1]] gives me
$content
[1] "3" "3"
But df_corpus[[1]] should return
<<PlainTextDocument>>
Metadata: 7
Content: chars: 49
And df_corpus[[1]][1] should return
$content
[1] "Text mining is a great time." "R is a great language"
I don't know where goes wrong. Any suggestions will be appreciated.
The texts inside example_text that are supposed to be character have all become factors because the 'factory-fresh' value of stringsAsFactors is TRUE, which is weird and annoying from my point of view.
example_text <- data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
lapply(example_text, class)
# $num
# [1] "numeric"
#
# $Author1
# [1] "factor"
#
# $Author2
# [1] "factor"
To ensure the column Author1 and Author2 to be character columns, you may try:
Add options(stringsAsFactors = FALSE) at the beginning of your code.
Add stringsAsFactors = FALSE inside your data.frame(...) statement.
Run example_text[, 2:3] <- lapply(example_text[, 2:3], as.character)
Run example_text[, 2:3] <- lapply(example_text[, 2:3], paste)
Then everything should work fine.

Get all of the nouns in a book (.txt file) in R and make a frequency table and wordcloud

I am trying to find all of the nouns in a text file. Originally I converted a .epub to a .pdf file. Then I converted the .pdf to a .txt file successfully and I removed half of the text since I only need to find the nouns from the last half of the book. I want to do this so I can find the frequency of the nouns and then identify them for finals.
I can do the frequency tables normally with the original text files without any transformations and make a wordcloud etc but I cannot seem to filter only the nouns. Any ideas?
cname <- file.path(".","Desktop", "egypt", "pdf")
mytxtfiles <- list.files(path = cname, pattern = "txt", full.names = TRUE)
#nouns2 and nouns doesnt seem to work :O -Ive tried both ways-
nouns2 <- regmatches(mytxtfiles, gregexpr("^([A-Z][a-z]+)+$", mytxtfiles, perl=TRUE))
nouns <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("^([A-Z][a-z]+)+$", j, perl=TRUE))})
#transformation if nouns do not work
docs <- tm_map(docs[1], removeWords, stopwords("english"))
#working wordcloud and freq data
dtm <- DocumentTermMatrix(docs)
findFreqTerms(dtm, lowfreq=100)
findAssocs(dtm, "data", corlimit=0.6)
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
p <- ggplot(subset(wf, freq >500), aes(word, freq))
p <-p + geom_bar(stat ="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
library(wordcloud)
wordcloud(names(freq),freq,min.freq=100, colors=brewer.pal(6,"Dark2"))
I have tried nouns2 and nouns but they return something like:
nouns2
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
Here is a method to find all the nouns, using the qdap package. You can go from here.
text <- "To further enhance our practice, the president was honored to have him join the firm, former commissioner and the first to institute patent reexaminations, bringing a wealth of experience and knowledge to the firm and our clients."
library(qdap)
pos.text <- pos(sentence) # tells the count and parts of speech in the text
vec.tagged <- as.vector(pos.text[[2]]) # retains only the tagged terms in a vector
vec.tagged.split <- str_split(vec.tagged$POStagged, "/") # breaks the vector apart at the "/"
all.nouns <- str_extract(vec.tagged.split[[1]], "^NN .+") # identifies the nouns
all.nouns <- str_replace(all.nouns, "NN\\s", "") # removes NN tag
all.nouns
[1] NA NA NA NA NA "novak" "druce"
[8] "was" NA NA NA NA NA NA
[15] NA NA NA NA NA "commissioner" "and"
[22] NA NA NA NA NA "reexaminations" NA
[29] NA NA "of" NA "and" NA "to"
[36] NA NA "and" NA NA NA

Document term matrix in R

I have the following code:
rm(list=ls(all=TRUE)) #clear data
setwd("~/UCSB/14 Win 15/Issy/text.fwt") #set working directory
files <- list.files(); head(files) #load & check working directory
fw1 <- scan(what="c", sep="\n",file="fw_chp01.fwt")
library(tm)
corpus2<-Corpus(VectorSource(c(fw1)))
skipWords<-(function(x) removeWords(x, stopwords("english")))
#remove punc, numbers, stopwords, etc
funcs<-list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc<-tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(1,110))) #create document term matrix
I'm trying use some of the operations detailed in the tm reference manual (http://cran.r-project.org/web/packages/tm/tm.pdf) with little success. For example, when I try to use the findFreqTerms, I get the following error:
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE
Can anyone clue me in as to why this isn't working and what I can do to fix it?
Edited for #lawyeR:
head(fw1) produces the first six lines of the text (Episode 1 of Finnegans Wake by James Joyce):
[1] "003.01 riverrun, past Eve and Adam's, from swerve of shore to bend"
[2] "003.02 of bay, brings us by a commodius vicus of recirculation back to"
[3] "003.03 Howth Castle and Environs."
[4] "003.04 Sir Tristram, violer d'amores, fr'over the short sea, had passen-"
[5] "003.05 core rearrived from North Armorica on this side the scraggy"
[6] "003.06 isthmus of Europe Minor to wielderfight his penisolate war: nor"
inspect(corpus2) outputs each line of the text in the following format (this is the final line of the text):
[[960]]
<<PlainTextDocument (metadata: 7)>>
029.36 borough. #this part differs by line of course
inspect(corpus2a.dtm) returns a table of all the types (there are 4163 in total( in the text in the following format:
Docs youths yoxen yu yurap yutah zee zephiroth zine zingzang zmorde zoom
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
Here is a simplified form of what you provided and did, and tm does its job. It may be that one or more of your cleaning steps caused a problem.
> library(tm)
> fw1 <- c("riverrun, past Eve and Adam's, from swerve of shore to bend
+ of bay, brings us by a commodius vicus of recirculation back to
+ Howth Castle and Environs.
+ Sir Tristram, violer d'amores, fr'over the short sea, had passen-
+ core rearrived from North Armorica on this side the scraggy
+ isthmus of Europe Minor to wielderfight his penisolate war: nor")
>
> corpus<-Corpus(VectorSource(c(fw1)))
> inspect(corpus)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
riverrun, past Eve and Adam's, from swerve of shore to bend
of bay, brings us by a commodius vicus of recirculation back to
Howth Castle and Environs.
Sir Tristram, violer d'amores, fr'over the short sea, had passen-
core rearrived from North Armorica on this side the scraggy
isthmus of Europe Minor to wielderfight his penisolate war: nor
> dtm <- DocumentTermMatrix(corpus)
> findFreqTerms(dtm)
[1] "adam's," "and" "armorica" "back" "bay," "bend"
[7] "brings" "castle" "commodius" "core" "d'amores," "environs."
[13] "europe" "eve" "fr'over" "from" "had" "his"
[19] "howth" "isthmus" "minor" "nor" "north" "passen-"
[25] "past" "penisolate" "rearrived" "recirculation" "riverrun," "scraggy"
[31] "sea," "shore" "short" "side" "sir" "swerve"
[37] "the" "this" "tristram," "vicus" "violer" "war:"
[43] "wielderfight"
As another point, I find it useful at the start to load a few other complementary packages to tm.
library(SnowballC); library(RWeka); library(rJava); library(RWekajars)
For what its worth, as compared to your somewhat complicated cleaning steps, I usually trudge along like this (replace comments$comment with your text vector):
comments$comment <- tolower(comments$comment)
comments$comment <- removeNumbers(comments$comment)
comments$comment <- stripWhitespace(comments$comment)
comments$comment <- str_replace_all(comments$comment, " ", " ")
# replace all double spaces internally with single space
# better to remove punctuation with str_ because the tm function doesn't insert a space
library(stringr)
comments$comment <- str_replace_all(comments$comment, pattern = "[[:punct:]]", " ")
comments$comment <- removeWords(comments$comment, stopwords(kind = "english"))
From another ticket this should help tm 0.6.0 has a bug and it can be addressed with this statement.
corpus_clean <- tm_map( corp_stemmed, PlainTextDocument)
Hope this helps.

Create a list with named values by applying a function to each row of a data frame

I'm trying to get a list where each element has a name, by applying a function to each row of a data frame, but can't get the right output.
Assuming this is the function that I want to apply to each row:
format_setup_name <- function(m, v, s) {
a <- list()
a[[paste(m, "machines and", v, s, "GB volumes")]] <- paste(num_machines,num_volumes,vol_size,sep="-")
a
}
If this is the input data frame:
df <- data.frame(m=c(1,2,3), v=c(3,3,3), s=c(15,20,30))
I can't get a list that looks like:
$`1-3-15`
[1] "1 machines and 3 15 GB volumes"
$`2-3-20`
[1] "2 machines and 3 20 GB volumes"
$`3-3-30`
[1] "3 machines and 3 30 GB volumes"
Can someone give me hints how to do it?
Why do I need this? Well, I want to populate selectizeInput in shiny using values coming from the database. Since I'm combining several columns, I need a way to match the selected input with the values.
This is a good use case for setNames which can add the names() attribute to an object, in place. Also, if you use as.list, you can do this in just one line without any looping:
setNames(as.list(paste(df$m, ifelse(df$m == 1, "machine", "machines"), "and", df$v, df$s, "GB volumes")), paste(df$m,df$v,df$s,sep="-"))
# $`1-3-15`
# [1] "1 machine and 3 15 GB volumes"
#
# $`2-3-20`
# [1] "2 machines and 3 20 GB volumes"
#
# $`3-3-30`
# [1] "3 machines and 3 30 GB volumes"
Thomas has already found a pretty neat solution to your problem (and in one line, too!). But I'll just show you how you could have succeeded with the approach you first tried:
# We'll use the same data, this time called "dat" (I avoid calling
# objects `df` because `df` is also a function's name)
dat <- data.frame(m = c(1,2,3), v = c(3,3,3), s = c(15,20,30))
format_setup_name <- function(m, v, s) {
a <- list() # initialize the list, all is well up to here
# But here we'll need a loop to assign in turn each element to the list
for(i in seq_along(m)) {
a[[paste(m[i], v[i], s[i], sep="-")]] <-
paste(m[i], "machines and", v[i], s[i], "GB volumes")
}
return(a)
}
Note that what goes inside the brackets is the name of the element, while what's at the right side of the <- is the content to be assigned, not the inverse as your code was suggesting.
So let's try it:
my.setup <- format_setup_name(dat$m, dat$v, dat$s)
my.setup
# $`1-3-15`
# [1] "1 machines and 3 15 GB volumes"
#
# $`2-3-20`
# [1] "2 machines and 3 20 GB volumes"
#
# $`3-3-30`
# [1] "3 machines and 3 30 GB volumes"
Everything seems nice. Just one thing to note: with the $ operator, you'll need to use single or double quotes to access individual items by their names:
my.setup$"1-3-15" # my.setup$1-3-15 won't work
# [1] "1 machines and 3 15 GB volumes"
my.setup[['1-3-15']] # equivalent
# [1] "1 machines and 3 15 GB volumes"
Edit: lapply version
Since loops have really fallen out of favor, here's a version with lapply:
format_setup_name <- function(m, v, s) {
a <- lapply(seq_along(m), function(i) paste(m[i], "machines and", v[i], s[i], "GB volumes"))
names(a) <- paste(m, v, s, sep="-")
return(a)
}

Count misspelled words in R

Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fish", "Final Countdow", "show me your s", "where is what")
Data<-cbind(Row, Content)
View(Data)
I wanted to create a function which tells me how many words are wrong per Row.
A intermediate step would be to have it look like this:
Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fs", "Final Countdow", "show me your s", "where is what")
MisspelledWords<-c(NA, "whre, fs", "Countdow","s",NA)
Data<-cbind(Row, Content,MisspelledWords)
I know that i have to use aspell but i'm having problems to perform aspell on only rows and not always directly on the whole file, finally i want to Count how many words are wrong on every Row For this i would take code of: Count the number of words in a string in R?
Inspired by this article, here's a try with which_misspelled and check_spelling in library(qdap).
library(qdap)
# which_misspelled
n_misspelled <- sapply(Content, function(x){
length(which_misspelled(x, suggest = FALSE))
})
data.frame(Content, n_misspelled, row.names = NULL)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
# check_spelling
df <- check_spelling(Content, n.suggest = 0)
n_misspelled <- as.vector(table(factor(df$row, levels = Row)))
data.frame(Content, n_misspelled)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
To use aspell you have to use a file. It's pretty straightforward to use a function to dump a column to a file, run aspell and get the counts (but it will not be all that efficient if you have a large matrix/dataframe).
countMispelled <- function(words) {
# do a bit of cleanup (if necessary)
words <- gsub(" *", " ", gsub("[[:punct:]]", "", words))
temp_file <- tempfile()
writeLines(words, temp_file);
res <- aspell(temp_file)
unlink(temp_file)
# return # of mispelled words
length(res$Original)
}
Data <- cbind(Data, Errors=unlist(lapply(Data[,2], countMispelled)))
Data
## Row Content Errors
## [1,] "1" "I love cheese" "0"
## [2,] "2" "whre is thed fish" "2"
## [3,] "3" "Final Countdow" "1"
## [4,] "4" "show me your s" "0"
## [5,] "5" "where is what" "0"
You might be better off using a data frame vs a matrix (I just worked with what you provided) since you can keep Row and Errors numeric that way.

Resources