Lists of term-frequency pairs into a matrix in R - r

I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?

Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.

Related

fread() with commas outside square brackets as separators

I'm trying to use fread() to get some data from a website. The data is conveniently set up with comma separators, but I get the error:
1: In fread("https://website.com/") :
Stopped early on line 56. Expected 5 fields but found 6. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<0,1,1,x[[0], [1]],0>>
This is because the entries before line 56 had a blank on column 4, so something like <<1,1,1,0>>, whereas line 56 has something including a comma on column 4, so it splits it into two columns. Now, I want that the whole x[[y], [z]] to be in one cell, so I want that my data is separated by a comma, but not when commas are inside square brackets.
Edit: The real website is private, so it makes no sense to link it here, but it simply contains data in a csv format. Something like:
field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
............
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0
............
The problem arises with the fact that x[[0], [1]] is supposed to be all in one cell, but because of the comma delimiter it is split across two cells.
Is there any way to do this with fread()? Or with any other function that serves a similar purpose?
Thank you in advance and sorry if the question is somewhat basic, I'm just getting started with R.
Instead of reading your CSV file directly from that private website of yours with fread, you can download the CSV first and then:
Read the lines of the CSV (without any special parsing), that will be equivalent to my csv_lines <- read_lines(my_weird_csv_text);
Then, split those read lines according to the regex "(?!\\])(\\,)(?!\\s\\[)" as opposed to using the single comma "," (this ensures that commas within those expressions with "[[" and "]]" are not use as split characters);
Finally, from the first row of the resulting matrix (split_lines) define the column names of a new dataframe/tibble that has been coerced from split_lines.
I hope it's clear.
Basically, we had to circumvent straightforward reading functions such as fread or other equivalent by reading line by line and then doing split based on a regex that handles your special cases.
library(readr)
library(data.table)
library(stringr)
library(tibble)
my_weird_csv_text <-
"field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0"
csv_lines <- read_lines(my_weird_csv_text)
split_lines <- stringr::str_split(csv_lines, "(?!\\])(\\,)(?!\\s\\[)", simplify = TRUE)
as_tibble(split_lines[-1, ]) %>%
`colnames<-`(split_lines[1, ]) -> tbl
tbl
#> # A tibble: 8 x 5
#> field1 field2 field3 field4 field5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 0 0 "" 1
#> 2 0 0 0 "" 1
#> 3 1 1 0 "" 1
#> 4 1 1 0 "" 1
#> 5 0 1 1 x[[0], [1]] 0
#> 6 0 1 0 x[[0], [1]] 1
#> 7 1 0 1 "" 1
#> 8 0 0 1 x[[1], [0]] 0
A suggestion:
From the documentation:
'fread' is for regular delimited files; i.e., where every row has the same number of
columns.
If the number of columns varies or is irregular because of errors in file generation, an alternative like readLines would enable you to process the file line by line--perhaps, using regular expressions like gsub, etc.

strpslit a character array and convert to dataframe simultaneously

I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.

Concatenate dfm matrices in 'quanteda' package

Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.
An example:
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
gives an error.
The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.
Also recall that 'dfm' from 'quanteda' is a S4 class.
Should work "out of the box", if you are using the latest version:
packageVersion("quanteda")
## [1] ‘0.9.6.9’
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## is one sample surprise text this
## doc1 1 1 2 0 1 1
## doc2 1 1 2 1 1 1
See also ?selectFeatures where features is a dfm object (there are examples in the help file).
Added:
Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:
require(tm)
dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
rbind(dtm1, dtm2)
## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.
This almost gets it, but seems to duplicate the repeated feature:
as.matrix(rbind(c(dtm1, dtm2)))
## Terms
## Docs one sample sample. text this surprise!
## 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1

Replace apply function with lapply

I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.
The first dataset (df1) looks like this :
word1 word2 pattern
air 10 (^|\\s)air(\\s.*)?\\s10($|\\s)
airport 20 (^|\\s)airport(\\s.*)?\\s20($|\\s)
car 30 (^|\\s)car(\\s.*)?\\s30($|\\s)
The other dataset (df2) from which I want to match this looks like
sl_no query
1 air 10
2 airport 20
3 airport 20
3 airport 20
3 car 30
The final output I want should look like
word1 word2 total_occ
air 10 1
airport 20 3
car 30 1
I am able to do this by using apply in R
process <-
function(x)
{
length(grep(x[["pattern"]], df2$query))
}
df1$total_occ=apply(df1,1,process)
but find it time taking since my dataset is pretty big.
I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying
lapply(df,process)
Error in x[, "pattern"] : incorrect number of dimensions
Please let me know what changes should I make to run lapply correctly.
Why not just lapply() over the pattern?
Here I've just pulled out your pattern but this could just as easily be df$pattern
pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
"(^|\\s)airport(\\s.*)?\\s20($|\\s)",
"(^|\\s)car(\\s.*)?\\s30($|\\s)")
Using your data for df2
txt <- "sl_no query
1 'air 10'
2 'airport 20'
3 'airport 20'
3 'airport 20'
3 'car 30'"
df2 <- read.table(text = txt, header = TRUE)
Just iterate on pattern directly
> lapply(pattern, grep, x = df2$query)
[[1]]
[1] 1
[[2]]
[1] 2 3 4
[[3]]
[1] 5
If you want more compact output as suggested in your question, you'll need to run lengths() over the output returned (Thanks to #Frank for pointing out the new function lengths().)). Eg
lengths(lapply(pattern, grep, x = df2$query))
which gives
> lengths(lapply(pattern, grep, x = df2$query))
[1] 1 3 1
You can add this to the original data via
dfnew <- cbind(df1[, 1:2],
Count = lengths(lapply(pattern, grep, x = df2$query)))

create list based on data frame in R

I have a data frame A in the following format
user item
10000000 1 # each user is a 8 digits integer, item is up to 5 digits integer
10000000 2
10000000 3
10000001 1
10000001 4
..............
What I want is a list B, with users' names as the name of list elements, list element is a vector of items corresponding to this user.
e.g
B = list(c(1,2,3),c(1,4),...)
I also need to paste names to B. To apply association rule learning, items need to be convert to characters
Originally I used tapply(A$user,A$item, c), this makes it not compatible with association rule package. See my post:
data format error in association rule learning R
But #sgibb's solution seems also generates an array, not a list.
library("arules")
temp <- as(C, "transactions") # C is output using #sgibb's solution
throws error: Error in as(C, "transactions") :
no method or default for coercing “array” to “transactions”
Have a look at tapply:
df <- read.table(textConnection("
user item
10000000 1
10000000 2
10000000 3
10000001 1
10000001 4"), header=TRUE)
B <- tapply(df$item, df$user, FUN=as.character)
B
# $`10000000`
# [1] "1" "2" "3"
#
# $`10000001`
# [1] "1" "4"
EDIT: I do not know the arules package, but here the solution proposed by #alexis_laz:
library("arules")
as(split(df$item, df$user), "transactions")
# transactions in sparse format with
# 2 transactions (rows) and
# 4 items (columns)

Resources