I have a data frame A in the following format
user item
10000000 1 # each user is a 8 digits integer, item is up to 5 digits integer
10000000 2
10000000 3
10000001 1
10000001 4
..............
What I want is a list B, with users' names as the name of list elements, list element is a vector of items corresponding to this user.
e.g
B = list(c(1,2,3),c(1,4),...)
I also need to paste names to B. To apply association rule learning, items need to be convert to characters
Originally I used tapply(A$user,A$item, c), this makes it not compatible with association rule package. See my post:
data format error in association rule learning R
But #sgibb's solution seems also generates an array, not a list.
library("arules")
temp <- as(C, "transactions") # C is output using #sgibb's solution
throws error: Error in as(C, "transactions") :
no method or default for coercing “array” to “transactions”
Have a look at tapply:
df <- read.table(textConnection("
user item
10000000 1
10000000 2
10000000 3
10000001 1
10000001 4"), header=TRUE)
B <- tapply(df$item, df$user, FUN=as.character)
B
# $`10000000`
# [1] "1" "2" "3"
#
# $`10000001`
# [1] "1" "4"
EDIT: I do not know the arules package, but here the solution proposed by #alexis_laz:
library("arules")
as(split(df$item, df$user), "transactions")
# transactions in sparse format with
# 2 transactions (rows) and
# 4 items (columns)
Related
I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)
Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.
The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1
I've figured out if I use as.character(df[x,y]) or as.<whatever>df[x,y] I can get/coerce what I need, every time from my data frames
What I cant seem to find/figure out is why. Details below.
When I access df[1,1] (or anything in column 1) I get
df[1,1]
[1] a
Levels: a b c
but when I access 1,3 it works fine
> df[1,3]
[1] 10
but then when I use as.character() it works.
> as.character(df[1,1])
[1] "a"
The data frame was built using this line
df = data.frame(names = c("a","b","c"), size = c(1,2,3),num = c(10,20,30) )
> df
names size num
1 a 1 10
2 b 2 20
3 c 3 30
But in this data frame
imp2met = read.csv('tomet.csv', header = TRUE, sep=",",dec='.')
> imp2met
unit mult ret
1 (yd) 0.9100 (m)
2 (in) 2.5200 (cm)
3 .....
I get these results for 1,3
> imp2met[1,3]
[1] (m)
Levels: (c) (cm) (cm^2) ....
>
> as.character(imp2met[1,3])
[1] "(m)"
So why the "random" results? Why do I need as.<whatever>() but only some of the time?
data.frame default is to convert character vectors to factors. You can change this with the argument stringsAsFactors=FALSE
Also, when you subset a dataframe using [, you can add the drop=FALSE argument to simplify the results in some cases.
I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?
Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.
Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).
Newbie R question. Sorry to ask: I'm sure it's been answered, but it's one that's hard to search, apparently. I've read the man page for var (variance), but I don't understand it. Checked books, web pages (OK, only two books). I'll wait for someone to point me to an existing answer ....
> df
first second
1 1 3
2 2 5
3 3 7
> df[,2]
[1] 3 5 7
> var(df[,2])
[1] 4
OK, so far, so good.
> df[1,]
first second
1 1 3
> var(df[1,])
first second
first NA NA
second NA NA
Huh??
Thanks in advance.
!
The first issue is that you get a different class of object when you select a row from a data.frame, than when you select a column:
df = data.frame(first=c(1, 2, 3), second=c(3, 5, 7))
class(df[, 2])
[1] "integer"
class(df[1, ])
[1] "data.frame"
# But you can explicitly convert with as.integer.
var(as.integer(df[1, ]))
# [1] 2
The second issue is that var() treats a data.frame quite differently. It treats each column as variable and computes a matrix of variances and covariances by comparing each column to every other column:
# Create a data frame with some random data.
dat = data.frame(first=rnorm(20), second=rnorm(20), third=rnorm(20))
var(dat)
# first second third
# first 0.98363062 -0.2453755 0.04255154
# second -0.24537550 1.1177863 -0.16445768
# third 0.04255154 -0.1644577 0.58928970
var(dat$third)
# [1] 0.5892897
cov(dat$first, dat$second)
# [1] -0.2453755
If you know that a data.frame is all numeric and want it to be available for both row and column operations, then convert it to a matrix:
dat = data.frame(first=rnorm(20), second=rnorm(20), third=rnorm(20))
dm <- data.matrix(df)
var(dm[1,])
#[1] 20.25
(The same effect occurs when you use apply() .... the list structure is lost and the rows are all converted to the lowest common denominator.)
> apply(dat, 1, var)
[1] 0.45998066 1.51241166 0.13634927 0.49981030 0.04440448 1.21224067 0.28113135 0.57968597
[9] 0.26102036 0.41999510 1.01237100 0.17304770 0.50572223 1.17825272 1.39342510 2.94125062
[17] 1.18640684 2.15245595 3.06482195 0.96396008