I generated a dataset holding two distinct columns: an ID column associated to a customer and another column associated to his/her active products:
head(df_itemList)
ID PRD_LISTE
1 1 A,B,C
3 2 C,D
4 3 A,B
5 4 A,B,C,D,E
7 5 B,A,D
8 6 A,C,D
I only selected customers that own more than one product. In total I have 589.454 rows and there are 16 different products.
Next, I wrote the data.frame into an csv-file like this:
df_itemList$ID <- NULL
colnames(df_itemList) <- c("itemList")
write.csv(df_itemList, "Basket_List_13-08-2020.csv", row.names = TRUE)
Then, I converted the csv-file into a basket format in order to apply the apriori algorithm as implemented in the arules-package.
library(arules)
txn <- read.transactions(file="Basket_List_13-08-2020.csv",
rm.duplicates= TRUE, format="basket",sep=",",cols=1)
txn#itemInfo$labels <- gsub("\"","",txn#itemInfo$labels)
The summary-function yields the following output:
summary(txn)
transactions as itemMatrix in sparse format with
589455 rows (elements/itemsets/transactions) and
1737 columns (items) and a density of 0.0005757052
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
element (itemset/transaction) length distribution:
sizes
1
589455
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
includes extended item information - examples:
labels
1 G,H,I,A,B,C,D,F,J
2 G,H,I,A,B,C,F
3 G,H,I,A,B,K,D
includes extended transaction information - examples:
transactionID
1
2 1
3 3
Now, I tried to run the apriori-algorithm:
basket_rules <- apriori(txn, parameter = list(sup = 1e-15,
conf = 1e-15, minlen = 2, target="rules"))
This is the output:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.01 0.1 1 none FALSE TRUE 5 1e-15 2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 0
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[1737 item(s), 589455 transaction(s)] done [0.20s].
sorting and recoding items ... [1737 item(s)] done [0.00s].
creating transaction tree ... done [0.16s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.04s].
Even with a ridiculously low support and confidence, no rules are generated...
summary(basket_rules)
set of 0 rules
Is this really because of my dataset? Or was there a mistake in my code?
Your summary shows that the data is not read in correctly:
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
Looks like "A,C" is read as an item, but it should be two items "A" and "C". The separating character does not work. I assume that could be because of quotation marks in the file. Make sure that Basket_List_13-08-2020.csv looks correct. Also, you need to skip the first line (headers) using skip = 1 when you read the transactions.
#Michael I am quite positive now that there is something wrong with the .csv-file I am reading in. Since there are others who experienced similar problems my guess is that this is the common reason for error. Can you please describe how the .csv-file should look like when read in?
When typing in data <- read.csv("file.csv", header = TRUE, sep = ",") I get the following data.frame:
X Prd
1 A
2 A,B
3 B,A
4 B
5 C
Is it correct that - if there are multiple products for a customer X - these products are all written in a single column? Or should be written in different columns?
Furthermore, when writing txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1) and summary(txn) I see the following problem:
most frequent items:
A B C A,B B,A
1256 1235 456 235 125
(numbers are chosen randomly)
So the read.transaction function differentiates between A,B and B,A... So I am guessing there is something wrong with the .csv-file.
Related
A vector with multiple objects is to be printed on the console with the names in a vertical table format.
Method-1: Print.dataframe the vector in a horizontal layout. However, the label appears as a column name.
Method-2: Write.table. But the text wrapped to the row names.
Please suggest if any other function is available. Please check a minimal example below for better understanding.
sv<-(1:4)
names(sv)<-c("Al", "BetaRay", "Gamma", "Zabracadabra")
sv
Al BetaRay Gamma Zabracadabra
1 2 3 4
print(data.frame(sv))
sv
Al 1
BetaRay 2
Gamma 3
Zabracadabra 4
write.table(data.frame(sv), col.names = FALSE, quote = FALSE)
Al 1
BetaRay 2
Gamma 3
Zabracadabra 4
Desired Output is:
Al 1
BetaRay 2
Gamma 3
Zabracadabra 4
one option is to use write.table. however, if you need columns aligned, then the width needs to be adjusted.
library(gdata)
df <- data.frame(c("Al", "BetaRay", "Gamma", "Zabracadabra"),c(1:4))
write.fwf(df,quote=F,colnames = F)
Al 1
BetaRay 2
Gamma 3
Zabracadabra 4
You are not completely clear on what you are trying to achieve here. From you desired output, I understand you are trying to write table to the console cleanly. Your approach is correct using the base R.
data
sv<-(1:4)
names(sv)<-c("Al", "BetaRay", "Gamma", "Zabracadabra")
# Method-1: Print.dataframe the vector in a horizontal layout.
# However, the label appears as a column name.
sv<-unname(data.frame(sv)) # as suggested by #DirtySockSniffer
dim(sv)
[1] 4 1
Note: you have a 1 column data with 4 rows. What you are seeing is rownames,
I hope you are not mistaken by this when you say the " label appears as a column name."
print(sv) # is exactly doing what you need
Al 1
BetaRay 2
Gamma 3
Zabracadabra 4
# Method-2: Write.table. But the text wrapped to the row names.
Note: The text is not wrapped into row names, it just appears so. write.table main purpose is to write a data frame to a file. I don't think it is recommended to print your data to console. However, when you specify separator, it can show some differentiation. But, it is not an efficient way.
write.table(sv, col.names = FALSE, quote = FALSE, sep = "\t") # use a tab separator
Al 1
BetaRay 2
Gamma 3
Zabracadabra 4
# Other options require you to use some package to print cleanly to console.
# install.packages("knitr")
require(knitr)
kable(sv)
| | sv|
|:------------|--:|
|Al | 1|
|BetaRay | 2|
|Gamma | 3|
|Zabracadabra | 4|
Datatable way ( more efficient in my opinion if you wish to do any further process on your dataframe) However, data table does not support row names by default numeric row names are generated. You are forced to make a column of rn if your data frame has row names. But, it writes cleanly to the console.
require(data.table)
data.table(sv,keep.rownames = T)
rn sv
1: Al 1
2: BetaRay 2
3: Gamma 3
4: Zabracadabra 4
I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?
Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.
I have a data frame A in the following format
user item
10000000 1 # each user is a 8 digits integer, item is up to 5 digits integer
10000000 2
10000000 3
10000001 1
10000001 4
..............
What I want is a list B, with users' names as the name of list elements, list element is a vector of items corresponding to this user.
e.g
B = list(c(1,2,3),c(1,4),...)
I also need to paste names to B. To apply association rule learning, items need to be convert to characters
Originally I used tapply(A$user,A$item, c), this makes it not compatible with association rule package. See my post:
data format error in association rule learning R
But #sgibb's solution seems also generates an array, not a list.
library("arules")
temp <- as(C, "transactions") # C is output using #sgibb's solution
throws error: Error in as(C, "transactions") :
no method or default for coercing “array” to “transactions”
Have a look at tapply:
df <- read.table(textConnection("
user item
10000000 1
10000000 2
10000000 3
10000001 1
10000001 4"), header=TRUE)
B <- tapply(df$item, df$user, FUN=as.character)
B
# $`10000000`
# [1] "1" "2" "3"
#
# $`10000001`
# [1] "1" "4"
EDIT: I do not know the arules package, but here the solution proposed by #alexis_laz:
library("arules")
as(split(df$item, df$user), "transactions")
# transactions in sparse format with
# 2 transactions (rows) and
# 4 items (columns)
i have a a problem with the following topic.
After using Excel, the workload of doing this is to high. Now i want to do it with R in a automatic way.
I have different Models of washing machines:
For Every Model, i have a data.frame with all required Components. FOR 1 MODEL AS EXAMPLE
Component = c("A","B","C","D","E","F","G","H","I","J")
Number = c(1,1,1,2,4,1,1,1,2,3)
Model.A= data.frame(Component,Quantity)
As second Information, i have a data.frame with all Components, which are used by all Models and in addition the actual Stock of these Components.
Component = c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z")
Stock = c(100,102,103,105,1800,500,600,400,50,80,700,900,600,520,35,65,78,95,92,50,36,34,96,74,5,76)
Comp.Stock = data.frame(Component,Stock)
The third and last Inforamtion is about the weekly production plan. I have 4 weekly production plans = Plan for 1 months. I got a data.frame with the Models of washing machines, which will be produced in the next 4 weeks and also the quantitiy of them.
pr.Models= c("MODEL.A","MODEL.B","MODEL.C","MODEL.D")
Quantity= c(15000,1000,18000,16000,5000)
Production= data.frame(pr.Models,Quantity)
My Problem is now, to combine these informations together, that i can compare the models which get produced ( last information) with the components. First with the used components for every Model on its own and in addition with the data.frame which has the information of all components and the stock.
The Aim is to get information and a warning, if the Component stock is not big enough for producing the models from the production plan.
Hind: ( Many same Components gets used by different Models)
Hopefully you understand what i mean and can help me by this problem.
Thank you =)
EDIT:
I can not follow all your steps:
Maybe this idea is also good, but i nead a hind how to do it:
Maybe it is possible to merge every produced Model (Production) with the Used Components. (considered with the Quantity for Producing and the Number need per Washing machine).
My prefered output is, to gert automattically dataframes for every produced model with the needed Components.
In the next step it should be able to merge these datas with Comp.Stock to see which Component are needed how often and compare this with the stock.
Have you any ideas on this way?
Maybe i am to stupid for the presented way... I really need an automatic way because there are more then 4k different components and more then 180 different models of washing machines.
Thank you
the Comp.Stock additionally with all used Models and their quanitity ( Production)
You need to have the model name as a column in the first data.frame (to match Production)
Model.A$pr.Models <- 'MODEL.A'
Then you can merge. Note that there are two "Quantity" columns, and you don't want to merge by those:
merged <- merge(merge(Model.A, Comp.Stock),Production, by='pr.Models')
Extra is how many you will have on-hand after production:
transform(transform(merged, Needed = Quantity.x * Quantity.y), Extra = Stock - Needed)
## pr.Models Component Quantity.x Stock Quantity.y Needed Extra
## 1 MODEL.A A 1 100 15000 15000 -14900
## 2 MODEL.A B 1 102 15000 15000 -14898
## 3 MODEL.A C 1 103 15000 15000 -14897
## 4 MODEL.A D 2 105 15000 30000 -29895
## 5 MODEL.A E 4 1800 15000 60000 -58200
## 6 MODEL.A F 1 500 15000 15000 -14500
## 7 MODEL.A G 1 600 15000 15000 -14400
## 8 MODEL.A H 1 400 15000 15000 -14600
## 9 MODEL.A I 2 50 15000 30000 -29950
## 10 MODEL.A J 3 80 15000 45000 -44920
If Extra is negative, you'll need more parts. You're seriously deficient.
transform(transform(merged, Needed = Quantity.x * Quantity.y), Extra = Stock - Needed)$Extra < 0
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Not enough of any part.
As a function:
Not.Enough.Parts <- function(Model, Comp.Stock, Production) {
Model$pr.Models <- toupper(substitute(Model))
merged <- merge(merge(Model, Comp.Stock),Production, by='pr.Models')
extra <- transform(transform(merged, Needed = Quantity.x * Quantity.y), Extra = Stock - Needed)
retval <- extra$Extra < 0
names(retval) <- extra$Component
return(retval)
}
Not.Enough.Parts(Model.A, Comp.Stock, Production)
## A B C D E F G H I J
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
I am trying to perform an association using the snpStats package.
I have a snp matrix called 'plink' which contains my genotype data (as
a list of $genotypes, $map, $fam), and plink$genotype has: SNP names as column names (2 SNPs) and the subject identifiers as the row names:
plink$genotype
SnpMatrix with 6 rows and 2 columns
Row names: 1 ... 6
Col names: 203 204
The plink dataset can be reproduced copying the following ped and map files and saving them as 'plink.ped' and plink.map' respectively:
plink.ped:
1 1 0 0 1 -9 A A G G
2 2 0 0 2 -9 G A G G
3 3 0 0 1 -9 A A G G
4 4 0 0 1 -9 A A G G
5 5 0 0 1 -9 A A G G
6 6 0 0 2 -9 G A G G
plink.map:
1 203 0 792429
2 204 0 819185
And then use plink in this way:
./plink --file plink --make-bed
#----------------------------------------------------------#
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
#----------------------------------------------------------#
Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web
Writing this text to log file [ plink.log ]
Analysis started: Tue Nov 29 18:08:18 2011
Options in effect:
--file /ugi/home/claudiagiambartolomei/Desktop/plink
--make-bed
2 (of 2) markers to be included from [ /ugi/home/claudiagiambartolomei/Desktop /plink.map ]
6 individuals read from [ /ugi/home/claudiagiambartolomei/Desktop/plink.ped ]
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 6 missing
4 males, 2 females, and 0 of unspecified sex
Before frequency and genotyping pruning, there are 2 SNPs
6 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 2 SNPs
After filtering, 0 cases, 0 controls and 6 missing
After filtering, 4 males, 2 females, and 0 of unspecified sex
Writing pedigree information to [ plink.fam ]
Writing map (extended format) information to [ plink.bim ]
Writing genotype bitfile to [ plink.bed ]
Using (default) SNP-major mode
Analysis finished: Tue Nov 29 18:08:18 2011
I also have a phenotype data frame which contains the outcomes (outcome1, outcome2,...) I would like to associate with the genotype, which is this:
ID<- 1:6
sex<- rep(1,6)
age<- c(59,60,54,48,46,50)
bmi<- c(26,28,22,20,23, NA)
ldl<- c(5, 3, 5, 4, 2, NA)
pheno<- data.frame(ID,sex,age,bmi,ldl)
The association works for the single terms when I do this: (using the formula "snp.rhs.test"):
bmi<-snp.rhs.tests(bmi~sex+age,family="gaussian", data=pheno, snp.data=plink$genotype)
My question is, how do I loop through the outcomes? This type of data
seems different from all the others and I am having trouble
manipulating it, so I would also be grateful if you have suggestions
of some tutorials that can help me understand how to do this and other
manipulations such as subsetting the snp.matrix data for example.
This is what I have tried for the loop:
rhs <- function(x) {
x<- snp.rhs.tests(x, family="gaussian", data=pheno,
snp.data=plink$genotype)
}
res_ <- apply(pheno,2,rhs)
Error in x$terms : $ operator is invalid for atomic vectors
Then I tried this:
for (cov in names(pheno)) {
association<-snp.rhs.tests(cov, family="gaussian",data=pheno, snp.data=plink$genotype)
}
Error in eval(expr, envir, enclos) : object 'bmi' not found
Thank you as usual for your help!
-f
The author of snpStats is David Clayton. Although the website listed in the package description is wrong, he is still at that domain and it's possible to do a search for documentation with the advanced search feature of Google with this specification:
snpStats site:https://www-gene.cimr.cam.ac.uk/staff/clayton/
The likely reason for your difficulty with access is that this is an S4 package and the methods for access are different. Instead of print methods S4 objects typically have show-methods. There is a vignette on the package here: https://www-gene.cimr.cam.ac.uk/staff/clayton/courses/florence11/practicals/practical6.pdf , and the directory for his entire short course is open for access: https://www-gene.cimr.cam.ac.uk/staff/clayton/courses/florence11/
It becomes clear that the object returned from snp.rhs.tests can be accessed with "[" using sequential numbers or names as illustrated on p 7. You can get the names :
# Using the example on the help(snp.rhs.tests) page:
> names(slt3)
[1] "173760" "173761" "173762" "173767" "173769" "173770" "173772" "173774"
[9] "173775" "173776"
The things you may be calling columns are probably "slots"
> getSlots(class(slt3))
snp.names var.names chisq df N
"ANY" "character" "numeric" "integer" "integer"
> str(getSlots(class(slt3)))
Named chr [1:5] "ANY" "character" "numeric" "integer" "integer"
- attr(*, "names")= chr [1:5] "snp.names" "var.names" "chisq" "df" ...
> names(getSlots(class(slt3)))
[1] "snp.names" "var.names" "chisq" "df" "N"
But there is no [i,j] method for looping over those slot names. You should instead go to the help page ?"GlmTests-class" which lists the methods defined for that S4 class.
The correct way to do what the initial poster required is:
for (i in ncol(pheno)) {
association <- snp.rhs.tests(pheno[,i], family="gaussian", snp.data=plink$genotype)
}
The documentation of snp.rhs.tests() says that if data is missing, the phenotype is taken from the parent frame - or maybe it was worded in the opposite sense: if data is specified, the phenotype is evaluated in the specified data.frame.
This is a clearer version:
for (i in ncol(pheno)) {
cc <- pheno[,i]
association <- snp.rhs.tests(cc, family="gaussian", snp.data=plink$genotype)
}
The documentation says data=parent.frame() is the default in snp.rhs.tests().
There is a glaring error in the apply() code - Please do not do x <- some.fun(x), as it does very bad things. Try this instead - drop the data=, and use a different variable name.
rhs <- function(x) {
y<- snp.rhs.tests(x, family="gaussian", snp.data=plink$genotype)
}
res_ <- apply(pheno,2,rhs)
Also the initial poster's question is misleading.
plink$genotype is an S4 object, pheno is a data.frame (an S3 object). You really just want to select columns in a S3 data.frame, but you are thrown off course by how snp.rhs.tests() looks for the columns (if a data.frame is given) or a vector phenotype (if it is given as a plain vector - i.e. in the parent frame, or your "current" frame, since the subroutine is evaluated in a "child" frame!)