I am trying to perform an association using the snpStats package.
I have a snp matrix called 'plink' which contains my genotype data (as
a list of $genotypes, $map, $fam), and plink$genotype has: SNP names as column names (2 SNPs) and the subject identifiers as the row names:
plink$genotype
SnpMatrix with 6 rows and 2 columns
Row names: 1 ... 6
Col names: 203 204
The plink dataset can be reproduced copying the following ped and map files and saving them as 'plink.ped' and plink.map' respectively:
plink.ped:
1 1 0 0 1 -9 A A G G
2 2 0 0 2 -9 G A G G
3 3 0 0 1 -9 A A G G
4 4 0 0 1 -9 A A G G
5 5 0 0 1 -9 A A G G
6 6 0 0 2 -9 G A G G
plink.map:
1 203 0 792429
2 204 0 819185
And then use plink in this way:
./plink --file plink --make-bed
#----------------------------------------------------------#
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
#----------------------------------------------------------#
Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web
Writing this text to log file [ plink.log ]
Analysis started: Tue Nov 29 18:08:18 2011
Options in effect:
--file /ugi/home/claudiagiambartolomei/Desktop/plink
--make-bed
2 (of 2) markers to be included from [ /ugi/home/claudiagiambartolomei/Desktop /plink.map ]
6 individuals read from [ /ugi/home/claudiagiambartolomei/Desktop/plink.ped ]
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 6 missing
4 males, 2 females, and 0 of unspecified sex
Before frequency and genotyping pruning, there are 2 SNPs
6 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 2 SNPs
After filtering, 0 cases, 0 controls and 6 missing
After filtering, 4 males, 2 females, and 0 of unspecified sex
Writing pedigree information to [ plink.fam ]
Writing map (extended format) information to [ plink.bim ]
Writing genotype bitfile to [ plink.bed ]
Using (default) SNP-major mode
Analysis finished: Tue Nov 29 18:08:18 2011
I also have a phenotype data frame which contains the outcomes (outcome1, outcome2,...) I would like to associate with the genotype, which is this:
ID<- 1:6
sex<- rep(1,6)
age<- c(59,60,54,48,46,50)
bmi<- c(26,28,22,20,23, NA)
ldl<- c(5, 3, 5, 4, 2, NA)
pheno<- data.frame(ID,sex,age,bmi,ldl)
The association works for the single terms when I do this: (using the formula "snp.rhs.test"):
bmi<-snp.rhs.tests(bmi~sex+age,family="gaussian", data=pheno, snp.data=plink$genotype)
My question is, how do I loop through the outcomes? This type of data
seems different from all the others and I am having trouble
manipulating it, so I would also be grateful if you have suggestions
of some tutorials that can help me understand how to do this and other
manipulations such as subsetting the snp.matrix data for example.
This is what I have tried for the loop:
rhs <- function(x) {
x<- snp.rhs.tests(x, family="gaussian", data=pheno,
snp.data=plink$genotype)
}
res_ <- apply(pheno,2,rhs)
Error in x$terms : $ operator is invalid for atomic vectors
Then I tried this:
for (cov in names(pheno)) {
association<-snp.rhs.tests(cov, family="gaussian",data=pheno, snp.data=plink$genotype)
}
Error in eval(expr, envir, enclos) : object 'bmi' not found
Thank you as usual for your help!
-f
The author of snpStats is David Clayton. Although the website listed in the package description is wrong, he is still at that domain and it's possible to do a search for documentation with the advanced search feature of Google with this specification:
snpStats site:https://www-gene.cimr.cam.ac.uk/staff/clayton/
The likely reason for your difficulty with access is that this is an S4 package and the methods for access are different. Instead of print methods S4 objects typically have show-methods. There is a vignette on the package here: https://www-gene.cimr.cam.ac.uk/staff/clayton/courses/florence11/practicals/practical6.pdf , and the directory for his entire short course is open for access: https://www-gene.cimr.cam.ac.uk/staff/clayton/courses/florence11/
It becomes clear that the object returned from snp.rhs.tests can be accessed with "[" using sequential numbers or names as illustrated on p 7. You can get the names :
# Using the example on the help(snp.rhs.tests) page:
> names(slt3)
[1] "173760" "173761" "173762" "173767" "173769" "173770" "173772" "173774"
[9] "173775" "173776"
The things you may be calling columns are probably "slots"
> getSlots(class(slt3))
snp.names var.names chisq df N
"ANY" "character" "numeric" "integer" "integer"
> str(getSlots(class(slt3)))
Named chr [1:5] "ANY" "character" "numeric" "integer" "integer"
- attr(*, "names")= chr [1:5] "snp.names" "var.names" "chisq" "df" ...
> names(getSlots(class(slt3)))
[1] "snp.names" "var.names" "chisq" "df" "N"
But there is no [i,j] method for looping over those slot names. You should instead go to the help page ?"GlmTests-class" which lists the methods defined for that S4 class.
The correct way to do what the initial poster required is:
for (i in ncol(pheno)) {
association <- snp.rhs.tests(pheno[,i], family="gaussian", snp.data=plink$genotype)
}
The documentation of snp.rhs.tests() says that if data is missing, the phenotype is taken from the parent frame - or maybe it was worded in the opposite sense: if data is specified, the phenotype is evaluated in the specified data.frame.
This is a clearer version:
for (i in ncol(pheno)) {
cc <- pheno[,i]
association <- snp.rhs.tests(cc, family="gaussian", snp.data=plink$genotype)
}
The documentation says data=parent.frame() is the default in snp.rhs.tests().
There is a glaring error in the apply() code - Please do not do x <- some.fun(x), as it does very bad things. Try this instead - drop the data=, and use a different variable name.
rhs <- function(x) {
y<- snp.rhs.tests(x, family="gaussian", snp.data=plink$genotype)
}
res_ <- apply(pheno,2,rhs)
Also the initial poster's question is misleading.
plink$genotype is an S4 object, pheno is a data.frame (an S3 object). You really just want to select columns in a S3 data.frame, but you are thrown off course by how snp.rhs.tests() looks for the columns (if a data.frame is given) or a vector phenotype (if it is given as a plain vector - i.e. in the parent frame, or your "current" frame, since the subroutine is evaluated in a "child" frame!)
Related
I generated a dataset holding two distinct columns: an ID column associated to a customer and another column associated to his/her active products:
head(df_itemList)
ID PRD_LISTE
1 1 A,B,C
3 2 C,D
4 3 A,B
5 4 A,B,C,D,E
7 5 B,A,D
8 6 A,C,D
I only selected customers that own more than one product. In total I have 589.454 rows and there are 16 different products.
Next, I wrote the data.frame into an csv-file like this:
df_itemList$ID <- NULL
colnames(df_itemList) <- c("itemList")
write.csv(df_itemList, "Basket_List_13-08-2020.csv", row.names = TRUE)
Then, I converted the csv-file into a basket format in order to apply the apriori algorithm as implemented in the arules-package.
library(arules)
txn <- read.transactions(file="Basket_List_13-08-2020.csv",
rm.duplicates= TRUE, format="basket",sep=",",cols=1)
txn#itemInfo$labels <- gsub("\"","",txn#itemInfo$labels)
The summary-function yields the following output:
summary(txn)
transactions as itemMatrix in sparse format with
589455 rows (elements/itemsets/transactions) and
1737 columns (items) and a density of 0.0005757052
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
element (itemset/transaction) length distribution:
sizes
1
589455
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
includes extended item information - examples:
labels
1 G,H,I,A,B,C,D,F,J
2 G,H,I,A,B,C,F
3 G,H,I,A,B,K,D
includes extended transaction information - examples:
transactionID
1
2 1
3 3
Now, I tried to run the apriori-algorithm:
basket_rules <- apriori(txn, parameter = list(sup = 1e-15,
conf = 1e-15, minlen = 2, target="rules"))
This is the output:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.01 0.1 1 none FALSE TRUE 5 1e-15 2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 0
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[1737 item(s), 589455 transaction(s)] done [0.20s].
sorting and recoding items ... [1737 item(s)] done [0.00s].
creating transaction tree ... done [0.16s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.04s].
Even with a ridiculously low support and confidence, no rules are generated...
summary(basket_rules)
set of 0 rules
Is this really because of my dataset? Or was there a mistake in my code?
Your summary shows that the data is not read in correctly:
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
Looks like "A,C" is read as an item, but it should be two items "A" and "C". The separating character does not work. I assume that could be because of quotation marks in the file. Make sure that Basket_List_13-08-2020.csv looks correct. Also, you need to skip the first line (headers) using skip = 1 when you read the transactions.
#Michael I am quite positive now that there is something wrong with the .csv-file I am reading in. Since there are others who experienced similar problems my guess is that this is the common reason for error. Can you please describe how the .csv-file should look like when read in?
When typing in data <- read.csv("file.csv", header = TRUE, sep = ",") I get the following data.frame:
X Prd
1 A
2 A,B
3 B,A
4 B
5 C
Is it correct that - if there are multiple products for a customer X - these products are all written in a single column? Or should be written in different columns?
Furthermore, when writing txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1) and summary(txn) I see the following problem:
most frequent items:
A B C A,B B,A
1256 1235 456 235 125
(numbers are chosen randomly)
So the read.transaction function differentiates between A,B and B,A... So I am guessing there is something wrong with the .csv-file.
I have a data frame A in the following format
user item
10000000 1 # each user is a 8 digits integer, item is up to 5 digits integer
10000000 2
10000000 3
10000001 1
10000001 4
..............
What I want is a list B, with users' names as the name of list elements, list element is a vector of items corresponding to this user.
e.g
B = list(c(1,2,3),c(1,4),...)
I also need to paste names to B. To apply association rule learning, items need to be convert to characters
Originally I used tapply(A$user,A$item, c), this makes it not compatible with association rule package. See my post:
data format error in association rule learning R
But #sgibb's solution seems also generates an array, not a list.
library("arules")
temp <- as(C, "transactions") # C is output using #sgibb's solution
throws error: Error in as(C, "transactions") :
no method or default for coercing “array” to “transactions”
Have a look at tapply:
df <- read.table(textConnection("
user item
10000000 1
10000000 2
10000000 3
10000001 1
10000001 4"), header=TRUE)
B <- tapply(df$item, df$user, FUN=as.character)
B
# $`10000000`
# [1] "1" "2" "3"
#
# $`10000001`
# [1] "1" "4"
EDIT: I do not know the arules package, but here the solution proposed by #alexis_laz:
library("arules")
as(split(df$item, df$user), "transactions")
# transactions in sparse format with
# 2 transactions (rows) and
# 4 items (columns)
This question already has answers here:
Why does R find a data.frame variable that isn't in the data.frame?
(2 answers)
Closed 27 days ago.
Why does the table function find a variable that was deleted?
Dog <- c("Rover", "Spot")
Cat <- c("Scratch", "Fluffy")
Pets <- data.frame(Dog, Cat) #create a data frame with two variables
names(Pets)
# [1] "Dog" "Cat"
#rename Dog to a longer name
names(Pets)[names(Pets)=="Dog"] <- "Dog_as_very_long_name"
Pets$Dog <- NULL # delete Dog
names(Pets)
#[1] "Dog_as_very_long_name" "Cat" #the variable dog is not in the data set anymore
table(Pets$Dog) #Why does the table function on a variable that was deleted
# Rover Spot
# 1 1
This is simply because of the partial matching that occurs in certain uses of $.
Try this:
> table(Pets$Ca)
Fluffy Scratch
1 1
Using the [[ notation instead will give you more control.
> table(Pets[["Ca"]])
< table of extent 0 >
> table(Pets[["Ca", exact = FALSE]])
Fluffy Scratch
1 1
You can use the options settings to give a warning when partial matches are used. Consider:
> options(warnPartialMatchDollar = TRUE)
> table(Pets$Ca)
Fluffy Scratch
1 1
Warning message:
In Pets$Ca : partial match of 'Ca' to 'Cat'
> table(Pets$Dog)
Rover Spot
1 1
Warning message:
In Pets$Dog : partial match of 'Dog' to 'Dog_as_very_long_name'
I'm trying to do some data manipulation in R. I have 2 data frames, one is training data, the other testing data all the data is categorical and stored as factor variables.
There are some NA's in the data and I'm trying to convert them to "-1". When I do it for the training data, things go fine, but not for the test data.
Something changes the values during a loop I run but I can't figure out what.
Here's the before:
> class(catTrain1[,"Cat_111"])
[1] "factor"
> class(catTest1[,"Cat_111"])
[1] "factor"
> table(catTrain1[,"Cat_111"])
1 2
726 25
> table(catTest1[,"Cat_111"])
0 1 2
1 503 15
Here's the loop:
> for(i in 1:ncol(catTrain1)){
+ catTrain1[,i] <- as.factor(as.character(ifelse(is.na(catTrain1[,i]), "-1", catTrain1[,i])))
+ }
> for(i in 1:ncol(catTest1)){
+ catTest1[,i] <- as.factor(as.character(ifelse(is.na(catTest1[,i]), "-1", catTest1[,i])))
+ }
Here's the after:
> table(catTrain1[,"Cat_111"])
1 2
726 25
> table(catTest1[,"Cat_111"])
1 2 3
1 503 15
I've seen the shift up by one with character -> numeric conversions but I can't figure out why this is happening, especially for just one of the dataframes / loops.
Any suggestions?
The column names in your first set of calls to table are the levels of the factor. In the second set of calls to table, the column names are the level indexes. ifelse is pulling the indexes, not the levels. In your loops, move the as.character in around the final catTest1[,i] and catTrain1[,i].
Try this instead. (More r-like, vectorized) :
levels( catTest1[,"Cat_111"] ) <- c( catTest1[,"Cat_111"], "-1")
catTest1[,"Cat_111"][ is.na(catTest1[,"Cat_111"]) ] <- -1
I'm using the "by" function in R to chop up a data frame and apply a function to different parts, like this:
pairwise.compare <- function(x) {
Nright <- ...
Nwrong <- ...
Ntied <- ...
return(c(Nright=Nright, Nwrong=Nwrong, Ntied=Ntied))
}
Z.by <- by(rankings, INDICES=list(rankings$Rater, rankings$Class), FUN=pairwise.compare)
The result (Z.by) looks something like this:
: 4
: 357
Nright Nwrong Ntied
3 0 0
------------------------------------------------------------
: 8
: 357
NULL
------------------------------------------------------------
: 10
: 470
Nright Nwrong Ntied
3 4 1
------------------------------------------------------------
: 11
: 470
Nright Nwrong Ntied
12 4 1
What I would like is to have this result converted into a data frame (with the NULL entries not present) so it looks like this:
Rater Class Nright Nwrong Ntied
1 4 357 3 0 0
2 10 470 3 4 1
3 11 470 12 4 1
How do I do that?
The by function returns a list, so you can do something like this:
data.frame(do.call("rbind", by(x, column, mean)))
Consider using ddply in the plyr package instead of by. It handles the work of adding the column to your dataframe.
Old thread, but for anyone who searches for this topic:
analysis = by(...)
data.frame(t(vapply(analysis,unlist,unlist(analysis[[1]]))))
unlist() will take an element of a by() output (in this case, analysis) and express it as a named vector.
vapply() does unlist to all the elemnts of analysis and outputs the result. It requires a dummy argument to know the output type, which is what analysis[[1]] is there for. You may need to add a check that analysis is not empty if that will be possible.
Each output will be a column, so t() transposes it to the desired orientation where each analysis entry becomes a row.
This expands upon Shane's solution of using rbind() but also adds columns identifying groups and removes NULL groups - two features which were requested in the question. By using base package functions, no other dependencies are required, e.g., plyr.
simplify_by_output = function(by_output) {
null_ind = unlist(lapply(by_output, is.null)) # by() returns NULL for combinations of grouping variables for which there are no data. rbind() ignores those, so you have to keep track of them.
by_df = do.call(rbind, by_output) # Combine the results into a data frame.
return(cbind(expand.grid(dimnames(by_output))[!null_ind, ], by_df)) # Add columns identifying groups, discarding names of groups for which no data exist.
}
I would do
x = by(data, list(data$x, data$y), function(d) whatever(d))
array(x, dim(x), dimnames(x))