long int in dataframe conversion (data.matrix()) or mean() of factor value - r

I have this type of data.frame :
"id" "var1" "t" "x" "y" "z" "idconnect" "bool1"
924903565 16 64 104 133 87 940539767 1
924903564 14 64 131 95 87 940539931 1
924903563 22 64 135 248 86 924903449 1
but the colMeans() or mean() function doesn't work (return NA, or false value when I use the as.numeric() before): I tried to have the mean(mydata[mydata[, "idconnect"]==940539931, "x"]) for example, but class(mydata[mydata[, "idconnect"]==940539931, "x"]) return "factor". I expected matrix or vector. Why is it factor and not here ? The as.matrix of my factor is weird.
So I tried to convert my dataframe into matrix but sapply(..., as.numerix) or data.matrix(), it returns :
"id" "var1" "t" "x" "y" "z" "idconnect" "bool1"
47 7 442 5 34 97 154228 3
46 5 442 32 395 97 154274 3
45 14 442 36 149 96 45 3
How can I convert my dataframe in matrix (with the respect of my long int value) ?

Related

Create a dataframe i nR

I would like to create a dataframe with 117 columns and 90 rows, the first ones being: ID, date1, date2, Category, DR1, DRM01, DRM02, DRM03 .... up to DRM111. For the first column, it would have values ranging from 1 to 3. In date1 it would have a fixed value, which would be "2022-01-05", in date2, it would have values between 2021-12-20 to the maximum that it gives. Category can be ABC or ERF, in DR1 would be values that would vary from 200 to 250, and finally, in DRM columns, would be values that would vary from 0 to 300. Is it possible to create a dataframe like this?
I wondering if this is an effort at simulation. The first few tasks seem blindly obvious but the last call to replicate with simplify=FALSE might have been a bit less than trivial.
test <- data.frame( ID = rep(1:3, length=90),
date1 = as.Date( "2022-01-05"),
date2= seq( as.Date("2021-12-20"), length.out=90, by=1),
#Category = ???? so far not specified
DR1 = sample( 200:250, 90, repl=TRUE), #need repl is length need is long
setNames( replicate(111, { sample(0:300, 90)}, simplify=FALSE) ,
nm=paste("DRM",1:111) ) )
Snipped the last 105 rows of the output from str:
str(test)
'data.frame': 90 obs. of 115 variables:
$ ID : int 1 2 3 1 2 3 1 2 3 1 ...
$ date1 : Date, format: "2022-01-05" "2022-01-05" "2022-01-05" "2022-01-05" ...
$ data2 : Date, format: "2021-12-20" "2021-12-21" "2021-12-22" "2021-12-23" ...
$ DR1 : int 229 218 240 243 221 202 242 221 237 208 ...
$ DRM.1 : int 41 238 142 100 19 56 224 152 85 84 ...
$ DRM.2 : int 150 185 141 55 34 83 88 105 165 294 ...
$ DRM.3 : int 144 22 237 174 78 291 120 63 261 236 ...
$ DRM.4 : int 223 105 263 214 45 226 129 80 182 15 ...
$ DRM.5 : int 27 108 288 237 129 251 150 70 300 243 ...
# additional rows elided
The last item in that construction returns a list that has 111 "columns" with ascending numbered names. I admit to being puzzled about why there were periods in the DRM names but then realized that the data.frame function uses check.names to make sure they are legitimate, so the spaces from paste were converted to periods. If you don't like periods then use paste0.

Grouping columns and creating a list output

I am new to R. I have a R dataframe of following structure:
164_I_.CEL 164_II.CEL 183_I.CEL 183_II.CEL 2114_I.CEL
1 4496 5310 4492 4511 2872
2 181 280 137 101 91
3 4556 5104 4379 4608 2972
4 167 217 99 79 82
5 89 110 69 58 47
I want to group the columns which have "_I.CEL" in the column name.
I need a list output like NI, NI, I, NI, I
where NI means Not I.
A combination of ifelse and grepl looking for the required pattern in the column names.
ifelse(grepl("_I\\.CEL", names(df1)), "I", "NI")
#[1] "NI" "NI" "I" "NI" "I"
where df1 is your data frame.
Or use fixed = TRUE
ifelse(grepl("_I.CEL", names(df1), fixed = TRUE), "I", "NI")

Assign NA to the name of table

In the following example, I want to extract NA as a level and display it in the table just as other levels. The levels() function doesn't work with NA value. Is there any other way to deal with this problem?
n=1000
comorbid<-sample(c(rep("diabetes",2),
rep("hypertension",5),
"cirrhosis","stroke","heartfailure",
"renalfailure",rep("COPD",3)),
n,
replace=T)
comorbid[sample(1:n,50)]<-NA
mort<-sample(c(rep("alive",4),
"dead"),n,replace=T)
table.cat<-data.frame(matrix(rep(999,7),nrow=1))
table<-table(comorbid,useNA="always")
per<-prop.table(table)
table.sub<-table(comorbid,mort,useNA="always")
per.sub<-prop.table(table.sub,2)
p<-tryCatch({#using fisher's test when scarce data
chisq.test(table.sub)$p.value
}, warning = function(w) {
fisher.test(table.sub,
workspace = 10e7)$p.value
})
frame<-data.frame(No.tot=as.data.frame(table)[,"Freq"],
per.tot=as.data.frame(per)[,"Freq"],
No.1=as.data.frame.matrix(table.sub)[,"alive"],
per.1=as.data.frame.matrix(per.sub)[,"alive"],
No.2=as.data.frame.matrix(table.sub)[,"dead"],
per.2=as.data.frame.matrix(per.sub)[,"dead"],
p=p)
rownames(frame)<-paste("comorbid",levels(comorbid),sep="_")
levels() works just fine with NA values. What levels() requires however is a factor (or anything with a levels attribute). As per your code, comorbid is a character vector:
> class(comorbid)
[1] "character"
If you coerce comorbid to a factor and change the default so that NAs are not excluded from the factor levels, you get the desired behaviour:
fcomorbid <- factor(comorbid, exclude = NULL)
levels(fcomorbid)
paste("comorbid", levels(fcomorbid), sep = "_")
> levels(fcomorbid)
[1] "cirrhosis" "COPD" "diabetes" "heartfailure" "hypertension"
[6] "renalfailure" "stroke" NA
> paste("comorbid", levels(fcomorbid), sep = "_")
[1] "comorbid_cirrhosis" "comorbid_COPD" "comorbid_diabetes"
[4] "comorbid_heartfailure" "comorbid_hypertension" "comorbid_renalfailure"
[7] "comorbid_stroke" "comorbid_NA"
To complete your example then
rownames(frame) <- paste("comorbid", levels(fcomorbid), sep = "_")
and we have
> frame
No.tot per.tot No.1 per.1 No.2 per.2 p
comorbid_cirrhosis 69 0.069 57 0.07011070 12 0.06417112 0.3108409
comorbid_COPD 209 0.209 172 0.21156212 37 0.19786096 0.3108409
comorbid_diabetes 128 0.128 101 0.12423124 27 0.14438503 0.3108409
comorbid_heartfailure 57 0.057 45 0.05535055 12 0.06417112 0.3108409
comorbid_hypertension 334 0.334 267 0.32841328 67 0.35828877 0.3108409
comorbid_renalfailure 78 0.078 61 0.07503075 17 0.09090909 0.3108409
comorbid_stroke 75 0.075 63 0.07749077 12 0.06417112 0.3108409
comorbid_NA 50 0.050 47 0.05781058 3 0.01604278 0.3108409

Creating a data set with paired data and converting it into a matrix

So, I'm using R to try and do a phylogenetic PCA on a dataset that I have using the phyl.pca function from the phytools package. However, I'm having issues organising my data in a way that the function will accept! And that's not all: I did a bit of experimenting and I know that there are more issues further down the line, which I will get into...
Getting straight to the issue, here's the data frame (with dummy data) that I'm using:
>all
Taxa Tibia Feather
1 Microraptor 138 101
2 Microraptor 139 114
3 Microraptor 145 141
4 Anchiornis 160 81
5 Anchiornis 14 NA
6 Archaeopteryx 134 82
7 Archaeopteryx 136 71
8 Archaeopteryx 132 NA
9 Archaeopteryx 14 NA
10 Scansoriopterygidae 120 85
11 Scansoriopterygidae 116 NA
12 Scansoriopterygidae 123 NA
13 Sapeornis 108 NA
14 Sapeornis 112 86
15 Sapeornis 118 NA
16 Sapeornis 103 NA
17 Confuciusornis 96 NA
18 Confuciusornis 107 30
19 Confuciusornis 148 33
20 Confuciusornis 128 61
The taxa are arranged into a tree (called "tree") with Microraptor being the most basal and then progressing in order through to Confuciusornis:
>summary(tree)
Phylogenetic tree: tree
Number of tips: 6
Number of nodes: 5
Branch lengths:
mean: 1
variance: 0
distribution summary:
Min. 1st Qu. Median 3rd Qu. Max.
1 1 1 1 1
No root edge.
Tip labels: Confuciusornis
Sapeornis
Scansoriopterygidae
Archaeopteryx
Anchiornis
Microraptor
No node labels.
And the function:
>phyl.pca(tree, all, method="BM", mode="corr")
And this is the error that is coming up:
Error in phyl.pca(tree, all, method = "BM", mode = "corr") :
number of rows in Y cannot be greater than number of taxa in your tree
Y being the "all" data frame. So I have 6 taxa in my tree (matching the 6 taxa in the data frame) but there are 20 rows in my data frame. So I used this function:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
And got this:
Group.1 Tibia Feather
1 Anchiornis 153 81
2 Archaeopteryx 136 77
3 Confuciusornis 120 41
4 Microraptor 141 119
5 Sapeornis 110 86
6 Scansoriopterygidae 120 85
It's a bit odd that the order of the taxa has changed... Is this ok?
In any case, I converted it into a matrix:
> all_agg_matrix <- as.matrix(all_agg)
> all_agg_matrix
Group.1 Tibia Feather
[1,] "Anchiornis" "153" "81"
[2,] "Archaeopteryx" "136" "77"
[3,] "Confuciusornis" "120" "41"
[4,] "Microraptor" "141" "119"
[5,] "Sapeornis" "110" "86"
[6,] "Scansoriopterygidae" "120" "85"
And then used the phyl.pca function:
> phyl.pca(tree, all_agg_matrix, method = "BM", mode = "corr")
[1] "Y has no names. function will assume that the row order of Y matches tree$tip.label"
Error in invC %*% X : requires numeric/complex matrix/vector arguments
So, now the order that the function is considering taxa in is all wrong (but I can fix that relatively easily). The issue is that phyl.pca doesn't seem to believe that my matrix is actually a matrix. Any ideas why?
I think you may have bigger problems. Most phylogenetic methods, I suspect including phyl.pca, assume that traits are fixed at the species level (i.e., they don't account for within-species variation). Thus, if you want to use phyl.pca, you probably need to collapse your data to a single value per species, e.g. via
dd_agg <- aggregate(dd[,-1],by=list(dd$Taxa),mean,na.rm=TRUE)
Extract the numeric columns and label the rows properly so that phyl.pca can match them up with the tips correctly:
dd_mat <- dd_agg[,-1]
rownames(dd_mat) <- dd_agg[,1]
Using these aggregated data, I can make up a tree (since you didn't give us one) and run phyl.pca ...
library(phytools)
tt <- rcoal(nrow(dd_agg),tip.label=dd_agg[,1])
phyl.pca(tt,dd_mat)
If you do need to do an analysis that takes within-species variation into account you might need to ask somewhere more specialized, e.g. the r-sig-phylo#r-project.org mailing list ...
The answer posted by Ben Bolker seems to work whereby the data (called "all") is collapsed into a single value per species before creating a matrix and running the function. As per so:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
> all_mat <- all_agg[,-1]
> rownames(all_mat) <- all_agg[,1]
> phyl.pca(tree,all_mat, method= "lambda", mode = "corr")
Thanks to everyone who contributed an answer and especially Ben! :)

Converting probe ids to entrez ids from a list of lists

The conversion of probe ids to entrez ids is quite straight forward
i1<-c("246653_at", "246897_at", "251347_at", "252988_at", "255528_at", "256535_at", "257203_at", "257582_at", "258807_at", "261509_at", "265050_at", "265672_at")
select(ath1121501.db, i1, "ENTREZID", "PROBEID")
PROBEID ENTREZID
1 246653_at 833474
2 246897_at 832631
3 251347_at 825272
4 252988_at 829998
5 255528_at 827380
6 256535_at 840223
7 257203_at 821955
8 257582_at 841494
9 258807_at 819558
10 261509_at 843504
11 265050_at 841636
12 265672_at 817757
But Iam unsure how to do it for a long list of lists resulting from a clustering and store it as a list of ENTREZ ids instead of probe ids again:
For instance:
[[1]]
247964_at 248684_at 249126_at 249214_at 250223_at 253620_at 254907_at 259897_at 261256_at 267126_s_at
28 40 44 45 54 95 108 152 171 229
[[2]]
248230_at 250869_at 259765_at 265948_at 266221_at
33 64 151 216 221
[[3]]
245385_at 247282_at 248967_at 250180_at 250881_at 251073_at 53874_at 256093_at 257054_at 260007_at
5 22 42 52 65 67 101 117 125 155
261868_s_at 263136_at 267497_at
181 195 232
It should be something like
[[1]]
"835761","834904","834356","834281","831256","829175","826721","843479","837084","816891","816892"
and similarly for other list of lists.

Resources