cramer.test: NAs introduced by coercion - r

I know there is a lot of information in Google about this problem, but I could not solve it.
I have a data frame:
> str(myData)
'data.frame': 1199456 obs. of 7 variables:
$ A: num 3064 82307 4431998 1354 193871 ...
$ B: num 6067 403916 2709997 2743 203434 ...
$ C: num 299 11752 33282 170 2748 ...
$ D: num 105 6676 7065 20 1593 ...
$ E: num 8 572 236 3 170 ...
$ F: num 0 21 95 0 13 ...
$ G: num 583 18512 961328 348 42728 ...
Then I convert it to a matrix in order to apply the Cramer-von Mises test from "cramer" library:
> myData = as.matrix(myData)
> str(myData)
num [1:1199456, 1:7] 3064 82307 4431998 1354 193871 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:1199456] "8" "32" "48" "49" ...
..$ : chr [1:7] "A" "B" "C" "D" ...
After that, if I apply a "cramer.test(myData[x1:y1,], myData[x2:y2,])" I get the following error:
Error in rep(0, (RVAL$m + RVAL$n)^2) : invalid 'times' argument
In addition: Warning message:
In matrix(rep(0, (RVAL$m + RVAL$n)^2), ncol = (RVAL$m + RVAL$n)) :
NAs introduced by coercion
I also tried to convert the data frame to a matrix like this, but the error is the same:
> myData = as.matrix(sapply(myData, as.numeric))
> str(myData)
num [1:1199456, 1:7] 3064 82307 4431998 1354 193871 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "A" "B" "C" "D" ...

Your problem is that your data set is too large for the algorithm that cramer.test is using (at least the way it's coded). The code tries to create a lookup table according to
lookup <- matrix(rep(0, (RVAL$m + RVAL$n)^2),
ncol = (RVAL$m + RVAL$n))
where RVAL$m and RVAL$n are the number of rows of the two samples. The standard maximum length of an R vector is 2^31-1 on a 32-bit platform: since your samples have equal numbers of rows N, you'll be trying to create a vector of length (2*N^2), which in your case is 5.754779e+12 -- probably too big even if R would let you create the vector.
You may have to look for another implementation of the test, or another test.

Related

Rphylopars: "Error in class(tree) <- "phylo" : attempt to set an attribute on NULL"

I'm trying to compute a phenotypic covariance matrix between a fatty acid dataset and a phylogenetic tree using the Rphylopars package.
I'm able to load the data set and phylogeny; however, when I attempt to run the test I get the error message
Error in class(tree) <- "phylo" : attempt to set an attribute on NULL"
This is the code for the test
phy <- read.tree("combined_trees.txt")
plot(phy)
phy$tip.label
FA_data <- read.csv("fatty_acid_example_data.csv", header = TRUE, na.strings = ".")
head(FA_data)
str(FA_data)
PPE <- phylopars(trait_data = FA_data$fatty1_continuous, tree = FA_data$phy)
Not sure what other info will help figure out the issue. The data set and phylogeny loaded without an error.
In the tutorial, the tree and trait data are jointly simulated by the simtraits() function, so both end up as elements of a single list. In your case (which will be typical of real-data cases), the tree and the trait data come from different sources, so most likely you want
PPE <- phylopars(trait_data = FA_data, tree = phy)
provided that FA_data contains a first column species matching the tip names in phy, and otherwise only the numeric data you want to use (potentially only the single fatty_acid1 column).
For comparison, the data structure returned by simtraits() looks like this (using str()):
List of 4
$ trait_data:'data.frame': 45 obs. of 5 variables:
..$ species: chr [1:45] "t7" "t8" "t2" "t3" ...
..$ V1 : num [1:45] 1.338 0.308 1.739 2.009 2.903 ...
..$ V2 : num [1:45] -2.002 -0.115 -0.349 -4.452 NA ...
..$ V3 : num [1:45] -1.74 NA 1.09 -2.54 -1.19 ...
..$ V4 : num [1:45] 2.496 2.712 1.198 1.675 -0.117 ...
$ tree :List of 4
..$ edge : int [1:28, 1:2] 29 29 28 28 27 27 26 26 25 25 ...
..$ edge.length: num [1:28] 0.0941 0.0941 0.6233 0.7174 0.0527 ...
..$ Nnode : int 14
..$ tip.label : chr [1:15] "t7" "t8" "t2" "t3" ...
..- attr(*, "class")= chr "phylo"
..- attr(*, "order")= chr "postorder"
...
you can see that simtraits() returns a list containing (among other things) (1) a data frame with species as the first column and the other columns numeric and (2) a phylogenetic tree.
You

Trouble converting list into factor in R

I am having problems creating a boxplot of my data, because one of my variables is in the form of a list.
I am trying to create a boxplot:
boxplot(dist~species, data=out)
and received the following error:
Error in model.frame.default(formula = dist ~ species, data = out) :
invalid type (list) for variable 'species'
I have been unsuccessful in forcing 'species' into the form of a factor:
out[species]<- as.factor(out[[out$species]])
and receive the following error:
Error in .subset2(x, i, exact = exact) : invalid subscript type 'list'
How can I convert my 'species' column into a factor which I can then use to create a boxplot? Thanks.
EDIT:
str(out)
'data.frame': 4570 obs. of 6 variables:
$ GridRef : chr "NT73" "NT80" "NT85" "NT86" ...
$ pred : num 154 71 81 85 73 99 113 157 92 85 ...
$ pred_bin : int 0 0 0 0 0 0 0 0 0 0 ...
$ dist : num 20000 10000 9842 14144 22361 ...
$ years_since_1990: chr "21" "16" "21" "20" ...
$ species :List of 4570
..$ : chr "C.splendens"
..$ : chr "C.splendens"
..$ : chr "C.splendens"
.. [list output truncated]
It's hard to imagine how you got the data into this form in the first place, but it looks like
out <- transform(out,species=unlist(species))
should solve your problem.
set.seed(101)
f <- as.list(sample(letters[1:5],replace=TRUE,size=100))
## need I() to make a wonky data frame ...
d <- data.frame(y=runif(100),f=I(f))
## 'data.frame': 100 obs. of 2 variables:
## $ y: num 0.125 0.0233 0.3919 0.8596 0.7183 ...
## $ f:List of 100
## ..$ : chr "b"
## ..$ : chr "a"
boxplot(y~f,data=d) ## invalid type (list) ...
d2 <- transform(d,f=unlist(f))
boxplot(y~f,data=d2)

How to get fitted values from ar() method model in R

I want to retrieve the fitted values from an ar() function output model in R. When using Arima() method, I get them using fitted(model.object) function, but I cannot find its equivalent for ar().
It does not store a fitted vector but does have the residuals. An example of using the residuals from the ar-object to reconstruct the predictions from the original data:
data(WWWusage)
arf <- ar(WWWusage)
str(arf)
#====================
List of 14
$ order : int 3
$ ar : num [1:3] 1.175 -0.0788 -0.1544
$ var.pred : num 117
$ x.mean : num 137
$ aic : Named num [1:21] 258.822 5.787 0.413 0 0.545 ...
..- attr(*, "names")= chr [1:21] "0" "1" "2" "3" ...
$ n.used : int 100
$ order.max : num 20
$ partialacf : num [1:20, 1, 1] 0.9602 -0.2666 -0.1544 -0.1202 -0.0715 ...
$ resid : Time-Series [1:100] from 1 to 100: NA NA NA -2.65 -4.19 ...
$ method : chr "Yule-Walker"
$ series : chr "WWWusage"
$ frequency : num 1
$ call : language ar(x = WWWusage)
$ asy.var.coef: num [1:3, 1:3] 0.01017 -0.01237 0.00271 -0.01237 0.02449 ...
- attr(*, "class")= chr "ar"
#===================
str(WWWusage)
# Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...
png(); plot(WWWusage)
lines(seq(WWWusage),WWWusage - arf$resid, col="red"); dev.off()
The simplest way to get the fits from an AR(p) model would be to use auto.arima() from the forecast package, which does have a fitted() method. If you really want a pure AR model, you can constrain the differencing via the d parameter and the MA order via the max.q parameter.
> library(forecast)
> fitted(auto.arima(WWWusage,d=0,max.q=0))
Time Series:
Start = 1
End = 100
Frequency = 1
[1] 91.68778 86.20842 82.13922 87.60576 ...

Transform to numeric a column with "NULL" values

I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.

R k-means clustering data

in R, I have computed a k-means clustering as follows:
km = (mat2, centers=3)
where mat2 is a matrix of column vectors obtained by combining elements of a set of time series. There are 31 rows
Now that I have my k-means object how can I look at the data associated with a particular point? For example, supposed I clicked on a dot in that belongs to one of the partitions. How can I view this data? Of course what I mean is how to programmatically obtain this data.
I expect that you call kmeans as this:
set.seed(42)
df <- data.frame( row.names = paste0( "obs", 1:100 ),
V1 = rnorm(100),
V2 = rnorm(100),
V3 = rnorm(100) )
km <- kmeans( df, centers = 3 )
If you are unfamiliar with a new function, it's always a good idea to inspect the resulting object using str():
> str(km)
List of 7
$ cluster : Named int [1:100] 1 2 3 3 1 1 1 1 1 1 ...
..- attr(*, "names")= chr [1:100] "obs1" "obs2" "obs3" "obs4" ...
$ centers : num [1:3, 1:3] 0.65604 -1.09689 0.56428 0.11162 0.00549 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:3] "V1" "V2" "V3"
$ totss : num 291
$ withinss : num [1:3] 43.7 65.7 51.3
$ tot.withinss: num 161
$ betweenss : num 130
$ size : int [1:3] 36 34 30
- attr(*, "class")= chr "kmeans"
As I understood from your question, you are looking for km$cluster, which tells you which observation of your data has been assigned to which cluster. The cluster centers can accordingly be investigated by km$centers.
If you now want to know which observations has been clustered to the third cluster with the center km$centers[3,], you can subset your data.frame (or matrix) by
> rownames(df[ km$cluster == 3, ])
[1] "obs3" "obs4" "obs12" "obs15" "obs16" "obs21" "obs25" "obs27" "obs32" "obs42" "obs43" "obs46" "obs48" "obs54" "obs55" "obs58" "obs61" "obs62" "obs63" "obs66" "obs67" "obs73" "obs76"
[24] "obs77" "obs81" "obs84" "obs86" "obs87" "obs90" "obs94"

Resources