multivariate random forest on a community matrix - r

I want to use random forest modeling to understand variable importance on community assembly - my response data is a community matrix.
library(randomForestSRC)
# simulated species matrix
species
# site species 1 species2 species 3
# 1 1 1 0
# 2 1 0 1
# 3 1 1 1
# 4 1 0 1
# 5 1 0 0
# 6 1 1 0
# 7 1 1 0
# 8 1 0 0
# 9 1 0 0
# 10 1 1 0
# environmental data
data
# site elevation_m PRECIPITATION_mm
# 1 500 28
# 2 140 37
# 3 445 15
# 4 340 45
# 5 448 20
# 6 55 70
# 7 320 18
# 8 200 42
# 9 420 22
# 10 180 8
# adding my species matrix into the environmental data frame
data[["species"]] <-(species)
# running the model
rf_model <- rfsrc(Multivar(species) ~.,data = data, importance = T)
but I'm getting an error message:
Error in parseFormula(formula, data, ytry) :
the y-outcome must be either real or a factor.
I'm guessing that the issue is the presence/absence data, but I'm not sure how to move past that. Is this a limitation of the function?

I think it MIGHT have to do with how you built your "data" data frame. When you used data[["species"]] <- (species), you had a data frame inside a data frame. If you str(data) after the step I just referred to, the output is this:
> str(data)
'data.frame': 10 obs. of 4 variables:
$ site : int 1 2 3 4 5 6 7 8 9 10
$ elevation: num 500 140 445 340 448 55 320 200 420 180
$ precip : num 28 37 15 45 20 70 18 42 22 8
$ species :'data.frame': 10 obs. of 4 variables: #2nd data frame
..$ site : int 1 2 3 4 5 6 7 8 9 10
..$ species.1: num 1 1 1 1 1 1 1 1 1 1
..$ species2 : num 1 0 1 0 0 1 1 0 0 1
..$ species.3: num 0 1 1 1 0 0 0 0 0 0
If you instead build your data frame as data2 <- as.data.frame(cbind(data,species))
, then
rfsrc(Multivar(species.1,species2,species.3)~.,data = data2, importance=T)
seems to work because I don't get an error message, instead I get some reasonable looking output:
Sample size: 10
Number of trees: 1000
Forest terminal node size: 5
Average no. of terminal nodes: 2
No. of variables tried at each split: 2
Total no. of variables: 4
Total no. of responses: 3
User has requested response: species.1
Resampling used to grow trees: swr
Resample size used to grow trees: 10
Analysis: mRF-R
Family: regr+
Splitting rule: mv.mse *random*
Number of random split points: 10
% variance explained: NaN
Error rate: 0
I don't think your method for building the data frame you wanted is the customary way, but I could be wrong. I think rfsrc() did not know how to read a nested data frame. I doubt most modeling functions do without extra customized code.

Here's an example, using example data from the vegan package, of automatically constructing a formula that includes all of the species names in the response:
library(vegan)
library(randomForestSRC)
data("dune.env")
data("dune")
all <- as.data.frame(cbind(dune,dune.env))
form <- formula(sprintf("Multivar(%s) ~ .",
paste(colnames(dune),collapse=",")))
rfsrc(form, data=all)
Suppose we want to do this with 2000 species. Here's a simulated example:
nsp <- 2000
nsamp <- 100
nenv <- 10
set.seed(101)
spmat <- matrix(rpois(nsp*nsamp, lambda=5), ncol=nsp,
dimnames=list(NULL,paste0("sp",seq(nsp))))
envmat <- matrix(rnorm(nenv*nsamp), ncol=nenv,
dimnames=list(NULL,paste0("env",seq(nenv))))
all2 <- as.data.frame(cbind(spmat,envmat))
form2 <- formula(sprintf("Multivar(%s) ~ .",
paste(colnames(spmat),collapse=",")))
rfsrc(form2, data=all2)
In this particular example we seem to explain -3% (!!) of the variance, but it doesn't crash, so that's a good thing ...

Related

survMisc::comp give incorrect p-value due to incorrect covariance matrix calculation

I`m performing survival analysis on somewhat unusual data-set, which in raw form looks like this:
> print(df)
# A tibble: 407 × 4
stress len time status
<dbl> <dbl> <dbl> <dbl>
1 0 5.32 1 1
2 0 5.97 1 1
3 0 4.08 1 1
4 0 4.57 1 1
5 0 6.11 1 1
6 0 7.74 1 1
7 0 5.55 1 1
8 0 5.86 1 1
9 0 6.01 1 1
10 0 4.86 1 1
# … with 397 more rows
It is seed germination time set, composed of 407 individual seeds, with variable time indicating day at which seed germinated (day 1 to 7), variable status indicating event status (1 - germinated, 0 - censored) and variable stress indicating whether or not stressing treatment was applied to the seed (1 - applied, 0 - not applied), variable len isn't used.
It is treated as right-censored data and does not follow PH assumption. When passed to survMisc::ten() it looks like this:
> df_ten <- ten(Surv(time,status)~stress,data=df)
> as.data.frame(df_ten)
t e n cg ncg
1 1 52 407 1 200
2 2 44 335 1 148
3 3 51 236 1 104
4 4 20 127 1 53
5 5 15 69 1 32
6 6 2 42 1 17
7 7 6 29 1 15
8 1 20 407 2 207
9 2 55 335 2 187
10 3 58 236 2 132
11 4 37 127 2 74
12 5 12 69 2 37
13 6 11 42 2 25
14 7 2 29 2 14
I wanted to compare survival curves using log-rank test and its different weighted modifications. Simple log-rank test can be done by using survival::survdiff(), but survMisc::comp() additionally performs calculation of its weighted modifications. However I found out that survdiff() and comp() give different p-values for log-rank.
Example:
> survdiff(Surv(time,status)~stress,df)
Call:
survdiff(formula = Surv(time, status) ~ stress, data = df)
N Observed Expected (O-E)^2/E (O-E)^2/V
stress=0 200 190 173 1.70 4.72
stress=1 207 195 212 1.38 4.72
Chisq= 4.7 on 1 degrees of freedom, p= 0.03
survdiff gives p=0.03
>comp(df_ten)
# new R version have some troubles with displaying comp() results so I include them like this
> as.data.frame(attr(df_ten,'lrt'))
W Q Var Z pNorm chiSq df pChisq
1 1 -17.1389764 7.788363e+01 -1.94205613 0.052130305 3.771582025 1 0.052130305
2 n -7159.0000000 6.365017e+06 -2.83760918 0.004545280 8.052025835 1 0.004545280
3 sqrtN -352.4452268 2.030262e+04 -2.47352082 0.013378901 6.118305256 1 0.013378901
4 S1 -14.2339087 2.040569e+01 -3.15100126 0.001627118 9.928808966 1 0.001627118
5 S2 -14.1996105 2.028548e+01 -3.15270847 0.001617633 9.939570710 1 0.001617633
6 FH_p=1_q=1 -0.1198792 2.298284e+00 -0.07907548 0.936972583 0.006252932 1 0.936972583
But comp gives p=0.052
survdiff() result is actually the correct one and comp() is incorrect. I was able to figure out that it`s happening due to incorrect covariance matrix calculation by survMisc::COV() which is used within comp().
Again, example:
> survdiff(Surv(time,status)~stress,df)$var
[,1] [,2]
[1,] 62.17916 -62.17916
[2,] -62.17916 62.17916
This is correct value of VAR(Oi-Ei)=62.2, I was able to obtain the same result by calculating it manually using formula from Kleinbaum and Klein book.
> COV(df_ten)[COV(df_ten)>0]
1 2 3 4 5 6 7
16.128233 20.824661 20.724324 10.556427 5.463805 2.473937 1.712247
But COV() give incorrect one, when summed up it will give value of 77.83, which is then used by comp().
I still can obtain correct matrix value from COV() by giving it different ten format, but it can`t be used in comp().
> df_wide <- asWide(df_ten)
> as.data.frame(df_wide)
t n e n_1 e_1 n_2 e_2
1 1 407 72 200 52 207 20
2 2 335 99 148 44 187 55
3 3 236 109 104 51 132 58
4 4 127 57 53 20 74 37
5 5 69 27 32 15 37 12
6 6 42 13 17 2 25 11
7 7 29 8 15 6 14 2
> rowSums(df_wide[,COV(x=e, n=n, ncg=matrix(data=c(n_1, n_2), ncol=2))],dims=2)
1 2
1 62.17916 -62.17916
2 -62.17916 62.17916
I assume this have something to do with properties of my data-set, like specifically with the number of events per time point or with relatively small amount of time points and with formula internally used by COV() to calculate value which is:
...
res2 <- x[, (ncg / n) * (1 - (ncg / n)) * ((n - e) / (n - 1)) * e, by=list(t, cg)]
res2 <- data.table::setkey(res2[, sum(V1), by=t], t)
...
full function code here
I have tried this stuff with different survival datasets in R, and it doesn`t seem to affect larger ones, but in relatively small data-set like coin::ocarcinoma you can also see difference in covariance and corresponding log-test value calculated by survdiff/comp functions.
So is there any straightforward way to fix this issue with comp()? Like by restructuring my data-set somehow or perhaps by tweaking formula used in COV()?

Bootstrapping for groups where within each group there are subgroups sampled unevenly

Say I collect two samples from two groups but the sample I collect from each subject within each group is a different size.
head(df1)
subject three bias number
<int> <dbl> <dbl> <int>
1 1 0 0.696 69
2 1 1 0.656 32
3 100 0 0.938 64
4 100 1 0.929 28
5 1002 0 0.7 40
6 1002 1 0.345 29
the df1 above is actually aggregated over a df of a bias Boolean by subject.
str(df)
'data.frame': 1256 obs. of 3 variables:
$ subject: int 1 1 1 1 1 1 1 1 1 1 ...
$ three : num 0 0 0 0 0 0 0 0 0 0 ...
$ bias : num 1 1 1 1 1 1 1 1 1 1 ...
I want to determine whether there are significant within-subject (column 1 df1) differences in the mean bias between three==0 and three==1 (column2 df1). The bias column (column 3 df1) is averaged over the number of samples (column 4 df1).
In theory, I would like to run an accelerated and bias corrected bootstrap test or a permutation test which can handle different sample sizes.
I'm confused how to implement a valid statistical test when it is a group of groups as described here.
Any insight would be greatly appreciated!

R train, svmRadial "Cannot scale data"

I'm using R and the this breastCancer data frame. I want to use the function train in the packages caret but it doesn't work because of the error below. However, when I use another data frame, the function works.
library(mlbench)
library(caret)
data("breastCancer")
BC = na.omit(breastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")
This is the error:
error : In .local(x, ...) : Variable(s) `' constant. Cannot scale data.
We can start with the data you have:
library(mlbench)
library(caret)
data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
str(BC)
'data.frame': 683 obs. of 10 variables:
$ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
$ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
$ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
$ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
$ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
$ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
$ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
$ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
$ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
$ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
BC is a data.frame and you can see all your predictors are categorical or ordinal. You are trying to do a svmRadial meaning a svm with radial basis function. It's not so trivial to calculate euclidean distance between categorical features and if you look at the distribution of your categories:
sapply(BC,table)
$Cl.thickness
1 2 3 4 5 6 7 8 9 10
139 50 104 79 128 33 23 44 14 69
$Cell.size
1 2 3 4 5 6 7 8 9 10
373 45 52 38 30 25 19 28 6 67
$Cell.shape
1 2 3 4 5 6 7 8 9 10
346 58 53 43 32 29 30 27 7 58
$Marg.adhesion
1 2 3 4 5 6 7 8 9 10
393 58 58 33 23 21 13 25 4 55
When you train the model, by default it is bootstrap, some of your training data will be missing the levels that are lowly represented, for example from the above table, category 9 for Marg.adhesion. And this variable becomes all zero for this training, hence it throws the error. It most likely doesn't affect the overall result much (since they are rare).
One solution is to use cross-validation (it is unlikely you select all the rare observations in the test fold). Note, you should never convert into a matrix using as.matrix() when you have a data.frame with factors and characters. Caret can handle data.frame like this:
train(Class ~.,data=BC,method="svmRadial",trControl=trainControl(method="cv"))
Support Vector Machines with Radial Basis Function Kernel
683 samples
9 predictor
2 classes: 'benign', 'malignant'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 614, 615, 615, 615, 616, 615, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.9575654 0.9101995
0.50 0.9619346 0.9190284
1.00 0.9633838 0.9220161
Tuning parameter 'sigma' was held constant at a value of 0.01841092
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01841092 and C = 1.
The other option if you want to use bootstrap for cross-valiation, is to either omit the observations with these low classes, or combine them with others.
Your code contains some typos like the package name is caret not caren and dataset name is BreastCancer not breastCancer. You can use the following code to get rid of errors
library(mlbench)
library(caret)
data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")
It returns me
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 683 samples
#> 9 predictor
#> 2 classes: 'benign', 'malignant'
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 683, 683, 683, 683, 683, 683, ...
#> Resampling results across tuning parameters:
#>
#> C Accuracy Kappa
#> 0.25 0.9550137 0.9034390
#> 0.50 0.9585504 0.9107666
#> 1.00 0.9611485 0.9161541
#>
#> Tuning parameter 'sigma' was held constant at a value of 0.02349173
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 0.02349173 and C = 1.

CountIF function in R?

I have the following formulas in excel, but calculation takes forever, so I would like to find a way to calculate these formulas in excel.
I'm counting the number of times an item shows up in a location (Location 1, Location 2, and External) with these formulas:
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="1")
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="2")
=SUMPRODUCT(($N:$N=$A2)*(LEFT($C:$C)="E"))
Here is the dataframe in which the columns with these values will be added:
> str(FinalPars)
'data.frame': 10038 obs. of 3 variables:
$ ID: int 11 13 18 22 39 181 182 183 191 192 ...
$ Minimum : num 15 6 1.71 1 1 4.39 2.67 5 5 2 ...
$ Maximum : num 15 6 2 1 1 5.48 3.69 6.5 5 2 ...
and here is the dataset to which the ItemID will be matched (This is a master list of all locations each item is stored in):
> str(StorageLocations)
'data.frame': 14080 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ CLASSIFICATION : Factor w/ 3 levels "Central 1","Central 2",..: 3 3 3 1 2 3 3 1 2 3 ...
$ Cart Descr : Factor w/ 145 levels "Closet1",..: 36 41 110 1 99 58 60 14 99 60 ...
Sample of Storage Location Data Frame:
ID Classification Cart Descr
123 Central 1 Main Store Room
123 Central 2 Secondary Store Room
123 External Closet 1
123 External Closet 2
123 External Closet 3
So the output for the above would be added to the data frame total pars as the new colums Central 1, Central 2, and External and count the number of times the item was IDd as in those locations:
ID Minimum Maximum Central 1 Central 2 External
123 10 15 1 1 3
This was my output in Excel - a Count of the # of times an item was identified as Central 1, Central 2, or External
If anyone knows the comparable formula in R it would be great!
It's hard to know what you are really asking for without example data. I produced an example below.
Location <- c(rep(1,4), rep(2,4), rep(3,4))
Item_Id <- c(rep(1,2),rep(2,3),rep(1,2),rep(2,2),rep(1,3))
Item_Id_Want_to_Match <- 1
df <- data.frame(Location, Item_Id)
> df
Location Item_Id
1 1 1
2 1 1
3 1 2
4 1 2
5 2 2
6 2 1
7 2 1
8 2 2
9 3 2
10 3 1
11 3 1
12 3 1
sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
> sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
[1] 2
EDIT:
ID <- rep(123,5)
Classification <- c("Central 1", "Central 2", rep("External",3))
df <- data.frame(ID, Classification)
df$count <- 1
ID2 <- 123
Min <- 10
Max <- 15
df2 <- data.frame(ID2, Min, Max)
library(dplyr)
count_df <- df %>%
group_by(ID, Classification) %>%
summarise(count= sum(count))
> count_df
Source: local data frame [3 x 3]
Groups: ID
ID Classification count
1 123 Central 1 1
2 123 Central 2 1
3 123 External 3
library(reshape)
new_df <- recast(count_df, ID~Classification, id.var=c("ID", "Classification"))
> new_df
ID Central 1 Central 2 External
1 123 1 1 3
merge(new_df, df2, by.x="ID", by.y="ID2")
> merge(new_df, df2, by.x="ID", by.y="ID2")
ID Central 1 Central 2 External Min Max
1 123 1 1 3 10 15

Automating split up of data frame

I have the following data frame in R:
> head(df)
date x y z n t
1 2012-01-01 1 1 1 0 52
2 2012-01-01 1 1 2 0 52
3 2012-01-01 1 1 3 0 52
4 2012-01-01 1 1 4 0 52
5 2012-01-01 1 1 5 0 52
6 2012-01-01 1 1 6 0 52
> str(df)
'data.frame': 4617600 obs. of 6 variables:
$ date: Date, format: "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" ...
$ x : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ y : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ z : Factor w/ 111 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ n : int 0 0 0 0 0 0 0 0 29 0 ...
$ t : num 52 52 52 52 52 52 52 52 52 52 ...
What I want to do is split this large df into smaller data frames as follows:
1) I want to have 45 data frames for each factor value of 'x'. 2) I want to further split these 45 data frames for each factor value of 'z'. So I want a total of 45*111=4995 data frames.
I've seen plenty online about splitting data frames, which turns them into lists. However, I'm not seeing how to further split lists. Another concern I have is with computer memory. If I split the data frame into lists, will it not still take up as much computer memory? If I then want to run some prediction models on the split data, it seems impossible to do. Ideally I would split the data into many data frames, run prediction models on the first split data frame, get the results I need, and then delete it before moving on to the next one.
Here's what I would do. Your data already fits in memory, so just leave it in one piece:
require(data.table)
setDT(df)
df[,{
sum(t*n) # or whatever you're doing for "prediction models"
},by=list(x,z)]

Resources