randomforest explained variation - r

I have the following dataframe. I would like to know which bacteria contribute more when comparing the location of the bacteria(categorical) and its pH(numeric).
For instance at the end i would like to say for example that a certain bacterial type is more frequently found in a certain location when looking at the temperature.
Bacillus Lactobacillus Janibacter Brevibacterium Lawsonella Location temperature
Sample1 2 30 164 8 21 48 bedroom 27
Sample2 0 211 0 996 195 108 bedroom 35
Sample3 1 938 1 21 38 43 pool 45
Sample4 0 95 17 1 4 334 pool 10
Sample5 0 192 91 25 1207 1659 soil 14
Sample6 0 12 33 6 12 119 soil 21
Sample7 0 16 3 0 0 805 soil 12
The idea is to run randomforest to select those features (bacteria) that are more important when looking at both the location and the temperature.
Is randomforest suitable for this ? When i run the follozinw command i get the following error:
randomForest(Location+Temperature ~.,data=mydf)
Error in Location + Temperature : non-numeric argument to binary operator.
From the error it looks that i cannot use a continous and categorical variable together. How can i fix this ?
Is for exemple convert the numeric temperature variable to ranges of temperatures as a categorical variables would be a solution ?
In fact i have tried and it worked by converting the numeric temperature to ranges and pasting the location so that i have a combination of location and temperature.
randomForest(Location_temperature ~.,data=dat)
I get the list of important bacteria which is what i was looking for. Now how can i know which one contributes more to one location or another since my model i was using all sites ? For example how to check that your important variables(let´s say Bacillus is the most important from the randomforest model ) is important in the pool location (how much variation it explains in the pool) ??
Hope it is clear....

Related

R group data into equal groups with a metric variable

I'm struggeling to get a good performing script for this problem: I have a table with a score, x, y. I want to sort the table by score and than build groups based on the x value. Each group should have an equal sum (not counts) of x. x is a metric number in the dataset and resembles the historic turnover of a customer.
score x y
0.436024136 3 435
0.282303336 46 56
0.532358015 24 34
0.644236597 0 2
0.99623626 0 4
0.557673456 56 46
0.08898779 0 7
0.702941303 453 2
0.415717835 23 1
0.017497461 234 3
0.426239166 23 59
0.638896238 234 86
0.629610596 26 68
0.073107526 0 35
0.85741877 0 977
0.468612039 0 324
0.740704267 23 56
0.720147257 0 68
0.965212467 23 0
a good way to do so is adding a group variable to the data.frame with cumsum! Now you can easily sum the groups with e. g. subset.
data.frame$group <-cumsum(as.numeric(data.frame$x)) %/% (ceiling(sum(data.frame$x) / 3)) + 1
remarks:
in big data.frames cumsum(as.numeric()) works reliably
%/% is a division where you get an integer back
the '+1' just let your groups start with 1 instead of 0
thank you #Ronak Shah!

Sample randomly with keeping class distribution in R

I have a matrix of 10 classes (2089 rows and 112 colunms).
0 1 2 3 4 5 6 7 8 9
482 60 404 134 60 339 376 66 63 105
I want to split the matrix randomly in three sets of proportion: 60, 20 and 20 % respectively with keeping the proportion classes in each set as in the original matrix.
I cheked Stratified random sampling from data frame but it's not the same question.
The first column of the matrix contains the classes indexes from 0 to 9. I want to do the split in 60, 20 and 20 % according to this column. For example the 9th class contains 63 observations (3%). The three parts must contains 3% of this class.

Plotting estimated probabilities from binary logistic regression when one or more predictor variables are held constant

I am a biology grad student who has been spinning my wheels for about thirty hours on the following issue. In summary I would like to plot a figure of estimated probabilities from a glm binary logistic regression model i produced. I have already gone through model selection, validation, etc and am now simply trying to produce figures. I had no problem plotting probability curves for the model i selected but what i am really interested in is producing a figure that shows probabilities of a binary outcome for a predictor variable when the other predictor variable is held constant.
I cannot figure out how to assign this constant value to only one of the predictor variables and plot the probability for the other variable. Ultimately i would like to produce figures similar to the crude example i attached desired output. I admit I am a novice in R and I certainly appreciate folks' time but i have exhausted online searches and have yet to find the approach or a solution adequately explained. This is the closest information related to my question but i found the explanation vague and it failed to provide an example for assigning one predictor a constant value while plotting the probability of the other predictor. https://stat.ethz.ch/pipermail/r-help/2010-September/253899.html
Below i provided a simulated dataset and my progress. Thank you very much for your expertise, i believe a solution and code example would be helpful for other ecologists who use logistic regression.
The simulated dataset shows survival outcomes over the winter for lizards. The predictor variables are "mass" and "depth".
x<-read.csv('logreg_example_data.csv',header = T)
x
survival mass depth
1 0 4.294456 262
2 0 8.359857 261
3 0 10.740580 257
4 0 10.740580 257
5 0 6.384678 257
6 0 6.384678 257
7 0 11.596380 270
8 0 11.596380 270
9 0 4.294456 262
10 0 4.294456 262
11 0 8.359857 261
12 0 8.359857 261
13 0 8.359857 261
14 0 7.920406 258
15 0 7.920406 258
16 0 7.920406 261
17 0 10.740580 257
18 0 10.740580 258
19 0 38.824960 262
20 0 9.916840 239
21 1 6.384678 257
22 1 6.384678 257
23 1 11.596380 270
24 1 11.596380 270
25 1 11.596380 270
26 1 23.709520 288
27 1 23.709520 288
28 1 23.709520 288
29 1 38.568970 262
30 1 38.568970 262
31 1 6.581013 295
32 1 6.581013 298
33 1 0.766564 269
34 1 5.440803 262
35 1 5.440803 262
36 1 19.534710 252
37 1 19.534710 259
38 1 8.359857 263
39 1 10.740580 257
40 1 38.824960 264
41 1 38.824960 264
42 1 41.556970 239
#Dataset name is x
# time to run the glm model
model1<-glm(formula=survival ~ mass + depth, family = "binomial", data=x)
model1
summary(model1)
#Ok now heres how i predict the probability of a lizard "Bob" surviving the winter with a mass of 32.949 grams and a burrow depth of 264 mm
newdata<-data.frame(mass = 32.949, depth = 264)
predict(model1, newdata, type = "response")
# the lizard "Bob" has a 87.3% chance of surviving the winter
#Now lets assume the glm. model was robust and the lizard was endangered,
#from all my research I know the average burrow depth is 263.9 mm at a national park
#lets say i am also interested in survival probabilities at burrow depths of 200 and 100 mm, respectively
#how do i use the valuable glm model produced above to generate a plot
#showing the probability of lizards surviving with average burrow depths stated above
#across a range of mass values from 0.0 to 100.0 grams??????????
#i know i need to use the plot and predict functions but i cannot figure out how to tell R that i
#want to use the glm model i produced to predict "survival" based on "mass" when the other predictor "depth" is held at constant values of biological relevance
#I would also like to add dashed lines for 95% CI

Trying to aggregate ReactionTypes in R

sample of outAct
Activity ReactionType numberActivities
activator activates 16
binding binds 83
recombinase binds 1
branching branches 3
carboxylase carboxylates 36
peptidase cleaves 425
endopeptidase cleaves 368
nuclease cleaves 53
glycosylase cleaves 24
cyclase converts 12
transhydrogenase converts 3
hist deacetylase deacetylates 8
deacetylase deacetylates 16
I want to count all the same ReactionTypes and sum up their numberActivities
reaction_types <-aggregate(numberActivities ~ ReactionType, unique(outAct), FUN=sum)
Desired output
ReactionType number
activates 16
binds 84
branches 3
carboxylates 36
cleaves 870
converts 15
deacetylates 24
Problem is, I’m getting duplicates, i.e. they are not being counted as one unique ReactionType e.g. the output contains rows such as
deacetylates 8
deacetylates 16
There are more examples like this throughout the output file.
Where am I going wrong?
Thanks in advance.
library(dplyr)
outAct %>% group_by(ReactionType) %>% summarise(number = sum(numberActivities))

glm giving different results in R 3.2.0 and R 3.2.2. Was there any substantial change in usage?

I use glm monthly to calculate a binomial model on the payment behaviour of a credit database, using a call like:
modelx = glm(paid ~ ., data = credit_db, family = binomial())
For the last month, I use R version 3.2.2 (just recently upgraded) and the results were very different than the previous month (done with R version 3.2.0). In order to check the code, I repeated the previous month calculations with version 3.2.2 and got different results from the previous calculation done in R 3.2.0.
Coefficients are also very different, in a wild form. I use at the beginning an exploratory model, with a variable that is the average number of delinquency days during the month, which should yield low coefficients for low average. In version 3.2.0, an extract of summary(modelx) was:
## Coefficients: Estimate Std. Error z value
## delinquency_avg_days1 -0.59329 0.18581 -3.193
## delinquency_avg_days2 -1.32286 0.19830 -6.671
## delinquency_avg_days3 -1.47359 0.21986 -6.702
## delinquency_avg_days4 -1.64158 0.21653 -7.581
## delinquency_avg_days5 -2.56311 0.25234 -10.158
## delinquency_avg_days6 -2.59042 0.25886 -10.007
and for version 3.2.2
## Coefficients Estimate Std. Error z value
## delinquency_avg_days.L -1.320e+01 1.083e+03 -0.012
## delinquency_avg_days.Q -1.140e+00 1.169e+03 -0.001
## delinquency_avg_days.C 3.439e+00 1.118e+03 0.003
## delinquency_avg_days^4 8.454e+00 1.020e+03 0.008
## delinquency_avg_days^5 3.733e+00 9.362e+02 0.004
## delinquency_avg_days^6 -4.988e+00 9.348e+02 -0.005
The summary output is a little different, since the Pr(>|z|) is shown. Notice also that the coefficient names changed too.
In the dataset this delinquency_avg_days variable have the following distribution (0 is "not paid", 1 is "paid", and as you can see, coefficients might be large for average days larger than 20 or so. Number of paid was sampled to match closely the number of "not paid".
0 1
0 140 663
1 59 209
2 62 118
3 56 87
4 66 50
5 69 41
6 64 40
7 78 30
8 75 31
9 70 29
10 77 23
11 69 18
12 79 17
13 61 13
14 53 5
15 67 18
16 50 10
17 40 9
18 39 8
19 23 9
20 24 2
21 36 9
22 35 1
23 17 0
24 11 0
25 11 0
26 7 1
27 3 0
28 0 0
29 0 1
30 1 0
In previous months, I used this exploratory model to create a second binomial model using ranges af average delinquency days. But this other model gives similar results with a few levels.
Now, I'd like to know whether there are substantial changes that require specifying other parameters or there is an issue with glm in version 3.2.2.

Resources