How to extract values from survfit object - r

I've created this model:
model <- survfit(Surv(time,status)~c$sex)
model
and the output is:
Call: survfit(formula = Surv(time, status) ~ c$sex)
records n.max n.start events median 0.95LCL 0.95UCL
c$sex=female 15 15 15 8 720 517 NA
c$sex=male 28 28 28 23 234 145 712
So, I want to extract the median for males and the same for females, but have no idea how to do it.
Here are my attempts to do it:
>model$median
NULL
>summary(model)$table[, "median"]
c$sex=female c$sex=male
720.0 234.5
I want each one of the values alone ("720" and "234.5"), can somebody help me?
Thanks in advance

You've already got it. All you are seeing printed to screen are the names attributes of the length 2 vector.
fit <- survfit(Surv(time, status) ~ x, data = aml)
summary(fit)$table
# records n.max n.start events median 0.95LCL 0.95UCL
#x=Maintained 11 11 11 7 31 18 NA
#x=Nonmaintained 12 12 12 11 23 8 NA
# Access first value like any other vector
summary(fit)$table[,'median'][1]
#x=Maintained
# 31
To print without names use unname()...
unname(summary(fit)$table[,'median'])
# [1] 31 23
But you do not need to unname() them to use them as a numeric value, that is just an aesthetic desire...
sum( summary(fit)$table[,'median'] )
[1] 54
For further proof (!) that it is a vector use str()...
str(summary(fit)$table[,'median'])
# Named num [1:2] 31 23
# - attr(*, "names")= chr [1:2] "x=Maintained" "x=Nonmaintained"

This also works:
> library(survMisc)
> fit <- survfit(Surv(time, status) ~ x, data = aml)
> median(fit)
median
x=Maintained 31
x=Nonmaintained 23
And without the names (i.e. remove the structure of a data.frame):
> unname(unlist(median(fit)))
[1] 31 23
It's nice if you also want the confidence interval (default is 'log'):
> median(fit, CI=TRUE)
median lower upper
x=Maintained 31 13 NA
x=Nonmaintained 23 5 43

Related

fit a normal distribution to grouped data, giving expected frequencies

I have a frequency distribution of observations, grouped into counts within class intervals.
I want to fit a normal (or other continuous) distribution, and find the expected frequencies in each interval according to that distribution.
For example, suppose the following, where I want to calculate another column, expected giving the
expected number of soldiers with chest circumferences in the interval given by chest, where these
are assumed to be centered on the nominal value. E.g., 35 = 34.5 <= y < 35.5. One analysis I've seen gives the expected frequency in this cell as 72.5 vs. the observed 81.
> data(ChestSizes, package="HistData")
>
> ChestSizes
chest count
1 33 3
2 34 18
3 35 81
4 36 185
5 37 420
6 38 749
7 39 1073
8 40 1079
9 41 934
10 42 658
11 43 370
12 44 92
13 45 50
14 46 21
15 47 4
16 48 1
>
> # ungroup to a vector of values
> chests <- vcdExtra::expand.dft(ChestSizes, freq="count")
There are quite a number of variations of this question, most of which relate to plotting the normal density on top of a histogram, scaled to represent counts not density. But none explicitly show the calculation of the expected frequencies. One close question is R: add normal fits to grouped histograms in ggplot2
I can perfectly well do the standard plot (below), but for other things, like a Chi-square test or a vcd::rootogram plot, I need the expected frequencies in the same class intervals.
> bw <- 1
n_obs <- nrow(chests)
xbar <- mean(chests$chest)
std <- sd(chests$chest)
plt <-
ggplot(chests, aes(chest)) +
geom_histogram(color="black", fill="lightblue", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = xbar, sd = std) * bw * n_obs,
color = "darkred", size = 1)
plt
here is how you could calculate the expected frequencies for each group assuming Normality.
xbar <- with(ChestSizes, weighted.mean(chest, count))
sdx <- with(ChestSizes, sd(rep(chest, count)))
transform(ChestSizes, Expected = diff(pnorm(c(32, chest) + .5, xbar, sdx)) * sum(count))
chest count Expected
1 33 3 4.7600583
2 34 18 20.8822328
3 35 81 72.5129162
4 36 185 199.3338028
5 37 420 433.8292832
6 38 749 747.5926687
7 39 1073 1020.1058521
8 40 1079 1102.2356155
9 41 934 943.0970605
10 42 658 638.9745241
11 43 370 342.7971793
12 44 92 145.6089948
13 45 50 48.9662992
14 46 21 13.0351612
15 47 4 2.7465640
16 48 1 0.4579888

How to create a messy_impute() function that imputes NA values in messy data with mean or median?

I have the following data frame for a student with homework and exam scores.
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 NA 41 61 11 A
The Homework_10 score is missing, and I need to create a function to impute the NA value with mean or median.
The function messy_impute should have the following arguments:
data : data frame or tibble to be imputed.
center : whether to impute using mean or median.
margin : whether to use row or column to input value (1- use row 2-use column).
For example,
messy_impute(student1,mean,1) should print out
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 **62** 41 61 11 A
since the mean of the rest of the homework is 62.
And, if the mean of the columns (other students) in section A for homework 10 is 50, then
messy_impute(student1,mean,2) should print out
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 **50** 41 61 11 A
since the mean of columns in section A is 50.
Please note the if the margin is 2, then the calculation should be done with the same section.
I'm really stuck on this defining the function.
Base R solution:
# Define function to Impute a row-wise mean (assumes one observation per student):
row_wise_mean_impute <- function(df){
grade_df <- df[,names(df) != "studid"]
return(cbind(df[,c("studid"), drop = FALSE],
replace(grade_df, is.na(grade_df), apply(grade_df, 1, mean, na.rm = TRUE))))
}
# Apply function:
row_wise_mean_impute(student1)
Data:
x <- c(rnorm(85, 50, 3), rnorm(15, 50, 15))
student1 <- cbind(studid = 1010101, data.frame(t(x)))
student1[, 10] <- NA_real_

R - caret createDataPartition returns more samples than expected

I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition() like this:
library(caret)
createDataPartition(iris$Species, p=0.1)
# [1] 12 22 26 41 42 57 63 79 89 93 114 117 134 137 142
createDataPartition(iris$Sepal.Length, p=0.1)
# [1] 1 27 44 46 54 68 72 77 83 84 93 99 104 109 117 132 134
I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.
Any ideas as to why I get these results?
Sepal.Length is a numeric feature; from the online documentation:
For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition, the number of percentiles is set via the groups argument.
groups: for numeric y, the number of breaks in the quantiles
with default value:
groups = min(5, length(y))
Here is what happens in your case:
Since you do not specify groups, it takes a value of min(5, 150) = 5 breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary:
> summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
For numeric features, the function will take a percentage of p = 0.1 from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:
l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8)) # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4)) # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9)) # 35
Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p; let's see what this should be in your case for p = 0.1:
ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17
Bingo! :)

Creating a data set with paired data and converting it into a matrix

So, I'm using R to try and do a phylogenetic PCA on a dataset that I have using the phyl.pca function from the phytools package. However, I'm having issues organising my data in a way that the function will accept! And that's not all: I did a bit of experimenting and I know that there are more issues further down the line, which I will get into...
Getting straight to the issue, here's the data frame (with dummy data) that I'm using:
>all
Taxa Tibia Feather
1 Microraptor 138 101
2 Microraptor 139 114
3 Microraptor 145 141
4 Anchiornis 160 81
5 Anchiornis 14 NA
6 Archaeopteryx 134 82
7 Archaeopteryx 136 71
8 Archaeopteryx 132 NA
9 Archaeopteryx 14 NA
10 Scansoriopterygidae 120 85
11 Scansoriopterygidae 116 NA
12 Scansoriopterygidae 123 NA
13 Sapeornis 108 NA
14 Sapeornis 112 86
15 Sapeornis 118 NA
16 Sapeornis 103 NA
17 Confuciusornis 96 NA
18 Confuciusornis 107 30
19 Confuciusornis 148 33
20 Confuciusornis 128 61
The taxa are arranged into a tree (called "tree") with Microraptor being the most basal and then progressing in order through to Confuciusornis:
>summary(tree)
Phylogenetic tree: tree
Number of tips: 6
Number of nodes: 5
Branch lengths:
mean: 1
variance: 0
distribution summary:
Min. 1st Qu. Median 3rd Qu. Max.
1 1 1 1 1
No root edge.
Tip labels: Confuciusornis
Sapeornis
Scansoriopterygidae
Archaeopteryx
Anchiornis
Microraptor
No node labels.
And the function:
>phyl.pca(tree, all, method="BM", mode="corr")
And this is the error that is coming up:
Error in phyl.pca(tree, all, method = "BM", mode = "corr") :
number of rows in Y cannot be greater than number of taxa in your tree
Y being the "all" data frame. So I have 6 taxa in my tree (matching the 6 taxa in the data frame) but there are 20 rows in my data frame. So I used this function:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
And got this:
Group.1 Tibia Feather
1 Anchiornis 153 81
2 Archaeopteryx 136 77
3 Confuciusornis 120 41
4 Microraptor 141 119
5 Sapeornis 110 86
6 Scansoriopterygidae 120 85
It's a bit odd that the order of the taxa has changed... Is this ok?
In any case, I converted it into a matrix:
> all_agg_matrix <- as.matrix(all_agg)
> all_agg_matrix
Group.1 Tibia Feather
[1,] "Anchiornis" "153" "81"
[2,] "Archaeopteryx" "136" "77"
[3,] "Confuciusornis" "120" "41"
[4,] "Microraptor" "141" "119"
[5,] "Sapeornis" "110" "86"
[6,] "Scansoriopterygidae" "120" "85"
And then used the phyl.pca function:
> phyl.pca(tree, all_agg_matrix, method = "BM", mode = "corr")
[1] "Y has no names. function will assume that the row order of Y matches tree$tip.label"
Error in invC %*% X : requires numeric/complex matrix/vector arguments
So, now the order that the function is considering taxa in is all wrong (but I can fix that relatively easily). The issue is that phyl.pca doesn't seem to believe that my matrix is actually a matrix. Any ideas why?
I think you may have bigger problems. Most phylogenetic methods, I suspect including phyl.pca, assume that traits are fixed at the species level (i.e., they don't account for within-species variation). Thus, if you want to use phyl.pca, you probably need to collapse your data to a single value per species, e.g. via
dd_agg <- aggregate(dd[,-1],by=list(dd$Taxa),mean,na.rm=TRUE)
Extract the numeric columns and label the rows properly so that phyl.pca can match them up with the tips correctly:
dd_mat <- dd_agg[,-1]
rownames(dd_mat) <- dd_agg[,1]
Using these aggregated data, I can make up a tree (since you didn't give us one) and run phyl.pca ...
library(phytools)
tt <- rcoal(nrow(dd_agg),tip.label=dd_agg[,1])
phyl.pca(tt,dd_mat)
If you do need to do an analysis that takes within-species variation into account you might need to ask somewhere more specialized, e.g. the r-sig-phylo#r-project.org mailing list ...
The answer posted by Ben Bolker seems to work whereby the data (called "all") is collapsed into a single value per species before creating a matrix and running the function. As per so:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
> all_mat <- all_agg[,-1]
> rownames(all_mat) <- all_agg[,1]
> phyl.pca(tree,all_mat, method= "lambda", mode = "corr")
Thanks to everyone who contributed an answer and especially Ben! :)

Creating folds for k-fold CV in R using Caret

I'm trying to make a k-fold CV for several classification methods/hiperparameters using the data available at
http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data.
This set is made of 208 rows, each with 60 attributes. I'm reading it into a data.frame using the read.table function.
The next step is to split my data into k folds, let's say k = 5. My first attempt was to use
test <- createFolds(t, k=5)
I had two issues with this. The first one is that the lengths of the folds are not next to each other:
Length Class Mode
Fold1 29 -none- numeric <br />
Fold2 14 -none- numeric <br />
Fold3 7 -none- numeric <br />
Fold4 5 -none- numeric <br />
Fold5 5 -none- numeric
The other one is that this apparently splitted my data according to the attributes indexes, but I want to split the data itself. I thought that by transposing my data.frame, using:
test <- t(myDataNumericValues)
But when I call the createFolds function, it gives me something like this:
Length Class Mode
Fold1 2496 -none- numeric <br />
Fold2 2496 -none- numeric <br />
Fold3 2495 -none- numeric <br />
Fold4 2496 -none- numeric <br />
Fold5 2497 -none- numeric
The length issue was solved, but it's still not splitting my 208 data accordingly.
What I can do? Is the caret package maybe not the most appropriate?
Please read ?createFolds to understand what the function does. It creates the indices that define which data are held out the separate folds (see the options to return the converse):
> library(caret)
> library(mlbench)
> data(Sonar)
>
> folds <- createFolds(Sonar$Class)
> str(folds)
List of 10
$ Fold01: int [1:21] 25 39 58 63 69 73 80 85 90 95 ...
$ Fold02: int [1:21] 19 21 42 48 52 66 72 81 88 89 ...
$ Fold03: int [1:21] 4 5 17 34 35 47 54 68 86 100 ...
$ Fold04: int [1:21] 2 6 22 29 32 40 60 65 67 92 ...
$ Fold05: int [1:20] 3 14 36 41 45 75 78 84 94 104 ...
$ Fold06: int [1:21] 10 11 24 33 43 46 50 55 56 97 ...
$ Fold07: int [1:21] 1 7 8 20 23 28 31 44 71 76 ...
$ Fold08: int [1:20] 16 18 26 27 38 57 77 79 91 99 ...
$ Fold09: int [1:21] 13 15 30 37 49 53 74 83 93 96 ...
$ Fold10: int [1:21] 9 12 51 59 61 62 64 70 82 87 ...
To use these to split the data:
> split_up <- lapply(folds, function(ind, dat) dat[ind,], dat = Sonar)
> dim(Sonar)
[1] 208 61
> unlist(lapply(split_up, nrow))
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
21 21 21 21 20 21 21 20 21 21
The function train is used in this package to do the actual modeling (you don't usually need to do the splitting yourself. See this page).
I'm not familiar with the caret package, but I used to write a function calculating CV based on decision tree from the rpart package. Of course, the function needs motifying in order to suit your purpose.
CV <- function(form, x, fold = 10, cp = 0.01) {
# x is the data
n <- nrow(x)
prop <- n%/%fold
set.seed(7)
newseq <- rank(runif(n))
k <- as.factor((newseq - 1)%/%prop + 1)
y <- unlist(strsplit(as.character(form), " "))[2]
vec.accuracy <- vector(length = fold)
for (i in seq(fold)) {
# It depends on which classification method you use
fit <- rpart(form, data = x[k != i, ], method = "class")
fit.prune <- prune(fit, cp = cp)
fcast <- predict(fit.prune, newdata = x[k == i, ], type = "class")
cm <- table(x[k == i, y], fcast)
accuracy <- (cm[1, 1] + cm[2, 2])/sum(cm)
vec.accuracy[i] <- accuracy
}
avg.accuracy <- mean(vec.accuracy)
avg.error <- 1 - avg.accuracy
cv <- data.frame(Accuracy = avg.accuracy, Error = avg.error)
return(cv)
}

Resources