Program formula to be jackknifed using R - r

I have a quite complex question of how to write an R code, as I went through all the possible options known to me without any results. I can do the first part in excel within less than one hour, i.e., to get a result for the following formula.
C = 1/n ∑ nj(cj - aj)^2
n = number of observations, nj = number of observations in the interval, aj = proportion of correct responses for each class interval, ∑ is the sum of each j ratings to class interval.
The reason I want to do it in R is because (1) everything can be done once and with the same program, and--more important--(2) I need do jackknife (or bootstrap) this formula to get the jackknife SEs (which I need for the inferential confidence intervals).
This is the data
> str(data)
'data.frame': 372 obs. of 6 variables:
$ ID : int 120 432 664 163 283 659 78 150 158 188 ...
$ age : int 20 20 16 18 20 15 20 20 16 20 ...
$ sex : Factor w/ 2 levels "female","male": 1 1 1 2 2 2 1 1 1 2 ...
$ poconf123 : int 1 1 1 1 1 1 1 1 1 1 ...
$ PoConfPC : int 40 50 40 30 10 50 50 30 40 30 ...
$ idacc : Factor w/ 2 levels "accurate","inaccurate": 1 1 1 1 1 1 2 2 2 2 ...
To get the nj, cj and aj, I tried the following code
group_by(data, poconf123) %>% #this is to get results for each class intervall
summarise(poconf = mean(PoConfPC)) %>% #this is to get cj
summarise(NumAcc=n(), prop = mean(idacc==1)/n()) #this is to get nj and aj
Using the code above I get the following error message
Error in summarise_impl(.data, dots):
Evaluation error: object 'PoConfPC' not found.
The first two lines alone, or the first and the last lines alone do not produce the same error. However, the code
group_by(data, poconf123) %>%
summarise(NumAcc=n(), prop = mean(idacc==1)/n())
still seems not optimal, because instead of getting the proportions e.g., .53, I am getting the values "0." in the entire column 'prop'.
Of course, I could ask for a proportion table, e.g.,
prop.table(table(data$idacc))
to get the proportions for both accurate and inaccurate but this is not really what I need.
So, overall I do not even get to the point where I can run the analyses and get a single C, not to mention the jackknife SE. And this brings me to the actual question, that is, how I can program this formula into a function, so I can 'jackknife' it.
Below is an example of how to jackknife the point-biserial correlation with the outputs required for inferential confidence intervals.
library(bootstrap)
xx = data$PoConfPC
yy = data$idacc
nn = length(xx)
theta <- function(x,xx,yy) {biserial.cor(xx[x], yy[x], level = 1)}
results <- jackknife(1:nn, theta, xx, yy)
summary(results$jack.values); results$jack.se; results$jack.bias

Related

How to use knn classification (class package) using training and test datasets

Dfcensus is the original data frame. I am trying to use Sex, EducYears and Age to predict whether a person's Income is "<=50K" or ">50K".
There are 20,000 rows in x_train_auto (training set) and 12,561 in x_test_auto (test set).
My classification variable (training set) has 15,124 <=50k and 4876 >50k.
Here is my code:
predictions = knn(train = x_train_auto, # response
test = x_test_auto, # response
cl = Df_census$Income[in_train_census], # prediction
k = 25)
table(predictions)
#<=50K
#12561
As you can see, all 12,561 test samples were predicted to have an Income of ">=50K".
This doesn't make sense. I am not sure where I am going wrong.
P.S.: I have sex one-hot encodes as 0 for male and 1 for female. And I have scaled Educ_years and Age and added sex to the data frame. I then added the one-hot encoded sex variable back into the scaled test and train data.
identifying the problem
Your provided x_test-auto.csv data suggests that you passed logical vectors with TRUEs and FALSEs (which define the indices of training and test samples rather than the actual data) to the train and test arguments of class::knn.
the solution
Rather, use the logical vector in x_train_auto (which I believe corresponds to in_train_census in your example) to define two separate data.frames, each containing all your desired predictors. These are then the training and the test set.
p <- c("Age","EducYears","Sex")
Df_train <- Df_census[in_train_census,p]
Df_test <- Df_census[!in_train_census,p]
In the knn function, pass the training set to the train argument, and the test set to the test argument, and further pass the outcome / target variable of the training set (as a factor) to cl.
The output (see ?class::knn) will be the predicted outcome for the test set.
Here is a complete and reproducible workflow using your data.
the data
library(class)
# read data from Dropbox
x_train_auto <- read.csv("https://dropbox.com/s/6kupkp4u4qyizy7/x_test_auto.csv?dl=1", row.names = 1)
Df_census <- read.csv("https://dropbox.com/s/ccvck8ajnatmpv0/Df_census.csv?dl=1", row.names = 1, stringsAsFactors = TRUE)
table(x_train_auto) # TRUE are training, FALSE are test set
#> x_train_auto
#> FALSE TRUE
#> 12561 20000
str(Df_census) # Income as factor, Sex is binary, Age and EducYears are numeric
#> 'data.frame': 32561 obs. of 15 variables:
#> $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
#> $ Work : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#> $ Fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#> $ Education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#> $ EducYears : int 13 13 9 7 13 14 5 9 14 13 ...
#> $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#> $ Occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
#> $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#> $ Race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#> $ Sex : int 1 1 1 1 0 0 0 1 0 1 ...
#> $ CapitalGain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
#> $ CapitalLoss : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ HoursPerWeek : int 40 13 40 40 40 40 16 45 50 40 ...
#> $ NativeCountry: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#> $ Income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# predictors and response
p <- c("Age","EducYears","Sex")
y <- "Income"
# create data partition
in_train_census <- x_train_auto$x
Df_train <- Df_census[in_train_census,]
Df_test <- Df_census[!in_train_census,]
# check
dim(Df_train)
#> [1] 20000 15
dim(Df_test)
#> [1] 12561 15
table(Df_train$Income)
#>
#> <=50K >50K
#> 15124 4876
using class::knn
The knn (k-nearest-neighbors) algorithm can perform better or worse depending on the choice of the hyperparameter k. It's often difficult to know which k value is best for the classification of a particular dataset. In a machine learning setting, you'd want to try out different values of k to find a value that gives the highest performance on your test dataset (i.e., data which was not used for model fitting).
It's always important to strike a good balance between overfitting (model is too complex, and will give good results on the training data, but less accurate or even rubbish results on new test data) and underfitting (model is too trivial to explain the actual patterns in the data). In the case of knn, using a larger k value would probably better safeguard against overfitting, according to the explanations here.
# apply knn for various k using the given training / test set
r <- data.frame(array(NA, dim = c(0, 2), dimnames = list(NULL, c("k","accuracy"))))
for (k in 1:30) {
#cat("k =", k, "\n")
# fit model on training set, predict test set data
set.seed(60402) # to be reproducible
predictions <- knn(train = Df_train[,p],
test = Df_test[,p],
cl = Df_train[,y],
k = k)
# confusion matrix on test set
t <- table(pred = predictions, ref = Df_test[,y])
# accuracy
a <- sum(diag(t)) / sum(t)
# bind
r <- rbind(r, data.frame(k = k, accuracy = a))
}
visualize model assessment
# find best k
r[which.max(r$accuracy),]
#> k accuracy
#> 17 17 0.8007324
(k.best <- r[which.max(r$accuracy),"k"])
#> [1] 17
# plot
with(r, plot(k, accuracy, type = "l"))
abline(v = k.best, lty = 2)
Created on 2021-09-23 by the reprex package (v2.0.1)
interpretation
The loop results suggest that your optimal value of k for this particular training and test set is between 12 and 17 (see plot above), but the accuracy gain is very small compared to using k = 1 (it's at around 80% regardless of k).
additional thoughts
Given that high income is rarer compared to lower income, accuracy might not be the desired performance metric. Sensitivity might be equally or more important, and you could modify the example code to calculate and assess other performance metrics instead.
In addition to pure prediction, you might want to explore whether other variables could be informative predictors of the Income class, by adding them to the p vector and comparing the resulting accuracies.
Here, we base our conclusions on a particular realization of training and test data. Better machine learning practice would be to split your data into 2 (as here), but then repeatedly split the training set again to fit and assess many more models, using e.g. (repeated) k-fold cross validation. A good package to do this in R is e.g. caret or tidymodels.
To gain a better understanding regarding which variables are the best predictors of Income class, I would also carry out a logistic regression on various uncorrelated predictors.

R: Create sample with at least one element from each category

For linear regression to predict house prices, I need to make train and test sample of 80% and 20% proportion.
However, some of the variables are factors of which few have just 1 observation under them.
Due to this, when performing random sampling, those factors are in test sample and not in train sample.
Hence when predicting the Sale Price in test set, the error comes:
"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor Exterior1st has new levels ImStucc"
Here is the summary of the train sample of Exterior1st variable:
> summary(train$Exterior1st)
AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco
11 0 1 36 0 41 173 0 164 78 2 17
VinylSd Wd Sdng WdShing
389 140 17
Here is summary of the test sample of Exterior1st variable:
> summary(test$Exterior1st)
AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco
4 0 0 8 1 11 37 1 37 22 0 4
VinylSd Wd Sdng WdShing
97 43 3
As you can see the ImStucc factor in this variable is present in the train sample but not in the test sample, due to which the predict function is throws the initial mentioned error.
In my pursuit for this solution, I had come across a function called "stratified".
But that function does not seem to work in R.
There was another solution using dplyr group_by. But here we have to specify the number of observations for each group. This solution is not suitable for this dataset as it would require calculation for each factor.
Another solution provided was for sampling of vector alone and not the data frame. Hence, that solution does not help.
t= sample(c(filtered_data$Exterior1st,sample(filtered_data$Exterior1st,size = 1000, replace = TRUE)))
> table(t)
t
1 3 4 5 6 7 8 9 10 11 12 13 14 15
26 2 74 2 91 375 1 345 168 3 37 848 329 36
The above sampling gives a total of 2337 entries, even though size given is 1000. Hence, this is perhaps not what I'm looking for.
Is there method to create a sample of 80% of the data such that at least 1 factor from each variable is present within this sample.
If there isn't, what is the workaround this situation?
Maybe I am misreading, but if you only have 1 observation of a categorical variable, you won't be able to use that factor, lmStucc, in a regression.
I would remove that variable from the model, collect more data, or aggregate it with other factors (if possible). (I would probably not include 2, 5, or 11 either - from the table t, because they also have low observations)
Also, the function sample (when replacement = TRUE) will choose the same observation multiple times. Set it to replacement = FALSE to avoid duplication of entries.

Why coxph() results some of the coefficient as NA when using survSplit() in R?

I'm working with survival data, and using survSplit() to deal with non-proportionality with time-dependent coefficients. The model is based on the article by Terry Therneau et. al. (2020) (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf)
I have a factor variable with 6 levels to represent different types of knee prostheses. When I'm applying survSplit() with any cutpoints, the coefficients for the reference level of this time-adjusted factor appears as NA in the results. There is no collinearity, and the problem can be reproduced with other factor variables in the data. Also, if I change the reference value by altering the factor levels, the reference value is NA anyways.
The problem is reproduced below with the factor variable celltype in the veteran data:
str(veteran)
'data.frame': 137 obs. of 8 variables:
$ trt : num 1 1 1 1 1 1 1 1 1 1 ...
$ celltype: Factor w/ 4 levels "squamous","smallcell",..: 1 1 1 1 1 1 1 1 1 1 ...
$ time : num 72 411 228 126 118 10 82 110 314 100 ...
$ status : num 1 1 1 1 1 1 1 1 1 0 ...
$ karno : num 60 70 60 60 70 20 40 80 50 70 ...
$ diagtime: num 7 5 3 9 11 5 10 29 18 6 ...
$ age : num 69 64 38 63 65 49 69 68 43 70 ...
$ prior : num 0 10 0 10 10 0 10 0 0 0 ...
library(tidyverse)
library(survival)
library(survminer)
df <- veteran
cox <- coxph(Surv(time, status) ~ celltype + age + prior, data = df)
cox.zph(cox, terms=F)
cox_tdc <- survSplit(Surv(time, status) ~ .,
data= df,
cut=c(150),
zero=0,
episode= "tgroup",
id="id") %>%
dplyr::select(id, tstart, time, status, tgroup, celltype, age, prior)
coxph(Surv(tstart, time, status) ~
celltype:strata(tgroup) + age + prior,
data=cox_tdc)
Call:
coxph(formula = Surv(tstart, time, status) ~ celltype:strata(tgroup) +
age + prior, data = cox_tdc)
coef exp(coef) se(coef) z p
age 0.005686 1.005702 0.009494 0.599 0.549262
prior 0.008592 1.008629 0.020661 0.416 0.677516
celltypesquamous:strata(tgroup)tgroup=1 0.300732 1.350848 0.360243 0.835 0.403828
celltypesmallcell:strata(tgroup)tgroup=1 1.172992 3.231649 0.325177 3.607 0.000309
celltypeadeno:strata(tgroup)tgroup=1 1.232753 3.430660 0.352423 3.498 0.000469
celltypelarge:strata(tgroup)tgroup=1 NA NA 0.000000 NA NA
celltypesquamous:strata(tgroup)tgroup=2 -1.160625 0.313290 0.450989 -2.574 0.010067
celltypesmallcell:strata(tgroup)tgroup=2 -0.238994 0.787420 0.542002 -0.441 0.659252
celltypeadeno:strata(tgroup)tgroup=2 1.455195 4.285319 0.837621 1.737 0.082335
celltypelarge:strata(tgroup)tgroup=2 NA NA 0.000000 NA NA
Likelihood ratio test=34.54 on 8 df, p=3.238e-05
n= 171, number of events= 128
Likelihood ratio test=69.43 on 9 df, p=1.97e-11
n= 272, number of events= 128
The problem with this is that I cannot test the Shoenfeld residuals as cox.zph() results an error: "Error in solve.default(imat, u) :
system is computationally singular: reciprocal condition number = 5.09342e-19". Because of the NAs.
Problem with NAs does not happen if I don't use time-dependent coefficients (:strata(tgroup))
Has anyone dealed with this problem previously? Why some of the coefficients are NAs? I really appreciate your help with this!
EDIT: example was changed to include reproducible data.
EDIT2: Fixed the time cutpoint in the example which resulted biased coefficients
EDIT3: I asked Terry Therneau about this problem and the email conversation is below:
Terry Therneau:
There are two issues here.
The model.matrix routine is used by lm, glm, coxph and a host of other routines to create the X matrix for regression. It tries to be intelligent so as to not create redundant columns in X; those columns will end up with an NA coefficient. It is pretty good, but not perfect. Your case of a model with strata(tgroup): factor variable is one where it leaves in too many. The extra NA in the printout are a nuiscance, but not something that I can fix.
The cox.zph routine, on the other hand, is my problem. It should ignore those NA columns, and does not. There is actually code to check for the NA, but your example shows that it is incomplete. I will add an NA case, like yours, so my test suite and repair the problem. (The NA case worked once, but some update broke it.)
Me:
When the I get the results with NA rows as in this case, are the coefficients still correct, and the NA represents the reference value?
Terry:
Yes, all the coefficients are correct. SAS, for instance, does not try to 'pre-eliminate' columns and models with factors always have some missing in the coefficient vector. Since they use "." instead of "NA" for printing the missings don't jump off the page as much. Numerically, there is no penalty for doing in one way or the other.

"Number of observations <= number of random effects" error

I am using a package called diagmeta for meta-analysis purposes. I can use this package with a built in data set called Schneider2017. However when I make my own database/data set I get the following error:
Error: number of observations (=300) <= number of random effects (=3074) for term (Group * Cutoff | Study); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable
Another thread here on SO suggests the error is caused by the data format of one or more columns. I have made sure every column's data type matches that in the Schneider2017 dataset - no effect.
Link to the other thread
I have tried extracting all of the data from the Schneider2017 dataset into excel and then importing a dataset from Excel through R studio. This again makes no difference. This suggests to me that something in the data format could be different, although I can't see how.
diag2 <- diagmeta(tpos, fpos, tneg, fneg, cutpoint,
studlab = paste(author,year,group),
data = SRschneider,
model = "DIDS", log.cutoff = FALSE,
check.nobs.vs.nRE = "ignore")
The dataset looks like this:
I expected the same successful execution and plotting as with the built-in data set, but keep getting this error.
Result from doing str(mydataset):
> str(SRschneider)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 10 variables:
$ ...1 : num 1 2 3 4 5 6 7 8 9 10 ...
$ study_id: num 1 1 1 1 1 1 1 1 1 1 ...
$ author : chr "Arora" "Arora" "Arora" "Arora" ...
$ year : num 2006 2006 2006 2006 2006 ...
$ group : chr NA NA NA NA ...
$ cutpoint: chr "6" "7.0" "8.0" "9.0" ...
$ tpos : num 133 131 130 127 119 115 113 110 102 98 ...
$ fneg : num 5 7 8 11 19 23 25 28 36 40 ...
$ fpos : num 34 33 31 30 28 26 25 21 19 19 ...
$ tneg : num 0 1 3 4 6 8 9 13 15 15 ...
Just a quick follow-up on Ben's detailed answer.
The statistical method implemented in diagmeta() expects that argument cutpoint is a continuous variable. We added a corresponding check for argument cutpoint (as well as arguments TP, FP, TN, and FN) in version 0.3-1 of R package diagmeta; see commit in GitHub repository for technical details.
Accordingly, the following R commands will result in a more informative error message:
data(Schneider2017)
diagmeta(tpos, fpos, tneg, fneg, as.character(cutpoint),
studlab = paste(author, year, group), data = Schneider2017)
You said that you
have made sure every column's data type matches that in the Schneider2017 dataset
but that doesn't seem to be true. Besides differences between num (numeric) and int (integer) types (which actually aren't typically important), your data has
$ cutpoint: chr "6" "7.0" "8.0" "9.0" ...
while str(Schneider2017) has
$ cutpoint: num 6 7 8 9 10 11 12 13 14 15 ...
Having your cutpoint be a character rather than numeric means that R will try to treat it as a categorical variable (with many discrete levels). This is very likely the source of your problem.
The cutpoint variable is likely a character because R encountered some value in this column that can't be interpreted as numeric (something as simple as a typographic error). You can use SRschneider$cutpoint <- as.numeric(SRschneider$cutpoint) to convert the variable to numeric by brute force (values that can't be interpreted will be set to NA), but it would be better to go upstream and see where the problem is.
If you use tidyverse packages to load your data you should get a list of "parsing problems" that may be useful. You can also try cp <- SRschneider$cutpoint; cp[which(is.na(as.numeric(cp)))] to look at the values that can't be converted.

Repeated measure ANOVA or time series' analysis?

I am quite new in R and (I admit it!) not so good with statistics, so I am sorry if my problem is too trivial, but I would really appreciate some hints on the matter.
I have 9 points (plots) of soil humidity measurements for each of the 2 different plantation systems we have (agriforestry and agriculture) over 2 months (weekly measurements). We also have the distance in meters between the closest tree (bigger than 5cm DBH) and the exact measurement point in each of the plots (varying between 4.2 and 12m in Agriforestry and are 50m in agriculture). Therefore, I have a profile of humidity (y) over time (x) (that behave similarly but vary due to weather fluctuations) for each of the 18 plots (9 in agriforestry and 9 in agriculture). What I need to know is:
Are these variations in humidity between the measurement points over time dependent on (or influenced by) the distance of the trees? Meaning, do the trees hold more water or take more water from the soil if they are closer to the measurement points (that are in the middle of a plantation?
Are these curves (humidity x time) significantly different from each other?
I thought first about grouping every 3 points of tree measurements (smaller distances from trees, medium distances and higher distances) for the agriforestry system and all 9 from agroforestry and using them as replications, as they behave more similarly. However it confounded me a bit.
So... I got as far as thinking about using a repeated measure ANOVA from the ez package. So in this case I had:
str(SanPedro)
data.frame': 450 obs. of 6 variables:
Parcel : Factor w/ 2 levels "Forest","Agriculture": 1 1 1 1 1 1 1 1 1 1 ...
Distance: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
Plot : num 1 1 1 1 1 1 1 1 1 1 ...
Date : Date, format: "0011-07-20" "0011-07-24" ...
Humidity: num 0.217 0.205 0.199 0.2 0.192 0.181 0.184 0.18 0.179 0.178 ...
Number : num 1 2 3 4 5 6 7 8 9 10 ..
When I tried to run the ezANOVA as
ezANOVA(data=SanPedro, dv=Humidity, wid=Number, within=Parcel, between=Plot, type=1, return_aov=TRUE)
I got this:
Warning: Converting "Number" to factor for ANOVA.
Warning: "Plot" will be treated as numeric.
Error in ezANOVA_main(data = data, dv = dv, wid = wid, within = within, :
One or more cells is missing data. Try using ezDesign() to check your data.
If I check the ezDesign(SanPedro), I get:
ezDesign(SanPedro)
Error in as.list(c(x, y, row, col)) :
argument "x" is missing, with no default
In the end, I do not really understand the problem with the data, and I am not even sure if the ezANOVA is actually the right analysis for my case... I really deeply appreciate any hints and ideas on the matter!!! Thanks a loooot!!! =)

Resources