PCA and Constant-Zero Column Error - r

I have a question about PCA using the caret package and an error message I'm getting, "cannot rescale a constant/zero column to unit variance".
Consider two sets of similar code. The first works just fine:
a = c(0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, -1, -1, NA)
b = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, -1, -1, NA)
c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0)
df = data.frame(a, b, c)
trans = preProcess(df, method = c("center", "scale", "pca"))
The variance of each column can be seen as:
apply(df, 2, var, na.rm=TRUE)
Note that the variance of column "c" is 0.11
Let's say I change the second to last integer in column "c" to 1 instead of 0, and then run the same code:
a = c(0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, -1, -1, NA)
b = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, -1, -1, NA)
c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)
df = data.frame(a, b, c)
trans = preProcess(df, method = c("center", "scale", "pca"))
I get an error message:
Error in prcomp.default(x, scale = TRUE, retx = FALSE) :
cannot rescale a constant/zero column to unit variance
If you look at the variance for column c, it's 0.059:
apply(df, 2, var, na.rm=TRUE)
Can anyone please help me understand the difference between these two sets of code and why the second gives an error when the first does not?
Thank you

PCA only uses complete observations. In your second definition of df above, a PCA analysis will drop the last row due to missingness. And column c is constant within the remaining rows.
Note: my answer is around PCA generally and not specific to the caret package.

Related

Add column with the cumulative number of elements found in a specific row

This table is based on a species sampling procedure that comprises starting at a certain location in a forest and recording the number of species that occur in that exact spot. Then, the surveyor walks and records the distance he traveled until he found a new species. This is the distance between the place where he found the new species and the initial point.
I would like to create a new column the includes the cumulative number of species based on the traveled distance. Here's what the new column should look like.
Example data:
data<-structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), binomial = c("Dromicodryas bernieri",
"Dromicodryas quadrilineatus", "Erymnochelys madagascariensis",
"Furcifer lateralis", "Furcifer oustaleti", "Hemidactylus mercatorius",
"Langaha pseudoalluaudi", "Leioheterodon madagascariensis", "Lycodryas pseudogranuliceps",
"Liophidium torquatum", "Liopholidophis sexlineatus", "Madagascarophis colubrinus",
"Madatyphlops decorsei", "Madascincus polleni", "Mimophis mahfalensis",
"Pelusios castanoides", "Phelsuma madagascariensis", "Thamnosophis lateralis",
"Trachylepis elegans", "Trachylepis gravenhorstii", "Zonosaurus madagascariensis",
"Hemidactylus frenatus", "Calumma nasutum", "Trachylepis madagascariensis",
"Amphiglossus macrocercus", "Zonosaurus aeneus", "Phelsuma lineata",
"Pelomedusa subrufa", "Calumma crypticum", "Furcifer viridis",
"Lygodactylus blancae", "Calumma gastrotaenia", "Trachylepis boettgeri",
"Zonosaurus ornatus", "Sanzinia madagascariensis", "Oplurus cyclurus",
"Leioheterodon modestus", "Oplurus cuvieri", "Madascincus igneocaudatus",
"Acrantophis dumerili", "Furcifer campani", "Pseudoxyrhopus imerinae",
"Lygodactylus mirabilis", "Phelsuma barbouri", "Furcifer minor",
"Compsophis infralineatus", "Pseudoxyrhopus quinquelineatus",
"Calumma hilleniusi", "Paroedura bastardi", "Brookesia brygooi"
), distance = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 3714.77402549982, 6249.49093233716, 7067.80424387549,
7715.0303317613, 13769.1057463018, 17206.1480236598, 18733.5237644898,
21923.789153995, 27314.2085865309, 31154.1890492383, 35460.0864839256,
35822.0263564291, 36933.3736660544, 39735.6007540156, 40983.6673876956,
43032.8409122139, 43793.3004333338, 44063.3992480126, 44657.9183000201,
44723.8214805486, 45184.0884859559, 46785.9008560645, 48994.7048866502,
55332.621992021, 57746.4142325833, 58866.2845249788, 60839.811988087,
65560.1987963227)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-50L))
One option is to get the cumsum on a logical vector i.e. where distance not equal to 0 and then add the count of 0 distance to it (sum(distance == 0))
library(dplyr)
data <- data %>%
mutate(new = cumsum(distance!=0) + sum(distance == 0))
Or for this, we can use base R
data$new <- with(data, cumsum(distance!=0) + sum(distance == 0))

How to plot a binary matrix without using additional packages?

I created a binary matrix and I wanna plot 1's as black square.
How can I write it without using any package?
For example, my matrix is:
m <- matrix(c(0,1,1,0,0,1,0,1,1),nrow=3, ncol=3)
Do you want this?
m <- matrix(c(0,1,1,0,0,1,0,1,1), nrow=3, ncol=3)
image(m, main = "My binary matrix plot", col = c("white", "black"))
If image doesn't suffice, we could write a generalized function using mapply like this one.
chessplot <- function(m, col=1, border=NA) {
stopifnot(dim(m)[1] == dim(m)[2]) ## allows only square matrices
n <- nrow(m)
plot(n, n, type='n', xlim=c(0, n), ylim=c(0, n))
mapply(\(i, j, m) {
rect(-1 + i, n - j, 0 + i, n - j + 1, col=m, border=border)
}, seq(n), rep(seq(n), each=n), t(m)) |> invisible()
}
Gives:
chessplot(m3)
chessplot(m4)
chessplot(m8)
Data:
m3 <- structure(c(0, 1, 1, 0, 0, 1, 0, 1, 1), .Dim = c(3L, 3L))
m4 <- structure(c(0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0), .Dim = c(4L,
4L))
m8 <- structure(c(0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
0, 1, 0, 1, 0), .Dim = c(8L, 8L))

Intersecting ranges of consecutive values in logical vectors in R

I have two logical vectors which look like this:
x = c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
y = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
I would like to count the intersections between ranges of consecutive values. Meaning that consecutive values (of 1s) are handled as one range. So in the above example, each vector contains one range of 1s and these ranges intersect only once.
Is there any R package for range intersections which could help here?
I think this should work (calling your logical vectors x and y):
sum(rle(x & y)$values)
A few examples:
x = c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
y = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
sum(rle(x & y)$values)
# [1] 1
x = c(1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
y = c(0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
sum(rle(x & y)$values)
# [1] 2
x = c(1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0)
y = c(0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
sum(rle(x & y)$values)
# [1] 3
By way of explanation, x & y gives the intersections on a per-element level, rle collapses runs of adjacent intersections, and sum counts.

Optimum algorithm to check various combinations of items when number of items is too large

I have a data frame which has 20 columns/items in it, and 593 rows (number of rows doesn't matter though) as shown below:
Using this the reliability of test is obtained as 0.94, with the help of alpha from psych package psych::alpha. The output also gives me the the new value of cronbach's alpha if I drop one of the items. However, I want to know how many items can I drop to retain an alpha of at least 0.8 I used a brute force approach for the purpose where I am creating the combination of all the items that exists in my data frame and check if their alpha is in the range (0.7,0.9). Is there a better way of doing this, as this is taking forever to run because number of items is too large to check for all the combination of items. Below is my current piece of code:
numberOfItems <- 20
for(i in 2:(2^numberOfItems)-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
# using the binaryLogic package
combination <- as.binary(i, n=numberOfItems)
model <- c()
for(j in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[j])
model <- c(model, j)
}
itemsToUse <- itemResponses[, c(model)]
#cat(model)
if(length(model) > 13){
alphaVal <- psych::alpha(itemsToUse)$total$raw_alpha
if(alphaVal > 0.7 && alphaVal < 0.9){
cat(alphaVal)
print(model)
}
}
}
A sample output from this code is as follows:
0.8989831 1 4 5 7 8 9 10 11 13 14 15 16 17 19 20
0.899768 1 4 5 7 8 9 10 11 12 13 15 17 18 19 20
0.899937 1 4 5 7 8 9 10 11 12 13 15 16 17 19 20
0.8980605 1 4 5 7 8 9 10 11 12 13 14 15 17 19 20
Here are the first 10 rows of data:
dput(itemResponses)
structure(list(CESD1 = c(1, 2, 2, 0, 1, 0, 0, 0, 0, 1), CESD2 = c(2,
3, 1, 0, 0, 1, 1, 1, 0, 1), CESD3 = c(0, 3, 0, 1, 1, 0, 0, 0,
0, 0), CESD4 = c(1, 2, 0, 1, 0, 1, 1, 1, 0, 0), CESD5 = c(0,
1, 0, 2, 1, 2, 2, 0, 0, 0), CESD6 = c(0, 3, 0, 1, 0, 0, 2, 0,
0, 0), CESD7 = c(1, 2, 1, 1, 2, 0, 1, 0, 1, 0), CESD8 = c(1,
3, 1, 1, 0, 1, 0, 0, 1, 0), CESD9 = c(0, 1, 0, 2, 0, 0, 1, 1,
0, 1), CESD10 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1), CESD11 = c(0,
2, 1, 1, 1, 1, 2, 3, 0, 0), CESD12 = c(0, 3, 1, 1, 1, 0, 2, 0,
0, 0), CESD13 = c(0, 3, 0, 2, 1, 2, 1, 0, 1, 0), CESD14 = c(0,
3, 1, 2, 1, 1, 1, 0, 1, 1), CESD15 = c(0, 2, 0, 1, 0, 1, 0, 1,
1, 0), CESD16 = c(0, 2, 2, 0, 0, 1, 1, 0, 0, 0), CESD17 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 0), CESD18 = c(0, 2, 0, 0, 0, 0, 0, 0,
0, 1), CESD19 = c(0, 3, 0, 0, 0, 0, 0, 1, 1, 0), CESD20 = c(0,
3, 0, 1, 0, 0, 0, 0, 0, 0)), .Names = c("CESD1", "CESD2", "CESD3",
"CESD4", "CESD5", "CESD6", "CESD7", "CESD8", "CESD9", "CESD10",
"CESD11", "CESD12", "CESD13", "CESD14", "CESD15", "CESD16", "CESD17",
"CESD18", "CESD19", "CESD20"), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The idea is to replace the computation of alpha with the so-called discrimination for each item from classical test theory (CTT). The discrimination is the correlation of the item score with a "true score" (which we would assume to be the row sum).
Let the data be
dat <- structure(list(CESD1 = c(1, 2, 2, 0, 1, 0, 0, 0, 0, 1), CESD2 = c(2, 3, 1, 0, 0, 1, 1, 1, 0, 1),
CESD3 = c(0, 3, 0, 1, 1, 0, 0, 0, 0, 0), CESD4 = c(1, 2, 0, 1, 0, 1, 1, 1, 0, 0),
CESD5 = c(0, 1, 0, 2, 1, 2, 2, 0, 0, 0), CESD6 = c(0, 3, 0, 1, 0, 0, 2, 0, 0, 0),
CESD7 = c(1, 2, 1, 1, 2, 0, 1, 0, 1, 0), CESD8 = c(1, 3, 1, 1, 0, 1, 0, 0, 1, 0),
CESD9 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1), CESD10 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1),
CESD11 = c(0, 2, 1, 1, 1, 1, 2, 3, 0, 0), CESD12 = c(0, 3, 1, 1, 1, 0, 2, 0, 0, 0),
CESD13 = c(0, 3, 0, 2, 1, 2, 1, 0, 1, 0), CESD14 = c(0, 3, 1, 2, 1, 1, 1, 0, 1, 1),
CESD15 = c(0, 2, 0, 1, 0, 1, 0, 1, 1, 0), CESD16 = c(0, 2, 2, 0, 0, 1, 1, 0, 0, 0),
CESD17 = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0), CESD18 = c(0, 2, 0, 0, 0, 0, 0, 0, 0, 1),
CESD19 = c(0, 3, 0, 0, 0, 0, 0, 1, 1, 0), CESD20 = c(0, 3, 0, 1, 0, 0, 0, 0, 0, 0)),
.Names = c("CESD1", "CESD2", "CESD3", "CESD4", "CESD5", "CESD6", "CESD7", "CESD8", "CESD9",
"CESD10", "CESD11", "CESD12", "CESD13", "CESD14", "CESD15", "CESD16", "CESD17",
"CESD18", "CESD19", "CESD20"), row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame"))
We compute (1) the discrimination and (2) the alpha coefficient.
stat <- t(sapply(1:ncol(dat), function(ii){
dd <- dat[, ii]
# discrimination is the correlation of the item to the rowsum
disc <- if(var(dd, na.rm = TRUE) > 0) cor(dd, rowSums(dat[, -ii]), use = "pairwise")
# alpha that would be obtained when we skip this item
alpha <- psych::alpha(dat[, -ii])$total$raw_alpha
c(disc, alpha)
}))
dimnames(stat) <- list(colnames(dat), c("disc", "alpha^I"))
stat <- data.frame(stat)
Observe that the discrimination (which is more efficient to compute) is inversely proportional to alpha that is obtained when deleting this item. In other words, alpha is highest when there are many high "discriminating" items (that correlate with each other).
plot(stat, pch = 19)
Use this information to select the sequence with which the items should be deleted to fall below a benchmark (say .9, since the toy data doesn't allow for a lower mark):
1) delete as many items as possible to stay above the benchmark; that is, start with the least discriminating items.
stat <- stat[order(stat$disc), ]
this <- sapply(1:(nrow(stat)-2), function(ii){
ind <- match(rownames(stat)[1:ii], colnames(dat))
alpha <- psych::alpha(dat[, -ind, drop = FALSE])$total$raw_alpha
})
delete_these <- rownames(stat)[which(this > .9)]
psych::alpha(dat[, -match(delete_these, colnames(dat)), drop = FALSE])$total$raw_alpha
length(delete_these)
2) delete as few items as possible to stay above the benchmark; that is, start with the highest discriminating items.
stat <- stat[order(stat$disc, decreasing = TRUE), ]
this <- sapply(1:(nrow(stat)-2), function(ii){
ind <- match(rownames(stat)[1:ii], colnames(dat))
alpha <- psych::alpha(dat[, -ind, drop = FALSE])$total$raw_alpha
})
delete_these <- rownames(stat)[which(this > .9)]
psych::alpha(dat[, -match(delete_these, colnames(dat)), drop = FALSE])$total$raw_alpha
length(delete_these)
Note, that 1) is coherent with classical item selection procedures in (psychological/educational) diagnostic/assessments: remove items from the assessment, that fall below a benchmark in terms of discriminatory power.
I changed the code as follows, now I am dropping a fixed number of items and changing the value of numberOfItemsToDrop from 1 to 20 manually. Although it is a lil better, but it still is taking too long to run :(
I hope there is some better way of doing this.
numberOfItemsToDrop <- 13
combinations <- combinat::combn(20, numberOfItemsToDrop)
timesToIterate <- length(combinations)/numberOfItemsToDrop
for(i in 1:timesToIterate){
model <- combinations[,i]
itemsToUse <- itemResponses[, -c(model)]
alphaVal <- psych::alpha(itemsToUse)$total$raw_alpha
if(alphaVal < 0.82){
cat("Cronbach's alpha =",alphaVal, ", number of items dropped = ", length(model), " :: ")
print(model)
}
}

Add accuracy to data frame based on several predicted values and known actual values

I have a data frame
testdf <- data.frame(predicted1 = c(1, 0, 1, 3, 2, 1, 1, 0, 1, 0), predicted2 = c(1, 0, 2, 2, 2, 1, 1, 0, 0, 0), predicted3 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), actual = c(1, 0, 2, 3, 2, 1, 1, 1, 0, 0))
I want to add another column to this data frame which tells me the total percentage accuracy when looking at all predicted values. So for example, row 1 of this would have an accuracy of 100%, because all prediction columns predicted the correct value (1).
How can this be done?
Thanks!
We can compare with the 'actual' get the rowMeans, multiply by 100 and round if needed
round(100*rowMeans(testdf[1:3] == testdf$actual), 2)

Resources