How to solve prcomp.default(): cannot rescale a constant/zero column to unit variance - r

I have a data set of 9 samples (rows) with 51608 variables (columns) and I keep getting the error whenever I try to scale it:
This works fine
pca = prcomp(pca_data)
However,
pca = prcomp(pca_data, scale = T)
gives
> Error in prcomp.default(pca_data, center = T, scale = T) :
cannot rescale a constant/zero column to unit variance
Obviously it's a little hard to post a reproducible example. Any ideas what the deal could be?
Looking for constant columns:
sapply(1:ncol(pca_data), function(x){
length = unique(pca_data[, x]) %>% length
}) %>% table
Output:
.
2 3 4 5 6 7 8 9
3892 4189 2124 1783 1622 2078 5179 30741
So no constant columns. Same with NA's -
is.na(pca_data) %>% sum
>[1] 0
This works fine:
pca_data = scale(pca_data)
But then afterwards both still give the exact same error:
pca = prcomp(pca_data)
pca = prcomp(pca_data, center = F, scale = F)
So why cant I manage to get a scaled pca on this data? Ok, lets make 100% sure that it's not constant.
pca_data = pca_data + rnorm(nrow(pca_data) * ncol(pca_data))
Same errors. Numierc data?
sapply( 1:nrow(pca_data), function(row){
sapply(1:ncol(pca_data), function(column){
!is.numeric(pca_data[row, column])
})
} ) %>% sum
Still the same errors. I'm out of ideas.
Edit: more and a hack at least to solve it.
Later, still having a hard time clustering this data eg:
Error in hclust(d, method = "ward.D") :
NaN dissimilarity value in intermediate results.
Trimming values under a certain cuttoff eg < 1 to zero had no effect. What finally worked was trimming all columns that had more than x zeros in the column. Worked for # zeros <= 6, but 7+ gave errors. No idea if this means that this is a problem in general or if this just happened to catch a problematic column. Still would be happy to hear if anyone has any ideas why because this should work just fine as long as no variable is all zeros (or constant in another way).

I don't think you're looking for zero variance columns correctly. Let's try with some dummy data. First, an acceptable matrix: of 10x100:
mat <- matrix(rnorm(1000, 0), nrow = 10)
And one with a zero-variance column. Let's call it oopsmat.
const <- rep(0.1,100)
oopsmat <- cbind(const, mat)
The first few elements of oopsmat look like this:
const
[1,] 0.1 0.75048899 0.5997527 -0.151815650 0.01002536 0.6736613 -0.225324647 -0.64374844 -0.7879052
[2,] 0.1 0.09143491 -0.8732389 -1.844355560 0.23682805 0.4353462 -0.148243210 0.61859245 0.5691021
[3,] 0.1 -0.80649512 1.3929716 -1.438738923 -0.09881381 0.2504555 -0.857300053 -0.98528008 0.9816383
[4,] 0.1 0.49174471 -0.8110623 -0.941413109 -0.70916436 1.3332522 0.003040624 0.29067871 -0.3752594
[5,] 0.1 1.20068447 -0.9811222 0.928731706 -1.97469637 -1.1374734 0.661594937 2.96029102 0.6040814
Let's try scaled and unscaled PCAs on oopsmat:
PCs <- prcomp(oopsmat) #works
PCs <- prcomp(oopsmat, scale. = T) #not forgetting the dot
#Error in prcomp.default(oopsmat, scale. = T) :
#cannot rescale a constant/zero column to unit variance
Because you can't divide by the standard deviation if it's infinity. To identify the zero-variance column, we can use which as follows to get the variable name.
which(apply(oopsmat, 2, var)==0)
#const
#1
And to remove zero variance columns from the dataset, you can use the same apply expression, setting variance not equal to zero.
oopsmat[ , which(apply(oopsmat, 2, var) != 0)]
Hope that helps make things clearer!

In addition to Joe's answer, just check that the classes of the columns in your dataframe are numerics.
If there are integers, then you'll get variances of 0, causing the scaling to fail.
So if,
class(my_df$some_column)
is an integer64, for example, then do the following
my_df$some_column <- as.numeric(my_df$some_column)
Hope this helps someone.

The error is because one of the column has constant values.
Calculate standard deviation of all the numeric cols to find the zero variance variables.
If the standard deviation is zero, you can remove the variable and compute pca

Related

Finding p value between two rasters in R

How can a p value be obtained between two rasters?
I currently have two rasters, and I would like to compute a p value.
I convert both into dataframes with na.rm=T.
df1<-as.data.frame(r1,na.rm=T)
df1<-as.data.frame(r2,na.rm=T)
cor.test(df1$gc,df2$ip)$p.value
Error in cor.test.default(df1$gc,df2$ip)) :
'x' and 'y' must have the same length
Even if I dont go for the na.rm, this comes
df1<-as.data.frame(r1)
df1<-as.data.frame(r2)
cor.test(df1$gc,df2$ip)$p.value
[1] 0
you have to pass a numeric vector to cor.test(). You are getting to that converting to a data.frame and getting the column, but it's probably easier to use the raster function values():
set.seed(666)
r1 <- raster(matrix(rnorm(100), 10, 10))
r2 <- raster(matrix(rnorm(100), 10, 10))
cor.test(values(r1), values(r2))$p.value
0.07144313
The error when you remove NA is because the two vector do not have the same length, i.e. in one vector you have more NA cells than in the other. Depending on many NA you have, the p-value you get at the end (0) may be important or not. Did you try making a simple boxplot or distribution plot of the values in both rasters? This may help you understand if the p-value is simply an artefact of too many NA.
d <- data.frame(
x = c(values(r1), values(r2)),
y = c(rep(1, ncell(r1)), rep(2, ncell(r2)))
)
boxplot(d$x ~ d$y)
/Emilio

Pearson coefficient per rows on large matrices

I'm currently working with a large matrix (4 cols and around 8000 rows).
I want to perform a correlation analysis using Pearson's correlation coefficient between the different rows composing this matrix.
I would like to proceed the following way:
Find Pearson's correlation coefficient between row 1 and row 2. Then between rows 1 and 3... and so on with the rest of the rows.
Then find Pearson's correlation coefficient between row 2 and row 3. Then between rows 2 and 4... and so on with the rest of the rows. Note I won't find the coefficient with row 1 again...
For those coefficients being higher or lower than 0.7 or -0.7 respectively, I would like to list on a separate file the row names corresponding to those coefficients, plus the coefficient. E.g.:
row 230 - row 5812 - 0.76
I wrote the following code for this aim. Unfortunately, it takes a too long running time (I estimated almost a week :( ).
for (i in 1:7999) {
print("Analyzing row:")
print(i)
for (j in (i+1):8000) {
value<- cor(alpha1k[i,],alpha1k[j,],use = "everything",method = "pearson")
if(value>0.7 | value<(-0.7)){
aristi <- c(row.names(alpha1k)[i],row.names(alpha1k)[j],value)
arist1p<-rbind(arist1p,aristi)
}
}
Then my question is if there's any way I could do this faster. I read about making these calculations in parallel but I have no clue on how to make this work. I hope I made myself clear enough, thank you on advance!
As Roland pointed out, you can use the matrix version of cor to simplify your task. Just transpose your matrix to get a "row" comparison.
mydf <- data.frame(a = c(1,2,3,1,2,3,1,2,3,4), b = rep(5,2,10), c = c(1:10))
cor_mat <- cor(t(mydf)) # correlation of your transposed matrix
idx <- which((abs(cor_mat) > 0.7), arr.ind = T) # get relevant indexes in a matrix form
cbind(idx, cor_mat[idx]) # combine coordinates and the correlation
Note that parameters use = everything and method = "pearson" are used by default for correlation. There is no need to specify them.

Confidence Interval of Sample Means using R

My dataframe contains sampling means of 500 samples of size 100 each. Below is the snapshot. I need to calculate the confidence interval at 90/95/99 for mean.
head(Means_df)
Means
1 14997
2 11655
3 12471
4 12527
5 13810
6 13099
I am using the below code but only getting the confidence interval for one row only. Can anyone help me with the code?
tint <- matrix(NA, nrow = dim(Means_df)[2], ncol = 2)
for (i in 1:dim(Means_df)[2]) {
temp <- t.test(Means_df[, i], conf.level = 0.9)
tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")
For any single mean, e. g. 14997, you can not compute a 95%-CI without knowing the variance or the standard deviation of the data, the mean was computed from. If you have access to the standard deviation of each sample, you can than compute the standard error of the mean and with that, easily the 95%-CI. Apparently, you lack the Information needed for the task.
Means_df is a data frame with 500 rows and 1 column. Therefore
dim(Means_df)[2]
will give the value 1.
Which is why you only get one value.
Solve the problem by using dim(Means_df)[1] or even better nrow(Means_df) instead of dim(Means_df)[2].

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

Maximum first derivative in for values in a data frame R

Good day, I am looking for some help in processing my dataset. I have 14000 rows and 500 columns and I am trying to get the maximum value of the first derivative for individual rows in different column groups. I have my data saved as a data frame with the first column being the name of a variable. My data looks like this:
Species Spec400 Spec405 Spec410 Spec415
1 AfricanOilPalm_1_Lf_1 0.2400900 0.2318345 0.2329633 0.2432734
2 AfricanOilPalm_1_Lf_10 0.1783162 0.1808581 0.1844433 0.1960315
3 AfricanOilPalm_1_Lf_11 0.1699646 0.1722618 0.1615062 0.1766804
4 AfricanOilPalm_1_Lf_12 0.1685733 0.1743336 0.1669799 0.1818896
5 AfricanOilPalm_1_Lf_13 0.1747400 0.1772355 0.1735916 0.1800227
For each of the variables in the species column, I want to get the maximum derivative from Spec495 to Spec500 for example. This is what I did before I ran into errors.
x<-c(495,500,505,510,515,520,525,530,535,540,545,550)##get x values of reflectance(Spec495 to Spec500)
y.data.f<-hsp[,21:32]##get row values for the required columns
y<-as.numeric(y.data.f[1,])##convert to a vector, for just the first row of data
library(pspline) ##Using a spline so a derivative maybe calculated from a list of numeric values
I really wanted to avoid using a loop because of the time it takes, but this is the only way I know of thus far
for(j in 1:14900)
+ { y<-as.numeric(y.data.f[j,]) + a1d<-max(predict(sm.spline(x, y), x, 1))
+ write.table(a1d, file = "a1-d-appended.csv", sep = ",",
+ col.names = FALSE, append=TRUE) + }
This loop runs up until the 7861th value then get this error:
Error in smooth.Pspline(x = ux, y = tmp[, 1], w = tmp[, 2], method = method, :
NA/NaN/Inf in foreign function call (arg 6)
I am sure there must be a way to avoid using a loop, maybe using the plyr package, but I can't figure out how to do so, nor which package would be best to get the value for maximum derivative.
Can anyone offer some insight or suggestions? Thanks in advance
First differences are the numerical analog of first derivatives when the x-dimension is evenly spaced. So something along the lines of:
which.max( diff ( predict(sm.spline(x, y))$ysmth) ) )
... will return the location of the maximum (positive) slope of the smoothed spline. If you wanted the maximal slope allowing it to be either negative or postive you would use abs() around the predict()$ysmth. If you are having difficulties with non-finite values then using an index of is.finite will clear both Inf and NaN difficulties:
predy <- predict(sm.spline(x, y))$ysmth
predx <- predict(sm.spline(x, y))$x
is.na( predy ) <- !is.finite(pred)
plot(predx, predy, # NA values will not blow up R plotting function,
# ... just create discontinuities.
main ="First Derivative")

Resources