Apply PCA for data stored in list - r

My image data is stored in a list. For every pixel (626257) of my image I have a vector containing all the values corresponding to the different wavelengths (44 wavelengths). Now I would like to carry out a principal component analysis (PCA). Unfortunately, I am not able to convert my listed data into the desired form. Here is the code to generate a dummy data set.
test = replicate(626257, rnorm(44, 3, 1),simplify = FALSE)
When I now try to carry out the PCA then the following error message pops up.
pca = prcomp(test, scale = F)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
How can I convert my list into a suitable datatype?

We could change the simplify = TRUE in replicate and it should work
test <- replicate(10, rnorm(44, 3, 1),simplify = FALSE)
pca = prcomp(test, scale = FALSE)

Related

Normalising the data by scale() function changes the correlation in R

I have some data in data frame format and I want to see the correlation between two groups of these data (observation and estimation) in the original form, and in the normalized forms. The data is available here data. Here are the codes:
joint_df_estimation_observation <- data.frame(df_estimation = as.vector(as.matrix(rowSums(estimation_data))), df_observation = rowSums(observation_data))
joint_df_estimation_observation
cor.test(joint_df_estimation_observation$df_estimation, joint_df_estimation_observation$df_observation)
and I got the results:
normalised_estimation_data =
as.data.frame(scale(as.matrix(t(as.matrix(estimation_data))), scale = TRUE))
normalised_estimation_data = as.data.frame(t(as.matrix(normalised_estimation_data)))
normalised_observation_data = as.data.frame(scale(as.matrix(t(as.matrix(observation_data))), scale = TRUE))
normalised_observation_data = as.data.frame(t(as.matrix(normalised_observation_data)))
joint_df_estimation_observation_normalisation <- data.frame(df_estimation = as.vector(as.matrix(rowSums(normalised_estimation_data))), df_observation = rowSums(normalised_observation_data))
joint_df_estimation_observation_normalisation
cor.test(joint_df_estimation_observation_normalisation$df_estimation, joint_df_estimation_observation_normalisation$df_observation)
and I got the result
I do not understand why the correlation is so much different? I thought that the correlation will remain same even if the data are normalized. Is there anything wrong that I have done? I have considered the case there maybe NAs that are generated if there is a constant value across all columns for some rows in the original data, which caused this issue. However, I have removed them and It actually does not change the results.

Extracting the relative influence from a gbm.fit object

I am trying to extract the relative influence of each variable from a gbm.fit object but it is coming up with the error below:
> summary(boost_cox, plotit = FALSE)
Error in data.frame(var = object$var.names[i], rel.inf = rel.inf[i]) :
row names contain missing values
The boost_cox object itself is fitted as follows:
boost_cox = gbm.fit(x = x,
y = y,
distribution="coxph",
verbose = FALSE,
keep.data = TRUE)
I have to use the gbm.fit function rather than the standard gbm function due to the large number of predictors (26k+)
I have solve this issue now myself.
The relative.influence() function can be used and works for objects created using both gbm() and gbm.fit(). However, it does not provide the plots as in the summary() function.
I hope this helps anyone else looking in the future.

Calculate the pearson correlation between two lists

I have many equally structured text files containing experimental data (641*976). At the beginning I define the correct "working directory" and order all the files in a list. Thereby I generate two different lists. Once the file.listx containing my sample data and once the file.listy containing reference data. Afterwards I rearrange the data in order to conduct the correlation analysis. Here the code shows how I generate the "x" list. The "y" list was generated exactly the same way with the reference data.
file.listx <- list.files(pattern="*.txt", full.names=T)
datalist = lapply(file.listx, FUN=read.table, header = F, sep = "\t", skip = 2)
cmbn = expand.grid(1:641, 1:977)
flen = length(datalist)
x=lapply(1:(nrow(cmbn)),function(t,lst,cmbn){
return(sapply(1:flen,function(i,t1,lst1,cmbn1){
return(lst1[[i]][cmbn1$Var1[t1],cmbn1$Var2[t1]])},t,lst,cmbn))}
,datalist,cmbn)
Now I want to calculate the pearson correlation between the two lists.
http://www.datasciencemadesimple.com/pearson-function-in-excel/
According to the pearson correlation formula corresponds my "x" to the sample and my "y" to the reference.
cor(x, y, method = "pearson")
Then the error message pops up that 'x' must be numeric. I do not know how I can solve this problem. When I use,
x = as.numeric(x)
it seems that the list structure gets lost. And the following approach does also not solve the problem.
x = as.matrix(x)
How can I convert my list into a numeric type without loosing the structure? I want to calculate the pearson correlation between the two lists.
Here is the code to generate two dummy lists. This way the error can be reproduced.
x = list(4:10, 10:16, 32:38, 100:106) # sample
y = list(10:16, 20:26, 40:46, 110:116) # reference
cor(x, y, method = "pearson")

"Input datasets must be dataframes" error in kamila package in R

I have a mixed type data set, one continuous variable, and eight categorical variables, so I wanted to try kamila clustering. It gives me an error when I use one continuous variable, but when I use two continuous variables it is working.
library(kamila)
data <- read.csv("mixed.csv",header=FALSE,sep=";")
conInd <- 9
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
Error in kamila(conVar = conVar[testInd, ], catFactor =
catFactor[testInd, : Input datasets must be dataframes
I think the problem is that the function assumes that you have at least two of both data types (i.e. >= 2 continuous variables, and >= 2 categorical variables). It looks like you supplied a single column index (conInd = 9, just column 9), so you have only one continuous variable in your data. Try adding another continuous variable to your continuous data.
I had the same problem (with categoricals) and this approach fixed it for me.
I think the ultimate source of the error in the program is at around line 170 of the source code. Here's the relevant snippet...
numObs <- nrow(conVar)
numInTest <- floor(numObs/2)
for (cvRun in 1:numPredStrCvRun) {
for (ithNcInd in 1:length(numClust)) {
testInd <- sample(numObs, size = numInTest, replace = FALSE)
testClust <- kamila(conVar = conVar[testInd,],
catFactor = catFactor[testInd, ],
numClust = numClust[ithNcInd],
numInit = numInit, conWeights = conWeights,
catWeights = catWeights, maxIter = maxIter,
conInitMethod = conInitMethod, catBw = catBw,
verbose = FALSE)
When the code partitions your data into a training set, it's selecting rows from a one-column data.frame, but that returns a vector by default in that case. So you end up with "not a data.frame" even though you did supply a data.frame. That's where the error comes from.
If you can't dig up another variable to add to your data, you could edit the code such that the calls to kamila in the cvRun for loop wrap the data.frame() function around any subsetted conVar or catFactor, e.g.
testClust <- kamila(conVar = data.frame(conVar[testInd,]),
catFactor = data.frame(catFactor[testInd,], ... )
and just save that as your own version of the function called say, my_kamila, which you could use instead.
Hope this helps.

How to use Stargazer R package with factor analysis output

I've used stargazer in the past with regression tables.
However I'd like to know how to use stargazer with output from factor analysis and principal component analysis.
My code runs as follows:
fa1 <- factanal(new2, factors = 4, rotation = "varimax", sort = TRUE)
print(fa1, digits = 3, cutoff = .5, sort = TRUE)
load <- fa1$loadings[,1:2]
plot(load,type="n")
text(load,labels=names(new2),cex=.7)
AND
two <- pca(new2, nfactors = 3)
THIS doesn't work - my only attempt so far.
stargazer(fa1, type = "text", title="Descriptive statistics", digits=1, out="table1.txt")
UPDATE: Since posting I have been able to convert the object to a data frame with:
converted <- as.data.frame(unclass(fa1$loadings))
I then used the code above successfully EXCEPT that the output doesn't seem to include individual factor scores.
See below:
loadings
this might not be a perfect sulution and you might have found out by now, but what you can do is to output the factor scores separately with stagazer by adding summary = FALSE as an option. This way you get only the factor loadings as an output.
For example like this:
stargazer(fa1, summary = FALSE, title="Descriptive statistics", digits=1, out="table1.txt")

Resources