PCA analysis in a loop for certain column intervals in R - r

I have a data set containing 526 rows nd 560 columns. In this data set, I want to run pca analysis for each 16 columns, respectively, in the loop and save the PCA scores for each row. I tried the below code but it did not work. I would be happy to get your advice.
Thanks in advance for your help.
for(i in 1:ncol(df)) {
df[ , i:(i+15)] <- prcomp(df[, i:(i+15)], scale. = TRUE, center = T)
}

Here is a way with a lapply loop. Create a vector f of consecutive integers, each repeated 16 times. Then split the data.frame names by this vector and lapply function prcomp to each subset. Finally, extract the scores.
f <- c(1, rep(0, 15))
f <- rep(f, length(names(df1))/16)
f <- cumsum(f)
nms <- split(names(df1), f)
pca_list <- lapply(nms, function(x){
prcomp(df1[x], center = TRUE, scale. = TRUE)
})
scores_list <- lapply(pca_list, '[[', 'x')
Test data creation code
set.seed(2021)
df1 <- replicate(560, rnorm(526))
df1 <- as.data.frame(df1)

Related

Dataframe output from a for-loop

I am trying to populate the output of a for loop into a data frame. The loop is repeating across the columns of a dataset called "data". The output is to be put into a new dataset called "data2". I specified an empty data frame with 4 columns (i.e. ncol=4). However, the output generates only the first two columns. I also get a warning message: "In matrix(value, n, p) : data length [2403] is not a sub-multiple or multiple of the number of columns [2]"
Why does the dataframe called "data2" have 2 columns, when I have specified 4 columns? This is my code:
a <- 0
b <- 0
GM <- 0
GSD <- 0
data2 <- data.frame(ncol=4, nrow=33)
for (i in 1:ncol(data))
{
if (i==34) {break}
a[i] <- colnames(data[i])
b <- data$cycle
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
If you look at the ?data.frame() help page, you'll see that it does not take arguments nrow and ncol--those are arguments for the matrix() function.
This is how you initialize data2, and you can see it starts with 2 columns, one column is named ncol, the second column is named nrow.
data2 <- data.frame(ncol=4, nrow=33)
data2
# ncol nrow
# 1 4 33
Instead you could try data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33)), though if you share a small sample of data and your expected result there may be more efficient ways than explicit loops to get this job done.
Generally, if you do loop, you want to do as much outside of the loop as possible. This is just guesswork without having sample data, these changes seem like a start at improving your code.
a <- colnames(data)
b <- data$cycle ## this never changes, no need to redefine every iteration
GM <- numeric(ncol(data)) ## better to initialize vectors to the correct length
GSD <- numeric(ncol(data))
data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33))
for (i in 1:ncol(data))
{
if (i==34) {break}
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
## it's weird to assign a row of data.frame at once...
## maybe you should keep it as a matrix?
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2

Picking 30 random data from data with sample()?

I am stuck.
We are asked to pick 30 random data from our dataset, then replace the picked values with NAs.
I'm stuck at the beginning, using the following function, as it selects 30 random data items from each column, while I want 30 random data picked among the whole dataset.
data2[sample(nrow(data2),30), ]
I hope you can help me out, thank you for your help/
Do you mean to replace 30 random rows?
data2 <- iris # as an example
throwouts <- sample(nrow(data2),30)
data2[throwouts, ] <- NA
print(data2)
Do you mean to replace 30 values in random rows and random columns?
data2 <- iris # as an example
coords <- expand.grid(1:nrow(data2),1:ncol(data2)) # all the possible values
coords <- coords[ sample(nrow(coords), 30), ] # take 30 unique ones of all possible values
for(i in 1:30) # erase each of them individually
data2[coords$Var1[i], coords$Var2[i] ] <- NA
print(data2)
The following seems to be memory efficient, it uses a logical matrix of FALSE values and 30 TRUE values in random positions to assign NA's.
set.seed(2020)
v <- rep(FALSE, prod(dim(df1)))
v[sample(length(v), 30)] <- TRUE
is.na(df1) <- matrix(v, nrow = nrow(df1))
rm(v)
This can easily be written as a function.
assignNA <- function(x, n){
v <- rep(FALSE, prod(dim(x)))
v[sample(length(v), 30)] <- TRUE
is.na(x) <- matrix(v, nrow = nrow(x))
x
}
set.seed(2020)
assignNA(df1, n = 30)
Tested with the data
df1 <- iris

Any efficient way to filter out multi-dim dataframe by measuring its correlation coefficient in R?

I intend to find Pearson correlation coefficient from multi-dim data to one numeric vector in R. Basically, I am expecting to get a correlation matrix by using the Pearson method, want to keep the rows (a.k.a, features for each column) in multi-dim data by using certain correlation coefficient as threshold.However, I tentatively tried some R implementation to do that but didn't get correct correlation matrix though. How can I get this one? can anyone point me out how to make this happen easily in R? any thought?
reproducible example
persons_df <- data.frame(person1=sample(1:20,10, replace = FALSE),
person2=as.factor(sample(10)),
person3=sample(1:25,10, replace = FALSE),
person4=sample(1:30,10, replace = FALSE),
person5=as.factor(sample(10)),
person6=as.factor(sample(10)))
row.names(persons_df) <-letters[1:10]
in persons_df, different features in row-wise and different persons in column-wise are given.
I have also age_df which has age of each person.
age_df <- data.frame(personID= colnames(persons_df),
age=sample(1:50, 6 , replace = FALSE))
my initial attempt:
pearson_corr <- function(df1, df2, verbose=FALSE){
stopifnot(ncol(df1)==nrow(df2))
res <- as.data.frame()
lapply(colnames(df1), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(df2)){
cor_mat <- stats::cor(y, df2$age, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
res <- cbind(res, cor=cor_mat[out])
}
})
})
return(final_df)
}
but above code didn't return correct correlation matrix. what I want to do how each features of the certain person is correlated with his age. Is there any efficient way to make this happen? any idea?
goal:
basically, I want to keep the features which show a high correlation with age. I don't have a better idea to do this in R. Can anyone point me out how to get his done easily and efficiently in R? thanks
mylist = do.call(rbind,
apply(persons_df, 1, function(x){
temp = cor.test(age_df$age, as.numeric(x))
data.frame(t = temp$statistic, p = temp$p.value)
}))
mylist
# t p
#a -1.060264 3.488012e-01
#b -2.292612 8.361623e-02
#c -16.785311 7.382895e-05
#d -1.362776 2.446304e-01
#e -1.922296 1.269356e-01
#f -4.671259 9.509393e-03
#g -3.719296 2.048710e-02
#h -2.684663 5.496171e-02
#i -15.814635 9.341701e-05
#j -2.423014 7.252635e-02
Then use mylist to filter out what values you don't want.

R: How to write a for loop that reads every two lines in a matrix?

I want to calculate correlation statistics using cor.test(). I have a data matrix where the two pairs to be tested are on consecutive lines (I have more than thousand pairs so I need to correct for that also later). I was thinking that I could loop through every two and two lines in the matrix and perform the test (i.e. first test correlation between row1 and row2, then row3 and row4, row5 and row6 etc.), but I don't know how to make this kind of loop.
This is how I do the test on a single pair:
d = read.table(file="cor-test-sample-data.txt", header=T, sep="\t", row.names = 1)
d = as.matrix(d)
cor.test(d[1,], d[2,], method = "spearman")
You could try
res <- lapply(split(seq_len(nrow(mat1)),(seq_len(nrow(mat1))-1)%/%2 +1),
function(i){m1 <- mat1[i,]
if(NROW(m1)==2){
cor.test(m1[1,], m1[2,], method="spearman")
}
else NA
})
To get the p-values
resP <- sapply(res, function(x) x$p.value)
indx <- t(`dim<-`(seq_len(nrow(mat1)), c(2, nrow(mat1)/2)))
names(resP) <- paste(indx[,1], indx[,2], sep="_")
resP
# 1_2 3_4 5_6 7_8 9_10 11_12 13_14
#0.89726818 0.45191660 0.14106085 0.82532260 0.54262680 0.25384239 0.89726815
# 15_16 17_18 19_20 21_22 23_24 25_26 27_28
#0.02270217 0.16840791 0.45563229 0.28533447 0.53088721 0.23453161 0.79235990
# 29_30 31_32
#0.01345768 0.01611903
Or using mapply (assuming that the rows are even)
ind <- seq(1, nrow(mat1), by=2) #similar to the one used by #CathG in for loop
mapply(function(i,j) cor.test(mat1[i,], mat1[j,],
method='spearman')$p.value , ind, ind+1)
data
set.seed(25)
mat1 <- matrix(sample(0:100, 20*32, replace=TRUE), ncol=20)
Try
d = matrix(rep(1:9, 3), ncol=3, byrow = T)
sapply(2*(1:(nrow(d)/2)), function(pair) unname(cor.test(d[pair-1,], d[pair,], method="spearman")$estimate))
pvalues<-c()
for (i in seq(1,nrow(d),by=2)) {
pvalues<-c(pvalues,cor.test(d[i,],d[i+1,],method="spearman")$p.value)
}
names(pvalues)<-paste(row.names(d)[seq(1,nrow(d),by=2)],row.names(d)[seq(2,nrow(d),by=2)],sep="_")

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources