R for loop to calculate wilcox.test - r

I am trying to write a code that would automatically calculate Wilcoxon test p-value for several comparisons.
Data used: 2 data sets with the same information representing two groups of participants completed the same 5 tasks which means that the each table contains 5 columns (tasks) and X rows with tasks scores.
data_17_18_G2 # first data set (in data.table format)
data_18_20_G2 # second data set (in data.table format)
Both data sets have identical names of column which are to be used in the W-test the next way:
wilcox.test(Group1Task1, Group2Task1, paired = F)
wilcox.test(Group1Task2, Group2Task2, paired = F)
and so on.
The inputs (e.g., Grou1Task1) are two vectors of task scores (the first one will be from data_17_18_G2 and the other one from data_18_20_G2
Desired output: a data table with a column of p-values
The problem I faced is that no matter how I manipulated the val1 and val2 empty objects, in the second and the third lines the right size "as.numeric(unlist(data_17_18_G2[, ..i]))" gives a correct output (a numeric vector) but it's left size "val1[i]" always returns only one value from the vector. That gave me the idea that the main problem appeared on the step of creating an empty vector, however, I wasn't able to solve it.
Empty objects:
result <- data.table(matrix(ncol=2))
val1 <- as.numeric() # here I also tried functions "numeric" and "vector"
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
For loop
for (i in 1:5) {
val1[i] <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2[i] <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
Output:
Error in `[<-.data.table`(`*tmp*`, i, 2, value = NULL) :
When deleting columns, i should not be provided
1: В val1[i] <- as.numeric(unlist(data_17_18_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
2: В val2[i] <- as.numeric(unlist(data_18_20_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
3: В res[i] <- wilcox.test(val1[i], val2[i], paired = F) :
number of items to replace is not a multiple of replacement length
Alternative:
I changed the second and the third lines
for (i in 1:5) {
val1[i] <- as.numeric(data_17_18_G2[ , ..i])
val2[i] <- as.numeric(data_18_20_G2[ , ..i])
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
And got this
Error in as.numeric(data_17_18_G2[, ..i]) :
(list) object cannot be coerced to type 'double'
which means that the function wilcox.test cannot interpret this type of input.
How can I improve the code so that I get a data table of p-values?

There would appear to be some bugs in the code. I have rewritten the code using the cars dataset as a example.
## use the cars dataset as a example (change with appropriate data)
data(cars)
data_17_18_G2 <- as.data.table(cars)
data_18_20_G2 <- data_17_18_G2[,2:1]
## Fixed code
result <- data.table(matrix(as.numeric(), nrow=ncol(data_17_18_G2), ncol=2))
val1 <- as.numeric()
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
for (i in 1:ncol(data_17_18_G2)) {
val1 <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2 <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[[i]] <- wilcox.test(val1, val2, paired = F)
result[i, 1] <- as.numeric(i)
result[i, 2] <- as.numeric(res[[i]]$p.value)
}
Hope this gives you the output you are after.

Related

Randomly subsampling seurat object

I've been trying to randomly subsample my seurat object.
I'm interested in subsampling based on 2 columns: condition and cell type. I have 5 conditions and 5 cell types. Main goal is to have 1000 cells for each cell type in each condition.
I've tried this so far:
First thing is subsetting my seurat object:
my.list <- list(hipo.c1.neurons = hipo %>%
subset(., condition %in% "c1" & group %in% "Neurons"),
hipo.c1.oligo = hipo %>%
subset(., condition %in% "c1" & group %in% "Oligod")...etc...)
And then subsample it using sample function:
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
})
And I get this error since there are some objects with less than 1000 cells: error in evaluating the argument 'j' in selecting a method for function '[': cannot take a sample larger than the population when 'replace = FALSE'
Then I've tried with this function:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error = function(e)NULL))
}
But then it gives me 0 in those objects that have less than 1000 cells. What would be the way to skip those objects that have less than 1000 cells and leave it like they are (not sample those ones)?
Is there a simpler way to do this, so I don't have to subset all of my objects separately?
I can't say for certain without seeing your data, but could you just add an if statement in the function? It looks like you're sampling column-wise, so check the number of columns. Just return x if the number of columns is less than the number you'd like to sample.
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
if(ncol(x) > 1000){
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
} else {
x
}
})
You could make it more flexible if you want to sample something other than 1000.
set.seed(0)
my.list.sampled <- lapply(X = my.list, B = 1000, FUN = function(x, B) {
if(ncol(x) > B){
x <- x[,sample(ncol(x), B, replace = FALSE)]
} else {
x
}
})

I am having an issue adding vectors to a list [duplicate]

This question already has an answer here:
What is the difference between [ ] and [[ ]] in R? [duplicate]
(1 answer)
Closed 1 year ago.
I have a list of matrices constructed by the following loops:
# Set up Row and Column Names for prediction coefficients.
rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
"actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
"unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
"covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
"yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
"yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
"yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")
# Build Matrix for dummy coefficient values.
matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))
matrix1
# Extract each variable type into own matrix (i.e. all "actsBreaks{x}")
#
Beta_names <- list()
betabreaks <- unique(gsub("[0-9]*", "", rows))
for (bc in betabreaks)
{
Breaks <- grep(paste0(bc, "[0-9]*"), rows)
Beta_names[[bc]] <- matrix1[Breaks, ,drop = FALSE]
Beta_names[[bc]] <- data.matrix(unlist(Beta_names[[bc]])) #, byrow = TRUE)
}
# Set up matrices for excluded/test data
one_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,2,0,10)
two_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,3,0,10)
three_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,4,10,0)
four_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,5,0,10)
five_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,6,0,10)
six_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,7,0,10)
seven_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,8,0,10)
eight_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,9,0,10)
nine_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,1,0,10)
ten_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,0,0,10)
DF1 <- data.frame (one_column ,two_column ,three_column ,
four_column ,five_column ,six_column ,
seven_column ,eight_column ,nine_column ,
ten_column )
paralength <- 5
Xnames <- list()
datindc <- 1
while ( datindc <= paralength )
{
Xbreaks <- factor(DF1[[datindc]],levels=sort(unique.default(DF1[[datindc]]),decreasing=FALSE))
Xnames[[datindc]] <- data.frame(model.matrix(~Xbreaks -1), stringsAsFactors = FALSE)
datindc <- datindc + 1
}
#
Xlngth <- length(Xnames)
BtaXind <- 1
BetaiXi <- list()
while ( BtaXind <= Xlngth )
{
BetaiXi[[BtaXind]] <- (Beta_names[[BtaXind + 1]] * Xnames[[BtaXind]])
BtaXind <- (BtaXind + 1)
}
I need to add each of those matrices' rows to each other, which I am trying to do by turning each matrix into a vector
BiXilngth <- length(BetaiXi)
BetaiXiTr <- list()
BtaiXiTrd <- 1
while (BtaiXiTrd <= BiXilngth)
{
Var1 <- c(t(BetaiXi[[BtaiXiTrd]]))
BetaiXiTr[BtaiXiTrd] <- Var1
BtaiXiTrd <- BtaiXiTrd + 1
}
and adding the vectors, effectively transposing the matrices. However, when I tried to convert the first matrix BetaiXi[[1]] to a vector and add it to the list with this command BetaiXiTr[BtaiXiTrd] <- c(t(BetaiXi[[BtaiXiTrd]])) I got the following message:
Warning message:
In BetaiXiTr[BtaiXiTrd] <- c(t(BetaiXi[[BtaiXiTrd]])) :
number of items to replace is not a multiple of replacement length
I then tried using unlist():
> BetaiXiTr[BtaiXiTrd] <-unlist(c(t(BetaiXi[[1]])))
Warning message:
In BetaiXiTr[BtaiXiTrd] <- unlist(c(t(BetaiXi[[1]]))) :
number of items to replace is not a multiple of replacement length
with the same result. Finally, I tried assigning the first vector to a variable > Var1 <- c(t(BetaiXi[[BtaiXiTrd]])) and assigning that vector to the list > BetaiXiTr[BtaiXiTrd] <- Var1 with, yet again, the same warning:
Warning message:
In BetaiXiTr[BtaiXiTrd] <- Var1 :
number of items to replace is not a multiple of replacement length
I searched for the warning message to determine what exactly I was being warned of but ended being more confused. Most reproduce or encountered the error message by trying to replace a vector of so many elements with a vector of fewer, while (to my understanding) I am simply trying to add a vector to a list. Am I going about this the incorrect way?
I was using [ ] and [ [ ] ] incorrectly in BetaiXiTr[BtaiXiTrd]. It needs to be BetaiXiTr[[BtaiXiTrd]]and that allows the vectors to be added.

nested loop in r to correlate columns of df1 to columns of df2

I have two datasets with abundance data from groups of different species. Columns are species and rows are sites. The sites (rows) are identical between the two datasets and what i am trying to do is to correlate the columns of the first dataset to the columns of the second dataset in order to see if there is a positive or a negative correlation.
library(Hmisc)
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$P
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$r
the first will give me the p value of the relation between sp1 and spA and the second the r value
I initially created a loop that allowed me to check all species of the first dataframe with a single column of the second dataframe. Needless to say if I was to make this work I would have to repeat the process a few hundred times.
My simple loop for one column of df1(new6) against all columns of df2(otu.table.filter)
pvalues = list()
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$P
}
rvalues = list()
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$r
}
p<-NULL
for(i in 1:length(pvalues)){
tmp <-print(pvalues[[i]][2])
p <- rbind(p, tmp)
}
r<-NULL
for(i in 1:length(rvalues)){
tmp <-print(rvalues[[i]][2])
r <- rbind(r, tmp)
}
fdr<-as.matrix(p.adjust(p, method = "fdr", n = length(p)))
sprman<-cbind(r,p,fdr)
and using the above as a starting point I tried to create a nested loop that each time would examine a column of df1 vs all columns of df2 and then it would proceed to the second column of df1 against all columns of df2 etc etc
but here i am a bit lost and i could not find an answer for a solution in r
I would assume that the pvalues output should be a list of
pvalues[[i]][[j]]
and similarly the rvalues output
rvalues[[i]][[j]]
but I am a bit lost and I dont know how to do that as I tried
pvalues = list()
rvalues = list()
for (j in 1:7){
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$P
}
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$r
}
}
but I cannot make it work cause I am not sure how to direct the output in the lists and then i would also appreciate if someone could help me with the next part which would be to extract for each comparison the p and r value and apply the fdr function (similar to what i did with my simple loop)
here is a subset of my two dataframes
Here a small demo. Let's assume two matrices x and y with a sample size n. Then correlation and approximate p-values can be estimated as:
n <- 100
x <- matrix(rnorm(10 * n), nrow = n)
y <- matrix(rnorm(5 * n), nrow = n)
## correlation matrix
r <- cor(x, y, method = "spearman")
## p-values
pval <- function(r, n) 2 * (1 - pt(abs(r)/sqrt((1 - r^2)/(n - 2)), n - 2))
pval(r, n)
## for comparison
cor.test(x[,1], y[,1], method = "spearman", exact = FALSE)
More details can be found here: https://stats.stackexchange.com/questions/312216/spearman-correlation-significancy-test
Edit
And finally a loop with cor.test:
## for comparison
p <- matrix(NA, nrow = ncol(x), ncol=ncol(y))
for (i in 1:ncol(x)) {
for (j in 1:ncol(y)) {
p[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$p.value
}
}
p
The values differ a somewhat, because the first uses the t-approximation then the second the "exact AS 89 algorithm" of cor.test.

Txt Prediction Model Numerical Expression Warning

I have three dataframes created from different ngram counts (Uni, Bi , Tri) each data frame contains the separated ngram, frequency counts (n) and have added probability using smoothing.
I have written three functions to look through the tables and return the highest probable word based on an input string. And have binded them
##Prediction Model
trigramwords <- function(FirstWord, SecondWord, n = 5 , allow.cartesian =TRUE) {
probword <- trigramtable[.(FirstWord, SecondWord), allow.cartesian = TRUE][order(-Prob)]
if(any(is.na(probword)))
return(bigramwords(SecondWord, n))
if(nrow(probword) > n)
return(probword[1:n, ThirdWord])
count <-nrow(probword)
bgramwords <- bigramtable(SecondWord, n)[1:(n - count)]
return(c(probword[, ThirdWord], bgramwords))
}
bigramwords <- function(FirstWord, n = 5 , allow.cartesian = TRUE){
probword <- bigramtable[FirstWord][order(-Prob)]
if(any(is.na(probword)))
return(Unigramword(n))
if (nrow(probword) > n)
return(probword[1:n, SecondWord])
count <- nrow(probword)
word1 <- Unigramword(n)[1:(n - count)]
return(c(probword[, SecondWord], word1))
}
##Back off Model
Unigramword <- function(n = 5, allow.cartesian = TRUE){
return(sample(UnigramTable[, FirstWord], size = n))
}
## Bind Functions
predictword <- function(str) {
require(quanteda)
tokens <- tokens(x = char_tolower(str))
tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
words <- trigramwords(tokens[1], tokens[2], 5)
chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")
print(words[1])
}
However I receive the following warning message and the output is always the same word. If I use only the bigramwords function it works fine, but when adding the trigram function I get the warning message. I believe it because 1:n is not defined correctly.
Warning message:
In 1:n : numerical expression has 5718534 elements: only the first used

Reading series of values in R

I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)

Resources