Good evening,
I'm working on a class project and I am trying to do multiple unpaired 2 sample t-tests and then storing their p-values so that I can work with just the p-values later
Below is the code I have been trying:
pVals_1Beta <-vector("numeric", length = nrow(group1_Y_Beta))
for (i in 1:nrow(group1_Y_Beta)) {
pVals_1Beta[i] <- t.test(x = group1_Y_Beta$values[i,],
y = group1_N_Beta$values[i,],
paired = FALSE,
var.equal =FALSE,
conf.level = 0.95)$p.value
}
where group1_Y_Beta and group1_N_Beta have two columns(values and ind) and about 110312 rows and I want to do run unpaired t-test comparing the two groups values and store all 110312 p-values. When I try running this I get:
Error in group1_Y_Beta$values[i, ] : incorrect number of dimensions
Any help on how to tweak my code to get it to work would be greatly appreciated.
THanks, LIz
Since group1_N_Beta and group1_Y_Beta are 2D objects, you need (1) row and (2) column identifier in order to obtain a specific cell's value. But since you already specified the name of the column using the $ notation, you only need to provide one number (or a vector of numbers) to complete the query. Replace [i,] ("ith row, all columns") with [i]
Related
I'm dealing with a dataframe that i called GBM that contains single-cell measurements. So i'm relying to SCnorm package to deal with the normalization process and to have a previous check of my data. I'm using (plotCountDepth function)
This is my pipeline :
sce <- SingleCellExperiment::SingleCellExperiment(assays = list('counts' = GBM))
sce <- plotCountDepth(Data = sce,
Conditions = Label,
FilterCellProportion = .1,
NCores = 3)
I do not really understand why I continue to have this error returned
Error in colSums(Data[, which(Conditions == Levels[x])]) :
'x' must be an array of at least two dimensions
even if I'm applying the same criteria I find in BioConductor
For you to have major information Label is a vector of the same dimension of GBM that is a matrix G x S, containing a series of labels to distinguish each cell group.
Thank you in advance
PS : GBM is a matrix which columns are named by the various cell names while the rows are of course the genes
As the vignette stated:
Data: can be a matrix of single-cell expression with cells where rows
are genes and columns are samples. Gene names should not be a column
in this matrix, but should be assigned to rownames(Data).
Below I provide a minimum working example and I suggest you check whether you specified the rownames correctly:
library(SingleCellExperiment)
library(SCnorm)
GBM = matrix(rpois(10000,20),ncol=50)
rownames(GBM) = paste0("Gene",1:200)
colnames(GBM) = paste0("Sample",1:50)
Label=rep(c("X","Y"),each=25)
sce <- SingleCellExperiment(assays = list('counts' = GBM))
This function works but it is not very well written because it prints out the ggplot object but there's no way of storing it:
plt <- plotCountDepth(Data = sce,Conditions = Label,
FilterCellProportion = .1,NCores = 3)
I am trying to run a wilcox.test() on two subsets of data from a data frame. They are not of equal length (48 vs. 260). I want to see if there is a difference between the dbh (diameter at breast height) of live oak trees and water oak trees.
Pine_stand <- read.csv("Pine_stand.csv")
live_oaks <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_oaks <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
wilcox.test(live_oaks~water_oaks,conf.int=T,correct=F)
Error in model.frame.default(formula = live_oaks ~ water_oaks) :
invalid type (list) for variable 'live_oaks'
that was my first attempt then I tried this
Pine_stand <- read.csv("Pine_stand.csv")
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
oaks<-c(live_dbh,water_dbh)
wilcox.test(dbh~Species,data=oaks)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 48, 260
>
and received that error. I have tried vectorizing the two groups and appending and tapply ... I know there is a simple answer I am overlooking, I just can't get it to work. All of the examples I am reading are comparing two vectors with the same length. I know I can do the Wilcoxon test by hand when there are different numbers, so there should be a way. Any advice is welcome.
Yes, you can run a wilcox.test for variables of different length. As stated in http://www.r-tutor.com/elementary-statistics/non-parametric-methods/mann-whitney-wilcoxon-test
“Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to follow
the normal distribution.”
Therefore it’s a non-parametric equivalent of the t-test that we can use, when the assumptions for the t-test are not met (for example distribution is not normal or variances in two samples are not equal).
The problem in your code is that with these two statements:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"))
you are creating two vectors that contain only dph values, but you lose information about the labels (Species). Therefore you should write:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh", “Species”))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh", “Species”))
Secondly when you are trying two merge the two sets with this code:
oaks<-c(live_dbh,water_dbh)
instead of creating a data frame you create a list. Why is that happening? First, as we can read from documentation for c(), its name stands for “Combine Values into a Vector or List”. Probably you have already used it to merge two vectors into one. However in case of subset function it actually gives as a result one column data-frame and not a vector. Therefore our live_dbh and water_dbh sets are data frames (and now with the label they even have two columns).
In case of one column data-frame you can always use c() function with recursive parameter set to TRUE to merge them:
total<-c(one_column_df1, one_column_df2, recursive=TRUE)
However it’s usually safer to use rbind function (and it’s also the only function that will work in case we are merging data frames with more than one column). Rbind stands for row bind.
oaks<-rbind(live_dbh,water_dbh)
Now you should be able to run a wilcox.test:
wilcox.test(dbh~Species,data=oaks)
How about
wilcox.test(dbh~Species, data=Pine_stand,
subset=(Species %in% c("live oak", "water oak"))
? (If these are the only two species in your data set, you don't need the subset argument.)
I am creating a correlation matrix, and via the findCorrelation() function from the caret package I am identifying parameters that have a correlation with another parameter higher than 0.75.
After that I am removing the correlated parameters coming out of the findCorrelation command.
highlyCorrelated <- findCorrelation(correlationMatrix,cutoff=(0.75),verbose = FALSE)
correlated_var=colnames(data[,highlyCorrelated])
data.dat <- data[!(names(data) %in% c(correlated_var))]
For completeness sake, in presenting later results, I want to present a list of what parameters are removed, and also because of what correlation.
Is there a way to generate a data frame that contains in the first column the removed parameter, and in the following columns the parameter(s) that that specific parameter was correlated to?
I can call upon certain correlations by using:
correlationMatrix[correlationMatrix[x,]>0.75,x]
Where x is an identified parameter with a correlation higher than 0.75 with other parameter(s). But I am not sure how I can turn this into a data frame or table, in order to present the findings.
Help is much appreciated!
Regards,
Eddy
I got somewhere using the packages plyr and rowr:
cor.table <- matrix(, nrow = 0, ncol = 0)
for (i in sort(highlyCorrelated)){
cor.table.i <- c(paste(colnames(correlationMatrix)[c(i)],":"),paste(names(correlationMatrix[abs(correlationMatrix[i,])>0.75,i])))
cor.table <- cbind.fill(cor.table,cor.table.i,fill=NA)
}
cor.table <- t(cor.table[c(-1)])
It's a bit of a workaround, and maybe not the prettiest, but at least I get something I can export.
I can't get rid of the fact that the answer shows that the parameter is correlated to itself, for some reason.
I wrote a function in R, which parses arguments from a dataframe, and outputs the old dataframe + a new column with stats from each row.
I get the following warning:
Warning message:
In [[.data.frame(xx, sxx[j]) :
named arguments other than 'exact' are discouraged
I am not sure what this means, to be honest. I did spot checks on the results and seem OK to me.
The function itself is quite long, I will post it if needed to better answer the question.
Edit:
This is a sample dataframe:
my_df<- data.frame('ALT'= c('A,C', 'A,G'),
'Sample1'= c('1/1:35,3,0,35,3,35:1:1:0:0,1,0', './.:0,0,0,0,0,0:0:0:0:0,0,0'),
'Sample2'= c('2/2:188,188,188,33,33,0:11:11:0:0,0,11', '1/1:255,99,0,255,99,255:33:33:0:0,33,0'),
'Sample3'= c('1/1:219,69,0,219,69,219:23:23:0:0,23,0', '0/1:36,0,78,48,87,120:7:3:0:4,3,0'))
And this is the function:
multi_allelic_filter_v2<- function(in_vcf, start_col, end_col, threshold=1){
#input: must have gone through biallelic_assessment first
table0<- in_vcf
#ALT_alleles is the number of alt alleles with coverage > threshold across samples
#The following function calculates coverage across samples for a single allele
single_allele_tot_cov_count<- function(list_of_unparsed_cov,
allele_pos){
single_allele_coverage_count<- 0
for (i in 1:length(list_of_unparsed_cov)) { # i is each group of coverages/sample
single_allele_coverage_count<- single_allele_coverage_count+
as.numeric(strsplit(as.character(list_of_unparsed_cov[i]),
split= ',')[[1]])[allele_pos]}
return(single_allele_coverage_count)}
#single row function
#Now we need to reiterate on each ALT allele in the row
single_row_assessment<- function(single_row){
# No. of alternative alleles over threshold
alt_alleles0 <- 0
if (single_row$is_biallelic==TRUE){
alt_alleles0<- 1
} else {
alt_coverages <- numeric() #coverages across sample of each ALT allele
altcovs_unparsed<- character() #Unparsed coverages from each sample
for (i in start_col:end_col) {
#Let's fill altcovs_unparsed
altcovs_unparsed<- c(altcovs_unparsed,
strsplit(x = as.character(single_row[1,i]), split = ':')[[1]][6])}
#Now let's calculate alt_coverages
for (i in 1:lengths(strsplit(as.character(
single_row$ALT),',',fixed = TRUE))) {
alt_coverages<- c(alt_coverages, single_allele_tot_cov_count(
list_of_unparsed_cov = altcovs_unparsed, allele_pos = i+1))}
#Now, let's see how many ALT alleles are over threshold
alt_alleles0<- sum(alt_coverages>threshold)}
return(alt_alleles0)}
#Now, let's reiterate across each row:
#ALT_alleles is no. of alt alleles with coverage >threshold across samples
table0$ALT_alleles<- -99 # Just as a marker, to make sure function works
for (i in 1:nrow(table0)){
table0[i,'ALT_alleles'] <- single_row_assessment(single_row = table0[i,])}
#Now we now how many ALT alleles >= threshold coverage are in each SNP
return(table0)}
Basically, in the following line:
'1/1:219,69,0,219,69,219:23:23:0:0,23,0'
fields are separated by ":", and I am interested in the last two numbers of the last field (23 and 0); in each row I want to sum all the numbers in those positions (two separate sums), and output how many of the "sums" are over a threshold. Hope it makes sense...
OK,
I re-run the script with the same dataset on the same computer (same project, then new project), then run it again on a different computer, could not get the warnings again in any case. I am not sure what happened, and the results seem correct. Never mind. Thanks anyway for the comments and advice
I am trying to compare mean similarity between 3 subsets of data using the com.sim function (simba-package), but I’m having trouble getting the function to ignore missing values and correctly run the analysis.
Some background on my data and what I’ve done so far: My data is binary, but unlike the kinds of data for which the function is written, I working with skeletal remains, which are typically incomplete and fragmented. Thus, ~10% of my data matrix has missing values.
When I run this command in R
com.sim(mydata, subs, simil = "jaccard", binary = TRUE, permutations = 1000, alpha = 0.05, bonfc = TRUE)
I get the following error message:
Error in diffmean(as.numeric(sim(veg[subs == (comb[x, 1]), ], method = simil)), :
There are NA values. Consider setting na.rm accordingly
I subsequently modified the code of the function to the following (modification in bold):
if (binary) {
tmp <- lapply(c(1:nrow(comb)), function(x) diffmean(as.numeric(sim(veg[subs ==
(comb[x, 1]), ], method = simil,)), as.numeric(sim(veg[subs ==
(comb[x, 2]), ], method = simil, )), na.rm = TRUE))
Now, the function runs, but it is excluding all cases with at least 1 missing value (which is nearly half the data set!!). It seems that it is deleting cases w/ NA listwise, whereas I’d prefer pairwise deletion so that similarity coefficients can still be calculated between cases with missing values (but just excluding the variables with NA from the calculation). Is there any way to accomplish this within com.sim? I know other functions such as simil (proxy-package) can handle missing values when calculating a matrix of Jaccard coefficients, but it seems that the sim functions in simba weren’t built this way.
I’m have zero coding experience (is it obvious?) and so I would appreciate any help or advice on options to pursue!
Thank you very much, and please let me know if I can provide additional information.
Best,
Matt