T test to find differentially expressed genes in R - r

I have a matrix which contains the genes and the mrna.
ID_REF GSM362168 GSM362169 GSM362170 GSM362171 GSM362172 GSM362173 GSM362174
244901_at 5.171072 5.207896 5.191145 5.067809 5.010239 5.556884 4.879528
244902_at 5.296012 5.460796 5.419633 5.440318 5.234789 7.567894 6.908795
I wanted to find the differentially expressed genes from the matrix using t test and i carried out the following.
stat=mt.teststat(control,classlabel,test="t",na=.mt.naNUM,nonpara="n")
and I get the following error
Error in is.factor(classlabel) : object 'classlabel' not found.
I am not sure how I have to assign the classlabels.Is it the right way to find the differentially expressed genes.
The classlabel should be a vector of integers corresponding to observation (column) class labels. I do not understand what that is.

If you open the documentation for mt.teststat:
?mt.teststat
and scroll down to the end, you'll see an example using the "Golub data":
data(golub)
teststat <- mt.teststat(golub, golub.cl)
If you look at golub.cl,it will become clear what the classlabel vector should look like:
golub.cl
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
In this case, 0 or 1 are labels for two classes of sample. There should be as many values in the vector as you have samples, in the same order that the samples appear in the data matrix. You can also look at:
?golub
golub.cl: numeric vector indicating the tumor class, 27 acute
lymphoblastic leukemia (ALL) cases (code 0) and 11 acute
myeloid leukemia (AML) cases (code 1).
So you need to create a similar vector, with labels (0, 1, ...) for however many classes you have for your own data.

Related

Test to compare proportions / paired (small) samples / 7-levels categorical variables

I'm working on data from a pre-post survey: the same participants have been asked the same questions at 2 different times (so the sample are not independant). I have 19 categorical variables (Likert scale with 7 levels).
For each question, I want to know if there is a significant difference between the "pre" and "post" answer. To do this, I want to compare proportions in each of the 7 categories between pre and post results.
I have two data bases (one 'pre' and one 'post') which I have merged as in the following example (I've made sure that the categorical variables have the same levels for PRE and POST):
prepost <- data.frame(ID = c(1:7),
Quest1_PRE = c('5_SomeA','1_StronglyD','3_SomeD','4_Neither','6_Agree','2_Disagree','7_StronglyA'),
Quest1_POST = c('1_StronglyD','7_StronglyA','6_Agree','7_StronglyA','3_SomeD','5_SomeA','7_StronglyA'))
I tried to perform a McNemar test:
temp <- table(prepost_S1$Quest1_PRE,prepost_S1$Quest1_POST)
mcnemar.test(temp)
> McNemar's Chi-squared test
data: temp
McNemar's chi-squared = NaN, df = 21, p-value = NA
But whatever the question, the test always return NA values. I think it is because the pivot table (temp) has very low frequencies (I only have 24 participants).
One exemple of a pivot table (I have 22 participants):
> temp
1_StronglyD 2_Disagree 3_SomeD 4_Neither 5_SomeA 6_Agree 7_StronglyA
1_StronglyD 0 0 0 0 0 1 0
2_Disagree 0 0 0 0 1 0 0
3_SomeD 0 0 0 0 0 1 1
4_Neither 0 0 1 1 2 2 2
5_SomeA 0 0 0 0 1 1 2
6_Agree 0 0 0 0 0 3 2
7_StronglyA 0 0 0 0 0 1 2
I've tried aggregating the variables' levels into 5 instead of 7 ("1_Disagree", "2_SomeD", "3_Neither", "4_SomeA", "5_Agree") but it still doesn't work.
Is there an equivalent of Fisher's exact test for paired sample? I've done research but I couldn't find anything helpful.
If not, could you think of any other test that could answer my question (= Do the answers differ significantly between the pre and post survey)?
Thanks!

How to import and transform adjacency matrix to R edge list?

A sample of my data can be seen below. The data contains information about ties between organizations (over 2000 organizations, the csv file has 0s and 1s, and empty cells)
A2654 B0004 B0188 B1278 B1372 B1722 B2503
A2654 0 1 0 0 0 1 0
B0004 1 0 0 0 0 1 0
B0188 0 0 0 0 0 0 0
B1278 0 0 0 0 0 0 0
B1372 0 0 0 0 0 0 0
B1722 1 1 0 0 0 0 0
(1) The first problem is that I can't import this data (.csv) into R
I runt the following code dt <- read_csv2("Org_ties.csv") The problem here is that while in the csv file the first column is left empty (it should be) -- when reading it into R, read_csv() generates a label for this column "X1". I do this in order to run the next code: g=graph_from_adjacency_matrix(dtmtrx, mode="directed", weighted = T) to produce a graph. However, I get the error message below. I think it has to do with the fact that I can't read it properly.
graph.adjacency.dense(adjmatrix, mode = mode, weighted = weighted, :
not a square matrix
In addition: Warning message:
In mde(x) : NAs introduced by coercion
(2) Another puzzling thing is that I cannot seem to transform the current data structure into an edge list. How can I do that? The edge list looks something like this
V1 V2 weight
A2654 B0004 1
A2654 B0188 0
A2654 B1278 0
A2654 B1372 0
A2654 B1722 1

Cannot get completed dataset using imputeMCA

I use missMDA package to fill in multiple categorical columns. However, I cannot get any result from these two functions: estim_ncpMCA(test_fill) and imputeMCA(test_fill). The program keeps running without any progress bar or results popped out.
This is the sample from the dataset.
Hybrid G1 G5 G8 G9 G10
P1000:P2030 0 -1 0 1 0
P1006:P1384 0 0 0 0 1
P1006:P1401 0 NA NA NA 1
P1006:P1412 0 0 0 0 1
P1006:P1594 0 0 0 0 1
P1013:P1517 0 0 0 0 1
I am working on a genetic project in R. In this dataset, there are 497 rows and 11,226 columns. Every row is a genetic marker series for a particular hybrid, and every column is a genetic marker ("G1", "G2" and etc) with value 1, 0, -1 and NA. There are total 746,433 of missing values and I am trying to fill in the missing values by imputeMCA.
I also made some transformations on test_fill before running imputeMCA.
test_fill = as.matrix(test_fill)
test_fill[, -1] <- lapply(test_fill[, -1], as.numeric)
I am wondering whether this is the result of too many columns in my dataset. And do I need to transpose my columns and rows.
I don't know if you found your answer, but I think your function doesn't run because of your first column, which seems to be the label of the individuals. You can specify that it should not be taken into the analysis.
estim_ncpMCA(test_fill[,2:11226], ncp.max = 5)
imputeMCA(test_fill[,2:11226], ncp = X)
I hope this can help.

lpsolve API in R: Edit column

I'm using lpsolveAPI in R and would like to set coefficients for specified columns and rows (coefficient for specified constraint number and decision variable number).
However, whereas I can add new column (new decision variable) or set existing column, I can't edit the column as that option will remove all previous coefficients in that column.
For example, let it be 5 constraints and 2 decision variables. Then:
lps.model <- make.lp(5, 2) #create lp model
#set coefficients for the first 3 constraint for both variables
for (i in seq(1,2)) set.column(lps.model, i, c(1,2,3), indices = c(1,2,3))
The model looks like this:
Model name:
C1 C2 C3 C4
Minimize 0 0 0 0
R1 1 1 0 0 free 0
R2 2 2 0 0 free 0
R3 3 3 0 0 free 0
R4 0 0 0 0 free 0
R5 0 0 0 0 free 0
Kind Std Std Std Std
Type Real Real Real Real
Upper Inf Inf Inf Inf
Lower 0 0 0 0
Now I want to add coefficients for 4th and 5th constraints.
for (i in seq(1,2)) set.column(lps.model, i, c(4,5), indices = c(4,5))
The code will rewrite the model since set.column functions sets all coefficients that were not listed within the function parameters to 0.
Model name:
C1 C2
Minimize 0 0
R1 0 0 free 0
R2 0 0 free 0
R3 0 0 free 0
R4 4 4 free 0
R5 5 5 free 0
Kind Std Std
Type Real Real
Upper Inf Inf
Lower 0 0
I have a big matrix of constraints and decision variables and need to run alike loops for different sets of variables. Is there any way to edit existing columns without rewriting them?
You could use set.mat for this to set values in your A matrix one at a time. See the help here.
For example:
> set.mat(lps.model, 4,5,3)
Will make the value of 4th row, 5th column to be 3, without overwriting anything else. So you can call set.mat within a double loop and change individual values.
However, it would be much more efficient if you can create whole columns at a time (preprocessing to create the list of coefficients) and then adding them to the lps.model in one shot using set.column especially since you say you have a large matrix of decision variables.

Frequency Distribution Plot of Document Term Matrix

I have created a document term matrix that looks something like this:
inspect(dtm[1:4,1:6])
allowed allowing almost alone companyunder companywide
Doc1.txt 1 1 1 0 1 0
Doc2.txt 0 1 1 0 1 1
Doc3.txt 0 0 0 1 0 1
Doc4.txt 1 0 1 0 1 1
After taking it's column sum it gives me.
colSums(dtm)
allowed 2
allowing 2
almost 3
alone 1
companyunder 3
companywide 3
This essentially indicates that these words are found in how many documents (for eg allowed 2 tells me that allowed is found in two documents.).
I'm having difficulty in creating a frequency distribution plot which will have x-axis as the document number and y-axis as the number of words the document contains.
Is this what you're looking for?
dtm = array(c(1,0,0,1,1,1,0,0,1,1,0,1,0,0,1,0,1,1,0,1,0,1,1,1),dim=c(4,6))
dimnames(dtm) = list(c("Doc1","Doc2","Doc3","Doc4"),c("allowed","allowing","almost","alone","companyunder","companywide"))
print(dtm)
plot(rowSums(dtm))

Resources