Cannot get completed dataset using imputeMCA - r

I use missMDA package to fill in multiple categorical columns. However, I cannot get any result from these two functions: estim_ncpMCA(test_fill) and imputeMCA(test_fill). The program keeps running without any progress bar or results popped out.
This is the sample from the dataset.
Hybrid G1 G5 G8 G9 G10
P1000:P2030 0 -1 0 1 0
P1006:P1384 0 0 0 0 1
P1006:P1401 0 NA NA NA 1
P1006:P1412 0 0 0 0 1
P1006:P1594 0 0 0 0 1
P1013:P1517 0 0 0 0 1
I am working on a genetic project in R. In this dataset, there are 497 rows and 11,226 columns. Every row is a genetic marker series for a particular hybrid, and every column is a genetic marker ("G1", "G2" and etc) with value 1, 0, -1 and NA. There are total 746,433 of missing values and I am trying to fill in the missing values by imputeMCA.
I also made some transformations on test_fill before running imputeMCA.
test_fill = as.matrix(test_fill)
test_fill[, -1] <- lapply(test_fill[, -1], as.numeric)
I am wondering whether this is the result of too many columns in my dataset. And do I need to transpose my columns and rows.

I don't know if you found your answer, but I think your function doesn't run because of your first column, which seems to be the label of the individuals. You can specify that it should not be taken into the analysis.
estim_ncpMCA(test_fill[,2:11226], ncp.max = 5)
imputeMCA(test_fill[,2:11226], ncp = X)
I hope this can help.

Related

Test to compare proportions / paired (small) samples / 7-levels categorical variables

I'm working on data from a pre-post survey: the same participants have been asked the same questions at 2 different times (so the sample are not independant). I have 19 categorical variables (Likert scale with 7 levels).
For each question, I want to know if there is a significant difference between the "pre" and "post" answer. To do this, I want to compare proportions in each of the 7 categories between pre and post results.
I have two data bases (one 'pre' and one 'post') which I have merged as in the following example (I've made sure that the categorical variables have the same levels for PRE and POST):
prepost <- data.frame(ID = c(1:7),
Quest1_PRE = c('5_SomeA','1_StronglyD','3_SomeD','4_Neither','6_Agree','2_Disagree','7_StronglyA'),
Quest1_POST = c('1_StronglyD','7_StronglyA','6_Agree','7_StronglyA','3_SomeD','5_SomeA','7_StronglyA'))
I tried to perform a McNemar test:
temp <- table(prepost_S1$Quest1_PRE,prepost_S1$Quest1_POST)
mcnemar.test(temp)
> McNemar's Chi-squared test
data: temp
McNemar's chi-squared = NaN, df = 21, p-value = NA
But whatever the question, the test always return NA values. I think it is because the pivot table (temp) has very low frequencies (I only have 24 participants).
One exemple of a pivot table (I have 22 participants):
> temp
1_StronglyD 2_Disagree 3_SomeD 4_Neither 5_SomeA 6_Agree 7_StronglyA
1_StronglyD 0 0 0 0 0 1 0
2_Disagree 0 0 0 0 1 0 0
3_SomeD 0 0 0 0 0 1 1
4_Neither 0 0 1 1 2 2 2
5_SomeA 0 0 0 0 1 1 2
6_Agree 0 0 0 0 0 3 2
7_StronglyA 0 0 0 0 0 1 2
I've tried aggregating the variables' levels into 5 instead of 7 ("1_Disagree", "2_SomeD", "3_Neither", "4_SomeA", "5_Agree") but it still doesn't work.
Is there an equivalent of Fisher's exact test for paired sample? I've done research but I couldn't find anything helpful.
If not, could you think of any other test that could answer my question (= Do the answers differ significantly between the pre and post survey)?
Thanks!

lpsolve API in R: Edit column

I'm using lpsolveAPI in R and would like to set coefficients for specified columns and rows (coefficient for specified constraint number and decision variable number).
However, whereas I can add new column (new decision variable) or set existing column, I can't edit the column as that option will remove all previous coefficients in that column.
For example, let it be 5 constraints and 2 decision variables. Then:
lps.model <- make.lp(5, 2) #create lp model
#set coefficients for the first 3 constraint for both variables
for (i in seq(1,2)) set.column(lps.model, i, c(1,2,3), indices = c(1,2,3))
The model looks like this:
Model name:
C1 C2 C3 C4
Minimize 0 0 0 0
R1 1 1 0 0 free 0
R2 2 2 0 0 free 0
R3 3 3 0 0 free 0
R4 0 0 0 0 free 0
R5 0 0 0 0 free 0
Kind Std Std Std Std
Type Real Real Real Real
Upper Inf Inf Inf Inf
Lower 0 0 0 0
Now I want to add coefficients for 4th and 5th constraints.
for (i in seq(1,2)) set.column(lps.model, i, c(4,5), indices = c(4,5))
The code will rewrite the model since set.column functions sets all coefficients that were not listed within the function parameters to 0.
Model name:
C1 C2
Minimize 0 0
R1 0 0 free 0
R2 0 0 free 0
R3 0 0 free 0
R4 4 4 free 0
R5 5 5 free 0
Kind Std Std
Type Real Real
Upper Inf Inf
Lower 0 0
I have a big matrix of constraints and decision variables and need to run alike loops for different sets of variables. Is there any way to edit existing columns without rewriting them?
You could use set.mat for this to set values in your A matrix one at a time. See the help here.
For example:
> set.mat(lps.model, 4,5,3)
Will make the value of 4th row, 5th column to be 3, without overwriting anything else. So you can call set.mat within a double loop and change individual values.
However, it would be much more efficient if you can create whole columns at a time (preprocessing to create the list of coefficients) and then adding them to the lps.model in one shot using set.column especially since you say you have a large matrix of decision variables.

Calculate mean value of subsets and store them in a vector for further analysis

Hullo, I've been working on a dataset for a while now, but am also kind of stuck. One question/answer here was already helpful, but I need to calculate the mean not for a single value, but sixty.
My dataset is basically this:
> data[c(1:5, 111:116), c(1:6, 85:87)]
plotcode block plot subsample year month Alo.pra Ant.odo Arr.ela
91 B1A01 B1 A01 1 2003 May 0 9 0
92 B1A02 B1 A02 1 2003 May 38 0 0
93 B1A03 B1 A03 1 2003 May 0 0 0
94 B1A04 B1 A04 1 2003 May 0 0 0
95 B1A05 B1 A05 1 2003 May 0 0 0
214 B2A16 B2 A16 2 2003 May 0 0 0
215 B2A17 B2 A17 2 2003 May 0 0 0
216 B2A18 B2 A18 2 2003 May 486 0 0
217 B2A19 B2 A19 2 2003 May 0 0 0
218 B2A20 B2 A20 2 2003 May 0 0 0
219 B2A21 B2 A21 2 2003 May 0 0 0
The first few columns are general data about the data point. Each plot has had up to 4 subsamples. The columns 85:144 are the data I want to calculate the means of.
I used this command:
tapply(data2003[,85] , as.factor(data2003$plotcode), mean, na.rm=T)
But like I said, I need to calculate the mean sixty times, for columns 85:144.
My idea was using a for–loop.
for (i in 85:144)
{
temp <- tapply(data2003[,i], data2003$plotcode, mean, na.rm=T)
mean.mass.2003 <- rbind(mean.mass.2003, temp)
}
But that doesn't work. I get multiple error messages, "number of columns of result is not a multiple of vector length (arg 2)".
What I basically want is a table in which the columns represent the species, with the rows as the plotcode and the actual entries in the fields being the respective means.
I figured and fiddled and had some help that worked as I wanted it. I know that's a kind of convoluted approach, but I only just started R, so I do like to understand what I code:
data.plots<-matrix(NA, 88,60) ## A new, empty matrix we'll fill with the loop
for (i in 85:144) # The numbers because that's where our relevant data is
{
temp <- tapply(data2007[,i], data2007$plotcode, mean, na.rm=T) # What tapply does in this instance: It calculates the mean value of the i-th column form data2003 for every row in which the plotcode is the same, ignoring NAs. temp will be a single row of values, obviously.
data.plots[,i-84]<-as.numeric(temp) # shunts the single row from temp we just calculated consecutively into data.plots.
}
colnames(data.plots) <- colnames(data[85:144])
rownames(data.plots) <- as.data.frame(table(data$plotcode))[,1] # the second part is basically a count() function, returning in the first column the unique entries found and in the second the frequency of that entry.
This works. It shunts the mean biomass per species into a temporary vector(? data frame? matrix?) as its being calculated for every unique entry in data2003$plotcode, and then overwrites consecutively the rows of the target matrix data.plots.
After naming the rows and columns of data.plots I can work with it without always having to remember each name.

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

T test to find differentially expressed genes in R

I have a matrix which contains the genes and the mrna.
ID_REF GSM362168 GSM362169 GSM362170 GSM362171 GSM362172 GSM362173 GSM362174
244901_at 5.171072 5.207896 5.191145 5.067809 5.010239 5.556884 4.879528
244902_at 5.296012 5.460796 5.419633 5.440318 5.234789 7.567894 6.908795
I wanted to find the differentially expressed genes from the matrix using t test and i carried out the following.
stat=mt.teststat(control,classlabel,test="t",na=.mt.naNUM,nonpara="n")
and I get the following error
Error in is.factor(classlabel) : object 'classlabel' not found.
I am not sure how I have to assign the classlabels.Is it the right way to find the differentially expressed genes.
The classlabel should be a vector of integers corresponding to observation (column) class labels. I do not understand what that is.
If you open the documentation for mt.teststat:
?mt.teststat
and scroll down to the end, you'll see an example using the "Golub data":
data(golub)
teststat <- mt.teststat(golub, golub.cl)
If you look at golub.cl,it will become clear what the classlabel vector should look like:
golub.cl
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
In this case, 0 or 1 are labels for two classes of sample. There should be as many values in the vector as you have samples, in the same order that the samples appear in the data matrix. You can also look at:
?golub
golub.cl: numeric vector indicating the tumor class, 27 acute
lymphoblastic leukemia (ALL) cases (code 0) and 11 acute
myeloid leukemia (AML) cases (code 1).
So you need to create a similar vector, with labels (0, 1, ...) for however many classes you have for your own data.

Resources