ANOVA with multiple factors not working - r

I am trying to have my data be analyzed using the multiple factors of depth and date of sampling. The code used and levels produced listed below.
date <- factor(Amount_allSampling_Depth$date)
depth <- factor(Amount_allSampling_Depth$depth)
depth_date <- date:depth
levels(depth_date)
[1] "2016-08-08:5" "2016-08-08:10" "2016-08-08:15" "2016-08-08:20" "2016-08-15:5"
[6] "2016-08-15:10" "2016-08-15:15" "2016-08-15:20" "2016-08-22:5" "2016-08-22:10"
[11] "2016-08-22:15" "2016-08-22:20" "2016-08-29:5" "2016-08-29:10" "2016-08-29:15"
[16] "2016-08-29:20" "2016-09-05:5" "2016-09-05:10" "2016-09-05:15" "2016-09-05:20"
[21] "2016-09-12:5" "2016-09-12:10" "2016-09-12:15" "2016-09-12:20" "2016-09-19:5"
[26] "2016-09-19:10" "2016-09-19:15" "2016-09-19:20" "2016-10-03:5" "2016-10-03:10"
[31] "2016-10-03:15" "2016-10-03:20" "2016-10-10:5" "2016-10-10:10" "2016-10-10:15"
[36] "2016-10-10:20" "2016-10-17:5" "2016-10-17:10" "2016-10-17:15" "2016-10-17:20"
[41] "2016-10-24:5" "2016-10-24:10" "2016-10-24:15" "2016-10-24:20"
When I try to do an ANOVA on of my classes of data ex. Dinoflagellates.
Dinoflagellates <- Amount_allSampling_Depth$Dinoflagellates
anova(lm(Dinoflagellates~depth_date))
I got the warning message:
ANOVA F-tests on an essentially perfect fit are unreliable.
Could someone help to find out how I could make it work so I can do multiple factor analyses?

Related

reorder a 1 dimensional dataframe based on the column order of a larger dataframe (R)

relevant_ods_reordered <- relevant_ods[names(cpm)]
the above seeks to reorder column names of a dataframe relevant_ods:
Plate1_DMSO_A01 Plate1_DMSO_B01 Plate1_DMSO_C01 Plate1_Lopinavir_D01
OD595 0.431 0.4495 0.4993 0.5785
Plate1_DMSO_E01 Plate1_DMSO_F01 Plate1_DMSO_G01 Plate1_DMSO_H01
OD595 0.5336 0.5133 0.527 0.5413
Plate1_DMSO_C12 Plate1_DMSO_D12 Plate1_Lopinavir_E12 Plate1_DMSO_F12
OD595 0.4137 0.4274 0.5241 0.4264
Plate1_DMSO_G12 Plate1_DMSO_H12
OD595 0.4561 0.4767
to match the order of the columns in a significantly larger dataframe:
[1] "Plate1_DMSO_A01" "Plate1_DMSO_A12"
[3] "Plate1_DMSO_B01" "Plate1_DMSO_B12"
[5] "Plate1_DMSO_C01" "Plate1_DMSO_C12"
[7] "Plate1_DMSO_D12" "Plate1_DMSO_E01"
[9] "Plate1_DMSO_F01" "Plate1_DMSO_F12"
[11] "Plate1_DMSO_G01" "Plate1_DMSO_G12"
[13] "Plate1_DMSO_H01" "Plate1_DMSO_H12"
[15] "Plate1_Lopinavir_D01" "Plate1_Lopinavir_E12"
[17] "Plate1_NS1519_22009_A02" "Plate1_NS1519_22009_A04"
[19] "Plate1_NS1519_22009_A05" "Plate1_NS1519_22009_A06"
[21] "Plate1_NS1519_22009_A07" "Plate1_NS1519_22009_A08"
[23] "Plate1_NS1519_22009_A09" "Plate1_NS1519_22009_A10"
[25] "Plate1_NS1519_22009_A11" "Plate1_NS1519_22009_B02"
[27] "Plate1_NS1519_22009_B03" "Plate1_NS1519_22009_B04"
[29] "Plate1_NS1519_22009_B05" "Plate1_NS1519_22009_B06"
etc.
Clearly, there is a returned
Error in `[.data.frame`(relevant_ods, names(cpm)) :
undefined columns selected
due to the mismatch between the numbers of columns
I have tried
relevant_ods_reordered <- relevant_ods[names(cpm),]
relevant_ods_reordered <- select(relevant_ods, names(cpm))
relevant_ods_reordered <- match(relevant_ods, names(cpm))
With base R, you need to find the names in common. intersect is good for this and preserves the order of its first argument:
relevant_ods[intersect(names(cpm), names(relevant_ods))]
Or with dplyr, use the select helper any_of:
select(relevant_ods, any_of(names(cpm)))

Warning message: In log(t.dat) : NaNs produced even with na.omit=T

I want to create a new data frame by appending the label binary vector to a large dataframe t.dat. NaNs are produced even when I use na.omit=T, which means the NaNswere not due to 0 values.
label <- as.factor(c(rep(0, 21-1+1),rep(1,177-22+1))) # Binary vector 0=non-tumor and 1=glioma
svm.df <-data.frame(label, log(t.dat), na.omit=T)
Warning message: In log(t.dat) : NaNs produced
> which(is.nan(log(t.dat)))
[1] 597849 656262 673097 869853 949681 949692 949700 949725 949728
[10] 1255020 1255029 1427194 1462292 1462370 1946921 2085039 2375207 2375324
[19] 2459488 2471475 2756957 2756962 2756964 2756973 2756982 2757015 2757103
[28] 2757113 2757114 2757117 2757123 2866715 2966242 2966248 3108773 3612388
[37] 3712228 4106033 4863666 4863703 5011987 5012045 5012068 5266896 5358428
[46] 5361451 5494337 5630823 5733845 5733910 5815590 5815592 5815621 5815632
[55] 5815635 5941255 5941305 6073404 6073416 6073456 6073493 6073510 6073521
[64] 6073559 6100700 6100735 6100757 6100786 6239608 6239635 6239646 6239664
[73] 6239719 6425198 6476611 6489147 6865672 6905857 6966059 7049793 7148523
[82] 7172428 7172547 7623457 7726116 7829439 7829468 7829499 8008035
(1) data.frame doesn't have an na.omit argument (check the documentation), so the effect of including na.omit=T will be to include an entire column called na.omit to your data frame.
(2) NaN values arise (in this case) from taking the log of a negative number. If you want to filter these out, you could try
ok <- which(t.dat >= 0)
svm.df <-data.frame(label[ok], log(t.dat[ok]))

My dataset is huge and I don't know how to make figures with the data as it is

I have RNAseq data for a Time-course experiment (6 time points) and involves tens of thousands of genes.
I have used the Filter program on Tidyverse to find genes that fit certain criteria (reference genes for qPCR), but I don't know how to make this data into a figure easily. Right now, I'd have to change the format of the dataset completely, but that would take so much time to be impractical.
The goal is just to have a graph for each gene that shows the change in expression over time for each condition (different leaf pairs and droughted/well-watered). I have done this for some in Excel but would like a quicker way to do it.
The dataset is set out like this:
[1] "gene.id" "LP1.2.02:00.WW" "LP1.2.02:00.WW_1" "LP1.2.02:00.WW_2"
[5] "LP1.2.06:00.WW" "LP1.2.06:00.WW_1" "LP1.2.06:00.WW_2" "LP1.2.10:00.WW"
[9] "LP1.2.10:00.WW_1" "LP1.2.10:00.WW_2" "LP1.2.14:00.WW" "LP1.2.14:00.WW_1"
[13] "LP1.2.14:00.WW_2" "LP1.2.18:00.WW" "LP1.2.18:00.WW_1" "LP1.2.18:00.WW_2"
[17] "LP1.2.22:00.WW" "LP1.2.22:00.WW_1" "LP1.2.22:00.WW_2" "LP3.4.5.02:00.WW"
[21] "LP3.4.5.02:00.WW_1" "LP3.4.5.02:00.WW_2" "LP3.4.5.06:00.WW" "LP3.4.5.06:00.WW_1"
[25] "LP3.4.5.06:00.WW_2" "LP3.4.5.10:00.WW" "LP3.4.5.10:00.WW_1" "LP3.4.5.10:00.WW_2"
[29] "LP3.4.5.14:00.WW" "LP3.4.5.14:00.WW_1" "LP3.4.5.14:00.WW_2" "LP3.4.5.18:00.WW"
[33] "LP3.4.5.18:00.WW_1" "LP3.4.5.18:00.WW_2" "LP3.4.5.22:00.WW" "LP3.4.5.22:00.WW_1"
[37] "LP3.4.5.22:00.WW_2" "LP1.2.02:00.Drought" "LP1.2.02:00.Drought_1" "LP1.2.02:00.Drought_2"
[41] "LP1.2.06:00.Drought" "LP1.2.06:00.Drought_1" "LP1.2.06:00.Drought_2" "LP1.2.10:00.Drought"
[45] "LP1.2.10:00.Drought_1" "LP1.2.10:00.Drought_2" "LP1.2.14:00.Drought" "LP1.2.14:00.Drought_1"
[49] "LP1.2.14:00.Drought_2" "LP1.2.18:00.Drought" "LP1.2.18:00.Drought_1" "LP1.2.18:00.Drought_2"
[53] "LP1.2.22:00.Drought" "LP1.2.22:00.Drought_1" "LP1.2.22:00.Drought_2" "LP3.4.5.02:00.Drought"
[57] "LP3.4.5.02:00.Drought_1" "LP3.4.5.02:00.Drought_2" "LP3.4.5.06:00.Drought" "LP3.4.5.06:00.Drought_1"
[61] "LP3.4.5.06:00.Drought_2" "LP3.4.5.10:00.Drought" "LP3.4.5.10:00.Drought_1" "LP3.4.5.10:00.Drought_2"
[65] "LP3.4.5.14:00.Drought" "LP3.4.5.14:00.Drought_1" "LP3.4.5.14:00.Drought_2" "LP3.4.5.18:00.Drought"
[69] "LP3.4.5.18:00.Drought_1" "LP3.4.5.18:00.Drought_2" "LP3.4.5.22:00.Drought." "LP3.4.5.22:00.Drought"
[73] "LP3.4.5.22:00.Drought_1" "X74" "LP1.2.02:00.WW.mean" "LP1.2.06:00.WW.mean"
[77] "LP1.2.10:00.WW.mean" "LP1.2.14:00.WW.mean" "LP1.2.18:00.WW.mean" "LP1.2.22:00.WW.mean"
[81] "LP1.2.02:00.drought.mean" "LP1.2.06:00.drought.mean" "LP1.2.10:00.drought.mean" "LP1.2.14:00.drought.mean"
[85] "LP1.2.18:00.drought.mean" "LP1.2.22:00.drought.mean" "LP3.4.5.02:00.WW.mean" "LP3.4.5.06:00.WW.mean"
[89] "LP3.4.5.10:00.WW.mean" "LP3.4.5.14:00.WW.mean" "LP3.4.5.18:00.WW.mean" "LP3.4.5.22:00.WW.mean"
[93] "LP3.4.5.02:00.drought.mean" "LP3.4.5.06:00.drought.mean" "LP3.4.5.10:00.drought.mean" "LP3.4.5.14:00.drought.mean"
[97] "LP3.4.5.18:00.drought.mean" "LP3.4.5.22:00.drought.mean"
It's a lot of headings, and as you can see from the titles, they contain the time, leaf pairs and condition. So, I'm not sure how to translate this into an x~y graph.
I've had several thoughts including trying to divide conditions into different subsets (LP1.2. WW/ LP.1.2.D/LP3.4.5.WW/LP.3.4.5.D) and making a subset for Time (02:00, 06:00, etc.) and trying to make a graph for that.
#make subset for the time points
Time <- c("02:00", "06:00", "10:00", "14:00", "18:00", "22:00")
#make subsets for each condition (LP1.2. WW/ LP.1.2.D/LP3.4.5.WW/LP.3.4.5.D)
LP1.2.WW.mean <- as.matrix(KG_graph_data[c( "LP1.2.02:00.WW.mean",
"LP1.2.06:00.WW.mean",
"LP1.2.10:00.WW.mean",
"LP1.2.14:00.WW.mean",
"LP1.2.18:00.WW.mean",
"LP1.2.22:00.WW.mean",
"gene.id")])
LP.1.2.D.mean <-
as.matrix(KG_graph_data[c("LP1.2.02:00.drought.mean",
"LP1.2.06:00.drought.mean",
"LP1.2.10:00.drought.mean",
"LP1.2.14:00.drought.mean",
"LP1.2.18:00.drought.mean",
"LP1.2.22:00.drought.mean",
"gene.id")])
LP345.WW.mean <- as.matrix((KG_graph_data[c("LP3.4.5.02:00.WW.mean",
"LP3.4.5.06:00.WW.mean",
"LP3.4.5.10:00.WW.mean",
"LP3.4.5.14:00.WW.mean",
"LP3.4.5.18:00.WW.mean",
"LP3.4.5.22:00.WW.mean",
"gene.id")]))
LP345.D.mean <-
as.matrix(KG_graph_data[c("LP3.4.5.02:00.drought.mean",
"LP3.4.5.06:00.drought.mean",
"LP3.4.5.10:00.drought.mean",
"LP3.4.5.14:00.drought.mean",
"LP3.4.5.18:00.drought.mean",
"LP3.4.5.22:00.drought.mean",
"gene.id")])
I tried extracting a particular gene from each matrix to then perhaps plot a graph from but it only worked when it was from one matrix and even then, the table contained no data.
Total_KgGene007565 <- subset(LP1.2.WW.mean, "gene.id"=="KgGene007565",
LP.1.2.D.mean, "gene.id"=="KgGene007565",
LP345.WW.mean, "gene.id"=="KgGene007565",
LP345.D.mean, "gene.id"="KgGene007565")
I am not sure how to proceed from here or if this was the wrong way to approach this.

Creating a loop with character variables on R - for two-sample t.test

I am looking to do multiple two sample t.tests in R.
I want to test 50 indicators that have two levels. So at first I used :
t.test(m~f)
Welch Two Sample t-test
data: m by f
t = 2.5733, df = 174.416, p-value = 0.01091
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.05787966 0.43891600
sample estimates:
mean in group FSS mean in group NON-FSS
0.8344209 0.5860231
Here m corresponds to the first indicator I want to test m =Debt.to.equity.ratio.
Here is a list of all the indicators I need to test :
print (indicators)
[1] "Debt.to.equity.ratio" "Deposits.to.loans"
[3] "Deposits.to.total.assets" "Gross.loan.portfolio.to.total.assets"
[5] "Number.of.active.borrowers" "Percent.of.women.borrowers"
[7] "Number.of.loans.outstanding" "Gross.loan.portfolio"
[9] "Average.loan.balance.per.borrower" "Average.loan.balance.per.borrower...GNI.per.capita"
[11] "Average.outstanding.balance" "Average.outstanding.balance...GNI.per.capita"
[13] "Number.of.depositors" "Number.of.deposit.accounts"
[15] "Deposits" "Average.deposit.balance.per.depositor"
[17] "Average.deposit.balance.per.depositor...GNI.per.capita" "Average.deposit.account.balance"
[19] "Average.deposit.account.balance...GNI.per.capita" "Return.on.assets"
[21] "Return.on.equity" "Operational.self.sufficiency"
[23] "FSS" "Financial.revenue..assets"
[25] "Profit.margin" "Yield.on.gross.portfolio..nominal."
[27] "Yield.on.gross.portfolio..real." "Total.expense..assets"
[29] "Financial.expense..assets" "Provision.for.loan.impairment..assets"
[31] "Operating.expense..assets" "Personnel.expense..assets"
[33] "Administrative.expense..assets" "Operating.expense..loan.portfolio"
[35] "Personnel.expense..loan.portfolio" "Average.salary..GNI.per.capita"
[37] "Cost.per.borrower" "Cost.per.loan"
[39] "Borrowers.per.staff.member" "Loans.per.staff.member"
[41] "Borrowers.per.loan.officer" "Loans.per.loan.officer"
[43] "Depositors.per.staff.member" "Deposit.accounts.per.staff.member"
[45] "Personnel.allocation.ratio" "Portfolio.at.risk...30.days"
[47] "Portfolio.at.risk...90.days" "Write.off.ratio"
[49] "Loan.loss.rate" "Risk.coverage"
Instead of changing the indicator name each time in the t.test, I would like to create a loop that will do it automatically and calculate the p.value. I've tried creating a loop but can't make it work due to the nature of the variables = characters.
I would really appreciate any tips on how to go forward!
Thank you very much !
Best
Morgan
I am assuming you are doing the regression of each indicator against the same f.
In that case, you can try something like:
p_vals = NULL;
for(this_indicator in indicators)
{
this_formula = paste(c(this_indicator, "f"), collapse="~");
res = t.test(as.formula(this_formula));
p_vals = c(p_vals, res$p.value);
}
One comment, however: are you doing any multiplicity adjustment for these p-values? Given the large of tests you are doing, there is a good chance you will be showered with false positives.

R: Using for loop on data frame

I have a data frame, deflator.
I want to get a new data frame inflation which can be calculated by:
deflator[i] - deflator[i-4]
----------------------------- * 100
deflator [i - 4]
The data frame deflator has 71 numbers:
> deflator
[1] 0.9628929 0.9596746 0.9747274 0.9832532 0.9851884
[6] 0.9797770 0.9913502 1.0100561 1.0176906 1.0092516
[11] 1.0185932 1.0241043 1.0197975 1.0174097 1.0297328
[16] 1.0297071 1.0313232 1.0244618 1.0347808 1.0480411
[21] 1.0322142 1.0351968 1.0403264 1.0447121 1.0504402
[26] 1.0487097 1.0664664 1.0935239 1.0965951 1.1141851
[31] 1.1033155 1.1234482 1.1333870 1.1188136 1.1336276
[36] 1.1096461 1.1226584 1.1287245 1.1529588 1.1582911
[41] 1.1691221 1.1782178 1.1946234 1.1963453 1.1939922
[46] 1.2118189 1.2227960 1.2140535 1.2228828 1.2314258
[51] 1.2570788 1.2572214 1.2607763 1.2744415 1.2982076
[56] 1.3318808 1.3394186 1.3525902 1.3352815 1.3492751
[61] 1.3593859 1.3368135 1.3642940 1.3538567 1.3658135
[66] 1.3710932 1.3888638 1.4262185 1.4309707 1.4328823
[71] 1.4497201
This is a very tricky question for me.
I tried to do this using a for loop:
> d <- data.frame(deflator)
> for (i in 1:71) {d <-rbind(d,c(delfaotr ))}
I think I might be doing it wrong.
Why use data frames? This is a straightforward vector operation.
inflation = 100 * (deflator[1:67] - deflator[-(1:4)])/deflator[-(1:4)]
I agree with #Fhnuzoag that your example suggests calculations on a numeric vector, not a data frame. Here's an additional way to do your calculations taking advantage of the lag argument in the diff function (with indexes that match those in your question):
lagBy <- 4 # The number of indexes by which to lag
laggedDiff <- diff(deflator, lag = lagBy) # The numerator above
theDenom <- deflator[seq_len(length(deflator) - lagBy)] # The denominator above
inflation <- laggedDiff/theDenom
The first few results are:
head(inflation)
# [1] 0.02315470 0.02094710 0.01705379 0.02725941 0.03299085 0.03008297

Resources