common dispersion in R - r

I try to use edge R package to analysis my data as following
This is my data frame called subdata
A B C D E
1 13707 13866 12193 12671 10178
2 0 0 0 0 1
3 7165 5002 1256 1341 2087
6 8537 16679 9042 9620 19168
10 19438 25234 15563 16419 16582
16 3 3 11 3 5
genotype=factor(c("LMS1","LRS1","MS3","MS4","RS5"))
y=DGEList(counts=data.matrix(subdata),group=genotype)
design=model.matrix(~0+group,data=subdata)
y=estimateGLMCommonDisp(y,design, verbose=TRUE)
I try to calculate the common dispersion to estimate up and down regulation genes but I get this error message as following
Warning message:
In estimateGLMCommonDisp.default(y = y$counts, design = design, :
No residual df: setting dispersion to NA
please could any one help me to dissolve this problem , I really appreciate for that

Related

Cannot introduce offset into negative binomial regression

I'm performing a count data analysis in R, dealing with a data called 'doctor' which is:
V2 V3 L4 V5
1 1 32 10.866795 1
2 2 104 10.674706 1
3 3 206 10.261581 1
4 4 186 9.446440 1
5 5 102 8.578665 1
6 1 2 9.841080 2
7 2 12 9.275472 2
8 3 28 8.649974 2
9 4 28 7.857481 2
10 5 31 7.287561 2
The best model was V3~V2+L4+V5+V2:L4:V5 using stepwise AIC. Now I want to set L4 as the offset and perform negative binomial regression including the interaction, so I used the code nbinom=glm.nb(V3~V2+V5+V2:V5,offset=L4) but get this error message that says Error in glm.control(...) : unused argument (offset = L4). What have I done wrong here?
Offsets are entered using an offset term in the model formula:
nbinom=glm.nb(V3~V2+V5+V2:V5+offset(L4))
Also you can use V2*V5 instead of V2+V5+V2:V5

Hierarchical clustering with specific number of data in each cluster

I'm clustering a set of words using "Hierarchical Clustering". I want each cluster to contain a certain number of words, for example 2 words, or 3 words.
I'm trying to modify existing code for this clustering.
I just put the value of max(d) to Inf as well
Lm[min(d),] <- sl
Lm[,min(d)] <- sl
if (length(cluster)>2){#if it's already clustered with more than 2 points
#then dont't cluster them again by setting values to Inf
Lm[min(d), min(d)] <- Inf
Lm[max(d), max(d)] <- Inf
Lm[max(d),] <- Inf
Lm[,max(d)] <- Inf
Lm[min(d),] <- Inf
Lm[,min(d)] <- Inf
}
However, it doesn't give me the expected results, I was wondering if it's correct approach? How can I do this type of clustering with constraint in r ?
example of results that I got
row V1 V2
166 -194 -38
167 166 -1
……..
240 239 239
241 240 240
242 241 241
243 242 242
244 243 243
This will be tough to optimize, or it can produce arbitrarily bad results. Because your size constraint goes against the principles of clustering.
Consider the one-dimensional data set -100, -1, 1, 100. Assuming you want to limit the cluster size to 2 elements. Hierarchical clustering will first merge -1 and +1 because they are closest. Now they have reached maximum size, so the only option is now to cluster -100 and +100, the worst possible result - this cluster is as big as the entire data set.
Just to give you an example of what I meant with partitional clustering:
library(cluster)
data("ruspini")
desired_cluster_size <- 3L
corresponding_num_clusters <- round(nrow(ruspini) / desired_cluster_size)
km <- kmeans(ruspini, corresponding_num_clusters)
table(km$cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 2 4 2 2 2 1 3 3 2 3 2 3 3 2 6 3 2 1 3 6 2 8 4
This definitely can't guarantee how many observations you'll have in each group,
and it's not deterministic,
but it at least gives you an approximation.
In the tabulated results you can see that many clusters (1 through 25) ended up with 2 or 3 elements.

Error in fisher.test : Bug in fexact3, it[i=6]=0: negative key (kyy=91)

I have this table abd I want to analyse it statistically.
table(sci$category, sci$true_group)
mono sim_rus_nen suc_balanced suc_nen_rus suc_rus_nen
generalization 9 3 9 4 3
description 35 16 15 13 17
scheme 2 1 1 1 2
syncretism 5 3 7 16 2
tautology 2 2 2 3 3
substitution 1 0 0 0 0
indefinite 7 5 5 6 9
no_answer 30 17 18 13 19
So I decided to apply Fisher's exact test. But I have this error (although it's OK with chiq.square)
fisher.test(table(sci$category, sci$true_group))
Error in fisher.test(my_tab) : Bug in fexact3, it[i=6]=0: negative
key -1099365618 (kyy=91)
How can I fix this?
For larger contingency table/ counts, it gets resource intensive, to count all the worst cases to arrive at the p-value (seems to be that error).
So its convenient to simulate the p-values for tables larger than (2x2):
df <- table(sci$category, sci$true_group)
fisher.test(df, simulate.p.value = TRUE, B = 1e6)
Fisher's Exact Test for Count Data with simulated p-value (based on 1e+06 replicates)
data: df
p-value = 0.1054
alternative hypothesis: two.sided
PS: Choosing between Fishers exact test vs Chisq-test is a whole another discussion. I would refer you to this cross validated post for clarity: Alternatives to chisq-test

Treat variables as data_frame and other things

I guess I have a problem in R. I have this data frame (see at the bottom); I imported it as a Import Dataset "Weevils" from Text; then I converted the data via
as.data.frame(Weevils) and is.data.frame(Weevils) [1] TRUE proved me it's a data frame yet I cannot use the $ operator because the variables were all "atomic vectors"; I tried this instead:
pairs(x[Age_yrs]~x[Larvae_per_m²], col= x[Farmer] pch = 16)
but then this occured:
Error in plot.xy(xy, type, ...) :
numerische Farbe muss >= 0 sein, gefunden -2
which basically means that a negative value (for ther Farmer?) was found therefore so it cannot assign the colors to the outcome; All is supposed to look like this https://stackoverflow.com/a/40829168/5987736 (Thanks to Carles Mitjans!)
yet what came out in my case when putting in pairs(x[Age_yrs]~x[Larvae_per_m²], pch = 16) was this plot: Plot with negative values ; it has negative values, thus the colors cannto be assigned;
So my questions are: Why cannot the variables in the Weevils dataframe be treated as non-atomic vectors or why can't I use the $ and why are the values negative, what can I do so the values get positive? Thanks for helping me!
Farmer Age_yrs Larvae_per_m²
1 Band 2 1315
2 Band 4 725
3 Band 6 90
4 Fechney 1 520
5 Fechney 3 285
6 Fechney 9 30
7 Mulholland 2 725
8 Mulholland 6 20
9 Adams 2 150
10 Adams 3 225
11 Forrester 1 455
12 Forrester 3 75
13 Bilborough 2 850
14 Bilborough 3 650

How can I filter out rows from linear regression based on another linear regression

I would like to conduct a linear regression that will have three steps: 1) Running the regression on all data points 2) Taking out the 10 outiers as found by using the absolute distanse value of rstandard 3) Running the regression again on the new data frame.
I know how to do it manually but these is very awkwarding. Is there a way to do it automatically? Can it be done for taking out columns as well?
Here is my toy data frame and code (I'll take out 2 top outliers):
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 16
608 0 1 5
123 1 17 7
321 1 8 7
226 0 2 7
556 0 20 3
334 1 6 3
225 0 1 1
999 0 3 11
987 0 30 1 ",header = TRUE)
model<- lm(target~ birds+ wolfs,data=df)
rstandard <- abs(rstandard(model))
df<-cbind(df,rstandard)
g<-subset(df,rstandard > sort(unique(rstandard),decreasing=T)[3])
g
userid target birds wolfs rstandard
4 543 1 2 3 1.189858
13 334 1 6 3 1.122579
modelNew<- lm(target~ birds+ wolfs,data=df[-c(4,13),])
I don't see how you could do this without estimating two models, the first to identify the most influential cases and the second on the data without those cases. You could simplify your code and avoid cluttering the workspace, however, by doing it all in one shot, with the subsetting process embedded in the call to estimate the "final" model. Here's code that does this for the example you gave:
model <- lm(target ~ birds + wolfs,
data = df[-(as.numeric(names(sort(abs(rstandard(lm(target ~ birds + wolfs, data=df))), decreasing=TRUE)))[1:2]),])
Here, the initial model, evaluation of influence, and ensuing subsetting of the data are all built into the code that comes after the first data =.
Also, note that the resulting model will differ from the one your code produced. That's because your g did not correctly identify the two most influential cases, as you can see if you just eyeball the results of abs(rstandard(lm(target ~ birds + wolfs, data=df))). I think it has to do with your use of unique(), which seems unnecessary, but I'm not sure.

Resources