The data matrix has both TPM and counts (matrix generated using Kallisto) and the goal is to generate gene levels count matrix or TPM matrix.
Any tool that would take the transcript level (TPM or counts) matrix and calculate gene-level (TPM or count ) matrix? or custom script?
Related
I built a network from a matrix, calculate the degree with degree_gen<-degree(g, mode="all") and when I obtain and save it as an excel I have a column with only the degree and I have not been able to create a dataframe that has the id of the node in a column and in the second column the degree
I think you can try
stack(degree_gen)
to produce a dataframe with both node ID and degree
Hi I have a data which is a binary matrix and I have generated a cluster dendrogram with the hclust function in R. First I normalise the values and then I plot. This is the code:
mat.norm <- t(df / sqrt(2*rowSums(df)))
plot(hclust(dist(mat.norm, "euclidean")))
My data consist of 9 columns and the dendrogram is plotted for all the values of the 9 columns. Does anybody know if it possible to set one of those column as the root of the dendrogram from where all the other columns will be clustered?
I am analysing data from a Delphi study and I need to create a vector of the frequency of each score (1:10) for each stakeholder group (6 groups, total of 73 participants) for each outcome (48). The data is in the form:
I would like to create a vector similar to:
score 1,2,3,4,5,6,7,8,9
trialists<-c(0,0,0,0,28.6,71.4,0,0,0)
Where it is expressed as a percentage of a stakeholder group (e.g. trialists) that have scored each score for each outcome . I need to excluded a score of 10 as it represents "unable to answer".
This will result in 48 vectors for each of the 6 stakeholder groups.
Is there a elegant way to do this on R rather than just plodding through the data on excel and inputting it manually?
I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.
I have a dataset of (two groups, replicates). My data is split based on the groups so I have 24 samples in group 1 and 20 samples in group 2. My data has replicates. So each set has 4 replicates, hence I have 6 sets in group 1 and 5 sets in group 2. Hence I have assigned indices to them to make it easier during permutation (indices from 1-11). What I want to do now is a routine permutation analysis to obtain the test statistic. I am using non paramteric method with resampling with replacement.
I am trying to permute the group labels. My null hypothesis is that there is no difference between the mean values between both the groups. My doubt\problem in R coding is that I have to pool the data together and then resample the groups. When I try to do this, I have to make sure I maintain the sample size for respective groups (that is after resampling the group lables, my new dataset should still contain 6 sets (24 samples) in group 1 and 5 sets (20 samples) in group 2. I am unable to achieve the latter.
How can I achieve this in R?