How to manipulate a count matrix from a DGEList? - r

I am currently doing an RNASeq differential expression analysis. I used the function DGEList from edgeR to obtain the count and sample objects. I now want to remove a list of genes from count. This is the code I tried (with remove the list of genes I want to remove, gene the reference fo all genes I have):
n=0
for (i in remove) {
for (j in gene) {
n=(n+1)
if (i==j) {
counts=counts[-n, ]
n=(n-1)
}
if (n==nrow(counts)) {
n=0
}
}
}
I was expecting it to work as it does properly on a matrix that is similar.The code is still running while the one working on the matrix finished a long time ago. It should remove about 16000 rows.
Do I have to manipulate it in a different way ?

If I understand correctly, you want to filter out some genes from your count matrix. In that case instead of the loops, you could try indexing the counts object. Assuming the entries in diff match some entries in rownames(counts), you could try:
counts_subset <- counts_all[which(!rownames(counts_all) %in% diff),]
A similar approach should work on the table obtained by running the LRT test (result$table). This would be better object to filter.

Related

create a list with the results (dataframes) of the same functions applied on a list of inputs in R

I am using the function text stat_keyness that look at the most frequently appearing words for a specific group of documents in comparison with all the other groups of documents (so basically you input the target group of documents and the output is a dataset containing the words ordered from the most important to the less important and some other columns with some statistics.
I have a a character vector with all the name of the documents groups I want to apply Keynes analysis to:
interests_list <- c(unique(data$interest))
(it looks like : chr "0" , "340" , "456" etc.. basically each number corresponds to a group of documents)
I can easily apply stat_keyness to a single group of document as follows
keyness <- dfm(dfmat_data, groups = "group_interest")
#Calculate keyness and determine audience as target group, compare frequencies of words
between target and reference documents.
result_keyness <- textstat_keyness(keyness, target = "17627")
the problem is that I don't want to run stat_keyness for each group individually as I have around 100 groups.
I was thinking to use a for loop, but I am not sure how to create a list of all the dataframes generated by text stat_keyness
I wrote this so far, but I don't know how to store all the results I would obtain
for(i in interest_list) {textstat_keyness(keyness, target = i )
}
otherwise, I tried with apply but it doesn't work
keylist <- lapply(keyness, textstat_keyness(keyness, target = interest_list ))
any idea how I can do to obtain my list of data frame in any efficient way?
thank you very much,
Carlo
Alternative to the for loop provided by JaiPizGon, is a solution with lapply.
keylist <- lapply(interest_list, function(i) textstat_keyness(keyness, target = i))
Note that lapply is essentially a for loop, which always return a list.
The notation used by JaiPizGon is also correct, only you should be careful in growing objects in R - see chapter 2 in "The R Inferno".
So if you are more comfortable using a for loop I suggest specifying the size of the list prior to assignment, i.e.:
keylist <- vector("list", length(interest_list))
for(i in seq_along(interest_list)) {
keylist[[i]] <- textstat_keyness(keyness, target = interest_list[i])
}
Have you tried initializing a list and assigning the result of textstat_keyness function?
Code:
keylist <- list()
for (i in 1:length(interest_list)) {
keylist[[i]] <- textstat_keyness(keyness, target = interest_list[i])
}

Append row to dataset in R

I'm trying to create a new dataset deleting some rows (trough a comparison with a dataset ds1) from ds2. I wrote a function that should do this:
compare<-function(ds1,ds2){
for(i in 1:length(ds1$long)){
for(j in 1:length(ds2$long)){
if(ds1$long[i]<(ds2$long[j]+500) & ds1$long[i]>(ds2$long[j]-500)){
if(ds1$lat[i]<(ds2$lat[j]+500) & ds1$lat[i]>(ds2$lat[j]-500)){
ds3<-data.frame(merge(ds2[j,],ds3))
}
}
}
}
return(ds3)
}
ds3 is the dataset I want to return, it should be formed by the rows of the original dataset ds2 that satisfy the condition above.
My function gives me an error:
Error in as.data.frame(y) :
argument "y" is not specified and has not a definite value
Is "merge()" the right function for creating such a dataset, appending rows to ds3?
If not, which is the right function to do this?
Thank you all in advance
Edit: I modified the function thanks to your tips, using
ds3<-data.frame()
ds3<-rbind(ds3,ds2[j,])
instead of
ds3<-data.frame(merge(ds2[j,],ds3))
Now I've got this error:
Errore in rbind(ds3, ds2[j, ]) :
no method for coercing this S4 class to a vector
If I use rbind(), can I operate with SpatialPoints? (data contained in my dataset are spatial points)
Edit2: I have 2 datasets, one with 330 rows (points on irregular grid, ds1), one with ~150000 rows (points on regular grid, ds2). I want to compute correlation between the variables in the first dataset and the variables in the second one. For making it, I want to "reduce" the second dataset to the dimensions of the first, saving only the points which have the same coordinates (or quasi) in both datasets.
Without a small example this has no testing but if you are happy with the performance of the for-loop then this may be what you are attempting:
compare<-function(ds1,ds2){
for(i in 1:length(ds1$long)){
for(j in i:length(ds2$long)){ # I think starting at 1 will give twice as many hits
if(ds1$long[i]<(ds2$long[j]+500) & ds1$long[i]>(ds2$long[j]-500)){
if(ds1$lat[i]<(ds2$lat[j]+500) & ds1$lat[i]>(ds2$lat[j]-500)){
if( length(d3) ) { # check to see if d3 exists or not
ds3<-rbind( ds3, ds2[,j] ) } else { # append as the next row
d3 <- ds2[ ,j] } # should only get executed once
}
}
}
}
return(ds3)
}
I tried to avoid the added overhead of retesting for j,i matches where you already had an i,j match. Again, I cannot tell for sure this is appropriate because the problem description still is not exactly clear to me.

Creating a Data frame that is populated by a custom function that returns an vector

I have the following code below and what I would like to do is populate a dataframe. Each row should be returned from the custom function rX (it returns a vector with 3 numbers).
I've come up with two ways to achieve this but they both feel a bit like work arounds and I was wondering if anyone had a better way to suggest.
Method 1 involves looping through each iteration storing the result in a temporary variable and then putting it in the correct place in the data frame
The second method rbinds the data in but I'm left with blank row which needs to be stripped out after.
n=500
ff<-c(0.2,0.3,0.5,0.25)
rX<-function(ff){
#generate data frame to hold set selections
rands<-runif(3)
s<-rep(0,3)
for(x in 1:3){
#generate probabalities from FF
probs<-cumsum(ff/sum(ff))
#select first fracture set
s[x]<-min(which(probs>=rands[x]))
#get rid of set and recalc
s[x]
ff[s[x]]<-0
}
rx<-s
}
solutions
#way 1
df_sets<-data.frame(s1=rep(0,n),s2=rep(0,n),s3=rep(0,n))
for (i in 1:n){
a<-rX(ff)
df_sets$s1[i]<-a[1]
df_sets$s2[i]<-a[2]
df_sets$s3[i]<-a[3]
}
head(df_sets)
#way 2
df_sets<-data.frame(s1=0,s2=0,s3=0)
for (i in 1:n){
a<-rX(ff)
df_sets<-rbind(df_sets,a)
}
df_sets<-df_sets[-1,]
head(df_sets)
edit:
The point of this function is to create a number of realizations which select from (without replacement) a predetermined vector which discrete probabilities. The function rX will use a static input as shown in the function above. It will select one of the datapoints by comparing a random number between 0 and 1 to the cumulative percent passing at each point. Then it will remove this point recalculate the probability function and recompare.

Indexing in nested loops

I am new to R and this site. My aim with the following, assuredly unnecessarily-arcane code is to create an R function that produces a special type of box plot in ggplot2. I first need to process potential input thereinto by calculating the variables that I shall later wish to have plotted.
I start by generating some random data, called datos:
c1=rnorm(98,47,23)
c2=rnorm(98,56,13)
c3=rnorm(98,52,7)
fila1=as.matrix(t(c(-2,15,30)))
colnames(fila1)=c("c1","c2","c3")
fila2=as.matrix(t(c(-20,5,20)))
colnames(fila2)=c("c1","c2","c3")
datos=rbind(data.frame(c1,c2,c3),fila1,fila2)
rm(c1,c2,c3,fila1,fila2)
Then, I calculate the variables to later be plotted, which include for each of the present columns in datos the mean (puntoMedio), the first and third quartiles (cuar1,cuar3), the inner-quartile range (iqr), the lower bound of potential submean whiskers (limInf), the upper bound of potential supermean whiskers (limSup) and outliers (submean outliers vAtInf and supermean outliers vAtSup to be combined in vAt):
puntoMedio=apply(datos,MARGIN=2,FUN=mean)
cuar1=apply(datos,MARGIN=2,FUN=quantile,probs=.25)
cuar3=apply(datos,MARGIN=2,FUN=quantile,probs=.75)
cuar=rbind(cuar1,cuar3)
iqr=apply(cuar,MARGIN=2,FUN=diff)
cuar=rbind(cuar,iqr,puntoMedio)
limInf=array(dim=ncol(datos))
for(i in 1:ncol(datos)){
limInf0=as.matrix(t(cuar[1,]-1.5*cuar[3,]))
if(length(datos[datos[,i]<limInf0[,i],i])>0){
limInf[i]=limInf0[,i]
}else{limInf[i]=min(datos[,i])}
}
limSup=array(dim=ncol(datos))
for(i in 1:ncol(datos)){
limSup0=as.matrix(t(cuar[2,]+1.5*cuar[3,]))
if(length(datos[datos[,i]>limSup0[,i],i])>0){
limSup[i]=limSup0[,i]
}else{limSup[i]=max(datos[,i])}
}
d=data.frame(t(rbind(cuar,limInf,limSup)))
rm(cuar)
vAtInf=datos
for(i in 1:ncol(vAtInf)){
vAtInf[vAtInf[,i]>limInf0[,i],i]=NA
}
colnames(vAtInf)=c("vAtInfc1","vAtInfc2","vAtInfc3")
vAtSup=datos
for(i in 1:ncol(vAtSup)){
vAtSup[vAtSup[,i]<limSup0[,i],i]=NA
}
colnames(vAtSup)=c("vAtSupc1","vAtSupc2","vAtSupc3")
datos=cbind(datos,vAtInf,vAtSup)
rm(limInf0,limSup0,cuar1,cuar3,i,iqr,limInf,limSup,puntoMedio)
Everything works as desired up until here. I have two data frames d and datos, the former of no interest here, and the latter, which in this specific case comprises nine columns: three of all values, three of the corresponding submean outliers and three of the corresponding supermean outliers (these latter six padded with NA). I now wish to extract all outliers by column, wherefore I have tried formulating the following loop. While it does work giving neither error nor warning, it also does not give the desired output in vAt (again, the by-column [columns 4:9] outliers from datos). The problem, then, as far as I have been able to discern, occurs in the nested for-loop, upon attempting to input i into vAt: each iteration of the loop erases the last, such that upon completion of the entire loop, vAt only contains NA and the outliers from the last column/of the last iteration.
for(i in ((ncol(datos)/3)+1):ncol(datos)){
vAt=matrix(nrow=.25*nrow(datos),ncol=ncol(datos)-(ncol(datos)/3))
colnames(vAt)=c(((ncol(datos)/3)+1):ncol(datos))
if(length(datos[,i][is.na(datos[,i])==F])>0){
for(j in 1:(length(datos[,i][is.na(datos[,i])==F]))){
nom=as.character(i)
vAt[j,nom]=datos[,i][is.na(datos[,i])==F][j]
}
}else{next}
}
I have not been able to find any existent thread that answers my question. Thanks for any help.
The problem is that you are initialising vAt inside the loop here.
Moving the initialisation statements outside the for loop will fix the problem that you are facing:
vAt=matrix(nrow=.25*nrow(datos),ncol=ncol(datos)-(ncol(datos)/3))
colnames(vAt)=c(((ncol(datos)/3)+1):ncol(datos))
for(i in ((ncol(datos)/3)+1):ncol(datos)){
if(length(datos[,i][is.na(datos[,i])==F])>0){
for(j in 1:(length(datos[,i][is.na(datos[,i])==F]))){
nom=as.character(i)
vAt[j,nom]=datos[,i][is.na(datos[,i])==F][j]
}
}else{next}
}
However, there are various improvements which you can make to the code as it stands:
Using vectorisation and *ply functions instead of for loops.
Not comparing logical vectors to ==F but instead only using !is.na(...).
Using sum(is.na(...)) instead of length(d[,i][!is.na(...)])
And some more. These will not change the correctness of the code, but will make it more efficient and more idiomatic.

Efficient function to return varying length vector from lookup table

I have three data sources:
types<-c(1,3,3)
places<-list(c(1,2,3),1,c(2,3))
lookup.counts<-as.data.frame(matrix(runif(9,min=0,max=10),nrow=3,ncol=3))
assigned.places<-rep.int(0,length(types))
the numbers in the "types" vector tell me what 'type' a given observation is. The vectors in the places list tell me which places the observation can be found in (some observations are found in only one place, others in all places). By definition there is one entry in types and one list in places for each observation. Lookup.counts tells me how many observations of each type are located in each place (generated from another data source).
I want to randomly assign each observation to a place based on a probability generated from lookup.counts. Using for loops it looks something like"
for (i in 1:length(types)){
row<-types[i]
columns<-places[[i]]
this.obs<-lookup.counts[row,columns] #the counts of this type in each place
total<-sum(this.obs)
this.obs<-this.obs/total #the share of observations of this type in these places
pick<-runif(1,min=0,max=1)
#the following should really be a 'while' loop, but regardless it needs help
for(j in 1:length(this.obs[])){
if(this.obs[j] > pick){
#pick is less than this county so assign
pick<- 100 #just a way of making sure an observation doesn't get assigned twice
assigned.places[i]<-colnames(lookup.counts)[j]
}else{
#pick is greater, move to the next category
pick<- pick-this.obs[j]
}
}
}
I have been trying to vectorize this somehow, but am getting hung up on the variable length of 'places' and of 'this.obs'
In practice, of course, the lookup.counts table is quite a bit bigger (500 x 40) and I have some 900K observations with places lists of length 1 through length 39.
To vectorize the inner loop, you can use sample or sample.int to choose from several alternaives with prescribed probabilities. Unless I read your code incorrectly, you want something like this:
assigned.places[i] <- sample(colnames(this.obs), 1, prob = this.obs)
I'm a bit surprised that you're using colnames(lookup.counts) instead. Shouldn't this be subset by columns as well? It seems that either I missed something, or there is a bug in your code.
the different lengths of your lists are a severe obstacle to vectorizing your outer loops. Perhaps you could use the Matrix package to store that information as sparse matrices. Then you could simply multiply probabilities by that vector to exclude those columns which are not in the places list of a given observation. But as you'd probably still use apply for the above sampling code, you might as well keep the list and use some form of apply to iterate over that.
The overall result might look somewhat like this:
assigned.places <- colnames(lookup.counts)[
apply(cbind(types, places), 1, function(x) {
sample(x[[2]], 1, prob=lookup.counts[x[[1]],x[[2]]])
})
]
The use of cbind and apply isn't particularly beautiful, but seems to work. Each x is a list of two items, x[[1]] being the type and x[[2]] being the corresponding places. We use these to index lookup.counts just as you did. Then we use the found counts as relative probabilities when choosing the index of one of the columns we used in the subscript. Only after all these numbers have been assembled into a single vector by apply will the indices be turned into names based on colnames.
You can check whether things are faster if you don't cbindstuff together, but instead iterate over the indices only:
assigned.places <- colnames(lookup.counts)[
sapply(1:length(types), function(i) {
sample(places[[i]], 1, prob=lookup.counts[types[i],places[[i]]])
})
]
This appears to work as well:
# More convenient if lookup.counts is a matrix.
lookup.counts<-matrix(runif(9,min=0,max=10),nrow=3,ncol=3)
colnames(lookup.counts)<-paste0('V',1:ncol(lookup.counts))
# A function that does what the for loop does for each i
test<-function(i) {
this.places<-colnames(lookup.counts)[places[[i]]]
this.obs<-lookup.counts[types[i],this.places]
sample(this.places,size=1,prob=this.obs)
}
# Applies the function for all i
sapply(1:length(types),test)

Resources