I am following this tutorial online for analyzing RNA-seq data between cell types.
https://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html
I have been able to perform most of this using my own data, but I am now trying to perform pathway enrichment analysis. However, I am having issues because I am unable to label the rows of my initial readcounts matrix accounting to the Gene IDs.
I have tried to simply create a new column with the Gene IDs, however this changes the matrix to a dataframe and prevents me from using DGEList.
seqdata is my data.frame with all the information on the genes from the analysis, with column 1 as the gene ID names and columns 15 to 24 as the vectors with the read count information of each gene across 10 samples.
I generated a matrix from this data.frame called readcounts_g that just has the read counts for each of these genes, but I am trying to assign row names in which i take column 1 from seqdata and use the gene names in this vector to assign the rownames for readcounts_g dataframe.
rownames(readcounts_g) <- seqdata[,1]
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
In addition: Warning message:
Setting row names on a tibble is deprecated.
I also have thought to simply enter the gene names as an additional vector into readcounts_g, but if i do that they I cannot use DEGList because it requires a matrix.
Ultimately, I am trying to use goana to do an enrichment pathway analysis with differentially expressed genes. But, I am unable to do this without having gene names assigned to the final matrix of DEGs.
If anyone has insight on how I can remedy this, it would be greatly appreciated. I can try to explain further if need be.
If seqdata is a tibble, seqdata[,1]is of class tibble and not character or numeric, hence you are unable to assign it as rownames of a matrix, see below for the alternative:
library(dplyr)
seqdata = tibble(geneID=sample(1:1000),
s1=rpois(1000,10),s2=rpois(1000,15),
s3=rpois(1000,20),s4=rpois(1000,25))
readcounts_g = as.matrix(seqdata[,2:5])
rownames(readcounts_g) = seqdata[,1]
#throws error
rownames(readcounts_g) = seqdata$geneID
#ok
> head(readcounts_g)
s1 s2 s3 s4
763 16 13 13 24
776 13 19 24 26
308 12 19 19 34
88 10 8 13 22
23 10 13 16 25
509 9 12 14 28
Related
I'm reading a .sav file using haven:
library(haven)
data <- read_spss("file.sav", user_na = FALSE)
Then trying to display one of the variables in a table:
table(data$region)
Which returns:
1 2 3 4 5 6 7 8 9 10 11 12
85 208 43 171 30 40 95 310 133 29 77 36
Which is technically correct, however - in SPSS, the numerical values in the top row have labels associated with them (region names in this case). If I just run data$region, it shows me the numbers and their associated labels at the end of the output, but is there a way to make those string labels appear in the first table row instead of their numerical counterparts?
Thank you in advance for your help!
The way to do this is to cast the variable as a factor, using the "labels" attribute of the vector as the factor levels. The sjlabelled package includes a function that does this in one step:
data$region <- sjlabelled::as_label(data$region)
While the table command will still work on the resulting data, the layout may be a little messy. The forcats package has a function that pretty-prints frequency tables for factors:
data$region %>% forcats::fct_count()
I have a vector of length 1000. It contains (numeric) survey answers of 100 participants, thus 10 answers per participant. I would like to drop the first three values for every participant to create a new vector of length 700 (including only the answers to questions 4-10).
I only know how to extract every n-th value of the vector, but cannot figure how to solve the above problem.
vector <- seq(1,1000,1)
Expected output:
4 5 6 7 8 9 10 14 15 16 17 18 19 20 24 ...
Using a matrix to first structure and then flatten is one method. Another somewhat similar method is to use what I am calling a "logical pattern index":
head( # just showing the first couple of "segments"
vector[ c( rep(FALSE, 3), rep(TRUE, 10-3) ) ],
15)
[1] 4 5 6 7 8 9 10 14 15 16 17 18 19 20 24
This method can also be use inside the two argument version of [ to select rows ore columns using a logical pattern index. This works because of R's recycling of logical indices.
Thanks for providing example data, based on which this thread is reproducible. Here is one solution
c(matrix(vector, 10)[4:10, ])
We first convert the vector to a matrix with 10 rows, so that each column attributes to a participant. Then use row subsetting to remove first three rows. Finally the matrix is flattened to a vector again.
Assume I have three matrices...
A=matrix(c("a",1,2),nrow=1,ncol=3)
B=matrix(c("b","c",3,4,5,6),nrow=2,ncol=3)
C=matrix(c("d","e","f",7,8,9,10,11,12),nrow=3,ncol=3)
I want to find all possible combinations of column 1 (characters or names) while summing up columns 2 and 3. The result would be a single matrix with length equal to the total number of possible combinations, in this case 6. The result would look like the following matrix...
Result <- matrix(c("abd","abe","abf","acd","ace","acf",11,12,13,12,13,14,17,18,19,18,19,20),nrow=6,ncol=3)
I do not know how to add a table in to this question, otherwise I would show it more descriptively. Thank you in advance.
You are mixing character and numeric values in a matrix and this will coerce all elements to character. Much better to define your matrix as numeric and keep the character values as the row names:
A <- matrix(c(1,2),nrow=1,dimnames=list("a",NULL))
B <- matrix(c(3,4,5,6),nrow=2,dimnames=list(c("b","c"),NULL))
C <- matrix(c(7,8,9,10,11,12),nrow=3,dimnames=list(c("d","e","f"),NULL))
#put all the matrices in a list
mlist<-list(A,B,C)
Then we use some Map, Reduce and lapply magic:
res <- Reduce("+",Map(function(x,y) y[x,],
expand.grid(lapply(mlist,function(x) seq_len(nrow(x)))),
mlist))
Finally, we build the rownames
rownames(res)<-do.call(paste0,expand.grid(lapply(mlist,rownames)))
# [,1] [,2]
#abd 11 17
#acd 12 18
#abe 12 18
#ace 13 19
#abf 13 19
#acf 14 20
I want to import multiple data in R and find the average of the third column of each files. I have shown example below.
I have imported multiple files in R using Ramnath's solution from Import multiple text files in R and assign them names from a predetermined list .
The code I have used so far is as follows:
#Import mulitple text using following code: files with extension *.dat
txt_files =list.files(pattern='\\.dat$')
data_list=lapply(txt_files,read.table,sep="\t",header=T)
Used Nico's answer to change to data frame from R list to data frame
# Change the list to dataframe
hello <- as.data.frame(do.call(rbind,data_list))
dim(hello)
# Using 12 files I got the following information
> dim(hello)
[1] 58536 1
Each file has 4878 number of rows. This is not what I am looking for. What above code did is merged all the data into one data frame based on rows.
I want it by columns and be able to calculate the average of third column from each file. I want to use the third column of each file and the find the array of average.
The sample of what I want is as follows:
File 1
Lat Long Value
10 12 15
12 13 16
File 2
Lat Long Value
10 12 11
12 13 15
Final File
Lat Long Value
10 12 13
12 13 15.5
As you can see for the final file, the first two columns are same, only thing that is different is the third column which is the average of two values from two files. So, I want to use my data to change to the data frame similar to the final file as shown above.
Group by coordinates
Combining things by rows is all right, as long as you don't require your final list to be in any particular order, and don't have different rows with the same coordinates. In that case, you can simply use common coordinates to group rows, and then aggregate over them like this:
aggregate(Value ~ Lat + Lon, hello, mean)
Group by row numbers
If, on the other hand, you have duplicate coordinates, or want the final result to be in the same order as all the inputs, then you should extract the Value column from each data.frame and combine them into a matrix. Then you can compute the mean for each matrix row, and combine those means with the two coordinate columns of any input data frame. This whole approach relies heavily on the order of input data rows, i.e. on the row number of a given place being the same in all files. You could implement it like this:
mean_values <- apply(do.call(cbind, lapply(data_list, function(df) df$Value)), 1, mean)
cbind(data_list[[1]][1:2], Value=mean_values)
Trying this out
Here is an example session of what this looks like on my system:
> data_list <- list(File.1=data.frame(Lat=c(10,12),Lon=c(12,13),Value=c(15,16)),
File.2=data.frame(Lat=c(10,12),Lon=c(12,13),Value=c(11,15)))
> hello <- as.data.frame(do.call(rbind,data_list))
> dim(hello)
[1] 4 3
> str(hello)
'data.frame': 4 obs. of 3 variables:
$ Lat : num 10 12 10 12
$ Lon : num 12 13 12 13
$ Value: num 15 16 11 15
> aggregate(Value ~ Lat + Lon, hello, mean)
Lat Lon Value
1 10 12 13.0
2 12 13 15.5
> value_matrix <- do.call(cbind, lapply(data_list, function(df) df$Value))
> value_matrix
File.1 File.2
[1,] 15 11
[2,] 16 15
> mean_values <- apply(value_matrix, 1, mean)
> cbind(data_list[[1]][1:2], Value=mean_values)
Lat Lon Value
1 10 12 13.0
2 12 13 15.5
Only a single column?
As you only get a single column from reading your input files, according to your dim output, you should investigate that data frame using head or str to see what went wrong. Most likely, your columns aren't separated by tabs but by commas or spaces or some such. Notice that if you do not spcify sep, then any sequence of spaces and / or tabs will be used as a column separator. Read the documentation for read.table for details.
I have a data object signal in R with 40,000+ rows (named variables) of numeric values and 200+ columns (samples). For every row of each column, I want to subtract the value for the row named background for that column.
The code below can be used to create an example signal object in R. With the example, for column A, the background value of 4 is to be subtracted from the values of channelNo1 to 3. Similarly, for column B, the value of 6 is to be subtracted. And so on. What is the simplest way to achieve this in R?
text <- textConnection('
A B C
channelNo1 12 22 32
channelNo2 13 21 33
channelNo3 12 21 30
background 4 6 8
')
signal <- read.table(text, header = TRUE)
close(text)
typeof(signal)
# returns 'list'
class(signal)
# returns 'data.frame'
Elements in an R matrix are oriented by column (check out matrix(1:12, nrow=3) and signal - signal[4,] is not doing what you think -- check out column B, where the second and third values should be the same (and equal to 15). You could write
as.data.frame(Map("-", signal, as.vector(signal[4,])))
(I think this would be relatively efficient) but since the data really seem to be a matrix (i.e., a rectangle of homogeneous type) it makes a lot more sense to manipulate it as a matrix
m = as.matrix(signal)
sweep(m, 2, m[4,], "-")