Annotation difference between justRMA and read.affybatch - bioconductor

I am trying to get the raw expression data from a CEL file with probe names. However I could not get the probe names.
I tried using read.affybatch as the following:
library(affy)
smallest_file <- "GSM766572.CEL.gz"
setwd("/data0/RtmpGPL/raw")
a <- read.affybatch(smallest_file)
c <- justRMA(filenames = c(smallest_file), normalize=FALSE)
however when I inspect the content with
head(exprs(a))
the content was
GSM766572.CEL.gz
1 157
2 15900
3 171
4 15673
5 56
6 115
what I want is something like the below (which was produced with head(exprs(c)) )
GSM766572.CEL.gz
1007_s_at 9.636495
1053_at 6.574971
117_at 5.966103
121_at 8.559288
1255_g_at 4.508790
1294_at 8.320684
But with the raw expression levels.
I do not understand why I can not get the probe names with the raw expression levels. Please help.
Thanks

Reading the justrma code and a solution I guess that what I wanted can be achieved by the following way:
tmp <- ReadAffy(filenames=c("GSM766572.CEL.gz"),celfile.path = "/data0/RtmpGPL/raw")
pmIndex <- pmindex(tmp)
probeintensities <- pm(tmp)
probenames <- rep(names(pmIndex), unlist(lapply(pmIndex,length)))
rownames(probeintensities) <- probenames
this produces something like the below:
GSM766572.CEL.gz
1007_s_at 352.0
1007_s_at 790.0
1007_s_at 853.0
1007_s_at 4383.5
1007_s_at 3090.0
1007_s_at 2239.0

Related

Question about getting counts in the R survey package

I'm using the 2018 CBECS data set from the Energy Information Administration (available here: https://www.eia.gov/consumption/commercial/data/2018/xls/2018_public_use_data.csv) and I've set up the sample design according to their user guide. I'm noticing a discrepancy when I use the svyby function as opposed to just the svytotal function and I'm hoping somebody can explain what it is I'm seeing and/or what I'm doing wrong.
Here is the set up for the sample design:
library(survey)
library(spatstat)
library(tidyverse)
cbecs2018 <- read_csv(paste0(getwd(), "/2018_public_use_data.csv"))
samp_wts <- cbecs2018$FINALWT
rep_wts <- cbecs2018[, grepl("^FINALWT", names(cbecs2018))]
rep_wts$FINALWT <- NULL
samp_design <- svrepdesign(weights=samp_wts, repweights=rep_wts,
type="JK2", mse=TRUE, data=cbecs2018)
sqftc <- factor(cbecs2018$SQFTC) #this is categorical variable classifying buildings by size
When I run svytotal to get a count of buildings by each category in sqftc, I get the output below, which is consistent with what EIA has:
svytotal(~sqftc, samp_design)
total SE
sqftc2 2836939.2 138709.13
sqftc3 1358439.0 78632.96
sqftc4 966092.8 55503.86
sqftc5 396595.4 23727.58
sqftc6 218416.8 11718.72
sqftc7 93085.9 5179.07
sqftc8 39865.5 1993.62
sqftc9 6664.8 620.07
sqftc10 2111.8 255.25
However, when I try to break it out by census region, I get completely different counts by category. For example, instead of showing 2,836,939 buildings in the second sqftc group, the table below makes it look like there are 3,605,529 buildings in the group.
x <- svyby(~sqftc, ~region, samp_design, svytotal)
> sum(x$sqftc2)
[1] 3605529
print(x)
region sqftc2 sqftc3 sqftc4 sqftc5 sqftc6 sqftc7 sqftc8 sqftc9 sqftc10 se1 se2 se3 se4 se5 se6 se7 se8
1 1 679858.4 382470.2 466330.8 383649.9 638936.3 777312.6 918361.9 220786.7 97105.4 70972.33 58987.22 57377.8 41027.49 79224.73 100678.28 104811.7 26387.60
2 2 1142179.1 634697.1 752421.8 762969.8 929830.8 1107860.2 1382698.4 369059.3 149810.3 131036.12 88954.07 102800.3 120901.81 88769.62 118328.83 146119.8 56056.48
3 3 859228.7 456788.7 521518.6 540952.1 779310.4 912930.2 1062321.1 285638.1 100881.7 86845.98 50065.79 56198.4 53630.90 66850.76 68490.26 87545.5 34443.43
4 4 924262.5 499895.4 541658.9 555604.6 820252.5 927657.6 1205995.5 298595.7 96787.1 96106.38 51019.41 58771.1 58782.50 60113.72 85934.54 134417.5 41790.27
se9
1 14502.07
2 39303.04
3 21410.55
4 13725.39
I feel like whatever I'm doing wrong is probably pretty straightforward, but any pointers would be greatly appreciated.
maybe review your minimal reproducible example? :-) when i run this, the numbers match
library(survey)
cbecs2018 <- read.csv("https://www.eia.gov/consumption/commercial/data/2018/xls/2018_public_use_data.csv")
samp_design <-
svrepdesign(
weights = ~ FINALWT ,
repweights = "^FINALWT[0-9]" ,
type = 'JK2' ,
mse = TRUE ,
data = cbecs2018
)
samp_design <- update( samp_design , SQFTC = factor( SQFTC ) )
svytotal(~SQFTC, samp_design)
svyby(~SQFTC,~REGION,samp_design,svytotal)

R: read_csv reads numeric entries as logical - parsing col_logical instead of col_double

I am new to R.
I wrote a code for an assignment which reads several csv files and binds it into a data frame and then according to the id, calculates the mean of either nitrate or sulfate.
Data sample:
Date sulfate nitrate ID
<date> <dbl> <dbl> <dbl>
1 2003-10-06 7.21 0.651 1
2 2003-10-12 5.99 0.428 1
3 2003-10-18 4.68 1.04 1
4 2003-10-24 3.47 0.363 1
5 2003-10-30 2.42 0.507 1
6 2003-11-11 1.43 0.474 1
...
To read the files and create a data.frame, I wrote this function:
pollutantmean <- function (pollutant, id = 1:332) {
#creating a data frame from several files
file_m <- list.files(path = "specdata", pattern = "*.csv", full.names = TRUE)
read_file_m <- lapply(file_m, read_csv)
df_1 <- bind_rows(read_file_m)
# delete NAs
df_clean <- df_1[complete.cases(df_1),]
#select rows according to id
df_asid_clean <- filter(df_clean, ID %in% id)
#count the mean of the column
mean_result <- mean(df_asid_clean[, pollutant])
mean_result
However, when the read_csv function is applied, certain entries in nitrate column are read as col_logical, although the whole class of the column remains numeric and the entries are numeric. It seems that the code "expects" to receive logical value, although the real value is not.
Throughout the reading I get this message:
<...>
Parsed with column specification:
cols(
Date = col_date(format = ""),
sulfate = col_double(),
nitrate = col_logical(),
ID = col_double()
)
Warning: 41 parsing failures.
row col expected actual file
2055 nitrate 1/0/T/F/TRUE/FALSE 0.383 'specdata/288.csv'
2067 nitrate 1/0/T/F/TRUE/FALSE 0.355 'specdata/288.csv'
2073 nitrate 1/0/T/F/TRUE/FALSE 0.469 'specdata/288.csv'
2085 nitrate 1/0/T/F/TRUE/FALSE 0.144 'specdata/288.csv'
2091 nitrate 1/0/T/F/TRUE/FALSE 0.0984 'specdata/288.csv'
.... ....... .................. ...... ..................
See problems(...) for more details.
I tried to change the column class by writing
df_1[,nitrate] <- as.numeric(as.character(df_1[, nitrate])
, after binding rows, but it only shows that NAs are again introduced in step which calculates the mean.
What is wrong here, and how could I solve it?
Would appreciate your help!
UPDATE: tried to insert read_csv(col_types = list...), but I get "files" argument is not defined. As I understand, the R reads inside read_csv first, then lapply and because there is not "file" given at the time, it shows error.
The problem with readr::read_csv() failure in parsing the column types can be overcome by passing a col_types= argument in lapply(). We do this as follows:
pollutantmean <- function (directory,pollutant,id=1:332){
require(readr)
require(dplyr)
file_m <- list.files(path = directory, pattern = "*.csv", full.names = TRUE)[id]
read_file_m <- lapply(file_m, read_csv,col_types=list(col_date(),col_double(),
col_double(),col_integer()))
# rest of code goes here. Since I am a Community Mentor in the
# JHU Data Science Specialization, I am not allowed to post
# a complete solution to the programming assignment
}
Note that I use the [ form of the extract operator to subset the list of file names with the id vector that is an argument to the function, which avoids reading a lot of data that isn't necessary. This eliminates the need for the filter() statement in the code posted in the question.
With some additional programming statements to complete the assignment, the code in my answer produces the correct results for the three examples posted with the assignment, as listed below.
> pollutantmean("specdata","sulfate",1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833
Alternately we could implement lapply() with an anonymous function that also uses read_csv() as follows:
read_file_m <- lapply(file_m, function(x) {read_csv(x,col_types=list(col_date(),col_double(),
col_double(),col_integer()))})
NOTE: while it is completely understandable that students who have been exposed to the tidyverse would like to use it for the programming assignment, the fact that dplyr isn't introduced until the next course in the sequence (and readr isn't covered at all) makes it much more difficult to use for assignments in R Programming, especially the first assignment, where dplyr non-standard evaluation gives people fits. An example of this situation is yet another Stackoverflow question on pollutantmean().
With the read_csv update you don't need lapply, you can simply pass along the file path directly to read_csv as you already have defined.
Regarding the column types this can then be sen manually in the col_type argument:
col_type=cols(Date-col_date,sulfate=...)

R, how to add an unlist (and other) function inside an apply function?

Context: I am working with genes and ontology, but my question concerns R script writing.
I would like to replace the GO:ID in my data frame by their corresponding terms extracted form a database.
So, this is my source data frame. it is a genes list (v1) and associated GO:ID (v2):
>gene_list_and_Go_ID
V1 V2
2563 Gene1 GO:0003871, GO:0008270, GO:0008652, GO:0009086
2580 Gene2 GO:0003871, GO:0008270, GO:0008652, GO:0009086
12686 Gene3 GO:0003871, GO:0008270, GO:0008652, GO:0009086
14523 Gene4 GO:0004489, GO:0006555, GO:0055114
The request to the database looks very simple:
>select(GO.db, my_Go_id, "TERM", "GOID")
I tried the following lines to address manually the database, it worked well:
>my_Go_id = unlist(strsplit("GO:0008270, GO:0008652, GO:0009086", split=", "))
>select(GO.db, my_Go_id, "TERM", "GOID")
GOID TERM
1 GO:0008270 zinc ion binding
2 GO:0008652 cellular amino acid biosynthetic process
3 GO:0009086 methionine biosynthetic process
My problem: I cannot make this process automatic!
Precisely, for each row, I need to transform each string from column n°2 in my data frame to a vector in order to question the database.
And then I need to replace the GO:ID in the data frame by the result of the request.
1/ To start, I tried to put the "unlist" function in a "apply" function to my data frame:
apply(gene_list_and_Go_ID,1,unlist(strsplit(gene_list_and_Go_ID[,2], split=", ")))
I got :
Error in strsplit(ok, split = ", ") : non-character argument
2/ Then, can I add also the request to the database inside the apply function?
3/ Finally, I do not know how to replace column n°2 by the result of the database request.
This is an example of an excepted “ideal” result:
V1 V2
2563 Gene1 GOID TERM
1 GO:0008270 zinc ion binding
2 GO:0008652 cellular amino acid biosynthetic process
3 GO:0009086 methionine biosynthetic process
Thanks for your help.
The proximate issue is that you don't call apply like you did. Instead of writing a function call as you did, you need to provide a function that will take each row/column of the array in turn as input via its first argument, so you want something like (not tested, because you don't need this)
apply(gene_list_and_Go_ID, 1,
function(x) { unlist(strsplit(x[2], split=", "))})
However, notice that you don't need entire rows of gene_list_and_Go_ID. What you want is to work on the V2 column of gene_list_and_Go_ID. Now also note that strsplit is vectorised, which means if you pass it a vector of length greater than 1 it will work on each element of that vector as if you'd repeatedly called strsplit() on each element of the vector in turn.
Consider the following:
df <- data.frame(V1 = paste0("Gene", 1:4),
V2 = c("GO:0003871, GO:0008270, GO:0008652, GO:0009086",
"GO:0003871, GO:0008270, GO:0008652, GO:0009086",
"GO:0003871, GO:0008270, GO:0008652, GO:0009086",
"GO:0004489, GO:0006555, GO:0055114"),
stringsAsFactors = FALSE)
Note that V2 needs to be a character vector --- here I used stringsAsFactors = FALSE to stop the automatic coercion character -> factor, but you could also just use as.character(V2) where I have V2 in the code below.
To run strsplit on each element of V2 we could use:
spl <- with(df, strsplit(V2, ", "))
which gets us
> spl
[[1]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"
[[2]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"
[[3]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"
[[4]]
[1] "GO:0004489" "GO:0006555" "GO:0055114"
By the look of the select call, this is a one shot deal - you need to call it for all rows in df (your gene_list_and_Go_ID). If so, just iterate over the elements of the list returned by strsplit():
names(spl) <- with(df, as.character(V1))
term <- lapply(spl, function(x, db) select(db, x, "TERM", "GOID"),
db = GO.db)
This will return a list where each element is the result of a call to select for a single gene / row of df.
Putting it back together you probably want:
out <- cbind.data.frame(Gene = rep(names(spl), each = lengths(spl)),
do.call("rbind", term))
But I can't test the last few parts as I have no idea where select() comes from nor what created GO.db
Ok, according to Gavin's answer and his kind help, I got the right script. But there was a very important step that blocked me: convert my "gene_list_and_Go_ID" data frame second column from factors to characters. I did this to skip the "non-character argument" error from the "strsplit" function. This post helped me: LINK
So here is my starting data frame:
>gene_list_and_Go_ID
V1 V2
2563 Gene1 GO:0003871, GO:0008270, GO:0008652, GO:0009086
2580 Gene2 GO:0003871, GO:0008270, GO:0008652, GO:0009086
12686 Gene3 GO:0003871, GO:0008270, GO:0008652, GO:0009086
14523 Gene4 GO:0004489, GO:0006555, GO:0055114
Next, the script.
The first new line appeared very useful (convert my df from factors to characters):
>gene_list_and_Go_ID <- data.frame(lapply(gene_list_and_Go_ID, as.character), stringsAsFactors=FALSE)
next:
>V_ID <- with(gene_list_and_Go_ID, strsplit(V2, ", "))
>names(V_ID) <- with(gene_list_and_Go_ID, as.character(V1))
>terms <- lapply(V_ID, function(x, db) select(db, x, "TERM", "GOID"), db = GO.db)
Final output is perfect :-) :
> terms
$Gene1
GOID TERM
1 GO:0003871 S-methyltransferase activity
2 GO:0008270 zinc ion binding
3 GO:0008652 cellular amino acid biosynthetic process
4 GO:0009086 methionine biosynthetic process
$Gene2
... etc ...
... etc ...
Note, I skipped the last Gavin's suggestion:
out <- cbind.data.frame(Gene = rep(names(spl), each = lengths(spl)),
do.call("rbind", term))
It may be a very elegant script but I have difficulties to understand all what it does, and here is what it generates:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 16, 15
In addition: Warning message:
In rep(names(V_ID), each = lengths(V_ID)) :
first element used of 'each' argument
THX

How can I retrieve gene annotation info (more specific - functions) of specific genes in R?

I have a list of genes as row-names of my eset like:
iDs <- head(rownames(eset))
[1] "LGI1" "ATE1" "NELL1" "CCND1" "FADD" "PPFIA1" "ORAOV1" "FOXN4" "NOVA1" "PTGER2" "GPX2" "DLK1"
[13] "ZG16" "MYH2" "NPTX1"
How can I find GO info (more specific: function) of these genes in R?
This was my solution but I am not sure if I'm doing it right:
library (biomaRt)
mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")
getGene( id = iDs[1] , type = "hgnc_symbol", mart = mart)
hgnc_symbol hgnc_symbol description chromosome_name
1 LGI1 LGI1 leucine-rich, glioma inactivated 1 [Source:HGNC Symbol;Acc:HGNC:6572] 10
band strand start_position end_position ensembl_gene_id
1 q23.33 1 93757809 93798174 ENSG00000108231
as I mentioned before what I want is to find out the function of these genes not the description or their location?
Any helps would be appreciated.
Thanks,

Subset by function's variable using $variable

I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"

Resources