extract the names from an ordered the vector in R - r

Studying the http://cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf
Here's the question i have, how to extract the names from an ordered the vector. The problem in the book it asks to give the gene identifiers of the three genes(from the patients in disease stage B1) with the largest mean as an output.
The data set is from package "ALL"
source("http://bioconductor.org/biocLite.R")
biocLite("ALL")
Here's what i got so far,
library("ALL")
data("ALL")
B1 <- exprs(ALL[,ALL$BT=="B1"])
hist(B1)
mean(B1)
meanB1 <- apply(exprs(ALL[,ALL$BT=="B1"]),1,mean)
omeanB1 <- order((meanB1), decreasing=TRUE)
I'm wondering if there is a particular function i can call from R to extract just the names of the genes. In the package "golub" ,there is a golub.gnames to help extract the gene names.

It seems to me that you're almost there. Once you have the order, you can apply it to meanB1:
head(meanB1[omeanB1])
# AFFX-hum_alu_at 31962_at 31957_r_at 40887_g_at 36546_r_at
# 13.41648 13.16671 13.15995 13.10987 12.94578
# 1288_s_at
# 12.80290
To get the names of the top three genes, you can do:
names(meanB1[omeanB1])[1:3]
# [1] "AFFX-hum_alu_at" "31962_at" "31957_r_at"

Related

ID Conversion from Human Gene Symbols to Zebrafish Gene Symbols

I am looking for an R solution (or a general logic solution) to convert Homo sapiens gene names into Danio rerio gene names. My current coding skills are fairly primitive, so I tried writing something with for-loops and if-statements, but it can only pick up one of the ortholog genes, however there are multiple. For example, for the human gene REG3G there are three zebrafish ortholog genes: si:ch211-125e6.13, zgc:172053, lectin. I have added the code I wrote, but that only picks up the last one, but I would like it to output all three.
I have also been having trouble finding R/BiomaRt code to help complete this task and would love any advice.
# Read excel file containing list of zebrafish genes and their human orthologs.
ortho_genes <- read_excel("/Users/talha/Desktop/Ortho_Gene_List.xlsx")
# Separate data from excel file into lists.
zebrafish <- ortho_genes$`Zebra Gene Name`
human <- ortho_genes$`Human Gene Name`
# Read sample list of differential expressed genes
sample_list <- c("GREB1L","SIN3B","NCAPG2","FAM50A","PSMD12","BPTF","SLF2","SMC5", "SMC6", "TMEM260","SSBP1","TCF12", "ANLN", "TFAM", "DDX3X","REG3G")
# Make a matrix with same number of columns as genes in the supplied list.
final_m <- matrix(nrow=length(sample_list),ncol=2)
# Iterate through every gene in the supplied list
for(x in 1:length(sample_list)){
# Iterate through every human gene
for(y in 1:length(human)){
# If the gene from the supplied list matches a human gene
if(sample_list[x] == human[y]){
# Fill our matrix in with the supplied gene and the zebrafish ortholog
# that matches up with the cell of the human gene
final_m[x,1] = sample_list[x]
final_m[x,2] = zebrafish[y]
}
}
}
You didn't specify the structure of ortho_genes. This is my guess:
ortho_genes <- tibble::tibble(Zebra = c("greb1l", "sin3b", "ncapg2", "fam50a", "psmd12", "bptf", "fam178a"),
Human = c("GREB1L","SIN3B","NCAPG2","FAM50A","PSMD12","BPTF","SLF2"))
You can simply index the table with sample_list (it's a vector, not a list)
sample_list <- c("NCAPG2", "SLF2", "GREB1L")
ortho_genes[ortho_genes$Human %in% sample_list,]
You also didn't specify how you want the output. Do you need a matrix? If you want to write the result into a file, a matrix may not be optimal.

Comparing character lists in R

I have two lists of characters that i read in from excel files
One is a very long list of all bird species that have been documented in a region (allBirds) and another is a list of species that were recently seen in a sample location (sampleBirds), which is much shorter. I want to write a section of code that will compare the lists and tell me which sampleBirds show up in the allBirds list. Both are lists of characters.
I have tried:
# upload xlxs file
Full_table <- read_excel("Full_table.xlsx")
Pathogen_table <- read_excel("pathogens.xlsx")
# read species columnn into a new dataframe
species <-c(as.data.frame(Full_table[,7], drop=FALSE))
pathogens <- c(as.data.frame(Pathogen_table[,3], drop=FALSE))
intersect(pathogens, species)
intersect(species, pathogens)
but intersect is outputting lists of 0, which I know cannot be true, any suggestions?
Maybe you can try match() function or "==".
You need to run the intersect on the individual columns that are stored in the list:
> a <- c(data.frame(c1=as.factor(c('a', 'q'))))
> b <- c(data.frame(c1=as.factor(c('w', 'a'))))
> intersect(a,b)
list()
> intersect(a$c1,b$c1)
[1] "a"
This will probably do in your case
intersect(Full_table[,7], Pathogen_table[,3])
Or if you insist on creating the data.frames:
intersect(pathogens[1,], species[1,])
where [1,] should select the first column of the data.frame only. Note that by using c(as.data.frame(... you are converting the data.frame to a regular list. I'd go with only as.data.frame(....

min() does not work as expected

I am trying to get the minimum of a a column.
The data has been split into groups using the "abbr" factor. My objective is to return the data in column 2 corresponding to the minimum in column number passed in the argument. If it helps , this is a part of the coursera R programming introductory course.
The minimum is supposed to be somewhere around 8, it shows 10.
Please help me here.
here's the link to the csv file on which i used read.csv
https://drive.google.com/file/d/0Bxkj3-FNtxqrLW14MFZCeEl6UGc/view?usp=sharing
best <- function(abbr, outvar){
## outcome is a dataframe consisting of a column labelled "State" (one of many)
## outvar is the desired column number
statecol <- split(outcome, outcome$State) ##state is a factor which will be inputted as abbr
dislist <- statecol[[abbr]][,2][statecol[[abbr]][, outvar] ==
min(statecol[[abbr]][, outvar])] ##continuation of prev line
dislist
}
In my opinion you are messing up with NA, make sure to specify na as not available and na.rm=TRUE in min..
filedata<-read.table(file.choose(),quote='"',sep=",",dec=".",header=TRUE,stringsAsFactors=FALSE, na.strings="Not Available")
f<-function(df,abbr,outVar,na.rm=TRUE){
outlist<-split(df,df["State"])
tempCol<-outlist[[abbr]][outVar]
outlist[[abbr]][,2][which(tempCol==min(tempCol,na.rm=na.rm))]
}
f(filedata,"AK",44)

R unique function vs eliminating duplicated values

I have a data frame and an info file (also in table format) that describes the data within the data frame. The row names of the data frame need to be relabelled according to information within the info file. The problem is that the information corresponding to the data frame row names, in the info file, contains lots of duplicated values. Hence it is necessary to convert the df to a matrix such that the row names can have duplicate values.
matrix1<-as.matrix(df)
ptr<-match(rownames(matrix1), info_file$Array_Address_Id)
rownames(matrix1)<-info_file$ILMN_Gene[ptr]
matrix1<-matrix1[!duplicated(rownames(E.rna_conditions_cleaned)), ]
The above is my own code however a friend gave me some code with a similar goal but different results:
u.genes <- unique(info_file$ILMN_Gene)
ptr.u.genes <- match( u.genes, info_file$ILMN_Gene )
matrix2 <- as.matrix(df[ptr.u.genes,])
rownames(matrix2) <- u.genes
The problem is that these two strategies output different results:
> dim(matrix1)
[1] 30783 565
> dim(matrix2[,ptr.use])
[1] 34694 565
See above matrix2 has ~4000 more rows than the other.
As you can see the row names of the below output are indeed unique but that doesn't tell why the two methods selected different rows but which method is better and why is the output different?
U.95 JIA.65 DV.93 KD.76 HC.54 KD.77
7A5 5.136470 5.657738 5.122299 5.195540 5.378040 4.997210
A1BG 6.166210 6.210373 6.382051 6.494048 5.888900 5.914070
A1CF 5.222130 4.940529 4.715292 5.182658 4.510937 5.060749
A26C3 5.410403 5.148601 5.122299 3.967419 4.780758 4.868472
A2BP1 5.725115 4.817920 5.483607 5.444427 5.503358 5.121951
A2LD1 6.505271 6.558276 5.494096 4.833267 6.988192 6.082662
I need to know this because I wan the row values that will yield the most accurate downstream analysis by having the row values that are best.

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources