ID Conversion from Human Gene Symbols to Zebrafish Gene Symbols - r

I am looking for an R solution (or a general logic solution) to convert Homo sapiens gene names into Danio rerio gene names. My current coding skills are fairly primitive, so I tried writing something with for-loops and if-statements, but it can only pick up one of the ortholog genes, however there are multiple. For example, for the human gene REG3G there are three zebrafish ortholog genes: si:ch211-125e6.13, zgc:172053, lectin. I have added the code I wrote, but that only picks up the last one, but I would like it to output all three.
I have also been having trouble finding R/BiomaRt code to help complete this task and would love any advice.
# Read excel file containing list of zebrafish genes and their human orthologs.
ortho_genes <- read_excel("/Users/talha/Desktop/Ortho_Gene_List.xlsx")
# Separate data from excel file into lists.
zebrafish <- ortho_genes$`Zebra Gene Name`
human <- ortho_genes$`Human Gene Name`
# Read sample list of differential expressed genes
sample_list <- c("GREB1L","SIN3B","NCAPG2","FAM50A","PSMD12","BPTF","SLF2","SMC5", "SMC6", "TMEM260","SSBP1","TCF12", "ANLN", "TFAM", "DDX3X","REG3G")
# Make a matrix with same number of columns as genes in the supplied list.
final_m <- matrix(nrow=length(sample_list),ncol=2)
# Iterate through every gene in the supplied list
for(x in 1:length(sample_list)){
# Iterate through every human gene
for(y in 1:length(human)){
# If the gene from the supplied list matches a human gene
if(sample_list[x] == human[y]){
# Fill our matrix in with the supplied gene and the zebrafish ortholog
# that matches up with the cell of the human gene
final_m[x,1] = sample_list[x]
final_m[x,2] = zebrafish[y]
}
}
}

You didn't specify the structure of ortho_genes. This is my guess:
ortho_genes <- tibble::tibble(Zebra = c("greb1l", "sin3b", "ncapg2", "fam50a", "psmd12", "bptf", "fam178a"),
Human = c("GREB1L","SIN3B","NCAPG2","FAM50A","PSMD12","BPTF","SLF2"))
You can simply index the table with sample_list (it's a vector, not a list)
sample_list <- c("NCAPG2", "SLF2", "GREB1L")
ortho_genes[ortho_genes$Human %in% sample_list,]
You also didn't specify how you want the output. Do you need a matrix? If you want to write the result into a file, a matrix may not be optimal.

Related

Comparing character lists in R

I have two lists of characters that i read in from excel files
One is a very long list of all bird species that have been documented in a region (allBirds) and another is a list of species that were recently seen in a sample location (sampleBirds), which is much shorter. I want to write a section of code that will compare the lists and tell me which sampleBirds show up in the allBirds list. Both are lists of characters.
I have tried:
# upload xlxs file
Full_table <- read_excel("Full_table.xlsx")
Pathogen_table <- read_excel("pathogens.xlsx")
# read species columnn into a new dataframe
species <-c(as.data.frame(Full_table[,7], drop=FALSE))
pathogens <- c(as.data.frame(Pathogen_table[,3], drop=FALSE))
intersect(pathogens, species)
intersect(species, pathogens)
but intersect is outputting lists of 0, which I know cannot be true, any suggestions?
Maybe you can try match() function or "==".
You need to run the intersect on the individual columns that are stored in the list:
> a <- c(data.frame(c1=as.factor(c('a', 'q'))))
> b <- c(data.frame(c1=as.factor(c('w', 'a'))))
> intersect(a,b)
list()
> intersect(a$c1,b$c1)
[1] "a"
This will probably do in your case
intersect(Full_table[,7], Pathogen_table[,3])
Or if you insist on creating the data.frames:
intersect(pathogens[1,], species[1,])
where [1,] should select the first column of the data.frame only. Note that by using c(as.data.frame(... you are converting the data.frame to a regular list. I'd go with only as.data.frame(....

In R, how do I modify a dataframe column in a list given a string name

I'm new to R. Thank you for your patience. I'm working with the survey package.
Background: I'm writing a function that loops through combinations of predictor and outcome variables (i.e., svyglm(outcome~predictor)) in a complex survey to output crude prevalence ratios. For each outcome/predictor combination, I want to first relevel the predictor within the survey design object to ensure the output ratios are all > 1.
Specific problem: Given the survey design object name, column name and reference level as strings, how do I tell R I want said column releveled.
prams16 is the name of the survey design object which includes a list of 9 items, variables is the analytic dataset (data frame) within the survey design object and mrace is a column in the variables DF.
These work:
prams16$variables$mrace <- relevel(prams16$variables$mrace, ref="White")
prams16[["variables"]]["mrace"] <- relevel(prams16$variables$mrace, ref="White")
However, when I try to construct references to prams16$variables$mrace or prams16[["variables"]]["mrace"] with strings, nothing seems to work.
Thanks!
EDIT: Requested reproducible example of problem.
myPredictor <- as.factor(c("Red","White","Black","Red","Green","Black","White","Black","Red","Green","Black"))
myOutcome <- c(1,0,1,0,1,0,1,0,1,0,1)
myDF <- tibble(myPredictor, myOutcome)
myOtherStuff <- c("etc","etc")
myObj <- list(myDF=myDF,myOtherStuff=myOtherStuff)
#These work...
myObj$myDF$myPredictor <- relevel(myObj$myDF$myPredictor, ref="White")
str(myObj$myDF$myPredictor) #"White" is now the referent level
myObj[["myDF"]]["myPredictor"] <- relevel(myObj$myDF$myPredictor, ref="Red")
str(myObj$myDF$myPredictor) #"Red" is now the referent level
#How to construct relevel assignment statement from strings?
anObj <- "myObj"
aPredictor <- "myPredictor"
aRef <- "Green"
#Produces error
as.name(paste0(anObj,"$myDF$",aPredictor)) <- relevel(as.name(paste0(anObj,"$myDF$",aPredictor)), ref=aRef)
Here's a way to solve this using expression arithmetic. Our task is to construct and evaluate the following expression:
myObj$myDF[[aPredictor]] <- relevel( myObj$myDF[[aPredictor]], ref=aRef )
Step 1: Convert the string "myObj" to a symbolic name:
sObj <- rlang::sym(anObj) # Option 1
sObj <- as.name(anObj) # Option 2
Step 2: Construct the expression myObj$myDF[[aPredictor]]:
e1 <- rlang::expr( (!!sObj)$myDF[[aPredictor]] )
Here, we use !! to tell rlang::expr that we want to replace sObj with whatever symbol is stored inside that variable. Without !!, the expression would be sObj$myDF[[aPredictor]], which is not quite what we want.
Step 3: Construct the target expression:
e2 <- rlang::expr( !!e1 <- relevel(!!e1, ref=aRef) )
As before, !! replaces e1 with whatever expression is stored inside it (i.e., what we constructed in Step 2).
Step 4: Evaluate the expression and inspect the result:
eval.parent(e2)
## The column is now correctly releveled to Green
myObj$myDF$myPredictor
# [1] Red White Black Red Green Black White Black Red Green Black
# Levels: Green Black Red White

R unique function vs eliminating duplicated values

I have a data frame and an info file (also in table format) that describes the data within the data frame. The row names of the data frame need to be relabelled according to information within the info file. The problem is that the information corresponding to the data frame row names, in the info file, contains lots of duplicated values. Hence it is necessary to convert the df to a matrix such that the row names can have duplicate values.
matrix1<-as.matrix(df)
ptr<-match(rownames(matrix1), info_file$Array_Address_Id)
rownames(matrix1)<-info_file$ILMN_Gene[ptr]
matrix1<-matrix1[!duplicated(rownames(E.rna_conditions_cleaned)), ]
The above is my own code however a friend gave me some code with a similar goal but different results:
u.genes <- unique(info_file$ILMN_Gene)
ptr.u.genes <- match( u.genes, info_file$ILMN_Gene )
matrix2 <- as.matrix(df[ptr.u.genes,])
rownames(matrix2) <- u.genes
The problem is that these two strategies output different results:
> dim(matrix1)
[1] 30783 565
> dim(matrix2[,ptr.use])
[1] 34694 565
See above matrix2 has ~4000 more rows than the other.
As you can see the row names of the below output are indeed unique but that doesn't tell why the two methods selected different rows but which method is better and why is the output different?
U.95 JIA.65 DV.93 KD.76 HC.54 KD.77
7A5 5.136470 5.657738 5.122299 5.195540 5.378040 4.997210
A1BG 6.166210 6.210373 6.382051 6.494048 5.888900 5.914070
A1CF 5.222130 4.940529 4.715292 5.182658 4.510937 5.060749
A26C3 5.410403 5.148601 5.122299 3.967419 4.780758 4.868472
A2BP1 5.725115 4.817920 5.483607 5.444427 5.503358 5.121951
A2LD1 6.505271 6.558276 5.494096 4.833267 6.988192 6.082662
I need to know this because I wan the row values that will yield the most accurate downstream analysis by having the row values that are best.

Replacing values in a vector with values from another vector in R

I am having trouble replacing values in a vector with values from another vector. The basic program logic is this:
Establish a keyword list (code omitted for this).
Establish a maximum occurrence vector. This is a vector which contains the keywords and the maximum occurrence when the keywords are compared with CSV files.
For example, my keyword list looks like this
ABILITY
DEVELOPS
ENVIRONMENTAL
...
my maximum occurrence vector is created from the keyword list and is initialized like this
ABILITY 0
DEVELOPS 0
ENVIRONMENTAL 0
and now I am comparing the maximum occurrence vector with CSV files like this
file 1
ABILITY 3
DEVELOPS 5
ENVIRONMENTAL 4
file 2
ABILITY 5
DEVELOPS 7
ENVIRONMENTAL 1
So basically I would like to populate the maximum occurrence vector with the maximum from file 1 and 2. For example, in the maximum occurrence vector, the maximum occurrence for ENVIRONMENTAL should be changed to 4 (the maximum occurrence after scanning file 1 and file 2). Here's my code:
# Find the largest frequency of the given keywords by searching the keyword sets
# Start by defining and initializing the max occurence vector
keywordslength=length(keywords)
keywordmax=data.frame(keywords)
keywordmax$Max=0
# Start by reading the keyword set and keeping the frequency of the keyword
ksearch1=read.csv("set1.csv",header=FALSE,sep=",")
ksearch1$V1=toupper(ksearch1$V1)
# Now scan ksearch1 for the word in question
for (i in 1:keywordslength)
{
# Establish the keyword
testkey=keywords[i]
testmax=0
# Scan ksearch1
for (j in 1:length(ksearch1$V1))
{
if (ksearch1[j,1]==testkey)
{
testmax=ksearch1[j,2]
}
if (subset(keywordmax, keywords==testkey, select=c(Max))>=testmax)
{
keywordmax[which(keywords==testkey),2]=testmax
}
}
}
This should work
Create the list of keywords and the two files
keywords <- as.data.frame(c("Ability","Develops","Environmental"))
max_occur <- data.frame(keywords,c(0,0,0))
file1 <- data.frame(keywords,c(3,5,4))
file2 <- data.frame(keywords,c(5,7,1))
Rename the columns appropriately
colnames(file1) <- c("V1","V2")
colnames(file2) <- c("V1","V2")
colnames(keywords) <- c("V1")
colnames(max_occur) <- c("V1","V2")
Sort the data frames according to the keywords
keywords <- as.data.frame(keywords[sort(keywords$V1,decreasing = FALSE),])
max_occur <- as.data.frame(max_occur[sort(max_occur$V1,decreasing = FALSE),])
file1 <- file1[sort(file1$V1,decreasing=FALSE),]
file2 <- file2[sort(file2$V1,decreasing=FALSE),]
Rename them since they are converted to factors
colnames(keywords) <- c("V1")
colnames(max_occur) <- c("V1","V2")
Find the largest value and store in max_occur
for(i in 1:length(keywords$V1)){
max_occur$V2[i] <- max(max_occur$V2[i],file1$V2[i],file2$V2[i])
}
Let me know if all keywords are not present in each file. I'll change the code a bit. From what you've posted. They all appear in each file.

extract the names from an ordered the vector in R

Studying the http://cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf
Here's the question i have, how to extract the names from an ordered the vector. The problem in the book it asks to give the gene identifiers of the three genes(from the patients in disease stage B1) with the largest mean as an output.
The data set is from package "ALL"
source("http://bioconductor.org/biocLite.R")
biocLite("ALL")
Here's what i got so far,
library("ALL")
data("ALL")
B1 <- exprs(ALL[,ALL$BT=="B1"])
hist(B1)
mean(B1)
meanB1 <- apply(exprs(ALL[,ALL$BT=="B1"]),1,mean)
omeanB1 <- order((meanB1), decreasing=TRUE)
I'm wondering if there is a particular function i can call from R to extract just the names of the genes. In the package "golub" ,there is a golub.gnames to help extract the gene names.
It seems to me that you're almost there. Once you have the order, you can apply it to meanB1:
head(meanB1[omeanB1])
# AFFX-hum_alu_at 31962_at 31957_r_at 40887_g_at 36546_r_at
# 13.41648 13.16671 13.15995 13.10987 12.94578
# 1288_s_at
# 12.80290
To get the names of the top three genes, you can do:
names(meanB1[omeanB1])[1:3]
# [1] "AFFX-hum_alu_at" "31962_at" "31957_r_at"

Resources