A bioinformatics programming question. In R, I have a classic speciesA-to-speciesB gene symbol conversion, in this example from mouse to human, which I'm performing using biomaRt, and specifically the getLDS function.
x<-c("Lbp","Ndufv3","Ggt1")
require(biomaRt)
convert<-function(x){
human=useMart("ensembl",dataset="hsapiens_gene_ensembl")
mouse=useMart("ensembl",dataset="mmusculus_gene_ensembl")
newgenes=getLDS(
attributes="mgi_symbol",
filters="mgi_symbol",
values=x,
mart=mouse,
attributesL="hgnc_symbol",
martL=human,
uniqueRows=TRUE
)
humanx<-unique(newgenes)
return(humanx)
}
conversion<-convert(x)
However, I would like to obtain ALL ids present in the linked database: in other words, all mouse/human pairs (in this example). Something to tell the parameter value in the getLDS function to retrieve all ids, not just those specified in the x variable. I am talking about a full map, tens of thousands of lines long, specifying all orthologous relationships between symbols of the two databases.
Any ideas or workarounds? Thanks a lot!
I believe a workaround could be retrieving all IDs from the Biomart database itself, here: https://www.ensembl.org/biomart/martview/
Click on choose database -> Ensembl Genes
Choose dataset -> your selected species (e.g. Mouse genes)
Click on Results -> Check "Unique results only" -> Go
Profit
The list retrieved here has currently 53605 ids, which is, I believe, what you need.
Enjoy!
Related
I am currently researching publicly available payer transparency files across multiple insurers and I am trying to parse and extract JSON files using R and output them into .CSV files to later use with SQL. The file I am currently working with contains nested tables within the highest table.
I have attached the specific file I am working with right now in a link below, along with the code to mount it into R's dataviewer. I have used R extensively in healthcare analytics classes for statistical analysis and machine learning; though, I have never used R for building out data tables.
My goal is to assign a primary key to the highest level of the table, apply foreign and primary keys to lower tables and extract the lower tables and join them onto eachother later to build out a large CSV or TXT file to load onto SQL.
So far, I have used the jsonlite and rjson packages to extract the JSON itself into R, but trying to delist an unnest the tables within the tables are an enigma to me even after extensive research. I also find myself running into problems with "subscript out of bounds", "unimplemented list errors" and other issues.
It could also very well be the case that the JSON is too large for R's packages or that the JSON is structurally flawed (I wouldn't know if it is, I am not accustomed to JSONs). It seems that this could be a problem better solved with Python, though I don't know how to use Python too well and I am optimistic in R given how powerful it is.
Any feedback or answers would be greatly appreciated.
JSON file link: https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json
Code to load JSON:
json2 <- fromJSON('https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json')
JSONs load correctly, but there are tables embedded within tables. I would hope that these tables could be easily exported and have keys for joining, but I can not figure out how to denest these tables from within the data.
Some nested tables are out of subscript bounds for the data array. I have never encountered this problem and am bewildered as to how to go about and resolve the issue.
I can not figure out how to 'extract' the lower level tables, let alone open them, due to the subscript boundary error.
I can assign row ID to the main/highest table in the file, but I can not figure out how to add sub row ID's to the lower table for future joins.
Maybe the jsonStrings package can help. It allows to manipulate JSON, without converting to a R object. That's the first time I try it on such a big JSON string and it works fine.
Here is how to get the table in the first element of the JSON array:
options(timeout = 300)
download.file(
"https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json",
"jsonFile.json"
)
library(jsonStrings)
# load the JSON file
jstring <- jsonString$new("jsonFile.json")
# extract table "plans" of first element (indexed by 0)
jsonTable <- jstring$at(0, "plans")
# get a dataframe
library(jsonlite)
dat <- fromJSON(jsonTable$asString())
But the dataframe dat has a list column. I don't know how you want to make a CSV with this dataframe.
I have a list of local election candidates and I would like to find out
(i) if these individuals have a twitter account
(ii) if so what are their screen names/ user names.
search_users seemed to be the best option but it does not do a good job. Here is an example:
y1 <- search_users(q="suleyman kilinc", n=5, parse=TRUE)
This gives me a list of 5 users and non of them is the one that I am looking for. This is often the case. But when I do the same search on Google with the key words "suleyman+kilinc+twitter", the first option that Google offers is what I exactly need. This is true for 95% of the random names that I manually searched. Is there a good way to automatize the name to user name search through R or a better option than search_users function.
Any help is appreciated.
It is a very interesting question. the q parameter accepts a string as indicated above. When you pass a word with space as a value of q then you are instructing the function to search for "suleyman" & "kilinc" hence "suleyman kilinc" is the same as "suleyman AND kilinc". The REST API for twitter in this case will return any user with both "suleyman" and "kilinc" irregardless of the order.
I currently do a lot of descriptive analysis in R. I always work with a data.table like df
net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)
library(data.table)
df <- data.table(net,gross,color,height)
In order to obtain results, I do apply a lot of filters.
Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:
df[color=="green" & height>175]
In my real data.table, I have 7 columns and all kind of filter-combinations.
Since I always address the same data.table, I'd like to find the most efficient way to filter the data.
So far, my files are organized like this (bottom-up):
execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a
new data.table and source the "source file" with this new filtered table.
I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.
How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?
I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.
I usually do it like this:
create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results
Automate the process as much as possible.
For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.
First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.
I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.
The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps.