Retrieve Intron splicings from Gene NCBI - python-3.6

I would like to retrieve the intron sequences of some genes (e.g https://www.ncbi.nlm.nih.gov/nuccore/X62462.1).
I can get it with Nucleotide database for some of the genes, but some of them only appear on Gene database from NCBI. To do so, I am using Biopython.
Here a piece of code to retrieve intron from nucleotide database.
from Bio.Seq import Seq
from Bio import SeqIO, Entrez
count = 4 # Number of entries to see
genes = ["estrogen receptor"]
shortname = genes[0]
Entrez.email = "email#gmail.com"
handle = Entrez.esearch(db="nucleotide", term="Human[Orgn] AND "+shortname+"[GENE] AND biomol_genomic[PROP] AND nucleotide_protein[Filter]", idtype="acc", retmax=count)
record = Entrez.read(handle)
handle.close()
With this part I check which entry I want:
print("Entries:", record["Count"])
seq_records=[]
for i in range(len(record["IdList"])):
idname = record["IdList"][i]
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=idname) as handle:
seq_record = SeqIO.read(handle, "gb")
seq_records.append(seq_record)
print(i, "--", seq_record.description, seq_record.id)
Entries: 1
0 -- H.sapiens 5' flanking region for estrogen receptor (breast) gene X62462.1
And now I retrieve the introns sequences for this gene:
id_chosen = 0
intron = [f for f in seq_records[id_chosen].features if f.type == "intron"]
x=1
for start, end in [(e.location.start.position, e.location.end.position ) for e in intron]:
print(">>>",seq_record.id, "Intron:",x, start+1, end, ",len:",len(seq_record.seq[start:end]))
x += 1
print(seq_record.seq[start:end], "\n")
Output: >X62462.1 Intron: 1 911 2933 ,len: 2023
GTAGGCTTGTTTTGATTTCTCTCTCTGTAGCTTTAGCATTTTGAGAAAGCAACTTACCTTTCTGGCTAGTGTCTGTATCCTAGCAGGGAGATGAGGATTGCTGTTCTCCATGG......
In this case there is only one intron.
So my question is...how to do it with a gene that has several intron splicing and appears in the Gene database? How can I access to those features?
Example: https://www.ncbi.nlm.nih.gov/gene/374
Thanks!

Related

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

neo4j single pass over graph but multiple matches

I have a graph in neo4j with vertices of:
person:ID,name,value:int,:LABEL
1,Alice,1,Person
2,Bob,0,Person
3,Charlie,0,Person
4,David,0,Person
5,Esther,0,Person
6,Fanny,0,Person
7,Gabby,0,Person
8,XXXX,1,Person
and edges:
:START_ID,:END_ID,:TYPE
1,2,call
2,3,text
3,2,text
6,3,text
5,6,text
5,4,call
4,1,call
4,5,text
1,5,call
1,8,call
6,8,call
6,8,text
8,6,text
7,1,text
imported into neo4j like:
DATA_DIR_SAMPLE=/data_network/
$NEO4J_HOME/bin/neo4j-admin import --mode=csv \
--database=graph.db \
--nodes:Person ${DATA_DIR_SAMPLE}/vertices.csv \
--relationships ${DATA_DIR_SAMPLE}/edges.csv
which looks like:
Now when querying the graph like:
MATCH (source:Person)-[*1]-(destination:Person)
RETURN source.name, source.value, avg(destination.value), 'undir_1_any' as type
UNION ALL
MATCH (source:Person)-[*2]-(destination:Person)
RETURN source.name, source.value, avg(destination.value), 'undir_2_any' as type
one can see that the graph is traversed multiple times, and additionally as I want to obtain a table like:
Vertex | value | type_undir_1_any | type_undir_2_any
Alice | 1 | 0.2 | 0
an additional aggregation step (pivot/reshape) would be required
In the future, I would like to add the following patterns
undirected | directed
all relations | type of relation
as outlined up to 3 levels into the graph
and all permutations of these
Is there a better way to combine the queries?
You need to aggregate along the path length, while with a custom function of calculating the average value:
MATCH p = (source:Person)-[*1..2]-(destination:Person)
WITH
length(p) as L, source, destination
RETURN
source.name as Vertex,
source.value as value,
1.0 *
sum(CASE WHEN L = 1 THEN destination.value ELSE 0 END) /
sum(CASE WHEN L = 1 THEN 1 ELSE 0 END) as type_undir_1_any,
1.0 *
sum(CASE WHEN L = 2 THEN destination.value ELSE 0 END) /
sum(CASE WHEN L = 2 THEN 1 ELSE 0 END) as type_undir_2_any
Or a more elegant version with function from the APOC library to calculate the average on the collection:
MATCH p = (source:Person)-[*1..2]-(destination:Person)
RETURN
source.name as Vertex,
source.value as value,
apoc.coll.avg(COLLECT(
CASE WHEN length(p) = 1 THEN destination.value ELSE NULL END
)) as type_undir_1_any,
apoc.coll.avg(COLLECT(
CASE WHEN length(p) = 2 THEN destination.value ELSE NULL END
)) as type_undir_2_any

GameTheory package: Convert data frame of games to Coalition Set

I am looking to explore the GameTheory package from CRAN, but I would appreciate help in converting my data (in the form of a data frame of unique combinations and results) in to the required coalition object. The precursor to this I believe to be an ordered list of all coalition values (https://cran.r-project.org/web/packages/GameTheory/vignettes/GameTheory.pdf).
My real data has n ~ 30 'players', and unique combinations = large (say 1000 unique combinations), for which I have 1 and 0 identifiers to describe the combinations. This data is sparsely populated in that I do not have data for all combinations, but will assume combinations not described have zero value. I plan to have one specific 'player' who will appear in all combinations, and act as a baseline.
By way of example this is the data frame I am starting with:
require(GameTheory)
games <- read.csv('C:\\Users\\me\\Desktop\\SampleGames.csv', header = TRUE, row.names = 1)
games
n1 n2 n3 n4 Stakes Wins Success_Rate
1 1 1 0 0 800 60 7.50%
2 1 0 1 0 850 45 5.29%
3 1 0 0 1 150000 10 0.01%
4 1 1 1 0 300 25 8.33%
5 1 1 0 1 1800 65 3.61%
6 1 0 1 1 1900 55 2.89%
7 1 1 1 1 700 40 5.71%
8 1 0 0 0 3000000 10 0.00333%
where n1 is my universal player, and in this instance, I have described all combinations.
To calculate my 'base' coalition value from player {1} alone, I am looking to perform the calculation: 0.00333% (success rate) * all stakes, i.e.
0.00333% * (800 + 850 + 150000 + 300 + 1800 + 1900 + 700 + 3000000) = 105
I'll then have zero values for {2}, {3} and {4} as they never "play" alone in this example.
To calculate my first pair coalition value, I am looking to perform the calculation:
7.5%(800 + 300 + 1800 + 700) + 0.00333%(850 + 150000 + 1900 + 3000000) = 375
This is calculated as players {1,2} base win rate (7.5%) by the stakes they feature in, plus player {1} base win rate (0.00333%) by the combinations he features in that player {2} does not - i.e. exclusive sets.
This logic is repeated for the other unique combinations. For example row 4 would be the combination of {1,2,3} so the calculation is:
7.5%(800+1800) + 5.29%(850+1900) + 8.33%(300+700) + 0.00333%(3000000+150000) = 529 which descriptively is set {1,2} success rate% by Stakes for the combinations it appears in that {3} does not, {1,3} by where {2} does not feature, {1,2,3} by their occurrences, and the base player {1} by examples where neither {2} nor {3} occur.
My expected outcome therefore should look like this I believe:
c(105,0,0,0, 375,304,110,0,0,0, 529,283,246,0, 400)
where the first four numbers are the single player combinations {1} {2} {3} and {4}, the next six numbers are two player combinations {1,2} {1,3} {1,4} (and the null cases {2,3} {2,4} {3,4} which don't exist), then the next four are the three player combinations {1,2,3} {1,2,4} {1,3,4} and the null case {2,3,4}, and lastly the full combination set {1,2,3,4}.
I'd then feed this in to the DefineGame function of the package to create my coalitions object.
Appreciate any help: I have tried to be as descriptive as possible. I really don't know where to start on generating the necessary sets and set exclusions.

What is the best way to parse this flat text format in R?

1. ZFP112
Official Symbol: ZFP112 and Name: zinc finger protein 112 homolog (mouse)[Homo sapiens]
Other Aliases: ZNF112, ZNF228
Other Designations: zfp-112; zinc finger protein 112; zinc finger protein 228
Chromosome: 19; Location: 19q13.2
Annotation: Chromosome 19NC_000019.9 (44830706..44860856, complement)
ID: 7771
2. SEP15
15 kDa selenoprotein[Homo sapiens]
Chromosome: 1; Location: 1p31
Annotation: Chromosome 1NC_000001.10 (87328128..87380107, complement)
MIM: 606254
ID: 9403
3. MLL4
myeloid/lymphoid or mixed-lineage leukemia 4[Homo sapiens]
Other Aliases: HRX2, KMT2B, MLL2, TRX2, WBP7
Other Designations: KMT2D; WBP-7; WW domain binding protein 7; WW domain-binding protein 7; histone-lysine N-methyltransferase MLL4; lysine N-methyltransferase 2B; lysine N-methyltransferase 2D; mixed lineage leukemia gene homolog 2; myeloid/lymphoid or mixed-lineage leukemia protein 4; trithorax homolog 2; trithorax homologue 2
Chromosome: 19; Location: 19q13.1
Annotation: Chromosome 19NC_000019.9 (36208921..36229779)
MIM: 606834
ID: 9757
37. LOC100509547
hypothetical protein LOC100509547[Homo sapiens]
This record was discontinued.
ID: 100509547
43. LOC100509587
hypothetical protein LOC100509587[Homo sapiens]
Chromosome: 6
This record was replaced with GeneID: 100506601
ID: 100509587
I want to get the gene name (ZFP112, SEP15, MLL4), the Location field (if present), the ID field, and skip the other stuff. All the string utilities like scan() seem geared toward more regular data. The blank line between records is effectively the record separator. I can write this to disk and read it back in with readLines() but I'd prefer to do it from memory since I downloaded it over HTTP.
Read the data in from "myfile.dat", say, (or just start from L below if you have previously read it in as separate lines). Now extract those lines that begin with digits followed by a dot followed by a space or that contain the word Location: or start with ID:. Then remove everything in those lines up to and including the last space. Create a group vector g which identifies the group to which each component of v2 belongs. (We have used the fact that the beginning field of each group starts with a non-digit and the other fields start with a digit.) Then split v2 into those groups . Expand short components of s by appropriately inserting an NA assuming that if its short that Location: is missing. (We assume the first field and the ID fields cannot be missing.) Finally transpose it so that the fields are in columns and the cases in rows.
L <- readLines("myfile.dat")
v <- grep("^\\d+\\. |Location: |^ID: ", L, value = TRUE)
v2 <- sub(".* ", "", v)
g <- cumsum(regexpr("^\\D", v2) > 0)
s <- split(v2, g)
m <- sapply(s, function(x) if (length(x) == 2) c(x[[1]], NA, x[[2]]) else x)
t(m)
Using the sample data in the post we get this from the last line:
[,1] [,2] [,3]
1 "ZFP112" "19q13.2" "7771"
2 "SEP15" "1p31" "9403"
3 "MLL4" "19q13.1" "9757"
4 "LOC100509547" NA "100509547"
5 "LOC100509587" NA "100509587"

How do associations, #NS and #NV work in UniData Dictionaries?

Does anyone have a quick example of how Associations, #NS and #NV work in UniData?
I’m trying to work out associations in dictionary items but cannot get them to do anything.
For example, in a record
<1,1> = A
<1,2> = B
<2,1> = Apple
<2,2> = Banana
I created 3 dictionary items. LETTER and FRUIT, COMBO as follows
LETTER:
<1> = D
<2> = 1
<3> =
<3> = Letter
<4> = 6L
<5> = M
<6> = COMBO
FRUIT:
<1> = D
<2> = 1
<3> =
<3> = Letter
<4> = 6L
<5> = M
<6> = COMBO
COMBO:
<1> = PH
<2> = LETTER FRUIT
Doing a LIST LETTER FRUIT or LIST COMBO has no difference to when LETTER and FRUIT do not have an association declared in 6.
At this point I thought it might group multivalues together when SELECTing so I created another record as such:
<1,1> = A
<1,2> = B
<2,1> = Banana
<2,2> = Apple
Doing SELECT MyFile WITH LETTER = “A” and FRUIT = “Apple” selects both records, so that cannot be it either.
I then tried changing LETTER to be:
<1> = I
<2> = EXTRACT(#RECORD,1,#NV,1);EXTRACT(FRUIT,1,#NV,1);#1:" (":#2:")" : #NS
<3> =
<3> = Letter
<4> = 6L
<5> = M
<6> = COMBO
Hoping it that a LIST MyFile LETTER would bring back all the different letters with their associated fruit in parentheses. That didn’t work either as now LETTER only ever displayed the first Multivalue instead of all of them. For Eg:
LIST MyFile LETTER 14:05:22 26 FEB 2010 1
MyFile.... LETTER..............
RECORD2 A (Banana)1
RECORD A (Apple)1
2 records listed
The manuals don’t go any further than saying the word “association”. Is anyone able to clarify this for me?
Many times NV and NS only work when using BY-EXP in your LIST or SELECT statements. You need to use modifiers that specifically look at MultiValue and SubValues.
WHEN is one, and BY-EXP is another. There are other, but not sure what they are off the top of my head. I primarly use BY-EXP and BY-EXP-DSND.
LIST MyFile BY-EXP LETTER = "A" BY-EXP FRUIT ="Apple" LETTER FRUIT LETTER.COMBO
To bring back all the combinations, you use need to do the following:
LIST MyFile BY-EXP LETTER LETTER FRUIT LETTER.COMBO
Change the following virtual field from 'LETTER' to say 'LETTER.COMBO' or something along those lines:
<1> = I
<2> = EXTRACT(#RECORD,1,#NV,1);EXTRACT(FRUIT,1,#NV,1);#1:" (":#2:")" : #NS
<3> =
<3> = Letter
<4> = 6L
<5> = M
<6> = COMBO
Hope that helps.
-Nathan
To answer part of my own question:
Only 'WHEN' is affected by the association, not with. If you turn on UDT.OPTIONS 94 and do
LIST MyFile WHEN LETTER = "A" AND FRUIT="Apple" COMBO
when using my D-Type definition of LETTER, I get
LIST MyFile WHEN LETTER = "A" AND FRUIT="Apple" LETTER FRUIT 16:06:42 26 FEB 2010 1
MyFile.... LETTER.............. FRUIT...............
RECORD A Apple
1 record listed
Which is what one would expect.
To use the WHEN clause you need to be in ECLTYPE U, not P. IT would be helpful if this was clearer, but oh well...

Resources