I have about 3500 CAS numbers that I would like to extract the chemical information from pubchem and put into a dataframe. I have no idea on how to format the output so I can put it into a dataframe when I use the code below. The output of each call (please see below) seems to give me the same format. It consists of a list of 9 things of varying size, 2 of which are tibbles of varying size.. Any ideas would be appreciated!! Thank you!
library(dplyr)
library(webchem)
ci_query(query, from = c("rn", "inchikey"), verbose = getOption("verbose"))
y1 <- ci_query('50-00-0', from = 'rn')
which yields:
y1
$`50-00-0`
$`50-00-0`$name
[1] "Formaldehyde [USP]" "Methanal"
$`50-00-0`$synonyms
[1] "AI3-26806" "Aldehyd mravenci" "Aldehyd
mravenci [Czech]"
[4] "Aldehyde formique" "Aldehyde formique [French]" "Aldehyde
formique [ISO-French]"
[7] "Aldeide formica" "Aldeide formica [Italian]" "BFV"
[10] "Caswell No. 465" "CCRIS 315" "Dormol"
[13] "EC 200-001-8" "EINECS 200-001-8" "EPA
Pesticide Chemical Code 043001"
[16] "Fannoform" "Formalaz"
"Formaldehyd"
[19] "Formaldehyd [Czech, Polish]" "Formaldehyde"
"Formaldehyde solution"
[22] "Formaldehyde, gas" "Formalin" "Formalin
40"
[25] "Formalin [JAN]" "Formalin-loesungen"
"Formalin-loesungen [German]"
[28] "Formalina" "Formalina [Italian]"
"Formaline"
[31] "Formaline [German]" "Formalith" "Formic
aldehyde"
[34] "Formol" "FYDE" "HSDB
164"
[37] "Karsan" "Lysoform"
"Methaldehyde"
[40] "Methanal" "Methyl aldehyde"
"Methylene oxide"
[43] "Morbicid" "NCI-C02799" "NSC
298885"
[46] "Oplossingen" "Oplossingen [Dutch]"
"Oxomethane"
[49] "Oxymethylene" "Paraform" "RCRA
waste number U122"
[52] "Superlysoform" "UN 1198" "UN 2209
(formalin)"
[55] "UNII-1HG84L3525"
$`50-00-0`$cas
[1] "50-00-0"
$`50-00-0`$inchi
[1] "InChI=1S/CH2O/c1-2/h1H2"
$`50-00-0`$inchikey
[1] "WSFSSNUMVMOOMR-UHFFFAOYSA-N"
$`50-00-0`$smiles
[1] "C=O"
$`50-00-0`$toxicity
# A tibble: 24 x 6
Organism `Test Type` Route `Reported Dose (Normalized Dose)` Effect
Source
<chr> <chr> <chr> <chr> <chr>
<chr>
1 cat LCLo inhalation 400mg/m3/2H (400mg/m3) ""
"\"To~
2 cat LDLo intravenous 30mg/kg (30mg/kg) "BLOOD: OTHER
CHANGES" "Acta~
3 dog LDLo intravenous 70mg/kg (70mg/kg) ""
"Inte~
4 dog LDLo subcutaneous 350mg/kg (350mg/kg) ""
"Inte~
5 frog LDLo parenteral 800ug/kg (0.8mg/kg) ""
"Inte~
6 guinea pig LD50 oral 260mg/kg (260mg/kg) ""
"Jour~
7 human TCLo inhalation 17mg/m3/30M (17mg/m3) "LUNGS, THORAX,
OR RESPIRATION: OTHER CHANGESSENSE OR~ "JAMA~
8 man LDLo unreported 477mg/kg (477mg/kg) ""
"\"Po~
9 man TCLo inhalation 300ug/m3 (0.3mg/m3) "SENSE ORGANS
AND SPECIAL SENSES: OTHER CHANGES: OLFA~ "Gigi~
10 man TDLo oral 643mg/kg (643mg/kg)
"GASTROINTESTINAL: NAUSEA OR VOMITINGLUNGS, THORAX, O~ "Japa~
# ... with 14 more rows
$`50-00-0`$physprop
# A tibble: 8 x 5
`Physical Property` Value Units `Temp (deg C)` Source
<chr> <dbl> <chr> <int> <chr>
1 Melting Point -9.2 e+ 1 deg C NA EXP
2 Boiling Point -1.91e+ 1 deg C NA EXP
3 pKa Dissociation Constant 1.33e+ 1 (none) 25 EXP
4 log P (octanol-water) 3.5 e- 1 (none) NA EXP
5 Water Solubility 4 e+ 5 mg/L 20 EXP
6 Vapor Pressure 3.89e+ 3 mm Hg 25 EXP
7 Henry's Law Constant 3.37e- 7 atm-m3/mole 25 EXP
8 Atmospheric OH Rate Constant 9.37e-12 cm3/molecule-sec 25 EXP
$`50-00-0`$source_url
[1] "https://chem.nlm.nih.gov/chemidplus/rn/50-00-0"
attr(,"class")
[1] "ci_query" "list"
I have two data sets. One that looks like this:
Male Female Territory
1 1 11 TEE
2 2 12 JEB
3 3 13 GAT
4 4 14 SHY
5 5 15 BOB
6 6 16 LEE
7 7 17 BOO
8 8 18 DON
9 9 19 RAZ
10 10 20 ZAP
This data set tells us the ID numbers of the male and females (these are the observed mating pairs- for example, male 1 and female 11 were observed to have mated and the territory they occupy is called TEE), and the territory name that they live in.
The other data set looks like this:
$GAT
[1] "TEE" "SHY" "BOB"
$JEB
[1] "LEE" "GAT" "BOO"
$TEE
[1] "DON" "RAZ" "ZAP"
This second data set lists the surrounding territories for each territory. For example, territories TEE, SHY, and BOB surround the territory GAT.
Both of these data sets are in character form.
What I am trying to do is make a list of potential mates for every female individual based on the territories that surround the territory they reside in and the males that live in those surrounding territories. So my end goal is to get something that looks like this:
$11
[1] "8" "9" "10"
$12
[1] "6" "3" "7"
$13
[1] "1" "4" "5"
etc...
So I have to try and match the territory that each female resides in to the surrounding territory list to get the list of surrounding territories for each female. Then I have to find all the males that reside in those surrounding territories (as well as the males that reside in the territory that the female resides in itself).
I'm honestly not even sure how to start this. Even something that can help me start this will be very appreciated.
Thanks!
I altered your example a bit to include a territory duplicate.
df <- data.frame(Male=1:4, Female=5:8, Territory=c("TEE","TEE","JEB","GAT"), Year=2013, stringsAsFactors = FALSE)
# Male Female Territory Year
#1 1 5 TEE 2013
#2 2 6 TEE 2013
#3 3 7 JEB 2013
#4 4 8 GAT 2013
neighbour <- list()
neighbour[['GAT']] <- c("TEE","SHY","BOB")
neighbour[['JEB']] <- c("LEE", "GAT", "BOO")
neighbour[['TEE']] <- c("DON", "RAZ", "ZAP")
#$GAT
#[1] "TEE" "SHY" "BOB"
#$JEB
#[1] "LEE" "GAT" "BOO"
#$TEE
#[1] "DON" "RAZ" "ZAP"
Here is a possible solution using lapply and %in%.
#iterate over all females
result <- lapply(setNames(nm=df$Female), function(x) {
#territory of the current female
FemTer <- df[df$Female == x, "Territory"]
#males living in the neighbourhood
df[df$Territory %in% c(FemTer, neighbour[[FemTer]]), "Male"]
})
result
#$`5`
#[1] 1 2
#
#$`6`
#[1] 1 2
#
#$`7`
#[1] 3 4
#
#$`8`
#[1] 1 2 4
I just assumed, that you would include the territory a female resides in as well as its surroundings. If you only want the surroundings, just delete the FemTer, from df[df$Territory %in% c(FemTer, neighbour[[FemTer]]), "Male"].
Consider this base R data wrangling with merge, reshape, by:
Data
txt = ' Male Female Territory
1 1 11 TEE
2 2 12 JEB
3 3 13 GAT
4 4 14 SHY
5 5 15 BOB
6 6 16 LEE
7 7 17 BOO
8 8 18 DON
9 9 19 RAZ
10 10 20 ZAP'
df <- read.table(text=txt, header=TRUE)
territories <- list(GAT=c("TEE","SHY","BOB"),
JEB=c("LEE","GAT","BOO"),
TEE=c("DON","RAZ","ZAP"))
Process
# CASE LIST TO DF
df_territories <- data.frame(territories, stringsAsFactors = FALSE)
# MELT DF TO LONG FORMAT
df_territories <- reshape(df_territories, varying = list(1:3),
v.names="nearest", timevar="Territory",
times=names(df_territories)[1:ncol(df_territories)],
direction="long")
# NESTED MERGE
mdf <- merge(merge(df, df_territories, by="Territory", all.x=TRUE),
df, by.x="nearest", by.y="Territory")
# BY GROUP SLICE
matelist <- by(mdf, mdf$Female.x, FUN=function(grp){
as.character(sort(grp$Male.y))
})
# LIST CLEANUP
attributes(matelist) <- NULL
names(matelist) <- unique(sort(mdf$Female.x))
matelist
# $`11`
# [1] "8" "9" "10"
# $`12`
# [1] "3" "6" "7"
# $`13`
# [1] "1" "4" "5"
I have a part of the data-set as shown below in the form of csv,the number of rows and columns are more than what is shown.I want to implement apriori on this data-set,Say I have this:-
Maths Science C++ Java DC
[1] 75 44 55 56 88
[2] 56 88 54 78 44
the original dataset has total columns(representing subjects)=30 and serial number(representing students)=24,
DATASET:link
I want to covert this dataset in the form shown below:-
[1] {Maths,DC}
[2] {Science,Java}
i.e A list of list(I think this is what it is called) containing the colnames.A list for a student shows in which subject he/she scored more than or equal to 75 marks,rest of the subjects are dropped(The only condition of the problem)
eq:- first student scored 75+ marks in Dc and Maths and so his list includes only dc and maths.
I am sorry for posting this,but I searched a lot on stack,and found a few of the working suggestions ,but couldn't reach the final goal.
My goal is to get a form like this:-
[9834] {semi-finished bread,
bottled water,
soda,
bottled beer}
[9835] {chicken,
tropical fruit,
other vegetables,
vinegar,
shopping bags}
As given in :-
library(arules)
inspect(Groceries)
OR I WILL APPRECIATE IF ANYONE CAN SUGGEST A WAY TO REPRESENT THE DATA IN OTHER FORM WHICH APRIORI CAN UNDERSTAND,BUT IT SHOULD FOLLOW THE NECESSARY CONDITIONS AS STATED.
*(sorry for the long post,I hope this conversion of my dataset in this format may help me study the pattern in student-subject dataset,thnx a ton for all the help)
library(plyr)
library(arules)
df <- read.table(text =
" 75 44 55 56 88
56 88 54 78 44")
names(df) <- c("Maths", "Science", "C++", "Java", "DC")
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {DC,Maths} 1
# [2] {Java,Science} 2
Edit: It works with your example dataset, too:
library(plyr)
library(arules)
df <- read.csv(file = url("https://drive.google.com/uc?export=download&id=0B3kdblyHw4qLR0dpT24xWUZGcGs"))
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {CD,CG,CN,DA,Data.Struc} 1
# [2] {CD,CG,CO,ML,OS} 2
# [3] {CN,Data.Struc,DC,DM,DMS} 3
# [4] {CHE,DD,DM,EC,EE} 4
# [5] {CHE,CN,MATHS,PHY} 5
# [6] {Data.Science,DM,DMS,ML,OS} 6
# [7] {CD,DA,Data.Struc,EC,MATHS} 7
# [8] {CG,CHE,CN,CO,OS} 8
# [9] {CN,CO,Data.Science,DC,DMS} 9
# [10] {DC,DD,EC,EE,PHY} 10
# [11] {CHE,DD,DMS,MATHS,PHY} 11
# [12] {CN,Data.Science,DM,MATHS,ML} 12
# [13] {CD,CG,DA,Data.Science,Data.Struc} 13
# [14] {CG,CO,EE,MATHS,OS} 14
# [15] {CN,CO,DC,DMS,PHY} 15
# [16] {CN,CO,DD,EC,EE} 16
# [17] {CHE,DA,EE,MATHS,PHY} 17
# [18] {Data.Science,DD,DM,ML,PHY} 18
# [19] {CD,CO,DA,Data.Struc,DC} 19
# [20] {CG,CO,DD,DM,OS} 20
# [21] {CG,CN,DA,DC,DMS} 21
# [22] {DD,EC,EE,ML,OS} 22
# [23] {CHE,CN,Data.Struc,MATHS,PHY} 23
# [24] {CG,Data.Science,DM,EE,ML} 24
I am creating 1000 random communities (vectors) from a species pool of 128 with certain operations applied to the community and stored in a new vector. For simplicity, I have been practicing writing code using 10 random communities from a species pool of 20. The problem is that there are a couple of pairs of species such that if one of the pairs is generated in the random community, I need that community to be thrown out and a new one regenerated. I have been able to code that if the pair is found in a community for that community(vector) to be labeled NA. I also know how to tell the loop to skip that vector using the "next" command. But with both of these options, I do not get all of the communities that I needing.
Here is my code using the NA option, but again that ends up shorting me communities.
C<-c(1:20)
D<-numeric(10)
X<- numeric(5)
for(i in 1:10){
X<-sample(C, size=5, replace = FALSE)
if("10" %in% X & "11" %in% X) X=NA else X=X
if("1" %in% X & "2" %in% X) X=NA else X=X
print(X)
D[i]<-sum(X)
}
print(D)
This is what my result looks like.
[1] 5 1 7 3 14
[1] 20 8 3 18 17
[1] NA
[1] NA
[1] 4 7 1 5 3
[1] 16 1 11 3 12
[1] 14 3 8 10 15
[1] 7 6 18 3 17
[1] 6 5 7 3 20
[1] 16 14 17 7 9
> print(D)
[1] 30 66 NA NA 20 43 50 51 41 63
Thanks so much!
I have a list of 15 million strings and I have a dictionary of 8 million words. I want to replace every string in database by the index of the string in the dictionary.
I tried using the hash package for faster indexing, but it is still taking hours for replacing in all 15 million strings.
What is the efficient way of implementing this?
Example[EDITED]:
# Database
[[1]]
[1]"a admit been c case"
[[2]]
[1]"co confirm d ebola ha hospit howard http lik"
# dictionary
"t" 1
"ker" 2
"be" 3
.
.
.
.
# Output:
[[1]]123 3453 3453 567
[[2]]6786 3423 234123 1234 23423 6767 3423 124431 787889 111
Where the index of admit in the dictionary is 3453.
Any kind of help is appreciated.
Updated Example with Code:
This is what I am currently doing.
Example: data =
[1] "a co crimea divid doe east hasten http polit secess split t threaten ukrain via w west xtcnwl youtub"
[2] "billion by cia fund group nazy spent the tweethead ukrain"
[3] "all back energy grandpar home miss my posit radiat the"
[4] "ao bv chega co de ebola http kkmnxv pacy rio suspeito t"
[5] "android androidgam co coin collect gameinsight gold http i jzdydkylwd t ve"
words.list = strsplit(data, "\\W+", perl=TRUE)
words.vector = unlist(words.list)
sorted.words = sort(table(words.vector),decreasing=TRUE)
h = hash(names(sorted.words),1:length(names(sorted.words)))
index = lapply(data, function(row)
{
temp = trim.leading(row)
word_list = unlist(strsplit(temp, "\\W+", perl=TRUE))
index_list = lapply(word_list,function(x)
{
return(h[[x]])
}
)
#print(index_list)
return(unlist(index_list))
}
)
Output:
index_list
[[1]]
[1] 6 1 19 21 22 23 31 2 40 44 46 3 48 5 51 52 53 54 55
[[2]]
[1] 12 14 16 26 30 38 45 4 49 5
[[3]]
[1] 7 11 25 29 32 36 37 41 42 4
[[4]]
[1] 10 13 15 1 20 24 2 35 39 43 47 3
[[5]]
[1] 8 9 1 17 18 27 28 2 33 34 3 50
The output is index. This runs fast if the length of data is small but execution is really slow if the length is 15 million.
My task is the nearest neighbor search. I want to search for 1000 queries which are of same format as the database.
I have tried many things like parallel computations as well, but had issues with memory.
[EDIT] How can I implement this using RCpp?
I think you'd like to avoid the lapply() by splitting the data, unlisting, then processing the vector of words
data.list = strsplit(data, "\\W+", perl=TRUE)
words = unlist(data.list)
## ... additional processing, e.g., strip white space, on the vector 'words'
perform the match, then re-list to original
relist(match(words, word.vector), data.list)
For downstream applications it might actually pay to retain the vector + 'partitioning' information, partition = sapply(data.list, length) rather than re-listing, since it'll continue to be efficient to operate on the unlisted vector. The Bioconductor S4Vectors package provides a CharacterList class that takes this approach, where one mostly works on something that is list-like, but where the data are stored and most operations are on an underlying character vector.
Sounds like you're doing NLP.
A fast non-R solution (which you could wrap in R) is word2vec
The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the
training text data and then learns vector representation of words. The
resulting word vector file can be used as features in many natural
language processing and machine learning applications.