formatting webchem pubchem output to a dataframe - r

I have about 3500 CAS numbers that I would like to extract the chemical information from pubchem and put into a dataframe. I have no idea on how to format the output so I can put it into a dataframe when I use the code below. The output of each call (please see below) seems to give me the same format. It consists of a list of 9 things of varying size, 2 of which are tibbles of varying size.. Any ideas would be appreciated!! Thank you!
library(dplyr)
library(webchem)
ci_query(query, from = c("rn", "inchikey"), verbose = getOption("verbose"))
y1 <- ci_query('50-00-0', from = 'rn')
which yields:
y1
$`50-00-0`
$`50-00-0`$name
[1] "Formaldehyde [USP]" "Methanal"
$`50-00-0`$synonyms
[1] "AI3-26806" "Aldehyd mravenci" "Aldehyd
mravenci [Czech]"
[4] "Aldehyde formique" "Aldehyde formique [French]" "Aldehyde
formique [ISO-French]"
[7] "Aldeide formica" "Aldeide formica [Italian]" "BFV"
[10] "Caswell No. 465" "CCRIS 315" "Dormol"
[13] "EC 200-001-8" "EINECS 200-001-8" "EPA
Pesticide Chemical Code 043001"
[16] "Fannoform" "Formalaz"
"Formaldehyd"
[19] "Formaldehyd [Czech, Polish]" "Formaldehyde"
"Formaldehyde solution"
[22] "Formaldehyde, gas" "Formalin" "Formalin
40"
[25] "Formalin [JAN]" "Formalin-loesungen"
"Formalin-loesungen [German]"
[28] "Formalina" "Formalina [Italian]"
"Formaline"
[31] "Formaline [German]" "Formalith" "Formic
aldehyde"
[34] "Formol" "FYDE" "HSDB
164"
[37] "Karsan" "Lysoform"
"Methaldehyde"
[40] "Methanal" "Methyl aldehyde"
"Methylene oxide"
[43] "Morbicid" "NCI-C02799" "NSC
298885"
[46] "Oplossingen" "Oplossingen [Dutch]"
"Oxomethane"
[49] "Oxymethylene" "Paraform" "RCRA
waste number U122"
[52] "Superlysoform" "UN 1198" "UN 2209
(formalin)"
[55] "UNII-1HG84L3525"
$`50-00-0`$cas
[1] "50-00-0"
$`50-00-0`$inchi
[1] "InChI=1S/CH2O/c1-2/h1H2"
$`50-00-0`$inchikey
[1] "WSFSSNUMVMOOMR-UHFFFAOYSA-N"
$`50-00-0`$smiles
[1] "C=O"
$`50-00-0`$toxicity
# A tibble: 24 x 6
Organism `Test Type` Route `Reported Dose (Normalized Dose)` Effect
Source
<chr> <chr> <chr> <chr> <chr>
<chr>
1 cat LCLo inhalation 400mg/m3/2H (400mg/m3) ""
"\"To~
2 cat LDLo intravenous 30mg/kg (30mg/kg) "BLOOD: OTHER
CHANGES" "Acta~
3 dog LDLo intravenous 70mg/kg (70mg/kg) ""
"Inte~
4 dog LDLo subcutaneous 350mg/kg (350mg/kg) ""
"Inte~
5 frog LDLo parenteral 800ug/kg (0.8mg/kg) ""
"Inte~
6 guinea pig LD50 oral 260mg/kg (260mg/kg) ""
"Jour~
7 human TCLo inhalation 17mg/m3/30M (17mg/m3) "LUNGS, THORAX,
OR RESPIRATION: OTHER CHANGESSENSE OR~ "JAMA~
8 man LDLo unreported 477mg/kg (477mg/kg) ""
"\"Po~
9 man TCLo inhalation 300ug/m3 (0.3mg/m3) "SENSE ORGANS
AND SPECIAL SENSES: OTHER CHANGES: OLFA~ "Gigi~
10 man TDLo oral 643mg/kg (643mg/kg)
"GASTROINTESTINAL: NAUSEA OR VOMITINGLUNGS, THORAX, O~ "Japa~
# ... with 14 more rows
$`50-00-0`$physprop
# A tibble: 8 x 5
`Physical Property` Value Units `Temp (deg C)` Source
<chr> <dbl> <chr> <int> <chr>
1 Melting Point -9.2 e+ 1 deg C NA EXP
2 Boiling Point -1.91e+ 1 deg C NA EXP
3 pKa Dissociation Constant 1.33e+ 1 (none) 25 EXP
4 log P (octanol-water) 3.5 e- 1 (none) NA EXP
5 Water Solubility 4 e+ 5 mg/L 20 EXP
6 Vapor Pressure 3.89e+ 3 mm Hg 25 EXP
7 Henry's Law Constant 3.37e- 7 atm-m3/mole 25 EXP
8 Atmospheric OH Rate Constant 9.37e-12 cm3/molecule-sec 25 EXP
$`50-00-0`$source_url
[1] "https://chem.nlm.nih.gov/chemidplus/rn/50-00-0"
attr(,"class")
[1] "ci_query" "list"

Related

Override hidden metadata within a tbl_graph object

Through a bunch of manipulations I recombined a tbl_graph that has this structure:
# A tbl_graph: 98 nodes and 78 edges
#
# An undirected simple graph with 46 components
#
# Node Data: 98 x 4 (active)
w nucleus name alpha
<dbl> <int> <int> <dbl>
1 0.4 1 95 0.05
2 0.4 1 34 0.05
3 0.4 1 82 0.05
4 0.4 2 10 0.55
5 0.4 2 2 0.55
6 0.4 3 68 0.55
# ... with 92 more rows
#
# Edge Data: 78 x 3
from to color
<int> <int> <chr>
1 34 95 red
2 82 95 red
3 34 82 red
# ... with 75 more rows
however, when I pass as.igraph to it (that is a necessary passage for ggraph), the edges in the metadata are these:
IGRAPH 70d96f3 UN-- 98 78 --
+ attr: w (v/n), nucleus (v/n), name (v/n), alpha (v/n), color (e/c)
+ edges from 70d96f3 (vertex names):
[1] 3--18 40--18 3--40 34--53 79--93 20-- 5 20--76 5--76 8--83 66--75 66--78 75--78 94--54 41--89
[15] 4--89 4--41 16--31 67--77 9--28 58--28 58-- 9 57--80 82--63 10--59 27--39 35--19 36--42 91--42
[29] 36--91 52--64 25--52 25--64 50--33 50--65 33--65 7--84 97--14 96-- 1 96--43 96-- 6 43-- 1 6-- 1
[43] 6--43 21--62 86--62 21--86 30--49 37--17 37--90 17--90 92--44 68--69 68--26 68--38 68--72 68--24
[57] 69--26 69--38 69--72 69--24 38--26 72--26 26--24 72--38 38--24 72--24 48--71 71--51 48--51 23--74
[71] 88--60 61--88 61--60 47--85 98--85 98--47 95--81 29--46
As you can see, these are different edges. Is there a easy way to force igraph to ovverride metadata with content of columns name,from,to?
EDIT:
It seems that it has to do with the order of the nodes; if I rearrange by the name, from and to will change to match the new id of the raw of their original match with name; that is something that I don't want to happen.

Data Manipulation in R for Apriori

I have a part of the data-set as shown below in the form of csv,the number of rows and columns are more than what is shown.I want to implement apriori on this data-set,Say I have this:-
Maths Science C++ Java DC
[1] 75 44 55 56 88
[2] 56 88 54 78 44
the original dataset has total columns(representing subjects)=30 and serial number(representing students)=24,
DATASET:link
I want to covert this dataset in the form shown below:-
[1] {Maths,DC}
[2] {Science,Java}
i.e A list of list(I think this is what it is called) containing the colnames.A list for a student shows in which subject he/she scored more than or equal to 75 marks,rest of the subjects are dropped(The only condition of the problem)
eq:- first student scored 75+ marks in Dc and Maths and so his list includes only dc and maths.
I am sorry for posting this,but I searched a lot on stack,and found a few of the working suggestions ,but couldn't reach the final goal.
My goal is to get a form like this:-
[9834] {semi-finished bread,
bottled water,
soda,
bottled beer}
[9835] {chicken,
tropical fruit,
other vegetables,
vinegar,
shopping bags}
As given in :-
library(arules)
inspect(Groceries)
OR I WILL APPRECIATE IF ANYONE CAN SUGGEST A WAY TO REPRESENT THE DATA IN OTHER FORM WHICH APRIORI CAN UNDERSTAND,BUT IT SHOULD FOLLOW THE NECESSARY CONDITIONS AS STATED.
*(sorry for the long post,I hope this conversion of my dataset in this format may help me study the pattern in student-subject dataset,thnx a ton for all the help)
library(plyr)
library(arules)
df <- read.table(text =
" 75 44 55 56 88
56 88 54 78 44")
names(df) <- c("Maths", "Science", "C++", "Java", "DC")
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {DC,Maths} 1
# [2] {Java,Science} 2
Edit: It works with your example dataset, too:
library(plyr)
library(arules)
df <- read.csv(file = url("https://drive.google.com/uc?export=download&id=0B3kdblyHw4qLR0dpT24xWUZGcGs"))
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {CD,CG,CN,DA,Data.Struc} 1
# [2] {CD,CG,CO,ML,OS} 2
# [3] {CN,Data.Struc,DC,DM,DMS} 3
# [4] {CHE,DD,DM,EC,EE} 4
# [5] {CHE,CN,MATHS,PHY} 5
# [6] {Data.Science,DM,DMS,ML,OS} 6
# [7] {CD,DA,Data.Struc,EC,MATHS} 7
# [8] {CG,CHE,CN,CO,OS} 8
# [9] {CN,CO,Data.Science,DC,DMS} 9
# [10] {DC,DD,EC,EE,PHY} 10
# [11] {CHE,DD,DMS,MATHS,PHY} 11
# [12] {CN,Data.Science,DM,MATHS,ML} 12
# [13] {CD,CG,DA,Data.Science,Data.Struc} 13
# [14] {CG,CO,EE,MATHS,OS} 14
# [15] {CN,CO,DC,DMS,PHY} 15
# [16] {CN,CO,DD,EC,EE} 16
# [17] {CHE,DA,EE,MATHS,PHY} 17
# [18] {Data.Science,DD,DM,ML,PHY} 18
# [19] {CD,CO,DA,Data.Struc,DC} 19
# [20] {CG,CO,DD,DM,OS} 20
# [21] {CG,CN,DA,DC,DMS} 21
# [22] {DD,EC,EE,ML,OS} 22
# [23] {CHE,CN,Data.Struc,MATHS,PHY} 23
# [24] {CG,Data.Science,DM,EE,ML} 24

R: Creating n-grams in R with Asian / Chinese characters?

So I'm trying to create bigrams and trigrams of a given set of text, which just happens to be Chinese. At first glance, the tau package seems almost perfect for the application. Given the following set-up, I get close to what I want:
library(tau)
q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
textcnt(q,method="ngram",n=3L,decreasing=TRUE)
The only problem is that the output is in unicode character strings, not the characters themselves. So I get something like:
_ + < <U <U+ > U U+ 9 +5 5 U+5 >_ _< _<U +59 59 2 29 29> 592 7 92
22 19 19 19 19 19 19 19 17 14 14 14 11 11 11 9 9 8 8 8 8 8 8
929 9> >< ><U 9>_ E +5E 3 3> 3>_ 5E 5E7 6 73 73> A E7 E73 4 8 9>< A> +6
8 8 8 8 5 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 2
+7 4> 4>< 7A A>< C U+6 U+7 +4 +4E +5F +66 +6C +76 +7A 0 0A 0A> 1 14 14> 4E 4EC
2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
597 5F 5F8 60 60A 66 660 68 684 6C 6C1 76 768 7A7 7A> 7D 7D> 84 84> 88 88> 8> 8><
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
97 97D A7 A7A A>_ C1 C14 CA CA> D D> D>_ EC ECA F F8 F88 U+4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I tried to write something that would perform a similar function, but I can't wrap my head around the code for anything more than a monogram (apologies if the code is inefficient or ugly, I'm doing my best here). The advantage of this method is also that I can get word-counts within individual "documents" by simply examining DTM, which is kind of nice.
data <- c(NA, NA, NA)
names(data) <- c("doc", "term", "freq")
terms <- NA
for(i in 1:length(q)){
temp <- data.frame(i,table(strsplit(q[i],"")))
names(temp) <- c("doc", "term", "freq")
data <- rbind(data, temp)
}
data <- data[-1,]
DTM <- xtabs(freq ~ doc + term, data)
colSums(DTM)
This actually gives a nice little output:
天 平 空 昊 今 好 很 气 的
8 4 1 1 1 1 1 1 1
Does anyone have any suggestions for using tau or altering my own code to achieve bigrams and trigrams for my Chinese characters?
Edit:
As requested in the comments, here is my sessionInfo() output:
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tau_0.0-15
loaded via a namespace (and not attached):
[1] tools_3.0.0
The stringdist package will do that for you:
> library(stringdist)
> q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> v1 <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> t(qgrams(v1, q=1))
V1
天 8
平 4
空 1
昊 1
...
> v2 <- c("天气气","平","很好平","天空天空天空","昊天","今天的天天气很好")
> t(qgrams(v2, q=2))
V1
天气 2
气气 1
空天 2
天空 3
天的 1
天天 3
今天 1
...
The reason why I transpose the returned matrices is because R renders the matrices incorrectly with regards to the column width - which happens to be the length of the unicode-ID character string (f.x. "<U+6C14><U+6C14>").
In case you are interested in further details about the stringdist package - I recommend this text: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms ;)

Parsing Deeply Nested JSON Structures in R Using RJSONIO

I suspect I'm missing something obvious here, but how do I parse deeply nested structures in R using RJSONIO?
For example - suppose I directly want to reference $familyName in results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver as grabbed using the following snippet:
require(RJSONIO)
resultsURL='http://ergast.com/api/f1/2012/1/results.json'
results.data.json=fromJSON(resultsURL)
RJSONIO doesn't appear to want to parse the ..$Results[[i]] data as structured elements?
require(RJSONIO)
somedata1<-list(a=1,b='w')
fromJSON(toJSON(somedata1))
# $a
# [1] 1
# $b
# [1] "w"
somedata2<-list(a=1,b=2)
fromJSON(toJSON(somedata2))
# a b
# 1 2
somedata3<-list(a='v',b='w')
fromJSON(toJSON(somedata3))
# a b
# "v" "w"
fromJSON(toJSON(somedata3),simplify=StrictNumeric)
# $a
# [1] "v"
# $b
# [1] "w"
fromJSON(toJSON(somedata2),simplify=FALSE)
# $a
# [1] 1
# $b
# [1] 2
fromJSON(toJSON(somedata3),simplifyWithNames = FALSE)
# $a
# [1] "v"
# $b
# [1] "w"
fromJSON(toJSON(somedata2),simplifyWithNames = FALSE)
# $a
# [1] 1
# $b
# [1] 2
from the examples above by default RJSON simplifies "collections/arrays of homogeneous scalar elements to R vectors". This simplification can be controlled using simplify or simplifyWithNames. In your example you can do any of the following to access the element you want:
require(RJSONIO)
resultsURL='http://ergast.com/api/f1/2012/1/results.json'
results.data.json=fromJSON(resultsURL)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver['familyName']
# familyName
# "Pérez"
results.data.json=fromJSON(resultsURL,simplify=FALSE)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver$familyName
# [1] "Pérez"
results.data.json=fromJSON(resultsURL,simplify=StrictNumeric)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver$familyName
# [1] "Pérez"
results.data.json=fromJSON(resultsURL,simplifyWithNames = FALSE)
results.data.json$MRData$RaceTable$Races[[1]]$Results[[8]]$Driver$familyName
# [1] "Pérez"
The jsonlite package is a fork of RJSONIO which tries to use a smarter mapping between R and JSON structures. I think this might make your life easier:
> x = fromJSON('http://ergast.com/api/f1/2012/1/results.json')
> x$RaceTable$Races$MRData$Results[[1]]$Driver
driverId code url
1 button BUT http://en.wikipedia.org/wiki/Jenson_Button
2 vettel VET http://en.wikipedia.org/wiki/Sebastian_Vettel
3 hamilton HAM http://en.wikipedia.org/wiki/Lewis_Hamilton
4 webber WEB http://en.wikipedia.org/wiki/Mark_Webber
5 alonso ALO http://en.wikipedia.org/wiki/Fernando_Alonso
6 kobayashi KOB http://en.wikipedia.org/wiki/Kamui_Kobayashi
7 raikkonen RAI http://en.wikipedia.org/wiki/Kimi_R%C3%A4ikk%C3%B6nen
8 perez PER http://en.wikipedia.org/wiki/Sergio_P%C3%A9rez
9 ricciardo RIC http://en.wikipedia.org/wiki/Daniel_Ricciardo
10 resta DIR http://en.wikipedia.org/wiki/Paul_di_Resta
11 vergne VER http://en.wikipedia.org/wiki/Jean-%C3%89ric_Vergne
12 rosberg ROS http://en.wikipedia.org/wiki/Nico_Rosberg
13 maldonado MAL http://en.wikipedia.org/wiki/Pastor_Maldonado
14 glock GLO http://en.wikipedia.org/wiki/Timo_Glock
15 pic PIC http://en.wikipedia.org/wiki/Charles_Pic
16 bruno_senna SEN http://en.wikipedia.org/wiki/Bruno_Senna
17 massa MAS http://en.wikipedia.org/wiki/Felipe_Massa
18 kovalainen KOV http://en.wikipedia.org/wiki/Heikki_Kovalainen
19 petrov PET http://en.wikipedia.org/wiki/Vitaly_Petrov
20 michael_schumacher MSC http://en.wikipedia.org/wiki/Michael_Schumacher
21 grosjean GRO http://en.wikipedia.org/wiki/Romain_Grosjean
22 hulkenberg HUL http://en.wikipedia.org/wiki/Nico_H%C3%BClkenberg
23 rosa DLR http://en.wikipedia.org/wiki/Pedro_de_la_Rosa
24 karthikeyan KAR http://en.wikipedia.org/wiki/Narain_Karthikeyan
givenName familyName dateOfBirth nationality
1 Jenson Button 1980-01-19 British
2 Sebastian Vettel 1987-07-03 German
3 Lewis Hamilton 1985-01-07 British
4 Mark Webber 1976-08-27 Australian
5 Fernando Alonso 1981-07-29 Spanish
6 Kamui Kobayashi 1986-09-13 Japanese
7 Kimi Räikkönen 1979-10-17 Finnish
8 Sergio Pérez 1990-01-26 Mexican
9 Daniel Ricciardo 1989-07-01 Australian
10 Paul di Resta 1986-04-16 Scottish
11 Jean-Éric Vergne 1990-04-25 French
12 Nico Rosberg 1985-06-27 German
13 Pastor Maldonado 1985-03-09 Venezuelan
14 Timo Glock 1982-03-18 German
15 Charles Pic 1990-02-15 French
16 Bruno Senna 1983-10-15 Brazilian
17 Felipe Massa 1981-04-25 Brazilian
18 Heikki Kovalainen 1981-10-19 Finnish
19 Vitaly Petrov 1984-09-08 Russian
20 Michael Schumacher 1969-01-03 German
21 Romain Grosjean 1986-04-17 French
22 Nico Hülkenberg 1987-08-19 German
23 Pedro de la Rosa 1971-02-24 Spanish
24 Narain Karthikeyan 1977-01-14 Indian

Selecting rows in data.frame based on character strings

I've a data.frame with row.names as in test.
test <-
c("Env_1990:trait_KPS", "Env_1990:trait_SPSM", "Env_1990:trait_TKW",
"Env_1990:trait_Yield", "Env_1991:trait_KPS", "Env_1991:trait_SPSM",
"Env_1991:trait_TKW", "Env_1991:trait_Yield", "Env_1992:trait_KPS",
"Env_1992:trait_SPSM", "Env_1992:trait_TKW", "Env_1992:trait_Yield",
"Env_1993:trait_KPS", "Env_1993:trait_SPSM", "Env_1993:trait_TKW",
"Env_1993:trait_Yield", "Env_1994:trait_KPS", "Env_1994:trait_SPSM",
"Env_1994:trait_TKW", "Env_1994:trait_Yield", "Env_1995:trait_KPS",
"Env_1995:trait_SPSM", "Env_1995:trait_TKW", "Env_1995:trait_Yield",
"Gen_B88:Env_1990:trait_KPS", "Gen_B88:Env_1990:trait_SPSM",
"Gen_B88:Env_1990:trait_TKW", "Gen_B88:Env_1990:trait_Yield",
"Gen_B88:Env_1991:trait_KPS", "Gen_B88:Env_1991:trait_SPSM",
"Gen_B88:Env_1991:trait_TKW", "Gen_B88:Env_1991:trait_Yield",
"Gen_B88:Env_1992:trait_KPS", "Gen_B88:Env_1992:trait_SPSM",
"Gen_B88:Env_1992:trait_TKW", "Gen_B88:Env_1992:trait_Yield",
"Gen_B88:Env_1993:trait_KPS", "Gen_B88:Env_1993:trait_SPSM",
"Gen_B88:Env_1993:trait_TKW", "Gen_B88:Env_1993:trait_Yield")
I want to select only those rows which start with Env_. I tried this code in R
grep(pattern="[Env_]", x=test).
This code gives me all rows because Env_ appears in every row name. I wonder how to select rows which starts only with Env_. Thanks in advance for your help.
You want to add the ^ character for beginning of line/string:
> grep("^Env_", test)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
> grep("^Env_", test, value = TRUE)
[1] "Env_1990:trait_KPS" "Env_1990:trait_SPSM" "Env_1990:trait_TKW"
[4] "Env_1990:trait_Yield" "Env_1991:trait_KPS" "Env_1991:trait_SPSM"
[7] "Env_1991:trait_TKW" "Env_1991:trait_Yield" "Env_1992:trait_KPS"
[10] "Env_1992:trait_SPSM" "Env_1992:trait_TKW" "Env_1992:trait_Yield"
[13] "Env_1993:trait_KPS" "Env_1993:trait_SPSM" "Env_1993:trait_TKW"
[16] "Env_1993:trait_Yield" "Env_1994:trait_KPS" "Env_1994:trait_SPSM"
[19] "Env_1994:trait_TKW" "Env_1994:trait_Yield" "Env_1995:trait_KPS"
[22] "Env_1995:trait_SPSM" "Env_1995:trait_TKW" "Env_1995:trait_Yield"

Resources