# parse PubMed data
library(XML) # xpath
library(rentrez) # entrez_fetch
pmids <- c("25506969","25032371","24983039","24983034","24983032","24983031","26386083",
"26273372","26066373","25837167","25466451","25013473","23733758")
# Above IDs are mix of Books and journal articles
# ID# 23733758 is an journal article and has No abstract
data.pubmed <- entrez_fetch(db = "pubmed", id = pmids, rettype = "xml",
parsed = TRUE)
abstracts <- xpathApply(data.pubmed, "//Abstract", xmlValue)
names(abstracts) <- pmids
It works well if every record has an abstract. However, when there is a PMID (#23733758) without a pubmed abstract ( or a book article or something else), it skips resulting in an error 'names' attribute [5] must be the same length as the vector [4]
Q: How to pass multiple paths/nodes so that, I can extract journal article, Books or Reviews ?
UPDATE : hrbrmstr solution helps to address the NA. But,can xpathApply take multiple nodes like c(//Abstract, //ReviewArticle , etc etc )?
You have to attack it one tag element up:
abstracts <- xpathApply(data.pubmed, "//PubmedArticle//Article", function(x) {
val <- xpathSApply(x, "./Abstract", xmlValue)
if (length(val)==0) val <- NA_character_
val
})
names(abstracts) <- pmids
str(abstracts)
List of 5
## $ 24019382: chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ 23927882: chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ 23825589: chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ 23792568: chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ 23733758: chr NA
Per your comment with an alternate way to do this:
str(xpathApply(data.pubmed, '//PubmedArticle//Article', function(x) {
xmlValue(xmlChildren(x)$Abstract)
}))
## List of 5
## $ : chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ : chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ : chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ : chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ : chr NA
Related
Ok, I know this is not a reproducible example, because I only managed to get this error with this specific data.table, that has almost 1gb, so I don't know how to send it to you. Anyway, I am completely lost... If someone knows what is happening here, please tell me.
I have the original data.table and some other ones obtained by just changing the skip argument.
> original <- fread('json.csv')
> skip100 <- fread('json.csv', skip = 100, sep = ',')
> skip1000 <- fread('json.csv', skip = 1000, sep = ',')
> skip10000 <- fread('json.csv', skip = 10000, sep = ',')
> str(original)
Classes ‘data.table’ and 'data.frame': 29315 obs. of 7 variables:
$ id : chr "0015023cc06b5362d332b3baf348d11567ca2fbb" "004f0f8bb66cf446678dc13cf2701feec4f36d76" "00d16927588fb04d4be0e6b269fc02f0d3c2aa7b" "0139ea4ca580af99b602c6435368e7fdbefacb03" ...
$ title : chr "The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for th"| __truncated__ "Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China" "Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples" "A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors" ...
$ authors : chr "Joseph C Ward,Lidia Lasecka-Dykes,Chris Neil,Oluwapelumi Adeyemi,Sarah , Gold,Niall Mclean,Caroline Wrig"| __truncated__ "Hanchu Zhou,Jiannan Yang,Kaicheng Tang,â\200 ,Qingpeng Zhang,Zhidong Cao,Dirk Pfeiffer,Daniel Dajun Zeng" "Salman L Butt,Eric C Erwood,Jian Zhang,Holly S Sellers,Kelsey Young,Kevin K Lahmers,James B Stanton" "Nishi Kumari,Ayush Upadhyay,Kishan Kalia,Rakesh Kumar,Kanika Tuteja,Rani Paul,Eugenia Covernton,Tina Sh"| __truncated__ ...
$ institution: chr "" "City University of Hong Kong,City University of Hong Kong,City University of Hong Kong,NA,City University of Ho"| __truncated__ "University of Georgia,University of Georgia,University of Georgia,University of Georgia,University of Georgia,V"| __truncated__ "Panjab University,Delhi University,D.A.V. College,CSIR-Institute of Microbial Technology,Panjab University,Univ"| __truncated__ ...
$ country : chr "" "China,China,China,NA,China,China,China,China" "USA,USA,USA,USA,USA,USA,USA" "India,India,India,India,India,India,France,India,NA,India" ...
$ abstract : chr "word count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without "| __truncated__ "" "Infectious bronchitis (IB) causes significant economic losses in the global poultry industry. Control of infect"| __truncated__ "Nipah Virus (NiV) came into limelight recently due to an outbreak in Kerala, India. NiV causes severe disease a"| __truncated__ ...
$ body_text : chr "VP3, and VP0 (which is further processed to VP2 and VP4 during virus assembly) (6). The P2 64 and P3 regions en"| __truncated__ "The 2019-nCoV epidemic has spread across China and 24 other countries 1-3 as of February 8, 2020 . The mass qua"| __truncated__ "Infectious bronchitis (IB), which is caused by infectious bronchitis virus (IBV), is one of the most important "| __truncated__ "Nipah is an infectious negative-sense single-stranded RNA virus which belongs to the genus henipavirus and fami"| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>
The number of observations is consistent for skip = 100 and 10000, but not for skip = 1000, as shown below.
> nrow(original)
[1] 29315
> nrow(skip100)
[1] 29215
> nrow(skip1000)
[1] 28316
> nrow(skip10000)
[1] 19315
What is happening?
I was wondering if there is a way to automatically pull the Russell 3000 holdings from the iShares website in R using the read_html (or rvest) function?
url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf
(all holdings in the table on the bottom, not just top 10)
So far I have had to copy and paste into an Excel document, save as a CSV, and use read_csv to create a tibble in R of the ticker, company name, and sector.
I have used read_html to pull the SP500 holdings from wikipedia, but can't seem to figure out the path I need to put in to have R automatically pull from iShares website (and there arent other reputable websites I've found with all ~3000 holdings). Here is the code used for SP500:
read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")%>%
html_node("table.wikitable")%>%
html_table()%>%
select('Symbol','Security','GICS Sector','GICS Sub Industry')%>%
as_tibble()
First post, sorry if it is hard to follow...
Any help would be much appreciated
Michael
IMPORTANT
According to the Terms & Conditions listed on BlackRock's website (here):
Use any robot, spider, intelligent agent, other automatic device, or manual process to search, monitor or copy this Website or the reports, data, information, content, software, products services, or other materials on, generated by or obtained from this Website, whether through links or otherwise (collectively, "Materials"), without BlackRock's permission, provided that generally available third-party web browsers may be used without such permission;
I suggest you ensure you are abiding by those terms before using their data in a way that violates those rules. For educational purposes, here is how data would be obtained:
First you need to get to the actual data (not the interactive javascript). How familiar are you with the devloper function on your browser? If you navigate through the webiste and track the traffic, you will notice a large AJAX:
https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
This is the data you need (all). After locating this, it is just cleaning the data. Example:
library(jsonlite)
#Locate the raw data by searching the Network traffic:
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
#pull the data in via fromJSON
x<-jsonlite::fromJSON(url,flatten=TRUE)
>Large list (10.4 Mb)
#use a comination of `lapply` and `rapply` to unlist, structuring the results as one large list
y<-lapply(rapply(x, enquote, how="unlist"), eval)
>Large list (50677 elements, 6.9Mb)
y1<-y[1:15]
> str(y1)
List of 15
$ aaData1 : chr "MSFT"
$ aaData2 : chr "MICROSOFT CORP"
$ aaData3 : chr "Equity"
$ aaData.display: chr "2.95"
$ aaData.raw : num 2.95
$ aaData.display: chr "109.41"
$ aaData.raw : num 109
$ aaData.display: chr "2,615,449.00"
$ aaData.raw : int 2615449
$ aaData.display: chr "$286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData.display: chr "286,156,275.09"
$ aaData.raw : num 2.86e+08
$ aaData14 : chr "Information Technology"
$ aaData15 : chr "2588173"
**Updated: In case you are unable to clean the data, here you are:
testdf<- data.frame(matrix(unlist(y), nrow=50677, byrow=T),stringsAsFactors=FALSE)
#Where we want to break the DF at (every nth row)
breaks <- 17
#number of rows in full DF
nbr.row <- nrow(testdf)
repeats<- rep(1:ceiling(nbr.row/breaks),each=breaks)[1:nbr.row]
#split DF from clean-up
newDF <- split(testdf,repeats)
Result:
> str(head(newDF))
List of 6
$ 1:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "MSFT" "MICROSOFT CORP" "Equity" "2.95" ...
$ 2:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AAPL" "APPLE INC" "Equity" "2.89" ...
$ 3:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "AMZN" "AMAZON COM INC" "Equity" "2.34" ...
$ 4:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "BRKB" "BERKSHIRE HATHAWAY INC CLASS B" "Equity" "1.42" ...
$ 5:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "FB" "FACEBOOK CLASS A INC" "Equity" "1.35" ...
$ 6:'data.frame': 17 obs. of 1 variable:
..$ matrix.unlist.y...nrow...50677..byrow...T.: chr [1:17] "JNJ" "JOHNSON & JOHNSON" "Equity" "1.29" ...
I am trying to scrape headlines off a few news websites using the html_node function and the SelectorGadget but find that some do not work giving the result "{xml_nodeset (0)}". For example the below code gives such result:
url_cnn = 'https://edition.cnn.com/'
webpage_cnn = read_html(url_cnn)
headlines_html_cnn = html_nodes(webpage_cnn,'.cd__headline-text')
headlines_html_cnn
The ".cd__headline-text" I got using the SelectorGadget.
Other websites work such as:
url_cnbc = 'https://www.cnbc.com/world/?region=world'
webpage_cnbc = read_html(url_cnbc)
headlines_html_cnbc = html_nodes(webpage_cnbc,'.headline')
headlines_html_cnbc
Gives a full set of headlines. Any ideas why some websites return the "{xml_nodeset (0)}" result?
Please, please, please stop using Selector Gadget. I know Hadley swears by it but he's 100% wrong. What you see with Selector Gadget is what's been created in the DOM after javascript has been executed and other resources have been loaded asynchronously. Please use "View Source". That's what you get when you use read_html().
Having said that, I'm impressed CNN is as generous as they are (you def can scrape this page) and the content is most certainly on that page, just not rendered (which is likely even better):
Now, that's javascript, not JSON so we'll need some help from the V8 package:
library(rvest)
library(V8)
ctx <- v8()
# get the page source
pg <- read_html("https://edition.cnn.com/")
# find the node with the data in a <script> tag
html_node(pg, xpath=".//script[contains(., 'var CNN = CNN || {};CNN.isWebview')]") %>%
html_text() %>% # get the plaintext
ctx$eval() # sent it to V8 to execute it
cnn <- ctx$get("CNN") # get the data ^^ just created
After exploring the cnn object:
str(cnn[["contentModel"]][["siblings"]][["articleList"]], 1)
## 'data.frame': 55 obs. of 7 variables:
## $ uri : chr "/2018/11/16/politics/cia-assessment-khashoggi-assassination-saudi-arabia/index.html" "/2018/11/16/politics/hunt-crown-prince-saudi-un-resolution/index.html" "/2018/11/15/politics/us-khashoggi-sanctions/index.html" "/2018/11/15/middleeast/jamal-khashoggi-saudi-prosecutor-death-penalty-intl/index.html" ...
## $ headline : chr "<strong>CIA determines Saudi Crown Prince personally ordered journalist's death, senior US official says</strong>" "Saudi crown prince's 'fit' over UN resolution" "US issues sanctions on 17 Saudis over Khashoggi murder" "Saudi prosecutor seeks death penalty for Khashoggi killers" ...
## $ thumbnail : chr "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" ...
## $ duration : chr "" "" "" "" ...
## $ description: chr "The CIA has determined that Saudi Crown Prince Mohammed bin Salman personally ordered the killing of journalist"| __truncated__ "Multiple sources tell CNN that a much-anticipated United Nations Security Council resolution calling for a cess"| __truncated__ "The Trump administration on Thursday imposed penalties on 17 individuals over their alleged roles in the <a hre"| __truncated__ "Saudi prosecutors said Thursday they would seek the death penalty for five people allegedly involved in the mur"| __truncated__ ...
## $ layout : chr "" "" "" "" ...
## $ iconType : chr NA NA NA NA ...
I am trying to import into R an excel spreadsheet of a balance sheet, I am trying to import it so it would look more or less like the balance sheet does now.
Assets 2011, 2010, 2009
Non current assets 32.322 3.111
intangible assets 12,222
something along those lines. I am also trying to import the second tab also which is a different balance sheet. The idea is that I will probably have 50 or more balance sheets. Would this be inefficient for analysis?
I am only interested in a few of the same variables from each of the balance sheet (think, current assets, non current assets for all the years etc.) is it possible to import just specific rows and columns from an excel spead sheet?
For instance just import;
A) Non current assets 32.322 3.111 322
B) Current assets 345 543 2.233
etc? - the row names do not change so could I use a function to do this?
Look at quantmod!
library(quantmod)
library(xlsx)
getFin("GS")
gs_BS <- GS.f$BS$A
str(gs_BS)
#num [1:42, 1:4] 106533 NA 113003 71883 NA ...
#- attr(*, "dimnames")=List of 2
# ..$ : chr [1:42] "Cash & Equivalents" "Short Term Investments" "Cash and Short Term Investments" "Accounts Receivable - Trade, Net" ...
# ..$ : chr [1:4] "2015-12-31" "2014-12-31" "2013-12-31" "2012-12-31"
#- attr(*, "col_desc")= chr [1:4] "As of 2015-12-31" "As of 2014-12-31" "As of 2013-12-31" "As of 2012-12-31"
transposed <- t(gs_BS)
write.xlsx(transposed, "C:\\Users\\your_path_here\\Desktop\\bal_sheet.xlsx", row.names=FALSE)
transp <- read.xlsx("C:\\Users\\your_path_here\\Desktop\\bal_sheet.xlsx" , sheetName="Sheet1")
transp$year <- c("2015","2014","2013","2012")")
This is good too.
require(quantmod)
equityList <- read.csv("EquityList.csv", header = FALSE, stringsAsFactors = FALSE)
names(equityList) <- c ("Ticker")
for (i in 1 : length(equityList$Ticker)) {
temp<-getFinancials(equityList$Ticker[i],src="google",auto.assign=FALSE)
write.csv(temp$IS$A,paste(equityList$Ticker[i],"_Income_Statement(Annual).csv",sep=""))
write.csv(temp$IS$A,paste(equityList$Ticker[i],"_Balance_Sheet(Annual).csv",sep=""))
write.csv(temp$IS$A,paste(equityList$Ticker[i],"_Cash_Flow(Annual).csv",sep=""))
write.csv(temp$IS$A,paste(equityList$Ticker[i],"_Income_Statement(Quarterly).csv",sep=""))
write.csv(temp$IS$A,paste(equityList$Ticker[i],"_Balance_Sheet(Quaterly).csv",sep=""))
write.csv(temp$IS$A,paste(equityList$Ticker[i],"_Cash_Flow(Quaterly).csv",sep=""))
}
Also, check this out.
https://msperlin.github.io/pafdR/importingInternet.html
There are other ways to do very similar things.
I am using the R package tm.plugin.webmining. Using the function GoogleNewsSource() I would like to query the news sorted by date and also from a specific date. Is there any paremeter to query the news of a specific date?
library(tm)
library(tm.plugin.webmining)
searchTerm <- "Data Mining"
corpusGoog <- WebCorpus(GoogleNewsSource(params=list(hl="en", q=searchTerm,
ie="utf-8", num=10, output="rss" )))
headers <- meta(corpusGoog,tag="datetimestamp")
If you're looking for a data frame-like structure, this is how you'd go about creating it (note: not all fields are extracted from the corpus):
library(dplyr)
make_row <- function(elem) {
data.frame(timestamp=elem[[2]]$datetimestamp,
heading=elem[[2]]$heading,
description=elem[[2]]$description,
content=elem$content,
stringsAsFactors=FALSE)
}
dat <- bind_rows(lapply(corpusGoog, make_row))
str(dat)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 4 variables:
## $ timestamp : POSIXct, format: "2015-02-03 13:08:16" "2015-01-11 23:37:45" ...
## $ heading : chr "A guide to data mining with Hadoop - Information Age" "Barack Obama to seek limits on student data mining - Politico" "Is data mining riddled with risk or a natural hazard of the internet? - INTHEBLACK" "Why an obscure British data-mining company is worth $3 billion - Quartz" ...
## $ description: chr "Information AgeA guide to data mining with HadoopInformation AgeWith the advent of the Internet of Things and the transition fr"| __truncated__ "PoliticoBarack Obama to seek limits on student data miningPoliticoPresident Barack Obama on Monday is expected to call for toug"| __truncated__ "INTHEBLACKIs data mining riddled with risk or a natural hazard of the internet?INTHEBLACKData mining is now viewed as a serious"| __truncated__ "QuartzWhy an obscure British data-mining company is worth $3 billionQuartzTesco, the troubled British retail group, is starting"| __truncated__ ...
## $ content : chr "A guide to data mining with Hadoop\nHow businesses can realise and capitalise on the opportunities that Hadoop offers\nPosted b"| __truncated__ "By Stephanie Simon\n1/11/15 6:32 PM EST\nPresident Barack Obama on Monday is expected to call for tough legislation to protect "| __truncated__ "By Adam Courtenay\nData mining is now viewed as a serious security threat, but with all the hype, s"| __truncated__ "How We Buy\nJanuary 12, 2015\nTesco, the troubled British retail group, is starting over. After an accounting scandal , a serie"| __truncated__ ...
Then, you can do anything you want. For example:
dat %>%
arrange(timestamp) %>%
select(heading) %>%
head
## Source: local data frame [6 x 1]
##
## heading
## 1 The potential of fighting corruption through data mining - Transparency International (pre
## 2 Barack Obama to seek limits on student data mining - Politico
## 3 Why an obscure British data-mining company is worth $3 billion - Quartz
## 4 Parks and Rec Recap: Treat Yo Self to Some Data Mining - Indianapolis Monthly
## 5 Fraud and data mining in Vancouverâ\u0080¦just Outside the Lines - Vancouver Sun (blog)
## 6 'Parks and Rec' Data-Mining Episode Was Eerily True To Life - MediaPost Communications
If you want/need something else, you need to be clearer in your question.
I was looking at google query string and noticed they pass startdate and enddate tag in the query if you click dates on right hand side of the page.
You can use the same tag name and yout results will be confined within start and end date.
GoogleFinanceSource(query, params = list(hl = "en", q = query, ie = "utf-8",
start = 0, num = 25, output = "rss",
startdate='2015-10-26', enddate = '2015-10-28'))