I try to figure out how to read data table of Google Chart in R.
For example the source code of this page contains historical Peercoin daily prices. I would like to copy in a R matrix the content of the data table that begin at line 497 with :
var data = google.visualization.arrayToDataTable([
['Period', right_title_name],
['2014/10/01 18:00', 0.01189974],
['2014/10/02 18:00', 0.01194000],
['2014/10/03 18:00', 0.01171897],
['2014/10/04 18:00', 0.01199999],
['2014/10/05 18:00', 0.01200000],
['2014/10/06 18:00', 0.01188685],
['2014/10/07 18:00', 0.01161999],
// data here
]);
I've installed several packages like RCurl, XML and data.table and follow examples from related questions (i.e. using fread, readHTMLTable and getURL) but I'm facing different issues to read the correct data from the source code. Too much noise I can't filter out. For example with RCurl :
library(RCurl)
address <- "http://alt19.com/19R/chart_showing_btc.php?shw=1&label=LTC_BTC&source=cryptsy&period=1day"
data <- getURL(address)
data has all the data but I'm not able to select date and prices with strsplit(data, "some code here").
May somebody suggest me a idea to achieve this ?
Thank you,
Florent
Probably there's a better way but what I usually do after getting the page source with getURL, as you posted, is to use some string manipulations.
My try:
pageSource <- getURL(address)
index1<-str_locate(pageSource,"'Period', right_title_name],")[[2]]
sourceCut1<-substr(pageSource,index1+1,nchar(pageSource))
index2<-str_locate(sourceCut1,"]);")[[1]]
sourceCut2<-substr(sourceCut1,1,index2-1)
#sourceCut2 is the part of page source with the data
data<-str_trim(strsplit(sourceCut2,"\n")[[1]]) #split data rows
dates<-gsub("^.*'([0-9/: ]+).*$", "\\1", data) #extract dates
dates<-as.POSIXct(dates,format="%Y/%m/%d %H:%M")
values<-as.numeric(gsub("^.*,([0-9 .]+).*$", "\\1", data)) #extract numeric values
mydata<-data.frame(dates=dates,values=values)
Note that it will continue working only if the structure of the data (date format, blank spaces, square brackets) remains unchanged, otherwise you will probably need to modify some of the regex.
This answer is highly specific to your situation (and that URL) but it may be enough of a base for others with similar challenges. You can use the V8 package to parse & interpret javascript, so I grab the page, extract the javascript for the table, do some cleanup of it so it can be interpreted pretty easily then post-process the conversion. It's not pretty and others might be able to optimize it, but it will get you what you need:
library(V8)
library(httr)
library(stringr)
library(dplyr)
library(magrittr)
pg <- GET("http://alt19.com/19R/chart_showing_btc.php?shw=1&label=LTC_BTC&source=cryptsy&period=1day")
content(pg, as="text") %>%
str_extract("(google\\.visualization\\.arrayToDataTable.*\\]\\);)") %>%
str_replace("google\\.visualization\\.arrayToDataTable\\(", "[") %>%
str_replace("\\)", "]") %>%
str_replace("right_title_name", "'right_title_name'") -> tbl
ct <- new_context()
ct$eval(tbl) %>%
str_split(",") %>%
extract2(1) %>%
matrix(ncol=2, byrow=TRUE) %>%
data.frame(stringsAsFactors=FALSE) %>%
tail(-1) %>%
select(timestamp=1, value=2) %>%
mutate(timestamp=as.Date(as.POSIXct(timestamp)),
value=as.numeric(value)) -> dat
glimpse(dat)
## Observations: 227
## Variables:
## $ timestamp (date) 2014-10-01, 2014-10-02, 2014-10-03, 2014-1...
## $ value (dbl) 0.01189974, 0.01194000, 0.01171897, 0.01199...
library(ggplot2)
ggplot(dat, aes(timestamp, value)) + geom_line(size=0.5) + theme_bw()
Related
I wan wanting to automate downloading of some unicef data from https://data.unicef.org/indicator-profile/ using rvest or a simila r package. I have noticed that there are indicator codes, but I am having trouble identifying the correct codes and actually downloading the data.
Upon inspecting element, there is a data-inner-wrapper class that seems like it might be useful. You can access a download link by going to a page associated with an indicator and specifying a time period. For example, CME_TMY5T9 is the code for Deaths aged 5 to 9.
The data is available by going to
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=.CME_TMY5T9..&startPeriod=2017&endPeriod=2022` and then clicking a download link.
If anyone could help me figure out how to get all the data, that would be fantastic. Thanks
library(rvest)
library(dplyr)
library(tidyverse)
page = "https://data.unicef.org/indicator-profile/"
df = read_html(page) %>%
#html_nodes("div.data-inner-wrapper")
html_nodes(xpath = "//div[#class='data-inner-wrapper']")
EDIT: Alternatively, downloading all data for each country would be possible. I think that would just require getting the download link or getting at at the data within the table (since country codes arent much of an issue)
This shows all the data for Afghanistan. I just need to figure out a programmatic way of actually downloading the data....
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=AFG..&startPeriod=1970&endPeriod=2022
You are on the right track! When you visit the website https://data.unicef.org/indicator-profile/, it does not directly contain the indicator codes, because these are loaded dynamically at a later point. You can try using the "network analysis" function of your webbrowser and look at the different requests your browser does to fully load a webpage. The one you are looking for, with all the indicator codes is here: https://uni-drp-rdm-api.azurewebsites.net/api/indicators
library(httr)
library(jsonlite)
library(glue)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
You also correctly identied the URL, which lets you download individual datasets in the data browser. Now you just needed to find the one that pops up, when you actually download an excel file and recursively add in the differnt helix-codes from the indicators. I have not tried applying this to all indicators, for some the url might differ and you might get incomplete data or errors. But this should get you started.
GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[3]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv()
This might be a good place to get started on how to mimick requests that your browser executes. https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
Here is what I did based on the very helpful code from #Datapumpernickel
library(dplyr)
library(httr)
library(jsonlite)
library(glue)
library(tidyverse)
library(tictoc)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
#browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
tic()
FULL_DF = NULL
for(i in seq(1,length(unique(indicators$helixCode)),1)){
# Set up a trycatch loop to keep on going when it encounters errors
tryCatch({
print(paste0("Processing : ", i, " of 546 ", indicators$helixCode[i]))
TMP = GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[i]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv(col_types = cols())
# # Basic formatting for variables I want
TMP = TMP %>%
select(`Geographic area`, Indicator, Sex, TIME_PERIOD, OBS_VALUE) %>%
mutate(description = indicators$helixCode[i]) %>%
rename(country = `Geographic area`,
variablename = Indicator,
disaggregation = Sex,
year = TIME_PERIOD,
value = OBS_VALUE)
# rbind each indicator to the full dataframe
FULL_DF = FULL_DF %>% rbind(TMP)
},
error = function(cond){
cat("\n WARNING COULD NOT PROCESS : ", i, " of 546 ", indicators$helixCode[i])
message(cond)
return(NA)
}
)
}
toc()
# Save the data
rio::export(FULL_DF, "unicef-data.csv")
I am trying to grab Hawaii-specific data from this site: https://www.opentable.com/state-of-industry. I want to get the data for Hawaii from every table on the site. This is done after selecting the State tab.
In R, I am trying to use rvest library with SelectorGadget.
So far I've tried
library(rvest)
html <- read_html("https://www.opentable.com/state-of-industry")
html %>%
html_element("tbody") %>%
html_table()
However, this isn't giving me what I am looking for yet. I am getting the Global dataset instead in a tibble. So any suggestions on how grab the Hawaii dataset from the State tab?
Also, is there a way to download the dataset that clicks on Download dataset tab? I can also then work from the csv file.
All the page data is stored in a script tag where it is pulled from dynamically in the browser. You can regex out the JavaScript object containing all the data, and write a custom function to extract just the info for Hawaii as shown below. Function get_state_index is written to accept a state argument, in case you wish to view other states' information.
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
Regex:
The title doesn't really do my question justice, because there are probably a few ways to skin this cat. But I picked one approach and went with it. This is what I'm working with:
I've pulled all the metadata for a particular study in the NCBI database using the "Send to:" option on their interface and downloading a .txt file.
In total, I have ~23k samples, each with up to 609 unique questions and answers from a questionnaire totaling 8M+ obs of 1 variable when read as a .csv. To my dismay, the metadata are irregular. Some samples have 140 associated key/value pairs. Others have 492. I've included a header of a sample below.
1: qiita_sid_10317:10317.BLANK1.6H.GUELPH
Identifiers: BioSample: SAMEA4790059; SRA: ERS2609990
Organism: metagenome
Attributes:
/Alias="qiita_sid_10317:10317.BLANK1.6H.GUELPH"
/description="American Gut control"
/ENA checklist="ERC000011"
/INSDC center alias="UCSDMI"
/INSDC center name="University of California San Diego Microbiome Initiative"
/INSDC first public="2018-07-13T17:03:10Z"
/INSDC last update="2018-07-13T14:50:03Z"
/INSDC status="public"
/SRA accession="ERS2609990"
I've tried (including but not limited to):
Read .txt file (adding a delimiter hasn't made a difference, am I missing something here?)
I've tried reading the data using various delimiters
I've even removed the header data in Sublime Text, leaving only "Attributes:" and the "/"-delimited key/value pairs in order to mess with the column that way
I've split the column found all unique values in col1 to maybe create a df from scratch, etc etc.
Can't seem to get past the cleaning steps:
samples <- read.csv("~/biosample_result_full.txt")
samples_split <- cSplit(samples, splitCols = sample$Colname, sep = "=")
samples_split$Attributes_1 <- gsub(" ", "_", samples_split$Attributes_1)
questions <- unique(samples_split$Attributes_1)
Ideally, each sample and associated metadata would be transformed into rows, with each "Attribute"/question as the column name.
Any help is greatly appreciated.
I see that the website you've linked to, allows fot the option to export data to xml. I strongly suggest to do so. R can hande/parse xml-files very efficient.
When I download the first three results from that site to a file biosample_result.xml , it's easy to process using the xml2-package
library( xml2 )
library( magrittr )
doc <- read_xml( "./biosample_result.xml")
#gret all BioSample nodes
BioSample.Nodes <- xml_find_all( doc, "//BioSample")
#build a data.frame
data.frame(
sample_name = xml_find_first( BioSample.Nodes , ".//Id[#db='SRA']") %>% xml_text(),
stringsAsFactors = FALSE )
# sample_name
# 1 ERS2609990
# 2 ERS2609989
# 3 ERS2609988
So if you can use the XML, you will just have to use the right xpath-syntax to get the data/nodes you need, into the columns you want...
In the exmaple above, I extracted (from each BioSample-node) the first ID-node with attribute db equals SRA, and stored the result in the co0lumn sample_name.
Still assuming you can use the xml-data.
If you are lokking for all attributes into one df, you need the functions from purrr, so just load the entire tidyverse
library( tidyverse )
df <- xml_find_all( doc, "//BioSample") %>%
map_df(~{
set_names(
xml_find_all(.x, ".//Attribute") %>% xml_text(),
xml_find_all(.x, ".//Attribute") %>% xml_attr( "attribute_name" )
) %>%
as.list() %>%
flatten_df()
})
will result in a df like this
I've read the various posts on this, but I still haven't found a solution. Here's some example code:
library(dplyr)
library(lubridate)
urlfile<-'https://raw.githubusercontent.com/blakeobeans/Predicting-Service-Calls/master/Data/nc.csv'
dates<-read.csv(urlfile, header=FALSE)
dates$V1 <- mdy(dates$V1)
dates <- dates %>%
rename("data.time" = V1) %>%
filter("2017-10-01" >= data.time & data.time >= "2017-06-01") %>%
group_by(data.time) %>%
summarise(n = n())
When I output to the pdf...
The same thing happens if I have notes in the code running out of the grey bar.
I've tried using the following line of code at the beginning:
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
But that doesn't help.
I had a similar problem when putting package on CRAN (they give a note if Rd file line exceeds 90 characters (NOTE: lines wider than 90 characters)). One of the arguments to my function was url to a github dataset. Solution was to split url into separate arguments. For example:
urlRemote <- "https://raw.githubusercontent.com/"
pathGithub <- "blakeobeans/Predicting-Service-Calls/master/Data/"
fileName <- "nc.csv"
And you can use it in your code like this:
paste0(urlRemote, pathGithub, fileName) %>%
read.csv(header = FALSE)
This solution has an advantage when you want to use multiple files from the same repository as you can use paste0(urlRemote, pathGithub, fileName1), paste0(urlRemote, pathGithub, fileName2), etc.
I'm attempting to extract data from a pdf, which can be located at https://www.dol.gov/ui/data.pdf. The data I'm interested in are on page 4 of the PDF and are the 3 observations of the Initial Claims (NSA), the 3 observations of the Insured Unemployment (NSA), and the most recent week used covered employment (footnote 2).
I've read the PDF into R using pdftools, but the text output which is generated is quite ugly (kind of to be expected - due to the nature of PDFs). Is there any way I can extract specific data from this text output? I believe the data will always be in the same place in the output, which is helpful.
The output I'm looking at can be seen with the following script:
library(pdftools)
download.file("https://www.dol.gov/ui/data.pdf", "data.pdf", mode="wb")
uidata <- pdf_text("data.pdf")
uidata[4]
I've searched people with similar questions and fiddled around with scan() and grep(), but can't seem to figure out a way to isolate and extract the data I need from the text output. Thanks in advance if anyone stumbles upon this and can point me in the right direction - if not I'll be trying to figure this out!
With grep and a little regex, you can get everything you need into a usable structure:
library(magrittr)
x <- pdftools::pdf_text('https://www.dol.gov/ui/data.pdf')
x2 <- readLines(textConnection(x[4]))
r <- grep('WEEK ENDING', x2)
l <- lapply(seq_along(r), function(i){
x2[r[i]:(na.omit(c(r[i + 1], grep('FOOTNOTE', x2)))[1] - 1)] %>%
trimws() %>%
gsub('\\s{2,}', ';', .) %>%
paste(collapse = '\n') %>%
read.csv2(text = ., dec = '.')
})
from_footnote <- as.numeric(gsub('^2|\\D', '', x2[grep('2\\.', x2)]))
l[[1]][3,]
#> WEEK.ENDING December.17 December.10 Change
#> Initial Claims (NSA) 315,613 305,333 +10,280 352,534
#> December.3
#> Initial Claims (NSA) 319,641
from_footnote
#> [1] 138322138
You'll still need to parse the numbers, but at least it's usable.