I've read the various posts on this, but I still haven't found a solution. Here's some example code:
library(dplyr)
library(lubridate)
urlfile<-'https://raw.githubusercontent.com/blakeobeans/Predicting-Service-Calls/master/Data/nc.csv'
dates<-read.csv(urlfile, header=FALSE)
dates$V1 <- mdy(dates$V1)
dates <- dates %>%
rename("data.time" = V1) %>%
filter("2017-10-01" >= data.time & data.time >= "2017-06-01") %>%
group_by(data.time) %>%
summarise(n = n())
When I output to the pdf...
The same thing happens if I have notes in the code running out of the grey bar.
I've tried using the following line of code at the beginning:
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
But that doesn't help.
I had a similar problem when putting package on CRAN (they give a note if Rd file line exceeds 90 characters (NOTE: lines wider than 90 characters)). One of the arguments to my function was url to a github dataset. Solution was to split url into separate arguments. For example:
urlRemote <- "https://raw.githubusercontent.com/"
pathGithub <- "blakeobeans/Predicting-Service-Calls/master/Data/"
fileName <- "nc.csv"
And you can use it in your code like this:
paste0(urlRemote, pathGithub, fileName) %>%
read.csv(header = FALSE)
This solution has an advantage when you want to use multiple files from the same repository as you can use paste0(urlRemote, pathGithub, fileName1), paste0(urlRemote, pathGithub, fileName2), etc.
Related
I wan wanting to automate downloading of some unicef data from https://data.unicef.org/indicator-profile/ using rvest or a simila r package. I have noticed that there are indicator codes, but I am having trouble identifying the correct codes and actually downloading the data.
Upon inspecting element, there is a data-inner-wrapper class that seems like it might be useful. You can access a download link by going to a page associated with an indicator and specifying a time period. For example, CME_TMY5T9 is the code for Deaths aged 5 to 9.
The data is available by going to
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=.CME_TMY5T9..&startPeriod=2017&endPeriod=2022` and then clicking a download link.
If anyone could help me figure out how to get all the data, that would be fantastic. Thanks
library(rvest)
library(dplyr)
library(tidyverse)
page = "https://data.unicef.org/indicator-profile/"
df = read_html(page) %>%
#html_nodes("div.data-inner-wrapper")
html_nodes(xpath = "//div[#class='data-inner-wrapper']")
EDIT: Alternatively, downloading all data for each country would be possible. I think that would just require getting the download link or getting at at the data within the table (since country codes arent much of an issue)
This shows all the data for Afghanistan. I just need to figure out a programmatic way of actually downloading the data....
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=AFG..&startPeriod=1970&endPeriod=2022
You are on the right track! When you visit the website https://data.unicef.org/indicator-profile/, it does not directly contain the indicator codes, because these are loaded dynamically at a later point. You can try using the "network analysis" function of your webbrowser and look at the different requests your browser does to fully load a webpage. The one you are looking for, with all the indicator codes is here: https://uni-drp-rdm-api.azurewebsites.net/api/indicators
library(httr)
library(jsonlite)
library(glue)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
You also correctly identied the URL, which lets you download individual datasets in the data browser. Now you just needed to find the one that pops up, when you actually download an excel file and recursively add in the differnt helix-codes from the indicators. I have not tried applying this to all indicators, for some the url might differ and you might get incomplete data or errors. But this should get you started.
GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[3]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv()
This might be a good place to get started on how to mimick requests that your browser executes. https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
Here is what I did based on the very helpful code from #Datapumpernickel
library(dplyr)
library(httr)
library(jsonlite)
library(glue)
library(tidyverse)
library(tictoc)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
#browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
tic()
FULL_DF = NULL
for(i in seq(1,length(unique(indicators$helixCode)),1)){
# Set up a trycatch loop to keep on going when it encounters errors
tryCatch({
print(paste0("Processing : ", i, " of 546 ", indicators$helixCode[i]))
TMP = GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[i]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv(col_types = cols())
# # Basic formatting for variables I want
TMP = TMP %>%
select(`Geographic area`, Indicator, Sex, TIME_PERIOD, OBS_VALUE) %>%
mutate(description = indicators$helixCode[i]) %>%
rename(country = `Geographic area`,
variablename = Indicator,
disaggregation = Sex,
year = TIME_PERIOD,
value = OBS_VALUE)
# rbind each indicator to the full dataframe
FULL_DF = FULL_DF %>% rbind(TMP)
},
error = function(cond){
cat("\n WARNING COULD NOT PROCESS : ", i, " of 546 ", indicators$helixCode[i])
message(cond)
return(NA)
}
)
}
toc()
# Save the data
rio::export(FULL_DF, "unicef-data.csv")
Currently, I use the following code to store Excel files (which are stored in a folder on my PC) in a list.
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
decrease_list <- sapply(decrease_names,read_excel,simplify = F)
After that, I combine the dataframes into one object by using the following code.
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows()
The problem I have is that the Excel files that are stored in the folder contain decimal points (points ".") as well as thousand separators (commas ","). I think R (and read_excel() in particular) convert the thousand separators into decimal points, which results in incorrect data.
Although I know that I can remove the thousand separators in Excel first, this would result in a lot of manual work and hence I am interested in a solution that recognises the thousand separator and keeps it intact (or removes it, the goal is to keep the nature of the data correct).
EDIT: as #dario suggested I add a snippet of a tibble that is stored in decrease_list after I run the code. The snippet looks like this:
Raised Avg. change
526.000 2.04
186.000 3.24
...
In the column raised the "." used to be a "," but has become a ".". The "." in Avg. change was a "." already.
Assuming that each excel file contains data in the same format, then we can apply the following code:
library(tidyverse)
library(readxl)
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
# 10 columns as written in your comment
decrease <- sapply(decrease_names, readxl::read_excel, col_types = rep("text", 10L), simplify = F)
# Not tested
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows() %>%
mutate(across(where(is.character), ~ as.numeric(gsub("\\,", "", .x))))
I have a .txt import file from a weather station using some pretty advanced code, and I need to sort based on one area of content within each line. Here's a few lines:
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
I basically need to be able to group together all lines with a $GPGGA, and do the same for $GPGLL, $GPVTG, and I believe 6 other types of entries that repeat. group_by() does work, nor do select() or sort() for obvious reasons. The formatting here is clearly not in any organized table format, making this very difficult for me. How do I do this?
Here's the code I used to import the original file (I replaced my actual username with "my username"):
filefolder <-"C:\\Users\\"my username"\\Downloads\\"
Weather_data = paste(filefolder, "Jul_13_2021_Weatherstation_Test_File.txt", sep = "")
Weather_data <- read.delim("Jul_13_2021_Weatherstation_Test_File.txt")
And here's what I have so far in my attempt:
Screenshot of what I have so far
1: https://i.stack.imgur.com/FSlzf.png][1]
As you say there is no organisation in the table. I would suggest doing something with regular expressions:
df <- data.frame(text = c("13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68",
"13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72",
"13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E"))
library(dplyr)
df %>%
mutate(Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry)
For this website: https://www.coinopsy.com/dead-coins/, I'm using R and the rvest package to scrape names, summary, etc., that kind of info, to make my own form. I've done this with other websites and it was really successful, but this one is odd.
I used SelectorGadget, which is useful, in my previous jobs, to figure out the css nodes' names, but html_nodes and html_text return empty character, I don't know if it's because the website is structured under a totally different format!
An example of the css code:
td class="all sorting_1">a class="coin_name" href="007coin">007Coin /a>/td>
a class="coin_name" href="007coin">007Coin /a>
url <- "https://www.coinopsy.com/dead-coins/"
webpage <- read_html(url)
Item_html <- html_nodes(webpage,'.coin_name')
Item <- html_text(Item_html)
> Item
character(0)
Can someone help me out on this issue?
If you disable javascript in the browser you will see that that content is not loaded. If you then inspect the html you will see the data is stored in a script tag; presumably loaded into the table when javascript runs in the browser. Javascript doesn't run with the method you are using. You can extract the javascript array of arrays from the response html. Then parse into a dataframe. I am new to R so looking into how this can be done in this case. I will include a full example with python at the end. I will update if my research yields something. Otherwise, you can regex out contents from returned string in data.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2] # string representation of list of lists
#step to convert string to object
#step to convert object to dataframe
In python there is the ast library which makes the conversion easy and the result of the below is the table you see on the page.
import requests
import re
import ast
import pandas as pd
r = requests.get('https://www.coinopsy.com/dead-coins/')
p = re.compile(r'var table_data = (.*?);') #p1 = re.compile(r'(\[".*?"\])')
data = p.findall(r.text)[0]
listings = ast.literal_eval(data)
df = pd.DataFrame(listings)
print(df)
Edit:
Currently I can't find a library which does the conversion I mentioned. Below is ugly way of combining and feels inefficient. I would welcome suggestions on improvements (though that may be for code review later). I'm still looking at this so will update.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
headers <- c("Column To Drop","Name","Summary","Project Start Date","Project End Date","Founder","urlId")
# https://www.coinopsy.com/dead-coins/bigone-token/ where bigone-token is urlId
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]
z <- substr(data, start = 2, stop = nchar(data)-1) %>% str_match_all(., "\\[(.*?)\\]")
z <- z[[1]][,2]
for(i in seq(1,length(z))){
if(i==1){
df <- rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x)))
}else{
df <- rbind(df,rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x))))
}
}
maybe it will help someone, I had the same problem, the solution was that at the beginning I have to specify the label to which the script is to be directed followed by the ".". In your case you want to address a class named coin_name, when specifying that class in the html_nodes function you don't specify the tag, same as I did. To solve it, I only had to specify the label, which in your case is the "a" label, and it would look like this.
Item_html <- html_nodes(webpage,'a.coin_name')
That way the html_nodes function would not return empty.
I know you already solved it but I hope someone can help you.
I try to figure out how to read data table of Google Chart in R.
For example the source code of this page contains historical Peercoin daily prices. I would like to copy in a R matrix the content of the data table that begin at line 497 with :
var data = google.visualization.arrayToDataTable([
['Period', right_title_name],
['2014/10/01 18:00', 0.01189974],
['2014/10/02 18:00', 0.01194000],
['2014/10/03 18:00', 0.01171897],
['2014/10/04 18:00', 0.01199999],
['2014/10/05 18:00', 0.01200000],
['2014/10/06 18:00', 0.01188685],
['2014/10/07 18:00', 0.01161999],
// data here
]);
I've installed several packages like RCurl, XML and data.table and follow examples from related questions (i.e. using fread, readHTMLTable and getURL) but I'm facing different issues to read the correct data from the source code. Too much noise I can't filter out. For example with RCurl :
library(RCurl)
address <- "http://alt19.com/19R/chart_showing_btc.php?shw=1&label=LTC_BTC&source=cryptsy&period=1day"
data <- getURL(address)
data has all the data but I'm not able to select date and prices with strsplit(data, "some code here").
May somebody suggest me a idea to achieve this ?
Thank you,
Florent
Probably there's a better way but what I usually do after getting the page source with getURL, as you posted, is to use some string manipulations.
My try:
pageSource <- getURL(address)
index1<-str_locate(pageSource,"'Period', right_title_name],")[[2]]
sourceCut1<-substr(pageSource,index1+1,nchar(pageSource))
index2<-str_locate(sourceCut1,"]);")[[1]]
sourceCut2<-substr(sourceCut1,1,index2-1)
#sourceCut2 is the part of page source with the data
data<-str_trim(strsplit(sourceCut2,"\n")[[1]]) #split data rows
dates<-gsub("^.*'([0-9/: ]+).*$", "\\1", data) #extract dates
dates<-as.POSIXct(dates,format="%Y/%m/%d %H:%M")
values<-as.numeric(gsub("^.*,([0-9 .]+).*$", "\\1", data)) #extract numeric values
mydata<-data.frame(dates=dates,values=values)
Note that it will continue working only if the structure of the data (date format, blank spaces, square brackets) remains unchanged, otherwise you will probably need to modify some of the regex.
This answer is highly specific to your situation (and that URL) but it may be enough of a base for others with similar challenges. You can use the V8 package to parse & interpret javascript, so I grab the page, extract the javascript for the table, do some cleanup of it so it can be interpreted pretty easily then post-process the conversion. It's not pretty and others might be able to optimize it, but it will get you what you need:
library(V8)
library(httr)
library(stringr)
library(dplyr)
library(magrittr)
pg <- GET("http://alt19.com/19R/chart_showing_btc.php?shw=1&label=LTC_BTC&source=cryptsy&period=1day")
content(pg, as="text") %>%
str_extract("(google\\.visualization\\.arrayToDataTable.*\\]\\);)") %>%
str_replace("google\\.visualization\\.arrayToDataTable\\(", "[") %>%
str_replace("\\)", "]") %>%
str_replace("right_title_name", "'right_title_name'") -> tbl
ct <- new_context()
ct$eval(tbl) %>%
str_split(",") %>%
extract2(1) %>%
matrix(ncol=2, byrow=TRUE) %>%
data.frame(stringsAsFactors=FALSE) %>%
tail(-1) %>%
select(timestamp=1, value=2) %>%
mutate(timestamp=as.Date(as.POSIXct(timestamp)),
value=as.numeric(value)) -> dat
glimpse(dat)
## Observations: 227
## Variables:
## $ timestamp (date) 2014-10-01, 2014-10-02, 2014-10-03, 2014-1...
## $ value (dbl) 0.01189974, 0.01194000, 0.01171897, 0.01199...
library(ggplot2)
ggplot(dat, aes(timestamp, value)) + geom_line(size=0.5) + theme_bw()