Handling OSM Data for whole country - r

Is there any way to get the data matching the filters given for the osmdata- library for a whole country or another big part? The problem is if the area is getting bigger, the file is too large and the download aborts. Below is the import section I'm using right now.
library(osmdata)
q <- getbb("Germany") %>%
opq() %>%
add_osm_feature("amenity", "restaurant")
str(q) #query structure
cinema <- osmdata_sf(q)

From the Planet.osm file I can tell you that there are about 97.000 objects, way or node, with tag amenity = "restaurant" in Germany. The OSM-API won't handle that. You'll even get a timeout using overpass-turbo.
For large amounts of data you'll have to download the Planet.osm file into a database. You'll find a nice tutorial here

Related

How do you download data from API and export it into a nice CSV file to query?

I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?
jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]

Scraping table into R

Trying to scrape data into R from table on website but having some trouble in finding the table. The table seems to not have any distinguishable class or id as it is labeled as . I have scraped data from many tables but have never encountered this type of situation. Still a novice and have done some searches but nothing has worked as of yet.
I have tried the following R commands but they only produce a result with "" and no data scraped. I had some results in assigning the article-template class but it is jumbled so I know that the data exists on the html_read. I am just not sure the proper way to program for this type of website.
Here is what I have so far in R:
library(xml2)
library(rvest)
website <- read_html("https://baseballsavant.mlb.com/daily_matchups")
dailymatchup <- website %>%
html_nodes(xpath = '//*[#id="leaderboard"]') %>%
html_text()
Essentially pulling the data from the table should allow a quick and organizable data frame. This is the ultimate target I am seeking.

Can AZ ML workbench reference multiple data sources from Data Prep Transform Dataflow expression

Using AZ ML workbench for a class project (required tool) I coded the desired logic below in an exploration notebook but cannot find a way to include this into a Data-prep Transform Data flow.
all_columns = df.columns
sum_columns = [col_name for col_name in all_columns if col_name not in ['NPI', 'Gender', 'State', 'Credentials', 'Specialty']]
sum_op_columns = list(set(sum_columns) & set(df_op['Drug Name'].values))
The logic is using the column names from one data source df_op (opioid drugs) to choose which subset of columns to include from another data source df (all drugs). When adding a py script/expression Transform Data Flow I'm only seeing the ability to reference the single df. Alternatives?
I may have a way for you to access both data frames.
In Workbench, once you have the data sources that you need loaded, right click on one and select "Generate Data Access Code File".
Once there you're automatically given code to access that specific file. However, you can use the same code to access the other files.
In the screenshot above, I have two data sources. I can use the below code to access them both as a pandas data frame and manipulate them as I need.
df_salary = datasource.load_datasource('SalaryData.dsource')
df_startup = datasource.load_datasource('50-Startups.dsource')
I believe from there you can save your updated data frame to a CSV and then use that in the train script.
Hope that helps or at least points you to another solution.

Scraping data from a dynamic web page (.asp) with R

I'm trying to scrap some data using this code.
require(XML)
tables <- readHTMLTable('http://fantasynba.movistarplus.es/basketball/reports/player_rankings.asp')
str(tables, max.level = 1)
df <- tables$searchResults
It works perfect but the problem is that it only gives me data for the first 188 observations that corresponds to the players whose position is "Base". Whenever I try to get data from "Pivot" or "Alero" players, it gives me the same info. Since the url never changes, I don't know how to get this info.

How to use R to read XML data from S3 more quickly?

this is my first time working with XML data, and I'd appreciate any help/advice that you can offer!
I'm working on pulling some data that is stored on AWS in a collection of XML files. I have an index files that contains a list of the ~200,000 URLs where the XML files are hosted. I'm currently using the XML package in R to loop through each URL and pull the data from the node that I'm interested in. This is working fine, but with so many URLs, this loop takes around 12 hours to finish.
Here's a simplified version of my code. The index file contains the list of URLs. The parsed XML files aren't very large (stored as dat in this example...R tells me they're 432 bytes). I've put NodeOfInterest in as a placeholder for the spot where I'd normally list the XML tag that I'd like to pull data from.
for (i in 1:200000) {
url <- paste('http://s3.amazonaws.com/',index[i,9],'_public.xml', sep="") ## create URL based off of index file
dat <- (xmlTreeParse(url, useInternal = TRUE)) ## load entire XML file
nodes <- (getNodeSet(dat, "//x:NodeOfInterest", "x")) ##find node for the tag I'm interested in
if (length(nodes) > 0 & exists("dat")) {
dat2 <- xmlToDataFrame(nodes) ##create data table from node
compiled_data <- rbind(compiled_data, dat2) ##create master file
rm(dat2)
}
print(i)
}
It seems like there must be a more efficient way to pull this data. I think the longest step (by far) is loading the XML into memory, but I haven't found anything out there that suggests another option. Any advice???
Thanks in advance!
If parsing the XML into a tree is your pinchpoint (in xmlTreeParse) maybe use a streaming interface like SAX which will allow you to only process those elements that are useful for your application. I haven't used it, but the package xml2 is built on top of libxml2 which provides a SAX ability.

Resources