Extract URL of CSV data from webpage - web-scraping

it seems to me as if onvista has recently changed its website design. My former links to retrieve stock market data do not work anymore. Unfortunately, I am not able to extract the URL of the CSV data from that website, for example:
https://www.onvista.de/index/boersenkurse/DAX-Kursindex-Index-1966970
EDIT: This is what I did so far:
link = paste0("https://www.onvista.de/onvista/boxes/historicalquote/export.csv?notationId=",onvista.ID,"&dateStart=",mday(aktuell),".",month(aktuell),".",year(aktuell),"&interval=M1")
temp = read.csv2(url(link))
Can someone of you help me? Thanks in advance!

Related

Scrape an image from website in Google sheets (using ImportXml)

I'm trying to scrape a product image from Aliexpress website to display in a cell together with link and other details. I've been trying to formulate the xpath, however I've been getting an error.
this is the command I've used:
=image(IMPORTXML(A2, "(//img[#class='magnifier-image'])[1]/#src"))
A2 is the product link
this is the element from Aliexpress website
link to the product: https://www.aliexpress.com/item/1005001845596088.html
Could anyone help me understand what I'm doing wrong please?
I would be very grateful for any ideas.
Thank you,
Kristyna
Unfortunately, the HTML data retrieved by IMPORTXML has not value of img[#class='magnifier-image']. But, fortunately, that image URL can be retrieved from meta tag. When this is reflected in the formula, it becomes as follows.
Sample formula:
=IMAGE(IMPORTXML(A2,"//meta[#property='og:image']/#content"))
The cell "A2" has the URL of https://www.aliexpress.com/item/1005001845596088.html.
In this case, the image URL of https://ae01.alicdn.com/kf/S962937cf821a4e0196a11cf0c877df11Y/Birthday-Valentine-Day-Keychain-Gifts-for-Boyfriend-Husband-My-Man-I-love-you-Couples-Keyring-for.jpg is retrieved.
Note:
This sample formula is for the URL of https://www.aliexpress.com/item/1005001845596088.html. So when you change the URL, this formula might not be able to be used. And also, when the specification of the site is changed, this formula might not be able to be used. Please be careful about this.

Wordpress : Get assigned data value from defined variables

I am working in an e-commerce site built with WordPress. I want to assign a zip code to each city in the country as users tend to misspell the names.
So I want it in a way that each time the type a city it's zip code appears.
Example when i type in New York I should have 00012 pop up.
Any suggestion would be welcomed.
Thanks
I have used one plugin please follow this plugin this will help to get zip code you need to do bit custom design if you need
here is link https://wordpress.org/plugins/simple-locator/
Their parent demo is here: https://locatewp.com/

How can I scrape data from a website within a frame using R?

The following link contains the results of the marathon of Paris: http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon.
I want to scrape these results, but the information lies within a frame. I know the basics of scraping with Rvest and Rselenium, but I am clueless on how to retrieve the data within such a frame. To get an idea, one of the things I tried was:
url = "http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon"
site = read_html(url)
ParisResults = site %>% html_node("iframe") %>% html_table()
ParisResults = as.data.frame(ParisResults)
Any help in solving this problem would be very welcome!
The results are loaded by ajax from the following url :
url="http://www.aso.fr/massevents/resultats/ajax.php?v=1460995792&course=mar16&langue=us&version=3&action=search"
table <- url %>%
read_html(encoding="UTF-8") %>%
html_nodes(xpath='//table[#class="footable"]') %>%
html_table()
PS: I don't know what ajax is exactly, and I just know basics of rvest
EDIT: in order to answer the question in the comment: I don't have a lot of experience in web scraping. If you only use very basic technics with rvest or xml, you have to understand a little more the web site, and every site has its own structure. For this one, here is how I did:
As you see, in the source code you don't see any results because they are in an iframe, and when inspecting the code, you can see after "RESULTS OF 2016 EDITION":
class="iframe-xdm iframe-resultats" data-href="http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=3"
Now you can use directly this url : http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=2
But you still can get the results. You can then use Chrome developer tools > Network > XHR. When refreshing the page, you can see that the data is loaded from this url (when you choose the Men category) : http://www.aso.fr/massevents/resultats/ajax.php?course=mar16&langue=us&version=2&action=search&fields%5Bsex%5D=F&limiter=&order=
Now you can get the results !
And if you want the second page, etc. you can click on the number of the page, then use developer tool to see what happens !

Getting the right data from webpage

I am looking to extract some data from website:
http://www.delfi.lv/bizness/biznesa_vide/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5
from me it's valuable to get info like:
"<h3 class="ca56269332 comment-noavatar listh3 comment-noavatar-author">
vārds
</h3>"
In this example "ca56269332" and "vārds" are dynamic variables.
For me I want to achieve something like this:
"<h3 class="* comment-noavatar listh3 comment-noavatar-author">
*
</h3>"
where "*" means a dynamic value, and export in some kind of excel or data file.
Also I want to extract multiple pages, like:
/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5&no=0
/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5&no=20
/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5&no=40
ect.
Can anyone please share some valuable resources to achieve that, I know that you can make it with PHP file_get but I want easier solution because my goal is not to publish it to webpage but use as source for my study project as a data file.
How to EXTRACT dynamic data to AVOID saving EVERY page with ALL the useless information it contains and make it easier, avoiding to manually process large number of web comments?

Download ASPX page with R

There are a number of fairly detailed answers on SO which cover authenticated login to an aspx site and a download from it. As a complete n00b I haven't been able to find a simple explanation of how to get data from a web form
The following MWE is intended as an example only. And this question is more intended to teach me how to do it for a wider collection of webpages.
website :
http://data.un.org/Data.aspx?d=SNA&f=group_code%3a101
what I tried and (obviously) failed.
test=read.csv('http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc')
giving me goobledegook with a View(test)
Anything that steps me through this or points me in the right direction would be very gratefully received.
The URL you are accessing using read.csv is returning a zipped file. You could download it
using httr say and write the contents to a temp file:
library(httr)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
response <- GET(urlUN)
writeBin(content(response, as = "raw"), "temp/temp.zip")
fName <- unzip("temp/temp.zip", list = TRUE)$Name
unzip("temp/temp.zip", exdir = "temp")
read.csv(paste0("temp/", fName))
Alternatively Hmisc has a useful getZip function:
library(Hmisc)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
unData <- read.csv(getZip(urlUN))
The links are being generated dynamically. The other problem is the content isn't actually at that link. You're making a request to a (very odd and poorly documented) API which will eventually return with the zip file. If you look in the Chrome dev tools as you click on that link you'll see the message and response headers.
There's a few ways you can solve this. If you know some javascript you can script a headless webkit instance like Phantom to load up these pages, simulate lick events and wait for a content response, then pipe that to something.
Alternately you may be able to finagle httr into treating this like a proper restful API. I have no idea if that's even remotely possible. :)

Resources