novice R user here. I'm looking to scrape a large amount of data on daily streaming volumes on songs that are on Spotify's Top 200 charts for a research project I am involved with. Basically, I would like to write a script to scrape all info for tracks in the top 200 on a given day, such as today's chart, and have this done for every day for a number of years, across a number of countries. I used some code from a guide that I followed previously to successfully scrape said data, but it is now not working for me.
I previously followed this guide pretty much word for word. While this originally worked, it now returns an empty tibble. I suspect that the problem may have to do with the fact that Spotify have re-developed their charts site since my last attempt. The site is different in appearance, but importantly the html node names appear to be different as well. My hunch is that this is what is causing the issue.
However, I am not at all sure if this is the case. Would appreciate it greatly if I could have some guidance on what I would need to do differently to achieve my aims, and whether it is indeed still possible to scrape these charts.
Cheers
Related
I'm trying to get rankings data for NFT collections sorted by their highest all-time volume. It seems that currently the opensea API does not support ranked lists as an endpoint. As a workaround, I'm looking at web-scraping to fetch the all-time volume rankings information using https://opensea.io/rankings?sortBy=total_volume.
However, I am having difficultly fetching data for any entry in the rankings list past a 100 items - i.e. page 2 of the rankings and onwards. The opensea url does not explicitly change when I click on the list of ranks at the bottom of the page (101-201).
Any ideas on how I could automate web scraping for ranks past the first 100 entries?
I'd appreciate any help here. And thanks for your help in advance!
Did you check out this library that does scraping for you under the hood? I have tested some endpoints and it appears to return data. Please check out: https://github.com/dcts/opensea-scraper
Was hoping someone smarter than me would know how to get around two websites Woolworths.com.au and Coles.com.au that appear to block the use of the IMPORTXML() function.
I am trying to do some personal budgeting and I'm trying to input the price of various products into my spreadsheet. e.g. Toothpaste
Here's what my spreadsheet looks like so far, I am trying to avoid manually inputting the prices of the Woolworths and Coles items
Have been trying importHTML and importXML.
=IMPORTXML("https://www.woolworths.com.au/shop/productdetails/238473/colgate-plax-alcohol-free-mouthwash-freshmint","//shared-price[#class='ng-star-inserted']")
I was initially able to get some price data but after about a day it seems to have stopped working (I think they block my sheet)
Pretty tricky to work around so if anyone has suggestions for advanced Googlesheets forums to
I am trying to scrape NCAA gymnastics scores from roadtonationals.com into R. I have been able to do this in the past, using readLines(), but the website has been updated recently, and my old code no longer works.
In particular, when I am looking at the standings (roadtonationals.com/results/standings/), I can change season, year, week, and team/individual using the drop down menus. I can change between the four events and the all around using the tabs on the right. However, even if the table changes, the URL remains the same. I know very little about coding for websites, so I don’t even really know what this type of table is called or where to start with it.
Technically, I could copy and paste, but eventually, I’d like each individual score, like I used to be able to get, from a page like roadtonationals.com/results/schedule/meet/20409, which also involves selecting the teams or the events without changing the URL.
I found this question:
Using R to scrape tables when URL does not change
which seems to be asking the same thing that I am.
However, when I tried
library(httr)
standings <- POST(url = "https://roadtonationals.com/results/standings/season")
I get a message that says, “Not Acceptable.” and “An appropriate representation of the requested resource /results/standings/season could not be found on this server.”
I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.
I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.