How to View Data (for Scraping) of Interactive Graph with Hover - web-scraping

I would like to scrape an interactive plot that displays different information based on where the pointer is hovering. This website is what I am interested in:https://embed.chartblocks.com/1.0/?c=60dcd8c53ba0f68e2d162a90&t=44027b4de63d924
I would like to scrape:
Hover green bar for 2011 and get "Credits 6.6B".
Hover blue bar for 2011 and get "Debits 9.48B".
Any suggestion on what I should try? Thank you

Here's a method with python.
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://embed.chartblocks.com/1.0/?c=60dcd8c53ba0f68e2d162a90&t=44027b4de63d924'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script =str(soup.find('script'))
jsonStr = re.search(r"(var chartResponse = )({.*);", script).group(2)
jsonData = json.loads(jsonStr)['data']['series']
debits_data = jsonData['ds-0']['values']
credits_data = jsonData['ds-1']['values']
debits_data = [x for idx, x in enumerate(debits_data) if idx%2 == 0]
debits_df = pd.DataFrame(debits_data)
debits_df['type'] = 'Debit'
credits_df = pd.DataFrame(credits_data)
credits_df['type'] = 'Credit'
results_df = debits_df.append(credits_df, sort=False).reset_index(drop=True)
Output:
print(results_df)
y x type
0 9476793016 2011 Debit
1 9789776200 2012 Debit
2 10197864252 2013 Debit
3 10639447839 2014 Debit
4 11262181013 2015 Debit
5 11859026050 2016 Debit
6 12515031664 2017 Debit
7 13437815328 2018 Debit
8 14436466583 2019 Debit
9 15179342495 2020 Debit
10 6602575885 2011 Credit
11 6960716962 2012 Credit
12 7353902696 2013 Credit
13 7658670144 2014 Credit
14 8050922524 2015 Credit
15 8469721067 2016 Credit
16 8974604919 2017 Credit
17 9532204297 2018 Credit
18 10313095156 2019 Credit
19 11611223625 2020 Credit

Related

Scraping How to isolate multiple element in same code source on website

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers={'User-Agent': 'Chrome/106.0.0.0'}
page = "https://www.transfermarkt.com/premier-league/einnahmenausgaben/wettbewerb/GB1/ids/a/sa//saison_id/2010/saison_id_bis/2010/nat/0/pos//w_s//intern/0"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
ClubList = []
ExpenditureList = []
ArrivalsList = []
IncomeList = []
DeparturesList = []
BalanceList = []
Club = pageSoup.find_all("td", {"class": "hauptlink no-border-links"})
Expenditure = pageSoup.find_all("td", {"class": "rechts hauptlink redtext"})
Arrivals = pageSoup.find_all("td", {"class": "zentriert"})
Income = pageSoup.find_all("td", {"class": "rechts hauptlink greentext"})
Departures = pageSoup.find_all("td", {"class": "zentriert"})
Balance = pageSoup.find_all("td", {"class": "rechts hauptlink"})for i in range(0,20):
ClubList.append(Club[i].text)
ExpenditureList.append(Expenditure[i].text)
ArrivalsList.append(Arrivals[i].text)
IncomeList.append(Income[i].text)
DeparturesList.append(Departures[i].text)
BalanceList.append(Balance[i].text)
df = pd.DataFrame({"Club":ClubList,"Expenditure":ExpenditureList,"Arrivals":ArrivalsList,"Income":IncomeList,"Departures":DeparturesList,"Balance":BalanceList})
df.head(20)
i want to scrape Arrivals and Departures
so i miss something to search the right information
Hi everyone
i have a problem in the same code source, i want to seperate information in the same code source but i did'nt know how i can do that
thanks in advance for you help
Try to use pd.read_html:
import requests
import pandas as pd
url = "https://www.transfermarkt.com/premier-league/einnahmenausgaben/wettbewerb/GB1/ids/a/sa//saison_id/2010/saison_id_bis/2010/nat/0/pos//w_s//intern/0"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0"
}
df = pd.read_html(requests.get(url, headers=headers).text)[1]
df = df.drop(columns=["club", "#", "Balance"])
df.columns = [
"club",
"Expenditure",
"Arrivals",
"Income",
"Departures",
"Balance",
]
print(df.to_markdown(index=False))
Prints:
club
Expenditure
Arrivals
Income
Departures
Balance
Manchester City
€183.61m
33
€40.15m
26
€-143.46m
Chelsea FC
€121.50m
19
€16.50m
21
€-105.00m
Liverpool FC
€97.73m
21
€101.50m
24
€3.78m
Aston Villa
€37.40m
20
€28.40m
20
€-9.00m
Manchester United
€29.30m
16
€16.97m
18
€-12.34m
Sunderland AFC
€29.10m
29
€41.78m
29
€12.68m
Tottenham Hotspur
€26.60m
39
€2.94m
33
€-23.67m
Birmingham City
€25.60m
24
€175Th.
21
€-25.43m
Arsenal FC
€23.00m
18
€8.10m
22
€-14.90m
Stoke City
€21.10m
24
€6.78m
23
€-14.33m
Wolverhampton Wanderers
€20.22m
28
€6.44m
25
€-13.78m
West Ham United
€18.03m
18
€4.95m
19
€-13.08m
Newcastle United
€13.98m
18
€41.75m
16
€27.77m
West Bromwich Albion
€13.83m
28
€850Th.
25
€-12.98m
Wigan Athletic
€11.40m
23
€3.40m
21
€-8.00m
Fulham FC
€11.05m
16
€12.81m
22
€1.76m
Blackpool FC
€5.73m
27
€700Th.
23
€-5.03m
Bolton Wanderers
€5.40m
17
€1.40m
19
€-4.00m
Blackburn Rovers
€5.05m
29
€275Th.
29
€-4.78m
Everton FC
€1.70m
12
€6.60m
14
€4.90m

What is the syntax for a stop message when no data

I am importing some data in to R and want the code to stop running if there is no file or there is no data in the file. I'm using base R and readxl. Please can you help with the syntax?
I've tried
if (dim(Llatest) == NULL) {stop('STOP NO DATA')}
if (dim(Llatest)[1] == 0) + stop('STOP NO DATA')}
if (isTRUE(dim(Llatest) == NULL)) {stop('STOP NO DATA')}
Some data imported from Sep19import.xlsx
ID Code Received Actioned Decision
1 123 Jul 01 2019 Sep 02 2019 Hold
2 456 Jul 11 2019 Sep 13 2019 No action
3 789 Nov 26 2018 Sep 25 2019 Investigate
4 321 Sep 12 2019 Sep 12 2019 Await decision
5 654 Aug 30 2019 Sep 26 2019 Hold
6 987 Feb 22 2019 Sep 02 2019 Investigate
Obtain list of files for import
LFiles <- list.files(path = "C:/Projects/Sep/code", pattern = "*import.xlsx", full.names = TRUE)
***I wish to stop here if LFiles is empty
Identify the latest file
Llatest <- subset(LFiles, LFiles == max(LFiles))
Extract data from file
LMonthly <- read_excel(Llatest)
***I wish to stop here if LMonthly is empty
Error Messages received - no non-missing arguments, returning NA
I expect the output to be 'STOP NO DATA'

A new table in R based on fields and data from existing one

I need to raise the question again as it was closed as duplicated, but the issue hasn't been resolved.
So, I'm working on international trade data and have the following table at the moment with 5 different values for commodity_code (commod_codes = c('85','84','87','73','29')):
year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 Organic chemicals 150863100
2 2013 Import Belarus China 29 Organic chemicals 151614000
3 2014 Import Belarus China 29 Organic chemicals 73110200
4 2015 Import Belarus China 29 Organic chemicals 140396300
5 2016 Import Belarus China 29 Organic chemicals 135311600
6 2012 Import Belarus China 73 Articles of iron or steel 100484600
I need to create a new table that looks simple (commodity codes in top row, years in first column and corresponding trade values in cells):
year commodity_code
29 73 84 85 87
1998 value1 ... value 5
1999
…
2016
* I used reshape() but didn't succeed.
Would appreciate your support.
In case there are duplicate permutations, I would suggest to use this code (though not in base R - uses dplyr and tidyr packages)
as.data.frame(trade_data[,c("year","commodity_code","trade_value_usd")] %>% group_by (year,commodity_code)%>% summarise( sum(trade_value_usd))%>%spread(commodity_code,3))
Provided I understood you correctly, here is a one-liner in base R.
xtabs(trade_value_usd ~ year + commodity_code, data = df);
#year 29 73
# 2012 150863100 100484600
# 2013 151614000 0
# 2014 73110200 0
# 2015 140396300 0
# 2016 135311600 0
Explanation: Use xtabs to cross-tabulate trade_value_usd as a function of year (rows) and commodity_code (columns).
Sample data
df <- read.table(text =
"year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 'Organic chemicals' 150863100
2 2013 Import Belarus China 29 'Organic chemicals' 151614000
3 2014 Import Belarus China 29 'Organic chemicals' 73110200
4 2015 Import Belarus China 29 'Organic chemicals' 140396300
5 2016 Import Belarus China 29 'Organic chemicals' 135311600
6 2012 Import Belarus China 73 'Articles of iron or steel' 100484600
", header = T, row.names = 1)

how to retrieve text from span & p tag in r

I have following link
url = "https://timesofindia.indiatimes.com/topic/Adani"
In above url I want to extract the headline, para below that and date in 3 different columns.
I am able to extract only one news headline and para with following code
results_headline <- url2 %>%
read_html() %>%
html_nodes(xpath='//*#id="c_topic_list1_1"]/div[1]/ul/li[4]/div/a/span[1]')
results_para <- url2 %>%
read_html() %>%
html_nodes(xpath='//*[#id="c_topic_list1_1"]/div[1]/ul/li[4]/div/a/p')
I want to extract all the headlines,paragraph and date on that page.
How can I do it in R?
Once again you can simply use css selector to extract the content of it.
url2 = "https://timesofindia.indiatimes.com/topic/Adani"
titles <- url2 %>% read_html() %>% html_nodes("div > a > span.title") %>% html_text()
dates <- url2 %>% read_html() %>% html_nodes("div > a > span.meta") %>% html_text()
desc <- url2 %>% read_html() %>% html_nodes("div > a > p") %>% html_text()
data.frame(titles,dates,desc)
output:
> data.frame(titles,dates,desc)
titles dates
1 \nDRI drops Adani Group overvaluation case\n Oct 28
2 \nAdani Enterprises to demerge renewable energy biz\n Oct 7
3 \nAdani Enterprises' Q2 PAT falls 6% to Rs 59 cr\n Nov 13
4 \nAdani firm close to finalising RInfra power acquisition deal\n Nov 12
5 \nAdani group shares surge up to 9%\n Aug 28
6 \nAdani Transmission acquires RInfra WRSSS assets for Rs 1k cr\n Nov 1
7 \nVedanta, Adani may bid for Bunder diamond project in MP\n Oct 27
8 \nAdani Power coercing land from farmers: M K Stalin\n Oct 31
9 \nAdani Transmission acquires 2 SPVs from RVPN\n Aug 6
desc
1 Additional director general, DRI (adjudication), K V S Singh, has dropped all charges and summarily closed all proceedings in a speaking order.
2 New Delhi, Oct 7 () Adani Enterprises today announced plans to demerge its renewable energy business into associate company Adani Green Energy Ltd as part of simplifying overall business structure.
3 New Delhi, Nov 13 () Adani Enterprises, the flagship firm of Adani group, today said its profit after tax fell by 6.34 per cent to Rs 59 crore in the July-September quarter of 2017-18 compared to Rs 63 crore in the same quarter a year ago.
4 New Delhi, Nov 12 () Adani Transmission is likely to clinch a deal of Rs 13,000-14,000 crore with Reliance Infrastructure to acquire the latter's Mumbai power business much before the January 2018 deadline to mark its foray into power distribution business.
5 New Delhi, Aug 28 () Shares of Adani group of companies surged up to 9 per cent today as the mining giant will start work on its 16.5 billion dollar Carmichael coal project in Australia in October and is expected to ship the first consignment in March 2020. The stock jumped 9.
6 New Delhi, Nov 1 () Adani Transmission today said it has completed acquisition of operational transmission assets of WRSS Schemes of Reliance Infra for Rs 1,000 crore. In effect, its power-wheeling network crossed the 8,500 circuit km mark.
7 New Delhi, Oct 27 () Metals and mining major Vedanta Ltd and the Adani Group may bid for the Bunder diamond project in Madhya Pradesh from which global giant Rio Tinto exited this year, according to sources. "Vedanta may bid for the Bunder project," said a source on the condition of anonymity.
8
9

rvest: select an option and submit form

I am trying to extract the unemployment rate data from this site. In the form, there is a select tag with some options. I can extract the table from default year 2007 to 2017. But I am having a hard time to set a value for from_year and to_year. Here is the code I have so far:
session = html_session("https://data.bls.gov/timeseries/LNS14000000")
form = read_html("https://data.bls.gov/timeseries/LNS14000000") %>% html_node("table form") %>% html_form()
set_values(form, from_year = 2000, to_year = as.numeric(format(Sys.Date(), "%Y"))) # nothing happened if I set the value for years
submit_form(session, form)
It doesn't work as expected.
Thanks so much #Andrew!
I can use the api to extract the data.
library(rjson)
library(blsAPI)
uer1 <- list(
'seriesid'=c('LNS14000000'),
'startyear'=2000,
'endyear'=2009)
response <- blsAPI(uer1, 2, TRUE)
The response looks like:
year period periodName value seriesID
1 2009 M12 December 9.9 LNS14000000
2 2009 M11 November 9.9 LNS14000000
3 2009 M10 October 10.0 LNS14000000
4 2009 M09 September 9.8 LNS14000000
5 2009 M08 August 9.6 LNS14000000
6 2009 M07 July 9.5 LNS14000000
...
Note that there are some query limits in the api.
api limits

Resources