Scraping How to isolate multiple element in same code source on website - web-scraping

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers={'User-Agent': 'Chrome/106.0.0.0'}
page = "https://www.transfermarkt.com/premier-league/einnahmenausgaben/wettbewerb/GB1/ids/a/sa//saison_id/2010/saison_id_bis/2010/nat/0/pos//w_s//intern/0"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
ClubList = []
ExpenditureList = []
ArrivalsList = []
IncomeList = []
DeparturesList = []
BalanceList = []
Club = pageSoup.find_all("td", {"class": "hauptlink no-border-links"})
Expenditure = pageSoup.find_all("td", {"class": "rechts hauptlink redtext"})
Arrivals = pageSoup.find_all("td", {"class": "zentriert"})
Income = pageSoup.find_all("td", {"class": "rechts hauptlink greentext"})
Departures = pageSoup.find_all("td", {"class": "zentriert"})
Balance = pageSoup.find_all("td", {"class": "rechts hauptlink"})for i in range(0,20):
ClubList.append(Club[i].text)
ExpenditureList.append(Expenditure[i].text)
ArrivalsList.append(Arrivals[i].text)
IncomeList.append(Income[i].text)
DeparturesList.append(Departures[i].text)
BalanceList.append(Balance[i].text)
df = pd.DataFrame({"Club":ClubList,"Expenditure":ExpenditureList,"Arrivals":ArrivalsList,"Income":IncomeList,"Departures":DeparturesList,"Balance":BalanceList})
df.head(20)
i want to scrape Arrivals and Departures
so i miss something to search the right information
Hi everyone
i have a problem in the same code source, i want to seperate information in the same code source but i did'nt know how i can do that
thanks in advance for you help

Try to use pd.read_html:
import requests
import pandas as pd
url = "https://www.transfermarkt.com/premier-league/einnahmenausgaben/wettbewerb/GB1/ids/a/sa//saison_id/2010/saison_id_bis/2010/nat/0/pos//w_s//intern/0"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0"
}
df = pd.read_html(requests.get(url, headers=headers).text)[1]
df = df.drop(columns=["club", "#", "Balance"])
df.columns = [
"club",
"Expenditure",
"Arrivals",
"Income",
"Departures",
"Balance",
]
print(df.to_markdown(index=False))
Prints:
club
Expenditure
Arrivals
Income
Departures
Balance
Manchester City
€183.61m
33
€40.15m
26
€-143.46m
Chelsea FC
€121.50m
19
€16.50m
21
€-105.00m
Liverpool FC
€97.73m
21
€101.50m
24
€3.78m
Aston Villa
€37.40m
20
€28.40m
20
€-9.00m
Manchester United
€29.30m
16
€16.97m
18
€-12.34m
Sunderland AFC
€29.10m
29
€41.78m
29
€12.68m
Tottenham Hotspur
€26.60m
39
€2.94m
33
€-23.67m
Birmingham City
€25.60m
24
€175Th.
21
€-25.43m
Arsenal FC
€23.00m
18
€8.10m
22
€-14.90m
Stoke City
€21.10m
24
€6.78m
23
€-14.33m
Wolverhampton Wanderers
€20.22m
28
€6.44m
25
€-13.78m
West Ham United
€18.03m
18
€4.95m
19
€-13.08m
Newcastle United
€13.98m
18
€41.75m
16
€27.77m
West Bromwich Albion
€13.83m
28
€850Th.
25
€-12.98m
Wigan Athletic
€11.40m
23
€3.40m
21
€-8.00m
Fulham FC
€11.05m
16
€12.81m
22
€1.76m
Blackpool FC
€5.73m
27
€700Th.
23
€-5.03m
Bolton Wanderers
€5.40m
17
€1.40m
19
€-4.00m
Blackburn Rovers
€5.05m
29
€275Th.
29
€-4.78m
Everton FC
€1.70m
12
€6.60m
14
€4.90m

Related

R: move everything after a word to a new column and then only keep the last four digits in the new column

My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?
Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz
# delete rows if col 2 has a blank value.
mintz = mintz[mintz$Substitution.Requirements != "", ]
# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
# delete PR
mintz = mintz[-34,]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern
EDIT
I still need help keeping only the state name in column 1.
As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?
mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)
One way could be:
For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4}-> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.
library(dplyr)
library(stringr)
mintz %>%
select(State) %>%
mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
State = str_extract(State, paste(state.name, collapse='|'))
) %>%
na.omit()
State Year
2 Arizona 2016
3 California 2016
7 Connecticut 2018
12 Florida 2013
13 Georgia 2015
16 Hawaii 2016
21 Illinois 2016
24 Indiana 2014
28 Iowa 2017
32 Kansas 2017
33 Kentucky 2016
34 Louisiana 2015
39 Maryland 2017
42 Michigan 2018
46 Missouri 2016
47 Montana 2017
50 Nebraska 2018
51 Nevada 2018
54 New Hampshire 2018
55 New Jersey 2016
59 New York 2017
62 North Carolina 2015
63 North Dakota 2013
66 Ohio 2017
67 Oregon 2016
70 Pennsylvania 2016
74 Rhode Island 2016
75 South Carolina 2017
78 South Dakota 2019
79 Tennessee 2015
82 Texas 2015
85 Utah 2015
88 Vermont 2018
89 Virginia 2013
92 Washington 2015
93 West Virginia 2018
96 Wisconsin 2019
97 Wyoming 2018

How to View Data (for Scraping) of Interactive Graph with Hover

I would like to scrape an interactive plot that displays different information based on where the pointer is hovering. This website is what I am interested in:https://embed.chartblocks.com/1.0/?c=60dcd8c53ba0f68e2d162a90&t=44027b4de63d924
I would like to scrape:
Hover green bar for 2011 and get "Credits 6.6B".
Hover blue bar for 2011 and get "Debits 9.48B".
Any suggestion on what I should try? Thank you
Here's a method with python.
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://embed.chartblocks.com/1.0/?c=60dcd8c53ba0f68e2d162a90&t=44027b4de63d924'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script =str(soup.find('script'))
jsonStr = re.search(r"(var chartResponse = )({.*);", script).group(2)
jsonData = json.loads(jsonStr)['data']['series']
debits_data = jsonData['ds-0']['values']
credits_data = jsonData['ds-1']['values']
debits_data = [x for idx, x in enumerate(debits_data) if idx%2 == 0]
debits_df = pd.DataFrame(debits_data)
debits_df['type'] = 'Debit'
credits_df = pd.DataFrame(credits_data)
credits_df['type'] = 'Credit'
results_df = debits_df.append(credits_df, sort=False).reset_index(drop=True)
Output:
print(results_df)
y x type
0 9476793016 2011 Debit
1 9789776200 2012 Debit
2 10197864252 2013 Debit
3 10639447839 2014 Debit
4 11262181013 2015 Debit
5 11859026050 2016 Debit
6 12515031664 2017 Debit
7 13437815328 2018 Debit
8 14436466583 2019 Debit
9 15179342495 2020 Debit
10 6602575885 2011 Credit
11 6960716962 2012 Credit
12 7353902696 2013 Credit
13 7658670144 2014 Credit
14 8050922524 2015 Credit
15 8469721067 2016 Credit
16 8974604919 2017 Credit
17 9532204297 2018 Credit
18 10313095156 2019 Credit
19 11611223625 2020 Credit

How to use If function in R to create a column using multiple conditions

I am not familiar with R , I need your help for this issue ,
I have a data frame composed with 25 variables (25 columns) named df simplified
name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack
I would like to add a column to this data frame named "country".This new column takes into account different conditions .
Juve Italy
FCB Spain
Real Spain
Chelsea England
Liverpool England
Bayern Germany
So let say if the club is FCB or Real the value in country is Spain
the output of df$Country should be as follows
Country
Spain
Spain
Spain
Spain
England
England
Italy
Germany
Spain
England
England
Italy
Germany
The code I started to do is the following
df$country=ifelse(df$Club=="FCB","spain", df$Club=="Real","Spain" ......)
But it seems false .
knowing that my real data set has more than 250 different values in "club" column
and more than 30 in "Country"
doing that manually seems too long .
Could you help me in that point please .
Do you know how to use if-else statements inside for loops? This would be the simplest way out.
Something like this:
df <- data.frame(name = c("a", "b", "c"),
Club = c("FCB", "Real", "Liverpool"),
stringsAsFactors = FALSE)
for(i in 1:nrow(df)){
if(df$Club[i] == "FCB" | df$Club[i] == "Real"){
df$country[i] <- "Spain"
} else if(df$Club[i] == "Liverpool"){
df$country[i] <- "England"
} else{
df$country[i] <- NA
}
}
df
# name Club country
# 1 a FCB Spain
# 2 b Real Spain
# 3 c Liverpool England

Not sure how to separate a column of data that I scraped

I have scraped data from the schedule of Albany Women's Basketball team from an espn website and the win/loss column is formatted like this: W 77-70, which means that Albany won 77-70. I want to separate this so that one column shows how many points Albany scored, and how many points the opponent scored.
Here is my code, not sure what to do next:
library(rvest)
library(stringr)
library(tidyr)
w.url <- "http://www.espn.com/womens-college-basketball/team/schedule/_/id/399"
webpage <- read_html(w.url)
w_table <- html_nodes(webpage, 'table')
w <- html_table(w_table)[[1]]
head(w)
w <- w[-(1:2), ]
names(w) <- c("Date", "Opponent", "Score", "Record")
head(w)
You can firstly trim out those rows that are not offering real results by using grepl function and then use regex for getting specific information:
w <- w[grepl("-", w$Score),]
gsub("^([A-Z])([0-9]+)-([0-9]+).*", "\\1,\\2,\\3", w$Score) %>%
strsplit(., split = ",") %>%
lapply(function(x){
data.frame(
result = x[1],
oponent = ifelse(x[1] == "L", x[2], x[3]),
albany = ifelse(x[1] == "W", x[2], x[3])
)
}) %>%
do.call('rbind',.) %>%
cbind(w,.) -> w2
head(w2)
# Date Opponent Score Record result oponent albany
#3 Fri, Nov 9 ##22 South Florida L74-37 0-1 (0-0) L 74 37
#4 Mon, Nov 12 #Cornell L48-34 0-2 (0-0) L 48 34
#5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) W 54 60
#6 Sun, Nov 18 #Rutgers L65-39 1-3 (0-0) L 65 39
#7 Wed, Nov 21 #Monmouth L64-56 1-4 (0-0) L 64 56
#8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) L 56 50
This is how I did it. Basically, use sub to extract either the Win or Loss values depending on whether Albany won or lost. Whether Albany won or lost the winner is listed first. So the ifelse function is necessary. The "\1" captures the digits in parenthesis.
w<-w[1:24,]
w$Albany<-ifelse(substr(w$Score,1,1)=='W',sub('W(\\d+)-\\d+','\\1',w$Score),sub('L\\d+-(\\d+)','\\1',w$Score))
w$Opponent_Team<-ifelse(substr(w$Score,1,1)=='W',sub('W\\d+-(\\d+)','\\1',w$Score),sub('L(\\d+)-\\d+','\\1',w$Score))
head(w)
Date Opponent Score Record Albany Opponent_Team
3 Fri, Nov 9 ##22 South Florida L74-37 0-1 (0-0) 37 74
4 Mon, Nov 12 #Cornell L48-34 0-2 (0-0) 34 48
5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) 60 54
6 Sun, Nov 18 #Rutgers L65-39 1-3 (0-0) 39 65
7 Wed, Nov 21 #Monmouth L64-56 1-4 (0-0) 56 64
8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) 50 56
````

A new table in R based on fields and data from existing one

I need to raise the question again as it was closed as duplicated, but the issue hasn't been resolved.
So, I'm working on international trade data and have the following table at the moment with 5 different values for commodity_code (commod_codes = c('85','84','87','73','29')):
year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 Organic chemicals 150863100
2 2013 Import Belarus China 29 Organic chemicals 151614000
3 2014 Import Belarus China 29 Organic chemicals 73110200
4 2015 Import Belarus China 29 Organic chemicals 140396300
5 2016 Import Belarus China 29 Organic chemicals 135311600
6 2012 Import Belarus China 73 Articles of iron or steel 100484600
I need to create a new table that looks simple (commodity codes in top row, years in first column and corresponding trade values in cells):
year commodity_code
29 73 84 85 87
1998 value1 ... value 5
1999
…
2016
* I used reshape() but didn't succeed.
Would appreciate your support.
In case there are duplicate permutations, I would suggest to use this code (though not in base R - uses dplyr and tidyr packages)
as.data.frame(trade_data[,c("year","commodity_code","trade_value_usd")] %>% group_by (year,commodity_code)%>% summarise( sum(trade_value_usd))%>%spread(commodity_code,3))
Provided I understood you correctly, here is a one-liner in base R.
xtabs(trade_value_usd ~ year + commodity_code, data = df);
#year 29 73
# 2012 150863100 100484600
# 2013 151614000 0
# 2014 73110200 0
# 2015 140396300 0
# 2016 135311600 0
Explanation: Use xtabs to cross-tabulate trade_value_usd as a function of year (rows) and commodity_code (columns).
Sample data
df <- read.table(text =
"year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 'Organic chemicals' 150863100
2 2013 Import Belarus China 29 'Organic chemicals' 151614000
3 2014 Import Belarus China 29 'Organic chemicals' 73110200
4 2015 Import Belarus China 29 'Organic chemicals' 140396300
5 2016 Import Belarus China 29 'Organic chemicals' 135311600
6 2012 Import Belarus China 73 'Articles of iron or steel' 100484600
", header = T, row.names = 1)

Resources