I'm trying to scrape a table from Sports Reference:
cu_url <- "https://www.sports-reference.com/cbb/schools/creighton/"
I was able to get the table into a data frame as intended like this:
cu_html <- read_html(cu_url)
cu_table <- html_nodes(cu_html, "table")
cu_info <- data.frame(html_table(cu_table))
colnames(cu_info) <- cu_info[1,]
cu_info <- cu_info[-1,]
However, I noticed after the fact that the header row repeats throughout the data. For example, row 22 shows the headers again as a row. Is there an efficient way to remove these? In the HTML, the header rows all have a table row () class of "thead" so I'm wondering if I can ask rvest to ignore these but I've failed when trying this using ! =.
Appreciate any thoughts. If I need to remove the actual header in order for this to work I will but would prefer to keep that one and just remove the repeats.
You can keep only the rows which have only numbers in Rk column.
library(rvest)
library(dplyr)
cu_url %>%
read_html %>%
html_nodes('table') %>%
html_table() %>%
.[[1]] %>%
setNames(make.unique(unlist(.[1,]))) %>%
slice(-1L) %>%
filter(grepl('^\\d+$', Rk)) -> result
result
Related
I am trying to scrape some sports data from this website (https://en.khl.ru/stat/players/1097/skaters/) using rvest. There are no pages to filter through, but there is a 'Show All' icon to show all the data on the page.
I have been trying to use a css selector to extract the table. Unfortunately, no rows are produced but the column names of the table are present.
I suspect the problem lies in the website's interactive features with the table.
Yes, this page is dynamically generated, thus troublesome for rvest to handle. But the key to scrape this page is to realize the data is stored as JSON in a script element on the page.
The code below reads the page and extracts the script nodes. Reviewed the script node to find the correct one. Then some trial and error extracted the JSON data. Cleaned up the player and team name columns for the final answer.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://en.khl.ru/stat/players/1097/skaters/"
page <- read_html(url)
#the data for the page is stored in a script element
scripts <-page %>% html_elements("script")
#get column names
headers <- page %>% html_elements("thead th") %>% html_text()
#examined the nodes and manually determined the 31st node was it
tail(scripts, 18)
data <- scripts[31] %>% html_text()
#examined the data string and notice the start of the JSON was '[ ['
#end of the JSON was ']]'
jsonstring <- str_extract(data, "\\[ \\[.+\\]\\]")
#convert the JSON into data frame
answer <- jsonlite::fromJSON(jsonstring) %>% as.data.frame
#rename column titles
names(answer) <- headers
#function to clean up html code in columns
cleanhtml <- function(text) {
out<-text %>% read_html() %>% html_text()
}
#remove the html information in columns 1 &3
answer <- answer[ , -32] %>% rowwise() %>%
mutate(Player = cleanhtml(Player), Team=cleanhtml(Team))
answer
I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.
I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.
Doing the problem 'Selecting Columns' in the lesson 'Introduction to data frames in R' on codeacademy.
It asks "Select the group column of artists using select() and save the result to artist_groups. View artist_groups."
I know how to select the column, it just doesn't tell me how to save it.
artists %>%
select(group)
I'm guessing I use artist_groups <- in some way but I can't get it to work
Put it before your code, this is an example with the dataset 'trees'
column <- trees %>%
select(Volume)
In your data would be
artist_groups <- artists %>%
select(group)
View(artist_groups) # And view the result
Try this
install.packages("dplyr")
library(dplyr)
artist_groups <- artists %>%
select(group)
A question regarding this data extraction I did. I would like to create a bar chart with the data but unfortunately I am unable to convert the characters extracted to numbers inside R. If I edit the file in a text editor, there's no porblem at all but I'd like to do the whole process in R. Here it is the code:
install.packages("rvest")
library(rvest)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table()
str(corporatetax)
As a result in corporatetax there is a data.frame with 3 variables all of them characters. My question, which I've not been abe to resolve, is how should I proceed to convert the second and the third column to numbers to create a bar chart? I've tried with sapply() and dplyr() but did not find a correct way to do that.
Thanks!
You might try to clean up the table like this
library(rvest)
library(stringr)
library(dplyr)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
# your xpath defines the single table, so you can use html_node() instead of html_nodes()
html_node(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table() %>% as_tibble() %>%
setNames(c("country", "corporate_tax", "combined_tax"))
corporatetax %>%
mutate(corporate_tax=as.numeric(str_replace(corporate_tax, "%", ""))/100,
combined_tax=as.numeric(str_replace(combined_tax, "%", ""))/100
)