R stops scraping when there is missing data

R stops scraping when there is missing data - r

I am using this code to loop through multiple url's to scrape data. The code works fine until it comes to a date that has missing data. This is the error message that pops up:
Error in data.frame(away, home, away1H, home1H, awayPinnacle, homePinnacle) :
arguments imply differing number of rows: 7, 8
I am very new to coding and could not figure out how to make it keep scraping despite the missing data.
library(rvest)
library(dplyr)
get_data <- function(date) {
# Specifying URL
url <- paste0('https://classic.sportsbookreview.com/betting-odds/nba-basketball/money-line/1st-half/?date=', date)
# Reading the HTML code from website
oddspage <- read_html(url)
# Using CSS selectors to scrape away teams
awayHtml <- html_nodes(oddspage,'.eventLine-value:nth-child(1) a')
#Using CSS selectors to scrape 1Q scores
away1QHtml <- html_nodes(oddspage,'.current-score+ .first')
away1Q <- html_text(away1QHtml)
away1Q <- as.numeric(away1Q)
home1QHtml <- html_nodes(oddspage,'.score-periods+ .score-periods .current-score+ .period')
home1Q <- html_text(home1QHtml)
home1Q <- as.numeric(home1Q)
#Using CSS selectors to scrape 2Q scores
away2QHtml <- html_nodes(oddspage,'.first:nth-child(3)')
away2Q <- html_text(away2QHtml)
away2Q <- as.numeric(away2Q)
home2QHtml <- html_nodes(oddspage,'.score-periods+ .score-periods .period:nth-child(3)')
home2Q <- html_text(home2QHtml)
home2Q <- as.numeric(home2Q)
#Creating First Half Scores
away1H <- away1Q + away2Q
home1H <- home1Q + home2Q
#Using CSS selectors to scrape scores
awayScoreHtml <- html_nodes(oddspage,'.first.total')
awayScore <- html_text(awayScoreHtml)
awayScore <- as.numeric(awayScore)
homeScoreHtml <- html_nodes(oddspage, '.score-periods+ .score-periods .total')
homeScore <- html_text(homeScoreHtml)
homeScore <- as.numeric(homeScore)
# Converting away data to text
away <- html_text(awayHtml)
# Using CSS selectors to scrape home teams
homeHtml <- html_nodes(oddspage,'.eventLine-value+ .eventLine-value a')
# Converting home data to text
home <- html_text(homeHtml)
# Using CSS selectors to scrape Away Odds
awayPinnacleHtml <- html_nodes(oddspage,'.eventLine-consensus+ .eventLine-book .eventLine-book-value:nth-child(1) b')
# Converting Away Odds to Text
awayPinnacle <- html_text(awayPinnacleHtml)
# Converting Away Odds to numeric
awayPinnacle <- as.numeric(awayPinnacle)
# Using CSS selectors to scrape Pinnacle Home Odds
homePinnacleHtml <- html_nodes(oddspage,'.eventLine-consensus+ .eventLine-book .eventLine-book-value+ .eventLine-book-value b')
# Converting Home Odds to Text
homePinnacle <- html_text(homePinnacleHtml)
# Converting Home Odds to Numeric
homePinnacle <- as.numeric(homePinnacle)
# Create Data Frame
df <- data.frame(away,home,away1H,home1H,awayPinnacle,homePinnacle)
}
date_vec <- sprintf('201902%02d', 02:06)
all_data <- do.call(rbind, lapply(date_vec, get_data))
View(all_data)

I'd recommending purrr::map() instead of lapply. Then you can wrap your call to get_data() with possibly(), which is a nice way to catch errors and keep going.
library(purrr)
map_dfr(date_vec, possibly(get_data, otherwise = data.frame()))
Output:
away home away1H home1H awayPinnacle homePinnacle
1 L.A. Clippers Detroit 47 65 116 -131
2 Milwaukee Washington 73 50 -181 159
3 Chicago Charlotte 60 51 192 -220
4 Brooklyn Orlando 48 44 121 -137
5 Indiana Miami 53 54 117 -133
6 Dallas Cleveland 58 55 -159 140
7 L.A. Lakers Golden State 58 63 513 -651
8 New Orleans San Antonio 50 63 298 -352
9 Denver Minnesota 61 64 107 -121
10 Houston Utah 63 50 186 -213
11 Atlanta Phoenix 58 57 110 -125
12 Philadelphia Sacramento 52 62 -139 123
13 Memphis New York 42 41 -129 114
14 Oklahoma City Boston 58 66 137 -156
15 L.A. Clippers Toronto 51 65 228 -263
16 Atlanta Washington 61 57 172 -196
17 Denver Detroit 55 68 -112 -101
18 Milwaukee Brooklyn 51 42 -211 184
19 Indiana New Orleans 53 50 -143 127
20 Houston Phoenix 63 57 -256 222
21 San Antonio Sacramento 59 63 -124 110

Related

Using str_split to fill rows down data frame with number ranges and multiple numbers

I have a dataframe with crop names and their respective FAO codes. Unfortunately, some crop categories, such as 'other cereals', have multiple FAO codes, ranges of FAO codes or even worse - multiple ranges of FAO codes.
Snippet of the dataframe with the different formats for FAO codes.
> FAOCODE_crops
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68,71,75,89,92,94,97,101,103,108
27 other oil crops 260:310,312:339
31 other fibre crops 773:821
Using the following code successfully breaks down these numbers,
unlist(lapply(unlist(strsplit(FAOCODE_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
[1] 15 27 56 44 79 79 83 68 71 75 89 92 94 97 101 103 108
... but I fail to merge these numbers back into the dataframe, where every FAOCODE gets its own row.
> FAOCODE_crops$FAOCODE <- unlist(lapply(unlist(strsplit(MAPSPAM_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
Error in `$<-.data.frame`(`*tmp*`, FAOCODE, value = c(15, 27, 56, 44, :
replacement has 571 rows, data has 42
I fully understand why it doesn't merge successfully, but I can't figure out a way to fill the table with a new row for each FAOCODE as idealized below:
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68
8 other cereals 71
8 other cereals 75
8 other cereals 89
And so on...
Any help is greatly appreciated!

We can use separate_rows to separate the ,. After that, we can loop through the FAOCODE using map and ~eval(parse(text = .x)) to evaluate the number range. Finnaly, we can use unnest to expand the data frame.
library(tidyverse)
dat2 <- dat %>%
separate_rows(FAOCODE, sep = ",") %>%
mutate(FAOCODE = map(FAOCODE, ~eval(parse(text = .x)))) %>%
unnest(cols = FAOCODE)
dat2
# # A tibble: 140 x 2
# SPAM_full_name FAOCODE
# <chr> <dbl>
# 1 wheat 15
# 2 rice 27
# 3 other cereals 68
# 4 other cereals 71
# 5 other cereals 75
# 6 other cereals 89
# 7 other cereals 92
# 8 other cereals 94
# 9 other cereals 97
# 10 other cereals 101
# # ... with 130 more rows
DATA
dat <- read.table(text = " SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 'other cereals' '68,71,75,89,92,94,97,101,103,108'
27 'other oil crops' '260:310,312:339'
31 'other fibre crops' '773:821'",
header = TRUE, stringsAsFactors = FALSE)

Create a Column that Counts (one at a time) Repeated Occurrences Within Groups? (in R)

mydata <- read.table(header=TRUE, text="
Away Home Game.ID Points.A Points.H Series.ID Series.Wins
Denver Utah aaa123 121 123 aaabbb Utah
Denver Utah aaa124 132 116 aaabbb Denver
Utah Denver aaa125 117 121 aaabbb Denver
Utah Denver aaa126 112 120 aaabbb Denver
Denver Utah aaa127 115 122 aaabbb Utah
Atlanta Boston aab123 112 114 aaaccc Boston
")
I am trying to create an additional column that counts, one at a time, the Series.Wins within each Series.ID group. So, from the data above, that column column would look like:
new.column <- c(1, 1, 2, 3, 2, 1)
The ultimate goal is to come up with a series record column with "home wins -
away wins":
Record <- c(1-0, 1-1, 2-1, 3-1, 2-3, 1-0)

This seemed to work:
mydata <- mydata %>% group_by(Series.Wins) %>% group_by(Series.ID, add = TRUE) %>% mutate(id = seq_len(n()))

Performing a row by row chisq test on a data frame and capturing the result as a tibble

I have a data frame similar to this:
df1 <- data.frame(c(31,3447,12,1966,39,3275),
c(20,3460,10,1968,30,3284),
c(334,3146,212,1766,338,2976),
c(36,3442,35,1943,47,3267),
c(81,3399,71,1907,112,3202),
c(22,3458,22,1956,42,3272))
colnames(df1) <- c("Site1.C1","Site1.C2","Site2.C1","Site2.C2","Site3.C1","Site3.C2")
df1
Site1.C1 Site1.C2 Site2.C1 Site2.C2 Site3.C1 Site3.C2
1 31 20 334 36 81 22
2 3447 3460 3146 3442 3399 3458
3 12 10 212 35 71 22
4 1966 1968 1766 1943 1907 1956
5 39 30 338 47 112 42
6 3275 3284 2976 3267 3202 3272
I am converting each row into a table and then performing a chisq test.
In order get specific values from the chisq result (p value, parameter, statistic, expected, etc), I'm having to repeat chisq test several times over (in a very ugly and cumbersome way), using the following code:
df2 <- df1 %>% rowwise() %>% mutate(P=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$p.value,
df=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$parameter,
Site1.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[1,1],
Site1.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[1,2],
Site2.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[2,1],
Site2.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[2,2],
Site3.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[3,1],
Site3.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[3,2])
as.data.frame(df2)
Site1.C1 Site1.C2 Site2.C1 Site2.C2 Site3.C1 Site3.C2 P df Site1.c1.exp Site1.c2.exp Site2.c1.exp Site2.c2.exp Site3.c1.exp Site3.c2.exp
1 31 20 334 36 81 22 2.513166e-08 2 43.40840 7.591603 314.9237 55.07634 87.66794 15.33206
2 3447 3460 3146 3442 3399 3458 2.760225e-02 2 3391.05464 3515.945362 3234.4387 3353.56132 3366.50668 3490.49332
3 12 10 212 35 71 22 4.743725e-04 2 17.92818 4.071823 201.2845 45.71547 75.78729 17.21271
4 1966 1968 1766 1943 1907 1956 1.026376e-01 2 1928.02242 2005.977577 1817.7517 1891.24831 1893.22588 1969.77412
5 39 30 338 47 112 42 2.632225e-10 2 55.49507 13.504934 309.6464 75.35362 123.85855 30.14145
6 3275 3284 2976 3267 3202 3272 2.686389e-02 2 3216.55048 3342.449523 3061.5833 3181.41674 3174.86626 3299.13374
Is there a more elegant way to do chisq test just once and capture the result as a tibble in the same row and then extract values on a need-to basis into additional columns?
My data frame has over a million of rows and some additional variables not used with the Chisq test.
Thank you.

With input from #akrun, I was able to get the desired result using the following code:
df2 <- df1 %>% rowwise() %>% mutate(result=list(chisq.test(rbind(c(Site1.C1,Site1.C2),c(S‌ite2.C1,Site2.C2),c(‌Site3.C1,Site3.C2)))‌))

R tidyr::spread duplicate error

I have the following data:
ID AGE SEX RACE COUNTRY VISITNUM VSDTC VSTESTCD VSORRES
32320058 58 M WHITE UKRAINE 2 2016-04-28 DIABP 74
32320058 58 M WHITE UKRAINE 1 2016-04-21 HEIGHT 183
32320058 58 M WHITE UKRAINE 1 2016-04-21 SYSBP 116
32320058 58 M WHITE UKRAINE 2 2016-04-28 SYSBP 116
32320058 58 M WHITE UKRAINE 1 2016-04-21 WEIGHT 109
22080090 75 M WHITE MEXICO 1 2016-05-17 DIABP 81
22080090 75 M WHITE MEXICO 1 2016-05-17 HEIGHT 176
22080090 75 M WHITE MEXICO 1 2016-05-17 SYSBP 151
I would like to reshape the data using tidyr::spread to get the following output:
ID AGE SEX RACE COUNTRY VISITNUM VSDTC DIABP SYSBP WEIGHT HEIGHT
32320058 58 M WHITE UKRAINE 2 2016-04-28 74 116 NA NA
32320058 58 M WHITE UKRAINE 1 2016-04-21 NA 116 109 183
22080090 75 M WHITE MEXICO 1 2016-05-17 81 151 NA 176
I receive duplicate errors, although I don't have duplicates in my data!
df1=spread(df,VSTESTCD,VSORRES)
Error: Duplicate identifiers for rows (36282, 36283), (59176, 59177), (59179, 59180)

I assume that I understand your question
# As many rows are identical, we should create a unique identifier column
# Let's take iris dataset as an example
# install caret package if you don't have it
install.packages("caret")
# require library
library(tidyverse)
library(caret)
# check the dataset (iris)
head(iris)
# assume that I gather all columns in iris dataset, except Species variable
# Create an unique identifier column and transform wide data to long data as follow
iris_gather<- iris %>% dplyr::mutate(ID=row_number(Species)) %>% tidyr::gather(key=Type,value=my_value,1:4)
# check first six rows
head(iris_gather)
# using *spread* to spread out the data
iris_spread<- iris_gather %>% dplyr::group_by(ID) %>% tidyr::spread(key=Type,value=my_value) %>% dplyr::ungroup() %>% dplyr::select(-ID)
# Check first six rows of iris_spread
head(iris_spread)

Reshaping a data frame --- changing rows to columns

Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?

This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19

Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R stops scraping when there is missing data - r

Related

Using str_split to fill rows down data frame with number ranges and multiple numbers

Create a Column that Counts (one at a time) Repeated Occurrences Within Groups? (in R)

Performing a row by row chisq test on a data frame and capturing the result as a tibble

R tidyr::spread duplicate error

Reshaping a data frame --- changing rows to columns

Categories

Resources