With the small reproducible example below, I'd like to identify the dplyr approach to arrive at the data.frame shown at the end of this note. The features of the dplyr output is that it will ensure that the data.frame is sorted by date (note that the dates 1999-04-13 and 1999-03-12 are out of order) and that it then "accumulate" the number of days within each wy grouping (wy = "water year"; Oct 1-Sep 30) that Q is above a threshold of 3.0.
dat <- read.table(text="
Date wy Q
1997-01-01 1997 9.82
1997-02-01 1997 3.51
1997-02-02 1997 9.35
1997-10-04 1998 0.93
1997-11-01 1998 1.66
1997-12-02 1998 0.81
1998-04-03 1998 5.65
1998-05-05 1998 7.82
1998-07-05 1998 6.33
1998-09-06 1998 0.55
1998-09-07 1998 4.54
1998-10-09 1999 6.50
1998-12-31 1999 2.17
1999-01-01 1999 5.67
1999-04-13 1999 5.66
1999-03-12 1999 4.67
1999-06-05 1999 3.34
1999-09-30 1999 1.99
1999-11-06 2000 5.75
2000-03-04 2000 6.28
2000-06-07 2000 0.81
2000-07-06 2000 9.66
2000-09-09 2000 9.08
2000-09-21 2000 6.72", header=TRUE)
dat$Date <- as.Date(dat$Date)
mdat <- dat %>%
group_by(wy) %>%
filter(Q > 3) %>%
?
Desired results:
Date wy Q abvThreshCum
1997-01-01 1997 9.82 1
1997-02-01 1997 3.51 2
1997-02-02 1997 9.35 3
1997-10-04 1998 0.93 0
1997-11-01 1998 1.66 0
1997-12-02 1998 0.81 0
1998-04-03 1998 5.65 1
1998-05-05 1998 7.82 2
1998-07-05 1998 6.33 3
1998-09-06 1998 0.55 3
1998-09-07 1998 4.54 4
1998-10-09 1999 6.50 1
1998-12-31 1999 2.17 1
1999-01-01 1999 5.67 2
1999-03-12 1999 4.67 3
1999-04-13 1999 5.66 4
1999-06-05 1999 3.34 5
1999-09-30 1999 1.99 5
1999-11-06 2000 5.75 1
2000-03-04 2000 6.28 2
2000-06-07 2000 0.81 2
2000-07-06 2000 9.66 3
2000-09-09 2000 9.08 4
2000-09-21 2000 6.72 5
library(dplyr)
dat %>%
arrange(Date) %>%
group_by(wy) %>%
mutate(abv = cumsum(Q > 3)) %>%
ungroup()
# # A tibble: 24 x 4
# Date wy Q abv
# <date> <int> <dbl> <int>
# 1 1997-01-01 1997 9.82 1
# 2 1997-02-01 1997 3.51 2
# 3 1997-02-02 1997 9.35 3
# 4 1997-10-04 1998 0.93 0
# 5 1997-11-01 1998 1.66 0
# 6 1997-12-02 1998 0.81 0
# 7 1998-04-03 1998 5.65 1
# 8 1998-05-05 1998 7.82 2
# 9 1998-07-05 1998 6.33 3
# 10 1998-09-06 1998 0.55 3
# # ... with 14 more rows
data.table approach
library(data.table)
setDT(dat, key = "Date")[, abvThreshCum := cumsum(Q > 3), by = .(wy)]
I have two large dataframes, one is called Dates_only and the other Values
**Dates_only:**
ID Quart_y Quart
1 1118 2017Q3 0.25
2 1118 2017Q4 0.50
3 1118 2018Q1 0.75
4 1118 2018Q2 1.00
5 1118 2018Q3 1.25
6 1118 2018Q4 1.50
7 1118 2019Q1 1.75
8 1118 2019Q2 2.00
9 1119 2017Q3 0.25
10 1119 2017Q4 0.50
11 1119 2018Q1 0.75
12 1119 2018Q2 1.00
13 1119 2018Q3 1.25
14 1119 2018Q4 1.50
15 1119 2019Q1 1.75
16 1119 2019Q2 2.00
17 13PP 2017Q3 0.25
18 13PP 2017Q4 0.50
19 13PP 2018Q1 0.75
20 13PP 2018Q2 1.00
21 13PP 2018Q3 1.25
22 13PP 2018Q4 1.50
23 13PP 2019Q1 1.75
24 13PP 2019Q2 2.00
And the second dataset:
**Values**
ID Day Value
1 1118 0 7.6
2 1119 0 6.2
3 13PP 0 6.8
4 1118 0.14 7.1
5 1119 0.13 6.2
6 13PP 0.13 5.9
7 1118 0.20 6.8
8 1119 0.23 5.8
9 13PP 0.24 4.6
10 1118 0.27 6.5
11 1119 0.28 5.4
12 13PP 0.32 4.2
13 1118 0.32 6.3
14 1119 0.32 4.8
15 13PP 0.44 4.0
16 1118 0.47 6.0
17 1119 0.49 4.3
18 13PP 0.49 3.8
19 1118 0.59 5.9
20 1119 0.64 4.0
21 13PP 0.61 3.6
22 1118 0.72 5.6
23 1119 0.71 3.8
24 13PP 0.73 3.4
25 1118 0.95 5.4
26 1119 0.86 3.2
27 13PP 0.78 3.0
28 1118 1.10 5.0
29 1119 0.93 2.9
30 13PP 1.15 2.9
What I want to do is to create another column (a fourth) in the Dates_only called Value_average, and it will contain average scores extracted from Values dataframe from the column Values$Value.
Specifically, as you can observe in Dates_only the Quart_y represents quarters/year, the Quart quantify this with a number from 0.25:2.
So, the pattern goes like this Q3 - x.25, Q4 - x.50, Q1 - x.75, Q2 - x.00.
In the second dataframe, Values, we have some scores that represent days of the year. The concept is that for days that have scores 0<Day<0.25 belong to the 2017Q3, days with scores 0.25<Day<0.50 belong to 2017Q4, and days with scores 1.00<Day<1.25 belong to 2018Q3.
I want for each ID from Dates_only dataframe to find the average of the Values$Value numbers that belong to the appropriate time frame:
For ID=1118 and for 2017Q3 the 'Values$Day' elements that are between 0<Day<=0.25 are (0, 0.14, 0.20) and the equivalent Values$Value are (7.6, 7.1, 6.8) so the Dates_only$Value_average is going to be 7.16. The next will average values for days 0.25<Day<=0.50 etc.
**Dates_only:**
ID Quart_y Quart Value_average
1 1118 2017Q3 0.25 7.16
2 1118 2017Q3 0.50 6.27
The code that I have used is:
Dates_only$Value_average <- 0
for (i in 1:length(Dates_only$ID)){
id <- as.character(Dates_only$ID[i])
quart <- as.numeric(Dates_only$Quart[i])
quart_prev <- quart-0.25
count_d <- 0
sum_val <- 0
for (k in 1:length(Values$ID)){
if (id==as.character(Values$ID[k])
&& quart>=as.numeric(Values$Day[k])
&& as.numeric(Values$Day[k])>quart_prev){
sum_val <- as.numeric(Values$Value[k]) + sum_val
count_d <- count_d + 1
}
}
av_value <- sum_val/count_d
Dates_only$Value_average[i] <- av_value
}
Is there a more efficient code to do that in very large datasets (over 300K observations)? I am pretty sure there is but my novice skills on R do not help a lot.
To replicate the two dataframes:
Dates_only <- data.frame(ID=c('1118','1118','1118','1118','1118',
'1118','1118','1118','1119','1119',
'1119','1119','1119','1119','1119',
'1119','13PP','13PP','13PP','13PP',
'13PP','13PP','13PP','13PP'),
Quart_y=c('2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2'),
Quart=c(0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00))
Values <- data.frame(ID=c('1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP'),
Day=c(0,0,0,0.14,0.13,0.13,0.2,0.23,0.24,0.27,0.28,
0.32,0.32,0.32,0.44,0.47,0.49,0.49,0.59,0.64,
0.61,0.72,0.71,0.73,0.95,0.86,0.78,1.1,0.93,1.15),
Value=c(7.6,6.2,6.8,7.1,6.2,5.9,6.8,5.8,4.6,6.5,5.4,
4.2,6.3,4.8,4,6,4.3,3.8,5.9,4,3.6,5.6,3.8,
3.4,5.4,3.2,3,5,2.9,2.9))
We can accomplish almost all of this using the dplyr package
library(dplyr)
Values %>%
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
full_join(., Dates_only, by = c("ID", "Quart")) %>%
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))
Which gives you:
ID Quart Quart_y Value_average
<fctr> <dbl> <fctr> <dbl>
1 1118 0.25 2017Q3 7.166667
2 1118 0.50 2017Q4 6.266667
3 1118 0.75 2018Q1 5.750000
4 1118 1.00 2018Q2 5.400000
5 1118 1.25 2018Q3 5.000000
6 1118 1.50 2018Q4 NaN
7 1118 1.75 2019Q1 NaN
8 1118 2.00 2019Q2 NaN
9 1119 0.25 2017Q3 6.066667
10 1119 0.50 2017Q4 4.833333
# ... with 14 more rows
See below for a breakdown of each line of code for any questions:
# Start with your `Values` data frame
Values %>%
# Recode `Day` that are '0.00', as they currently will be excluded from
# the rule 2017Q3: 0<Day<=0.25
# I picked 0.01 arbitrarily to fit this rule
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
# Now round all `Day` values up to the nearest 0.25
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
# Now join the two data frames using a `full_join`
# A left_join may also be used if you are uninterested in NA's
full_join(., Dates_only, by = c("ID", "Quart")) %>%
# Finally, designate groupings to calculate the mean values
# for each ID for each quarter
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))
I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--
What is the best way to get the tables from inside the comment tags? Thanks!
Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none
You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:
library(rvest)
# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_node('table#advanced') %>% # select the desired table
html_table() %>% # parse table
.[colSums(is.na(.)) < nrow(.)] # get rid of spacer columns
df[, 1:15]
## Rk Player Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK%
## 1 1 Pau Gasol 34 78 2681 22.7 0.550 0.023 0.317 9.2 27.6 18.6 14.4 0.5 4.0
## 2 2 Jimmy Butler 25 65 2513 21.3 0.583 0.212 0.508 5.1 11.2 8.2 14.4 2.3 1.0
## 3 3 Joakim Noah 29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0 1.2 2.6
## 4 4 Aaron Brooks 30 82 1885 14.4 0.534 0.383 0.213 1.9 7.5 4.8 24.2 1.5 0.6
## 5 5 Mike Dunleavy 34 63 1838 11.6 0.573 0.547 0.181 1.7 12.7 7.3 9.7 1.1 0.8
## 6 6 Taj Gibson 29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7 6.9 1.1 3.2
## 7 7 Nikola Mirotic 23 82 1654 17.9 0.556 0.502 0.455 4.3 21.8 13.3 9.7 1.7 2.4
## 8 8 Kirk Hinrich 34 66 1610 6.8 0.468 0.441 0.131 1.4 6.6 4.1 13.8 1.5 0.6
## 9 9 Derrick Rose 26 51 1530 15.9 0.493 0.325 0.224 2.6 8.7 5.7 30.7 1.2 0.8
## 10 10 Tony Snell 23 72 1412 10.2 0.550 0.531 0.148 2.5 10.9 6.8 6.8 1.2 0.6
## 11 11 E'Twaun Moore 25 56 504 10.3 0.504 0.273 0.144 2.7 7.1 5.0 10.4 2.1 0.9
## 12 12 Doug McDermott 23 36 321 6.1 0.480 0.383 0.140 2.1 12.2 7.3 3.0 0.6 0.2
## 13 13 Nazr Mohammed 37 23 128 8.7 0.431 0.000 0.100 9.6 22.3 16.1 3.6 1.6 2.8
## 14 14 Cameron Bairstow 24 18 64 2.1 0.309 0.000 0.357 10.5 3.3 6.8 2.2 1.6 1.1
Ok..got it.
library(stringi)
library(knitr)
library(rvest)
any_version_html <- function(x){
XML::htmlParse(x)
}
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))
e <- html_table(any_version_html(d))
> kable(summary(e),'rst')
====== ========== ====
Length Class Mode
====== ========== ====
9 data.frame list
2 data.frame list
24 data.frame list
21 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
3 data.frame list
====== ========== ====
kable(e[[1]],'rst')
=== ================ === ==== === ================== === === =================================
No. Player Pos Ht Wt Birth Date  Exp College
=== ================ === ==== === ================== === === =================================
41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico
0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon
21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University
34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University
16 Pau Gasol PF 7-0 250 July 6, 1980 es 13
22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California
12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas
3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University
## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.
# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')
=== ============== ===========
Rk Player Salary
=== ============== ===========
1 Derrick Rose $18,862,875
2 Carlos Boozer $13,550,000
3 Joakim Noah $12,200,000
4 Taj Gibson $8,000,000
5 Pau Gasol $7,128,000
6 Nikola Mirotic $5,305,000
=== ============== ===========
Hi
i have a 10 year, 5 minutes resolution data set of dust concentration
and i have seperetly a 15 year data set with a day resolution of the synoptic clasification
how can i combine these two datasets they are not the same length or resolution
here is a sample of the data
> head(synoptic)
date synoptic
1 01/01/1995 8
2 02/01/1995 7
3 03/01/1995 7
4 04/01/1995 20
5 05/01/1995 1
6 06/01/1995 1
>
head(beit.shemesh)
X........................ StWd SHT PRE GSR RH Temp WD WS PM10 CO O3
1 NA 64 19.8 0 -2.9 37 15.2 61 2.2 241 0.9 40.6
2 NA 37 20.1 0 1.1 38 15.2 344 2.1 241 0.9 40.3
3 NA 36 20.2 0 0.7 39 15.1 32 1.9 241 0.9 39.4
4 NA 52 20.1 0 0.9 40 14.9 20 2.1 241 0.9 38.7
5 NA 42 19.0 0 0.9 40 14.6 11 2.0 241 0.9 38.7
6 NA 75 19.9 0 0.2 40 14.5 341 1.3 241 0.9 39.1
No2 Nox No SO2 date
1 1.4 2.9 1.5 1.6 31/12/2000 24:00
2 1.7 3.1 1.4 0.9 01/01/2001 00:05
3 2.1 3.5 1.4 1.2 01/01/2001 00:10
4 2.7 4.2 1.5 1.3 01/01/2001 00:15
5 2.3 3.8 1.5 1.4 01/01/2001 00:20
6 2.8 4.3 1.5 1.3 01/01/2001 00:25
any idea's
Make an extra column for calculating the dates, and then merge. To do this, you have to generate a variable in each dataframe bearing the same name, hence you first need some renaming. Also make sure that the merge column you use has the same type in both dataframes :
beit.shemesh$datetime <- beit.shemesh$date
beit.shemesh$date <- as.Date(beith.shemesh$datetime,format="%d/%m/%Y")
synoptic$date <- as.Date(synoptic$date,format="%d/%m/%Y")
merge(synoptic, beit.shemesh,by="date",all.y=TRUE)
Using all.y=TRUE keeps the beit.shemesh dataset intact. If you also want empty rows for all non-matching rows in synoptic, you could use all=TRUE instead.