I am having trouble converting daily data into weekly using averages over the week.
My Data looks like this:
> str(daily_FWIH)
'data.frame': 4371 obs. of 6 variables:
$ Date : Date, format: "2013-03-01" "2013-03-02" "2013-03-04" "2013-03-05" ...
$ CST.OUC : Factor w/ 6 levels "BVG11","BVG12",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CST.NAME : Factor w/ 6 levels "Central Scotland",..: 2 2 2 2 2 2 2 2 2 2 ...
$ SOM_patch: Factor w/ 6 levels "BVG11_Highlands & Islands",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Row_Desc : Factor w/ 1 level "FSFluidWIH": 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 1.16 1.99 1.47 1.15 1.16 1.28 1.27 2.07 1.26 1.19 ...
> head(daily_FWIH)
Date CST.OUC CST.NAME SOM_patch Row_Desc Value
1 2013-03-01 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.16
2 2013-03-02 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.99
3 2013-03-04 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.47
4 2013-03-05 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.15
5 2013-03-06 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.16
6 2013-03-07 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.28
This is what I've tried converting this to xts object as shown here
This is what I have tried:
daily_FWIH$Date = as.Date(as.character(daily_FWIH$Date), "%d/%m/%Y")
library(xts)
temp.x = xts(daily_FWIH[-1], order.by=daily_FWIH$Date)
apply.weekly(temp.x, colMeans(temp.x$Value))
I have two problems. My week starts and ends on a "Saturday" and I get the following error:
> apply.weekly(temp.x, colMeans(temp.x$Value))
Error in colMeans(temp.x$Value) : 'x' must be numeric
UPDATE Based on Sam's comments:
This is what I did:
daily_FWIH$Date <- ymd(daily_FWIH$Date) # convert to POSIX format
daily_FWIH$fakeDate <- daily_FWIH$Date + days(2)
daily_FWIH$week <- week(daily_FWIH$fakeDate) # extract week value
daily_FWIH$year <- year(daily_FWIH$fakeDate)
> daily_FWIH %>%
+ group_by(year,week) %>%
+ mutate(weeklyAvg = mean(Value), weekStartsOn = min(Date)) %>% # create the average variable
+ slice(which(Date == weekStartsOn)) %>% # select just the first record of the week - other vars will come from this
+ select(-Value,-fakeDate,-week,-year,-Date, -CST.OUC,-CST.NAME) # drop unneeded variables
Source: local data frame [631 x 6]
Groups: year, week
year week SOM_patch Row_Desc weeklyAvg weekStartsOn
1 2013 9 BVG11_Highlands & Islands FSFluidWIH 1.048333 2013-03-01
2 2013 9 BVG12_North East Scotland FSFluidWIH 1.048333 2013-03-01
3 2013 9 BVG13_Central Scotland FSFluidWIH 1.048333 2013-03-01
4 2013 9 BVG14_South East Scotland FSFluidWIH 1.048333 2013-03-01
5 2013 9 BVG15_West Central Scotland FSFluidWIH 1.048333 2013-03-01
6 2013 9 BVG16_South West Scotland FSFluidWIH 1.048333 2013-03-01
7 2013 10 BVG11_Highlands & Islands FSFluidWIH 1.520500 2013-03-02
8 2013 10 BVG12_North East Scotland FSFluidWIH 1.520500 2013-03-02
9 2013 10 BVG13_Central Scotland FSFluidWIH 1.520500 2013-03-02
10 2013 10 BVG14_South East Scotland FSFluidWIH 1.520500 2013-03-02
.. ... ... ... ... ... ...
Which is incorrect...
The desired output is:
> head(desired)
Date BVG11.Highlands_I_.A_pct BVG12.North.East.ScotlandA_pct BVG13.Central.ScotlandA_pct
1 01/03/2013 1.16 1.13 1.08
2 08/03/2013 1.41 2.37 1.80
3 15/03/2013 1.33 3.31 1.34
4 22/03/2013 1.39 2.49 1.62
5 29/03/2013 5.06 3.42 1.42
6 NA NA NA
BVG14.South.East.ScotlandA_pct BVG15.West.Central.ScotlandA_pct BVG16.South.West.ScotlandA_pct
1 1.05 0.98 0.89
2 1.51 1.21 1.07
3 1.13 2.13 2.01
4 2.14 1.24 1.37
5 1.62 1.46 1.95
6 NA NA NA
> str(desired)
'data.frame': 11 obs. of 7 variables:
$ Date : Factor w/ 6 levels "01/03/2013",..: 2 3 4 5 6 1 1 1 1 1 ...
$ BVG11.Highlands_I_.A_pct : num 1.16 1.41 1.33 1.39 5.06 ...
$ BVG12.North.East.ScotlandA_pct : num 1.13 2.37 3.31 2.49 3.42 ...
$ BVG13.Central.ScotlandA_pct : num 1.08 1.8 1.34 1.62 1.42 ...
$ BVG14.South.East.ScotlandA_pct : num 1.05 1.51 1.13 2.14 1.62 ...
$ BVG15.West.Central.ScotlandA_pct: num 0.98 1.21 2.13 1.24 1.46 ...
$ BVG16.South.West.ScotlandA_pct : num 0.89 1.07 2.01 1.37 1.95 ...
Find the first Saturday in your data, then assign a week ID to all dates in your data set based on that :
library(lubridate) # for the wday() and ymd() functions
daily_FWIH$Date <- ymd(daily_FWIH$Date)
saturdays <- daily_FWIH[wday(daily_FWIH$Date) == 7, ] # filter for Saturdays
startDate <- min(saturdays$Date) # select first Saturday
daily_FWIH$week <- floor(as.numeric(difftime(daily_FWIH$Date, startDate, units = "weeks")))
Once you have a weekID-starting-on-Saturday variable, this is a standard R problem. You can calculate the weekly averages using your method of choice for calculating means within a subgroup. I like dplyr:
library(dplyr)
daily_FWIH %>%
group_by(week, SOM_patch) %>% # use your grouping variables in addition to week
summarise(weeklyAvg = mean(Value), weekBeginDate = min(Date)) %>%
mutate(firstDayOfWeek = wday(weekBeginDate, label=TRUE)) # confirm correct week cuts
Source: local data frame [2 x 5]
Groups: week
week SOM_patch weeklyAvg weekBeginDate firstDayOfWeek
1 -1 BVG11_Highlands & Islands 1.16 2013-03-01 Fri
2 0 BVG11_Highlands & Islands 1.41 2013-03-02 Sat
Update based on comments below:
If you want to see the other values in your dataset, you'll need to decide how to select or calculate weekly values when daily values within a week conflict. In your sample data, they are the same in all rows, so I'm just drawing them from the row containing the first day of the week.
library(dplyr)
daily_FWIH %>%
group_by(week, SOM_patch) %>% # use your grouping variables
mutate(weeklyAvg = mean(Value), weekBeginDate = min(Date)) %>%
slice(which(Date == weekBeginDate)) %>% # select just the first record of the week - other vars will come from this
select(-Value, -Date) # drop unneeded variables
Source: local data frame [2 x 7]
Groups: week, SOM_patch
CST.OUC CST.NAME SOM_patch Row_Desc week weeklyAvg weekBeginDate
1 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH -1 1.16 2013-03-01
2 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 0 1.41 2013-03-02
Related
This question already has answers here:
Divide each value by the sum of values by group
(3 answers)
Closed 2 days ago.
sales <- read.table(header = TRUE, text="Year Name Sales
1980 Atari 4.00
1980 Activision 1.07
1981 Activision 4.21
1981 ParkerBros. 2.06
1981 Imagic 1.99
1981 Atari 1.84
1981 Coleco 1.36
1981 Mystique 0.76
1981 Fox 0.74
1981 Men 0.72")
I was able to get the sum using aggregate.
I want to divide sales data from that year's total sales and get the %. But I don't know how to divide each row from respective year's total sales.
DF <- aggregate(Sales ~ Year + NAame, data = sales, FUN=sum)
DFC48 <- aggregate(DF, NA_Sales~Year, FUN=sum)
Base R:
We could use ave(). Here we can apply function x/sum(x) to each group Year, where x is defined by sales$Sales:
sales$su <- ave(sales$Sales, sales$Year, FUN = function(x) x/sum(x))
Year Name Sales su
1 1980 Atari 4.00 0.78895464
2 1980 Activision 1.07 0.21104536
3 1981 Activision 4.21 0.30774854
4 1981 ParkerBros. 2.06 0.15058480
5 1981 Imagic 1.99 0.14546784
6 1981 Atari 1.84 0.13450292
7 1981 Coleco 1.36 0.09941520
8 1981 Mystique 0.76 0.05555556
9 1981 Fox 0.74 0.05409357
10 1981 Men 0.72 0.05263158
Could you please try the below code
data %>% group_by(Year) %>% mutate(su=Sales/sum(Sales))
Created on 2023-02-17 with reprex v2.0.2
# A tibble: 10 × 4
# Groups: Year [2]
Year Name Sales su
<dbl> <chr> <dbl> <dbl>
1 1980 Atari 4 0.789
2 1980 Activision 1.07 0.211
3 1981 Activision 4.21 0.308
4 1981 Parker Bros. 2.06 0.151
5 1981 Imagic 1.99 0.145
6 1981 Atari 1.84 0.135
7 1981 Coleco 1.36 0.0994
8 1981 Mystique 0.76 0.0556
9 1981 Fox 0.74 0.0541
10 1981 Men 0.72 0.0526
Another option using prop.table like this:
library(dplyr)
sales %>%
group_by(Year) %>%
mutate(su = prop.table(Sales))
#> # A tibble: 10 × 4
#> # Groups: Year [2]
#> Year Name Sales su
#> <int> <chr> <dbl> <dbl>
#> 1 1980 Atari 4 0.789
#> 2 1980 Activision 1.07 0.211
#> 3 1981 Activision 4.21 0.308
#> 4 1981 ParkerBros. 2.06 0.151
#> 5 1981 Imagic 1.99 0.145
#> 6 1981 Atari 1.84 0.135
#> 7 1981 Coleco 1.36 0.0994
#> 8 1981 Mystique 0.76 0.0556
#> 9 1981 Fox 0.74 0.0541
#> 10 1981 Men 0.72 0.0526
Created on 2023-02-18 with reprex v2.0.2
I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?
Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68
What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)
I'm attempting to scrape the second table shown at the URL below, and I'm running into issues which may be related to the interactive nature of the table.
div_stats_standard appears to refer to the table of interest.
The code runs with no errors but returns an empty list.
url <- 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
data <- url %>%
read_html() %>%
html_nodes(xpath = '//*[(#id = "div_stats_standard")]') %>%
html_table()
Can anyone tell me where I'm going wrong?
Look for the table.
library(rvest)
url <- "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)
Result:
head(table)
Playing Time Playing Time Playing Time Performance Performance
1 Squad # Pl MP Starts Min Gls Ast
2 Arsenal 26 27 297 2,430 39 26
3 Aston Villa 28 27 297 2,430 33 27
4 Bournemouth 25 28 308 2,520 27 17
5 Brighton 23 28 308 2,520 28 19
6 Burnley 21 28 308 2,520 32 23
Performance Performance Performance Performance Per 90 Minutes Per 90 Minutes
1 PK PKatt CrdY CrdR Gls Ast
2 2 2 64 3 1.44 0.96
3 1 3 54 1 1.22 1.00
4 1 1 60 3 0.96 0.61
5 1 1 44 2 1.00 0.68
6 2 2 53 0 1.14 0.82
Per 90 Minutes Per 90 Minutes Per 90 Minutes Expected Expected Expected Per 90 Minutes
1 G+A G-PK G+A-PK xG npxG xA xG
2 2.41 1.37 2.33 35.0 33.5 21.3 1.30
3 2.22 1.19 2.19 30.6 28.2 22.0 1.13
4 1.57 0.93 1.54 31.2 30.5 20.8 1.12
5 1.68 0.96 1.64 33.8 33.1 22.4 1.21
6 1.96 1.07 1.89 30.9 29.4 18.9 1.10
Per 90 Minutes Per 90 Minutes Per 90 Minutes Per 90 Minutes
1 xA xG+xA npxG npxG+xA
2 0.79 2.09 1.24 2.03
3 0.81 1.95 1.04 1.86
4 0.74 1.86 1.09 1.83
5 0.80 2.01 1.18 1.98
6 0.68 1.78 1.05 1.73
Process_Table = Process_Table[order(-Process_Table$Process, -Process_Table$Freq),]
#output
Process Freq Percent
17 Other Airport Services 45 15.46
5 Check-in 35 12.03
23 Ticket sales and support channels 35 12.03
11 Flight and inflight 33 11.34
19 Pegasus Plus 23 7.90
24 Time Delays 16 5.50
7 Other 13 4.47
14 Other 13 4.47
22 Other 13 4.47
25 Other 13 4.47
16 Other 11 3.78
20 Other 6 2.06
26 Other 6 2.06
3 Other 5 1.72
13 Other 5 1.72
18 Other 5 1.72
21 Other 4 1.37
1 Other 2 0.69
2 Other 1 0.34
4 Other 1 0.34
6 Other 1 0.34
8 Other 1 0.34
9 Other 1 0.34
10 Other 1 0.34
12 Other 1 0.34
15 Other 1 0.34
as you can see it is giving different frequency for the same level
whereas, if i am printing the levels in that feature it is giving an output as the following
levels(Process_Table$Process)
[1] "Check-in" "Flight and inflight"
[3] "Other" "Other Airport Services"
[5] "Pegasus Plus" "Ticket sales and support channels"
[7] "Time Delays"
what i want is the combined frequency of "Others" category. Can anyone help me out on this.
Edit: code was used to derive to the first set of output:
Process_Table$Percent = round(Process_Table$Freq/sum(Process_Table$Freq) * 100, 2)
Process_Table$Process = as.character(Process_Table$Process)
low_list = Process_Table %>%
filter(Percent < 5.50) %>%
select(Process)
Process_Table$Process = ifelse(Process_Table$Process %in% low_list$Process, 'Other', Process_Table$Process)
as.data.frame(Process_Table)
Process_Table$Process = as.factor(Process_Table$Process)
Your Processed_Table should undergo another step of aggregating. Add the following to your final step of data aggregating.
Processed_Table <- Processed_Table %>% group_by(Process) %>% summarize(Freq = sum(Freq), Percent = sum(Percent))