Creating a new column in my data frame based on a function

Creating a new column in my data frame based on a function - r

I have a data frame that has NFL teams and some data about them. I'm wanting to add Points per game for each team for that particular week.
I cannot just summarize the data by team as I need the individual game the way it's currently represented.
CurrYrfun <- function(Yr,Tm,Wk){
PPG <- Schedule_Results %>%
filter(Year == Yr & Team == Tm & Week < Wk) %>%
group_by(Team) %>%
summarize(APG = mean(Pts))
return(PPG[['APG']])
}
This function gives the correct result for individual records, but when I try to mutate a new column in the dataframe as below:
Schedule_Results <- Schedule_Results %>%
mutate(PPG = CurrYrfun(Year, Team, Week))
I get an error saying PPG is of length 0. I've tried to attach a picture of the dataframe, so you have an idea of the data I'm working with.dataframe snapshot here
Edited to include data and examples:
Schedule_Results <- structure(list(Year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L,
2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L,
2019L, 2019L, 2019L, 2019L, 2019L, 2019L), Week = c(17, 17, 17,
16, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 13, 12, 12, 12, 11,
11, 11), Team = c("Washington Redskins", "Cincinnati Bengals",
"Jacksonville Jaguars", "Jacksonville Jaguars", "Washington Redskins",
"Cincinnati Bengals", "Cincinnati Bengals", "Washington Redskins",
"Jacksonville Jaguars", "Washington Redskins", "Cincinnati Bengals",
"Jacksonville Jaguars", "Jacksonville Jaguars", "Washington Redskins",
"Cincinnati Bengals", "Cincinnati Bengals", "Jacksonville Jaguars",
"Washington Redskins", "Washington Redskins", "Jacksonville Jaguars",
"Cincinnati Bengals"), Opp = c("Dallas Cowboys", "Cleveland Browns",
"Indianapolis Colts", "Atlanta Falcons", "New York Giants", "Miami Dolphins",
"New England Patriots", "Philadelphia Eagles", "Oakland Raiders",
"Green Bay Packers", "Cleveland Browns", "Los Angeles Chargers",
"Tampa Bay Buccaneers", "Carolina Panthers", "New York Jets",
"Pittsburgh Steelers", "Tennessee Titans", "Detroit Lions", "New York Jets",
"Indianapolis Colts", "Oakland Raiders"), Pts = c(16, 33, 38,
12, 35, 35, 13, 27, 20, 15, 19, 10, 11, 29, 22, 10, 20, 19, 17,
13, 10), Opp_Pts = c(47, 23, 20, 24, 41, 38, 34, 37, 16, 20,
27, 45, 28, 21, 6, 16, 42, 16, 34, 33, 17), Yds = c(271, 361,
353, 288, 361, 430, 315, 352, 262, 262, 451, 252, 242, 362, 277,
244, 369, 230, 225, 308, 246), Opp_Yds = c(517, 313, 275, 518,
552, 502, 291, 415, 364, 341, 333, 525, 315, 278, 271, 338, 471,
364, 400, 389, 386), TO = c(2, 1, 1, 1, 0, 1, 5, 1, 0, 1, 1,
0, 4, 0, 0, 2, 1, 2, 1, 1, 2), Opp_TO = c(1, 3, 2, 2, 0, 1, 0,
1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 4, 2, 2, 2), Home = c("1", "1",
"1", "1", "0", "1", "0", "0", "0", "1", "1", "0", "0", "0", "1",
"0", "1", "1", "0", "1", "1"), Playoffs = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), win = c("0", "1",
"1", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "1", "1",
"0", "0", "1", "0", "0", "0")), row.names = c(NA, -21L), class = "data.frame")
CurrYrfun <- function(Yr,Tm,Wk){
PPG <- Schedule_Results %>%
filter(Year == Yr & Team == Tm & Week < Wk) %>%
group_by(Team) %>%
summarize(APG = mean(Pts))
return(PPG[['APG']])
}
CurrYrfun(2019,'Washington Redskins',13)
CurrYrfun(2019,'Jacksonville Jaguars',14)
CurrYrfun(2019,'Washington Redskins',16)
CurrYrfun(2019,'Cincinnati Bengals',15)
Schedule_Results <- Schedule_Results %>%
mutate(PPG = CurrYrfun(Year, Team, Week))
My goal is to return the output of the function for each row as a new column in the dataframe

I'm pretty sure this is what you want. I spot-checked the first couple examples you give, and they look right.
Schedule_Results %>%
group_by(Team, Year) %>%
arrange(Week) %>%
mutate(PPG = lag(cummean(Pts), 1))
# # A tibble: 21 x 14
# # Groups: Team, Year [3]
# Year Week Team Opp Pts Opp_Pts Yds Opp_Yds TO Opp_TO Home Playoffs win PPG
# <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
# 1 2019 11 Washington Reds~ New York Jets 17 34 225 400 1 2 0 0 0 NA
# 2 2019 11 Jacksonville Ja~ Indianapolis Co~ 13 33 308 389 1 2 1 0 0 NA
# 3 2019 11 Cincinnati Beng~ Oakland Raiders 10 17 246 386 2 2 1 0 0 NA
# 4 2019 12 Cincinnati Beng~ Pittsburgh Stee~ 10 16 244 338 2 1 0 0 0 10
# 5 2019 12 Jacksonville Ja~ Tennessee Titans 20 42 369 471 1 2 1 0 0 13
# 6 2019 12 Washington Reds~ Detroit Lions 19 16 230 364 2 4 1 0 1 17
# 7 2019 13 Jacksonville Ja~ Tampa Bay Bucca~ 11 28 242 315 4 1 0 0 0 16.5
# 8 2019 13 Washington Reds~ Carolina Panthe~ 29 21 362 278 0 2 0 0 1 18
# 9 2019 13 Cincinnati Beng~ New York Jets 22 6 277 271 0 0 1 0 1 10
# 10 2019 14 Washington Reds~ Green Bay Packe~ 15 20 262 341 1 1 1 0 0 21.7
...

Related

Create a crosstable of variable labels across dataframes and fill with variable names

With the friendly help of a fellow SO member, I have been able to create a crosstable that compares the variable labels across dataframes. I have now realized that the data I am using does not only have different variable labels across the dataframes but also that variable names are different across dataframes even if the variable label is the same. Hence, I would like to change the code so that instead of marking the cell with X, I would like to insert the respective variable name.
library(haven)
library(labelled)
# for testing, add discordant labels for one variable
var_label(GEM2001$weight) <- "Weight provided by data vendor"
var_label(GEM2002$weight) <- "Weight according to vendor"
# for testing, adding a discordant variable name for one variable
names(GEM2001)[names(GEM2001) == "country"] <- "country_updtd"
df_list <- list(GEM2001 = GEM2001, GEM2002 = GEM2002)
df_labels <- lapply(df_list, \(df) unlist(var_label(df)))
all_labels <- unique(unlist(df_labels))
label_table <- data.frame(label = all_labels)
for (df in names(df_labels)) {
label_table[[df]] <- ifelse(all_labels %in% df_labels[[df]], "X", "")
}
label_table
Minimum reproducible example of the data:
# First dataset:
structure(list(setid = structure(c(7700001, 7700002, 7700003,
7700004, 7700005, 7700006), label = "Harmonization ID", format.spss = "F12.0", display_width = 14L),
setid_ne = structure(c(1000000007700001, 1000000007700002,
1000000007700003, 1000000007700004, 1000000007700005, 1000000007700006
), label = "Alternative ID variable to avoid duplicates across years", format.spss = "F15.0", display_width = 17L),
yrsurv = structure(c(2001, 2001, 2001, 2001, 2001, 2001), label = "Year survey was administered", format.spss = "F4.0"),
country = structure(c(7, 7, 7, 7, 7, 7), label = "Country", format.spss = "F4.0", display_width = 9L, labels = c(`United States` = 1,
Russia = 7, Egypt = 20, `South Africa` = 27, Greece = 30,
Netherlands = 31, Belgium = 32, France = 33, Spain = 34,
Hungary = 36, Italy = 39, Romania = 40, Switzerland = 41,
Austria = 43, `United Kingdom` = 44, Denmark = 45, Sweden = 46,
Norway = 47, Poland = 48, Germany = 49, Peru = 51, Mexico = 52,
Argentina = 54, Brazil = 55, Chile = 56, Colombia = 57, Malaysia = 60,
Australia = 61, Indonesia = 62, Philippines = 63, `New Zealand` = 64,
Singapore = 65, Thailand = 66, Japan = 81, Korea = 82, Vietnam = 84,
China = 86, Turkey = 90, India = 91, Pakistan = 92, Iran = 98,
Canada = 101, Morocco = 212, Algeria = 213, Tunisia = 216,
Libya = 218, Ghana = 233, Nigeria = 234, Angola = 244, Barbados = 246,
Ethiopia = 251, Uganda = 256, Zambia = 260, Namibia = 264,
Malawi = 265, Botswana = 267, Portugal = 351, Luxembourg = 352,
Ireland = 353, Iceland = 354, Finland = 358, Lithuania = 370,
Latvia = 371, Estonia = 372, Serbia = 381, Montenegro = 382,
Croatia = 385, Slovenia = 386, `Bosnia and Herzegovina` = 387,
Macedonia = 389, `Czech Republic` = 420, Slovakia = 421,
Guatemala = 502, `El Salvador` = 503, `Costa Rica` = 506,
Panama = 507, Venezuela = 582, Bolivia = 591, Ecuador = 593,
Suriname = 597, Uruguay = 598, `* 'Azores'` = 620, Tonga = 676,
Vanuatu = 678, Kazakstan = 701, `Shenzhen*` = 755, `Puerto Rico` = 787,
`Dominican Republic` = 809, `Hong Kong` = 852, `Trinidad & Tobago` = 868,
Jamaica = 876, Bangladesh = 880, Taiwan = 886, Lebanon = 961,
Jordan = 962, Syria = 963, `Saudi Arabia` = 966, Yemen = 967,
`West Bank & Gaza Strip` = 970, `United Arab Emirates` = 971,
Israel = 972), class = c("haven_labelled", "vctrs_vctr",
"double")), weight = structure(c(0.947503767410607, 0.919003654090076,
0.924603676356567, 1.01710404415125, 0.716602849315504, 0.83510332049034
), label = "Weight provided by data vendor", format.spss = "F8.6", display_width = 10L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), label = "JN725 - IAE - GEM 2009")
# Second dataset
structure(list(setid = structure(c(1121800009, 1121800025, 1121800031,
1121800035, 1121800036, 1121800039), label = "Harmonization ID", format.spss = "F12.0", display_width = 14L),
setid_ne = structure(c(2000001121800009, 2000001121800025,
2000001121800031, 2000001121800035, 2000001121800036, 2000001121800039
), label = "Alternative ID variable to avoid duplicates across years", format.spss = "F15.0", display_width = 17L),
yrsurv = structure(c(2002, 2002, 2002, 2002, 2002, 2002), label = "Year survey was administered", format.spss = "F4.0"),
country = structure(c(1, 1, 1, 1, 1, 1), label = "Country", format.spss = "F4.0", display_width = 9L, labels = c(`United States` = 1,
Russia = 7, Egypt = 20, `South Africa` = 27, Greece = 30,
Netherlands = 31, Belgium = 32, France = 33, Spain = 34,
Hungary = 36, Italy = 39, Romania = 40, Switzerland = 41,
Austria = 43, `United Kingdom` = 44, Denmark = 45, Sweden = 46,
Norway = 47, Poland = 48, Germany = 49, Peru = 51, Mexico = 52,
Argentina = 54, Brazil = 55, Chile = 56, Colombia = 57, Malaysia = 60,
Australia = 61, Indonesia = 62, Philippines = 63, `New Zealand` = 64,
Singapore = 65, Thailand = 66, Japan = 81, Korea = 82, Vietnam = 84,
China = 86, Turkey = 90, India = 91, Pakistan = 92, Iran = 98,
Canada = 101, Morocco = 212, Algeria = 213, Tunisia = 216,
Libya = 218, Ghana = 233, Nigeria = 234, Angola = 244, Barbados = 246,
Ethiopia = 251, Uganda = 256, Zambia = 260, Namibia = 264,
Malawi = 265, Botswana = 267, Portugal = 351, Luxembourg = 352,
Ireland = 353, Iceland = 354, Finland = 358, Lithuania = 370,
Latvia = 371, Estonia = 372, Serbia = 381, Montenegro = 382,
Croatia = 385, Slovenia = 386, `Bosnia and Herzegovina` = 387,
Macedonia = 389, `Czech Republic` = 420, Slovakia = 421,
Guatemala = 502, `El Salvador` = 503, `Costa Rica` = 506,
Panama = 507, Venezuela = 582, Bolivia = 591, Ecuador = 593,
Suriname = 597, Uruguay = 598, `* 'Azores'` = 620, Tonga = 676,
Vanuatu = 678, Kazakstan = 701, `Shenzhen*` = 755, `Puerto Rico` = 787,
`Dominican Republic` = 809, `Hong Kong` = 852, `Trinidad & Tobago` = 868,
Jamaica = 876, Bangladesh = 880, Taiwan = 886, Lebanon = 961,
Jordan = 962, Syria = 963, `Saudi Arabia` = 966, Yemen = 967,
`West Bank & Gaza Strip` = 970, `United Arab Emirates` = 971,
Israel = 972), class = c("haven_labelled", "vctrs_vctr",
"double")), weight = structure(c(0.666652666946661, 1.35532689346212,
0.886868262634747, 0.247242055158897, 1.7567198656027, 0.595583088338233
), label = "Weight provided by data vendor", format.spss = "F8.6", display_width = 10L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), label = "JN725 - IAE - GEM 2009")

Updated code:
label_table <- data.frame(label = all_labels)
for (df in names(df_list)) {
for (col_name in names(df_list[[df]])) {
if (var_label(df_list[[df]][[col_name]]) %in% all_labels) {
label_table[label_table$label == var_label(df_list[[df]][[col_name]]), df] <- col_name
}
}
}
label_table
Output:
> label_table
label GEM2001 GEM2002
1 Harmonization ID setid setid
2 Alternative ID variable to avoid duplicates across years setid_ne setid_ne
3 Year survey was administered yrsurv yrsurv
4 Country country_updtd country
5 Weight provided by data vendor weight <NA>
6 Weight according to vendor <NA> weight

Create a crosstable of the variable labels across two and more dataframes

I would like to compare the variable labels across multiple dataframes. I have downloaded the SPSS datasets of the Global Entrepreneurship Monitor using haven package. I would like to create a crosstable listing all unique variable labels across multiple dataframes horizontally and all dataframe vertically. I would then like to fill each table with an x if the variable label is contained in a data frame. MRE of two datasets can be found below the code.
I have started with the following code that seems to get the unique variable labels per dataframe:
library(labelled)
libary(tidyverse)
# Get an overview of the same variables across all data frames
# combine data frames into a list
df_list <- list(GEM2001, GEM2002)
# extract unique variable names across all data frames
var_label <- Reduce(union, lapply(df_list, var_label))
# custom function to extract variable labels from each data frame and compare to unique names
check_vars <- function(df) {
vars_in_df <- var_label(df)
sapply(var_label, function(x) ifelse(x %in% vars_in_df, "x", ""))
}
# apply custom function to each data frame in the list
df_var_mat <- lapply(df_list, check_vars)
# create cross table
cross_table <- table(df_var_mat, dnn = c("DataFrame", "Variable"), useNA = "ifany")
Minimum reproducible example of the data:
# First dataset:
structure(list(setid = structure(c(7700001, 7700002, 7700003,
7700004, 7700005, 7700006), label = "Harmonization ID", format.spss = "F12.0", display_width = 14L),
setid_ne = structure(c(1000000007700001, 1000000007700002,
1000000007700003, 1000000007700004, 1000000007700005, 1000000007700006
), label = "Alternative ID variable to avoid duplicates across years", format.spss = "F15.0", display_width = 17L),
yrsurv = structure(c(2001, 2001, 2001, 2001, 2001, 2001), label = "Year survey was administered", format.spss = "F4.0"),
country = structure(c(7, 7, 7, 7, 7, 7), label = "Country", format.spss = "F4.0", display_width = 9L, labels = c(`United States` = 1,
Russia = 7, Egypt = 20, `South Africa` = 27, Greece = 30,
Netherlands = 31, Belgium = 32, France = 33, Spain = 34,
Hungary = 36, Italy = 39, Romania = 40, Switzerland = 41,
Austria = 43, `United Kingdom` = 44, Denmark = 45, Sweden = 46,
Norway = 47, Poland = 48, Germany = 49, Peru = 51, Mexico = 52,
Argentina = 54, Brazil = 55, Chile = 56, Colombia = 57, Malaysia = 60,
Australia = 61, Indonesia = 62, Philippines = 63, `New Zealand` = 64,
Singapore = 65, Thailand = 66, Japan = 81, Korea = 82, Vietnam = 84,
China = 86, Turkey = 90, India = 91, Pakistan = 92, Iran = 98,
Canada = 101, Morocco = 212, Algeria = 213, Tunisia = 216,
Libya = 218, Ghana = 233, Nigeria = 234, Angola = 244, Barbados = 246,
Ethiopia = 251, Uganda = 256, Zambia = 260, Namibia = 264,
Malawi = 265, Botswana = 267, Portugal = 351, Luxembourg = 352,
Ireland = 353, Iceland = 354, Finland = 358, Lithuania = 370,
Latvia = 371, Estonia = 372, Serbia = 381, Montenegro = 382,
Croatia = 385, Slovenia = 386, `Bosnia and Herzegovina` = 387,
Macedonia = 389, `Czech Republic` = 420, Slovakia = 421,
Guatemala = 502, `El Salvador` = 503, `Costa Rica` = 506,
Panama = 507, Venezuela = 582, Bolivia = 591, Ecuador = 593,
Suriname = 597, Uruguay = 598, `* 'Azores'` = 620, Tonga = 676,
Vanuatu = 678, Kazakstan = 701, `Shenzhen*` = 755, `Puerto Rico` = 787,
`Dominican Republic` = 809, `Hong Kong` = 852, `Trinidad & Tobago` = 868,
Jamaica = 876, Bangladesh = 880, Taiwan = 886, Lebanon = 961,
Jordan = 962, Syria = 963, `Saudi Arabia` = 966, Yemen = 967,
`West Bank & Gaza Strip` = 970, `United Arab Emirates` = 971,
Israel = 972), class = c("haven_labelled", "vctrs_vctr",
"double")), weight = structure(c(0.947503767410607, 0.919003654090076,
0.924603676356567, 1.01710404415125, 0.716602849315504, 0.83510332049034
), label = "Weight provided by data vendor", format.spss = "F8.6", display_width = 10L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), label = "JN725 - IAE - GEM 2009")
# Second dataset
structure(list(setid = structure(c(1121800009, 1121800025, 1121800031,
1121800035, 1121800036, 1121800039), label = "Harmonization ID", format.spss = "F12.0", display_width = 14L),
setid_ne = structure(c(2000001121800009, 2000001121800025,
2000001121800031, 2000001121800035, 2000001121800036, 2000001121800039
), label = "Alternative ID variable to avoid duplicates across years", format.spss = "F15.0", display_width = 17L),
yrsurv = structure(c(2002, 2002, 2002, 2002, 2002, 2002), label = "Year survey was administered", format.spss = "F4.0"),
country = structure(c(1, 1, 1, 1, 1, 1), label = "Country", format.spss = "F4.0", display_width = 9L, labels = c(`United States` = 1,
Russia = 7, Egypt = 20, `South Africa` = 27, Greece = 30,
Netherlands = 31, Belgium = 32, France = 33, Spain = 34,
Hungary = 36, Italy = 39, Romania = 40, Switzerland = 41,
Austria = 43, `United Kingdom` = 44, Denmark = 45, Sweden = 46,
Norway = 47, Poland = 48, Germany = 49, Peru = 51, Mexico = 52,
Argentina = 54, Brazil = 55, Chile = 56, Colombia = 57, Malaysia = 60,
Australia = 61, Indonesia = 62, Philippines = 63, `New Zealand` = 64,
Singapore = 65, Thailand = 66, Japan = 81, Korea = 82, Vietnam = 84,
China = 86, Turkey = 90, India = 91, Pakistan = 92, Iran = 98,
Canada = 101, Morocco = 212, Algeria = 213, Tunisia = 216,
Libya = 218, Ghana = 233, Nigeria = 234, Angola = 244, Barbados = 246,
Ethiopia = 251, Uganda = 256, Zambia = 260, Namibia = 264,
Malawi = 265, Botswana = 267, Portugal = 351, Luxembourg = 352,
Ireland = 353, Iceland = 354, Finland = 358, Lithuania = 370,
Latvia = 371, Estonia = 372, Serbia = 381, Montenegro = 382,
Croatia = 385, Slovenia = 386, `Bosnia and Herzegovina` = 387,
Macedonia = 389, `Czech Republic` = 420, Slovakia = 421,
Guatemala = 502, `El Salvador` = 503, `Costa Rica` = 506,
Panama = 507, Venezuela = 582, Bolivia = 591, Ecuador = 593,
Suriname = 597, Uruguay = 598, `* 'Azores'` = 620, Tonga = 676,
Vanuatu = 678, Kazakstan = 701, `Shenzhen*` = 755, `Puerto Rico` = 787,
`Dominican Republic` = 809, `Hong Kong` = 852, `Trinidad & Tobago` = 868,
Jamaica = 876, Bangladesh = 880, Taiwan = 886, Lebanon = 961,
Jordan = 962, Syria = 963, `Saudi Arabia` = 966, Yemen = 967,
`West Bank & Gaza Strip` = 970, `United Arab Emirates` = 971,
Israel = 972), class = c("haven_labelled", "vctrs_vctr",
"double")), weight = structure(c(0.666652666946661, 1.35532689346212,
0.886868262634747, 0.247242055158897, 1.7567198656027, 0.595583088338233
), label = "Weight provided by data vendor", format.spss = "F8.6", display_width = 10L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), label = "JN725 - IAE - GEM 2009")

How about:
library(haven)
library(labelled)
# for testing, add discordant labels for one variable
var_label(GEM2001$weight) <- "Weight provided by data vendor"
var_label(GEM2002$weight) <- "Weight according to vendor"
df_list <- list(GEM2001 = GEM2001, GEM2002 = GEM2002)
df_labels <- lapply(df_list, \(df) unlist(var_label(df)))
all_labels <- unique(unlist(df_labels))
label_table <- data.frame(label = all_labels)
for (df in names(df_labels)) {
label_table[[df]] <- ifelse(all_labels %in% df_labels[[df]], "X", "")
}
label_table
label GEM2001 GEM2002
1 Harmonization ID X X
2 Alternative ID variable to avoid duplicates across years X X
3 Year survey was administered X X
4 Country X X
5 Weight provided by data vendor X
6 Weight according to vendor X
Edit:
OP asked "Is there an easy way to replace the 'X' in the crosstable with the variable name?" To do so, change the for loop as follows:
for (df in names(df_labels)) {
label_table[[df]] <- ifelse(
all_labels %in% df_labels[[df]],
names(df_labels[[df]])[match(all_labels, df_labels[[df]])],
""
)
}
label_table
label GEM2001 GEM2002
1 Harmonization ID setid setid
2 Alternative ID variable to avoid duplicates across years setid_ne setid_ne
3 Year survey was administered yrsurv yrsurv
4 Country country country
5 Weight provided by data vendor weight
6 Weight according to vendor weight

You could use the following aproach:
library(dplyr)
library(tidyr)
df_list = list("df1"=df1, "df2" = df2)
bind_rows(lapply(names(df_list),\(n) {
tibble(frame=n, labels=unlist(var_label(df_list[[n]])))
})) %>%
pivot_wider(id_cols = labels, names_from=frame, values_from=frame,values_fn = length)

Using xpath in R to scrape data from website with multiple similar paths

I'm trying to scrape in R a list of apartments for sale and the basic info (address, m2, price, rooms, etc.) of this website: https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000 (see also below a screenshot of the page + inspect)
Using SelectorGadget i haven't been able to create a path that unique extracts the square meters of all 50 apartments on page 1, and another path that unique extracts the numbers of rooms, etc.
I did manage to find a path that unique extracts the addresses (see in code block below). But this is in a separate block/class from the rest of the text.
Here is my current code:
library(rvest)
library(dplyr)
link = "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000&page=1"
page = read_html(link)
address = page %>% html_nodes("div.mr-2") %>% html_text()
price = #MISSING - CAN'T FIGURE OUT
sqm = #MISSING - CAN'T FIGURE OUT
rooms = #MISSING - CAN'T FIGURE OUT
forsale = data.frame(address, price, sqm, rooms, stringsAsFactors = FALSE)
Any ideas on how to approach it?
I tried using xpath as well to extract the sqm, but only managed to get one specific text field extracted, not all 50 on the page.
Alternative approaches are welcome too. Thanks in advance!

Using their API (found in the network section), you can call on it and retrieve in the information as such:
library(tidyverse)
library(httr2)
"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases") %>%
unnest(address, names_sep = "_") %>%
mutate(
address = str_c(address_roadName, address_houseNumber, address_zipCode, sep = " "),
.before = 1
) %>%
select(address,
price = priceCash,
sqm = housingArea,
rooms = numberOfRooms)
# A tibble: 100 × 4
address price sqm rooms
<chr> <int> <int> <int>
1 Holsteinsgade 66 2100 3135000 56 2
2 Tuborgvej 60 2900 4875000 114 4
3 Poppellunden 8 4000 3350000 92 3
4 Hyldegårds Tværvej 5 2920 6498000 115 3
5 Grollowstræde 3 3000 3495000 92 3
6 Rasmus Rasks Vej 8 2500 3995000 80 3
7 Ryesgade 7 8000 4598000 110 4
8 Carl Th. Zahles Gade 8 2300 5795000 113 3
9 Strandlodsvej 23E 2300 5495000 101 3
10 Nordre Fasanvej 162 2000 4695000 90 4
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
Which variables are available for extraction:
"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases") %>%
glimpse
Rows: 100
Columns: 37
$ `_links` <df[,1]> <data.frame[30 x 1]>
$ address <df[,28]> <data.frame[30 x 28]>
$ addressType <chr> "condo", "condo", "condo", "condo", "condo", "condo", "c…
$ caseID <chr> "89194273-5948-4734-8085-fec9d42ac3c2", "ff6a9ff5-eacf-…
$ caseUrl <chr> "https://www.lokalbolig.dk/?sag=26-X0001820", "https://www.…
$ coordinates <df[,3]> <data.frame[30 x 3]>
$ daysOnMarket <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ defaultImage <df[,1]> <data.frame[30 x 1]>
$ descriptionBody <chr> "Lys stuelejlighed med to terrasser i HørsholmNær centrum o…
$ descriptionTitle <chr> "Lys stuelejlighed med to terrasser i Hørsholm", "Fantas…
$ distinction <chr> "real_estate", "real_estate", "real_estate", "real_estate",…
$ energyLabel <chr> "c", "c", "d", "c", "d", "c", "c", "c", "c", "c", "c", "…
$ highlighted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ housingArea <int> 98, 82, 64, 91, 81, 97, 78, 113, 81, 91, 133, 69, 80, 64, 1…
$ images <list> [<data.frame[5 x 1]>], [<data.frame[3 x 1]>], [<data.frame[…
$ monthlyExpense <int> 4183, 3888, 2798, 3205, 3557, 3405, 3233, 2688, 3921, 3907,…
$ nextOpenHouse <df[,4]> <data.frame[30 x 4]>
$ numberOfFloors <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
$ numberOfRooms <int> 3, 3, 2, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 2, 4, 2, 3, 4, 2, 4,…
$ pageViews <int> 126, 341, 191, 160, 358, 356, 242, 516, 133, 180, 134, 106…
$ perAreaPrice <int> 40765, 54817, 62422, 71374, 43148, 58711, 60897, 41150, 480…
$ priceCash <int> 3995000, 4495000, 3995000, 6495000, 3495000, 5695000, 47…
$ priceChangePercentage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ providerCaseID <chr> "26-X000182018025lok", "114-2102", "43000000643cam", "13433…
$ realEstate <df[,3]> <data.frame[30 x 3]>
$ realtor <df[,21]> <data.frame[30 x 21]>
$ slug <chr> "oerbaekgaards-alle-901-0-tv-2970-hoersholm-02239600_901_st…
$ status <chr> "open", "open", "open", "open", "open", "open", "open", "op…
$ timeOnMarket <df[,2]> <data.frame[30 x 2]>
$ totalClickCount <int> 103, 274, 109, 121, 227, 273, 205, 415, 82, 128, 122, 92, 1…
$ totalFavourites <int> 1, 3, 0, 0, 4, 1, 1, 3, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2,…
$ utilitiesConnectionFee <df[,1]> <data.frame[30 x 1]>
$ yearBuilt <int> 2002, 1886, 1907, 2008, 1932, 1914, 1900, 1926, 1934, 1932,…
$ basementArea <int> NA, NA, NA, NA, NA, NA, NA, NA, 88, NA, NA, NA, NA, NA, …
$ lotArea <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5327, NA, N…
$ weightedArea <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ secondaryAddressType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
How you can save the data into your environment
df <- "https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases")

Selectors are kind of convoluted and fragile, but for now it seems to work:
library(rvest)
library(dplyr)
library(purrr)
library(stringr)
url <- "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000"
html <- read_html(url)
html |> html_elements("div.shadow.overflow-hidden.mx-4") |>
map_dfr(\(x)
list(
"address" = html_element(x ,"div.mr-2") |> html_text2() |> str_squish(),
"price" = html_element(x ,"span.text-lg.pr-2") |> html_text(),
"sqm" = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(1) .text-sm" ) |> html_text(),
"rooms" = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(4) .text-sm" ) |> html_text()
)
)
#> # A tibble: 50 × 4
#> address price sqm rooms
#> <chr> <chr> <chr> <chr>
#> 1 Poppellunden 8, 4. tv. Himmelev, 4000 Roskilde 3.350.000 kr. 92 m² 3 Vær.
#> 2 Tuborgvej 60, 2. th. 2900 Hellerup 4.875.000 kr. 114 m² 4 Vær.
#> 3 Hyldegårds Tværvej 5, st. tv. 2920 Charlottenlund 6.498.000 kr. 115 m² 3 Vær.
#> 4 Grollowstræde 3 3000 Helsingør 3.495.000 kr. 92 m² 3 Vær.
#> 5 Ryesgade 7, 2. tv. 8000 Aarhus C 4.598.000 kr. 110 m² 4 Vær.
#> 6 Carl Th. Zahles Gade 8, 2. tv. 2300 København S 5.795.000 kr. 113 m² 3 Vær.
#> 7 Rasmus Rasks Vej 8, 2. tv. 2500 Valby 3.995.000 kr. 80 m² 3 Vær.
#> 8 Strandlodsvej 23E, 1. mf. 2300 København S 5.495.000 kr. 101 m² 3 Vær.
#> 9 Nordre Fasanvej 162, 3. th. 2000 Frederiksberg 4.695.000 kr. 90 m² 4 Vær.
#> 10 Ringstedgade 17B, 1. th. 4000 Roskilde 5.395.000 kr. 137 m² 5 Vær.
#> # … with 40 more rows
Created on 2023-02-01 with reprex v2.0.2

Calculate a Weighted Rolling Average by rows by group in r?

I have a dataframe games_h. This is just a snippet of the table but it has many teams and is sorted by date, team, game number. I am trying to create a weighted rolling average grouped by the team. I would like the most recent game to be weighted more than two games ago. So the weights would be (Game_1 * 1+ Game_2 *2)/3 or weights equal to 1 with same ratio so weights = c(1-.667, .667).
dput(games_h)
structure(list(GameId = c(16, 16, 37, 37, 57, 57), GameDate = structure(c(17905,
17905, 17916, 17916, 17926, 17926), class = "Date"), NeutralSite = c(0,
0, 0, 0, 0, 0), AwayTeam = c("Virginia Cavaliers", "Virginia Cavaliers",
"Florida State Seminoles", "Florida State Seminoles", "Syracuse Orange",
"Syracuse Orange"), HomeTeam = c("Boston College Eagles", "Boston College Eagles",
"Boston College Eagles", "Boston College Eagles", "Boston College Eagles",
"Boston College Eagles"), Team = c("Virginia Cavaliers", "Boston College Eagles",
"Florida State Seminoles", "Boston College Eagles", "Syracuse Orange",
"Boston College Eagles"), Home = c(0, 1, 0, 1, 0, 1), Score = c(83,
56, 82, 87, 77, 71), AST = c(17, 6, 12, 16, 11, 13), TOV = c(10,
8, 9, 13, 11, 11), STL = c(5, 4, 4, 6, 6, 5), BLK = c(6, 0, 4,
4, 1, 0), Rebounds = c(38, 18, 36, 33, 23, 23), ORB = c(7, 4,
16, 10, 7, 6), DRB = c(31, 14, 20, 23, 16, 17), FGA = c(55, 57,
67, 55, 52, 45), FGM = c(33, 22, 28, 27, 29, 21), X3FGM = c(8,
7, 8, 13, 11, 9), X3FGA = c(19, 25, 25, 21, 26, 22), FTA = c(14,
9, 24, 28, 15, 23), FTM = c(9, 5, 18, 20, 8, 20), Fouls = c(16,
12, 25, 20, 19, 19), Game_Number = 1:6, Count = c(1, 1, 1, 1,
1, 1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), groups = structure(list(HomeTeam = "Boston College Eagles",
.rows = structure(list(1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
Here is an example output of the score column.
Here is my failed attempt. The function work correctly but I cannot apply it to the entire dataframe by group.
weighted_avg<-function(x, wt1, wt2) {
rs1 = rollsum(x,1,align = "right")
rs2 = rollsum(x,2,align = "right")
rs1=rs1[-1]
rs3 = rs2 - rs1
weighted_avg= ((rs3 * wt2)+ (rs1*wt1))/(wt1+wt2)
return(weighted_avg)
}
weighted_avg(csum$Score_Y, 2, 1)
apply(csum$Score_Y , 2, weighted_avg, wt1 = 2, wt2=1)
test<-csum %>%
group_by(Team)%>%
group_map(across(c(Score:Fouls), weighted_avg(.x$Team, 2, 1) ))
test<-csum %>%
group_by(Team)%>%
group_walk(across(c(Score:Fouls),weighted_avg(.~,2,1) ))

Here are some notes about the code:
I used slider::slide_dbl function. First we specify the vector for which we would like to compute the moving average Score.
As we need a sliding window of length 2, I used .before argument in slide_dbl to use the previous value and a current value to be used for calculating moving average.
Also I set .complete argument to TRUE to makes sure to only calculate moving average when we have a previous value. In other word we don't have any moveing average in first row.
For more info check the documentation for slider package.
library(tidyverse)
library(slider)
df %>%
group_by(HomeTeam) %>%
summarise(Example = c(NA, slide_dbl(Score, .before = 1, .complete = TRUE,
.f = ~ (.x[1] * 1 + .x[2] * 2) / 3)))
`summarise()` has grouped output by 'HomeTeam'. You can override using the `.groups` argument.
# A tibble: 7 × 2
# Groups: HomeTeam [1]
HomeTeam Example
<chr> <dbl>
1 Boston College Eagles NA
2 Boston College Eagles NA
3 Boston College Eagles 65
4 Boston College Eagles 73.3
5 Boston College Eagles 85.3
6 Boston College Eagles 80.3
7 Boston College Eagles 73
If it is going to calculate moving average for all numeric columns you could try:
df %>%
group_by(HomeTeam) %>%
summarise(across(where(is.numeric), ~ c(NA, slide_dbl(., .before = 1, .complete = TRUE,
.f = ~ (.x[1] * 1 + .x[2] * 2) / 3)))) %>%
ungroup()
`summarise()` has grouped output by 'HomeTeam'. You can override using the `.groups` argument.
# A tibble: 7 × 21
HomeTeam GameId NeutralSite Home Score AST TOV STL BLK Rebounds ORB DRB FGA FGM
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Boston C… NA NA NA NA NA NA NA NA NA NA NA NA NA
2 Boston C… NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Boston C… 16 0 0.667 65 9.67 8.67 4.33 2 24.7 5 19.7 56.3 25.7
4 Boston C… 30 0 0.333 73.3 10 8.67 4 2.67 30 12 18 63.7 26
5 Boston C… 37 0 0.667 85.3 14.7 11.7 5.33 4 34 12 22 59 27.3
6 Boston C… 50.3 0 0.333 80.3 12.7 11.7 6 2 26.3 8 18.3 53 28.3
7 Boston C… 57 0 0.667 73 12.3 11 5.33 0.333 23 6.33 16.7 47.3 23.7
# … with 7 more variables: X3FGM <dbl>, X3FGA <dbl>, FTA <dbl>, FTM <dbl>, Fouls <dbl>,
# Game_Number <dbl>, Count <dbl>

modify function to find the maximum length for a consecutive run of a number per row

I would like to modify the answer to the question here or have a new solution to include another column which shows the second largest consecutive run of "0". My sample data and code is below, the function is operating on the month columns and the second largest run column is what I hope to add. I am working with a large dataset so the more efficient the better, any ideas are appreciated, thanks.
sample data
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9), V1 = c("A",
"B", "A", "B", "B", "A", "A", "B", "B"), V2 = c(21, 233, 185,
85, 208, 112, 238, 66, 38), V3 = c(149, 250, 218, 104, 62, 19,
175, 168, 28), Jan = c(10, 20, 10, 12, 76, 28, 137, 162, 101),
Feb = c(20, 25, 15, 0, 89, 0, 152, 177, 119), March = c(0,
28, 20, 14, 108, 0, 165, 194, 132), April = c(0, 34, 25,
16, 125, 71, 181, 208, 149), May = c(25, 0, 30, 22, 135,
0, 191, 224, 169), June = c(29, 0, 35, 24, 145, 0, 205, 244,
187), July = c(34, 0, 40, 28, 163, 0, 217, 256, 207), August = c(37,
0, 45, 29, 173, 0, 228, 276, 221), Sep = c(0, 39, 50, 31,
193, 0, 239, 308, 236), Oct = c(0, 48, 55, 35, 210, 163,
252, 0, 247), Nov = c(48, 55, 60, 40, 221, 183, 272, 0, 264
), Dec = c(50, 60, 65, 45, 239, 195, 289, 0, 277), `Second largest run` = c(1,
NA, NA, NA, NA, 2, NA, NA, NA), result = c(2, 4, -Inf, 1,
-Inf, 5, -Inf, 3, -Inf)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
code
most_consecutive_val = function(x, val = 0) {
with(rle(x), max(lengths[values == val]))
}
test$result=apply(test[,-c(1:4,17)], MARGIN = 1, most_consecutive_val)

Rather than taking the max from the run length encoding (rle) function, we want to sort the output and then extract the desired index. We'll get NA's when we request an index that doesn't exist -- where there isn't a second run of zeroes in row 2 for example.
ordered_runs = function(x, val = 0, idx = 1) {
with(rle(x), sort(lengths[values == val], decreasing = TRUE))[idx]
}
test$result_1 <- apply(test[,-c(1:4,17:18)], MARGIN = 1, ordered_runs, idx = 1)
test$result_2 <- apply(test[,-c(1:4,17:18)], MARGIN = 1, ordered_runs, idx = 2)
Output is slightly different than your expected -- (1) using NA's rather than -Inf, and (2) in your first row, where I believe there is a tie with a second run of 2 zeroes.
> test[,c(1,17:20)]
# A tibble: 9 x 5
ID `Second largest run` result result_1 result_2
<dbl> <dbl> <dbl> <int> <int>
1 1 1 2 2 2
2 2 NA 4 4 NA
3 3 NA -Inf NA NA
4 4 NA 1 1 NA
5 5 NA -Inf NA NA
6 6 2 5 5 2
7 7 NA -Inf NA NA
8 8 NA 3 3 NA
9 9 NA -Inf NA NA

Here is an option using data.table which should be quite fast for OP's large dataset and also identifies all sequences of zeros simultaneously:
library(data.table)
setDT(DF)
cols <- c("Jan", "Feb", "March", "April", "May", "June", "July", "August", "Sep", "Oct", "Nov", "Dec")
#convert into a long format
m <- melt(DF, measure.vars=cols)[
#identify consecutive sequences of the same number and count
order(ID), c("rl", "rw") := .(rl <- rleid(ID, value), rowid(rl))][
#extract the last element where values = 0 (that is the length of sequences of zeros)
value == 0L, .(ID=ID[.N], len=rw[.N]), rl][
#sort in descending order for length of sequences
order(ID, -len)]
#pivot into wide format and perform a update join
wide <- dcast(m, ID ~ rowid(ID), value.var="len")
DF[wide, on=.(ID), (names(wide)) := mget(names(wide))]
output:
ID V1 V2 V3 Jan Feb March April May June July August Sep Oct Nov Dec 1 2
1: 1 A 21 149 10 20 0 0 25 29 34 37 0 0 48 50 2 2
2: 2 B 233 250 20 25 28 34 0 0 0 0 39 48 55 60 4 NA
3: 3 A 185 218 10 15 20 25 30 35 40 45 50 55 60 65 NA NA
4: 4 B 85 104 12 0 14 16 22 24 28 29 31 35 40 45 1 NA
5: 5 B 208 62 76 89 108 125 135 145 163 173 193 210 221 239 NA NA
6: 6 A 112 19 28 0 0 71 0 0 0 0 0 163 183 195 5 2
7: 7 A 238 175 137 152 165 181 191 205 217 228 239 252 272 289 NA NA
8: 8 B 66 168 162 177 194 208 224 244 256 276 308 0 0 0 3 NA
9: 9 B 38 28 101 119 132 149 169 187 207 221 236 247 264 277 NA NA
data:
DF <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9), V1 = c("A",
"B", "A", "B", "B", "A", "A", "B", "B"), V2 = c(21, 233, 185,
85, 208, 112, 238, 66, 38), V3 = c(149, 250, 218, 104, 62, 19,
175, 168, 28), Jan = c(10, 20, 10, 12, 76, 28, 137, 162, 101),
Feb = c(20, 25, 15, 0, 89, 0, 152, 177, 119), March = c(0,
28, 20, 14, 108, 0, 165, 194, 132), April = c(0, 34, 25,
16, 125, 71, 181, 208, 149), May = c(25, 0, 30, 22, 135,
0, 191, 224, 169), June = c(29, 0, 35, 24, 145, 0, 205, 244,
187), July = c(34, 0, 40, 28, 163, 0, 217, 256, 207), August = c(37,
0, 45, 29, 173, 0, 228, 276, 221), Sep = c(0, 39, 50, 31,
193, 0, 239, 308, 236), Oct = c(0, 48, 55, 35, 210, 163,
252, 0, 247), Nov = c(48, 55, 60, 40, 221, 183, 272, 0, 264
), Dec = c(50, 60, 65, 45, 239, 195, 289, 0, 277), `1` = c(2L,
4L, NA, 1L, NA, 5L, NA, 3L, NA), `2` = c(2L, NA, NA, NA,
NA, 2L, NA, NA, NA)), row.names = c(NA, -9L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating a new column in my data frame based on a function - r

Related

Create a crosstable of variable labels across dataframes and fill with variable names

Create a crosstable of the variable labels across two and more dataframes

Using xpath in R to scrape data from website with multiple similar paths

Calculate a Weighted Rolling Average by rows by group in r?

modify function to find the maximum length for a consecutive run of a number per row

Categories

Resources