Define a tidyverse-function - r

I have a data.frame df and I would like to do some checks on the data. If there's an error (e.g. missing values or non plausible values) I would like to make a list containing the id of the case and the type of error.
# Define an empty data.frame
errors <- data.frame(id = numeric(),
message = character())
# Function that stacks all the errors
addErrorMessage(message){
errors <- rbind(errors, ) # <= what to do here?
}
df <- data.frame(id = 1:7,
var1 = c(1, 2, 3, 3, 9, 4, 5),
var2 = c("A", "A", "B", "C", NA, "D", "A"))
########### List of checks ################
# Check 1: var1 should be smaller than 5
df %>% filter(var1 > 5) %>%
addErrorMsg(message = "Value of var1 is 5 or greater")
# Check 2: var2 should not be missing
df %>% filter(is.na(var2)) %>%
addErrorMessage(message = "Value of var2 is missing")
My question is: How can I define a function addErrorMessage() that I can directly use in the tidyverse-workflow? I want to avoid to save the wrong cases to a temporary data.frame for each check and then stack this data.frame on the errors-data.frame using rbind().

Your actual problem can probably be solved using the {pointblank} package which contains a lot of functions that help to conduct this and similar tests.
If you are more interested in writing such validation functions yourself, see a very rough draft below.
df <- data.frame(id = 1:7,
var1 = c(1, 2, 3, 3, 9, 4, 5),
var2 = c("A", "A", "B", "C", NA, "D", "A"))
library(pointblank)
df %>%
col_vals_lt(vars(var1),
value = 5) %>%
col_vals_not_null(vars(var2))
#> Error: Exceedance of failed test units where values in `var1` should have been < `5`.
#> The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)
Created on 2021-08-17 by the reprex package (v2.0.1)
{pointblank} can also generate data validation reports:
agent <-
create_agent(
tbl = df,
tbl_name = "My data",
label = "Checking column values",
actions = action_levels(stop_at = 1)
) %>%
col_vals_lt(vars(var1),
value = 5) %>%
col_vals_not_null(vars(var2)) %>%
interrogate()
agent
If you are more interested in writing this kind of functions yourself, below is a very rough draft. It uses the attributes of the underyling data.frame which is not a great solution, since depending on the functions you use in between checks the attributes might get lost. In a package we could use a dedicated environment to capture errors, so in this case we wouldn't need the attributes.
library(dplyr)
df <- data.frame(id = 1:7,
var1 = c(10, 2, 3, 3, 9, 4, 5),
var2 = c("A", NA, "B", "C", NA, "D", "A"))
check <- function(data, condition, message){
exp <- rlang::enexpr(condition)
test <- transmute(data, new = eval(exp))$new
if (any(test)) {
err_df <- attr(data, "error_df")
if (is.null(err_df)) {
attr(data, "error_df") <- data.frame(check = 1L,
row_nr = which(test),
message = message)
} else {
attr(data, "error_df") <- rbind(err_df,
data.frame(check = max(err_df$check) + 1L,
row_nr = which(test),
message = message)
)
}
}
data
}
get_errors <- function(data) {
print(attr(data,"error_df"))
invisible(data)
}
df %>%
check(condition = var1 > 5,
message = "Value of var1 is 5 or greater") %>%
check(condition = is.na(var2),
message = "Value of var2 is missing") %>%
get_errors
#> check row_nr message
#> 1 1 1 Value of var1 is 5 or greater
#> 2 1 5 Value of var1 is 5 or greater
#> 3 2 2 Value of var2 is missing
#> 4 2 5 Value of var2 is missing
Created on 2021-08-17 by the reprex package (v2.0.1)

Related

Is it possible to use tidyselect helpers with the cols_only() function?

I have a .csv file like this (except that the real .csv file has many more columns):
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
I only want id1, id2, data1, and data2.
I can do this:
df <- read_csv("df.csv",
col_names = TRUE,
cols_only(id1 = col_character(),
id2 = col_character(),
data1 = col_integer(),
data2 = col_integer()))
But, as mentioned above, my real dataset has many more columns, so I'd like to use tidyselect helpers to only read in specified columns and ensure specified formats.
I tried this:
df2 <- read_csv("df.csv",
col_names = TRUE,
cols_only(starts_with("id") = col_character(),
starts_with("data") & !ends_with("s") = col_integer()))
But the error message indicates that there's a problem with the syntax. Is it possible to use tidyselect helpers in this way?
My proposal is around the houses somewhat but it pretty much does let you customise the read spec on a 'rules' rather than explicit basis
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
# read only 1 row to make a spec from with minimal read; really just to get the colnames
df_spec <- spec(read_csv("df.csv",
col_names = TRUE,
n_max = 1))
#alter the spec with base R functions startsWith / endsWith etc.
df_spec$cols <- imap(df_spec$cols,~{if(startsWith(.y,"id")){
col_character()
} else if(startsWith(.y,"data") &
!endsWith(.y,"s")){
col_integer()
} else {
col_skip()
}})
df <- read_csv("df.csv",
col_types = df_spec$cols)

How to get the number of unique values in a column considering another column values with R?

I have this data frame:
df <- data.frame(id = c(918, 919, 920, 921, 922),
city = c("a", "c", "b", "c", "a"),
mosquitoes = c(9, 13, 8, 25, 10))
What I want to do is to get the number of unique ID values for each city and then create a new dataframe that should looks like:
newdf <- data.frame(city = c("a", "b", "c"),
id = c(2,1,2),
mosquitoes = c(19, 8, 38))
I know how to do half of that using
newdf <- aggregate(mosquitoes ~ city, data = df, sum)
But no matter how I try, I can't get the range for unique values of ID according to the cities that I have. I've been trying
newdf$id <- aggregate(length(id) ~ city, data = df, sum)
And I also tried a loop (because my original data has way more than 3 cities), but only got disaster and can't make it work at all:
x <- unique(df$city)
unique_ID <-
for (x in df$city) {
city = unique(df$city)
mosquitoes = ?
ID = ?
}
This topic was the most similar to mine I could found, but apparently it only works with numeric values. At least I couldn't make it work with my character columns.
Can someone please help me?
You could do:
library(tidyverse)
df <- data.frame(id = c(918, 919, 920, 921, 922),
city = c("a", "c", "b", "c", "a"),
mosquitoes = c(9, 13, 8, 25, 10))
df %>%
group_by(city) %>%
summarise(id = n(), mosquitoes = sum(mosquitoes))
#> # A tibble: 3 x 3
#> city id mosquitoes
#> <chr> <int> <dbl>
#> 1 a 2 19
#> 2 b 1 8
#> 3 c 2 38
Created on 2022-09-05 with reprex v2.0.2

How to filter a data frame to only min and max values of different columns in R?

Lets say I have the following data frame:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))
I want to create a data frame that has the following columns: id, start_day, end_day
such that for each unique id I only need the minimum of start_day column and the maximum of the end_day column. The final data frame should look like as follow:
To get this new data frame I wrote the following code:
df <- df[!(is.na(df$start_day)), ]
dt <- data.frame(matrix(ncol =3 , nrow = length(unique(df$id))))
colnames(dt) <- c("id", "start_day", "end_day")
dt$id <- unique(df$id)
st_day <- vector()
en_day <- vector()
for (elm in dt$id) {
d <- df[df$id == elm, ]
minimum <- min(d$start_day)
maximum <- max(d$end_day)
st_day <- c(st_day, minimum)
en_day <- c(en_day, maximum)
}
dt$start_day <- st_day
dt$end_day <- en_day
df <- dt
My code is creating what I am looking for, but I am not happy with it. I would love to learn a better and cleaner way to do the same thing. Any idea is very much appreciated.
You can try data.table like below
> library(data.table)
> na.omit(setDT(df))[, .(start_day = min(start_day), end_day = max(end_day)), id]
id start_day end_day
1: 1 1 15
2: 2 4 22
3: 3 5 21
This should do:
df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
end_day = max(end_day, na.rm = T))
Output:
id start_day end_day
<dbl> <dbl> <dbl>
1 1 1 15
2 2 4 22
3 3 5 21

Pass multiple arguments to ddply

I am attempting to create a function which takes a list as input, and returns a summarised data frame. However, after trying multiple ways, I am unable to pass a list to the function for the aggregation.
So far I have the following, but it is failing.
library(dplyr)
random_df <- data.frame(
region = c("A", "B", "C", "C"),
number_of_reports = c(1, 3, 2, 1),
report_MV = c(12, 33, 22, 12)
)
output_graph <- function(input) {
print(input$arguments)
DF <- input$DF
group_by <- input$group_by
args <- input$arguments
flow <- ddply(DF, group_by, summarize, args)
return(flow)
}
graph_functions <- list(
DF = random_df,
group_by = .(region),
arguments = .(Reports = sum(number_of_reports),
MV_Reports = sum(report_MV))
)
output_graph(graph_functions)
Where this works:
library(dplyr)
random_df <- data.frame(
region = c("A", "B", "C", "C"),
number_of_reports = c(1, 3, 2, 1),
report_MV = c(12, 33, 22, 12)
)
output_graph <- function(input) {
print(input$arguments)
DF <- input$DF
group_by <- input$group_by
args <- input$arguments
flow <- ddply(
DF,
group_by,
summarize,
Reports = sum(number_of_reports),
MV_Reports = sum(report_MV)
)
return(flow)
}
graph_functions <- list(
DF = random_df,
group_by = .(region),
arguments = .(Reports = sum(number_of_reports),
MV_Reports = sum(report_MV))
)
output_graph(graph_functions)
Would anyone be aware of a way to pass a list of functions to ddply? Or another way to achieve the same goal of aggregating a dynamic set of variables.
In order to pass arguments into the function for use by dplyr, I recommend reading this regarding non-standard evaluation (NSE). Here is an edited function producing the same output as my original.
library(dplyr)
random_df <- data.frame(
region = c('A','B','C','C'),
number_of_reports = c(1, 3, 2, 1),
report_MV = c(12, 33, 22, 12)
)
output_graph <- function(df, group, args) {
grp_quo <- enquo(group)
df %>%
group_by(!!grp_quo) %>%
summarise(!!!args)
}
args <- list(
Reports = quo(sum(number_of_reports)),
MV_Reports = quo(sum(report_MV))
)
output_graph(random_df, region, args)
# # A tibble: 3 x 3
# region Reports MV_Reports
# <fctr> <dbl> <dbl>
# 1 A 1.00 12.0
# 2 B 3.00 33.0
# 3 C 3.00 34.0

sort data into deciles based on a rolling subset

I am trying to replicate the Fama French 1993 paper using R. I need to do the following sorting :
for each month,
calculate ME decile breakpoints on NYSE stocks only
sort all stocks into the deciles created in 2.
Data generation:
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME=floor(100*abs(rnorm(n)))))
Desired Output:
ME_NYSE_vals <- as.numeric(paste(df[df$Month==1 & df$exchangeCode=="NYSE","ME"]))
ME_ALL_vals <- as.numeric(paste(df[df$Month==1,"ME"]))
cut(x = ME_ALL_vals,
breaks = c(-Inf,quantile(ME_NYSE_vals,probs=seq(.1,.9,.1)),+Inf),
labels = 1:10
)
The breaks should be calculated based on ME_NSYE_vals. The cut should be applied to all ME_ALL_vals for each month.
If the intention is to keep the whole data frame but generate deciles only for the NYSE values the code below could do. The point was to generate deciles only for the entries pertaining to the NYSE values but to keep the full data set achieving some form of a partial sorting.
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
# Transformations
df %<>%
mutate(nTileNYSE = ifelse(exchangeCode == "NYSE", ntile(ME, 10), NA))
arrange(nTileNYSE)
The code was applied to the data:
set.seed(1)
df <- as.data.frame(cbind(exchangeCode = c("NYSE", "NASDAQ"),
Stock = c("A", "B", "C", "A"),
Month = 1:12,
ME=rnorm(1200)))
2nd approach
Following the discussion in the comments I would suggest the following approach:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c( "tidyr", "dplyr", "magrittr", "xts", "Hmisc"),
char = TRUE)
# Data generation ---------------------------------------------------------
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME = floor(100*abs(rnorm(n)))))
# Transformations ---------------------------------------------------------
# For some reason this was needed
df$ME <- as.numeric(as.character(df$ME))
# Generate cuts
dfNtiles <- df %>%
arrange(exchangeCode, Month, ME) %>%
group_by(exchangeCode, Month) %>%
mutate(cutsBsdOnNYSE = cut(x = ME,
breaks = cut2(x = df$ME[df$exchangeCode == "NYSE"],
g = 10, onlycuts = TRUE))) %>%
ungroup() %>%
group_by(cutsBsdOnNYSE) %>%
mutate(grpBsdOnNYSE = n())
It's fairly straightforward
Generating cut brackets reflecting subset of the data.
Applying those brackets to the whole vector (ME)
Numbering the obtained groups so a group identifier is created
and boils down to:

Resources