For loop over the same variable in multiple datasets - r

I have multiple datasets and would like to create a contingency table for the same variable in each of them. I am attempting to write a for loop over these datasets, but am having difficulty accessing the necessary variable. Here's a fake set-up to illustrate my issue:
data1 <- data.frame(name = c("A", "B", "C"),
value1 = c(1, 2, 2),
value2 = c(1, 3, 7))
data2 <- data.frame(name = c("D", "E", "F"),
value1 = c(3, 4, 3),
value2 = c(8, 2, 1))
datasets <- c("data1", "data2")
If I manually execute table(data1$value1) then I receive a result. However, if I try something like the following:
for (i in seq_along(datasets)) {
variable <- datasets[[i]]$value1
table(variable)
}
then R throws an error message "Error: $ operator is invalid for atomic vectors." Given this, what is the best way to achieve my initial aim?

Related

Turn the dataset into a script, creating the dataset

Are there any ways to turn the dataframe (imported from excel) into a code, which creates this dataframe?
I have a dataframe, which I obtain from excel, but I'd love to get the code like this from it.
abc <- data.frame(a = c("a", "aa", "r"),
b = c(1, 2, 3),
c = c(T, F, T))

Is it possible to use tidyselect helpers with the cols_only() function?

I have a .csv file like this (except that the real .csv file has many more columns):
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
I only want id1, id2, data1, and data2.
I can do this:
df <- read_csv("df.csv",
col_names = TRUE,
cols_only(id1 = col_character(),
id2 = col_character(),
data1 = col_integer(),
data2 = col_integer()))
But, as mentioned above, my real dataset has many more columns, so I'd like to use tidyselect helpers to only read in specified columns and ensure specified formats.
I tried this:
df2 <- read_csv("df.csv",
col_names = TRUE,
cols_only(starts_with("id") = col_character(),
starts_with("data") & !ends_with("s") = col_integer()))
But the error message indicates that there's a problem with the syntax. Is it possible to use tidyselect helpers in this way?
My proposal is around the houses somewhat but it pretty much does let you customise the read spec on a 'rules' rather than explicit basis
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
# read only 1 row to make a spec from with minimal read; really just to get the colnames
df_spec <- spec(read_csv("df.csv",
col_names = TRUE,
n_max = 1))
#alter the spec with base R functions startsWith / endsWith etc.
df_spec$cols <- imap(df_spec$cols,~{if(startsWith(.y,"id")){
col_character()
} else if(startsWith(.y,"data") &
!endsWith(.y,"s")){
col_integer()
} else {
col_skip()
}})
df <- read_csv("df.csv",
col_types = df_spec$cols)

How to read and merge only the second sheet from a number of excel files (xlsm) in R?

I have a working directory with a large number of xlsm files (600ish). I need to merge all of these files into one dataframe, but ONLY the second sheet of the excel file. Since there are a lot of files, ideally I would use a loop, but I'm struggling with how to do this. Right now I have this code, which is obviously not working. Any thoughts on how to best do this would be greatly appreciated.
library(readxl)
library(tidyverse)
data.files = list.files(pattern = "*.xlsm")
data_to_merge <- lapply(data.files, read_excel(x, sheet = 2))
combined_df <- bind_rows(data_to_merge)
Not sure how to include examples of the data so it's easily reproducible since my question is dealing with excel sheets, not data that's already in r, but if this is useful, all of the 2nd sheets have the same simple structure that looks something like this:
data1 <- data.frame(id = 1:6,
x1 = c(5, 1, 4, 9, 1, 2),
x2 = c("A", "Y", "G", "F", "G", "Y"))
data2 <- data.frame(id = 4:9,
y1 = c(3, 3, 4, 1, 2, 9),
y2 = c("a", "x", "a", "x", "a", "x"))
You were close. You just need to slightly alter your lapply statement, so that the function and parameter are separated by a column.
library(readxl)
library(tidyverse)
data.files = list.files(pattern = "*.xlsm")
data_to_merge <- lapply(data.files, read_excel, sheet = 2)
combined_df <- bind_rows(data_to_merge)
Or a more tidyverse approach:
combined_df <- list.files(pattern = "*.xlsm") %>%
map(., ~ read_excel(.x, sheet = 2)) %>%
bind_rows()

Define a tidyverse-function

I have a data.frame df and I would like to do some checks on the data. If there's an error (e.g. missing values or non plausible values) I would like to make a list containing the id of the case and the type of error.
# Define an empty data.frame
errors <- data.frame(id = numeric(),
message = character())
# Function that stacks all the errors
addErrorMessage(message){
errors <- rbind(errors, ) # <= what to do here?
}
df <- data.frame(id = 1:7,
var1 = c(1, 2, 3, 3, 9, 4, 5),
var2 = c("A", "A", "B", "C", NA, "D", "A"))
########### List of checks ################
# Check 1: var1 should be smaller than 5
df %>% filter(var1 > 5) %>%
addErrorMsg(message = "Value of var1 is 5 or greater")
# Check 2: var2 should not be missing
df %>% filter(is.na(var2)) %>%
addErrorMessage(message = "Value of var2 is missing")
My question is: How can I define a function addErrorMessage() that I can directly use in the tidyverse-workflow? I want to avoid to save the wrong cases to a temporary data.frame for each check and then stack this data.frame on the errors-data.frame using rbind().
Your actual problem can probably be solved using the {pointblank} package which contains a lot of functions that help to conduct this and similar tests.
If you are more interested in writing such validation functions yourself, see a very rough draft below.
df <- data.frame(id = 1:7,
var1 = c(1, 2, 3, 3, 9, 4, 5),
var2 = c("A", "A", "B", "C", NA, "D", "A"))
library(pointblank)
df %>%
col_vals_lt(vars(var1),
value = 5) %>%
col_vals_not_null(vars(var2))
#> Error: Exceedance of failed test units where values in `var1` should have been < `5`.
#> The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)
Created on 2021-08-17 by the reprex package (v2.0.1)
{pointblank} can also generate data validation reports:
agent <-
create_agent(
tbl = df,
tbl_name = "My data",
label = "Checking column values",
actions = action_levels(stop_at = 1)
) %>%
col_vals_lt(vars(var1),
value = 5) %>%
col_vals_not_null(vars(var2)) %>%
interrogate()
agent
If you are more interested in writing this kind of functions yourself, below is a very rough draft. It uses the attributes of the underyling data.frame which is not a great solution, since depending on the functions you use in between checks the attributes might get lost. In a package we could use a dedicated environment to capture errors, so in this case we wouldn't need the attributes.
library(dplyr)
df <- data.frame(id = 1:7,
var1 = c(10, 2, 3, 3, 9, 4, 5),
var2 = c("A", NA, "B", "C", NA, "D", "A"))
check <- function(data, condition, message){
exp <- rlang::enexpr(condition)
test <- transmute(data, new = eval(exp))$new
if (any(test)) {
err_df <- attr(data, "error_df")
if (is.null(err_df)) {
attr(data, "error_df") <- data.frame(check = 1L,
row_nr = which(test),
message = message)
} else {
attr(data, "error_df") <- rbind(err_df,
data.frame(check = max(err_df$check) + 1L,
row_nr = which(test),
message = message)
)
}
}
data
}
get_errors <- function(data) {
print(attr(data,"error_df"))
invisible(data)
}
df %>%
check(condition = var1 > 5,
message = "Value of var1 is 5 or greater") %>%
check(condition = is.na(var2),
message = "Value of var2 is missing") %>%
get_errors
#> check row_nr message
#> 1 1 1 Value of var1 is 5 or greater
#> 2 1 5 Value of var1 is 5 or greater
#> 3 2 2 Value of var2 is missing
#> 4 2 5 Value of var2 is missing
Created on 2021-08-17 by the reprex package (v2.0.1)

How do I select variables in an R dataframe whose names contain a particular string?

Two examples would be very helpful for me.
How would I select:
1) variables whose names start with b or B (i.e. case-insensitive)
or
2) variables whose names contain a 3
df <- data.frame(a1 = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
a2 = c("A", "D", "A", "C"), a3 = c(8, 3, 9, 9),
b1 = c(1, 1, 1, 2), b2 = c( 5, 4, 3,2), b3 = c(3, 4, 3, 4),
B1 = c(3, 6, 4, 4))
If you just want the variable names:
grep("^[Bb]", names(df), value=TRUE)
grep("3", names(df), value=TRUE)
If you are wanting to select those columns, then either
df[,grep("^[Bb]", names(df), value=TRUE)]
df[,grep("^[Bb]", names(df))]
The first uses selecting by name, the second uses selecting by a set of column numbers.
While I like the answer above, I wanted to give a "tidyverse" solution as well. If you are doing a lot of pipes and trying to do several things at once, as I often do, you may like this answer. Also, I find this code more "humanly" readable.
The function tidyselect::vars_select will select variables from a character vector in the first argument, which should contain the names of the corresponding data frame, based on a select helper function like starts_with or matches
library(dplyr)
library(tidyselect)
df <- data.frame(a1 = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
a2 = c("A", "D", "A", "C"), a3 = c(8, 3, 9, 9),
b1 = c(1, 1, 1, 2), b2 = c( 5, 4, 3,2), b3 = c(3, 4, 3, 4),
B1 = c(3, 6, 4, 4))
# will select the names starting with a "b" or a "B"
tidyselect::vars_select(names(df), starts_with('b', ignore.case = TRUE))
# use select in conjunction with the previous code
df %>%
select(vars_select(names(df), starts_with('b', ignore.case = TRUE)))
# Alternatively
tidyselect::vars_select(names(df), matches('^[Bb]'))
Note that the default for ignore.case is TRUE, but I put it here to show explicitly, and in case future readers are curious how to adjust the code. The include and exclude arguments are also very useful. For example, you could use vars_select(names(df), matches('^[Bb]'), include = 'a1') if you wanted everything that starts with a "B" or a "b", and you wanted to include "a1" as well.
I thought it was worth adding that select_vars is retired as of tidyverse version 1.2.1. Now, tidyselect::vars_select() is likely what you're looking for within the "tidyverse". See the documentation here.

Resources