Dynamic variable in grepl() - r

This is the continuation of the following thread:
Creating Binary Identifiers Based On Condition Of Word Combinations For Filter
Expected output is the same as per the said thread.
I am now writing a function that can take dynamic names as variables.
This is the code that I am aiming at, if I am to run it manually:
df <- df %>% group_by(id, date) %>% mutate(flag1 = if(eval(parse(text=conditions))) grepl(pattern, item_name2) else FALSE)
To make it take into consideration dynamic variable names, I have been doing the code this way:
groupcolumns <- c(id, date)
# where id and date will be entered into the function as character strings by the user
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, item) else FALSE)
# converting to formula to use with dynamically generated column names
# "conditons" being the following character vector, which I can automatically generate:
conditons <- "any(grepl("Alpha", Item)) & any(grepl("Bravo", Item))"
This becomes:
df <- df %>% group_by_(.dots = groupcolumns) %>% mutate_(.dots = setNames(variable, flags[1]))
# where flags[1] is a predefined vector of columns names that I have created
flags <- paste("flag", seq(1:100), sep = "")
The problem is, I am unable to do anything to the grepl function; to specify the "item" dynamically. If I do it this way, as "df$item", and do a eval(parse(text="df$item")), the intention of piping fails as I am doing a group_by_ and it results in an error (naturally). This also applies to the conditions that I set.
Does a way exists for me to tell grepl to use a dynamic variable name?
Thanks a lot (especially to akrun)!
edit 1:
tried the following, and now there is no problem of passing the name of the item into grepl.
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, as.name(item)) else FALSE)
However, the problem lies in that piping seems not to work, as the output of as.name(item) is seen as an object, which does not exist in the environment.
edit 2:
trying do() in dplyr:
variable <- list(~if(eval(parse(text=conditions))) grepl(pattern, .$deparse(as.name(item))) else FALSE)
df <- df %>% group_by_(.dots = groupcolumns) %>% do_(.dots = setNames(variable, combiflags[1]))
which throws me the error:
Error: object 'Item' not found

If I understand your question correctly, you want to be able to dynamically input both patterns and the object to be searched by these patterns in grepl? The best solution for you will depend entirely on how you choose to store the patterns and how you choose to store the objects to be searched. I have a few ideas that should help you though.
For dynamic patterns, try inputting a list of patterns using the paste function. This will allow you to search many different patterns at once.
grepl(paste(your.pattern.list, collapse="|"), item)
Lets say you want to set up a scenario where you are storing many patterns of interest in a directory. Perhaps collected automatically from a server, or from some other output. You can create lists of patterns if they are in separate files using this:
#set working directory
setwd("/path/to/files/i/want")
#make a list of all files in this directory
inFilePaths = list.files(path=".", pattern=glob2rx("*"), full.names=TRUE)
#perform a function for each file in the list
for (inFilePath in inFilePaths)
{
#grepl function goes here
#if each file in the folder is a table/matrix/dataframe of patterns try this
inFileData = read_csv(inFilePath)
vectorData=as.vector(inFileData$ColumnOfPatterns)
grepl(paste(vectorData, collapse="|"), item)
}
For dynamically specifying the item, you can use an almost identical framework
#set working directory
setwd("/path/to/files/i/want")
#make a list of all files in this directory
inFilePaths = list.files(path=".", pattern=glob2rx("*"), full.names=TRUE)
#perform a function for each file in the list
for (inFilePath in inFilePaths)
{
#grepl function goes here
#if each file in the folder is a table/matrix/dataframe of data to be searched try this
inFileData = read_csv(inFilePath)
grepl(pattern, inFileData$ColumnToBeSearched)
}
If this is too far off from what you envisioned, please update your question with details about how the data you are using is stored.

Related

Name a variable or object based on the value of another variable in R

I read data files from a directory where I don't know the number or the name of the files. Each files a data frame (as parquet file). I can read that files. But how to name the results?
I would like to have something like a named list where the filename is the name of the element. I don't know how to do this in R. In Python I would use dictionaries like this
file_names = ['A.parquet', 'B.parquet']
all_data = {}
for fn in file_names:
data = pd.read_parquet(fn)
all_data[fn] = data
How can I solve this in R?
library("arrow")
file_names = c('a.parquet', 'B.parquet')
# "named vector"?
daten = c()
for (pf in file_names) {
# name of data frame (filename without suffix)
df_name <- strsplit(pf, ".", fixed=TRUE)[[1]][1]
df <- arrow::read_parquet(pf)
daten[df_name] = df
}
This doesn't work because I got this error
number of items to replace is not a multiple of replacement length
In the tidyverse you would use purrr. This is basically the same as the lapply() or sapply() approach, but in a different ecosystem.
library(arrow)
library(purrr)
file_names = c('a.parquet', 'B.parquet')
daten <- file_names %>%
set_names(tools::file_path_sans_ext) %>%
map(read_parquet)
You would access each list item through the usual ways.
daten$a
daten$B
# or
daten[["a"]]
daten[["B"]]
Explaination
The pipe operator %>% is an extremely common thing to run into in R these days. It is from the magrittr package, but is also exported from various other tidyverse packages, including purrr.
The pipe takes the left hand argument and enters it as the first argument on the right side expression. So f(x, y) can be written as x %>% f(y). This is useful to chain together expressions. R itself has a native pipe operator |> starting with version 4.1.0.
file_names is an unnamed character vector of the file names.
set_names() will make this a named vector by applying the function file_path_sans_ext() to file_names. This removes the file extension, so each element is named according to its name before the extension.
map() will iterate over each element of the vector, returning a list named according to the names of the vector elements. Each iteration runs the read_parquet function on the input (the file name).
You can used named lists like so.
You can either use the names directly
sapply(file_names, arrow::read_parquet,USE.NAMES = TRUE,simplify = FALSE)
or set them after with whatever function you want to apply
setNames(lapply(file_names, arrow::read_parquet), str_extract(file_names, '(^.+)(\\.)'))

select a value from several dataframes (file.csv) and merge them into a new dataframe with R

Maybe I'm asking for something too simple, but I can't solve this problem.
I want to create a script that recursively enters the folders present in a base_folder, opens a specific file with a name that is always the same (w3nu) and selects a precise value (I need to select the email of the subject belonging to the Response column, filtering for the corresponding heat in the Question.Key column).
I want my script to repeat itself in the same way for all the folders present in the base folder.
Finally, I want to merge all the emails into a new dataframe.
I have created this script but it does not work.
library(tidyverse)
base_folder <- "data/raw/exp_1_participants/sbj"
files <- list.files(base_folder, recursive = TRUE, full.names = TRUE)
demo_email <- files[str_detect(files, "w3nu")]
email_extraction <- function(demo_email){
demo_email <- read.csv(task,header = T)
demo_email <- demo_email %>%
filter(Question.Key == "respondent-email") %>%
select(Response)
}
email_list_jolly <- vector(mode = "list", length = length(demo_email))
for(i in 1:length(email_list_jolly)){
email_list_jolly[[i]] <- email_extraction(demo_email[i])
}
email_list_stud <- cbind(email_list_jolly)
write.csv(email_list_stud, 'data/cleaned/email_list_stud.csv')
can you help me? thanks
From comments:
Looks like you haven't defined task within the script shown above, but you're telling read.csv to find it. Did you mean to pass demo_email to read.csv instead? task is probably a random vector in your workspace.

R purrr::map() & mutate(): Add many new columns based on variables in list

I need to create a dataframe summarising information relating to file checking.
I have a list of 126 unique combinations of climate scenarios and years (e.g. 'ssp126_2030', 'ssp126_2050', 'ssp145_2030', 'ssp245_2050'). These unique elements represent sections of a larger full file path pointing to a specific file (scenario_list, below). For each unique element, I need to create multiple new columns specifying whether the file exists, its size and the date it was created.
I would like to loop through the list of 126 elements and stitch together a table of file checks (file_check_table, below). I start with a table of sub-directories, I then split these strings into sections so I can paste0() together a string that points to the file within the sub-directory that I want to check. I am aiming to use mutate()/transmutate() and purrr::map() to loop through each element in the climate scenario list and add multiple file checking columns (see below image of table).
I am new to functional programming, and this is what I have tried so far I was thinking of creating a function to add new columns, and then apply the function to list of climate scenarios. My end goal is to have one new column for each climate scenario and type of file check:
file_checks <- function(x) {
dir_list %>%
mutate(file_check_table,!!paste0(new_col_name) := ifelse(file.exists(paste0(file))==TRUE,1,0))}
file_check_table <- map(scenario_list, file_checks(x))
However, this function does not work as I don't think I have written the function correctly or perhaps used purrr correctly. Any thought on how to fix this would be much appreciated, thank you. This is what I would like file_check_table
If I understand your question correctly, you have a scenario_list that describes the path to the files, and would like the characteristics of the files. The natural way to do that would be to run a pipe with one entry per row, no reason to put it in a function.
For example:
library(tidyverse)
scenario_list <- read_lines("scenario_list.txt")
root_dir <- "C:/USers/Documents/my_project/data_subdir"
file_table <- tibble(scenario = scenario_list) %>%
mutate(path = file.path(root_dir, paste0(scenario, ".csv")),
exists = file.exists(path),
full_info = file.info(path),
file_size = full_info$size,
file_date = full_info$mtime)
And then if you want the output on a single row as in your screenshot:
file_table %>%
select(-path, -full_info) %>%
pivot_wider(names_from = scenario,
names_glue = "{scenario}_{.value}",
values_from = !scenario) %>%
write_csv("output.csv")

Using Character as Naming Convention in R

I am analyzing a data set and have created a function that summarizes most of my columns. The goal of my script is to automate the creation and extraction of summary tables(more or less dataframes).
To generalize as much as possible, I want to pass a character string to my function to be used to name columns, rows, files and more.
What I am working with currently:
NameFun <- function(df, name) {
##Name the first column
colnames(df)[1] <- "name"
##Write DF to Excel Workbook
write.xlsx(df, "Workbook.xlsx", sheetName = "name",
col.names = TRUE, row.names = TRUE, append = TRUE)
}
The objective here is to input a character "name" and use it within the function. I have tried "eval", "assign", and "get" with no luck. I have tried a few other attempts but either R doesn't recognize it in the environment, does nothing at all, or rejects the idea of passing a character all together.
I am open to any other solutions as to help generalize my script even more. Each column will have a unique name but report the same number of columns and type of metrics. Ideally, I would be able to pass a list of each column to the function and loop it through the whole data set.
Thanks!
-J
You could probably do this:
#Initialize a list to hold your results
ll<-list()
# You can run a loop or run it multiple times to generate your summary
ll[[name]]<-summary_Method(...) # Or pass the df
NameFun<-function(name, ll, df){
ll[[name]]<-df
}
# Write the list of dataframe to excel file.
lapply(names(ll), function(x) write.xlsx(ll[[x]], 'Workbook.xlsx', sheetName=x, append=TRUE))

Load multiple excel files and name object after a file name

I have read several questions related to this but none is what I am looking for.
The best one is by far using the readxlpackage
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
but as it is explained, it gives a list.
what I want is to get each file by their name in work directory
what I am doing is to get setwdinto the directory I have all the xls files then I load them one by one based on their name for example
mydf1 <- read_excel("mydf1.xlsx")
mydfb <- read_excel("mydfb.xlsx")
datac <- read_excel("datac.xlsx")
Is there any other way to get them without repeating the name over and over again?
You can use assign with for loop:
library(readxl)
file.list <- list.files(pattern = "*.xlsx")
for(i in file.list) {
assign(sub(".xlsx", "", i), read_excel(i))
}
PS.: you need sub to remove file extension (otherwise you would get object mydf1.xlsx instead of mydf1).
This is a perfect use case for the purrr package:
library(readxl)
library(tidyverse) #loads purrr
#for each excel file name, read excel sheet and append to df
df.excel <- file.names %>% map_df( ~ read_excel(path = .x))
You could use something like this in your loop:
lapply(seq_along(file.list), function(x){
df<-read_excel(x)
y<-gsub("\\..*","",x)
assign(y, df, envir=globalenv())
})
You only think that you want each one loading into the global environment. As you become more experienced with R you will find that in most (if not all) cases it is better to keep related objects like this together in a list.
If they are all in a list then you can use lapply or sapply to run the same command on every element instead of trying to create a new loop where you get each object and process it.
The list approach is less likely to overwrite other objects that you may want to keep or cause other programming at a distance bugs (which can be very hard to track down).
Building on the purrr approach by #SethRaithel, this provides column with the file names.
library(tidyverse)
library(readxl)
# create a list of files matching a regular expression
# in a defined path
file_list <- fs::dir_ls(temp_path, regexp="*.xls")
data_new <- file_list %>%
# convert to a tibble
as_tibble() %>%
# create column with just file name for reference
mutate(file = fs::path_file(value)) %>%
# uses map to read all the files and then return a single df
mutate(data = purrr::map(value, .f=readxl::read_excel)) %>%
unnest(cols=data) %>%
janitor::clean_names() %>%
select(-value)

Resources