r - writing a function that includes for i loop - r

I scanned similar questions previously answered but couldn't find the thread that is specific to my problem.
I have a number of datasets that all have five flagging columns (binary) at the end.
The aim is to produce an output that summarises the specified column in each dataset by each flag.
Hence, each output is a list of five summary tables.
library(tidyverse)
library(janitor)
## mydataset1
mydataset1 <- tibble(id = 1:100,
column_000 = sample(1:16, 100, replace = TRUE),
flag1 = sample(0:1, 100, replace = TRUE),
flag2 = sample(0:1, 100, replace = TRUE),
flag3 = sample(0:1, 100, replace = TRUE),
flag4 = sample(0:1, 100, replace = TRUE),
flag5 = sample(0:1, 100, replace = TRUE))
## summary table function
get_table <- function(data, column) {
data %>%
# select the flag
filter(data[[i]] == 1) %>%
# summary table
tabyl(column) %>%
arrange(desc(n)) %>%
top_n(5, n)
}
## list of tables function
output_list <- function(data, column) {
# empty list
output <- list()
# for loop - go through each flagging column
for (i in (length(data)-4):length(data)) {
output[[i]] <- get_table(data, column)
}
# for some reason, there are NULL list items for all other columns
output <- compact(output)
# rename and print
names(output) <- names(data)[(length(data)-4):length(data)]
print(output)
}
### execute
output_list(mydataset1, "column_000")
# error
### manually executing the function works fine
# empty list
output <- list()
# for loop - go through each flagging column
for (i in (length(mydataset1)-4):length(mydataset1)) {
output[[i]] <- get_table(mydataset1, "column_000")
}
# for some reason, there are NULL list items for all other columns
output <- compact(output)
# rename and print
names(output) <- names(mydataset1)[(length(mydataset1)-4):length(mydataset1)]
print(output)
This is what I have for now.
If I execute the contents of output_list function manually, it works fine.
However, if I execute it as a function, it gives me an error that object i is not found.
Where did I get it wrong? Please help!

Pass i as an input to get_table function.
library(tidyverse)
library(janitor)
get_table <- function(data, column, i) {
data %>%
# select the flag
filter(data[[i]] == 1) %>%
# summary table
tabyl(column) %>%
arrange(desc(n)) %>%
top_n(5, n)
}
Make the corresponding changes in output_list function.
output_list <- function(data, column) {
# empty list
output <- list()
# for loop - go through each flagging column
for (i in (length(data)-4):length(data)) {
output[[i]] <- get_table(data, column, i)
}
# for some reason, there are NULL list items for all other columns
output <- compact(output)
# rename and print
names(output) <- names(data)[(length(data)-4):length(data)]
print(output)
}
Run the function -
output_list(mydataset1, "column_000")

In your get table function, you are using "i" but not declaring "i" in the function argument. Your code works fine when you run code separately because i value gets assigned from the for loop in the global environment. if you intend to use i from for loop in get_table function you can just declare it. See code below.
library(tidyverse)
library(janitor)
## mydataset1
mydataset1 <- tibble(id = 1:100,
column_000 = sample(1:16, 100, replace = TRUE),
flag1 = sample(0:1, 100, replace = TRUE),
flag2 = sample(0:1, 100, replace = TRUE),
flag3 = sample(0:1, 100, replace = TRUE),
flag4 = sample(0:1, 100, replace = TRUE),
flag5 = sample(0:1, 100, replace = TRUE))
## summary table function
get_table <- function(data, column) {
data %>%
# select the flag
filter(data[[i]] == 1) %>%
# summary table
tabyl(column) %>%
arrange(desc(n)) %>%
top_n(5, n)
}
## list of tables function
output_list <- function(data, column) {
# empty list
output <- list()
# for loop - go through each flagging column
for (i in (length(data)-4):length(data)) {
output[[i]] <- get_table(data, column)
}
# for some reason, there are NULL list items for all other columns
output <- compact(output)
# rename and print
names(output) <- names(data)[(length(data)-4):length(data)]
print(output)
}
### execute
output_list(mydataset1, "column_000")
# error
### manually executing the function works fine
# empty list
output <- list()
# for loop - go through each flagging column
for (i in (length(mydataset1)-4):length(mydataset1)) {
output[[i]] <- get_table(mydataset1, "column_000")
}
# for some reason, there are NULL list items for all other columns
output <- compact(output)
# rename and print
names(output) <- names(mydataset1)[(length(mydataset1)-4):length(mydataset1)]
print(output)

Related

What line(s) can I add to only pull unique values before counting all values and writing the CSV file?

I am trying to create a CSV file that is a list of all unique values in my dataset. My data is from a folder that contains 200+ CSV files all with 9 columns and a varying number of rows. Some files have no duplicates but many have duplicate values. I have found a code that lists how many rows in each file but I am wondering what I could add to it so it removes the duplicate values and only counts the unique values in the final output CSV. I would like the final CSV file to list the row count each of the 200+ files in one sheet.
The code I found is below
library(tidyverse)
csv.file <- list.files("TestA") # Directory with your .csv files
data.frame.output <- data.frame(number_of_cols = NA,
number_of_rows = NA,
name_of_csv = NA) #The df to be written
MyF <- function(x){
csv.read.file <- data.table::fread(
paste("TestA", x, sep = "/")
)
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
data.frame.output <<- add_row(data.frame.output,
number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = str_remove_all(x,".csv")) %>%
filter(!is.na(name_of_csv))
}
map(csv.file, MyF)
data.table::fwrite(data.frame.output, file = "Output1.csv")
I appreciate any guidance as I am a total R/coding beginner.
The following function accepts a vector of file names, reads them one by one, removes duplicated rows and outputs a data.frame with numbers of columns and rows and CSV filename.
There is no need to previously create a results data.frame data.frame.output.
MyF <- function(x, path = "TestA"){
f <- function(x, path) {
# commented out to test the function
# uncomment these 3 lines and comment out the next one
#csv.read.file <- data.table::fread(
# file.path(path, x)
#)
csv.read.file <- data.table::fread(x)
i_dups <- (duplicated(csv.read.file) | duplicated(csv.read.file, fromLast = TRUE))
csv.read.file <- csv.read.file[!i_dups, ]
#
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
#
name_of_csv <- if(is.na(x)) NA_character_ else basename(x)
name_of_csv <- tools::file_path_sans_ext(name_of_csv)
#
data.frame(number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = name_of_csv) |>
dplyr::filter(!is.na(name_of_csv))
}
#
y <- purrr::map(x, f, path = path)
data.table::rbindlist(y)
}
data.frame.output <- MyF(csv.file)
data.table::fwrite(data.frame.output, file = "Output1.csv")
I find this for loop version better. Though for loops are not considered very idiomatic in R, there is nothing wrong with them. Like the function above, it avoids assignment in the parent environment with the operator <<- and the code is simpler. The results data.frame data.frame.output is created beforehand with the number of rows equal to the length of the input filenames vector and assignment is done by replacing the NA values by each CSV files' values.
MyF <- function(x, path = "TestA"){
data.frame.output <- data.frame(number_of_cols = rep(NA, length(x)),
number_of_rows = rep(NA, length(x)),
name_of_csv = rep(NA, length(x)))
for(i in seq_along(x)) {
# commented out to test the function
# uncomment this line and comment out the next one
#fl_name <- file.path(path, x[i])
fl_name <- x[i]
#
csv.read.file <- data.table::fread(fl_name)
i_dups <- (duplicated(csv.read.file) | duplicated(csv.read.file, fromLast = TRUE))
csv.read.file <- csv.read.file[!i_dups, ]
#
data.frame.output$number_of_cols[i] <- ncol(csv.read.file)
data.frame.output$number_of_rows[i] <- nrow(csv.read.file)
#
name_of_csv <- if(is.na(fl_name)) NA_character_ else basename(fl_name)
name_of_csv <- tools::file_path_sans_ext(name_of_csv)
data.frame.output$name_of_csv[i] <- name_of_csv
}
#
data.frame.output |> dplyr::filter(!is.na(name_of_csv))
}
MyF(csv.file)

How to convert character string to executable code in R?

I have a dataframe e.g.
df_reprex <- data.frame(id = rep(paste0("S",round(runif(100, 1000000, 9999999),0)), each=10),
date = rep(seq.Date(today(), by=-7, length.out = 10), 100),
var1 = runif(1000, 10, 20),
var2 = runif(1000, 20, 50),
var3 = runif(1000, 2, 5),
var250 = runif(1000, 100, 200),
var1_baseline = rep(runif(100, 5, 10), each=10),
var2_baseline = rep(runif(100, 50, 80), each=10),
var3_baseline = rep(runif(100, 1, 3), each=10),
var250_baseline = rep(runif(100, 20, 70), each=10))
I want to write a function containing a for loop that for each row in the dataframe will subtract every "_baseline" column from the non-baseline column with the same name.
I have created a script that automatically creates a character string containing the code I would like to run:
df <- df_reprex
# get only numeric columns
df_num <- df %>% dplyr::select_if(., is.numeric)
# create a version with no baselines
df_nobaselines <- df_num %>% select(-contains("baseline"))
#extract names of non-baseline columns
numeric_cols <- names(df_nobaselines)
#initialise empty string
mutatestring <- ""
#write loop to fill in string:
for (colname in numeric_cols) {
mutatestring <- paste(mutatestring, ",", paste0(colname, "_change"), "=", colname, "-", paste0(colname, "_baseline"))
# df_num <- df_num %>%
# mutate(paste0(col, "_change") = col - paste0(col, "_baseline"))
}
mutatestring <- substr(mutatestring, 4, 9999999) # remove stuff at start (I know it's inefficient)
mutatestring2 <- paste("df %>% mutate(", mutatestring, ")") # add mutate call
but when I try to call "mutatestring2" it just prints the character string e.g.:
[1] "df %>% mutate( var1_change = var1 - var1_baseline , var2_change = var2 - var2_baseline , var3_change = var3 - var3_baseline , var250_change = var250 - var250_baseline )"
I thought that this part would be relatively easy and I'm sure I've missed something obvious, but I just can't get the text inside that string to run!
I've tried various slightly ridiculous methods but none of them return the desired output (i.e. the result returned by the character string if it was entered into the console as a command):
call(mutatestring2)
eval(mutatestring2)
parse(mutatestring2)
str2lang(mutatestring2)
mget(mutatestring2)
diff_func <- function() {mutatestring2}
diff_func1 <- function() {
a <-mutatestring2
return(a)}
diff_func2 <- function() {str2lang(mutatestring2)}
diff_func3 <- function() {eval(mutatestring2)}
diff_func4 <- function() {parse(mutatestring2)}
diff_func5 <- function() {call(mutatestring2)}
diff_func()
diff_func1()
diff_func2()
diff_func3()
diff_func4()
diff_func5()
I'm sure there must be a very straightforward way of doing this, but I just can't work it out!
How do I convert a character string to something that I can run or pass to a magrittr pipe?
You need to use the text parameter in parse, then eval the result. For example, you can do:
eval(parse(text = "print(5)"))
#> [1] 5
However, using eval(parse()) is normally a very bad idea, and there is usually a more sensible alternative.
In your case you can do this without resorting to eval(parse()), for example in base R you could subtract all the appropriate variables from each other like this:
baseline <- grep("_baseline$", names(df_reprex), value = TRUE)
non_baseline <- gsub("_baseline", "", baseline)
df_new <- cbind(df_reprex, as.data.frame(setNames(mapply(
function(i, j) df_reprex[[i]] - df_reprex[[j]],
baseline, non_baseline, SIMPLIFY = FALSE),
paste0(non_baseline, "_corrected"))))
Or if you want to keep the whole thing in a single pipe without storing intermediate variables, you could do:
mapply(function(i, j) df_reprex[[i]] - df_reprex[[j]],
grep("_baseline$", names(df_reprex), value = TRUE),
gsub("_baseline", "", grep("_baseline$", names(df_reprex), value = TRUE)),
SIMPLIFY = FALSE) %>%
setNames(gsub("_baseline", "_corrected",
grep("_baseline$", names(df_reprex), value = TRUE))) %>%
as.data.frame() %>%
{cbind(df_reprex, .)}

Convert data.frame into a row with the specific sorting of numeric and character data?

Intro: Working in R, I often need to reorganize information from lists of data.frames to create a summary table. In this example, I start with a single data.frame, and I show my function that converts key information from the data.frame into a single row. Bearing in mind that my desired output requires the sorting of a mixture of numeric and character data, I can’t help wondering if there is an easier technique to do this kind of thing.
My question: Can anyone provide advice, or better yet a solution, for a simpler technique to convert data.frames like these into rows, while respecting the specific sorting of the data?
#sample data
input_df <- data.frame(M1 = c("Age", "Weight", "Speed", "Range"),
dogs = c(100, 120, 85, 105),
cats = c(115, 89, 80, 111),
birds = c(100, 90, 100, 104))
# desired summary row
desired_row <- data.frame(Model = "M1",
dogs = "Weight (120)",
cats = "Age (115), Range (111)",
birds = "Range (104)",
stringsAsFactors = F)
desired_row$Model <- factor(desired_row$Model)
# my function
makeRow <- function(dat1) {
# get model name
mod <- data.frame(Model = names(dat1[1]))
# make list of variables with model varible
d1 <- setNames(lapply(names(dat1)[-1], function(x) cbind(dat1[1],
dat1[x])), names(dat1)[-1])
# create a sorted named vector, largest-to-smallest
sorted_named_vec <- function(x) {
sort(setNames(x[[2]], x[[1]]), decreasing = T)
}
d2 <- lapply(d1, sorted_named_vec)
# implement a criterion to report only top indexes
keep_tops <- function(x) {
ifelse(x == max(x) | x >= 110 | (x > 102) & ((x -
100)/(max(x) - 100) > 0.33), x, "")
}
d3 <- lapply(d2, keep_tops)
# remove blank character elements
remove_blank_elements <- function(x) {
x[nchar(x) > 0]
}
d4 <- lapply(d3, remove_blank_elements)
# collapse variable name with top values and add parenthesis
collapse_to_string <- function(x) {
paste0(names(x), " (", x, "),", collapse = " ")
}
d5 <- lapply(d4, collapse_to_string)
# remove the last comma
remove_last_comma <- function(x) {
gsub("\\,$", "", x)
}
d6 <- lapply(d5, remove_last_comma)
# consturct a row from the list
row <- cbind(mod, as.data.frame(d6, stringsAsFactors = F))
row
}
# call
row_output <- makeRow(dat1 = input_df)
row_output
# check output to desired
identical(desired_row, row_output)
not sure if more efficient, but slightly less code and more direct approach imo.
makeRow <- function(dat1) {
#make data frame for row with model name
d0 <- data.frame(mod = names(dat1)[1]) #col name changed later
# implement a criterion to report only top indexes -> now return if true or false
keep_tops <- function(x) {
x == max(x) | x >= 110 | (x > 102) & ((x - 100)/(max(x) - 100) > 0.33)
}
vals =c() #empty -> for values of each cols
# make list of variables with model variables(dat1 cols)
#use the columns of the df directly
for(col in 2:ncol(dat1)){
#make temp df with each and evaluate what row to keep in the same line
df = dat1[keep_tops(dat1[,col])==1,c(1,col)]
df[,2] = paste0("(",df[,2],")") #add the () around the numbers
val = apply(as.data.frame(apply(df, 1, paste0, collapse=" ")), 2, paste0, collapse=", ") #collapse rows, then cols
vals = c(vals, val) #add this variable values to the values' list
}
# bind the first col made earlier with these values
row <- cbind(d0, as.data.frame(t(vals), stringsAsFactors = F))
colnames(row) = colnames(dat1) #rename the columns to match
row
}
# call
row_output <- makeRow(dat1 = input_df)
# check output to desired
identical(desired_row$birds, row_output$birds)
with your 'input_df', identical() was TRUE.

Arguments imply differing number of rows for an iteration loop

code problem
Save the result from an iteration loop into a whole dataframe problem
library(rscopus)
library(dplyr)
auth_token_header("d2f02ad55dcfc907212f0e6b216bf847")
akey="d2f02ad55dcfc907212f0e6b216bf847"
set_api_key(akey)
df = data.frame(doi = c("10.1109/TPAMI.2018.2798607", "10.1109/CNS.2017.8228696"))
df_references <- NULL
for (i in 1:nrow(df)) {
x = abstract_retrieval(df$doi[i], identifier= "doi")
for (a in 1:length(x$content$`abstracts-retrieval-response`$`item`$bibrecord$tail$`bibliography`$reference)){
call_str <- paste("ref <- x$content$`abstracts-retrieval-response`$`item`$bibrecord$tail$`bibliography`$reference[[",a,"]]$`ref-info`$`ref-title`")
eval(parse(text = call_str))
df_references <- rbind(df_references, data.frame(initial_paper = df$doi[i],
ref_title = ref))
}
}
I expect the output to be saved results of every iteration into a dataframe

Two same type of dataframes perform differently in a function

Below is my data
set.seed(100)
toydata <- data.frame(A = sample(1:50,50,replace = T),
B = sample(1:50,50,replace = T),
C = sample(1:50,50,replace = T)
)
Below is my swapping function
derangement <- function(x){
if(max(table(x)) > length(x)/2) return(NA)
while(TRUE){
y <- sample(x)
if(all(y != x)) return(y)
}
}
swapFun <- function(x, n = 10){
inx <- which(x < n)
y <- derangement(x[inx])
if(length(y) == 1) return(NA)
x[inx] <- y
x
}
In the first case,I get the new data toy by swapping the entire dataframe. The code is below:
toydata<-as.matrix(toydata)
toy<-swapFun(toydata)
toy<-as.data.frame(toy)
In the second case, I get the new data toy by swapping each column respectively. Below is the code:
toydata<-as.data.frame(toydata)
toy2 <- toydata # Work with a copy
toy2[] <- lapply(toydata, swapFun)
toy<-toy2
Below is the function that can output the difference of contigency table after swapping.
# the function to compare contingency tables
f = function(x,y){
table1<-table(toydata[,x],toydata[,y])
table2<-table(toy[,x],toy[,y])
sum(abs(table1-table2))
}
# vectorise your function
f = Vectorize(f)
combn(x=names(toydata),
y=names(toydata), 2) %>%# create all combinations of your column names
t() %>% # transpose
data.frame(., stringsAsFactors = F) %>% # save as dataframe
filter(X1 != X2) %>% # exclude pairs of same
# column
mutate(SumAbs = f(X1,X2)) # apply function
In the second case, this mutate function works.
But in the first case, this mutatefunction does not work. It says:
+ filter(X1 != X2) %>% # exclude pairs of same column
+ mutate(SumAbs = f(X1,X2)) # apply function
Error in combn(x = names(toydata), y = names(toydata), 2) : n < m
However in the two cases, the toy data are all dataframes with the same dimension, the same row names and the same column names. I feel confused.
How can I fix it? Thanks.

Resources