Creating function to read data set and columns and displyaing nrow - r

I am struggling a bit with a probably fairly simple task. I wanted to create a function that has arguments of dataframe(df), column names of dataframe(T and R), value of the selected column of dataframe(a and b). I know that the function reads the dataframe. but , I don't know how the columns are selected. I'm getting an error.
fun <- function(df,T,a,R,b)
{
col <- ds[c("x","y")]
omit <- na.omit(col)
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
nrow(data2)/nrow(data1)
}
fun(jugs,Place,UK,Price,10)
I'm new to r language. So, please help me.

There are several errors you're making.
col <- ds[c("x","y")]
What are x and y? Presumably they're arguments that you're passing, but you specify T and R in your function, not x and y.
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
Again, presumably, you want a and b to be arguments you passed to the function, but you specified 'a' and 'b' which are specific, not general arguments. Also, I assume that second "omit$x" should be "omit$y" (or vice versa). And actually, since you just made this into a new data frame with two columns, you can just use the column index.
nrow(data2)/nrow(data1)
You should print this line, or return it. Either one should suffice.
fun(jugs,Place,UK,Price,10)
Finally, you should use quotes on Place, UK, and Price, at least the way I've done it.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
data1 <- omit[omit[,1] == val1,]
data2 <- omit[omit[,2] == val2,]
print(nrow(data2)/nrow(data1))
}
fun(jugs, "Place", "UK", "Price", 10)
And if I understand what you're trying to do, it may be easier to avoid creating multiple dataframes that you don't need and just use counts instead.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
n1 <- sum(omit[,1] == val1)
n2 <- sum(omit[,2] == val2)
print(n2/n1)
}
fun(jugs, "Place", "UK", "Price", 10)

I would write this function as follows:
fun <- function(df,T,a,R,b) {
data <- na.omit(df[c(T,R)]);
sum(data[[R]]==b)/sum(data[[T]]==a);
};
As you can see, you can combine the first two lines into one, because in your code col was not reused anywhere. Secondly, since you only care about the number of rows of the two subsets of the intermediate data.frame, you don't actually need to construct those two data.frames; instead, you can just compute the logical vectors that result from the two comparisons, and then call sum() on those logical vectors, which naturally treats FALSE as 0 and TRUE as 1.
Demo:
fun <- function(df,T,a,R,b) { data <- na.omit(df[c(T,R)]); sum(data[[R]]==b)/sum(data[[T]]==a); };
df <- data.frame(place=c(rep(c('p1','p2'),each=4),NA,NA), price=c(10,10,20,NA,20,20,20,NA,20,20), stringsAsFactors=F );
df;
## place price
## 1 p1 10
## 2 p1 10
## 3 p1 20
## 4 p1 NA
## 5 p2 20
## 6 p2 20
## 7 p2 20
## 8 p2 NA
## 9 <NA> 20
## 10 <NA> 20
fun(df,'place','p1','price',20);
## [1] 1.333333

Related

How to efficiently store and retrieve parameters/arguments used during data processing in lists

I work with a large count table and for my analyses it is usually required to split this table into subsets based on observations, variables, values or context information.
# generating toy data
count_df1 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1, 10),
column3 = runif(50, 1, 10)
)
count_df2 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1.5, 9),
column3 = runif(50, 1.5, 9)
)
list_count_df <- list(count_df1 = count_df1, count_df2 = count_df2)
I learned to use lists and for loops to process all resulting subsets in the same manner. I'm rather using for loops than apply because I use the names of the objects (with the use of counters) to keep track of how I modified them and I don't know how to do this with e.g. lapply.
# set values to iterate over
thresholds <- c(2, 4)
conditions <- c(TRUE, FALSE)
# perform some kind of subsetting and store the parameters used
output_list <- list()
counter <- 0
for (current_threshold in thresholds) {
for (count_df in list_count_df) {
counter <- counter + 1
# modify the name to keep track of changes
current_name <- paste(names(list_count_df)[counter], current_threshold, sep = "_")
output_list[[current_name]] <- subset(count_df1, column2 < current_threshold)
}
counter <- 0
}
Additionally, the time consuming part is usually the main function of the body, so a loop with a reduced overhead by apply would probably not safe so much time (I'm still open to this).
After I'm done with preparing the various subsets and subject them to the analysis, I need to store the analysis' results and the accompanying parameters for the different subsets. That is probably a common task.
# allocate for df to store the results
result_length <- length(output_list) * length(conditions)
df_headers <- c("Names", "Threshold", "Input_table", "Standard_deviation", "Scaling")
df_results <- setNames(data.frame(matrix(ncol = length(df_headers),
nrow = result_length)), df_headers)
# perform some analyses (here: PCA) on the dfs while looping over
# analysis parameters and storing some results directly
iii <- 0
table_counter <- 0
for (item in output_list) {
table_counter <- table_counter + 1
for (condition in conditions) {
iii <- iii + 1
current_name <- paste(names(output_list)[table_counter], condition, sep = "_")
tmp <- prcomp(item, scale = condition)
# let's pretend we are only interested in standard deviation per item
df_results[iii, 1] <- current_name
df_results[iii, 4] <- tmp$sdev[1]
rm(tmp)
}
}
However, I'm partly doing this by extracting parts of the name of the object, which is highly repetitive and also very custom and has to be changed for each additional step included beforehand. As I want to start my own package soon, this is nothing another user could easily follow.
# extract more values from the name of the former object
df_results$Threshold <- as.numeric(sapply(strsplit(as.character(df_results$Names), '_'), "[", 3))
df_results$Input_table <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 2))
df_results$Scaling <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 4))
df_results
# now I could this into long format, do plotting etc
I provided a short example below of how such a workflow could look like. My questions are:
1) What are the general good practices on how to store parameters used for and how to extract them after processing?
2) If the solution is too case-specific for a general approach:
a) any ideas what to change here?
b) Are lists and/or for loops the way to go at all?
I do it because modifying names in lapply is unclear to me and without this I lose track of what is what. I also would not know how to efficiently handle all these different subsets in one big data.frame
Please consider that my original data contains numerical, factor and character columns with 100s of rows/observations and ten thousands of columns/variables.
Honestly there are many ways to do this and it will come down to personal preference. One common way would be to define a class object that will set the standard of how you access information on it. Creating a class means that you can make S3 methods too. This could help give more flexibility on how you generate your class depending on if you are working on a list, df, or just a vector.
generate_foo <- function(x, ...){
UseMethod("generate_foo")}
generate_foo.default <- function(x, current_threshold, conditions, name = NULL){
if(is.null(name)){
name <- as.character(substitute(x))
}
x <- x[x[["column2"]]<current_threshold,]
tmp <- tryCatch({prcomp(x, scale = conditions)}, error=function(er){return("Error")})
retval <- list(list(subset = x,
pcaObj = tmp, #could store the entire object or just the parts you care about.
subsetparam = current_threshold,
condition = conditions,
name = name))
class <- "foo"
return(retval)
}
generate_foo.list <- function(x,
current_threshold,
conditions, name = NULL){
if(is.null(name)||length(name)!=length(x)){
name <- names(x)
}
#Generate combinations
combi <- separate( #generate all the possible combination indexes at once
data.frame(
indx = levels(suppressWarnings(interaction(1:length(x),
1:length(current_threshold),
1:length(conditions))))),
col = "indx", into = c("df","thresh","cond"), sep = "\\.")
x <- x[as.numeric(combi$df)]
name <- name[as.numeric(combi$df)]
current_threshold <- current_threshold[as.numeric(combi$thresh)]
conditions <- conditions[as.numeric(combi$cond)]
foolist <- mapply(FUN = generate_foo.default,
x = x,
current_threshold = current_threshold,
conditions = conditions,
name = name)
class(foolist) <- "foolist"
return(foolist)
}
With this method when you call:
foo <- generate_foo(x = list_count_df,
current_threshold = thresholds,
conditions = conditions,
name = c("Custname1","Custname2"))
You will end up with a list of objects with class "foo". Specifically in this case the resulting object is length 8, each containing 5 parameters, subset, pcaObj, subsetparam, condition, name. with the exception of pcaObj sometimes throwing an error if the subset is too small, a tryCatch loop prevents the code from failing.
take it a step further by writing custom print and summary functions!
#summary
summary.foolist <- function(x){
subsetdim <- unlist(lapply(x, function(y){prod(dim(y[["subset"]]))}))
pcasdev <- unlist(lapply(x, function(y){y[["pcaObj"]]$sdev[1]}))
subsetparam <- unlist(lapply(x, function(y){y[["subsetparam"]]}))
condition <- unlist(lapply(x, function(y){y[["condition"]]}))
name <- unlist(lapply(x,function(y){y[["name"]]}))
df <- data.frame(SubsetDim=subsetdim, PCAsdev=pcasdev, SubsetParam=subsetparam, condition=condition, name = name)
return(df)
}
summary(foo)
SubsetDim PCAsdev SubsetParam condition name
1 24 1.207833 2 TRUE Custname1
2 6 1.732051 2 TRUE Custname2
3 54 1.324284 4 TRUE Custname1
4 33 1.372508 4 TRUE Custname2
5 24 16.258848 2 FALSE Custname1
6 6 12.024556 2 FALSE Custname2
7 54 15.592938 4 FALSE Custname1
8 33 14.057929 4 FALSE Custname2
Using a convention like this ensures your data is stored in a canonical way. Of course there are many ways you can choose to build your custom R class and object.
You could make one function that makes a list of subsetted dataframes and set that as one class. Then make another function that performs the analysis and generates a new class object. As long as you stick to building a named list, then accessing parts of the object become easier because they are organized.
Functional solution
0. Generate source data frame
# for reproducibility of random tasks
set.seed(1)
df <- data.frame(
col1 = c(1:100),
col2 = c(runif(50,1,10), runif(50,11,20)),
col3 = c(runif(50,1,10), runif(50,11,20))
)
# so half of the rows have numbers 1 to 10 in col2 and col3
# and other have 11 to 20 in col2 and col3.
# let's randomize the order of rows
df <- df[sample(1:100),]
# and take this data frame `df` as our source data frame
# fromw which we will do the analysis.
1. Problem description
We want to subdivide the original df into sub data frames
applying 2 different criteria.
Then, we analyze each sub data frame
using all possible combinations of 2 different parameters,
finally, collect all analysis values and aggravate them to a data frame.
The criteria:
criterium1: if col2 value is <= 10, we assign to the row "df1" therwise "df2".
categories c("df1", "df2").
criterium2: if col3 value is lower first limit, the row is assigned 'class5'
if col3 value is > first limit but <= second limit, assign 'class15'
other cases don't interest us - let's assign 'other'
categories c("class5", "class15", "other") # each row will be subdivided into
one of them
We want for each combination of the two criteria an own sub-dataframe
on which the analysis should be done.
The parameters for the analysis:
parameter1: 'scale.=' c(TRUE, FALSE)
parameter_categories c("sc+", "sc-")
parameter2: 'center=' c(TRUE, FASE)
parameter_categories c("cen+", "cen-")
The analysis result value:
We want for each combination of the two parameters an own report or the value
for 'Standard deviation'.
3 stddev columns of the PC1, PC2, PC3
Additional information to be collected:
we want a distinguishable (unique) name for each combination
2. How the entire analysis will look like:
# 0. categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("class5", "class15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# 1. Declare parameters and prepare all parameter combinations
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# 2. Apply analysis over all sub dfs and parameter combinations
# and extract and aggravate analysis results into a final data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# 3. rename parameter column contents
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
df_final:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
3. Explanation step by step
3.1 Declare Helper functions
# for preparing parameter combinations as lists
named_cross_combine <- function(seq1, seq2, seq1_names, seq2_names, sep="_") {
res <- list()
i <- 1
namevec <- c()
for (j1 in seq_along(seq1)) {
for (j2 in seq_along(seq2)) {
res[[i]] <- c(seq1[j1], seq2[j2])
namevec[i] <- paste0(seq1_names[j1], sep, seq2_names[j2])
i <- i + 1
}
}
names(res) <- namevec
res
}
# correctly named params list - `sep=` determines how names are joined
# you can apply `gsub()` on the namevec before assignment to adjust further the names.
# useful for doing analysis
do.call2 <- function(fun, x, rest) {
do.call(fun, c(list(x), rest))
}
apply_parameters <- function(funcname,
dfs,
params) {
lapply(dfs, function(df) lapply(params_list, function(pl) do.call2(funcname, df, pl)))
}
split_names_to_data_frame <- function(names_vec, sep) {
res <- lapply(names_vec, function(s) strsplit(s, sep)[[1]])
df <- Reduce(rbind, res)
# colnames(df) <- col_names
rownames(df) <- names_vec
df
}
apply_to_subdf_and_combine <- function(
res_list,
accessor_func=function(x) x, # subdf result
subdf_level_combiner_func=as.data.frame, # within subdf result
combine_prepare_func=function(x) x, # applied on each subdf result
final_combiner_func=rbind, # combine the results
col_names=NULL, # column names for final
sep="_") { # joiner for names
res_accessed_combined <- lapply(res_list,
function(x) do.call(what=subdf_level_combiner_func,
list(lapply(x, accessor_func))))
res_prepared <- lapply(res_accessed_combined, combine_prepare_func)
res_df <- Reduce(final_combiner_func, res_prepared)
rownames(res_df) <- paste(unlist(sapply(names(res_prepared), rep, nrow(res_prepared[[1]]))),
unlist(sapply(res_prepared, rownames)),
sep = sep)
names_df <- split_names_to_data_frame(rownames(res_df), sep = sep)
final_df <- as.data.frame(cbind(names_df, res_df))
if (!is.null(col_names)) {
colnames(final_df) <- col_names
}
final_df
}
# for simplifying the function call
extract_and_combine <- function(res_list,
result_extractor_func,
col_names,
sep="_") {
apply_to_subdf_and_combine(
res_list = res_list,
accessor_func = result_extractor_func,
subdf_level_combiner_func=as.data.frame,
combine_prepare_func=function(x) as.data.frame(t(x)),
final_combiner_func=rbind,
col_names=col_names,
sep=sep
)
}
# for even more simplifying function call
apply_extract_aggravate <- function(dfs,
params,
analyzer_func,
extractor_func,
col_names,
sep="_") {
extract_and_combine(
res_list=apply_parameters(funcname=analyzer_func, dfs=dfs, params=params),
result_extractor_func=extractor_func,
col_names=col_names,
sep=sep
)
}
# useful for renaming the data frame columns values
lookup <- function(x, seq1, seq2) {
seq2[sapply(x, function(x) which(x == seq1))]
}
3.2 Categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("5", "15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
But to have full control over categorization, you can
declare your own categorizer functions and categorize and
split the data frame:
# write rules for criterium1 1 element as function
categorizer1 <- function(x) {
if (1 <= x && x <= 10) {
"df1"
} else if (11 <= x && x <= 20) {
"df2"
}
}
# vectorize it to be able to apply it on entire columns
categorizer1 <- Vectorize(categorizer1)
# do the same for critreium2
categorizer2 <- function(x) {
if (x <= 5) {
"class5"
} else if (5 < x && x <= 15) {
"class15"
} else {
"other"
}
}
categorizer2 <- Vectorize(categorizer2)
# apply on col2 and col3 the corresponding categorizers
categories1 <- categorizer1(df[,"col2"])
categories2 <- categorizer2(df[,"col3"])
# get the list of sub data frames according to categories
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# Let the categorizer functions return strings and
# for the second argument use `paste()` with `sep=` to determine
# how the names should be combined - here with "_".
# Use `gsub(pattern, replacement, x, ignore.case=F, perl=T)`
# to process the name using regex patterns to how you want it at the end.
# Here, we remove the bulky "class".
3.3 Declare parameters as lists and their corresponding names in filename
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
# prepare all combinations of them in a list of lists
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# this produces a list of all possible parameter combination lists.
# Each parameter combination has to be kept itself in a list, because
# `do.call()` later requires the parameters being in a list.
# `named_cross_combine()` takes care of correct naming,
# joining the names using `sep` values.
# The first element in `parameter1` is taken and is paired with each of
# `parameters2`. Then the second of `parameter1` through all `parameters2`, etc.
3.4 Apply all parameters over dfs and collect the results into a data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# This function takes the dfs and the parameters list and runs the
# analyzer_func, here `prcomp()` over all combinations of boths.
# The `extractor_func` must be chosen in a way that the returned result is a vector.
# If it is already a vector, set here `function(x) x` the identity function.
# The column names should give new names to the resulting columns.
# The number of the names are determined by:
# - the number of categoriesN,
# - the number of parametersN,
# - the number of elements of result after extractor_func() was applied.
# `sep=` determines which joiner is used for joining the names.
3.5 Finally, rename parameter columns' contents by using lookup() + previously declared parameter lists (parametersN) with their corresponding name vectors (npN)
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
# Two parameter columns, so two commands.
This converts df_final from this:
# df limits scale center std_PC1 std_PC2 std_PC3
# df1_15_scpos_cpos df1 15 scpos cpos 1.205986 0.9554013 0.7954906
# df1_15_scpos_cneg df1 15 scpos cneg 1.638142 0.5159250 0.2243043
# df1_15_scneg_cpos df1 15 scneg cpos 15.618145 2.4501942 1.3687843
# df1_15_scneg_cneg df1 15 scneg cneg 31.425246 5.9055013 1.7178626
# df1_5_scpos_cpos df1 5 scpos cpos 1.128371 1.0732246 0.7582659
# df1_5_scpos_cneg df1 5 scpos cneg 1.613217 0.4782639 0.4108470
# df1_5_scneg_cpos df1 5 scneg cpos 13.525868 2.5524661 0.9894493
# df1_5_scneg_cneg df1 5 scneg cneg 30.007511 3.9094993 1.6020638
# df2_15_scpos_cpos df2 15 scpos cpos 1.129298 1.0069030 0.8431092
# df2_15_scpos_cneg df2 15 scpos cneg 1.720909 0.1523516 0.1235295
# df2_15_scneg_cpos df2 15 scneg cpos 14.061532 2.4172787 1.2348606
# df2_15_scneg_cneg df2 15 scneg cneg 80.543382 3.8409639 1.8480111
# df2_other_scpos_cpos df2 other scpos cpos 1.090057 0.9588241 0.9446865
# df2_other_scpos_cneg df2 other scpos cneg 1.718190 0.1881516 0.1114570
# df2_other_scneg_cpos df2 other scneg cpos 15.168160 2.5579403 1.3354016
# df2_other_scneg_cneg df2 other scneg cneg 82.297724 5.0580949 1.9356444
to this:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
4. Final remarks
This is not very different from your approach. All information is collected in the names. And the names used for generating the part of the data frame which explains the background for the analysis data.
The lookup() function is very useful for renaming the columns for the parameters.
The categorization of a column can be very simplified by the cat() function. But in the cut() function you don't have full control over
whether the upper/lower limit is included (<=) or not (<).
That is why sometimes declaring own categorizer functions can be of advantage. (And especially for more complex categorizations).
Extensibility
More categories: Just define more categories categories1 categories2 categories3 ...
# then do
dfs <- split(df, paste(categories1, categories2, categories3, ..., sep="_"))
# use `gsub()` around `paste()` or do
# names(dfs) <- gsub("search_term", "replace_term", names(dfs)) - over and over again
# until all names are as they should be.
More parameters: Just define more parametersN - npN pairs.
# then do
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
params_list <- named_cross_combine(params_list, parameters3, names(params_list), np3, sep="_")
params_list <- named_cross_combine(params_list, parameters4, names(params_list), np4, sep="_")
... (and so on ...)
# use then at the end more lines for renaming parameter column contents:
df_final[, prmcol_name1] <- unlist(lookup(df_final[, prmcol_name1], np1, parameters1))
df_final[, prmcol_name2] <- unlist(lookup(df_final[, prmcol_name2], np2, parameters2))
df_final[, prmcol_name3] <- unlist(lookup(df_final[, prmcol_name3], np2, parameters3))
... (and so on ...)
Thus, the number of categories and parameters is easily enhance-able.
The core helper functions stay the same. And don't have to be modified.
(The use of higher order functions (functions which take functions as arguments) as helper functions is key for their flexibility - one of the strenghs of functional programming).

Conditionally add named elements to a list

I have a function to perform actions on a variable list of dataframes depending on user selections. The function mostly performs generic actions but there are a few actions that are dataframe specific.
My code runs fine if all dataframes are selected but I am unable to get it to work if not all dataframes are selected.
The following provides a minimal reproducible example:
# User switches.
df1Switch <- TRUE
df2Switch <- TRUE
df3Switch <- TRUE
# DF creation.
set.seed(1)
df <- data.frame(X=sample(1:10), Y=sample(11:20))
if (df1Switch) df1 <- df
if (df2Switch) df2 <- df
if (df3Switch) df3 <- df
# Function to do something.
fn_something <- function(file_list, file_names) {
df <- file_list
# Do lots of generic things.
df$Z <- df$X + df$Y
# Do a few specific things.
if (file_names == "Name1") df$X <- df$X + 1
else if (file_names == "Name2") df$X <- df$Z - 1
else if (file_names == "Name3") df$Y <- df$X + df$Y
return(df)
}
# Call function to do something.
file_list <- list(Name1=df1, Name2=df2, Name3=df3)
file_names <- names(file_list)
all_df <- do.call(rbind,mapply(fn_something, file_list, file_names,
SIMPLIFY=FALSE))
In this case the code runs fine as the user has selected to create all three dataframes. I use a named list so that the specific actions can be performed against the correct dataframes.
The output looks something like this (the actual numbers aren't important):
X Y Z
Name1.1 4 13 16
Name1.2 5 12 16
Name1.3 6 16 21
: : : :
Name2.1 15 13 16
: : : :
The problem arises if the user selects not to create some dataframes, e.g.:
# User switches.
df1Switch <- TRUE
df2Switch <- FALSE
df3Switch <- TRUE
Not surprisingly, in this case an object not found error results:
> # Call function to do something.
> file_list <- list(Name1=df1, Name2=df2, Name3=df3)
Error: object 'df2' not found
What I would like to do is conditionally specify the contents of file_list along the lines of this pseudo code:
file_list <- list(if (df1Switch) {Name1=df1}, if (df2Switch) {Name2=df2}, if (df3Switch) {Name3=df3})
I have come across list.foldLeft
Conditionally merge list elements but I don't know if this is suitable.
(I'll re-hash my comment:)
In general, I would encourage you to consider use of a list-of-dataframes instead of individual frames. My rationale for this:
assuming that each frame is structured (nearly) identically; and
assuming that what you do to one frame you will (or at least can) do to all frames; then
it is easier to list_of_frames <- lapply(list_of_frames, some_func) than it is to do something like:
for (nm in c("df1", "df2", "df3")) {
d <- get(nm)
d <- some_func(d)
assign(nm, d)
}
especially when dealing with non-global environments (i.e., doing this within a function).
To be clear, "easier" is subjective: though it does win code-golf, I find it much easier to read and understand that "I am running some_func on each element of list_of_frames and saving the result". (You can even save it to a new list-of-frames, thereby keeping the original frames untouched.)
You may also do things conditionally, as in
needs_work <- sapply(list_of_frames, some_checker_func) # returns logical
# or
needs_work <- c("df1", "df2") # names of elements of list_of_frames
list_of_frames[needs_work] <- lapply(list_of_frames[needs_work], some_func)
Having said that ... the direct answer to your one liner:
c(if (df1Switch) list(Name1=df1), if (df2Switch) list(Name2=df2), if (df3Switch) list(Name3=df3))
This capitalizes on the fact that unstated else results in a NULL, and the NULL-compressing (dropping) characteristic of c(). You can see it in action with:
c(if (T) list(a=1), if (T) list(b=2), if (T) list(d=4))
# $a
# [1] 1
# $b
# [1] 2
# $d
# [1] 4
c(if (T) list(a=1), if (FALSE) list(b=2), if (T) list(d=4))
# $a
# [1] 1
# $d
# [1] 4

Using %in% operator in R for categorical variables

Trying to using %in% operator in r to find an equivalent SAS Code as below:
If weather in (2,5) then new_weather=25;
else if weather in (1,3,4,7) then new_weather=14;
else new_weather=weather;
SAS code will produce variable "new_weather" with values 25, 14 and as defined in variable "weather".
R code:
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[newcol] = df[col]
df[df[newcol] %in% c(2,5)]= 25
df[df[newcol] %in% c(1,3,4,7)] = 14
return(df)
}
Result: output values of "col" and "newcol" are same, when passing a data frame through a function "GS". Syntax is not picking up the second or more values for a variable "newcol"? Appreciated your time explaining the reason and possible fix.
Is this what you are trying to do?
df <- data.frame(A=seq(1:4), B=seq(1:4))
add_and_adjust <- function(df, copy_column, new_column_name) {
df[new_column_name] <- df[copy_column] # make copy of column
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(2,5), 25, df[,new_column_name])
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(1,3,4,7), 14, df[,new_column_name])
return(df)
}
Usage:
add_and_adjust(df, 'B', 'my_new_column')
df[newcol] is a data frame (with one column), df[[newcol]] or df[, newcol] is a vector (just the column). You need to use [[ here.
You also need to be assigning the result to df[[newcol]], not to the whole df. And to be perfectly consistent and safe you should probably test the col values, not the newcol values.
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[[newcol]] = df[[col]]
df[[newcol]][df[[col]] %in% c(2,5)] = 25
df[[newcol]][df[[col]] %in% c(1,3,4,7)] = 14
return(df)
}
GS(data.frame(x = 1:7), "x", "new")
# x new
# 1 1 14
# 2 2 25
# 3 3 14
# 4 4 14
# 5 5 25
# 6 6 6
# 7 7 14
#user9231640 before you invest too much time in writing your own function you may want to explore some of the recode functions that already exist in places like car and Hmisc.
Depending on how complex your recoding gets your function will get longer and longer to check various boundary conditions or to change data types.
Just based upon your example you can do this in base R and it will be more self documenting and transparent at one level:
df <- data.frame(A=seq(1:30), B=seq(1:30))
df$my_new_column <- df$B
df$my_new_column <- ifelse(df$my_new_column %in% c(2,5), 25, df$my_new_column)
df$my_new_column <- ifelse(df$my_new_column %in% c(1,3,4,7), 14, df$my_new_column)

Attach/detach in R behaving very strangely

I want to subset a dataframe by applying two conditions to it. When I attach the dataframe, apply the first condition, detach the dataframe, attach it again, apply the second condition, and detach again, I get the expected result, a dataframe with 9 observations.
Of course, you wouldn't normally detach/attach before applying the second condition. So I attach, apply the two conditions after one another, and then detach. But the result is different now: It's a dataframe with 24 observations. All but 5 of these observations consist exclusively of NA-values.
I know there's lots of advice against using attach, and I understand the point that it's dangerous, because it's easy to loose track of an attach statement still being active. My point here is a different one; I see a behaviour in attach that I just can't understand. I'm using R Studio 0.99.465 with 64-bit-R 3.2.1.
So here's the code, first the version that is clumsy, but produces the correct result (df with 9 observations, all non-NA):
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
detach(df)
attach(df)
df <- df[early_vvl <= late_no_reaction,]
detach(df)
Now the one that produces the dataframe with 24 observations, most of which consist only of NA values:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
df <- df[early_vvl <= late_no_reaction,]
detach(df)
I'm puzzled. Does anybody understand why the second version produces a different result?
Have a look at what happens here:
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
length(early_vvl <= late_no_reaction)
## [1] 32
df <- df[early_vvl <= late_no_reaction,]
detach(df)
So your logical vector early_vvl <= late_no_reaction still uses the original df, the one that you attached. When you subset the data.frame the second time, the logical is longer than the data.frame has rows and it behaves like this:
df <- data.frame(x=1:5, y = letters[1:5])
df[rep(c(TRUE, FALSE), 5), ]
## x y
## 1 1 a
## 3 3 c
## 5 5 e
## NA NA <NA>
## NA.1 NA <NA>
You could just use & to avoid the problem:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl & early_vvl <= late_no_reaction,]
detach(df)

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Resources