I work with a large count table and for my analyses it is usually required to split this table into subsets based on observations, variables, values or context information.
# generating toy data
count_df1 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1, 10),
column3 = runif(50, 1, 10)
)
count_df2 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1.5, 9),
column3 = runif(50, 1.5, 9)
)
list_count_df <- list(count_df1 = count_df1, count_df2 = count_df2)
I learned to use lists and for loops to process all resulting subsets in the same manner. I'm rather using for loops than apply because I use the names of the objects (with the use of counters) to keep track of how I modified them and I don't know how to do this with e.g. lapply.
# set values to iterate over
thresholds <- c(2, 4)
conditions <- c(TRUE, FALSE)
# perform some kind of subsetting and store the parameters used
output_list <- list()
counter <- 0
for (current_threshold in thresholds) {
for (count_df in list_count_df) {
counter <- counter + 1
# modify the name to keep track of changes
current_name <- paste(names(list_count_df)[counter], current_threshold, sep = "_")
output_list[[current_name]] <- subset(count_df1, column2 < current_threshold)
}
counter <- 0
}
Additionally, the time consuming part is usually the main function of the body, so a loop with a reduced overhead by apply would probably not safe so much time (I'm still open to this).
After I'm done with preparing the various subsets and subject them to the analysis, I need to store the analysis' results and the accompanying parameters for the different subsets. That is probably a common task.
# allocate for df to store the results
result_length <- length(output_list) * length(conditions)
df_headers <- c("Names", "Threshold", "Input_table", "Standard_deviation", "Scaling")
df_results <- setNames(data.frame(matrix(ncol = length(df_headers),
nrow = result_length)), df_headers)
# perform some analyses (here: PCA) on the dfs while looping over
# analysis parameters and storing some results directly
iii <- 0
table_counter <- 0
for (item in output_list) {
table_counter <- table_counter + 1
for (condition in conditions) {
iii <- iii + 1
current_name <- paste(names(output_list)[table_counter], condition, sep = "_")
tmp <- prcomp(item, scale = condition)
# let's pretend we are only interested in standard deviation per item
df_results[iii, 1] <- current_name
df_results[iii, 4] <- tmp$sdev[1]
rm(tmp)
}
}
However, I'm partly doing this by extracting parts of the name of the object, which is highly repetitive and also very custom and has to be changed for each additional step included beforehand. As I want to start my own package soon, this is nothing another user could easily follow.
# extract more values from the name of the former object
df_results$Threshold <- as.numeric(sapply(strsplit(as.character(df_results$Names), '_'), "[", 3))
df_results$Input_table <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 2))
df_results$Scaling <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 4))
df_results
# now I could this into long format, do plotting etc
I provided a short example below of how such a workflow could look like. My questions are:
1) What are the general good practices on how to store parameters used for and how to extract them after processing?
2) If the solution is too case-specific for a general approach:
a) any ideas what to change here?
b) Are lists and/or for loops the way to go at all?
I do it because modifying names in lapply is unclear to me and without this I lose track of what is what. I also would not know how to efficiently handle all these different subsets in one big data.frame
Please consider that my original data contains numerical, factor and character columns with 100s of rows/observations and ten thousands of columns/variables.
Honestly there are many ways to do this and it will come down to personal preference. One common way would be to define a class object that will set the standard of how you access information on it. Creating a class means that you can make S3 methods too. This could help give more flexibility on how you generate your class depending on if you are working on a list, df, or just a vector.
generate_foo <- function(x, ...){
UseMethod("generate_foo")}
generate_foo.default <- function(x, current_threshold, conditions, name = NULL){
if(is.null(name)){
name <- as.character(substitute(x))
}
x <- x[x[["column2"]]<current_threshold,]
tmp <- tryCatch({prcomp(x, scale = conditions)}, error=function(er){return("Error")})
retval <- list(list(subset = x,
pcaObj = tmp, #could store the entire object or just the parts you care about.
subsetparam = current_threshold,
condition = conditions,
name = name))
class <- "foo"
return(retval)
}
generate_foo.list <- function(x,
current_threshold,
conditions, name = NULL){
if(is.null(name)||length(name)!=length(x)){
name <- names(x)
}
#Generate combinations
combi <- separate( #generate all the possible combination indexes at once
data.frame(
indx = levels(suppressWarnings(interaction(1:length(x),
1:length(current_threshold),
1:length(conditions))))),
col = "indx", into = c("df","thresh","cond"), sep = "\\.")
x <- x[as.numeric(combi$df)]
name <- name[as.numeric(combi$df)]
current_threshold <- current_threshold[as.numeric(combi$thresh)]
conditions <- conditions[as.numeric(combi$cond)]
foolist <- mapply(FUN = generate_foo.default,
x = x,
current_threshold = current_threshold,
conditions = conditions,
name = name)
class(foolist) <- "foolist"
return(foolist)
}
With this method when you call:
foo <- generate_foo(x = list_count_df,
current_threshold = thresholds,
conditions = conditions,
name = c("Custname1","Custname2"))
You will end up with a list of objects with class "foo". Specifically in this case the resulting object is length 8, each containing 5 parameters, subset, pcaObj, subsetparam, condition, name. with the exception of pcaObj sometimes throwing an error if the subset is too small, a tryCatch loop prevents the code from failing.
take it a step further by writing custom print and summary functions!
#summary
summary.foolist <- function(x){
subsetdim <- unlist(lapply(x, function(y){prod(dim(y[["subset"]]))}))
pcasdev <- unlist(lapply(x, function(y){y[["pcaObj"]]$sdev[1]}))
subsetparam <- unlist(lapply(x, function(y){y[["subsetparam"]]}))
condition <- unlist(lapply(x, function(y){y[["condition"]]}))
name <- unlist(lapply(x,function(y){y[["name"]]}))
df <- data.frame(SubsetDim=subsetdim, PCAsdev=pcasdev, SubsetParam=subsetparam, condition=condition, name = name)
return(df)
}
summary(foo)
SubsetDim PCAsdev SubsetParam condition name
1 24 1.207833 2 TRUE Custname1
2 6 1.732051 2 TRUE Custname2
3 54 1.324284 4 TRUE Custname1
4 33 1.372508 4 TRUE Custname2
5 24 16.258848 2 FALSE Custname1
6 6 12.024556 2 FALSE Custname2
7 54 15.592938 4 FALSE Custname1
8 33 14.057929 4 FALSE Custname2
Using a convention like this ensures your data is stored in a canonical way. Of course there are many ways you can choose to build your custom R class and object.
You could make one function that makes a list of subsetted dataframes and set that as one class. Then make another function that performs the analysis and generates a new class object. As long as you stick to building a named list, then accessing parts of the object become easier because they are organized.
Functional solution
0. Generate source data frame
# for reproducibility of random tasks
set.seed(1)
df <- data.frame(
col1 = c(1:100),
col2 = c(runif(50,1,10), runif(50,11,20)),
col3 = c(runif(50,1,10), runif(50,11,20))
)
# so half of the rows have numbers 1 to 10 in col2 and col3
# and other have 11 to 20 in col2 and col3.
# let's randomize the order of rows
df <- df[sample(1:100),]
# and take this data frame `df` as our source data frame
# fromw which we will do the analysis.
1. Problem description
We want to subdivide the original df into sub data frames
applying 2 different criteria.
Then, we analyze each sub data frame
using all possible combinations of 2 different parameters,
finally, collect all analysis values and aggravate them to a data frame.
The criteria:
criterium1: if col2 value is <= 10, we assign to the row "df1" therwise "df2".
categories c("df1", "df2").
criterium2: if col3 value is lower first limit, the row is assigned 'class5'
if col3 value is > first limit but <= second limit, assign 'class15'
other cases don't interest us - let's assign 'other'
categories c("class5", "class15", "other") # each row will be subdivided into
one of them
We want for each combination of the two criteria an own sub-dataframe
on which the analysis should be done.
The parameters for the analysis:
parameter1: 'scale.=' c(TRUE, FALSE)
parameter_categories c("sc+", "sc-")
parameter2: 'center=' c(TRUE, FASE)
parameter_categories c("cen+", "cen-")
The analysis result value:
We want for each combination of the two parameters an own report or the value
for 'Standard deviation'.
3 stddev columns of the PC1, PC2, PC3
Additional information to be collected:
we want a distinguishable (unique) name for each combination
2. How the entire analysis will look like:
# 0. categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("class5", "class15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# 1. Declare parameters and prepare all parameter combinations
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# 2. Apply analysis over all sub dfs and parameter combinations
# and extract and aggravate analysis results into a final data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# 3. rename parameter column contents
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
df_final:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
3. Explanation step by step
3.1 Declare Helper functions
# for preparing parameter combinations as lists
named_cross_combine <- function(seq1, seq2, seq1_names, seq2_names, sep="_") {
res <- list()
i <- 1
namevec <- c()
for (j1 in seq_along(seq1)) {
for (j2 in seq_along(seq2)) {
res[[i]] <- c(seq1[j1], seq2[j2])
namevec[i] <- paste0(seq1_names[j1], sep, seq2_names[j2])
i <- i + 1
}
}
names(res) <- namevec
res
}
# correctly named params list - `sep=` determines how names are joined
# you can apply `gsub()` on the namevec before assignment to adjust further the names.
# useful for doing analysis
do.call2 <- function(fun, x, rest) {
do.call(fun, c(list(x), rest))
}
apply_parameters <- function(funcname,
dfs,
params) {
lapply(dfs, function(df) lapply(params_list, function(pl) do.call2(funcname, df, pl)))
}
split_names_to_data_frame <- function(names_vec, sep) {
res <- lapply(names_vec, function(s) strsplit(s, sep)[[1]])
df <- Reduce(rbind, res)
# colnames(df) <- col_names
rownames(df) <- names_vec
df
}
apply_to_subdf_and_combine <- function(
res_list,
accessor_func=function(x) x, # subdf result
subdf_level_combiner_func=as.data.frame, # within subdf result
combine_prepare_func=function(x) x, # applied on each subdf result
final_combiner_func=rbind, # combine the results
col_names=NULL, # column names for final
sep="_") { # joiner for names
res_accessed_combined <- lapply(res_list,
function(x) do.call(what=subdf_level_combiner_func,
list(lapply(x, accessor_func))))
res_prepared <- lapply(res_accessed_combined, combine_prepare_func)
res_df <- Reduce(final_combiner_func, res_prepared)
rownames(res_df) <- paste(unlist(sapply(names(res_prepared), rep, nrow(res_prepared[[1]]))),
unlist(sapply(res_prepared, rownames)),
sep = sep)
names_df <- split_names_to_data_frame(rownames(res_df), sep = sep)
final_df <- as.data.frame(cbind(names_df, res_df))
if (!is.null(col_names)) {
colnames(final_df) <- col_names
}
final_df
}
# for simplifying the function call
extract_and_combine <- function(res_list,
result_extractor_func,
col_names,
sep="_") {
apply_to_subdf_and_combine(
res_list = res_list,
accessor_func = result_extractor_func,
subdf_level_combiner_func=as.data.frame,
combine_prepare_func=function(x) as.data.frame(t(x)),
final_combiner_func=rbind,
col_names=col_names,
sep=sep
)
}
# for even more simplifying function call
apply_extract_aggravate <- function(dfs,
params,
analyzer_func,
extractor_func,
col_names,
sep="_") {
extract_and_combine(
res_list=apply_parameters(funcname=analyzer_func, dfs=dfs, params=params),
result_extractor_func=extractor_func,
col_names=col_names,
sep=sep
)
}
# useful for renaming the data frame columns values
lookup <- function(x, seq1, seq2) {
seq2[sapply(x, function(x) which(x == seq1))]
}
3.2 Categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("5", "15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
But to have full control over categorization, you can
declare your own categorizer functions and categorize and
split the data frame:
# write rules for criterium1 1 element as function
categorizer1 <- function(x) {
if (1 <= x && x <= 10) {
"df1"
} else if (11 <= x && x <= 20) {
"df2"
}
}
# vectorize it to be able to apply it on entire columns
categorizer1 <- Vectorize(categorizer1)
# do the same for critreium2
categorizer2 <- function(x) {
if (x <= 5) {
"class5"
} else if (5 < x && x <= 15) {
"class15"
} else {
"other"
}
}
categorizer2 <- Vectorize(categorizer2)
# apply on col2 and col3 the corresponding categorizers
categories1 <- categorizer1(df[,"col2"])
categories2 <- categorizer2(df[,"col3"])
# get the list of sub data frames according to categories
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# Let the categorizer functions return strings and
# for the second argument use `paste()` with `sep=` to determine
# how the names should be combined - here with "_".
# Use `gsub(pattern, replacement, x, ignore.case=F, perl=T)`
# to process the name using regex patterns to how you want it at the end.
# Here, we remove the bulky "class".
3.3 Declare parameters as lists and their corresponding names in filename
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
# prepare all combinations of them in a list of lists
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# this produces a list of all possible parameter combination lists.
# Each parameter combination has to be kept itself in a list, because
# `do.call()` later requires the parameters being in a list.
# `named_cross_combine()` takes care of correct naming,
# joining the names using `sep` values.
# The first element in `parameter1` is taken and is paired with each of
# `parameters2`. Then the second of `parameter1` through all `parameters2`, etc.
3.4 Apply all parameters over dfs and collect the results into a data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# This function takes the dfs and the parameters list and runs the
# analyzer_func, here `prcomp()` over all combinations of boths.
# The `extractor_func` must be chosen in a way that the returned result is a vector.
# If it is already a vector, set here `function(x) x` the identity function.
# The column names should give new names to the resulting columns.
# The number of the names are determined by:
# - the number of categoriesN,
# - the number of parametersN,
# - the number of elements of result after extractor_func() was applied.
# `sep=` determines which joiner is used for joining the names.
3.5 Finally, rename parameter columns' contents by using lookup() + previously declared parameter lists (parametersN) with their corresponding name vectors (npN)
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
# Two parameter columns, so two commands.
This converts df_final from this:
# df limits scale center std_PC1 std_PC2 std_PC3
# df1_15_scpos_cpos df1 15 scpos cpos 1.205986 0.9554013 0.7954906
# df1_15_scpos_cneg df1 15 scpos cneg 1.638142 0.5159250 0.2243043
# df1_15_scneg_cpos df1 15 scneg cpos 15.618145 2.4501942 1.3687843
# df1_15_scneg_cneg df1 15 scneg cneg 31.425246 5.9055013 1.7178626
# df1_5_scpos_cpos df1 5 scpos cpos 1.128371 1.0732246 0.7582659
# df1_5_scpos_cneg df1 5 scpos cneg 1.613217 0.4782639 0.4108470
# df1_5_scneg_cpos df1 5 scneg cpos 13.525868 2.5524661 0.9894493
# df1_5_scneg_cneg df1 5 scneg cneg 30.007511 3.9094993 1.6020638
# df2_15_scpos_cpos df2 15 scpos cpos 1.129298 1.0069030 0.8431092
# df2_15_scpos_cneg df2 15 scpos cneg 1.720909 0.1523516 0.1235295
# df2_15_scneg_cpos df2 15 scneg cpos 14.061532 2.4172787 1.2348606
# df2_15_scneg_cneg df2 15 scneg cneg 80.543382 3.8409639 1.8480111
# df2_other_scpos_cpos df2 other scpos cpos 1.090057 0.9588241 0.9446865
# df2_other_scpos_cneg df2 other scpos cneg 1.718190 0.1881516 0.1114570
# df2_other_scneg_cpos df2 other scneg cpos 15.168160 2.5579403 1.3354016
# df2_other_scneg_cneg df2 other scneg cneg 82.297724 5.0580949 1.9356444
to this:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
4. Final remarks
This is not very different from your approach. All information is collected in the names. And the names used for generating the part of the data frame which explains the background for the analysis data.
The lookup() function is very useful for renaming the columns for the parameters.
The categorization of a column can be very simplified by the cat() function. But in the cut() function you don't have full control over
whether the upper/lower limit is included (<=) or not (<).
That is why sometimes declaring own categorizer functions can be of advantage. (And especially for more complex categorizations).
Extensibility
More categories: Just define more categories categories1 categories2 categories3 ...
# then do
dfs <- split(df, paste(categories1, categories2, categories3, ..., sep="_"))
# use `gsub()` around `paste()` or do
# names(dfs) <- gsub("search_term", "replace_term", names(dfs)) - over and over again
# until all names are as they should be.
More parameters: Just define more parametersN - npN pairs.
# then do
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
params_list <- named_cross_combine(params_list, parameters3, names(params_list), np3, sep="_")
params_list <- named_cross_combine(params_list, parameters4, names(params_list), np4, sep="_")
... (and so on ...)
# use then at the end more lines for renaming parameter column contents:
df_final[, prmcol_name1] <- unlist(lookup(df_final[, prmcol_name1], np1, parameters1))
df_final[, prmcol_name2] <- unlist(lookup(df_final[, prmcol_name2], np2, parameters2))
df_final[, prmcol_name3] <- unlist(lookup(df_final[, prmcol_name3], np2, parameters3))
... (and so on ...)
Thus, the number of categories and parameters is easily enhance-able.
The core helper functions stay the same. And don't have to be modified.
(The use of higher order functions (functions which take functions as arguments) as helper functions is key for their flexibility - one of the strenghs of functional programming).
I have several data frames named as such:
orange_ABC
orange_BCD
apple_ABC
apple_BCD
grape_ABC
grape_BCD
I need to rbind those that have the first part of their name in common (orange, apple, grape), and name the new data frames as such. I'm accessing the names from a list of data frames names(fruitlist) (from which I made the aforementioned data frames) and have tried using lapply with function(x) with no luck. I'm somewhat new to R, so think I'm making a simple mistake when it comes to dynamically naming the new data frame...
lapply(names(fruitlist),
function(x){
frame_nm <- toString((names(fruitlist[x])))
frame_nm <- do.call(rbind, mget(ls(pattern=paste0((names(splitlist[x])),"*"))))
})
I've tried the standalone line on one type of "fruit" and it seems to work:
test_DF <- do.call(rbind, mget(ls(pattern="apple*")))
EDIT: I realize I forgot to mention that the example list of 6 data frames were created dynamically, so I can't simply generate a list of them. However, I do have a list of the "fruits", and all possible the ends of the new data frame names are known ("_ABC" and "_BCD").
As suspected, the proposed way of assigning values to objects does not work. Moreover, care has to be taken when using ls() and mget() for listing and accessing named objects within a function, because they do not automatically ascend to parent environments and only "see" variables in the local scope unless told otherwise. This applies to R version 3.4, older versions may behave differently.
Creating named objects.
In order to create new objects in the global environment, use assign() (already suggested in Luke C's answer):
> assign("foo", "some text")
> foo
[1] "some text"
Placing code inside a function induces a local scope. Explicitly specifying the global environment allows setting global variables:
> set_foo <- function (x) { assign("foo", x, envir=globalenv()) }
> set_foo("other text")
> foo
[1] "other text"
Note that omitting the envir argument would leave the global environment unaffected.
Use of ls()/mget() within a local function.
By default, this only lists names from the current (local) environment of the that function, which only sees the argument x in the example code given in the question. Similar to above, a quick fix is to specify the global environment explicitly by adding the argument envir=globalenv(). The same applies for mget().
Since no MWE was provided, I am taking the liberty of adapting the "fake data" example code provided in Luke C's answer.
# Populate environment
namelist <- paste(fruit = rep(c("orange", "apple", "grape"), 2),
nums = rep(c("_ABC", "_BCD"), each = 3), sep = "")
for(x in namelist)
assign(x, data.frame(a = 1:4, b = 11:14))
# The following re-generates the list of fruits used above
grouplist <- unique(unlist(lapply(strsplit(namelist, "_"), function (x) { x[[1]] })))
# Group and rbind by prefix, suppressing output
invisible(lapply(grouplist,
function(x) {
grouped <- do.call(rbind,
mget(ls(pattern=paste0(x,"_*"), envir=globalenv()),
envir=globalenv()))
assign(x, grouped, envir=globalenv())
}))
If your fruitlist is a named list of data frames, maybe this will suit.
First, get the like names into their own list:
fruit.groups <- split(names(fruitlist),
sapply(strsplit(names(fruitlist), split = "_"), "[[", 1))
> fruit.groups
$apple
[1] "apple_ABC" "apple_BCD"
$grape
[1] "grape_ABC" "grape_BCD"
$orange
[1] "orange_ABC" "orange_BCD"
Then, use lapply to rbind by group:
fdf <- lapply(fruit.groups, function(x){
out <- do.call(rbind, fruitlist[x])
out$from <- gsub("(\\..*)", "", rownames(out))
rownames(out) <- NULL
return(out)
})
> fdf$apple
a b from
1 1 11 apple_ABC
2 2 12 apple_ABC
3 3 13 apple_ABC
4 4 14 apple_ABC
5 1 11 apple_BCD
6 2 12 apple_BCD
7 3 13 apple_BCD
8 4 14 apple_BCD
Fake data:
namelist <- paste(fruit = rep(c("orange", "apple", "grape"), 2),
nums = rep(c("_ABC", "_BCD"), each = 3), sep = "")
fruitlist <- llply(namelist, function(x){
assign(as.character(x), data.frame(a = 1:4, b = 11:14))
})
EDIT:
From the edits to your question above:
If you have the fruits and suffixes, use expand.grid to get all possible combinations (assuming that all combinations will refer to the dynamically generated data frames).
fruits <- c("orange", "apple", "grape")
suffixes <- c("_ABC", "_BCD")
fullnames <- apply(expand.grid(fruits, suffixes), 1, paste, collapse = "")
Using that list of names, use mget to generate a list of the present dataframes.
new_fruit_df_list <- mget(fullnames)
Then, the code from above should work, modified here to reflect the name changes:
fruit.groups <- split(names(new_fruit_df_list),
sapply(strsplit(names(new_fruit_df_list), split = "_"), "[[", 1))
fdf <- lapply(fruit.groups, function(x){
out <- do.call(rbind, new_fruit_df_list[x])
out$from <- gsub("(\\..*)", "", rownames(out))
rownames(out) <- NULL
return(out)
})
Have a look at the head of each, with the added column (remove if you don't want it) showing the name of that row's original data frame.
> lapply(fdf, head, 2)
$apple
a b from
1 1 11 apple_ABC
2 2 12 apple_ABC
$grape
a b from
1 1 11 grape_ABC
2 2 12 grape_ABC
$orange
a b from
1 1 11 orange_ABC
2 2 12 orange_ABC
Give this a try:
file_groups <- ls()[grep(".*_.*", ls())]
file_groups <- gsub("(.*)_.*", "\\1", file_groups)
df_list <- lapply(file_groups,
function(x){ do.call(rbind, mget(ls(pattern = paste0(x, "*"))))})
I am struggling a bit with a probably fairly simple task. I wanted to create a function that has arguments of dataframe(df), column names of dataframe(T and R), value of the selected column of dataframe(a and b). I know that the function reads the dataframe. but , I don't know how the columns are selected. I'm getting an error.
fun <- function(df,T,a,R,b)
{
col <- ds[c("x","y")]
omit <- na.omit(col)
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
nrow(data2)/nrow(data1)
}
fun(jugs,Place,UK,Price,10)
I'm new to r language. So, please help me.
There are several errors you're making.
col <- ds[c("x","y")]
What are x and y? Presumably they're arguments that you're passing, but you specify T and R in your function, not x and y.
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
Again, presumably, you want a and b to be arguments you passed to the function, but you specified 'a' and 'b' which are specific, not general arguments. Also, I assume that second "omit$x" should be "omit$y" (or vice versa). And actually, since you just made this into a new data frame with two columns, you can just use the column index.
nrow(data2)/nrow(data1)
You should print this line, or return it. Either one should suffice.
fun(jugs,Place,UK,Price,10)
Finally, you should use quotes on Place, UK, and Price, at least the way I've done it.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
data1 <- omit[omit[,1] == val1,]
data2 <- omit[omit[,2] == val2,]
print(nrow(data2)/nrow(data1))
}
fun(jugs, "Place", "UK", "Price", 10)
And if I understand what you're trying to do, it may be easier to avoid creating multiple dataframes that you don't need and just use counts instead.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
n1 <- sum(omit[,1] == val1)
n2 <- sum(omit[,2] == val2)
print(n2/n1)
}
fun(jugs, "Place", "UK", "Price", 10)
I would write this function as follows:
fun <- function(df,T,a,R,b) {
data <- na.omit(df[c(T,R)]);
sum(data[[R]]==b)/sum(data[[T]]==a);
};
As you can see, you can combine the first two lines into one, because in your code col was not reused anywhere. Secondly, since you only care about the number of rows of the two subsets of the intermediate data.frame, you don't actually need to construct those two data.frames; instead, you can just compute the logical vectors that result from the two comparisons, and then call sum() on those logical vectors, which naturally treats FALSE as 0 and TRUE as 1.
Demo:
fun <- function(df,T,a,R,b) { data <- na.omit(df[c(T,R)]); sum(data[[R]]==b)/sum(data[[T]]==a); };
df <- data.frame(place=c(rep(c('p1','p2'),each=4),NA,NA), price=c(10,10,20,NA,20,20,20,NA,20,20), stringsAsFactors=F );
df;
## place price
## 1 p1 10
## 2 p1 10
## 3 p1 20
## 4 p1 NA
## 5 p2 20
## 6 p2 20
## 7 p2 20
## 8 p2 NA
## 9 <NA> 20
## 10 <NA> 20
fun(df,'place','p1','price',20);
## [1] 1.333333
I have a bunch of DF named like: df1, df2, ..., dfN
and lt1, lt2, ..., ltN
I would like to merge them in a loop, something like:
for (X in 1:N){
outputX <- merge(dfX, ltX, ...)
}
But I have some troubles getting the name of output, dfX, and ltX to change in each iteration. I realize that plyr/data.table/reshape might have an easier way, but I would like for loop to work.
Perhaps I should clarify. The DF are quite large, which is why plyr etc will not work (they crash). I would like to avoid copy'ing.
The next in the code is to save the merged DF.
This is why I prefer the for-loop apporach, since I know what each merged DF is named in the enviroment.
You can combine data frames into lists and use mapply, as in the example below:
i <- 1:3
d1.a <- data.frame(i=i,a=letters[i])
d1.b <- data.frame(i=i,A=LETTERS[i])
i <- 11:13
d2.a <- data.frame(i=i,a=letters[i])
d2.b <- data.frame(i=i,A=LETTERS[i])
L1 <- list(d1.a, d2.a)
L2 <- list(d1.b, d2.b)
mapply(merge,L1,L2,SIMPLIFY=F)
# [[1]]
# i a A
# 1 1 a A
# 2 2 b B
# 3 3 c C
#
# [[2]]
# i a A
# 1 11 k K
# 2 12 l L
# 3 13 m M
If you'd like to save every of the resulting data frames in the global environment (I'd advise against it though), you could do:
result <- mapply(merge,L1,L2,SIMPLIFY=F)
names(result) <- paste0('output',seq_along(result))
which will give a name to every data frame in the list, an then:
sapply(names(result),function(s) assign(s,result[[s]],envir = globalenv()))
Please note that provided is a base R solution that does essentially the same thing as your sample code.
If your data frames are in a list, writing a for loop is trivial:
# lt = list(lt1, lt2, lt3, ...)
# if your data is very big, this may run you out of memory
lt = lapply(ls(pattern = "lt[0-9]*"), get)
merged_data = merge(lt[[1]], lt[[2]])
for (i in 3:length(lt)) {
merged_data = merge(merged_data, lt[[i]])
save(merged_data, file = paste0("merging", i, ".rda"))
}
I am struggling to make my apply() work: I have two dataframes:
from <- c(1,2,3)
to <- c(2,3,4)
df1 <- data.frame(from, to)
long <-c(9,9.2,9.4,9.6)
lat <- c(45,45.2,45.4,45.6)
id <- c(1,2,3,4)
df2 <- data.frame(long, lat, id)
Now I want something like this:
myFunction <- function(arg){
>>> How do I access arg$from and arg$to? <<<<
}
apply(df1,1,myFunction)
In myFunction I need to make some calculations and return a value for each from-to pair. I don't understand how to access parts of the arg, since arg[0] gives me numeric(0) and arg$from just crashes.
The problem is that apply(...) requires a matrix or array as the first argument. If you pass a dataframe, it will coerce that to a matrix. Matrices are 1 indexed, so the upper left element is [1,1], not [0,0]. Also, matrix columns cannot be referenced using the $ notation.
So,
f <- function(x) {
from <- x[1]
to <- x[2]
# do stuff with from and to...
}
apply(df,1,f)
would work.
One other thing to watch out for is that if your dataframe has (other) columns that have character strings, the conversion will make everything character (including the numbers!). This is because, by definition, all elements of a matrix must have the same data type. Your example does not have that problem, though.
Try mapply(). It's a multivariate version of sapply(). For example:
> myFunction <- function(arg1, arg2){
+ return(sum(arg1, arg2))
+ }
>
> mapply(myFunction, df1$from, df1$to)
[1] 3 5 7
You can also use it to make a new variable in your data frame.
> df1$newvar <- mapply(myFunction, df1$from, df1$to)
> df1
from to newvar
1 1 2 3
2 2 3 5
3 3 4 7