I work with a large count table and for my analyses it is usually required to split this table into subsets based on observations, variables, values or context information.
# generating toy data
count_df1 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1, 10),
column3 = runif(50, 1, 10)
)
count_df2 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1.5, 9),
column3 = runif(50, 1.5, 9)
)
list_count_df <- list(count_df1 = count_df1, count_df2 = count_df2)
I learned to use lists and for loops to process all resulting subsets in the same manner. I'm rather using for loops than apply because I use the names of the objects (with the use of counters) to keep track of how I modified them and I don't know how to do this with e.g. lapply.
# set values to iterate over
thresholds <- c(2, 4)
conditions <- c(TRUE, FALSE)
# perform some kind of subsetting and store the parameters used
output_list <- list()
counter <- 0
for (current_threshold in thresholds) {
for (count_df in list_count_df) {
counter <- counter + 1
# modify the name to keep track of changes
current_name <- paste(names(list_count_df)[counter], current_threshold, sep = "_")
output_list[[current_name]] <- subset(count_df1, column2 < current_threshold)
}
counter <- 0
}
Additionally, the time consuming part is usually the main function of the body, so a loop with a reduced overhead by apply would probably not safe so much time (I'm still open to this).
After I'm done with preparing the various subsets and subject them to the analysis, I need to store the analysis' results and the accompanying parameters for the different subsets. That is probably a common task.
# allocate for df to store the results
result_length <- length(output_list) * length(conditions)
df_headers <- c("Names", "Threshold", "Input_table", "Standard_deviation", "Scaling")
df_results <- setNames(data.frame(matrix(ncol = length(df_headers),
nrow = result_length)), df_headers)
# perform some analyses (here: PCA) on the dfs while looping over
# analysis parameters and storing some results directly
iii <- 0
table_counter <- 0
for (item in output_list) {
table_counter <- table_counter + 1
for (condition in conditions) {
iii <- iii + 1
current_name <- paste(names(output_list)[table_counter], condition, sep = "_")
tmp <- prcomp(item, scale = condition)
# let's pretend we are only interested in standard deviation per item
df_results[iii, 1] <- current_name
df_results[iii, 4] <- tmp$sdev[1]
rm(tmp)
}
}
However, I'm partly doing this by extracting parts of the name of the object, which is highly repetitive and also very custom and has to be changed for each additional step included beforehand. As I want to start my own package soon, this is nothing another user could easily follow.
# extract more values from the name of the former object
df_results$Threshold <- as.numeric(sapply(strsplit(as.character(df_results$Names), '_'), "[", 3))
df_results$Input_table <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 2))
df_results$Scaling <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 4))
df_results
# now I could this into long format, do plotting etc
I provided a short example below of how such a workflow could look like. My questions are:
1) What are the general good practices on how to store parameters used for and how to extract them after processing?
2) If the solution is too case-specific for a general approach:
a) any ideas what to change here?
b) Are lists and/or for loops the way to go at all?
I do it because modifying names in lapply is unclear to me and without this I lose track of what is what. I also would not know how to efficiently handle all these different subsets in one big data.frame
Please consider that my original data contains numerical, factor and character columns with 100s of rows/observations and ten thousands of columns/variables.
Honestly there are many ways to do this and it will come down to personal preference. One common way would be to define a class object that will set the standard of how you access information on it. Creating a class means that you can make S3 methods too. This could help give more flexibility on how you generate your class depending on if you are working on a list, df, or just a vector.
generate_foo <- function(x, ...){
UseMethod("generate_foo")}
generate_foo.default <- function(x, current_threshold, conditions, name = NULL){
if(is.null(name)){
name <- as.character(substitute(x))
}
x <- x[x[["column2"]]<current_threshold,]
tmp <- tryCatch({prcomp(x, scale = conditions)}, error=function(er){return("Error")})
retval <- list(list(subset = x,
pcaObj = tmp, #could store the entire object or just the parts you care about.
subsetparam = current_threshold,
condition = conditions,
name = name))
class <- "foo"
return(retval)
}
generate_foo.list <- function(x,
current_threshold,
conditions, name = NULL){
if(is.null(name)||length(name)!=length(x)){
name <- names(x)
}
#Generate combinations
combi <- separate( #generate all the possible combination indexes at once
data.frame(
indx = levels(suppressWarnings(interaction(1:length(x),
1:length(current_threshold),
1:length(conditions))))),
col = "indx", into = c("df","thresh","cond"), sep = "\\.")
x <- x[as.numeric(combi$df)]
name <- name[as.numeric(combi$df)]
current_threshold <- current_threshold[as.numeric(combi$thresh)]
conditions <- conditions[as.numeric(combi$cond)]
foolist <- mapply(FUN = generate_foo.default,
x = x,
current_threshold = current_threshold,
conditions = conditions,
name = name)
class(foolist) <- "foolist"
return(foolist)
}
With this method when you call:
foo <- generate_foo(x = list_count_df,
current_threshold = thresholds,
conditions = conditions,
name = c("Custname1","Custname2"))
You will end up with a list of objects with class "foo". Specifically in this case the resulting object is length 8, each containing 5 parameters, subset, pcaObj, subsetparam, condition, name. with the exception of pcaObj sometimes throwing an error if the subset is too small, a tryCatch loop prevents the code from failing.
take it a step further by writing custom print and summary functions!
#summary
summary.foolist <- function(x){
subsetdim <- unlist(lapply(x, function(y){prod(dim(y[["subset"]]))}))
pcasdev <- unlist(lapply(x, function(y){y[["pcaObj"]]$sdev[1]}))
subsetparam <- unlist(lapply(x, function(y){y[["subsetparam"]]}))
condition <- unlist(lapply(x, function(y){y[["condition"]]}))
name <- unlist(lapply(x,function(y){y[["name"]]}))
df <- data.frame(SubsetDim=subsetdim, PCAsdev=pcasdev, SubsetParam=subsetparam, condition=condition, name = name)
return(df)
}
summary(foo)
SubsetDim PCAsdev SubsetParam condition name
1 24 1.207833 2 TRUE Custname1
2 6 1.732051 2 TRUE Custname2
3 54 1.324284 4 TRUE Custname1
4 33 1.372508 4 TRUE Custname2
5 24 16.258848 2 FALSE Custname1
6 6 12.024556 2 FALSE Custname2
7 54 15.592938 4 FALSE Custname1
8 33 14.057929 4 FALSE Custname2
Using a convention like this ensures your data is stored in a canonical way. Of course there are many ways you can choose to build your custom R class and object.
You could make one function that makes a list of subsetted dataframes and set that as one class. Then make another function that performs the analysis and generates a new class object. As long as you stick to building a named list, then accessing parts of the object become easier because they are organized.
Functional solution
0. Generate source data frame
# for reproducibility of random tasks
set.seed(1)
df <- data.frame(
col1 = c(1:100),
col2 = c(runif(50,1,10), runif(50,11,20)),
col3 = c(runif(50,1,10), runif(50,11,20))
)
# so half of the rows have numbers 1 to 10 in col2 and col3
# and other have 11 to 20 in col2 and col3.
# let's randomize the order of rows
df <- df[sample(1:100),]
# and take this data frame `df` as our source data frame
# fromw which we will do the analysis.
1. Problem description
We want to subdivide the original df into sub data frames
applying 2 different criteria.
Then, we analyze each sub data frame
using all possible combinations of 2 different parameters,
finally, collect all analysis values and aggravate them to a data frame.
The criteria:
criterium1: if col2 value is <= 10, we assign to the row "df1" therwise "df2".
categories c("df1", "df2").
criterium2: if col3 value is lower first limit, the row is assigned 'class5'
if col3 value is > first limit but <= second limit, assign 'class15'
other cases don't interest us - let's assign 'other'
categories c("class5", "class15", "other") # each row will be subdivided into
one of them
We want for each combination of the two criteria an own sub-dataframe
on which the analysis should be done.
The parameters for the analysis:
parameter1: 'scale.=' c(TRUE, FALSE)
parameter_categories c("sc+", "sc-")
parameter2: 'center=' c(TRUE, FASE)
parameter_categories c("cen+", "cen-")
The analysis result value:
We want for each combination of the two parameters an own report or the value
for 'Standard deviation'.
3 stddev columns of the PC1, PC2, PC3
Additional information to be collected:
we want a distinguishable (unique) name for each combination
2. How the entire analysis will look like:
# 0. categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("class5", "class15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# 1. Declare parameters and prepare all parameter combinations
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# 2. Apply analysis over all sub dfs and parameter combinations
# and extract and aggravate analysis results into a final data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# 3. rename parameter column contents
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
df_final:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
3. Explanation step by step
3.1 Declare Helper functions
# for preparing parameter combinations as lists
named_cross_combine <- function(seq1, seq2, seq1_names, seq2_names, sep="_") {
res <- list()
i <- 1
namevec <- c()
for (j1 in seq_along(seq1)) {
for (j2 in seq_along(seq2)) {
res[[i]] <- c(seq1[j1], seq2[j2])
namevec[i] <- paste0(seq1_names[j1], sep, seq2_names[j2])
i <- i + 1
}
}
names(res) <- namevec
res
}
# correctly named params list - `sep=` determines how names are joined
# you can apply `gsub()` on the namevec before assignment to adjust further the names.
# useful for doing analysis
do.call2 <- function(fun, x, rest) {
do.call(fun, c(list(x), rest))
}
apply_parameters <- function(funcname,
dfs,
params) {
lapply(dfs, function(df) lapply(params_list, function(pl) do.call2(funcname, df, pl)))
}
split_names_to_data_frame <- function(names_vec, sep) {
res <- lapply(names_vec, function(s) strsplit(s, sep)[[1]])
df <- Reduce(rbind, res)
# colnames(df) <- col_names
rownames(df) <- names_vec
df
}
apply_to_subdf_and_combine <- function(
res_list,
accessor_func=function(x) x, # subdf result
subdf_level_combiner_func=as.data.frame, # within subdf result
combine_prepare_func=function(x) x, # applied on each subdf result
final_combiner_func=rbind, # combine the results
col_names=NULL, # column names for final
sep="_") { # joiner for names
res_accessed_combined <- lapply(res_list,
function(x) do.call(what=subdf_level_combiner_func,
list(lapply(x, accessor_func))))
res_prepared <- lapply(res_accessed_combined, combine_prepare_func)
res_df <- Reduce(final_combiner_func, res_prepared)
rownames(res_df) <- paste(unlist(sapply(names(res_prepared), rep, nrow(res_prepared[[1]]))),
unlist(sapply(res_prepared, rownames)),
sep = sep)
names_df <- split_names_to_data_frame(rownames(res_df), sep = sep)
final_df <- as.data.frame(cbind(names_df, res_df))
if (!is.null(col_names)) {
colnames(final_df) <- col_names
}
final_df
}
# for simplifying the function call
extract_and_combine <- function(res_list,
result_extractor_func,
col_names,
sep="_") {
apply_to_subdf_and_combine(
res_list = res_list,
accessor_func = result_extractor_func,
subdf_level_combiner_func=as.data.frame,
combine_prepare_func=function(x) as.data.frame(t(x)),
final_combiner_func=rbind,
col_names=col_names,
sep=sep
)
}
# for even more simplifying function call
apply_extract_aggravate <- function(dfs,
params,
analyzer_func,
extractor_func,
col_names,
sep="_") {
extract_and_combine(
res_list=apply_parameters(funcname=analyzer_func, dfs=dfs, params=params),
result_extractor_func=extractor_func,
col_names=col_names,
sep=sep
)
}
# useful for renaming the data frame columns values
lookup <- function(x, seq1, seq2) {
seq2[sapply(x, function(x) which(x == seq1))]
}
3.2 Categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("5", "15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
But to have full control over categorization, you can
declare your own categorizer functions and categorize and
split the data frame:
# write rules for criterium1 1 element as function
categorizer1 <- function(x) {
if (1 <= x && x <= 10) {
"df1"
} else if (11 <= x && x <= 20) {
"df2"
}
}
# vectorize it to be able to apply it on entire columns
categorizer1 <- Vectorize(categorizer1)
# do the same for critreium2
categorizer2 <- function(x) {
if (x <= 5) {
"class5"
} else if (5 < x && x <= 15) {
"class15"
} else {
"other"
}
}
categorizer2 <- Vectorize(categorizer2)
# apply on col2 and col3 the corresponding categorizers
categories1 <- categorizer1(df[,"col2"])
categories2 <- categorizer2(df[,"col3"])
# get the list of sub data frames according to categories
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# Let the categorizer functions return strings and
# for the second argument use `paste()` with `sep=` to determine
# how the names should be combined - here with "_".
# Use `gsub(pattern, replacement, x, ignore.case=F, perl=T)`
# to process the name using regex patterns to how you want it at the end.
# Here, we remove the bulky "class".
3.3 Declare parameters as lists and their corresponding names in filename
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
# prepare all combinations of them in a list of lists
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# this produces a list of all possible parameter combination lists.
# Each parameter combination has to be kept itself in a list, because
# `do.call()` later requires the parameters being in a list.
# `named_cross_combine()` takes care of correct naming,
# joining the names using `sep` values.
# The first element in `parameter1` is taken and is paired with each of
# `parameters2`. Then the second of `parameter1` through all `parameters2`, etc.
3.4 Apply all parameters over dfs and collect the results into a data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# This function takes the dfs and the parameters list and runs the
# analyzer_func, here `prcomp()` over all combinations of boths.
# The `extractor_func` must be chosen in a way that the returned result is a vector.
# If it is already a vector, set here `function(x) x` the identity function.
# The column names should give new names to the resulting columns.
# The number of the names are determined by:
# - the number of categoriesN,
# - the number of parametersN,
# - the number of elements of result after extractor_func() was applied.
# `sep=` determines which joiner is used for joining the names.
3.5 Finally, rename parameter columns' contents by using lookup() + previously declared parameter lists (parametersN) with their corresponding name vectors (npN)
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
# Two parameter columns, so two commands.
This converts df_final from this:
# df limits scale center std_PC1 std_PC2 std_PC3
# df1_15_scpos_cpos df1 15 scpos cpos 1.205986 0.9554013 0.7954906
# df1_15_scpos_cneg df1 15 scpos cneg 1.638142 0.5159250 0.2243043
# df1_15_scneg_cpos df1 15 scneg cpos 15.618145 2.4501942 1.3687843
# df1_15_scneg_cneg df1 15 scneg cneg 31.425246 5.9055013 1.7178626
# df1_5_scpos_cpos df1 5 scpos cpos 1.128371 1.0732246 0.7582659
# df1_5_scpos_cneg df1 5 scpos cneg 1.613217 0.4782639 0.4108470
# df1_5_scneg_cpos df1 5 scneg cpos 13.525868 2.5524661 0.9894493
# df1_5_scneg_cneg df1 5 scneg cneg 30.007511 3.9094993 1.6020638
# df2_15_scpos_cpos df2 15 scpos cpos 1.129298 1.0069030 0.8431092
# df2_15_scpos_cneg df2 15 scpos cneg 1.720909 0.1523516 0.1235295
# df2_15_scneg_cpos df2 15 scneg cpos 14.061532 2.4172787 1.2348606
# df2_15_scneg_cneg df2 15 scneg cneg 80.543382 3.8409639 1.8480111
# df2_other_scpos_cpos df2 other scpos cpos 1.090057 0.9588241 0.9446865
# df2_other_scpos_cneg df2 other scpos cneg 1.718190 0.1881516 0.1114570
# df2_other_scneg_cpos df2 other scneg cpos 15.168160 2.5579403 1.3354016
# df2_other_scneg_cneg df2 other scneg cneg 82.297724 5.0580949 1.9356444
to this:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
4. Final remarks
This is not very different from your approach. All information is collected in the names. And the names used for generating the part of the data frame which explains the background for the analysis data.
The lookup() function is very useful for renaming the columns for the parameters.
The categorization of a column can be very simplified by the cat() function. But in the cut() function you don't have full control over
whether the upper/lower limit is included (<=) or not (<).
That is why sometimes declaring own categorizer functions can be of advantage. (And especially for more complex categorizations).
Extensibility
More categories: Just define more categories categories1 categories2 categories3 ...
# then do
dfs <- split(df, paste(categories1, categories2, categories3, ..., sep="_"))
# use `gsub()` around `paste()` or do
# names(dfs) <- gsub("search_term", "replace_term", names(dfs)) - over and over again
# until all names are as they should be.
More parameters: Just define more parametersN - npN pairs.
# then do
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
params_list <- named_cross_combine(params_list, parameters3, names(params_list), np3, sep="_")
params_list <- named_cross_combine(params_list, parameters4, names(params_list), np4, sep="_")
... (and so on ...)
# use then at the end more lines for renaming parameter column contents:
df_final[, prmcol_name1] <- unlist(lookup(df_final[, prmcol_name1], np1, parameters1))
df_final[, prmcol_name2] <- unlist(lookup(df_final[, prmcol_name2], np2, parameters2))
df_final[, prmcol_name3] <- unlist(lookup(df_final[, prmcol_name3], np2, parameters3))
... (and so on ...)
Thus, the number of categories and parameters is easily enhance-able.
The core helper functions stay the same. And don't have to be modified.
(The use of higher order functions (functions which take functions as arguments) as helper functions is key for their flexibility - one of the strenghs of functional programming).
I need to generate a Data Frame in R from the below Excel Table.
Every time I modify one of the values from column Value the variable Score will have a different value (the cell is protected so I cannot see the formula).
The idea is to generate enough samples to check the main sources of variability, and perform some basic statistics.
I think the only way would be to manually modify the variables in the column Value and anotate the result from Score in the Dataframe.
The main issue I am having is that I am not used to work with data that has this format, and because of this I am finding difficult to visualize how should I structure the Data Frame.
I am getting stuck because the variable Score depends on 5 different Stages (where each one of them has 2 different variables) and a set of dimensions with 7 different variables.
I was trying the way I am used to create Data Frames, starting with the Vectors, but it feels wrong and I cannot see how can I represent this relationships between the different variables.
stage <- c('Inspection','Cut','Assembling','Test','Labelling','Dimensions')
variables <- c('Experience level', 'Equipement', 'User','Length','Wide','Length Body','Width Body','Tape Wing','Tape Body','Clip)
range <- c('b','m','a','UA','UB','UC') ?? not sure what to do about the range??
Could anybody help me with the logic on how this should be modelled?
As suggested by #Gregor, to resolve your main issue consider building a data frame of all needed values in respective columns. Then run each row to produce Score.
Specifically, to build needed data frame from inputs in Excel table, consider Map (wrapper to mapply) and data.frame constructor on equal-length list or vectors of 17 items:
Excel Table Inputs
# VECTOR OF 17 CHARACTER ITEMS
stage_list <- c(rep("Inspection", 2),
rep("Cut", 2),
rep("Assembling", 2),
rep("Test", 2),
rep("Labelling", 2),
rep("Dimensions", 7))
# VECTOR OF 17 CHARACTER ITEMS
exp_equip <- c("Experience level", "Equipement")
var_list <- c(rep(exp_equip, 3),
c("User", "Equipement"),
exp_equip,
c("Length", "Wide", "Length body", "Width body",
"Tape wing", "Tape body", "Clip"))
# LIST OF 17 VECTORS
bma_range <- c("b", "m", "a")
noyes_range <- c("no", "yes")
range_list <- c(replicate(6, bma_range, simplify=FALSE),
list(c("UA", "UB", "UC")),
replicate(3, bma_range, simplify=FALSE),
list(seq(6.5, 9.5, by=0.1)),
list(seq(11.9, 12.1, by=0.1)),
list(seq(6.5, 9.5, by=0.1)),
list(seq(4, 6, by=1)),
replicate(3, noyes_range, simplify=FALSE))
Map + data.frame
df_list <- Map(function(s, v, r)
data.frame(Stage = s, Variable = v, Range = r, stringsAsFactors=FALSE),
stage_list, var_list, range_list, USE.NAMES = FALSE)
# APPEND ALL DFS
final_df <- do.call(rbind, df_list)
head(final_df)
# Stage Variable Range
# 1 Inspection Experience level b
# 2 Inspection Experience level m
# 3 Inspection Experience level a
# 4 Inspection Equipement b
# 5 Inspection Equipement m
# 6 Inspection Equipement a
Rextester demo
Score Calculation (using unknown score_function, assumed to take three non-optional args)
# VECTORIZED METHOD
final_df$Score <- score_function(final_df$Stage, final_df$Variable, final_df$Range)
# NON-VECTORIZED/LOOP ROW METHOD
final_df$Score <- sapply(1:nrow(final_df), function(i)
score_function(final_df$Stage[i], final_df$Variable[i], final_df$Range[i])
# NON-VECTORIZED/LOOP ELEMENTWISE METHOD
final_df$Score <- mapply(score_function, final_df$Stage, final_df$Variable, final_df$Range)
I have a data frame that contains groups and logical vectors assessing whether they reside in each area.
# Create data frame
Group = c('Group1', 'Group2', 'Group3', 'Group4')
Area1 = c(TRUE, FALSE, TRUE, FALSE)
Area2 = c(TRUE, TRUE, FALSE, FALSE)
Area3 = c(FALSE, TRUE, FALSE, FALSE)
Area4 = c(FALSE, FALSE, FALSE, TRUE)
df = data.frame(Group, Area1, Area2, Area3, Area4)
# Generate unique combinations of Groups
links <- expand.grid(df$Group, df$Group) #generates all possible combination
links$key <- apply(links, 1, function(x)paste(sort(x), collapse=''))
undirected <- subset(links, !duplicated(links$key))
undirected$ID <- seq.int(nrow(undirected))
For each unique group dyad, I am trying to determine how many areas they share. My desired output is the dyad, the count of number of areas they share, and the names of the areas.
# Desired Output
Group1Group2 1 Area2
Group1Group3 1 Area1
Group1Group4 0 NA
Group2Group3 0 NA
Group2Group4 2 Area3, Area4
Group3Group4 0
I'm not sure that I've understood your question properly. The data structure is confusing. Does the dyad {i=2, j=4} titled Group2Group4 really have Areas 3 and 4 in common? I'd think not.
I'm not sure igraph is really needed here. This could, however, be set up as a bipartite network like G(V₁,V₂,E) differentiating areas ∈ V₁ from groups ∈ V₂ and having dyads running always from areas to groups: eⁱʲ ∈ E; i ∈ V₁; j ∈ V₂. Then you'd get shared areas by listing each group-node's neighbourhood, and the number of hared areas by counting their in-degree.
If you'd really, really like to see that in code, I'll re-post when I have time.
In the meanwhile This, I think, does what you like. I'm not winning any code-golfing competitions on this one, but if I understand your question correctly, it does the job:
# Make that same data
Group = c('Group1', 'Group2', 'Group3', 'Group4')
Area1 = c(TRUE, FALSE, TRUE, FALSE)
Area2 = c(TRUE, TRUE, FALSE, FALSE)
Area3 = c(FALSE, TRUE, FALSE, FALSE)
Area4 = c(FALSE, FALSE, FALSE, TRUE)
df = data.frame(Group, Area1, Area2, Area3, Area4)
# Take two groups (by number) and list the areas they have in common
is.shared <- function(i, j){
# Make a dataframe with two rows (one for i and one for j) where
# The order of the areas are multiplied with the boolean that indicates
# if the group resides in area x. If so, set x, if not, set 0.
dyad <- as.data.frame(matrix(rep(2:ncol(df)-1,2), nrow=2, byrow=T)) * df[c(i,j),2:5]
# The shared areas is the intersection of the two sets
shared.areas <- intersect(as.numeric(dyad[1,]), as.numeric(dyad[2,]))
}
# Take a vector of area-numbers and return a string that lists them.
# c(2,4,0) becomes "Area2, Area4".
list.areas <- function(vector){
result = c()
for(area in vector){
if(area != 0){
result <- c(result, paste("Area", area, sep=""))
}
}
paste(result, collapse=", ")
}
# Make a matrix of all possible dyadic combinations (two-way)
dyads <- expand.grid(1:nrow(df), 2:ncol(df)-1)
names(dyads) <- c("Group i", "Group j")
# Each row contains a dyad - a pair (i, and j) of groups.
# Generate a unique dyadic key
dyads$Key <- apply(dyads, 1, function(x) paste(sort(x), collapse='->'))
# For each row of dyads, that is to say, for each pair (i,j), check if
# any areas are shared using is.shared(), and convert the result to a
# string using list.areas()
dyads$Shared_Areas <- sapply(1:nrow(dyads), function(x)
list.areas(is.shared(dyads[x,1], dyads[x,2]) )
)
# Count the number of shared areas by splitting the string by commas
dyads$Shared_Area_Nums <- sapply(dyads$Shared_Areas, function(x)
length(strsplit(x,",")[[1]])
)
# Not that it's not as safe to count the result of is.shared() directly.
# If two groups share ALL areas with each other, no 0 will be returned in
# the vector. If we asume that no two groups reside in all areas, it would
# also be ok to generate dyad$Shared_Areas like this:
dyads$Shared_Areas_Unsafe <- sapply(1:nrow(dyads), function(x)
length(is.shared(dyads[x,1], dyads[x,2]))
) - 1
# Rename columns
dyads <- dyads[,c("Group i","Group j", "Key", "Shared_Area_Nums",
"Shared_Areas_Unsafe", "Shared_Areas")]
Then go ahead and clean to an undirected structure if you like, and perhaps the auto-links to (I don't get the impression that Group2Group2 is of interest to you).
Note that you had come half-way with your code. Particularly the expand.grid(df$Group, df$Group) was neat.
I've got the following task
Treatment$V010 <- as.numeric(substr(Treatment$V010,1,2))
Treatment$V020 <- as.numeric(substr(Treatment$V020,1,2))
[...]
Treatment$V1000 <- as.numeric(substr(Treatment$V1000,1,2))
I have 100 variables from $V010, $V020, $V030... to $V1000. Those are numbers of different length. I want to "extract" just the first two digits of the numbers and replace the old number with the new number which is two digits long.
My data frame "Treatment" has 80 more variables which i did not mention here, so it is my goal that this function will just be applied to the 100 variables mentioned.
How can I do that? I could write that command 100 times but I am sure there is a better solution.
Alright, let's do it. First thing first: as you want to get specific columns of your dataframe, you need to specify their names to access them:
cnames = paste0('V',formatC(seq(10,1000,by=10), width = 3, format = "d", flag = "0"))
(cnames is a vector containing c('V010','V020', ..., 'V1000'))
Next, we will get their indexes:
coli=unlist(sapply(cnames, function (x) which(colnames(Treatment)==x)))
(coli is a vector containing the indexes in Treatment of the relevant columns)
Finally, we will apply your function over these columns:
Treatment[coli] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[coli])
Does it work?
PS: if anyone has a better/more concise way to do it, please tell me :)
EDIT:
The intermediate step is not useful, as you can already use the column names cnames to get the relevant columns, i.e.
Treatment[cnames] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[cnames])
(the only advantage of doing the conversion from column names to column indexes is when there are some missing columns in the dataframe - in this case, Treatment['non existing column'] crashes with undefined columns selected)
A solution where relevant columns are selected based on a pattern that can be described with a regular expression.
Regex explanation:
^: Start of string
V: Literal V
\\d{2}: Exactly 2 digits
Treatment <- data.frame(V010 = c(120, 130), x010 = c(120, 130), xV1000 = c(111, 222), V1000 = c(111, 222))
Treatment
# V010 x010 xV1000 V1000
# 1 120 120 111 111
# 2 130 130 222 222
# columns with a name that matches the pattern (logical vector)
idx <- grepl(x = names(Treatment), pattern = "^V\\d{2}")
# substr the relevant columns
Treatment[ , idx] <- sapply(Treatment[ , idx], FUN = function(x){
as.numeric(substr(x, 1, 2))
})
Treatment
# V010 x010 xV1000 V1000
# 1 12 120 111 11
# 2 13 130 222 22