wrong number of observations - nobs - r

I am just written my "complete" function (R programming - Coursera) but something is not working as expected.
complete <- function(directory,id=1:332){
#create a list of files
list_files<-list.files(directory,full.names = TRUE)
#create an empty data frame
dat <- data.frame()
for(i in id){
#read in the file
temp<- read.csv(list_files[i],header=TRUE)
#delete rows that do not have complete cases
temp<-na.omit(temp)
#count all of the rows with complete cases
tNobs<-nrow(temp)
#enumerate the complete cases by index
dat<-rbind(dat,data.frame(i,tNobs))
}
return(dat)
}
When I ask:
cc <- complete("specdata", c(6, 10, 20, 34, 100, 200, 310))
print(cc$nobs)
it returns NULL. WHY? it should return:
228 148 124 165 104 460 232

The issue is that the data.frame was creating by not assigning the column names. So, it will take the object name as the column name i.e. tNobs for the second column.
i <- 2
tNobs = 10
data.frame(i, tOobs)
# i tOobs
#1 2 10
So, when we call the nobs that doesn't exist in the data.frame, it gives NULL
dat <- rbind(dat, data.frame(col1 = i, nobs = tNobs))

Related

(Pearson's) Correlation loop through the data frame

I have a data frame with 159 obs and 27 variables, and I want to correlate all 159 obs from column 4 (variable 4) with each one of the following columns (variables), this is, correlate column 4 with 5, then column 4 with 6 and so on... I've been unsuccessfully trying to create a loop, and since I'm a beginner in R, it turned out harder than I thought. The reason why I want to turn it more simple is that I would need to do the same thing for a couple more data frames and if I had a function that could do that, it would be so much easier and less time-consuming. Thus, it would be wonderful if anyone could help me.
df <- ZEB1_23genes # CHANGE ZEB1_23genes for df (dataframe)
for (i in colnames(df)){ # Check the class of the variables
print(class(df[[i]]))
}
print(df)
# Correlate ZEB1 with each of the 23 genes accordingly to Pearson's method
cor.test(df$ZEB1, df$PITPNC1, method = "pearson")
### OR ###
cor.test(df[,4], df[,5])
So I can correlate individually but I cannot create a loop to go back to column 4 and correlate it to the next column (5, 6, ..., 27).
Thank you!
If I've understood your question correctly, the solution below should work well.
#Sample data
df <- data.frame(matrix(data = sample(runif(100000), 4293), nrow = 159, ncol = 27))
#Correlation function
#Takes data.frame contains columns with values to be correlated as input
#The column against which other columns must be correlated cab be specified (start_col; default is 4)
#The number of columns to be correlated against start_col can also be specified (end_col; default is all columns after start_col)
#Function returns a data.frame containing start_col, end_col, and correlation value as rows.
my_correlator <- function(mydf, start_col = 4, end_col = 0){
if(end_col == 0){
end_col <- ncol(mydf)
}
#out_corr_df <- data.frame(start_col = c(), end_col = c(), corr_val = c())
out_corr <- list()
for(i in (start_col+1):end_col){
out_corr[[i]] <- data.frame(start_col = start_col, end_col = i, corr_val = as.numeric(cor.test(mydf[, start_col], mydf[, i])$estimate))
}
return(do.call("rbind", out_corr))
}
test_run <- my_correlator(df, 4)
head(test_run)
# start_col end_col corr_val
# 1 4 5 -0.027508521
# 2 4 6 0.100414199
# 3 4 7 0.036648608
# 4 4 8 -0.050845418
# 5 4 9 -0.003625019
# 6 4 10 -0.058172227
The function basically takes a data.frame as an input and spits out (as output) another data.frame containing correlations between a given column from the original data.frame against all subsequent columns. I do not know the structure of your data, and obviously, this function will fail if it runs into unexpected conditions (for instance, a column of characters in one of the columns).

How to efficiently store and retrieve parameters/arguments used during data processing in lists

I work with a large count table and for my analyses it is usually required to split this table into subsets based on observations, variables, values or context information.
# generating toy data
count_df1 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1, 10),
column3 = runif(50, 1, 10)
)
count_df2 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1.5, 9),
column3 = runif(50, 1.5, 9)
)
list_count_df <- list(count_df1 = count_df1, count_df2 = count_df2)
I learned to use lists and for loops to process all resulting subsets in the same manner. I'm rather using for loops than apply because I use the names of the objects (with the use of counters) to keep track of how I modified them and I don't know how to do this with e.g. lapply.
# set values to iterate over
thresholds <- c(2, 4)
conditions <- c(TRUE, FALSE)
# perform some kind of subsetting and store the parameters used
output_list <- list()
counter <- 0
for (current_threshold in thresholds) {
for (count_df in list_count_df) {
counter <- counter + 1
# modify the name to keep track of changes
current_name <- paste(names(list_count_df)[counter], current_threshold, sep = "_")
output_list[[current_name]] <- subset(count_df1, column2 < current_threshold)
}
counter <- 0
}
Additionally, the time consuming part is usually the main function of the body, so a loop with a reduced overhead by apply would probably not safe so much time (I'm still open to this).
After I'm done with preparing the various subsets and subject them to the analysis, I need to store the analysis' results and the accompanying parameters for the different subsets. That is probably a common task.
# allocate for df to store the results
result_length <- length(output_list) * length(conditions)
df_headers <- c("Names", "Threshold", "Input_table", "Standard_deviation", "Scaling")
df_results <- setNames(data.frame(matrix(ncol = length(df_headers),
nrow = result_length)), df_headers)
# perform some analyses (here: PCA) on the dfs while looping over
# analysis parameters and storing some results directly
iii <- 0
table_counter <- 0
for (item in output_list) {
table_counter <- table_counter + 1
for (condition in conditions) {
iii <- iii + 1
current_name <- paste(names(output_list)[table_counter], condition, sep = "_")
tmp <- prcomp(item, scale = condition)
# let's pretend we are only interested in standard deviation per item
df_results[iii, 1] <- current_name
df_results[iii, 4] <- tmp$sdev[1]
rm(tmp)
}
}
However, I'm partly doing this by extracting parts of the name of the object, which is highly repetitive and also very custom and has to be changed for each additional step included beforehand. As I want to start my own package soon, this is nothing another user could easily follow.
# extract more values from the name of the former object
df_results$Threshold <- as.numeric(sapply(strsplit(as.character(df_results$Names), '_'), "[", 3))
df_results$Input_table <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 2))
df_results$Scaling <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 4))
df_results
# now I could this into long format, do plotting etc
I provided a short example below of how such a workflow could look like. My questions are:
1) What are the general good practices on how to store parameters used for and how to extract them after processing?
2) If the solution is too case-specific for a general approach:
a) any ideas what to change here?
b) Are lists and/or for loops the way to go at all?
I do it because modifying names in lapply is unclear to me and without this I lose track of what is what. I also would not know how to efficiently handle all these different subsets in one big data.frame
Please consider that my original data contains numerical, factor and character columns with 100s of rows/observations and ten thousands of columns/variables.
Honestly there are many ways to do this and it will come down to personal preference. One common way would be to define a class object that will set the standard of how you access information on it. Creating a class means that you can make S3 methods too. This could help give more flexibility on how you generate your class depending on if you are working on a list, df, or just a vector.
generate_foo <- function(x, ...){
UseMethod("generate_foo")}
generate_foo.default <- function(x, current_threshold, conditions, name = NULL){
if(is.null(name)){
name <- as.character(substitute(x))
}
x <- x[x[["column2"]]<current_threshold,]
tmp <- tryCatch({prcomp(x, scale = conditions)}, error=function(er){return("Error")})
retval <- list(list(subset = x,
pcaObj = tmp, #could store the entire object or just the parts you care about.
subsetparam = current_threshold,
condition = conditions,
name = name))
class <- "foo"
return(retval)
}
generate_foo.list <- function(x,
current_threshold,
conditions, name = NULL){
if(is.null(name)||length(name)!=length(x)){
name <- names(x)
}
#Generate combinations
combi <- separate( #generate all the possible combination indexes at once
data.frame(
indx = levels(suppressWarnings(interaction(1:length(x),
1:length(current_threshold),
1:length(conditions))))),
col = "indx", into = c("df","thresh","cond"), sep = "\\.")
x <- x[as.numeric(combi$df)]
name <- name[as.numeric(combi$df)]
current_threshold <- current_threshold[as.numeric(combi$thresh)]
conditions <- conditions[as.numeric(combi$cond)]
foolist <- mapply(FUN = generate_foo.default,
x = x,
current_threshold = current_threshold,
conditions = conditions,
name = name)
class(foolist) <- "foolist"
return(foolist)
}
With this method when you call:
foo <- generate_foo(x = list_count_df,
current_threshold = thresholds,
conditions = conditions,
name = c("Custname1","Custname2"))
You will end up with a list of objects with class "foo". Specifically in this case the resulting object is length 8, each containing 5 parameters, subset, pcaObj, subsetparam, condition, name. with the exception of pcaObj sometimes throwing an error if the subset is too small, a tryCatch loop prevents the code from failing.
take it a step further by writing custom print and summary functions!
#summary
summary.foolist <- function(x){
subsetdim <- unlist(lapply(x, function(y){prod(dim(y[["subset"]]))}))
pcasdev <- unlist(lapply(x, function(y){y[["pcaObj"]]$sdev[1]}))
subsetparam <- unlist(lapply(x, function(y){y[["subsetparam"]]}))
condition <- unlist(lapply(x, function(y){y[["condition"]]}))
name <- unlist(lapply(x,function(y){y[["name"]]}))
df <- data.frame(SubsetDim=subsetdim, PCAsdev=pcasdev, SubsetParam=subsetparam, condition=condition, name = name)
return(df)
}
summary(foo)
SubsetDim PCAsdev SubsetParam condition name
1 24 1.207833 2 TRUE Custname1
2 6 1.732051 2 TRUE Custname2
3 54 1.324284 4 TRUE Custname1
4 33 1.372508 4 TRUE Custname2
5 24 16.258848 2 FALSE Custname1
6 6 12.024556 2 FALSE Custname2
7 54 15.592938 4 FALSE Custname1
8 33 14.057929 4 FALSE Custname2
Using a convention like this ensures your data is stored in a canonical way. Of course there are many ways you can choose to build your custom R class and object.
You could make one function that makes a list of subsetted dataframes and set that as one class. Then make another function that performs the analysis and generates a new class object. As long as you stick to building a named list, then accessing parts of the object become easier because they are organized.
Functional solution
0. Generate source data frame
# for reproducibility of random tasks
set.seed(1)
df <- data.frame(
col1 = c(1:100),
col2 = c(runif(50,1,10), runif(50,11,20)),
col3 = c(runif(50,1,10), runif(50,11,20))
)
# so half of the rows have numbers 1 to 10 in col2 and col3
# and other have 11 to 20 in col2 and col3.
# let's randomize the order of rows
df <- df[sample(1:100),]
# and take this data frame `df` as our source data frame
# fromw which we will do the analysis.
1. Problem description
We want to subdivide the original df into sub data frames
applying 2 different criteria.
Then, we analyze each sub data frame
using all possible combinations of 2 different parameters,
finally, collect all analysis values and aggravate them to a data frame.
The criteria:
criterium1: if col2 value is <= 10, we assign to the row "df1" therwise "df2".
categories c("df1", "df2").
criterium2: if col3 value is lower first limit, the row is assigned 'class5'
if col3 value is > first limit but <= second limit, assign 'class15'
other cases don't interest us - let's assign 'other'
categories c("class5", "class15", "other") # each row will be subdivided into
one of them
We want for each combination of the two criteria an own sub-dataframe
on which the analysis should be done.
The parameters for the analysis:
parameter1: 'scale.=' c(TRUE, FALSE)
parameter_categories c("sc+", "sc-")
parameter2: 'center=' c(TRUE, FASE)
parameter_categories c("cen+", "cen-")
The analysis result value:
We want for each combination of the two parameters an own report or the value
for 'Standard deviation'.
3 stddev columns of the PC1, PC2, PC3
Additional information to be collected:
we want a distinguishable (unique) name for each combination
2. How the entire analysis will look like:
# 0. categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("class5", "class15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# 1. Declare parameters and prepare all parameter combinations
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# 2. Apply analysis over all sub dfs and parameter combinations
# and extract and aggravate analysis results into a final data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# 3. rename parameter column contents
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
df_final:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
3. Explanation step by step
3.1 Declare Helper functions
# for preparing parameter combinations as lists
named_cross_combine <- function(seq1, seq2, seq1_names, seq2_names, sep="_") {
res <- list()
i <- 1
namevec <- c()
for (j1 in seq_along(seq1)) {
for (j2 in seq_along(seq2)) {
res[[i]] <- c(seq1[j1], seq2[j2])
namevec[i] <- paste0(seq1_names[j1], sep, seq2_names[j2])
i <- i + 1
}
}
names(res) <- namevec
res
}
# correctly named params list - `sep=` determines how names are joined
# you can apply `gsub()` on the namevec before assignment to adjust further the names.
# useful for doing analysis
do.call2 <- function(fun, x, rest) {
do.call(fun, c(list(x), rest))
}
apply_parameters <- function(funcname,
dfs,
params) {
lapply(dfs, function(df) lapply(params_list, function(pl) do.call2(funcname, df, pl)))
}
split_names_to_data_frame <- function(names_vec, sep) {
res <- lapply(names_vec, function(s) strsplit(s, sep)[[1]])
df <- Reduce(rbind, res)
# colnames(df) <- col_names
rownames(df) <- names_vec
df
}
apply_to_subdf_and_combine <- function(
res_list,
accessor_func=function(x) x, # subdf result
subdf_level_combiner_func=as.data.frame, # within subdf result
combine_prepare_func=function(x) x, # applied on each subdf result
final_combiner_func=rbind, # combine the results
col_names=NULL, # column names for final
sep="_") { # joiner for names
res_accessed_combined <- lapply(res_list,
function(x) do.call(what=subdf_level_combiner_func,
list(lapply(x, accessor_func))))
res_prepared <- lapply(res_accessed_combined, combine_prepare_func)
res_df <- Reduce(final_combiner_func, res_prepared)
rownames(res_df) <- paste(unlist(sapply(names(res_prepared), rep, nrow(res_prepared[[1]]))),
unlist(sapply(res_prepared, rownames)),
sep = sep)
names_df <- split_names_to_data_frame(rownames(res_df), sep = sep)
final_df <- as.data.frame(cbind(names_df, res_df))
if (!is.null(col_names)) {
colnames(final_df) <- col_names
}
final_df
}
# for simplifying the function call
extract_and_combine <- function(res_list,
result_extractor_func,
col_names,
sep="_") {
apply_to_subdf_and_combine(
res_list = res_list,
accessor_func = result_extractor_func,
subdf_level_combiner_func=as.data.frame,
combine_prepare_func=function(x) as.data.frame(t(x)),
final_combiner_func=rbind,
col_names=col_names,
sep=sep
)
}
# for even more simplifying function call
apply_extract_aggravate <- function(dfs,
params,
analyzer_func,
extractor_func,
col_names,
sep="_") {
extract_and_combine(
res_list=apply_parameters(funcname=analyzer_func, dfs=dfs, params=params),
result_extractor_func=extractor_func,
col_names=col_names,
sep=sep
)
}
# useful for renaming the data frame columns values
lookup <- function(x, seq1, seq2) {
seq2[sapply(x, function(x) which(x == seq1))]
}
3.2 Categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("5", "15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
But to have full control over categorization, you can
declare your own categorizer functions and categorize and
split the data frame:
# write rules for criterium1 1 element as function
categorizer1 <- function(x) {
if (1 <= x && x <= 10) {
"df1"
} else if (11 <= x && x <= 20) {
"df2"
}
}
# vectorize it to be able to apply it on entire columns
categorizer1 <- Vectorize(categorizer1)
# do the same for critreium2
categorizer2 <- function(x) {
if (x <= 5) {
"class5"
} else if (5 < x && x <= 15) {
"class15"
} else {
"other"
}
}
categorizer2 <- Vectorize(categorizer2)
# apply on col2 and col3 the corresponding categorizers
categories1 <- categorizer1(df[,"col2"])
categories2 <- categorizer2(df[,"col3"])
# get the list of sub data frames according to categories
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# Let the categorizer functions return strings and
# for the second argument use `paste()` with `sep=` to determine
# how the names should be combined - here with "_".
# Use `gsub(pattern, replacement, x, ignore.case=F, perl=T)`
# to process the name using regex patterns to how you want it at the end.
# Here, we remove the bulky "class".
3.3 Declare parameters as lists and their corresponding names in filename
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
# prepare all combinations of them in a list of lists
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# this produces a list of all possible parameter combination lists.
# Each parameter combination has to be kept itself in a list, because
# `do.call()` later requires the parameters being in a list.
# `named_cross_combine()` takes care of correct naming,
# joining the names using `sep` values.
# The first element in `parameter1` is taken and is paired with each of
# `parameters2`. Then the second of `parameter1` through all `parameters2`, etc.
3.4 Apply all parameters over dfs and collect the results into a data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev, # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_" # separator for names
)
# This function takes the dfs and the parameters list and runs the
# analyzer_func, here `prcomp()` over all combinations of boths.
# The `extractor_func` must be chosen in a way that the returned result is a vector.
# If it is already a vector, set here `function(x) x` the identity function.
# The column names should give new names to the resulting columns.
# The number of the names are determined by:
# - the number of categoriesN,
# - the number of parametersN,
# - the number of elements of result after extractor_func() was applied.
# `sep=` determines which joiner is used for joining the names.
3.5 Finally, rename parameter columns' contents by using lookup() + previously declared parameter lists (parametersN) with their corresponding name vectors (npN)
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
# Two parameter columns, so two commands.
This converts df_final from this:
# df limits scale center std_PC1 std_PC2 std_PC3
# df1_15_scpos_cpos df1 15 scpos cpos 1.205986 0.9554013 0.7954906
# df1_15_scpos_cneg df1 15 scpos cneg 1.638142 0.5159250 0.2243043
# df1_15_scneg_cpos df1 15 scneg cpos 15.618145 2.4501942 1.3687843
# df1_15_scneg_cneg df1 15 scneg cneg 31.425246 5.9055013 1.7178626
# df1_5_scpos_cpos df1 5 scpos cpos 1.128371 1.0732246 0.7582659
# df1_5_scpos_cneg df1 5 scpos cneg 1.613217 0.4782639 0.4108470
# df1_5_scneg_cpos df1 5 scneg cpos 13.525868 2.5524661 0.9894493
# df1_5_scneg_cneg df1 5 scneg cneg 30.007511 3.9094993 1.6020638
# df2_15_scpos_cpos df2 15 scpos cpos 1.129298 1.0069030 0.8431092
# df2_15_scpos_cneg df2 15 scpos cneg 1.720909 0.1523516 0.1235295
# df2_15_scneg_cpos df2 15 scneg cpos 14.061532 2.4172787 1.2348606
# df2_15_scneg_cneg df2 15 scneg cneg 80.543382 3.8409639 1.8480111
# df2_other_scpos_cpos df2 other scpos cpos 1.090057 0.9588241 0.9446865
# df2_other_scpos_cneg df2 other scpos cneg 1.718190 0.1881516 0.1114570
# df2_other_scneg_cpos df2 other scneg cpos 15.168160 2.5579403 1.3354016
# df2_other_scneg_cneg df2 other scneg cneg 82.297724 5.0580949 1.9356444
to this:
df limits scale center std_PC1 std_PC2 std_PC3
df1_15_scpos_cpos df1 15 TRUE TRUE 1.205986 0.9554013 0.7954906
df1_15_scpos_cneg df1 15 TRUE FALSE 1.638142 0.5159250 0.2243043
df1_15_scneg_cpos df1 15 FALSE TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg df1 15 FALSE FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos df1 5 TRUE TRUE 1.128371 1.0732246 0.7582659
df1_5_scpos_cneg df1 5 TRUE FALSE 1.613217 0.4782639 0.4108470
df1_5_scneg_cpos df1 5 FALSE TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg df1 5 FALSE FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos df2 15 TRUE TRUE 1.129298 1.0069030 0.8431092
df2_15_scpos_cneg df2 15 TRUE FALSE 1.720909 0.1523516 0.1235295
df2_15_scneg_cpos df2 15 FALSE TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg df2 15 FALSE FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2 other TRUE TRUE 1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2 other TRUE FALSE 1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2 other FALSE TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2 other FALSE FALSE 82.297724 5.0580949 1.9356444
4. Final remarks
This is not very different from your approach. All information is collected in the names. And the names used for generating the part of the data frame which explains the background for the analysis data.
The lookup() function is very useful for renaming the columns for the parameters.
The categorization of a column can be very simplified by the cat() function. But in the cut() function you don't have full control over
whether the upper/lower limit is included (<=) or not (<).
That is why sometimes declaring own categorizer functions can be of advantage. (And especially for more complex categorizations).
Extensibility
More categories: Just define more categories categories1 categories2 categories3 ...
# then do
dfs <- split(df, paste(categories1, categories2, categories3, ..., sep="_"))
# use `gsub()` around `paste()` or do
# names(dfs) <- gsub("search_term", "replace_term", names(dfs)) - over and over again
# until all names are as they should be.
More parameters: Just define more parametersN - npN pairs.
# then do
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
params_list <- named_cross_combine(params_list, parameters3, names(params_list), np3, sep="_")
params_list <- named_cross_combine(params_list, parameters4, names(params_list), np4, sep="_")
... (and so on ...)
# use then at the end more lines for renaming parameter column contents:
df_final[, prmcol_name1] <- unlist(lookup(df_final[, prmcol_name1], np1, parameters1))
df_final[, prmcol_name2] <- unlist(lookup(df_final[, prmcol_name2], np2, parameters2))
df_final[, prmcol_name3] <- unlist(lookup(df_final[, prmcol_name3], np2, parameters3))
... (and so on ...)
Thus, the number of categories and parameters is easily enhance-able.
The core helper functions stay the same. And don't have to be modified.
(The use of higher order functions (functions which take functions as arguments) as helper functions is key for their flexibility - one of the strenghs of functional programming).

Generate data frame length (and column data) from function

I want to generate a data frame with a random length.
> head(df)
"id" "age"
53 12 # randomly chosen data from fn1(){} and fn2(){}
146 31 #
343 22 #
...#randomly generated length from sample(50:5000,1)
The problem is that the way I'm trying is just repeat the same element over and over:
# This just repeats the same value instead of generating function over and over
a <- fn1(){}
rep(a,15)
[1] "S" "S" "S" "S" "S" "S" "S" ...
Ideally the column names I want to specify and assign a value from other functions:
# Generate length of data frame
df.length <- sample(50:500,1)
# Generate data for each row from function
df.column.id <- fn1(){}
df.column.age <- fn2(){}
...
df <- data.frame("id" = df.column.id, "age" = df.column.age, ...)
Unfortunately the rep function isn't working, so how can the data frame columns be generated from functions? I also tried matrix(data = c(df.column.id, df.column.age), nrow = df.length) didn't work as intended.
Edit:
replicate(10, RandomStatusColor(), simplify = "vector") is working to generate a vector of the function outputs.
Maybe something like this could help:
min_rownum <- 10
max_rownum <- 50
num_of_rows <- sample(seq(min_rownum, max_rownum), 1)
min_age <- 1
max_age <- 50
age <- sample(seq(min_age, max_age), num_of_rows, replace = TRUE)
min_ID <- 50
max_ID <- 500
id <- sample(seq(min_ID, max_ID), num_of_rows)
df1 <- data.frame(id, age)
I tried to use variable names that would make the code self-explanatory.
The parameter replace = TRUE in the sample() function means that an element can be selected more than once. In the case of ages this is plausible, whereas IDs should be unique. The second argument of sample() defines how many elements should be chosen from the vector that is passed as a first argument.
The title of the question suggests that the data.frame should be generated by a function. In that case the above code can be wrapped into a function like this:
make_random_df <- function(min_rownum=10, max_rownum=50, min_age=1, max_age=50,
min_ID=50, max_ID=500) {
num_of_rows <- sample(seq(min_rownum, max_rownum), 1)
age <- sample(seq(min_age, max_age), num_of_rows, replace = TRUE)
id <- sample(seq(min_ID, max_ID), num_of_rows)
df1 <- data.frame(id, age)
}
Using this function, the data.frame can be created with
my_random_df <- make_random_df()
#> head(my_random_df)
# id age
#1 461 7
#2 86 44
#3 319 8
#4 363 45
#5 59 3
#6 258 49
Here's a function that generates data samples of a given length (len) from a given vector (vec):
createData <- function(vec, len) {
sample(vec, len, replace = TRUE)
}
nobs <- 20
df <- data.frame(id = createData(vec = c("a", "b", "c"), len = nobs),
age = createData(vec = seq(10, 50, 10), len = nobs))
df
Is this what you are after?

function that counts complete cases of id subset output order in R

Creating a function that returns a list of User Id's and Number of observations per id that are complete cases. Got the core code working but running into an issue where the return/output is organized in numerical ranking not how the argument is passed in to the function. I've read what I can on stackoverflow and google looking at subsetting data and functions but can't isolate my problem of why the data is coming back ordered the way it is.
Here's my code:
complete <- function(directory, id = 1:332){
files <- list.files(pattern="*.csv")
myfiles <- do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))
data.set <- subset(myfiles, ID %in% id)
ans <- sapply(split(data.set, data.set$ID), function(y) sum(complete.cases(y)))
ans
return(data.frame(id = names(ans), nobs = unname(ans)))
}
And sample data being passed with output I'm getting:
complete("specdata", 30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932

How to repetitively replace substrings in variables in R

I've got the following task
Treatment$V010 <- as.numeric(substr(Treatment$V010,1,2))
Treatment$V020 <- as.numeric(substr(Treatment$V020,1,2))
[...]
Treatment$V1000 <- as.numeric(substr(Treatment$V1000,1,2))
I have 100 variables from $V010, $V020, $V030... to $V1000. Those are numbers of different length. I want to "extract" just the first two digits of the numbers and replace the old number with the new number which is two digits long.
My data frame "Treatment" has 80 more variables which i did not mention here, so it is my goal that this function will just be applied to the 100 variables mentioned.
How can I do that? I could write that command 100 times but I am sure there is a better solution.
Alright, let's do it. First thing first: as you want to get specific columns of your dataframe, you need to specify their names to access them:
cnames = paste0('V',formatC(seq(10,1000,by=10), width = 3, format = "d", flag = "0"))
(cnames is a vector containing c('V010','V020', ..., 'V1000'))
Next, we will get their indexes:
coli=unlist(sapply(cnames, function (x) which(colnames(Treatment)==x)))
(coli is a vector containing the indexes in Treatment of the relevant columns)
Finally, we will apply your function over these columns:
Treatment[coli] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[coli])
Does it work?
PS: if anyone has a better/more concise way to do it, please tell me :)
EDIT:
The intermediate step is not useful, as you can already use the column names cnames to get the relevant columns, i.e.
Treatment[cnames] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[cnames])
(the only advantage of doing the conversion from column names to column indexes is when there are some missing columns in the dataframe - in this case, Treatment['non existing column'] crashes with undefined columns selected)
A solution where relevant columns are selected based on a pattern that can be described with a regular expression.
Regex explanation:
^: Start of string
V: Literal V
\\d{2}: Exactly 2 digits
Treatment <- data.frame(V010 = c(120, 130), x010 = c(120, 130), xV1000 = c(111, 222), V1000 = c(111, 222))
Treatment
# V010 x010 xV1000 V1000
# 1 120 120 111 111
# 2 130 130 222 222
# columns with a name that matches the pattern (logical vector)
idx <- grepl(x = names(Treatment), pattern = "^V\\d{2}")
# substr the relevant columns
Treatment[ , idx] <- sapply(Treatment[ , idx], FUN = function(x){
as.numeric(substr(x, 1, 2))
})
Treatment
# V010 x010 xV1000 V1000
# 1 12 120 111 11
# 2 13 130 222 22

Resources