I have a dataset comprised of clinics with each clinic is comprised of doctors, performing procedures on patients.
I have written to perform analyses on the dataset filtering for clinic lists or doctor lists (a simple one is below):
num.of <- function(x.doctor, x.clinic){
if (!missing(x.clinic)){
df_filter <- filter(df_clean, clinic == x.clinic)
}
if (!missing(x.doctor)) {
df_filter <- filter(df_clean, doctor == x.doctor)
}
num_doctor <- length(unique(df_filter$doctor))
num_surveys <- nrow(df_filter)
num_procedure <- length(unique(df_filter$PPID))
result <- setNames(c(num_doctor, num_surveys, num_procedure), c("num_doctor", "num_surveys", "num_procedure"))
return(result)
}
I am attempting to call on these functions with either a list of doctors or a list of clinics:
sapply(doctor_list, num.of, x.clinic = NULL)
However, the function only works when the 'first' argument is passed through, i.e. the function above does not work, but this does:
sapply(clinic_list, num.of, x.doctor = NULL)
If the arguments are reversed when writing the initial function, the opposite of the above examples is true.
The functions are fed only one set of arguments at a time: Either a list for x.doctor or a list for x.clinic.
How can I rewrite my functions please so that apply works x.clinic and in a separate function call for x.doctor?
Thank you!
Try this:
num.of <- function(x, data, type = c("doctor", "clinic")) {
type <- match.arg(type)
df_filter <-
if (type == "doctor") {
filter(data, doctor == x)
} else {
filter(data, clinic == x)
}
num_doctor <- length(unique(df_filter$doctor))
num_surveys <- nrow(df_filter)
num_procedure <- length(unique(df_filter$PPID))
result <- setNames(c(num_doctor, num_surveys, num_procedure), c("num_doctor", "num_surveys", "num_procedure"))
return(result)
}
This enables an explicit and clear call:
sapply(doctor_list, num.of, data = df_clean, type = "doctor")
sapply(clinic_list, num.of, data = df_clean, type = "clinic")
I took the liberty of helping with a scope breach: accessing df_clean from inside the function may work but can present problems in the future. It makes the function very context-dependent and inflexible in the presence of multiple datasets. Even if you are 100% certain you will always always always have df_clean in your calling (or global) environment for this case, it's a good habit (among "Best Practices TM").
If this doesn't work, then you might need to make a more reproducible example so that we can actually test the function. Since you may not want to include actual data, it makes things incredibly easier for everyone else if you make it generic-as-ever, with simple names and simple example data.
Related
I am working with microarray data within an ExpressionSet object downloaded from Gene Expression Omnibus. The rows of the expression data in this object are labeled with probe names, but for downstream analysis I really need the gene symbols.
Thankfully, the individuals that compiled this dataset included the corresponding gene symbols in the metadata that accompanies this kind of object.
I am trying to write a for loop within a function that looks at the list of variable labels (effectively row names for the metadata), determines whether a column called "GENE_SYMOBL" is present, then either writes those gene symbols to a vector, or moves on and converts the probe names to gene symbols using gprofileR.
I don't want my if else statement to run for each iteration of my for loop, I just want it to run after the if statement has determined if any of the row names are "GENE_SYMBOL".
So far I have written the for loop with the if statement but can't figure out how to put the condition if ANY of the column names match, then do A, if none match then do B.
nums <- as.data.frame(matrix(0, ncos = 27, nrow = 12))
feature_headers <- c(letters, "GENE_SYMBOL")
colnames(nums) <- feature_headers
for (i in 1:length(feature_headers)) {
if(feature_headers[i] == "GENE_SYMBOL") {
gene_symb <- nums[["GENE_SYMBOL"]]
}else{
#what else it does is more involved that this question needs be so
#I just wrote out something for the function to say
cat("boohoo no genes for you"\n)
}
}
Any help you could provide would be much appreciated and let me know if you need more information.
In your specific situation, R has a handy %in% operator that you can use to check this:
if ("GENE_SYMBOL" %in% feature_headers) {
#...
}
As a more general rule, if your goal is "if loop A meets condition B, then do action C", you can follow this pattern:
found <- FALSE
for (loopStatement) {
if(condition) {
found <- TRUE
break
}
}
if(found) {
doActionC()
}
This way, if you get through the whole list without finding the label, found is still FALSE- but if you do find the label, you don't do a bunch of unnecessary checking. This is essentially the gist of what's happening under the hood with %in%, and %in% is faster to write and probably faster to process. It's a good thing to know for other situations, though.
Also, the %in% operator can be used to check if the elements of one list are shared with another list!
You can add a boolean variable to record if your condition is hit in the for loop and then break to avoid unnecessary calculation
nums <- as.data.frame(matrix(0, ncos = 27, nrow = 12))
feature_headers <- c(letters, "GENE_SYMBOL")
colnames(nums) <- feature_headers
FALSE -> found
for (i in 1:length(feature_headers)) {
if(feature_headers[i] == "GENE_SYMBOL") {
TRUE -> found
break
}
}
if (found) {
dosomething()
}
else {
dosomethingelse()
}
Hi I want to check if a column in a data.frame exists, and only if it does check another conditions.
I know I can use a nested if statement as I have in the example.
This is normally for checking inputs to functions. This is a working example which gives me the output I want, I just was wondering if there is a smarter way, as this can get messy especially if I am doing it for a number of conditions. My example:
testfun <- function(dat,...){
library(dplyr)
if("Site" %in% colnames(dat)){
#for example check number of sites, this condition could be anything though
if(n_distinct(dat$Site) > 1) stop ("Function must have site specific data")
}
#do stuff
return(1)
}
testdf1 <- data.frame(x = 1:10, y = 1:10)
testdf2 <- data.frame(x = 1:10, y = 1:10,Site = "A")
testdf3 <- data.frame(x = 1:10, y = 1:10,Site = rep(c("A","B"),each = 5))
testfun(testdf1)
testfun(testdf2)
testfun(testdf3)
Edit with a bit more context: For this example the reason for this is that the user may input data that is site specific and therefore doesn't have a Site column (i.e. they have a data.frame with data only at one site so they have never specified the site as a column) or they might be using a data.frame that has had data for a number of sites specified in a column. So if there is no Site column it is safe to assume that data is for one site and the its valid to continue calculations, but if there is a site column I have to check that it only has one distinct value (eg might have been filtered on this column before applying the function of applied through plyr::ddply).
There are a lot of other cases however where I want to check that my input data to a function is of the expected form, and if the input is a data.frame this often means checking for column names and something about that column
You can decide if this is a smarter way or not but one way is by separating the logic using map_if. Here we check the basic condition ("Site" %in% colnames(dat)) in predicate part and based on that we call two functions one for TRUE and other for FALSE. We still check similar conditions but by keeping the functions separate we can keep the code clean and it is easy to understand which part is doing what.
library(dplyr)
library(purrr)
testfun <- function(dat, ...) {
unlist(map_if(list(dat), "Site" %in% colnames(dat), true_fun, .else = false_fun))
}
true_fun <- function(dat) {
if(n_distinct(dat$Site) > 1) stop ("Function must have site specific data")
return(1)
}
false_fun <- function(dat) { return(1) }
testfun(testdf1)
#[1] 1
testfun(testdf2)
#[1] 1
testfun(testdf3)
Error in .f(.x[[i]], ...) : Function must have site specific data
In Stata, summarize prints a brief statistical summary of all variables in the current workspace. In R, summary(<myvariable>) does something similar for a particular <myvariable>.
Q: In R, how should I print a statistical summary of ALL relevant variables in my workspace?
I tried:
x <- runif(4)
y <- runif(4)
z <- runif(3)
w <- matrix(runif(4), nrow = 2)
sapply(ls(), function(i) {if (class(get(i)) == "numeric") summary(get(i))})
which gets close to what I want. But it still prints
$w
NULL
...
which is undesirable. Also, this code throws an error when there's a variable of type closure in my workspace...
I feel like I'm going off into the weeds here. There must be a simpler, straightforward way of more-or-less replicating Stata's summarize in R, right?
You can use methods to determine which variable types work with summary
summary.methods = methods(summary)
check.method <- function(x){
any(grepl(paste0('^summary\\.',class(x)[1],'$'),summary.methods))
}
lapply(ls(), function(z,envir = .GlobalEnv) {
obj = get(z)
if (class(obj) %in% c('list','data.frame')
Recall(names(obj),as.environment(obj))
else if (check.method(obj))
print(summary(obj))
else
print(paste0("No summary for: ",z))
})
You may want to change this depending on how much data you have, but it should work.
Added some recursion for list/data frames.
If you want to get it to work with lists and individual data frame columns, I would check for those classes and use as.environment to get variables from the list/frame. I can show you a more explicit way of doing this later if you like.
I want to go through a vector, name all variables with i and use i to subset a larger file.
Why this does not work?
x <- c(seq(.1,.9,.1),seq(.9,1,.01))
doplot <- function(y)
{
for (i in unique(y))
{
paste("f_", i, sep = "") <- (F_agg[F_agg$Assort==i,])
}
}
doplot(x)
There are several problems here. First of all, on the left hand side of <- you need a symbol (well, or a special function, but let's not get into that now). So when you do this:
a <- "b"
a <- 15
then a will be set to 15, instead of first evaluating a to be b and then set b to 15.
Then, if you create variables within a function, they will be (by default) local to that function, and destroyed at the end of the function.
Third, it is not good practice to create variables this way. (For details I will not go into now.) It is better to put your data in a named list, and then return the list from the function.
Here is a solution that should work, although I cannot test it, because you did not provide any test data:
doplot <- function(y) {
lapply(unique(y), function(i) {
F_agg[F_agg$Assort == i, ]
})
}
I have a list of samples, each of varying lengths. I need to compare sample means (using a Mann-Whitney-Wilcoxon test) for all samples in the list. Current code is as follows:
wilcox.v = list() ##This creates the list of samples
for (i in df){
treat = list(i$treatment)
wilcox.v = c(wilcox.v,treat)
}
###This *should* iterate over all items in the list
wilcox = sapply(wilcox.v, function(i){ wilcox.test(as.numeric(wilcox.v[i,]), as.numeric(wilcox.v[-i,]), exact = FALSE)$p.value
})
I'd like to have the function return a vector of p-values, so that the broader function can re-sample if necessary.
The problem seems to lie in the need to compare a sample mean to all other sample means in the list.
I'm sure there's an easy way to do this (and I think it has something to do with calling indicies correctly), but I'm not sure!
AS joran said, you wrote your apply function a little wonky. There are two ways you can fis this.
Modify it so i is in fact an index reference:
wilcox = sapply(1:length(wilcox.v)
,function(i){ wilcox.test(as.numeric(wilcox.v[[i]])
,as.numeric(wilcox.v[[-i]]), exact = FALSE)$p.value
})
modify your function so it appropriately treats i as a list element. I'll leave this as an exercise to you (primarily since I don't want to deal with the wilcox.v[-i,] term.
Thanks for your help! This is the solution I ended up using. It's hardly elegant but it gets the job done.
mannwhit = vector()
for (i in mannwhit.v){
for (j in mannwhit.v){
if (identical(i,j) == FALSE){
p.val = wilcox.test(i, j, paired=FALSE)$p.value
mannwhit = c(mannwhit, p.val)
}
}
}