My question isn't quite as crazy as it sounds (hopefully). I have a function which takes as an input a list of dataframes, and gives as an output a list of corresponding linear models. One input to the function is the argument log_transform. I'd like the user to be able to input a list of variables to be log transformed, and have that be taken into account by the model. This gets a little complex however, since this needs to be applied not only to multiple variables, but to multiple variables across multiple dataframes. As is, I have this coded as so:
function(df_list, log_transform = c("var1", "var2")) {
if(!is.null(log_transform)) { #"If someone inupts a list of variables to be transformed, then..."
trans <- function(df) {
sapply(log_transform, function(x) { #"...apply a log transformation to each variable..."
x <- log(x + 1)
}, simplify = T)
llply(df_list, trans) #"...for each dataframe in df_list."
}
etc
}
However, when I try to run this, I receive the error:
Error in x + 1 : non-numeric argument to binary operator
Where am I going wrong?
Thanks
No test cases provided so this remains untested but it was checked for syntactic completion and a right-curley-brace added. You still need to reference the columns by name within the component dataframes which your code was not doing:
function(df_list, log_transform = c("var1", "var2")) {
if(!is.null(log_transform))
{ #"If someone inputs a list of variables to be transformed, then..."
trans <- function(df) {
df[[ log_transform]] <- sapply(log_transform, function(x) {
#"...apply a log transformation to each variable..."
log(df[[x]] + 1)
} )
llply(df_list, trans)
#"...for each dataframe in df_list."
}
#etc
} }
Related
I'm still new to writing my own functions. As an exercise and because I use it alot, I want to write a flexible function to easily reverse survey response scales. This is what I came up with:
rev_scale = function(var, new_var, scale){
for (i in 1:length(abs(var))){
new_var[i] = scale-abs(var[i])+1
}
}
Info on code
var = variable I want to reverse.
new_var = new column with the reversed variable
scale = how many points in the scale (eg. 5 for a 5-point scale)
The reason why I use 'abs' instead of just 'var' is that some dataframes also return value-labels, and I only want the values in this function.
Question
When applying this new function on a variable, R returns "NULL". However, if I run the for-loop separately, with the arguments 'imputed', my new variable is properly reversed.
Any ideas on what is happening here?
Thanks in advance!
### Example of the (working) for-loop with arguments 'imputed' ###
df <- data.frame(matrix(ncol = 1, nrow = 4))
df$var = c(1,2,3,4)
for (i in 1:length(abs(df$var))){
df$var_rev[i] = 4-abs(df$var[i])+1
}
df$var_rev
OUTPUT:
[1] 4 3 2 1
R does not use reference-variables (think pointers)*. So your new_var outside of your function does not get updated when refered to inside a function. Instead, R creates a new copy of new_var and updates that.
You should instead return the new value from your function. I.e.
rev_scale = function(var, scale){
res <- vector('numeric', length(var))
for (i in 1:length(abs(var))){
res[i] = scale-abs(var[i])+1
}
return(res)
}
Also note that I have removed new_var from the function's arguments. In other words, I have completely separated the functions input-arguments from its output.
The reason you get a NULL from the function is that in R, all functions returns somethings. If not specified, the function will return the last value of the last statement, except when the last statement is a control structure (ifs, loops) - then it defaults to a NULL.
* There are a couple of exceptions and work-arounds, but I will not go into that here.
Edit:
As benimwolfspelz noted, you do not need to explicitly iterate over each element in var, as R does this implicitly. Your entire function could be reduced to:
rev_scale = function(var, scale) {
scale-abs(var)+1
}
Secondly, in your for-loop, your can simplify length(abs(var)) to length(var) as abs(var) does not change the length of the vector.
First let me say that I am not an expert coder and any advice about this particular question or my general technique will be greatly appreciated.
I have a large data set that is made up of similar data frames named Table6.# such as: Table6.1, Table6.2, ect. I have variables in each data frame that repeat as well, such as: ST1_Delta_PV%, ST2_Delta_PV%, ect. and ST1_Realloc_Margin, ST2_Reallocation_Margin, ect.
I am trying to write several nested loops that will calculated values in each table across these similar variables. I have tried to do this with the paste function as shown below, but this is obviously not the correct way to do this.
for (i in 1:25){
for (j in 1:4){
for (k in 1:length(paste("Table6.",i,"sep="")[,1]){
paste("Table6.",i,sep="")$paste("ST",j,"NonTgt_Shr",sep="")[k] <- paste("Table6.",i,sep="")$paste("ST",j,"_Delta_PV%",sep="")[k] * paste("Table6.",i,sep="")$paste("ST",j,"_Reallocation_Margin",sep="")[k]
}
}
}
I apologize if this is a complete mess. I appreciate your help.
As akrun says, you should put your data frames in a list
Tables <- list(Table6.1, Table6.2, …)
for (Table in Tables) { … }
This way, you do not need to use paste to construct the different Table names.
For accessing the different columns, you can use the df["column"] syntax - this is similar to df$column, except that inside the brackets, you can use any string
nonTgt_Shr.column.name <- paste0("ST",j,"NonTgt_Shr")
delta.column.name <- paste0("ST",j,"_Delta_PV%")
for (k in 1:nrow(Table) {
Table[nonTgt_Shr.column.name][k] <- Table[delta.column.name][k] * …
}
Note how I use variables for storing the name, making the line with the actual computation much more readable.
Also, nrow is more intuitive than length(Table[,1]).
The calculations could be transformed into a function which improves readability, scaling and
robustness
In the actual calculation function, the function get is used to retrieve the data frame based on the name.
#Calculation Function
fn_CalcVariables <- function(
tableName="Table6.1",
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%", "_Reallocation_Margin"),
variablePrefix="ST1"
) {
DF <- get(tableName)
outputVarName <- paste0(variablePrefix, outputVarName)
inputVarNames <- paste0(variablePrefix, inputVarNames)
DF[,outputVarName] <- DF[,inputVarNames[1]] * DF[,inputVarNames[2]]
return(DF)
}
This function should by called by nested lapply calls.
lapply iterates over the lists of the arguments, calls the function (second argument), and collects a list of the return values.
(As an exercise, try l <- list(a=1, b=2); lapply(l, function(x) { x*2 }).)
#List object names for tables and variable names
tableNamesList <- paste0("Table6.",1:25)
variablePrefixList <- paste0("ST",1:4)
#Nested loops to invoke custom function from above
lapply(variablePrefixList, function(alpha) {
lapply(tableNamesList, function(x, varprefix=alpha) {
cat("Begin Processing Table",x,"varPrefix",varprefix,"\n")
fn_CalcVariables(
tableName=x,
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%","_Reallocation_Margin"),
variablePrefix=varprefix
)
cat("End Processing Table", x, "varPrefix", varprefix, "\n")
}) #End of innner lapply
}) #End of outer lapply
I have a dataset comprised of clinics with each clinic is comprised of doctors, performing procedures on patients.
I have written to perform analyses on the dataset filtering for clinic lists or doctor lists (a simple one is below):
num.of <- function(x.doctor, x.clinic){
if (!missing(x.clinic)){
df_filter <- filter(df_clean, clinic == x.clinic)
}
if (!missing(x.doctor)) {
df_filter <- filter(df_clean, doctor == x.doctor)
}
num_doctor <- length(unique(df_filter$doctor))
num_surveys <- nrow(df_filter)
num_procedure <- length(unique(df_filter$PPID))
result <- setNames(c(num_doctor, num_surveys, num_procedure), c("num_doctor", "num_surveys", "num_procedure"))
return(result)
}
I am attempting to call on these functions with either a list of doctors or a list of clinics:
sapply(doctor_list, num.of, x.clinic = NULL)
However, the function only works when the 'first' argument is passed through, i.e. the function above does not work, but this does:
sapply(clinic_list, num.of, x.doctor = NULL)
If the arguments are reversed when writing the initial function, the opposite of the above examples is true.
The functions are fed only one set of arguments at a time: Either a list for x.doctor or a list for x.clinic.
How can I rewrite my functions please so that apply works x.clinic and in a separate function call for x.doctor?
Thank you!
Try this:
num.of <- function(x, data, type = c("doctor", "clinic")) {
type <- match.arg(type)
df_filter <-
if (type == "doctor") {
filter(data, doctor == x)
} else {
filter(data, clinic == x)
}
num_doctor <- length(unique(df_filter$doctor))
num_surveys <- nrow(df_filter)
num_procedure <- length(unique(df_filter$PPID))
result <- setNames(c(num_doctor, num_surveys, num_procedure), c("num_doctor", "num_surveys", "num_procedure"))
return(result)
}
This enables an explicit and clear call:
sapply(doctor_list, num.of, data = df_clean, type = "doctor")
sapply(clinic_list, num.of, data = df_clean, type = "clinic")
I took the liberty of helping with a scope breach: accessing df_clean from inside the function may work but can present problems in the future. It makes the function very context-dependent and inflexible in the presence of multiple datasets. Even if you are 100% certain you will always always always have df_clean in your calling (or global) environment for this case, it's a good habit (among "Best Practices TM").
If this doesn't work, then you might need to make a more reproducible example so that we can actually test the function. Since you may not want to include actual data, it makes things incredibly easier for everyone else if you make it generic-as-ever, with simple names and simple example data.
I need to apply transformations to all numeric variables of a large dataframe. The dataframe has variables of other types as well. My initial idea was to iterate over all the columns, check if they are numerical and then divide them by 1000.
I've got stuck in my code for a function, would appreciate some pointers here:
transformDivideThousand <- function(data_frame){
for(i in ncol(data_frame)){
if (is.numeric(data_frame[i])) {
data_frame[i]/1000
}
}
return(data_frame)
}
The execution of the function:
test <- transformDivideThousand(mypatients)
test is a dataframe, but the transformations are not happening. Where did I err?
As an extra, I would also like transformDivideThousand to have an optional argument where I could pass a list with the names for the variables to use, if empty, than iterate over all of them.
#nicola's comment explains what's going wrong with your loop. Another option is to use sapply to identify the numeric columns, which results in more succinct code. For example, using the built-in iris data frame:
iris[, sapply(iris, is.numeric)] =
iris[, sapply(iris, is.numeric)]/1000
You can just run this directly on a data frame, as above, or put it inside a function:
tDT <- function(data_frame) {
data_frame[, sapply(data_frame, is.numeric)] =
data_frame[, sapply(data_frame, is.numeric)]/1000
return(data_frame)
}
Then, to run it:
iris.new = tDT(iris)
For future reference, per #nicola's comment, here's how to make the for loop version work:
tDT2 <- function(data_frame) {
for (i in 1:ncol(data_frame)) {
if (is.numeric(data_frame[,i])) {
data_frame[,i] = data_frame[,i]/1000
}
}
return(data_frame)
}
I want to go through a vector, name all variables with i and use i to subset a larger file.
Why this does not work?
x <- c(seq(.1,.9,.1),seq(.9,1,.01))
doplot <- function(y)
{
for (i in unique(y))
{
paste("f_", i, sep = "") <- (F_agg[F_agg$Assort==i,])
}
}
doplot(x)
There are several problems here. First of all, on the left hand side of <- you need a symbol (well, or a special function, but let's not get into that now). So when you do this:
a <- "b"
a <- 15
then a will be set to 15, instead of first evaluating a to be b and then set b to 15.
Then, if you create variables within a function, they will be (by default) local to that function, and destroyed at the end of the function.
Third, it is not good practice to create variables this way. (For details I will not go into now.) It is better to put your data in a named list, and then return the list from the function.
Here is a solution that should work, although I cannot test it, because you did not provide any test data:
doplot <- function(y) {
lapply(unique(y), function(i) {
F_agg[F_agg$Assort == i, ]
})
}