Transforming all numerical variables in a dataframe with a function - r

I need to apply transformations to all numeric variables of a large dataframe. The dataframe has variables of other types as well. My initial idea was to iterate over all the columns, check if they are numerical and then divide them by 1000.
I've got stuck in my code for a function, would appreciate some pointers here:
transformDivideThousand <- function(data_frame){
for(i in ncol(data_frame)){
if (is.numeric(data_frame[i])) {
data_frame[i]/1000
}
}
return(data_frame)
}
The execution of the function:
test <- transformDivideThousand(mypatients)
test is a dataframe, but the transformations are not happening. Where did I err?
As an extra, I would also like transformDivideThousand to have an optional argument where I could pass a list with the names for the variables to use, if empty, than iterate over all of them.

#nicola's comment explains what's going wrong with your loop. Another option is to use sapply to identify the numeric columns, which results in more succinct code. For example, using the built-in iris data frame:
iris[, sapply(iris, is.numeric)] =
iris[, sapply(iris, is.numeric)]/1000
You can just run this directly on a data frame, as above, or put it inside a function:
tDT <- function(data_frame) {
data_frame[, sapply(data_frame, is.numeric)] =
data_frame[, sapply(data_frame, is.numeric)]/1000
return(data_frame)
}
Then, to run it:
iris.new = tDT(iris)
For future reference, per #nicola's comment, here's how to make the for loop version work:
tDT2 <- function(data_frame) {
for (i in 1:ncol(data_frame)) {
if (is.numeric(data_frame[,i])) {
data_frame[,i] = data_frame[,i]/1000
}
}
return(data_frame)
}

Related

Calculating variable values using paste function in R

First let me say that I am not an expert coder and any advice about this particular question or my general technique will be greatly appreciated.
I have a large data set that is made up of similar data frames named Table6.# such as: Table6.1, Table6.2, ect. I have variables in each data frame that repeat as well, such as: ST1_Delta_PV%, ST2_Delta_PV%, ect. and ST1_Realloc_Margin, ST2_Reallocation_Margin, ect.
I am trying to write several nested loops that will calculated values in each table across these similar variables. I have tried to do this with the paste function as shown below, but this is obviously not the correct way to do this.
for (i in 1:25){
for (j in 1:4){
for (k in 1:length(paste("Table6.",i,"sep="")[,1]){
paste("Table6.",i,sep="")$paste("ST",j,"NonTgt_Shr",sep="")[k] <- paste("Table6.",i,sep="")$paste("ST",j,"_Delta_PV%",sep="")[k] * paste("Table6.",i,sep="")$paste("ST",j,"_Reallocation_Margin",sep="")[k]
}
}
}
I apologize if this is a complete mess. I appreciate your help.
As akrun says, you should put your data frames in a list
Tables <- list(Table6.1, Table6.2, …)
for (Table in Tables) { … }
This way, you do not need to use paste to construct the different Table names.
For accessing the different columns, you can use the df["column"] syntax - this is similar to df$column, except that inside the brackets, you can use any string
nonTgt_Shr.column.name <- paste0("ST",j,"NonTgt_Shr")
delta.column.name <- paste0("ST",j,"_Delta_PV%")
for (k in 1:nrow(Table) {
Table[nonTgt_Shr.column.name][k] <- Table[delta.column.name][k] * …
}
Note how I use variables for storing the name, making the line with the actual computation much more readable.
Also, nrow is more intuitive than length(Table[,1]).
The calculations could be transformed into a function which improves readability, scaling and
robustness
In the actual calculation function, the function get is used to retrieve the data frame based on the name.
#Calculation Function
fn_CalcVariables <- function(
tableName="Table6.1",
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%", "_Reallocation_Margin"),
variablePrefix="ST1"
) {
DF <- get(tableName)
outputVarName <- paste0(variablePrefix, outputVarName)
inputVarNames <- paste0(variablePrefix, inputVarNames)
DF[,outputVarName] <- DF[,inputVarNames[1]] * DF[,inputVarNames[2]]
return(DF)
}
This function should by called by nested lapply calls.
lapply iterates over the lists of the arguments, calls the function (second argument), and collects a list of the return values.
(As an exercise, try l <- list(a=1, b=2); lapply(l, function(x) { x*2 }).)
#List object names for tables and variable names
tableNamesList <- paste0("Table6.",1:25)
variablePrefixList <- paste0("ST",1:4)
#Nested loops to invoke custom function from above
lapply(variablePrefixList, function(alpha) {
lapply(tableNamesList, function(x, varprefix=alpha) {
cat("Begin Processing Table",x,"varPrefix",varprefix,"\n")
fn_CalcVariables(
tableName=x,
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%","_Reallocation_Margin"),
variablePrefix=varprefix
)
cat("End Processing Table", x, "varPrefix", varprefix, "\n")
}) #End of innner lapply
}) #End of outer lapply

How to create a function in R that takes two arguments from two lists?

I want to generate a dataframe with summary statistics (AUC, Gini, RMSE, etc.) from validation of multiple models on multiple datasets.
I've got x number of models (classifiers - gbm, xgb, rf, etc. - all built in caret package) that are enclosed in ListOfModels, and y number of datasets (dataframes with identical variables over several data points) that are enclosed in ListOfDatasets.
I can create a short version of the desired dataframe by running a custom function fun_modelStats (that extracts model stats using model and dataset as arguments) inside ldply - but can do so only either over a ListOfModels and just one specific dataset or over a ListOfDatasets and just one specific model, like this:
modelStats_by_model <- ldply(ListOfModels, function(model) {
modelStats <- fun_modelStats(model, B97_2012SU_2013)
})
and
modelStats_by_dataset <- ldply(ListOfDatasets, function(dataset) {
modelStats <- fun_modelStats(gbmFit1, dataset)
})
The resulting dataframe with models' stats has either x or y number of rows, and I can't get my head around the way of building this dataframe with x*y rows, i.e. stats from all models validated on all datasets.
I did experiment with Map and mapply, and for loop, but to no avail.
Using Map I get weird incorrect output:
modelStats_all <- Map(fun_modelStats, ListOfModels, ListOfDatasets)
The for loop does generate the desired output with this code below, but only as plain text in console whereas I need it as a dataframe.
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
modelStats <- fun_modelStats(ListOfModels[[i]], ListOfDatasets[[j]])
print(modelStats)
}
}
Many thanks in advance for help!
P.S. Further search at SO (How to write a function that takes a model as an argument in R - this post, for example) shows that using aggregate.formula or aggregate.data.frame or rbind.data.frame could help, but I can't figure out how.
Here is the solution, in case anyone faces a similar problem:
fun_multiModelStats <- function(ListOfModels, ListOfDatasets) {
multiModelStats <- data.frame()
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
model <- ListOfModels[[i]]
dataset <- ListOfDatasets[[j]]
modelStats <- fun_modelStats(model, dataset)
modelName <- names(ListOfModels[i])
datasetName <- names(ListOfDatasets[j])
modelStats <- cbind(modelName, datasetName, modelStats)
multiModelStats <- rbind(multiModelStats, modelStats)
}
}
return(multiModelStats)
}
Yet, I would like to find a solution without double for loops but rather with something from the apply family of functions.
What will happen if you'll modify your for loop as following:
modelStats <- data.frame()
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
modelStats[i,j] <- fun_modelStats(ListOfModels[[i]], ListOfDatasets[[j]])
print(modelStats)
}
}

Counting the number of missing values in SPSS file (using memisc)

I'm trying to count the number of missing values for each missing.value of all variables in a SPSS file. I imported the file using the memisc package. Here is my actual code:
library(memisc)
#Takes about 70seconds
escc <- spss.system.file(file.choose(), to.lower=FALSE)
system.time({
esccMiss <- matrix(,length(escc),9)
esccMiss[,1] <- names(escc)
for (i in 1:length(escc)) {
x <- escc[i]
if(length(miss <- missing.values(x)) > 0) {
ifelse(length(miss#range)>0 , vals <- miss#range[1]:(miss#range[1]+3), vals <- miss#filter)
for (j in 1:length(vals)) {
esccMiss[i, 2*j] <- vals[j]
esccMiss[i,2*j+1] <- length(x[x == vals[j]])
}
}
}
})
I'm fairly new to R (explains the C structure of my code) and i realise this is really slow but i have trouble finding the way to do the samething with lapply function in the memisc package.
Forget my other answer, this is much faster:
escc2 <- as.data.set(escc)
system.time(lis <- lapply(escc2,function(x) table(x[which(is.missing(x))])))
Should only take a few seconds now.
Explanation: The original dataset (escc) is of a class that simply does not work in the *apply family since there isn't a method written for it. However, memisc also includes as.data.set, which does work in *apply.
is.missing returns a vector of all the values that are marked as missing.
which finds the indices of those missings and x[] subsets x so you only have those missings.
table puts the values into a table.

Need help applying a function, within an apply, within a function

My question isn't quite as crazy as it sounds (hopefully). I have a function which takes as an input a list of dataframes, and gives as an output a list of corresponding linear models. One input to the function is the argument log_transform. I'd like the user to be able to input a list of variables to be log transformed, and have that be taken into account by the model. This gets a little complex however, since this needs to be applied not only to multiple variables, but to multiple variables across multiple dataframes. As is, I have this coded as so:
function(df_list, log_transform = c("var1", "var2")) {
if(!is.null(log_transform)) { #"If someone inupts a list of variables to be transformed, then..."
trans <- function(df) {
sapply(log_transform, function(x) { #"...apply a log transformation to each variable..."
x <- log(x + 1)
}, simplify = T)
llply(df_list, trans) #"...for each dataframe in df_list."
}
etc
}
However, when I try to run this, I receive the error:
Error in x + 1 : non-numeric argument to binary operator
Where am I going wrong?
Thanks
No test cases provided so this remains untested but it was checked for syntactic completion and a right-curley-brace added. You still need to reference the columns by name within the component dataframes which your code was not doing:
function(df_list, log_transform = c("var1", "var2")) {
if(!is.null(log_transform))
{ #"If someone inputs a list of variables to be transformed, then..."
trans <- function(df) {
df[[ log_transform]] <- sapply(log_transform, function(x) {
#"...apply a log transformation to each variable..."
log(df[[x]] + 1)
} )
llply(df_list, trans)
#"...for each dataframe in df_list."
}
#etc
} }

How to index data frame column by a variable?

As an example, I want a function that will iterate over the columns in a dataframe and print out each column's data type (e.g., "numeric", "integer", "character", etc)
Without a variable I know I can do class(df$MyColumn) and get the data type. How can I change it so "MyColumn" is a variable?
What I'm trying is
f <- function(df) {
for(column in names(df)) {
columnClass = class(df[column])
print(columnClass)
}
}
But this just prints out [1] "data.frame" for each column.
Since a data frame is simply a list, you can loop over the columns using lapply and apply the class function to each column:
lapply(df, class)
To address the previously unspoken concerns in User's comment.... if you build a function that does whatever it is that you hope to a column, then this will succeed:
func <- function(col) {print(class(col))}
lapply(df, func)
It's really mostly equivalent to:
for(col in names(df) ) { print(class(df[[col]]))}
And there would not be an unneeded 'colClass' variable cluttering up the .GlobalEnv.
Use a comma before column:
for(column in names(df)) {
columnClass = class(df[,column])
print(columnClass)
}
Much as DWin suggested
apply(df,2,class)
but you say you want to do more with each coloumn?
What do you want to do. Try to avoid abstract examples.
In case it helps
apply(df,2,mean)
apply(df,2,sd)
or something more complicated
apply(df,2,function(x){s = c(summary(x)["Mean"], summary(x)["Median"], sd(x))})
Note that the summary function gives you most of this functionality anyway, but this is just an example. any function can be place inside of an apply and iterated over the cols of a matrix or dataframe. that function can be as complex or as simple as you need it to be.
You can use the colwise function of the plyr package to transform any function into a column wise function. This is a wrapper for lapply.
library(plyr)
colwise.print.class<-colwise(.fun=function(col) {print(class(col))})
colwise.print.class(df)
You can view the function created with
print(colwise.print.class)

Resources