Turn index-based loop into name-based function - r

I have at disposal a clean dataframe (1500r x 297c, named 'Data' - very inspiring) with both numeric/factor columns. However, as this is often the case, my factors were encoded as numbers (each number representing a level) hence a dataframe full a numeric vectors.
To overcome this matter I also have a second dataframe (VarLabels), containing information about the columns of the 1st dataframe (which has... 297 rows as you would imagine). In there, one specific column helps me defining what should be the data class in the main dataframe (named VarLabels$TypeVar).
I wrote the following piece of code, which might not be optimal but proved to work so far:
(NB: as you can see, for data labelled 'MIX' I wish to create a copy to have one numeric and one factor)
nbcol <- ncol(Data)
indexcol <- which(colnames(VarLabels) == "TypeVar")
for(i in 1:nbcol){
if (colnames(Data)[[i]] %in% VarLabels$VarName){
if (VarLabels[i,indexcol] == "Quant"){
Data[[i]] <- as.numeric(Data[[i]])
} else if (VarLabels[i,indexcol] == "Qual") {
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
} else if (VarLabels[i,indexcol] == "Mix") {
Data <- cbind(Data, Data[[i]])
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
Data[[ncol(Data)]] <- as.numeric(Data[[ncol(Data)]])
colnames(Data)[[ncol(Data)]] <- paste(colnames(Data)[[i]], "Num", sep = "_")
} else {
Data[[i]] <- as.numeric(Data[[i]])
}
} else {
}
}
Do you have a neater solution, possibly using a function to reduce the number of code lines / using names instead of column index? (which may be risky if order changes in one of the two dataframes) I recently got into R and am still struggling with user-defined functions.
I read other related topics like:
Change all columns from factor to numeric in R
Function to change class of columns in R to match the class of an other dataset
Convert type of multiple columns of a dataframe at once
How do I get the classes of all columns in a data frame?
but could not apply the answers to my own problem. Any idea how to make things simple? (if possible!)

The following function does what the question asks for.
It matches input data set X column names with the new column types with a sequence of which/match statements, without needing loops. The coercion is performed with lapply loops.
The test data set is the built-in data set mtcars.
coerceCols <- function(X, VarLabels){
i <- which(VarLabels$TypeVar == "Qual")
j <- match(VarLabels$VarName[i], names(X))
X[j] <- lapply(X[j], factor)
i <- which(VarLabels$TypeVar == "Mix")
j <- match(VarLabels$VarName[i], names(X))
tmp <- X[j]
names(tmp) <- paste(names(tmp), "Num", sep = "_")
X[j] <- lapply(X[j], factor)
cbind(X, tmp)
}
Data <- mtcars
VarLabels <- data.frame(VarName = names(mtcars),
TypeVar = c("Quant", "Mix", "Quant",
"Quant", "Quant", "Quant",
"Quant", "Qual", "Qual",
"Mix", "Mix"),
stringsAsFactors = FALSE)
coerceCols(Data, VarLabels)

Related

R function used to rename columns of a data frames

I have a data frame, say acs10. I need to relabel the columns. To do so, I created another data frame, named as labelName with two columns: The first column contains the old column names, and the second column contains names I want to use, like the table below:
column_1
column_2
oldLabel1
newLabel1
oldLabel2
newLabel2
Then, I wrote a for loop to change the column names:
for (i in seq_len(nrow(labelName))){
names(acs10)[names(acs10) == labelName[i,1]] <- labelName[i,2]}
, and it works.
However, when I tried to put the for loop into a function, because I need to rename column names for other data frames as well, the function failed. The function I wrote looks like below:
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
print(varName[i,1])
print(varName[i,2])
print(names(dataF))
}
}
renameDF(acs10, labelName)
where dataF is the data frame whose names I need to change, and varName is another data frame where old variable names and new variable names are paired. I used print(names(dataF)) to debug, and the print out suggests that the function works. However, the calling the function does not actually change the column names. I suspect it has something to do with the scope, but I want to know how to make it works.
In your function you need to return the changed dataframe.
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
}
return(dataF)
}
You can also simplify this and avoid for loop by using match :
renameDF <- function(dataF,varName){
names(dataF) <- varName[[2]][match(names(dataF), varName[[1]])]
return(dataF)
}
This should do the whole thing in one line.
colnames(acs10)[colnames(acs10) %in% labelName$column_1] <- labelName$column_2[match(colnames(acs10)[colnames(acs10) %in% labelName$column_1], labelName$column_1)]
This will work if the column name isn't in the data dictionary, but it's a bit more convoluted:
library(tibble)
df <- tribble(~column_1,~column_2,
"oldLabel1", "newLabel1",
"oldLabel2", "newLabel2")
d <- tibble(oldLabel1 = NA, oldLabel2 = NA, oldLabel3 = NA)
fun <- function(dat, dict) {
names(dat) <- sapply(names(dat), function(x) ifelse(x %in% dict$column_1, dict[dict$column_1 == x,]$column_2, x))
dat
}
fun(d, df)
You can create a function containing just on line of code.
renameDF <- function(df, varName){
setNames(df,varName[[2]][pmatch(names(df),varName[[1]])])
}

How to improve this function for converting columns to date types over a list of dataframes

I have a list of dataframes. I have a function currently - it is limited. It works by inputting the dataframes within this list that have a certain column name - and converts this column to date type.
Limitations:
You have to define the columns which you know have this column - if you include a dataframe that does not include this column it would throw an error. It'd be good if it could work over all dataframes in the list - even if they don't include the specified column
Currently, it takes one column name, however a dataframe may contain other columns which could be converted to dates. It'd be good for the function to accept multiple column name inputs.
Here is my function currently:
repair_dates3 <- function(data, df_list, col_name) {
lapply(df_list, function(x) {
data[[x]][[col_name]] <<- as.Date(data[[x]][[col_name]], format = "%Y-%m-%d")
})
return(data)
}
You call it like this:
repair_dates3(data, c("dataframe1", "dataframe2", "dataframe3"), "Date")
Any ideas how I can improve this?
Many thanks
The following is simpler and more flexible. It allows a vector of data frames and a vector of column names to be defined and processed by the function.
Untested, since there is no example data set.
repair_dates3 <- function(x, col_names, format = "%Y-%m-%d") {
x[col_names] <- lapply(x[col_names], as.Date, format = format)
x
}
df_list <- c("dataframe1", "dataframe2", "dataframe3")
cols <- c("Date", "Date.2")
data[df_list] <- lapply(data[df_list], repair_dates3, cols)
Another solution, maybe closer to the question post, is to define a function that takes care of the outer lapply call. This will call an inner, private function f.
repair_dates4 <- function(x, which_dfs, col_names, format = "%Y-%m-%d") {
f <- function(x, d, format = "%Y-%m-%d"){
x[d] <- lapply(x[d], as.Date, format = format)
x
}
x[which_dfs] <- lapply(x[which_dfs], f, col_names, format = format)
x
}
data <- repair_date4(data, df_list, cols)

Need to create data frame from list, but most terms turn into ones

I am trying to take a list of data frames that I had just created and unlist, so that I can create one giant data frame. However, somewhere in the for loop, my pvalues, lengths, and samples turn into factors, even though when I look at them individually, they are characters. I suppose them being factors is what makes the terms one when I try to unlist. How do I fix this so that the real terms are revealed, not ones?
Code:
for(i in 1:24) {
l <- length(olap[[i]])
for (j in 1:l) {
sub_olap <- as.data.frame(mcols(subsetByOverlaps(all_list, olap[[i]][j])))
chrdata[[j]] <- data.frame(median(sub_olap$log2Ratio),
paste0(sub_olap$pvalue, collapse = ","),
paste0(sub_olap$length, collapse = ","),
paste0(sub_olap$sample, collapse = ","))
colnames(chrdata[[j]]) <- c("med_log2Ratio", "pvalue", "length", "sample")
}
mdata[[i]] <- chrdata
}
data.frame(matrix(unlist(mdata[[1]]), nrow = 81))

How to build subset query using a loop in R?

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

Subset variables in data frame based on column type

I need to subset data frame based on column type - for example from data frame with 100 columns I need to keep only those column with type factor or integer. I've written a short function to do this, but is there any simpler solution or some built-in function or package on CRAN?
My current solution to get variable names with requested types:
varlist <- function(df=NULL, vartypes=NULL) {
type_function <- c("is.factor","is.integer","is.numeric","is.character","is.double","is.logical")
names(type_function) <- c("factor","integer","numeric","character","double","logical")
names(df)[as.logical(sapply(lapply(names(df), function(y) sapply(type_function[names(type_function) %in% vartypes], function(x) do.call(x,list(df[[y]])))),sum))]
}
The function varlist works as follows:
For every requested type and for every column in data frame call "is.TYPE" function
Sum tests for every variable (boolean is casted to integer automatically)
Cast result to logical vector
subset names in data frame
And some data to test it:
df <- read.table(file="http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", sep=" ", header=FALSE, stringsAsFactors=TRUE)
names(df) <- c('ca_status','duration','credit_history','purpose','credit_amount','savings', 'present_employment_since','installment_rate_income','status_sex','other_debtors','present_residence_since','property','age','other_installment','housing','existing_credits', 'job','liable_maintenance_people','telephone','foreign_worker','gb')
df$gb <- ifelse(df$gb == 2, FALSE, TRUE)
df$property <- as.character(df$property)
varlist(df, c("integer","logical"))
I'm asking because my code looks really cryptic and hard to understand (even for me and I've finished the function 10 minutes ago).
Just do the following:
df[,sapply(df,is.factor) | sapply(df,is.integer)]
subset_colclasses <- function(DF, colclasses="numeric") {
DF[,sapply(DF, function(vec, test) class(vec) %in% test, test=colclasses)]
}
str(subset_colclasses(df, c("factor", "integer")))

Resources