Using for() over variables that need to be changed - r

I'd like to be able use for() loop to automate the same operation that runs over many variables modifying them.
Here's simplest example to could design:
varToChange = list( 1:10, iris$Species[1:10], letters[1:10]) # assume that it has many more than just 3 elements
varToChange
for (i in varToChange ) {
if (is.character(y)) i <- as.integer(as.ordered(i))
if (is.factor(y)) i <- as.integer(i)
}
varToChange # <-- Here I want to see my elements as integers now
Here's actual example that led me to this question - taken from: Best way to plot automatically all data.table columns using ggplot2
In the following function
f <- function(dt, x,y,k) {
if (is.numeric(x)) x <- names(dt)[x]
if (is.numeric(y)) y <- names(dt)[y]
if (is.numeric(k)) k <- names(dt)[k]
ggplot(dt, aes_string(x,y, col=k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2)
instead of brutally repeating the same line many times, as a programmer, I would rather have a loop to repeat this line for me.
Something like this one:
for (i in c(x,y,k)) {
if (is.numeric(i)) i <- names(dt)[i]
}
In C/C++ this would have been done using pointers. In R - is it all possible?
UPDATE: Very nice idea to use Map below. However it does not work for this example
getColName <- function(dt, x) {
if (is.numeric(x)) {
x <- names(dt)[x]
}
x
}
f<- function(dt, x,y,k) {
list(x,y,k) <- Map(getColName, list(x,y,k), dt)
# if (is.numeric(x)) x <- names(dt)[x]
# if (is.numeric(y)) y <- names(dt)[y]
# if (is.numeric(k)) k <- names(dt)[k]
ggplot(dt, aes_string(x,y, col=k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2) # Brrr..

No need for for loop, just Map a function over each of your list items
varToChange = list( 1:10, iris$Species[1:10], letters[1:10])
myfun <- function(y) {
if (is.character(y)) y <- as.integer(as.ordered(y))
if (is.factor(y)) y <- as.integer(y)
y
}
varToChange <- Map(myfun, varToChange)
UPDATE: Map never modifies variables in place, This is simply not done in R. Use the new values returned by Map
f<- function(dt, x, y, k) {
args <- Map(function(x) getColName(dt, x), list(x=x,y=y,k=k))
ggplot(dt, aes_string(args$x,args$y, col=args$k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2)

You have two choices for iteration in R, iterate over variables themselves, or over their indices. I generally recommend iterating over indices. This case illustrates a strong advantage of that because your question is a non-issue if you are using indices.
varToChange = list( 1:10, iris$Species[1:10], letters[1:10])
for (i in seq_along(varToChange)) {
if (is.character(varToChange[[i]])) varToChange[[i]] <- as.integer(as.factor(varToChange[[i]]))
if (is.factor(varToChange[[i]])) varToChange[[i]] <- as.integer(varToChange[[i]])
}
I also replaced as.ordered() with as.factor() - the only difference between an ordered factor and a regular factor are the default contrasts used in modeling. As you are just coercing to integer, it doesn't matter.

Related

How to write a function with an unspecified number of arguments where the arguments are column names

I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))

Applying a Function to a Data Frame : lapply vs traditional way

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

Nested apply with multiple parameters

I would like to use the apply family instead of a for loop.
My for loop is nested and contains several vectors and a list, for which I am unsure how to input as parameters with apply.
Codes <- c("A","B","C")
Samples <- c("A","A","B","B","B","C")
Samples_Names <- c("A1","A2","B1","B2","B3","C1")
Samples_folder <- c("Alpha","Alpha","Beta","Beta","Beta","Charlie")
Df <- list(data.frame(T1 = c(1,2,3)), data.frame(T1 = c(1,2,3)), data.frame(T1 = c(1,2,3)))
for (i in 1:length(Codes)){
for (j in 1:length(Samples)) {
if(Codes[i] == Samples[j]) {
write_csv(Df[[i]], path = paste0(Working_Directory,Samples_folder[j],"/",Samples_Names[j],".csv"))
}
}
}
This will give an output of A1,A2 in Alpha, B1,B2,B3 in Beta, and C1 in charlie.
Since you are looking to just use write_csv, we can use pwalk from purrr to accomplish this over the three equal size vectors. No need to include the loop on Codes, as for each iteration in the apply we can write_csv the dataset corresponding to where Samples is found in Codes.
I shortened Working_Directory to WD.
library(purrr)
pwalk(list(Samples, Samples_folder, Samples_Names),
function(x, y, z) write_csv(Df[[match(x, Codes)]], path = paste0(WD, y, "/", z, ".csv")))

R loop to create data frames with 2 counters

What I want is to create 60 data frames with 500 rows in each. I tried the below code and, while I get no errors, I am not getting the data frames. However, when I do a View on the as.data.frame, I get the view, but no data frame in my environment. I've been trying for three days with various versions of this code:
getDS <- function(x){
for(i in 1:3){
for(j in 1:30000){
ID_i <- data.table(x$ID[j: (j+500)])
}
}
as.data.frame(ID_i)
}
getDS(DATASETNAME)
We can use outer (on a small example)
out1 <- c(outer(1:3, 1:3, Vectorize(function(i, j) list(x$ID[j:(j + 5)]))))
lapply(out1, as.data.table)
--
The issue in the OP's function is that inside the loop, the ID_i gets updated each time i.e. it is not stored. Inorder to do that we can initialize a list and then store it
getDS <- function(x) {
ID_i <- vector('list', 3)
for(i in 1:3) {
for(j in 1:3) {
ID_i[[i]][[j]] <- data.table(x$ID[j:(j + 5)])
}
}
ID_i
}
do.call(c, getDS(x))
data
x <- data.table(ID = 1:50)
I'm not sure the description matches the code, so I'm a little unsure what the desired result is. That said, it is usually not helpful to split a data.table because the built-in by-processing makes it unnecessary. If for some reason you do want to split into a list of data.tables you might consider something along the lines of
getDS <- function(x, n=5, size = nrow(x)/n, column = "ID", reps = 3) {
x <- x[1:(n*size), ..column]
index <- rep(1:n, each = size)
replicate(reps, split(x, index),
simplify = FALSE)
}
getDS(data.table(ID = 1:20), n = 5)

Function within function not activating as expected

I have a function that I use to get a "quick look" at a data.frame... I deal with a lot of survey data and this acts as a quick tool to see what's what.
f.table <- function(x) {
if (is.factor(x[[1]])) {
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
x <- na.omit(melt(x,c()))
x <- cast(x, variable ~ value, frequency)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
frequency <- function(x) {
x[x > 1] <- 1
x[is.na(x)] <- 0
x <- round(sum(x)/n, digits=2)
}
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(frequency, mean, sd, min, max))
x <- transform(x, variable=reorder(variable, frequency))
}
return(x)
}
What I find happens is that if I don't define "frequency" outside of the function, it returns wonky results for data frames with continuous variables. It doesn't seem to matter which definition I use outside of the function, so long as I do.
try:
n <- 100
x <- data.frame(a=c(1:25),b=rnorm(100),c=rnorm(100))
x[x > 20] <- NA
Now, select either one of the frequency functions and paste them in and try it again:
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
f.table(x)
Why is that?
Crucially, I think this is where your problem is. cast() is evaluating those functions without reference to the function it was called from. Inside cast() it evaluates fun.aggregate via funstofun and, although I don't really follow what it is doing, is getting stats:::frequency and not your local one.
Hence my comment to your Q. What do you wan the function to do? At the moment it would seem necessary to define a "frequency" function in the global environment so that cast() or funstofun() finds it. Give it a unique name so it is unlikely to clash with anything so it should be the only thing found, say .Frequency(). Without knowing what you want to do with the function (rather than what you thought the function [f.table] should do) it is a bit difficult to provide further guidance, but why not have .FrequencyNum() and .FrequencyFac() defined in the global workspace and rewrite your f.table() wrapper calls to cast to use the relevant one?
.FrequencyFac <- function(X, N) {
round(length(X)/N, digits=2)
}
.FrequencyNum <- function(X, N) {
X[X > 1] <- 1
X[is.na(X)] <- 0
round(sum(X)/N, digits=2)
}
f.table <- function(x, N) {
if (is.factor(x[[1]])) {
x <- na.omit(melt(x, c()))
x <- dcast(x, variable ~ value, .FrequencyFac, N = N)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(.FrequencyNum, mean, sd, min, max), N = N)
##x <- transform(x, variable=reorder(variable, frequency))
## left this out as I wanted to see what cast returned
}
return(x)
}
Which I thought would work, but it is not finding N, and it should be. So perhaps I am missing something here?
By the way, it is probably not a good idea to rely on function that find n (in your version) from outside the function. Always pass in the variables you need as arguments.
I don't have the package that contains melt, but there are a couple potential issues I can see:
Your frequency functions do not return anything.
It's generally bad practice to alter function inputs (x is the input and the output).
There is already a generic frequency function in stats package in base R, which may cause issues with method dispatch (I'm not sure).

Resources