Don't understand how apply gets its parameters in r - r

I am struggling to make my apply() work: I have two dataframes:
from <- c(1,2,3)
to <- c(2,3,4)
df1 <- data.frame(from, to)
long <-c(9,9.2,9.4,9.6)
lat <- c(45,45.2,45.4,45.6)
id <- c(1,2,3,4)
df2 <- data.frame(long, lat, id)
Now I want something like this:
myFunction <- function(arg){
>>> How do I access arg$from and arg$to? <<<<
}
apply(df1,1,myFunction)
In myFunction I need to make some calculations and return a value for each from-to pair. I don't understand how to access parts of the arg, since arg[0] gives me numeric(0) and arg$from just crashes.

The problem is that apply(...) requires a matrix or array as the first argument. If you pass a dataframe, it will coerce that to a matrix. Matrices are 1 indexed, so the upper left element is [1,1], not [0,0]. Also, matrix columns cannot be referenced using the $ notation.
So,
f <- function(x) {
from <- x[1]
to <- x[2]
# do stuff with from and to...
}
apply(df,1,f)
would work.
One other thing to watch out for is that if your dataframe has (other) columns that have character strings, the conversion will make everything character (including the numbers!). This is because, by definition, all elements of a matrix must have the same data type. Your example does not have that problem, though.

Try mapply(). It's a multivariate version of sapply(). For example:
> myFunction <- function(arg1, arg2){
+ return(sum(arg1, arg2))
+ }
>
> mapply(myFunction, df1$from, df1$to)
[1] 3 5 7
You can also use it to make a new variable in your data frame.
> df1$newvar <- mapply(myFunction, df1$from, df1$to)
> df1
from to newvar
1 1 2 3
2 2 3 5
3 3 4 7

Related

Is there a way to check if dataframe is empty and if so to add a NA row?

For example i have a dataframe that has nothing inside but i need it to run the full code cause it usually expects there to be data. I tried this but it did not work
ifelse(dim(df_empty)[1]==0,rbind(Shots1B_empty,NA))
Maybe something like this:
df_empty <- data.frame(x=integer(0), y = numeric(0), a = character(0))
if(nrow(df_empty) == 0){
df_empty <- rbind(df_empty, data.frame(x=NA, y=NA, a=NA))
}
df_empty
# x y a
#1 NA NA NA
Simple question, OP, but actually pretty interesting. All the elements of your code should work, but the issue is that when you run as is, it will return a list, not a data frame. Let me show you with an example:
growing_df <- data.frame(
A=rep(1, 3),
B=1:3,
c=LETTERS[4:6])
df_empty <- data.frame()
If we evaluate as you have written you get:
df <- ifelse(dim(df_empty)[1]==0, rbind(growing_df, NA))
with df resulting in a List:
> class(df)
[1] "list"
> df
[[1]]
[1] 1 1 1 NA
The code "worked", but the resulting class of df is wrong. It's odd because this works:
> rbind(growing_df, NA)
A B c
1 1 1 D
2 1 2 E
3 1 3 F
4 NA NA <NA>
The answer is to use if and else, rather than ifelse(), just as #akrun noted in their answer. The reason is found if you dig into the documentation of ifelse():
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
Since dim(df_empty)[1] and/or nrow(df_empty) are both vectors, the result will be saved as a list. That's why if {} works, but not ifelse() here. rbind() results in a data frame normally, but the class of the result stored into df when assigning with ifelse() is decided based on the test element, not the resulting element. Compare that to if{} statements, which have a result element decided based on whatever expression is input into {}.
We may need if/else instead of ifelse - ifelse requires all arguments to be of same length, which obviously will be not the case when we rbind
Shots1B_empty <- if(nrow(df_empty) == 0) rbind(Shots1B_empty, NA)

Conditionally add named elements to a list

I have a function to perform actions on a variable list of dataframes depending on user selections. The function mostly performs generic actions but there are a few actions that are dataframe specific.
My code runs fine if all dataframes are selected but I am unable to get it to work if not all dataframes are selected.
The following provides a minimal reproducible example:
# User switches.
df1Switch <- TRUE
df2Switch <- TRUE
df3Switch <- TRUE
# DF creation.
set.seed(1)
df <- data.frame(X=sample(1:10), Y=sample(11:20))
if (df1Switch) df1 <- df
if (df2Switch) df2 <- df
if (df3Switch) df3 <- df
# Function to do something.
fn_something <- function(file_list, file_names) {
df <- file_list
# Do lots of generic things.
df$Z <- df$X + df$Y
# Do a few specific things.
if (file_names == "Name1") df$X <- df$X + 1
else if (file_names == "Name2") df$X <- df$Z - 1
else if (file_names == "Name3") df$Y <- df$X + df$Y
return(df)
}
# Call function to do something.
file_list <- list(Name1=df1, Name2=df2, Name3=df3)
file_names <- names(file_list)
all_df <- do.call(rbind,mapply(fn_something, file_list, file_names,
SIMPLIFY=FALSE))
In this case the code runs fine as the user has selected to create all three dataframes. I use a named list so that the specific actions can be performed against the correct dataframes.
The output looks something like this (the actual numbers aren't important):
X Y Z
Name1.1 4 13 16
Name1.2 5 12 16
Name1.3 6 16 21
: : : :
Name2.1 15 13 16
: : : :
The problem arises if the user selects not to create some dataframes, e.g.:
# User switches.
df1Switch <- TRUE
df2Switch <- FALSE
df3Switch <- TRUE
Not surprisingly, in this case an object not found error results:
> # Call function to do something.
> file_list <- list(Name1=df1, Name2=df2, Name3=df3)
Error: object 'df2' not found
What I would like to do is conditionally specify the contents of file_list along the lines of this pseudo code:
file_list <- list(if (df1Switch) {Name1=df1}, if (df2Switch) {Name2=df2}, if (df3Switch) {Name3=df3})
I have come across list.foldLeft
Conditionally merge list elements but I don't know if this is suitable.
(I'll re-hash my comment:)
In general, I would encourage you to consider use of a list-of-dataframes instead of individual frames. My rationale for this:
assuming that each frame is structured (nearly) identically; and
assuming that what you do to one frame you will (or at least can) do to all frames; then
it is easier to list_of_frames <- lapply(list_of_frames, some_func) than it is to do something like:
for (nm in c("df1", "df2", "df3")) {
d <- get(nm)
d <- some_func(d)
assign(nm, d)
}
especially when dealing with non-global environments (i.e., doing this within a function).
To be clear, "easier" is subjective: though it does win code-golf, I find it much easier to read and understand that "I am running some_func on each element of list_of_frames and saving the result". (You can even save it to a new list-of-frames, thereby keeping the original frames untouched.)
You may also do things conditionally, as in
needs_work <- sapply(list_of_frames, some_checker_func) # returns logical
# or
needs_work <- c("df1", "df2") # names of elements of list_of_frames
list_of_frames[needs_work] <- lapply(list_of_frames[needs_work], some_func)
Having said that ... the direct answer to your one liner:
c(if (df1Switch) list(Name1=df1), if (df2Switch) list(Name2=df2), if (df3Switch) list(Name3=df3))
This capitalizes on the fact that unstated else results in a NULL, and the NULL-compressing (dropping) characteristic of c(). You can see it in action with:
c(if (T) list(a=1), if (T) list(b=2), if (T) list(d=4))
# $a
# [1] 1
# $b
# [1] 2
# $d
# [1] 4
c(if (T) list(a=1), if (FALSE) list(b=2), if (T) list(d=4))
# $a
# [1] 1
# $d
# [1] 4

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

Reference vector from data frame using custom function

I'm trying to call a vector "a" from a data frame "df" using a function. I know I could do this just fine with the following:
> df$a
[1] 1 2 3
But I'd like to use a function where both the data frame and vector names are input separately as arguments. This is the best that I've come up with:
show_vector <- function(data.set, column) {
data.set$column
}
But here's how it goes when I try it out:
> show_vector(df, a)
NULL
How could I change this function in order to successfully reference vector df$a where the names of both are input to a function as arguments?
It's actually possible to do this without passing the column name as a string (in other words, you can pass in the unquoted column name:
show_vector <- function(data.set, column) {
eval(substitute(column), envir = data.set)
}
Usage example:
df <- data.frame(a = 1:3, b = 4:6)
show_vector(df, b)
# 4 5 6
I've wondered about this kind of thing a lot in the past and haven't found an easy fix. The best I've come up with is this:
df <- data.frame(c(1, 2, 3), c(4, 5, 6))
colnames(df) <- c("A", "B")
test <- function(dataframe, columnName) {
return(dataframe[, match(columnName, colnames(dataframe))])
}
test(df, "A")
Your code would work if you only put the column name in quotes i.e. show_vector(df, "a")
Other multiple ways to do this:
Using base functionality
func <- function(df, cname){
return(df[, grep(cname, colnames(df))])
}
Or even
func <- function(df, cname){
return(df[, cname])
}
You can use substitute to capture the input vector name as it is then use `as.character to make it as a character.
show_vector <- function(data.set, column) {
data.set[,as.character(substitute(column))]
}
Now lets take a look:
(dat=data.frame(a=1:3,b=4:6,c=10:12))
a b c
1 1 4 10
2 2 5 11
3 3 6 12
show_vector(dat,a)
[1] 1 2 3
show_vector(dat,"a")
[1] 1 2 3
It works.
we can also write a simple one where we just input a character string:
show_vector1 <- function(data.set, column) {
data.set[,column]
}
show_vector1(dat,"a")
[1] 1 2 3
Although this will not work if the column name is not a character:
show_vector1(dat,a)
**Show Traceback
Rerun with Debug
Error in `[.data.frame`(data.set, , column) : undefined columns selected**

Retain Vector Names as Dataframe Column Names

In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1

Resources