I am trying to understand an example from a textbook.
The example code is like this:
x <- cbind(x1,x2,x3)
z <- NULL
y <- rbind(z,x)
My question is, why did it rbind to a Null when the output seems same as just x?
This example may be to illustrate that rbind()ing an object to NULL just returns the non-NULL object. This is something you can make use of if you have code that may or may not return a data.frame/matrix/vector (and otherwise returns NULL). The case you show is fairly trivial, but consider this example:
results <- lapply(1:4, function(i) {
if (i %% 2 == 0) {
return(data.frame(a = i, b = i / 2))
} else {
return(NULL)
}
})
# a list of mixed results (some may be NULL)
results
#> [[1]]
#> NULL
#>
#> [[2]]
#> a b
#> 1 2 1
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> a b
#> 1 4 2
# get a data.frame of just the non-NULL rows
do.call('rbind', results)
#> a b
#> 1 2 1
#> 2 4 2
Here, we are going to iteratively apply rbind() to each element in the list resulting from a call to lapply().
The function we apply is arbitrary, but it has some internal logic that leads to a data.frame result, but in other cases returns NULL. Consider for example an API request that may or may not return data, or may or may not be successful if you are not connected to internet.
Since rbind() of an object with NULL just returns the original object, there is no additional handling or need to return dummy values for the cases that do not meet the condition.
Related
I am changing the values of a column of a data frame. Then, I am saving the file, supposedly with the changes, but not. What am I missing? Thanks,
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
that was oversimplified, the real deal is this one:
date_format_1 = "[0-9]-[:alpha:][:alpha:][:alpha:]"
date_format_2 = "[:alpha:][:alpha:][:alpha:]-[0-9][0-9]"
test <- data.frame(name_s = c("v","w","x","y","z"), event_text = c("Aug-89","7-May","9-Jun","4-Dec-2021","Feb-99"))
lapply(1:length(test$event_text), function(x) {
if (str_detect(test$event_text[[x]], paste0("\\b",date_format_1,"\\b")) == T){
test$event_text[x] <- paste0(str_sub(test$event_text[[x]],1,1), "/F",
which(month.abb %in% str_sub(
test$event_text[[x]], 3,5)))
} else if(str_detect(test$event_text[[x]], paste0("\\b", date_format_2,"\\b"))
== T) {
test$event_text[[x]] = paste0(which(month.abb %in% str_sub(
test$event_text[x],1,3)),"/F",str_sub(test$event_text[[x]],-2))
} else {
test$event_text[x] <- test$event_text[[x]]
}
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
Below I have written two calls to lapply that fix the issue you were having. The problem stems from the fact that R has scoped variables and so the value is changed within the function but the result is never returned or extracted from the function. As such I have demonstrated this by printing the dataframe after each of the lapply() calls below.
We can fix this in two ways. The first more correct version is to let lapply modify the exact vector directly by adding one to each value and returning x+1. (Note I have skipped curly braces and this will return the value from the next ppiece of code run, in this case x+1 alternatively you could write function(x) {return(x+1)} in that argument).
An alternate approach that will run slower but still use the indexing method is to use global assignment. <<- assigns the variable to the global scope/environment rather than the local scope of the function. (Note this code is run sequentially so the written call to this function is adding + 1 for the second time to the dataframe when shown below).
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
# Original Behaviour, doesn't work due to scoping issues
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 3
#>
#> [[3]]
#> [1] 4
print(test)
#> name_s number_s
#> 1 x 1
#> 2 y 2
#> 3 z 3
# function that is syntactically and functionally correct
# instead of modifying the vector in the function scope the function returns the
# mutated vector which we then assign to the dataframe's vector
test$number_s <- lapply(test$number_s, function(x) x + 1)
print(test)
#> name_s number_s
#> 1 x 2
#> 2 y 3
#> 3 z 4
# function that is syntactically odd but functionally correct
# the function affects the values in the global scope, this works but is slower
# and is not best practice as it would be difficult to read
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <<- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#> [1] 5
print(test)
#> name_s number_s
#> 1 x 3
#> 2 y 4
#> 3 z 5
Created on 2021-07-23 by the reprex package (v2.0.0)
I need to be able to add functions in a list and be able to call it using a for() function.
Example:
Sample1 = function(a,c,b) c*3 + b*2 + a*1 + 3
Sample2 = function(d,e,f) d*1 + e*2 + f*3 + 5
sampleRList = list(Sample1,Sample2)
When I call
for(item in sampleRList){
print(typeof(item))
}
it prints out closure like normal since it is a function, but when I try to get the variables in the function using
foo <- function(x) {
if(length(x) > 1) {
res <- lapply(x, foo)
} else {
res <- if(is.numeric(x)) x else NA
}
na.omit(unlist(res))
}
foo(body(item))
^(sample code) where x would be the current item, it does not return the list containing the variables/numbers.
example of expected result is
[1] 1 2 3 4
How do I fix this? Thanks.
foo code taken from here <<
It's not terribly clear what you are asking here. I think you are saying that you want to extract the multiplicative coefficients from the functions Sample1 and Sample2 using the function foo, but you need to be able to do it if Sample1 and Sample2 are in a list.
In your example, you call foo(body(Ex1)), but you have no object called Ex1, so this code throws an error. If on the other hand you call:
foo(body(sampleRList[[1]]))
#> [1] 3 2 1 3
Then you can see that you retrieve the correct coefficients from Sample1, and to retrieve those from Sample2 you would do:
foo(body(sampleRList[[2]]))
#> [1] 1 2 3 5
If you wanted to get both at the same time you could do:
lapply(sampleRList, function(x) foo(body(x)))
#> [[1]]
#> [1] 3 2 1 3
#>
#> [[2]]
#> [1] 1 2 3 5
I'm trying to use a new R package called waldo (see at the tidyverse blog too) that is designed to compare data objects to find differences. The waldo::compare() function returns an object that is, according to the documentation:
a character vector with class "waldo_compare"
The main purpose of this function is to be used within the console, leveraging coloring features to highlight outstanding values that are not equal between data objects. However, while just examining in console is useful, I do want to take those values and act on them (filter them out from the data, etc.). Therefore, I want to programmatically extract the outstanding values. I don't know how.
Example
Generate a vector of length 10:
set.seed(2020)
vec_a <- sample(0:20, size = 10)
## [1] 3 15 13 0 16 11 10 12 6 18
Create a duplicate vector, and add additional value (4) into an 11th vector element.
vec_b <- vec_a
vec_b[11] <- 4
vec_b <- as.integer(vec_b)
## [1] 3 15 13 0 16 11 10 12 6 18 4
Use waldo::compare() to test the differences between the two vectors
waldo::compare(vec_a, vec_b)
## `old[8:10]`: 12 6 18
## `new[8:11]`: 12 6 18 4
The beauty is that it's highlighted in the console:
But now, how do I extract the different value?
I can try to assign waldo::compare() to an object:
waldo_diff <- waldo::compare(vec_a, vec_b)
and then what? when I try to do waldo_diff[[1]] I get:
[1] "`old[8:10]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \n`new[8:11]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \033[34m4\033[39m"
and for waldo_diff[[2]] it's even worse:
Error in waldo_diff[3] : subscript out of bounds
Any idea how I could programmatically extract the outstanding values that appear in the "new" vector but not in the "old"?
As a disclaimer, I didn't know anything about this package until you posted so this is far from an authoritative answer, but you can't easily extract the different values using the compare() function as it returns an ANSI formatted string ready for pretty printing. Instead the workhorses for vectors seem to be the internal functions ses() and ses_context() which return the indices of the differences between the two objects. The difference seems to be that ses_context() splits the result into a list of non-contiguous differences.
waldo:::ses(vec_a, vec_b)
# A tibble: 1 x 5
x1 x2 t y1 y2
<int> <int> <chr> <int> <int>
1 10 10 a 11 11
The results show that there is an addition in the new vector beginning and ending at position 11.
The following simple function is very limited in scope and assumes that only additions in the new vector are of interest:
new_diff_additions <- function(x, y) {
res <- waldo:::ses(x, y)
res <- res[res$t == "a",] # keep only additions
if (nrow(res) == 0) {
return(NULL)
} else {
Map(function(start, end) {
d <- y[start:end]
`attributes<-`(d, list(start = start, end = end))
},
res[["y1"]], res[["y2"]])
}
}
new_diff_additions(vec_a, vec_b)
[[1]]
[1] 4
attr(,"start")
[1] 11
attr(,"end")
[1] 11
At least for the simple case of comparing two vectors, you’ll be better off
using diffobj::ses_dat() (which is from the package that waldo uses
under the hood) directly:
waldo::compare(1:3, 2:4)
#> `old`: 1 2 3
#> `new`: 2 3 4
diffobj::ses_dat(1:3, 2:4)
#> op val id.a id.b
#> 1 Delete 1 1 NA
#> 2 Match 2 2 NA
#> 3 Match 3 3 NA
#> 4 Insert 4 NA 3
For completeness, to extract additions you could do e.g.:
extract_additions <- function(x, y) {
ses <- diffobj::ses_dat(x, y)
y[ses$id.b[ses$op == "Insert"]]
}
old <- 1:3
new <- 2:4
extract_additions(old, new)
#> [1] 4
I have multiple .rData files whose top level Global Environment variables are a mix of data.frames, lists, deeply nested lists. I know that many of the nested lists have within them data.frame types, but I'm having trouble retrieving the path to them.
I had a faced a similar problem before with another type of class using the following code
names(rapply(mget(ls(.GlobalEnv), envir=.GlobalEnv), length, classes="fluor.spectral.data", how="unlist"))
and while not the most elegant solution, it achieved what I needed and quickly. returning names like "Fluor.Spec.WA.M12.SC.13" which then allows me to manipulate the object after formating the '.' into '$'.
Can someone help me retrieve the path to all data.frame class types, nested or otherwise in highly variable .rData files? Thanks in advance
If you want to return all data.frames loaded in the global environment, either present as individual object or as element of a nested list, use rrapply in the rrapply-package (extension of base rrapply).
library(rrapply)
w <- data.frame(1)
x <- list(1, 2, 3)
y <- 5
z <- list(1, 2, list(1, df = data.frame(a = 1, b = 2)))
rrapply(as.list(.GlobalEnv), classes = "data.frame", how = "flatten")
#> $w
#> X1
#> 1 1
#>
#> $df
#> a b
#> 1 1 2
Setting classes = "data.frame" avoids recursion into data.frame columns (as base rapply would do), and how = "flatten" will return the collected data.frames as a flattened list.
NB: If you want to return the complete object paths to the found data.frames, set how = "prune" instead of how ="flatten":
rrapply(as.list(.GlobalEnv), classes = "data.frame", how = "prune")
#> $w
#> X1
#> 1 1
#>
#> $z
#> $z[[1]]
#> $z[[1]]$df
#> a b
#> 1 1 2
Edit: In order to also return data.frames present in slots of some S4-class, a possible way to extend the above call would be:
## define S4-class with a data.frame in "df" slot
userClass <- setClass("user", slots = c(df = "data.frame"))
v <- userClass(df = data.frame(user = 1))
rrapply(as.list(.GlobalEnv),
classes = c("data.frame", "user"),
f = function(x) {
if(class(x) == "user") {
slot(x, "df")
} else {
x
}
},
how = "flatten")
#> $v
#> user
#> 1 1
#>
#> $w
#> X1
#> 1 1
#>
#> $df
#> a b
#> 1 1 2
In this case, classes = c("data.frame", "user") will check for data.frames and S4-objects of class "user". The f function applied to the object, returns the object itself if it is a data.frame or the "df" slot if it is an S4-object.
Note that this code assumes that the S4-class name is known as well as the slot(s) which contain the data.frame objects.
Is there any exception handeling mechanism in dplyr's mutate()? What I mean is a way to catch exceptions and handle them.
Let us suppose that I have a function that throws an error in some cases (in the example if the input is negative), for the sake of simplicity I define the function, but in real life it will be a function in some R package. Let us suppose this function is vectorized:
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
Now, let's suppose I want to use this function inside mutate().
If this function is used inside a mutate(), it stops at the first error and no result is returned:
library(dplyr)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
data %>% mutate(y = my_func_vect(x))
# Error in mutate_impl(.data, dots) : Evaluation error: x must be positive.
Is there a way to catch the error, and do something (e.g. return an NA) in this case, while getting results for the other elements?
The result I expect is what would be achieved using a loop with tryCatch(), i.e. something along the lines of:
y <- rep(NA_real_, length(data$x))
for(i in seq_along(data$x)) {
tryCatch({
y[i] <- my_func_vect(data$x[i])
}, error = function(err){})
}
y
# Result is: 1 NA 2 4
We can also make use of purrr's safely() or possibly() functions.
From the purrr help:
safely: wrapped function instead returns a list with components result and error. One value is always NULL.
quietly: wrapped function instead returns a list with components result, output, messages and warnings.
possibly: wrapped function uses a default value (otherwise) whenever an error occurs.
It doesn't change the fact that you have to apply the function to each row separately.
library(dplyr)
library(purrr)
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
With map:
data %>%
mutate(y = map_dbl(x, ~possibly(my_func_vect, otherwise = NA_real_)(.x)))
#> x y
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
Using rowwise():
data %>%
rowwise() %>%
mutate(y = possibly(my_func_vect, otherwise = NA_real_)(x))
#> Source: local data frame [4 x 2]
#> Groups: <by row>
#>
#> # A tibble: 4 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
The others functions are somewhat more difficult to use and apply in a 'data-frame environment', as they are more suited to work with lists, and returns such.
Created on 2018-05-15 by the reprex package (v0.2.0).
You want to evaluate every occuring error individually, maybe you shouldn't use the vectorized function. Instead use map from the purrr package- which is effectively the same as lapply here.
Make a function to catch the error for standard use if you want NA values in the case you get an error.
try_my_func <- function(x) {
tryCatch(my_func(x), error = function(err){NA})
}
Then use mutate with map
data %>% mutate(y = purrr::map(x, try_my_func))
x y
1 1 1
2 -1 NA
3 4 2
4 9 3
Or similarly, if you don't want to declare a new function.
data %>% mutate(y = purrr::map(x, ~ tryCatch(my_func(.), error = function(err){NA})))
And lastly if you Do want to use a Vectorized function, you can skip the map function altogether. But personally I never use Vectorize so I'd do it with map.
data %>% mutate(y = Vectorize(try_my_func)(x))