purrring with NULL listcolumns in R - r

library(tidyverse)
data(mtcars)
mtcars <- rownames_to_column(mtcars,var = "car")
mtcars$make <- map_chr(mtcars$car,~strsplit(.x," ")[[1]][1])
mt2 <- mtcars %>% select(1:8,make) %>% nest(-make,.key = "l")
mt4<-mt2[1:5,]
mt4[c(1,5),"l"] <- list(list(NULL))
Now, I´d like to run the following function for each make of car:
fun_mt <- function(df){
a <- df %>%
filter(cyl<8) %>%
arrange(mpg) %>%
slice(1) %>%
select(mpg,disp)
return(a)
}
mt4 %>% mutate(newdf=map(l,~possibly(fun_mt(.x),otherwise = "NA"))) %>% unnest(newdf)
However, the NULL columns refuse to evaluate due to
Error: no applicable method for 'filter_' applied to an object of class "NULL"
I also tried using the safely and possibly approach, but still I get an error msg:
Error: Don't know how to convert NULL into a function
Any good solutions to this?

The problem is that NULL gets passed into the function fun_mt(). You wanted to catch this with possibly(). But possibly() is a function operator, i.e. you pass it a function and it returns a function. So, your call should have been
~ possibly(fun_mt, otherwise = "NA"))(.x)
But this doesn't yet work with unnest(). Instead of a character "NA" (a bad idea anyway, rather use a proper NA) you would have to default to a data frame:
~ possibly(fun_mt, otherwise = data.frame(mpg = NA, disp = NA))(.x)

Related

Object not found in function environment for nested objects

I have a code snippet which I am trying to convert into a function. This function is supposed to look for potential spelling errors in a manual-entry field. The snippet works and you can try it out like this, using the starwars data from the tidyverse package:
require(tidyverse)
require(rlang) # loaded for {{ to force function arguments as well as the with_env() function
require(RecordLinkage) # loaded for the jarowinkler() function
starwars_cleaning <- starwars %>%
add_count(name, name = "Freq_name") %>% # this keeps track of which spelling is more frequent
distinct(name, .keep_all = T) %>% # this prevents duplicated comparisons and self-comparisons
nest_by(homeworld, .key = ".Nest") %>%
mutate(Mapped = list(imap_dfr(.x = .Nest$name,
.f = ~jarowinkler(str1 = .x,
str2 = .Nest$name[-.y]) %>%
list() %>%
tibble(Score_n = ., Match_n = list(.Nest$name[-.y]),
Freq_n = list(.Nest$Freq_name[-.y]))
)))
The function should accept the variable(s) to nest on (ellipses) and the variable to look for potential misspelled matches in as arguments. Right now, it looks like this:
string_matching <- function(.df, .string_col, ...){
.df$.tmp_string <- .df %>% select({{.string_col}})
.df <- .df %>%
add_count(.tmp_string, name = "Freq_name") %>%
distinct(.tmp_string, .keep_all = T) %>%
nest_by(..., .key = ".Nest") %>%
mutate(Mapped_n = list(with_env(env = current_env(), # same error with or without specifying the execution environment for imap
expr = imap_dfr(.x = .Nest$.tmp_string,
.f = ~jarowinkler(str1 = .x,
str2 = .Nest$.tmp_string[-.y]) %>%
list() %>%
tibble(Score_n = ., Match_n = list(.Nest$.tmp_string[-.y]),
Freq_n = list(.Nest$Freq_name[-.y]))
)
))
)
return(.df)
}
starwars %>%
string_matching(name, homeworld)
On the starwars data, it isn't very useful, clearly. And I cut down some of the features of this code to get a MWE--but that's the idea. When I wrap the code up like this in a function, it returns invalid argument to unary operator (apparently caused by the [-.y]). I tried the force() command after reading this post since this problem apparently comes up a lot. Because of the current error and that post, I thought the problem might have to do with the function environment causing imap_dfr() to lose track of the data somehow. I tried to wrap the call to map in with_env() and instruct it to use the function environment rather than its own. I also tried to break up the function by assigning an intermediate object to the global environment so that it could be found in the mapping step of the function:
assign(x = "TEMP", value = .df$.Nest, envir = global_env())
That landed me with the same 'unary operator` error. I'm not sure what to try next. I seem to be going in circles. Any insights into what is causing this problem and how to fix it would be greatly appreciated.
I don't think the post you pointed to is really related here. I don't think your problem is related to execution environment. The problem really is how you've handled passing the variable to your function. When you create your tmp_string, you are calling select() which is returning a tibble rather than the vector of column values. Instead, use pull() to extract those values.
string_matching <- function(.df, .string_col, ...){
.df$.tmp_string <- .df %>% pull({{.string_col}})
.df <- .df %>%
add_count(.tmp_string, name = "Freq_name") %>%
distinct(.tmp_string, .keep_all = T) %>%
nest_by(..., .key = ".Nest") %>%
mutate(Mapped_n = list(with_env(env = current_env(), # same error with or without specifying the execution environment for imap
expr = imap_dfr(.x = .Nest$.tmp_string,
.f = ~jarowinkler(str1 = .x,
str2 = .Nest$.tmp_string[-.y]) %>%
list() %>%
tibble(Score_n = ., Match_n = list(.Nest$.tmp_string[-.y]),
Freq_n = list(.Nest$Freq_name[-.y]))
)
))
)
return(.df)
}
Or you could write your code to avoid the need for that temp column completely
string_matching <- function(.df, .string_col, ...){
col <- rlang::ensym(.string_col)
.df <- .df %>%
add_count(!!col, name = "Freq_name") %>%
distinct(!!col, .keep_all = T) %>%
nest_by(..., .key = ".Nest") %>%
mutate(Mapped_n = list(imap_dfr(.x = .Nest %>% pull(!!col),
.f = ~jarowinkler(str1 = .x,
str2 = (.Nest %>% pull(col))[-.y]) %>%
list() %>%
tibble(Score_n = ., Match_n = list((.Nest %>% pull(col))[-.y]),
Freq_n = list(.Nest$Freq_name[-.y]))
))
)
return(.df)
}

Problem with mutate keyword and functions in R

I got a problem with the use of MUTATE, please check the next code block.
output1 <- mytibble %>%
mutate(newfield = FND(mytibble$ndoc))
output1
Where FND function is a FILTER applied to a large file (5GB):
FND <- function(n){
result <- LARGETIBBLE %>% filter(LARGETIBBLE$id == n)
return(paste(unique(result$somefield),collapse=" "))
}
I want to execute FND function for each row of output1 tibble, but it just executes one time.
Never use $ in dplyr pipes, very rarely they are used. You can change your FND function to :
library(dplyr)
FND <- function(n){
LARGETIBBLE %>% filter(id == n) %>% pull(somefield) %>%
unique %>% paste(collapse = " ")
}
Now apply this function to every ndoc value in mytibble.
mytibble %>% mutate(newfield = purrr::map_chr(ndoc, FND))
You can also use sapply :
mytibble$newfield <- sapply(mytibble$ndoc, FND)
FND(mytibble$ndoc) is more suitable for data frames. When you use functions such as mutate on a tibble, there is no need to specify the name of the tibble, only that of the column. The symbols %>% are already making sure that only data from the tibble is used. Thus your example would be:
output1 <- mytibble %>%
mutate(newfield = FND(ndoc))
FND <- function(n){
result <- LARGETIBBLE %>% filter(id == n)
return(paste(unique(result$somefield),collapse=" "))
}
This would be theoretically, however I do not know if your function FND will work, maybe try it and if not, give some practical example with data and what you are trying to achieve.

Write a function in R to change a group of datasets layout

I have many datasets in tibble format, with variables as rows. I want to change the layout and wrangle individual dataset. To save myself from repetitive work and risk of making mistakes. I wrote this function in R to do this.
library(tidyverse)
change_data_layout<- function(data_df){
data_df_2 <- data_df %>% mutate(samples = colnames()) %>% t()
colnames(data_df_2) <-data_df_2[1,]
rownames <- rownames(data_df_2) [2:nrow(data_df_2)]
data_df_3 <- data_df_2[1:nrow(data_df_2),] %>% as_tibble() %>% mutate(samples = rownames)
colnames(data_df_3) <- data_df_3 [1,]
data_df_4 <- data_df_3[2:nrow(data_df_3),]
data_final <- data_df_4 %>%
mutate_each(funs(type.convert)) %>% mutate_if(is.factor, as.character)
return(data_final)
}
However, when I run this function as :
dataset1_final <- change_data_layout(dataset1)
I got this error message:
Error: argument "x" is missing, with no default
Called from: mutate_impl(.data, dots)
Any help and suggestions?

NSE on complex expressions with dplyr's do()

Can someone help me understand how NSE works with dplyr when the variable reference is in the form ".$mpg" .
After reading here, I thought using as.name would do it, since I have a character string that gives a variable name.
For example, this works:
mtcars %>%
summarise_(interp(~mean(var), var = as.name("mpg")))
and this doesn't work:
mtcars %>%
summarise_(interp(~mean(var), var = as.name(".$mpg")))
but this does:
mtcars %>%
summarise(mean(.$mpg))
and so does this:
mtcars %>%
summarise(mean(mpg))
I want to be able to specify the variable in the form .$mpg so that I can use it with do() when I don't have the option of specifying a dot for the data like in the following example:
library(dplyr)
library(broom)
mtcars %>%
tbl_df() %>%
slice(., 1) %>%
do(tidy(prop.test(.$mpg, .$disp, p = .50)))
chose random variables here to demonstrate how the prop.test function works, please don't interpret this as misuse of the test.
Eventually, I want to turn this into a function like this:
library(lazyeval)
library(broom)
library(dplyr)
p_test <- function(x, miles, distance){
x %>%
tbl_df() %>%
slice(., 1) %>%
do_(tidy(prop.test(miles, distance, p = .50)))
}
p_test(mtcars, ".$mpg", ".$disp")
I originally thought that I would have to do something like:
interp(~var, var = as.name(miles) where miles would get replaced with .$mpg, but as I mentioned at the top this does not seem to work.
The reason is that as.name creates an unevaluated variable name, but .$mpg, when used in code, is not a variable name. Rather, it’s a complex expression which is equivalent to:
`$`(., mpg)
That is, it’s a function call to the function $ with two arguments. Using as.name causes R to subsequently search for a variable with the name `.$mpg` rather than calling the above-described function.
That’s the explanation of why your attempt doesn’t work. The solution is then relatively straightforward: instead of creating an unevaluated variable name, we need to create an unevaluated function call expression. We can do this in various ways, and I’m going to show two here.
The first is simply to call parse:
p_test = function (data, miles, distance) {
x = parse(text = miles)[[1]]
n = parse(text = distance)[[1]]
data %>%
slice(1) %>%
do_(interp(~tidy(prop.test(x, n, p = 0.5)), x = x, n = n))
}
Now you can call p_test(mtcars, '.$mpg', '.$disp') and get the desired result.
However, a more dplyr-y way of doing the same thing would be to pass unevaluated objects to p_test:
p_test(mtcars, mpg, disp)
… and we can easily do this with a simple change:
p_test_ = function (data, var1, var2) {
data %>%
slice(1) %>%
do_(interp(~tidy(prop.test(.$x, .$n, p = 0.5)),
x = as.name(var1), n = as.name(var2)))
}
p_test = function (data, var1, var2) {
p_test_(data, substitute(var1), substitute(var2))
}
Now the following two pieces of code both work:
p_test(mtcars, mpg, disp)
p_test_(mtcars, 'mpg', 'disp')

gather_ does not work. Shouldn't quoting and ~ing have the same effect in standard evaluation mode?

I have issues getting tidyr's gather to work in it's standard evaluation version gather_ :
require(tidyr)
require(dplyr)
require(lazyeval)
df = data.frame(varName=c(1,2))
gather works:
df %>% gather(variable,value,varName)
but I'd like to be able to take the name varName from a variable in standard evaluation mode, and can't seem to get it right:
name='varName'
df %>% gather_("variable","value",interp(~v,v=name))
Error in match(x, y, 0L) : 'match' requires vector arguments
I'm also confused by the following.
This works as expected:
df %>% gather_("variable","value","varName")
The next line should be equivalent to last line (from my understanding of http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html ), but doesn't work:
df %>% gather_(~variable,~value,~varName)
Error in match(x, y, 0L) : 'match' requires vector arguments
Looking at the source of tidyr:::gather_.data.frame, you can see that it is just a wrapper for reshape2::melt. As such, it only works for character or numeric arguments. Acutally the following (which I would consider a bug) works:
df %>% gather_("variable", "value", 1)
As far as I can tell the nse vignette only refers to dplyr and not to tidyr.
Although this question has been answered, the following code could be used for defining keys and values for gathering purposes more generally in a function, using a vector of inputs for key and value:
data <- data.frame(a = runif(10), b = runif(10), c = runif(10))
Key <- "ColId"
Value <- "ColValue"
data %>% gather(key = KeyTmp, value = ValTmp) %>%
rename_(.dots = setNames("KeyTmp", Key) ) %>%
rename_(.dots = setNames("ValTmp", Value) )

Resources